Posts

Showing posts from 2018

Analyzing Software Development using data science and analytic tools

Yesterday I had the pleasure to host Markus Harrer  who gave a talk about analyzing software code using data science, which he describe in details in his blog . There is often a communication gap between software developers and management. While good developers can see the big picture of the code, and timely identify a need to restructure or even rewrite the code, they often miss to see the risks of both, time and cost, of such an operation. The management on the other hand, while being able to identify risks of getting into an adventure of rewriting a legacy system, often fail to understand the outcome and the risks of lack of maintenance. The solution to overcome this 'gap of ignorance' is communicating using data science and analytics. Using Jupyter  notebooks, for example, to combine both textual explanations as well as analytics and diagrams can easily communicate the risks of a software to the management. For using data science on software code, one should decide

Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article  was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participants who were new to NLP. The Data Science Society's  Datathon 2018 , presented us this time many cases which are Natural Language Processing related . One of the cases, for example, involves extracting entities' activities from unstructured documents, and determining their sentiment. So, how should one begin working on such a problem? First, let’s break this problem down: (1) we need to to detect which entities are mentioned in an article, and then (2) we need to detect the sentiment that is related, specifically to those entities. Let’s think about it logically for a moment: Entities are (normally) nouns, and in order to get the sentiment we will probably need to look at the adjectives that describe them, or the verbs that are related to them. So a first step could be (a) parsing the document into sentences, and then int

NLP Toolkit (Made for self use - but feel free to use it too :) )

A list of tools for NLP tasks. Done mostly for self-reference... hence quite brief. Better lists can be found out there: https://github.com/keon/awesome-nlp SciPy ( sklearn.feature_extraction.text ) CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by: TfidfTransformer - transforms a count-matrix ( CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF) NLTK Corpus reader - Words tokenization POS Chunking Stemming Creating parse trees out of sentences Using KnowledgeBases (FrameNet, WordNet, propBank) Different implementation for tagging (Senna, Stanford) Spacy Corpus reader - Tokenization NER, POS Semantic representation (Word Vectors) Labeled dependency parsing Gensim : Corpus reader / parser transofrmations (TF-IDF, LSA, LDA, HDP) Similarity Queries Topic segmentation (LDA, LSA) AllenNLP High-level trained models Machine comprehension (QA based on given t