Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participants who were new to NLP.

The Data Science Society's Datathon 2018, presented us this time many cases which are Natural Language Processing related.
One of the cases, for example, involves extracting entities' activities from unstructured documents, and determining their sentiment.

So, how should one begin working on such a problem?

First, let’s break this problem down:
(1) we need to to detect which entities are mentioned in an article, and then (2) we need to detect the sentiment that is related, specifically to those entities.
Let’s think about it logically for a moment: Entities are (normally) nouns, and in order to get the sentiment we will probably need to look at the adjectives that describe them, or the verbs that are related to them.
So a first step could be (a) parsing the document into sentences, and then into words, and (b) determine the grammatical role of each word in the sentence, and determine (c) in which way these words are related to each other.
Let’s start giving these actions names:
  • In NLP, documents are often referred as corpus, and the parsing process is referred as preprocessing;
  • Breaking up a corpus, as described in (a) is called tokenizing, while choosing the part of speech (POS) of a word is referred as POS tagging;
  • In order to determine the dependency of words in each other (c), the common way to represent it is as a tree (although some prefer to use a graph), and unsurprisingly it is called Dependency Tree;
  • The secondary level, i.e. subject, object, type of verbs and other semantics, is called Semantic Role Labeling (SRL);
  • Detecting that a specific word (an entity) is a name of a company/person/product, is called Named Entity Recognition (or NER since as you can see, we really love abbreviations)
There are already many tools and frameworks out there that can do this job, and even more, for you, i.e. tokenizing, tagging and parsing, and semantically labeling. A great place to start is the nltk library in Python, which besides offering the basic tools, also have a great guide for beginners. Spacy would give you even more power and speed for the above mentioned tasks.
Nevertheless, it’s always encouraged to recombine, hack, tweak and change the current models in order to create a better model. This is often done by re-training a model, and for that one needs a lot of data.
Distributed Representations of words, also called Word embedding, is a process where the words in the sentence are being replaced by matching vectors. In order to do that, a neural-network based algorithm is first implemented on a big corpus, and through training, converts the words into vectors.
Known algorithms in that field are Word2vecgloVe and fasttext. And as before, also here there are ready tools for using it, such as gensim, and others.
In addition, as stated before, you can also re-train your own word representation, on your given corpus, in order to gain accuracy and specificity for the task you're tackling.
Before using these vectors for sentiment analysis, it’s important to remember that these representations are no more than semantic-representations. I.e. adjectives such as ‘good’ and ‘bad’ are represented by similar vectors, which have close/near representation in the vector space, distance-wise.
Where next?
All these are merely the basics. Among places to check next are modeling algorithms such as deep-learning based models, for example CNN and BiLSTM (bi-directional LSTM).

And if you are interested to get deeper into NLP, which in my opinion is a fascinating topic(!), check out these great links:

A discussion regarding different approaches for semantic parsing through semantic composition:


NLP Toolkit (Made for self use - but feel free to use it too :) )

A list of tools for NLP tasks.
Done mostly for self-reference... hence quite brief.

Better lists can be found out there:

SciPy (sklearn.feature_extraction.text)
  • CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by:
  • TfidfTransformer - transforms a count-matrix (CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF)
  • Corpus reader - Words tokenization
  • POS
  • Chunking
  • Stemming
  • Creating parse trees out of sentences
  • Using KnowledgeBases (FrameNet, WordNet, propBank)
  • Different implementation for tagging (Senna, Stanford)
  • Corpus reader - Tokenization
  • NER, POS
  • Semantic representation (Word Vectors)
  • Labeled dependency parsing

  • Corpus reader / parser
  • transofrmations (TF-IDF, LSA, LDA, HDP)
  • Similarity Queries
  • Topic segmentation (LDA, LSA)
  • High-level trained models
  • Machine comprehension (QA based on given text)
  • Coreference Resolution (who/what does 'it', 'they' refers to)
  • NER
  • SRL 
  • Textual Entailment (similar to topic segmentation and similarity queries combined)


  • Seq2seq mapping (Encoder - Decoder for translations)
  • Tagging
  • Classification
  • Also includes built-in WordEmbedder, Tokenizer
  • Based on Tensorflow

  • Attention-based encoder-decoder (machine translation)
  • Based on Theano through dl4mt tutorial

http://universaldependencies.org (github repository: https://github.com/UniversalDependencies)
https://github.com/nlp-compromise/nlp-corpus (different categories)

GPU-based services to train and run models:

  • FloydHub (2h GPU, 20h CPU, 10GB)
  • PaperSpace (10$ credit - monthly costs start with 5$ for 50GB storage and 0.07$ per hour)
  • Kaggle (Free - Good for small, short-time training; execution are killed after ~5h; includes many cleaned datasets)


pip install pymssql fails with 'sqlfront.h': No such file or directory

I've tried to install pymssql on Windows using command line:
pip install pymssql

The operation fails with an error:
fatal error C1083: Cannot open include file: 'sqlfront.h': No such file or directory

while looking for a solution, the first results are not really helpful, until you arrive to this helpful one.

In short: I was using python version 3.6, which currently isn't supported with pymssql.
Using python 3.5 instead solves that issue, and the pip install pymssql runs well.

If you're using Anaconda Navigator, it is made even simpler:
In the Environment section, you can add a new environment with your choice of python version. When you're choosing a supported version - 3.5 for example - you will find the pymssql in the packages list and install it from there directly.


AngularJS Directive: Accessing an DOM element with a dynamic ID in an asynchronous directive


This problem was really bothering me for few months. In earlier cases I tried to avoid it, but today I could not hide from it any longer. And the thing is that there was no post on any other place that could solve this matter... So here it goes.

You're writing a directive in AngularJS, and you want to access one of the DOM elements declared within its template.
Usually you do it through the linking function (post section in the compile) or in the directive controller, by injecting and using '$element', as explained here.

But, in my case, I had to access an element using its ID and this ID was dynamic.

So, my template contains:

    <div ng-attr-id="{{ 'something_' + dynamicValue }}"></div>

And my directive contains:

    templateURL: 'pathToTempalte',
    scope: {
        dynamicValue: '@'
    link: function(scope, element, attributes) {
        attributes.$observe('dynamicValue', function(value) {
            element.find('#something_' + value);

The element.find doesn't work, because nothing is actually compiled and rendered yet here.

The solution, at the moment, is to use ''$timeout".

$timeout, even with '1' as a parameter, will only be executed after everything has been rendered, and outside of the asynchronous scope. Therefore using element.find inside its function, will deliver the expected result:

    templateURL: 'pathToTempalte',
    scope: {
        dynamicValue: '@'
    link: function(scope, element, attributes) {
        attributes.$observe('dynamicValue', function(value) {
            $timeout(function() {
                element.find('#something_' + value);
            }, 1);


Happy coding


A guide for modeling a graph database - A lunch with Neo4J chief scientist Jim Webber, London

Since the invention of NOSQL databases, it gets more and more attention from the developer community. One of the remarkable and unique databases exists in this topic of NOSQL is the graph database of NEO4J, an open-source database which stores the data as a Graph.

Modeling a graph database is quite different than modeling the regular RDBMS database, and even from other NOSQL databases such as key-value collections. We got used to identifying the molecularity of the data and save it as the columns, joining similar data together into tables. But since Neo4J is a flexible graph database, this case does not work there.

In order to model the data, we need to identify first the queries that we're about to run on the database. These queries will form the basic logical sentences which are the keys of modeling the database (more about that coming up in a second).

We need to identify where to store each piece of information:
- as a node
- as an edge
- as a property of a node or of an edge

Nodes should store items of information, nouns: Users, Receipts, Comment, Document.
The edges should store relation-items that forms a verb: Created by, Voted By, IS the son of, etc.
Properties are an additional description for either nodes or edges. You should put as a property the data that does not need to be related. if we look at the information as a sentence - the properties are the adjectives.

So, nodes and edges (and properties) should create a logical sentence as part of the modeling:
User A VOTED_YES on Question Q
This sentence creates the model for us:
[User (name=A)] -- VOTED_YES --> [Question (content=Q)]

Where User and Question are nodes. Each of which has properties that describe it (user name, question content/title).
Pay specific attention to VOTED_YES edge.
We could have chosen VOTED (vote_data = 'yes') where we have an edge of VOTED and the content is its property. Why didn't we model it so?

Jim Webber, the chief scientist of NEO4J explains: Performance-wise, it's better to create granularity in the model. So that each of these votes is a unique type of an edge.
And what if we want to collect all the votes for a specific question? We can create an Index for the vote, and easily find all the votes in the index that are pointing to a specific question.

__ to be continued __


Javascript, animation and easing functions

I recently had to create an animation, which obey to the laws of physics, using javascript and HTML5.

Among the gravity and free falling, one of these animations had to also apply sort of bouncing and elastic laws.

After creating it first with CSS3 and HTML, we realized that most of the Android devices does not support it fully yet. Especially versions prior to Ice Cream Sandwich, such as Android 2.1, 2.2, 2.3 and lower (Froyo and its siblings).

So, I had to transfer everything into HTML5 Canvas, and instead of applying the css-transitions, I had to implements everything alone.

The first website I encountered and would like to highly recommend is Timothee Groleau's easing generator which can be found here:

Originally made for Flash - but can easily adopted to Javascript HTML5 too.

The output of his generator is basically a function describes a certain movement of an object.
Such movement may be physical movement on either X/Y/Z axis, or even the rotation degree of an object.

The function takes 4 parameters:

t - the current time of the animation (starts with 0 and in every frame has to be increased in 1/num_of_frames_per_second). make sure that if you work in Milliseconds - t has to be that too.

b - the initial parameter (of the x/y/z axis, the rotation degree/radians etc.)

c - the target (of the same x/y/z axis or the rotation degrees)

d - total time for this process - again, like the t, has to be at the same time-unit, and personally - I recommend working with milliseconds.

the output of the function is the current value of your object at the specific t.

as for drawing, especially for 3d objects, there are already-made frameworks that can be used for this purpose such as Three.js and k3d.

one last word - nearly every physical action has a formula to describe it. use the generator to find it instead of too many if's ... :)


TreePanel generated from TreeStore

Apparently, generating a treePanel from a treeStore is not 'out-of-the-box' like generating a grid or even generating tree from static json in Ext-JS.

In addition, at the moment there are not a many examples out there that explains how it should be done. Most of the examples are using a static JSON (memory). But what if you are using a store with an already-made JSON which you need to customize to be a tree?

listeners to the rescue
The way to make it happen, is to alter the store a bit.
we start by defining a store as usual:

Ext.define('MyTreeStore', {
extend: 'Ext.data.TreeStore',

config: {
someConfig: 0

constructor: function (cfg) {
var me = this;

cfg = cfg || {};
autoLoad: true,
storeId: 'MyTreeStoreID',
root: {
expanded: true
proxy: {
type: 'rest',
url: 'http://website/JsonGenerator.php'
extraParams: {
someCoolConfig: 1
reader: {
type: 'json'
// Don't want proxy to include these params in request
pageParam: undefined,
startParam: undefined
fields: ['JSONField_ID', 'JSONField_NAME'],

// ------------------------------------------------------------------------
// this is the important part:

listeners: {
append: function (thisNode, newChildNode, index, eOpts) {
if (!newChildNode.isRoot()) {
newChildNode.set('leaf', true);
newChildNode.set('text', newChildNode.get('JSONField_NAME'));

// ------------------------------------------------------------------------

}, cfg)]);

Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article  was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participant...