Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participants who were new to NLP.

The Data Science Society's Datathon 2018, presented us this time many cases which are Natural Language Processing related.
One of the cases, for example, involves extracting entities' activities from unstructured documents, and determining their sentiment.

So, how should one begin working on such a problem?

First, let’s break this problem down:
(1) we need to to detect which entities are mentioned in an article, and then (2) we need to detect the sentiment that is related, specifically to those entities.
Let’s think about it logically for a moment: Entities are (normally) nouns, and in order to get the sentiment we will probably need to look at the adjectives that describe them, or the verbs that are related to them.
So a first step could be (a) parsing the document into sentences, and then into words, and (b) determine the grammatical role of each word in the sentence, and determine (c) in which way these words are related to each other.
Let’s start giving these actions names:
  • In NLP, documents are often referred as corpus, and the parsing process is referred as preprocessing;
  • Breaking up a corpus, as described in (a) is called tokenizing, while choosing the part of speech (POS) of a word is referred as POS tagging;
  • In order to determine the dependency of words in each other (c), the common way to represent it is as a tree (although some prefer to use a graph), and unsurprisingly it is called Dependency Tree;
  • The secondary level, i.e. subject, object, type of verbs and other semantics, is called Semantic Role Labeling (SRL);
  • Detecting that a specific word (an entity) is a name of a company/person/product, is called Named Entity Recognition (or NER since as you can see, we really love abbreviations)
There are already many tools and frameworks out there that can do this job, and even more, for you, i.e. tokenizing, tagging and parsing, and semantically labeling. A great place to start is the nltk library in Python, which besides offering the basic tools, also have a great guide for beginners. Spacy would give you even more power and speed for the above mentioned tasks.
Nevertheless, it’s always encouraged to recombine, hack, tweak and change the current models in order to create a better model. This is often done by re-training a model, and for that one needs a lot of data.
Distributed Representations of words, also called Word embedding, is a process where the words in the sentence are being replaced by matching vectors. In order to do that, a neural-network based algorithm is first implemented on a big corpus, and through training, converts the words into vectors.
Known algorithms in that field are Word2vecgloVe and fasttext. And as before, also here there are ready tools for using it, such as gensim, and others.
In addition, as stated before, you can also re-train your own word representation, on your given corpus, in order to gain accuracy and specificity for the task you're tackling.
Before using these vectors for sentiment analysis, it’s important to remember that these representations are no more than semantic-representations. I.e. adjectives such as ‘good’ and ‘bad’ are represented by similar vectors, which have close/near representation in the vector space, distance-wise.
Where next?
All these are merely the basics. Among places to check next are modeling algorithms such as deep-learning based models, for example CNN and BiLSTM (bi-directional LSTM).

And if you are interested to get deeper into NLP, which in my opinion is a fascinating topic(!), check out these great links:

A discussion regarding different approaches for semantic parsing through semantic composition:


Popular posts from this blog

pip install pymssql fails with 'sqlfront.h': No such file or directory

ASP.NET 2.0 with MySQL 4.1.2 - unicode problem

C#, Entity Framework, WCF and JSON serialization