Posts

Showing posts from February, 2018

Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article  was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participants who were new to NLP. The Data Science Society's  Datathon 2018 , presented us this time many cases which are Natural Language Processing related . One of the cases, for example, involves extracting entities' activities from unstructured documents, and determining their sentiment. So, how should one begin working on such a problem? First, let’s break this problem down: (1) we need to to detect which entities are mentioned in an article, and then (2) we need to detect the sentiment that is related, specifically to those entities. Let’s think about it logically for a moment: Entities are (normally) nouns, and in order to get the sentiment we will probably need to look at the adjectives that describe them, or the verbs that are related to them. So a first step could be (a) parsing the document into sentences, and then int

NLP Toolkit (Made for self use - but feel free to use it too :) )

A list of tools for NLP tasks. Done mostly for self-reference... hence quite brief. Better lists can be found out there: https://github.com/keon/awesome-nlp SciPy ( sklearn.feature_extraction.text ) CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by: TfidfTransformer - transforms a count-matrix ( CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF) NLTK Corpus reader - Words tokenization POS Chunking Stemming Creating parse trees out of sentences Using KnowledgeBases (FrameNet, WordNet, propBank) Different implementation for tagging (Senna, Stanford) Spacy Corpus reader - Tokenization NER, POS Semantic representation (Word Vectors) Labeled dependency parsing Gensim : Corpus reader / parser transofrmations (TF-IDF, LSA, LDA, HDP) Similarity Queries Topic segmentation (LDA, LSA) AllenNLP High-level trained models Machine comprehension (QA based on given t