NLP Toolkit (Made for self use - but feel free to use it too :) )

NLP Toolkit (Made for self use - but feel free to use it too :) )

February 03, 2018

A list of tools for NLP tasks.
Done mostly for self-reference... hence quite brief.

Better lists can be found out there:
https://github.com/keon/awesome-nlp

SciPy (sklearn.feature_extraction.text)

CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by:
TfidfTransformer - transforms a count-matrix (CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF)

Corpus reader - Words tokenization
POS
Chunking
Stemming
Creating parse trees out of sentences
Using KnowledgeBases (FrameNet, WordNet, propBank)
Different implementation for tagging (Senna, Stanford)

Corpus reader - Tokenization
NER, POS
Semantic representation (Word Vectors)
Labeled dependency parsing

Gensim:

Corpus reader / parser
transofrmations (TF-IDF, LSA, LDA, HDP)
Similarity Queries
Topic segmentation (LDA, LSA)

High-level trained models
Machine comprehension (QA based on given text)
Coreference Resolution (who/what does 'it', 'they' refers to)
NER
SRL
Textual Entailment (similar to topic segmentation and similarity queries combined)

openNMT

Seq2seq mapping (Encoder - Decoder for translations)
Tagging
Classification
Also includes built-in WordEmbedder, Tokenizer
Based on Tensorflow

Attention-based encoder-decoder (machine translation)
Based on Theano through dl4mt tutorial

Corpus
http://universaldependencies.org (github repository: https://github.com/UniversalDependencies)
https://github.com/nlp-compromise/nlp-corpus (different categories)

GPU-based services to train and run models:

FloydHub (2h GPU, 20h CPU, 10GB)
PaperSpace (10$ credit - monthly costs start with 5$ for 50GB storage and 0.07$ per hour)
Kaggle (Free - Good for small, short-time training; execution are killed after ~5h; includes many cleaned datasets)

Comments