NLP Toolkit (Made for self use - but feel free to use it too :) )

A list of tools for NLP tasks.
Done mostly for self-reference... hence quite brief.

Better lists can be found out there:

SciPy (sklearn.feature_extraction.text)
  • CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by:
  • TfidfTransformer - transforms a count-matrix (CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF)
  • Corpus reader - Words tokenization
  • POS
  • Chunking
  • Stemming
  • Creating parse trees out of sentences
  • Using KnowledgeBases (FrameNet, WordNet, propBank)
  • Different implementation for tagging (Senna, Stanford)
  • Corpus reader - Tokenization
  • NER, POS
  • Semantic representation (Word Vectors)
  • Labeled dependency parsing

  • Corpus reader / parser
  • transofrmations (TF-IDF, LSA, LDA, HDP)
  • Similarity Queries
  • Topic segmentation (LDA, LSA)
  • High-level trained models
  • Machine comprehension (QA based on given text)
  • Coreference Resolution (who/what does 'it', 'they' refers to)
  • NER
  • SRL 
  • Textual Entailment (similar to topic segmentation and similarity queries combined)


  • Seq2seq mapping (Encoder - Decoder for translations)
  • Tagging
  • Classification
  • Also includes built-in WordEmbedder, Tokenizer
  • Based on Tensorflow

  • Attention-based encoder-decoder (machine translation)
  • Based on Theano through dl4mt tutorial

Corpus (github repository: (different categories)

GPU-based services to train and run models:

  • FloydHub (2h GPU, 20h CPU, 10GB)
  • PaperSpace (10$ credit - monthly costs start with 5$ for 50GB storage and 0.07$ per hour)
  • Kaggle (Free - Good for small, short-time training; execution are killed after ~5h; includes many cleaned datasets)


Popular posts from this blog

pip install pymssql fails with 'sqlfront.h': No such file or directory

ASP.NET 2.0 with MySQL 4.1.2 - unicode problem

C#, Entity Framework, WCF and JSON serialization