NLP Toolkit (Made for self use - but feel free to use it too :) )
A list of tools for NLP tasks.
Done mostly for self-reference... hence quite brief.
Better lists can be found out there:
https://github.com/keon/awesome-nlp
Done mostly for self-reference... hence quite brief.
Better lists can be found out there:
https://github.com/keon/awesome-nlp
SciPy (sklearn.feature_extraction.text)
- CountVectorizer - converting text to token-counts matrix (n-grams co-reference) which is then used by:
- TfidfTransformer - transforms a count-matrix (CountVectorizer output) to term-frequency or inverse document frequency (TF-IDF)
- Corpus reader - Words tokenization
- POS
- Chunking
- Stemming
- Creating parse trees out of sentences
- Using KnowledgeBases (FrameNet, WordNet, propBank)
- Different implementation for tagging (Senna, Stanford)
- Corpus reader - Tokenization
- NER, POS
- Semantic representation (Word Vectors)
- Labeled dependency parsing
Gensim:
openNMT
Corpus
http://universaldependencies.org (github repository: https://github.com/UniversalDependencies)
https://github.com/nlp-compromise/nlp-corpus (different categories)
GPU-based services to train and run models:
- Corpus reader / parser
- transofrmations (TF-IDF, LSA, LDA, HDP)
- Similarity Queries
- Topic segmentation (LDA, LSA)
- High-level trained models
- Machine comprehension (QA based on given text)
- Coreference Resolution (who/what does 'it', 'they' refers to)
- NER
- SRL
- Textual Entailment (similar to topic segmentation and similarity queries combined)
openNMT
- Seq2seq mapping (Encoder - Decoder for translations)
- Tagging
- Classification
- Also includes built-in WordEmbedder, Tokenizer
- Based on Tensorflow
- Attention-based encoder-decoder (machine translation)
- Based on Theano through dl4mt tutorial
Corpus
http://universaldependencies.org (github repository: https://github.com/UniversalDependencies)
https://github.com/nlp-compromise/nlp-corpus (different categories)
GPU-based services to train and run models:
- FloydHub (2h GPU, 20h CPU, 10GB)
- PaperSpace (10$ credit - monthly costs start with 5$ for 50GB storage and 0.07$ per hour)
- Kaggle (Free - Good for small, short-time training; execution are killed after ~5h; includes many cleaned datasets)
Comments