fastText Notes

fastText is handy and efficient in document classification and word vectors representation.

The major difference between fasttext and gensim word2vec is:
FastText takes consideration of sub word vector/representation. So even when there is new words or wrongly spelled words (out of vocabulary), it is able to give a most reasonable word vector based on training data.

e.g. ‘words’ never appeared in training text, but ‘word’ did appear. FastText model can still relate words to be close to word.

Common Terminal Command used for text classification

Terminal directory to be in fastText folder:

Supervised text classification model training:
./fasttext supervised -input text.train -output model_text -lr 1.0 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs

bucket: Word and character ngram features are hashed into a fixed number of buckets, in order to limit the memory usage of the model. The option -bucket is used to fix the number of buckets used by the model. The larger the bucket number, the larger will be final model size;

After training, to obtain the k most likely labels and their associated probabilities for a new input text document:

$ ./fasttext predict-prob model.bin test.txt k

Leave a Reply

Your email address will not be published. Required fields are marked *