Arshad's Blog: Mallet command to generate topic models

To generate topic model:

First, split the data into individual files:

split -l 1 -d -a 6 ../data.txt data-

Then convert the split data to mallet format:

text2vectors --input data --remove-stopwords --output data-mallet.txt --keep-sequence TRUE --keep-sequence-bigrams TRUE

Next generate the topics:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 > topic-words.txt

To generate at phrase level, use the N-gram option:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 --use-ngrams true > topic-phrases.txt

If any source file is modified, run "make clean", "make", "make jar". if "make jar" is skipped, the change will not be seen.

Arshad's Blog

Thursday, July 10, 2008

Mallet command to generate topic models

No comments:

Feedjit

Blog Archive

About Me