Thursday, July 10, 2008

Mallet command to generate topic models

To generate topic model:

First, split the data into individual files:

split -l 1 -d -a 6 ../data.txt data-

Then convert the split data to mallet format:

text2vectors --input data --remove-stopwords --output data-mallet.txt --keep-sequence TRUE --keep-sequence-bigrams TRUE

Next generate the topics:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 > topic-words.txt

To generate at phrase level, use the N-gram option:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 --use-ngrams true > topic-phrases.txt

If any source file is modified, run "make clean", "make", "make jar". if "make jar" is skipped, the change will not be seen.

No comments: