To generate topic model:
First, split the data into individual files:
split -l 1 -d -a 6 ../data.txt data-
Then convert the split data to mallet format:
text2vectors --input data --remove-stopwords --output data-mallet.txt --keep-sequence TRUE --keep-sequence-bigrams TRUE
Next generate the topics:
vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 > topic-words.txt
To generate at phrase level, use the N-gram option:
vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 --use-ngrams true > topic-phrases.txt
If any source file is modified, run "make clean", "make", "make jar". if "make jar" is skipped, the change will not be seen.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment