Tuesday, July 22, 2008

useful mkdir variant

You can actually make a whole path of directories with one single mkdir command:

mkdir -p a/b/c/d/e

will create a folder a, inside it folder b, ..., inside it folder e. If part of the path exists, it will not raise any error, it will simply create the part that does not yet exist.

To see how it proceeds, use:

mkdir -pv a/b/c/d/e

Thursday, July 10, 2008

Mallet command to generate topic models

To generate topic model:

First, split the data into individual files:

split -l 1 -d -a 6 ../data.txt data-

Then convert the split data to mallet format:

text2vectors --input data --remove-stopwords --output data-mallet.txt --keep-sequence TRUE --keep-sequence-bigrams TRUE

Next generate the topics:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 > topic-words.txt

To generate at phrase level, use the N-gram option:

vectors2topics --input data-mallet.txt --num-topics 250 --num-top-words 100 --output-doc-topics doc-topics.txt --num-iterations 100 --show-topics-interval 1000 --use-ngrams true > topic-phrases.txt

If any source file is modified, run "make clean", "make", "make jar". if "make jar" is skipped, the change will not be seen.