initial preprocessing stores results into stemmed-*.txt % python preproc.py create document vector for 28 files and computes tfidf weights prints everything onto the screen which I manually cut and paste into a file called cossim-matrix.txt % python docvector.py reads in content from cossim-matrix.txt into numeric matrix repeatedly finds smallest distance then spits out the pair of documents % python hiercluster.py