I am trying to process a bunch of text generated by human users to figure out what they are talking about, in the context of an experiment where there were robots a person could be controlling, a few target areas the robots could move to, and a few things they could move, like a crate. One thing that occurred to me was to use TF-IDF, which, given a text and a collection of texts, tells you what words in the text are relatively unusual to that particular text, compared to the rest of the collection.

It turns out that that’s not really what I wanted, because the words that are “unusual” are not really the ones that the particular text is about.

select(1.836) all(3.628) red(2.529) robots(1.517) ((1.444) small(5.014) drag(2.375) near(3.915) lhs(4.321) of(1.276) screen(2.018) )(1.444)

This is a sentence from the collection, and the value after each word is the TF-IDF score for that word. The things I care about are that it’s about robots, and maybe a little that it’s about the left hand side of the screen. “Robots” actually got a pretty low score (barely more than “of”), but “small” got the highest score in the sentence.

At any rate, this is how I did my TF-IDF calculation. Add documents to the object with add_text, get scores with get_tfidf. It uses NLTK for tokenization, you could also use t.split(” “) to break strings up on spaces.

class TFIDF(object): def __init__(self): #Count of docs containing a word self.doc_counts = {} self.docs = 0.0 def add_text(self, t): #We're adding a new doc self.docs += 1.0 #Get all the unique words in this text uniques = list(set(nltk.word_tokenize(t))) for u in uniques: if u in self.doc_counts.keys(): self.doc_counts[u] += 1 else: self.doc_counts[u] = 1 def get_tfidif(self, t): word_counts = {} #Count occurances of each word in this text words = nltk.word_tokenize(t) for w in words: if w in word_counts.keys(): word_counts[w] += 1 else: word_counts[w] = 1 #Calculate the TF-IDF for each word tfidfs = [] for w in words: #Word count is either 0 (It's in no docs), or the count w_docs = 0 if w in self.doc_counts.keys(): w_docs = self.doc_counts[w] #the 1 is to avoid div/zero for previously unseen words idf = math.log(self.docs/(1+w_docs)) tf = word_counts[w] tfidfs.append((w, tf * idf)) return tfidfs