Web-Scale Distributional Similarity and Entity Set Expansion – Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu and Vishnu Vyas. 2009
This paper implements web scale word similarity metrics and uses this to test performance on a set expansion task. The authors implement a distributed algorithm to compute cosine similarity between context vectors for 500 million terms found in a 200 billion word web dump. Context vectors consisted of pmi between the word and the chunk immediately to the left or to the right although the details of the features are unclear. Using 200 quad core hadoop instances the entire similarity matrix was computed in 5 hours. The authors took advantage of the sparsity of the matrices to build an inverted index. They then decomposed the similarity computation into 3 parts that involved only features in one word, only features in the other word and nonzero features for both words. The inverted index allowed practical computation of the score part relying on both words while the other score parts that relied on individual words could be cashed.