Understanding Natural Language

Understanding Natural Language – Winograd 1972

Describes the Blockworld system.  Argues for many features of modern linguistic theory including phrase and feature based syntax, a semantics based on relations, properties, events and objects.  Introduces language understanding as the task of translating between a string representation of a language and a conceptual representation suitable for inference and reasoning.  Recognizes the interdependence between words, their immediate constituents, local discourse context, overall discourse context and background world knowledge.  Combines the tasks of parsing, interpreting and reasoning based on context simultaneously.  Language is used to convey meaning for a purpose and we need systems that recognize this fact and incorporate all the sources of information we use to accomplish this task.

Web-Scale Distributional Similarity and Entity Set Expansion

Web-Scale Distributional Similarity and Entity Set Expansion – Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu and Vishnu Vyas. 2009

This paper implements web scale word similarity metrics and uses this to test performance on a set expansion task.  The authors implement a distributed algorithm to compute cosine similarity between context vectors for 500 million terms found in a 200 billion word web dump.  Context vectors consisted of pmi between the word and the chunk immediately to the left or to the right although the details of the features are unclear.  Using 200 quad core hadoop instances the entire similarity matrix was computed in 5 hours.  The authors took advantage of the sparsity of the matrices to build an inverted index.  They then decomposed the similarity computation into 3 parts that involved only features in one word, only features in the other word and nonzero features for both words.  The inverted index allowed practical computation of the score part relying on both words while the other score parts that relied on individual words could be cashed.

Entity Extraction via Ensemble Semantics

Entity Extraction via Ensemble Semantics – Pennacchiotti & Patel (2009)

This paper proposes a new framework for information extraction called Ensemble Semantics.  The Authors describe a Knowledge Extraction framework which collects information from multiple knowledge sources.  They then use multiple knowledge extractors and feature extractors to extract candidate relations and features of relation reliability.  They then rank candidate relations producing the final knowledge relations.  The particular system drew on web query logs, web corpus data (600 million docs), structured web data (web tables) and Wikipedia features.  They demonstrate combining multiple knowledge extractors (distributional and pattern based) along with features from all knowledge sources significantly improve MAP for the categories of musicians, actors and athletes.  Wikipedia features alone significantly improved performance over the baseline system.  Including web querry logs and web corpus data further improved performance and subsumed the benefits of using Wikipedia features.  The confidence of the knowledge extractors were the most important features.  Features extracted from web data were also very important.  This data was drawn from query logs, structured table data, and free form web data.  Features determining the well formed nature of the Term, popularity of the term and common co occurrence terms were all very useful. The weighting of features varied based on knowledge source illustrating a key advantage of the Esemble Semantics framework.  Combining different knowledge sources and using different features to integrate them significantly improves knowledge extraction performance.