Entity Extraction via Ensemble Semantics – Pennacchiotti & Patel (2009)
This paper proposes a new framework for information extraction called Ensemble Semantics. The Authors describe a Knowledge Extraction framework which collects information from multiple knowledge sources. They then use multiple knowledge extractors and feature extractors to extract candidate relations and features of relation reliability. They then rank candidate relations producing the final knowledge relations. The particular system drew on web query logs, web corpus data (600 million docs), structured web data (web tables) and Wikipedia features. They demonstrate combining multiple knowledge extractors (distributional and pattern based) along with features from all knowledge sources significantly improve MAP for the categories of musicians, actors and athletes. Wikipedia features alone significantly improved performance over the baseline system. Including web querry logs and web corpus data further improved performance and subsumed the benefits of using Wikipedia features. The confidence of the knowledge extractors were the most important features. Features extracted from web data were also very important. This data was drawn from query logs, structured table data, and free form web data. Features determining the well formed nature of the Term, popularity of the term and common co occurrence terms were all very useful. The weighting of features varied based on knowledge source illustrating a key advantage of the Esemble Semantics framework. Combining different knowledge sources and using different features to integrate them significantly improves knowledge extraction performance.