We address two problems in this paper. We first want to veri- fy the correctness of hundreds of millions of isA relationships. That is, given a candidate pair <c,e>, we want to evaluate how likely e is an entity of class c. Second, given a candidate pair <e1, e2>, and a known relationship R between classes c1 and c2, we want to evaluate whether relationship R exists between e1 and e2.
The explosive growth and popularity of the World Wide Web has resulted in a huge amount of texts on the Internet, which presents an unprecedented opportunity for Information Extraction (IE). IE is at the core of many emerging applications, such as entity search, text mining, and risk analysis using financial reports. In these applications, we can divide the outcome of IE into two categories according to the frequency: heads and tails. The heads are those that occur very frequently in the corpus. For instance, we can extract the fact that "google is a company" from numerous distinct sentences. It is built on the assumption that the higher the frequency, the more likely it is correct. Nevertheless, there are results that occur very infrequently, for instance, suppose from a corpus we extract a statement that says Rhodesia(Rhodesia was an unrecognised state located in southern Africa that existed between 1965 and 1979 following its Unilateral Declaration of Independence from the United Kingdom on 11 November 1965.) is a country, and its occurrences in the corpus are few and far between. In Table 1, we show some frequent and rare candidate countries extracted from a web corpus using Hearst patterns. It turns out that all frequent entities are correct, while the majority of infrequent ones are incorrect. The mistakes come from either the extraction algorithm, or erroneous sentences in the corpus.
Table 1: Frequent and infrequent candidate entities of country
Frequent Entities | Rare Entities |
India | Northern |
China | Sabah |
Germany | Yap |
Australia | Parts of sudan |
Japan | Wealthy |
France | Western romania |
Canada | American artists |
USA | South korea japan |
Brazil | New sjaelland |
Italy | Rhodesia |
How to verify the correctness of a tail extraction (also known as sparse extraction) is one of the most important and challenging problems in IE. As we know, the distribution of words and phrases in a corpus of natural language utterances follows the Zipf's law which states that the frequency of any word or phrase is inversely proportional to its rank in the frequency table. Thus, their occurrences in a particular syntactic pattern we use for extraction are very small. Without a good mechanism to identify correct extractions from incorrect ones, sparse information extraction will be plagued by either low precision or low recall.
Existing efforts in information extraction or sparse extraction can be divided into the following four classes. Heuristic based approaches start with a set of seed entities given a relation or some prior label distributional knowledge, and identify extraction patterns for the relation iteratively. Redundancy-based approaches require that extractions appear relatively frequently with a limited set of patterns. Knowledge-based approaches identify information extraction in terms of external resources, such as Wikipedia, Freebase and WordNet. In addition, most of popular approaches in handling of sparse extractions are context-based model building approaches. They use one important hypothesis known as the distributional hypothesis, which says that different entities of the same semantic relation (such as the unary and binary relations) tend to appear in similar textual contexts. For example, we may not find many occurrences of Rhodesia in the Hearst pattern "countries such as Rhodesia". But if Rhodesia appears in similar context where terms such as India, USA, and Germany occur, then we will be more certain about the claim that Rhodesia is a country according to the distributional hypothesis. This hypothesis is beneficial to assess sparse extractions.However, the challenge lies in modeling contexts and measuring the semantic similarity of two contexts.
We now analyze the challenges in the tasks. The first challenge is the scale. For example, there are hundreds of millions of isA relationships (formed among 2.7 million categories and 5.5 million entities in Probase[1][2]). It is impossible to learn the generative model (such as the HMM model and the deep learning model) based on the contexts of all entities, it is very time-consuming. The second challenge lies in improving the effectiveness of the verifier. As we mentioned, the feature representation based on the contexts of words are very different that based on the contexts of entities. Meanwhile, neither a bag of words nor a set of hidden states can provide good semantics to understand the relationship between a candidate pair. Motivated by this, in this paper, we introduce a semantic, scalable, and effective approach for sparse information extraction assessment.
First, we introduce a semantic approach for solving the two problems. More precisely, we come up with a semantic representation of the contexts. This approach is natural because we are dealing with a large semantic network, which provides semantic information in various aspects. Using these information, we are able to introduce semantic features to describe a context, which leads to a lightweight and effective solution of context learning.
Second, we scan billions of web documents using MapReduce6 to capture the contexts of millions of entities and pairs of entities in Probase, and then compare the similarity between their contexts and the contexts of seeds7. We further use the similarity evaluated by our three semantic context based approaches to represent the feature space given a pair, and then train a binary-class classifier on a small number of labeled data varying with different base classifiers to select the best one for predicting sparse extractions. Extensive studies show that our approach can achieve better performance than state-of-the-art approaches in sparse extraction assessment.
Considering the experimental data sets, we randomly selected about 1800 entities that belong to 12 classes in Probase. Tables 2 and 3 show the descriptions and some examples in each class respectively. Each entity has no more than 10 occurrences in Hearst patterns and we call them sparse extractions. This is because more than 90% entities of the above 12 concepts have no more than 10 occurrences in Probase, namely lying in the long tail of the entity distribution curves. For example, Figure 2 shows the frequency distribution varying the number of entities in country. We can clearly see the long tail phenomenon under the dotted line with no more than 10 occurrences. We asked human judges to evaluate their correctness.We also looked into three binary relations: is- CapitalOf, isCurrencyOf, and headquarteredIn. We randomly picked 315 sparse extractions that have no more than 10 occurrences, and we also picked the 10 most frequent extractions for each relation which serve as seeds. Details of all test relationships are shown in Table 2.
Table 2: Data sets used in experiments
total pairs in Probase | paris with frequency < 10 | pairs in experiments | #bad pairs | #good pairs | |
isA relationships | |||||
country | 5534 | 92.81% | 415 | 226 | 189 |
sport | 2866 | 92.18% | 335 | 67 | 268 |
city | 8815 | 90.05% | 231 | 33 | 198 |
animal | 5562 | 92.38% | 186 | 37 | 149 |
seasoning | 531 | 92.47% | 169 | 41 | 128 |
company | 59734 | 96.84% | 82 | 9 | 73 |
painter | 1097 | 98.09% | 81 | 5 | 76 |
currency | 330 | 91.82% | 78 | 8 | 70 |
disease | 8280 | 92.60% | 69 | 9 | 60 |
film | 10859 | 96.62% | 65 | 25 | 40 |
language | 2703 | 93.53% | 51 | 6 | 45 |
river | 1924 | 97.77% | 40 | 2 | 38 |
total | 108235 | 92.25% | 1802 | 468 | 1334 |
Binary relationships | |||||
isCapitalOf(country, city) | 160 | 39 | 121 | ||
isCurrencyOf(country, currency) | 80 | 19 | 61 | ||
headquarteredIn(company, city) | 75 | 22 | 235 | ||
total | 315 | 80 | 235 |
Table 3: Examples of isA relations
isA relation | #bad pair | #good pair |
country | <country, democratic people> | <country, g77> |
city | <city, santa martha> | <city, amadora> |
sport | <sport, trafalgar park> | <sport, girls golf> |
animal | <animal, cauquenes> | <animal, moon snail> |
seasoning | <seasoning, bacon bit> | <seasoning, five spice> |
company | <company, institute> | <company, hasbro> |
painter | <painter, robert young> | <painter, childe hassam> |
film | <film, forest gump> | <film, breach> |
language | <language, francophone> | <language, micmac> |
river | <river, manda> | <river, missouri river> |
isCapitalOf(country, city) | <dili, east timor> | <andorra, andorra la Vella> |
isCurrencyOf(country, currency) | <baht, thailand> | <colombia, colombian peso> |
headquarteredIn(company, city); | <espoo, general electric> | <michelin, clermont-ferrand> |
More Details Refer to Used Data Sets (1) .
More Details Refer to Used Data Sets (new) .
Our project is implemented by C# and SQL Server. Base classifiers used in our approach are from Weka-3.8.1.jar. Souce codes of this project refer to Source codes.
Our AM (attribute-based context), CM (concept-based context) and IM (Isa-based context) approaches have similar parametes, we explain the parameter list of CM as an example. Main functions of these three approaches are called AMMain(string[] args), SuperConceptBasedMain(string[] args) and IMBasedMain(string[] args) in the file of "CleaningMain.cs".
Parameter list for our CM approach
Parameter list for our CM approach | |
Variable | Description |
databaseServer | the name of database; |
databaseName | the name of database; |
testEntityTable | the table of conceptualization; |
isSelectedTopK | whether select Top tokens or not, 1: yes, 0: no; |
classNumThres | the maximum number of concepts in conceptualization; |
distEvalType | the type of distance evaluation; |
seedsNum | the number of seeds; |
bUseClustering | whether use clustering or not, default: false; |
pathStr | directory of files; |
[1] Knowledgebase Probase: http://research.microsoft.com/en-us/projects/probase/release.aspx
[2] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy for text understanding. In Proceedings of SIGMOD'12, pages 481-492, 2012.
[1] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Employing Semantic Context for Sparse Information Extraction Assessment, ACM Transactions on Knowledge Discovery from Data,12(5): 54:1-36, July 2018.
[2] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Assessing Sparse Information Extraction using Semantic Contexts, In: Proceedings of 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), pp.1709-1714, San Francisco, CA, USA, 2013.10.28-11.01
Peipei Li (peipeili@hfut.edu.cn): Hefei University of Technology, China
Haixun Wang (haixun@google.com): Google Research, USA
Hongsong Li (hongsong.lhs@alibaba-inc.com): Alibaba Group, China
Xindong Wu (xwu@uvm.edu): University of Louisiana at Lafayett, USA