Skip to content

wjxtank2010/Georgetown_2016_Fall_QPR

Repository files navigation

Georgetown Memex Human Traffic Point Fact Search System

This is the search system for Memex Human Traffic Point Fact questions from Georgetown Univerisity Infosense Team.

Usage

The system is generally divided into 4 parts, which are Search, Validation, Answer Extraction and Ranking. It is mainly implemented by Python so before you go ahead and run the system, there are several packages that need to be installed:

::
fuzzywuzzy,elasticsearch,certifi,pyyaml,bs4,webcolors,nltk,cbor,lxml

There is also a shell script named pipInstall.sh in this repository that can help you install all the packages above once you run it.

Search

In the Search part, it takes SPARQL query as input, does query parsing and query expansion, then builds Elasticsearch query body and retrieves top 3000 documents from Elasticsearch.

The code of this part is mainly in search.py. A sample SPARQL query input could be:

::
{
"type": "Point Fact", "question": "What is the country of birth listed in the ad that contains the phone number 6135019502, in Toronto Ontario, with the title 'the millionaires mistress'?", "id": "192", "SPARQL": ["PREFIX qpr: <http://istresearch.com/qpr>nSELECT ?ad ?ethnicitynWHEREn{t?ad a qpr:Ad ;ntqpr:phone '6135019502' ;ntqpr:location 'Toronto, Ontario' ;ntqpr:title ?title .ntFILTER CONTAINS(LCASE(?title), 'the millionaires mistress')n}"]

}

and the parsed query would be:

::
{
'must_search_field':
{
'phone': '6135019502', 'location': 'Toronto', 'title': 'the millionaires mistress'

},

'should_search_field':
{
'location': 'Toronto, Ontario'

},

'group': {}, 'required_match_field':

{
'phone': '6135019502', 'location': 'Toronto, Ontario', 'title': 'the millionaires mistress'

},

'answer_field':
{
'ethnicity': '?ethnicity'

},

'type': 'Point Fact', 'id': '192'

}

the elasticsearch query body would be (after query expansion):

::
{'query':
{'bool':
{'should':
[
{'match':
{'extracted_text': '613-501-9502'}

}, {'match':

{'extracted_text': '(613)501-9502'}

}, {'match':

{'extracted_text': 'Toronto, Ontario'}

}, {'match': {'extracted_text': 'ethnicity'} }

],

'must':
{'match':
{'extracted_text': '613 AND 501 AND 9502 AND Toronto AND the millionaires mistress'}

}

}

}, 'size': 3000

}

Validation

After document retrival, we would do validation to check if a document is atcually what we are search for. It takes candidate documents in last step as input and generate validation score for each document. The validation step is done in validate function in main.py and validates documents by functions in extraction.py.

There are two modes for validation which are restricted mode and unrestricted mode.

In the restricted mode, all the given conditions (which stored in required_match_field in parsed query) in the query have to be satified in order for a document to be validated. While in loosed mode, the more conditions satisfied, the better the document is.

And the system evaluates the validation quality by a score which called validation score. In restricted mode, the validation score``is either 1(all given conditions satisfied) or 0(any condition not satisfied). In unrestricted mode, the ``validation score depends on how much conditions satified. For example, if there are 5 given conditions and 3 of them meets in a document, then the validation score for that docuemnt is 3/5 = 0.6.

Initially, we answer the query in restricted mode. If there is no answers in stricted mode, then the system automatically try the unrestricted mode.

Answer Extraction

In answer extraction part, the system check whether the validated documents really have answer for the query and gives documents answer extraction score. It also uses functions in extraction.py, while doing extractions for features stored in answer_field in parsed query and generate answer extraction score for each documents.

However, it could be challenging due to the "noises" that one document may contain more than one "answers". We consider that a more confident answer should be appear together with relevant person features.

After doing answer extraction, if there are only one answer in a document, the document gets a answer extraction score "1" by 1-0 (0 means no noise).

If there are multiple answers, calculate the average word distance of each answer and selected features (features relevant to person, e.g. name, address, email...). For example, if the selected features are name, address, email, there are 2 names, 1 address, 0 email, 3 answers found in the document, the average word distance for the answer_i defined as:

::
avg_dis_i = (|P_name_1 - P_ans_i| + |P_name_2 - P_ans_i| + |P_address_1 - P_ans_i|)/3

where P means percentage word position in whole document.

The better is the answer, the smaller is the average word distance for that answer. If the answer_k has the smallest average word distance, the answer extraction score of this document is 1 - avg_dis_k. The "denoise" is done in clarify function in main.py.

Ranking

After getting the validation score and answer extraction score, the system calculate a final score for each document to do ranking. What we define here is:

::
final score = validation score * answer extraction score

Then we set up a threshhold (currently 0.5)to do a filter of the documents. If there is no document with over 0.5 score, return the half of the candidates with higher scores. The threshold can be adjusted if needed.

This ranking step is done in generate_formal_answer function in main.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published