Attention-based Transformer for Long-context Question Answering

An Attention-based Transformer for Neural Question-Answering on Knowledge Graph, via Sequence-to-sequence Approach, with Automatic Templates Generation from Long Text.

The project is Stuart Chen's research in Google 2019 GSoC in collaboration with DBpedia and AKSW Research Group.

Here is the website for blogging the research development.

Dependencies

Abstract

In this project, I propose a methodology that leverages the Transformer for long-context questions answering with the knowledge graph. The main pipeline is using named-entity annotator and syntactic parser to generate a question template from the passages. A template is a question that contains entities and relations of the knowledge graph. Then in storing the templates, the templates are embedded into universal sentence encodings to categorize them by measuring semantic similarity, which also avoids the redundancy in storage. After that, we train a Transformer on the templates for translating a natural language question with annotated triples into SPARQL to get an answer from the knowledge graph. The answer accuracy in the evaluation was raised to 11.4% from the precedent 0.93%.

The Pipeline of Architecture

To show the workflow, the model architecture is like:

To begin with, please run the requirements.txt to set up all the dependencies. Before running all the scripts, please mind that this repository folder should have been exported to the system path $PYTHONPATH. Also, the model en-core-web-sm==2.1.0 for spaCy need you to download, see the instruction on official page of spaCy.Also, please make a folder called 'glove2wordvec' in 'neural-qa/data', and put the word2vec file into it.

1. Automatic Templates Generator

The component aims at automating the templates generation from the long text, with the help of Universal Sentence Encoder, DBpedia-Spotlight, DBpedia-Lookup, NLTK, and Spacy.

1.1. Extraction Of Wiki pages and article pre-processing

We need abundant natural language textual materials to get more questions with RDFs of DBpedia, to transform them into templates.

For example, if you want to get the articles about Brack Obama(dbr:Barack_Obama), we set DBR_NAME=Barack_Obama, then

neural-qa/templates_generator> python questions_generate_main.py --dbo_class=$DBR_NAME

here, the variable $DBR_NAME should be a certain entity, like Barack_Obama.

the scripts will automatically make a Bank directory in the neural-qa/data/ folder to save the articles.

1.2. Filtering of the sentences in articles to match the DBpedia triple RDFs

The script sentences_filter.py is for filtering out those sentences pertinent to the RDFs that we need.

1.3. Convert sentences containing DBpedia entities to questions with placeholders

The question_convertor.py is the part responsible for converting the caught sentences to template-questions with entity placeholders.

    e.g. She was born in France? --> where <A> was born in ?

1.4. Matching these questions towards the template questions in exiting templates-sets with Universal Sentence Encoder

This sentence_encoder.py is from the implementation of Universal Sentence Encoder[1] which shows efficiency in semantic sentences matching, it helps to match whether there is an existing correspondent template for the new question that we have.

1.5. If the matching similarity score can not pass the threshold, the questions go to the query composing part

To use the pipeline, please run the templates_generate_main.py after the step 1 above,

python templates_generate_main.py --dbo_class=$DBO_CLASS  --temps_fpath=$EXISTING_TEMPLATES_FILE_PATH  --text_fpath=$TEXT_FILE_PATH  --ntriple_fpath=$NTRIPLES_FILE_PATH  --train_vec=$WHETHER_TO_TRAIN_THE_VECTOR  --vecpath=$FILE_PATH_THAT_SAVES_VECTORS   --temp_save_path=$FILE_PATH_SAVING_RESULTS

which will automatically initiate the pipeline.

Please have a look at the parameters:

1. for --dbo_class=$DBO_CLASS, the $DBO_CLASS should be a ontology category, like: Person, Monument, etc.
1. for --temps_fpath=$EXISTING_TEMPLATES_FILE_PATH, the $EXISTING_TEMPLATES_FILE_PATH should be a file path to the template set for the DBpedia entity resource(dbr), like, for Barack_Obama, we should use the template set for Person.
1. for --text_fpath=$TEXT_FILE_PATH, the $TEXT_FILE_PATH should be the text article extracted from the Wikipage.
1. for --ntriple_fpath=$NTRIPLES_FILE_PATH, it should be the ntriple file.
1. for --train_vec=$WHETHER_TO_TRAIN_THE_VECTOR, the default is to use the prepared vectors, however, if you want, you can set it to True, which trains the vector by Universal Sentence Encoder.
1. for --vecpath=$FILE_PATH_THAT_SAVES_VECTORS, it's the file path where the vectors are stored.
1. for --temp_save_path=$FILE_PATH_SAVING_RESULTS , please set the file path where you want to save the new template set generated.
To find the ntriple files and text files automatically saved, please go into the neural-qa/data/Bank/DBresources/, you will see the folder correspondent to the entity's ontology category, like, for Barack_Obama is in category Person, then you can find the folder neural-qa/data/Bank/DBresources/Person/Barack_Obama, the ntriple file and the text file will be seen there.
one result of our works can be seen here, which facilitates to clarify the structure of Templates Bank directory with the output results inside Bank\DBresourses\Person\Barack_Obama.

For example, we run the program for dbr_Barack_Obama, we should use the command as below:

neural-qa/templates_generator> python templates_generate_main.py  --dbo_class=Person --temps_fpath=../data/annotations_Person.csv --text_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.txt --ntriple_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.ntriples  --vecpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.vectors   --temp_save_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.template.csv

2. Transformer

The implementation of this neural transformer part gets inspiration from the paper Attention Is All You Need[2] and its official model by TensorFlow[3].

2.1. Data Preparation

2.1.1. to generate the data

We use the templates in CSV format provided by SPARQL as a Foreign Language[4] to generate the training data for the experiments.

The generated data consists of two parts, namely, data.en the source data, and data.sparql the target data.

In the data.en are the natural language questions with RDF entities annotated to be translated into RDF structured query language SPARQL, like in this example,

    "who is the spouse of dbr_Barack_Obama ?"
    "who is the partner of dbr_Audrey_Hepburn ?"
    ...

To begin with, please run the data generation:

this one command must be run in Python 2.7, since it was from the previous project.

cd  neural-qa/
mkdir data/QALD7
neural-qa> python generator.py  --transformer=True  --templates data/QALD-7.csv  --output data/QALD7

after which this script will convert the data into a training set and validation set with building the vocabulary:

cd neural-qa/transformer_atten/transformer

then, we make a folder named 'data' in the transformer folder, and again make a folder QALD7 in the folder data, please copy the generated data files in to the ./data/QALD7/ folder:

neural-qa/transformer_atten/transformer> python data_preprocess.py --data_dir=./data/QALD7

Then, we need to pre-process the data and build the vocabulary file and split the data into tarining set and validation set:

neural-qa/transformer_atten/transformer> python transformer_main.py --data_dir=./data/QALD7/DATA_DIR --model_dir=./data/QALD7/model_QALD7   --vocab_file=./data/QALD7/vocab.en_sparql   --param_set=big

Please make sure the folders and paths that have been set in the commands already exist.
We strongly encourage to use one previously generated dataset can be found here, and put it in neural-qa/transformer_atten/transformer. Decompress the zipped file and put it in the neural-qa/transformer_atten/transformer/data/QALD7.

2.2. Model Training

To conduct the training, please notice the parameters to set:

Please put all the tfrecord files in the neural-qa/transformer_atten/transformer/data/QALD7/DATA_DIR/ to prevent the running issues.

   PARAM_SET=big
   DATA_DIR=$path/to/the/data
   MODEL_DIR=$path/to/your/model
   VOCAB_FILE=$DATA_DIR/vocab.en_sparql

just a side note, please make sure the generated date for training are put in a folder that only contains the data without any file else, and we should put the generated tfrecords into a DATA_DIR folder in transformer/data/QALD7, otherwise it might raise the tf.errors.DataLossError. The model has risk at handling the threads and the corrupted data loss error, the feasible solution that we know is to put the tfrecords in a separated folder, and make sure the access to write/read the files inside are already authorized.

In our experiment, we use the command below:

python transformer_main.py --data_dir=./data/QALD7/DATA_DIR --model_dir=./data/QALD7/model_QALD7   --vocab_file=./data/QALD7/vocab.en_sparql   --param_set=big

To see more instructions, this refers to the official model.
NOTIFICATION: Since the current model officially by TensorFlow still has a potential issue, we strongly recommend you to train it on CPU or check the CUDA environment in case that the memory run out of storage and the threads get killed.

2.3. Experiment Results

Training Time

Loss

This shows the cross-entropy loss while training:

GERBIL Evaluation

The table shows the evaluation result for the QALD-7 benchmark:

The GERBIL is an online platform to do the question-answering F1-score evaluation with the confusion matrix, and this table shows the answering accuracy of the model's output.

For better comparison, we have a blog about the results of QALD evaluation of original NSpM model.

Summary

I am so glad to have this scientific research experience with my excellent researchers from Google and DBpedia. I get more profound insights during our research in natural language processing, knowledge graph, and deep learning. It's profounding my mind in scientific research, which ignites the flame of unquenchable curiosity in artificial intelligence.

So here, I want to talk about this project. We are using long natural language text to generate the question templates for reasoning on the knowledge graph.

Also, we tried to make the best use of the state-of-the-art model, Transformer of attention mechanism, to play the role of the learner from natural language questions into the SPARQL queries on the graph-structured relational database.

What's more, we want to make the system a never-ending-learner, like the Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases[5], to keep the long loop of accumulating knowledge. I believe this is a crucial key to artificial general intelligence.

Evolution of the Project

In the beginning, I mean in the initial proposal, we wanted to use graph embedding to do the SQuAD machine reading comprehension tasks with reinforcement learning, but gradually we realized the performance of the neural SPARQL machine is highly dependent on the training data which indicate the crucial necessity of automating the templates generation from long contextual passages. The Wikipedia is a wonderful source of plenty of such articles relevant to DBpedia RDF triples, so we decided to evolve an intelligent neural SPARQL machine with automated templates generation, comparison, and accumulation to try to approach a never-ending-learning intelligent agent.

Of course, during the coding, I have countered so many difficulties, like doing the benchmark evaluations and some tough impediments, but as now I think about these problems, I think they gave me a totally thorough growth. I got to learn more and more about the newest products in the industry and get more adequate with the international coding standards which open my door to a bigger world. For example, in the part of calculating the vector similarity to match existing templates, we first used word mover distance with GLoVe vectors via gensim, but we found that was too heavy and too slow, then we used spaCy and found it much speedier. And soon after this, we found the Universal Sentence Encoder is even better in this task, which is a huge evolution in our development.

Another thing that I still remember is the paraphrasing of the predicates, we used to think load all the phrases in RAM and do the matching. I still remember that the file was so huge even more than 17.6 GB. Then I found the wordnet from nltk can accomplish this paraphrasing task without such a huge cost, which is a smart solution.

Future Works

We hope to keep on the work on making the question generation even better and including ASK queries, queries that require filter (how many, how much, etc.) and complex queries as well. Because we believe this can make the neural SPARQL machines get even better and better performance.

Continuation

As an expansion after this project, I continued to work on improving the answering accuracy of the model, where it made thorough use of the knowledge vector space for inferring the answer. Different from the other neural methodologies that concentrate on optimizing the SPARQL query generation to get the correct entity. I took a step further, leveraging the embeddings of the relevant triples in the question to infer its answer as a vector, with the reinforcement learning algorithm to optimize the vector distance in embedding space.

References

[1] Daniel Cer et al. (2018) Universal Sentence Encoder

[2] Ashish Vaswani et al. (2017) Attention Is All You Need

[3] TensorFlow - Official Models: https://github.com/tensorflow/models

[4] Tommaso Soru et al. (2017) SPARQL as a Foreign Language

[5] Abdalghani Abujabal et al. (2018) Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

[6] Rajarshi Das et al. (2017) Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

[7] Haitian Sun et al. (2018) Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text

[8] Svetlana Stenchikova et al. (2018) QASR: Spoken Question Answering Using Semantic Role Labeling

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
data		data
templates_generator		templates_generator
transformer_atten		transformer_atten
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
build_vocab.py		build_vocab.py
filter_dataset.py		filter_dataset.py
generator.py		generator.py
generator_test.py		generator_test.py
generator_utils.py		generator_utils.py
interpreter.py		interpreter.py
requirements.txt		requirements.txt
split_in_train_dev_test.py		split_in_train_dev_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention-based Transformer for Long-context Question Answering

Dependencies

Abstract

The Pipeline of Architecture

1. Automatic Templates Generator

1.1. Extraction Of Wiki pages and article pre-processing

1.2. Filtering of the sentences in articles to match the DBpedia triple RDFs

1.3. Convert sentences containing DBpedia entities to questions with placeholders

1.4. Matching these questions towards the template questions in exiting templates-sets with Universal Sentence Encoder

1.5. If the matching similarity score can not pass the threshold, the questions go to the query composing part

2. Transformer

2.1. Data Preparation

2.1.1. to generate the data

2.2. Model Training

2.3. Experiment Results

Summary

Evolution of the Project

Future Works

Continuation

References

About

Releases

Packages

Languages

License

StuartCHAN/neural-qa

Folders and files

Latest commit

History

Repository files navigation

Attention-based Transformer for Long-context Question Answering

Dependencies

Abstract

The Pipeline of Architecture

1. Automatic Templates Generator

1.1. Extraction Of Wiki pages and article pre-processing

1.2. Filtering of the sentences in articles to match the DBpedia triple RDFs

1.3. Convert sentences containing DBpedia entities to questions with placeholders

1.4. Matching these questions towards the template questions in exiting templates-sets with Universal Sentence Encoder

1.5. If the matching similarity score can not pass the threshold, the questions go to the query composing part

2. Transformer

2.1. Data Preparation

2.1.1. to generate the data

2.2. Model Training

2.3. Experiment Results

Summary

Evolution of the Project

Future Works

Continuation

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages