kore.ai
Use Kore.ai's Knowledge Graph Generator to automatically extract terms from FAQs, define the hierarchy between these terms, and also associate the FAQs to the right terms.
Overview
Prerequisites
Configuration Steps
Run KnowledgeGraph Generator
Command Options
Option: langauge
Option: type
Output details
Using Synonym Generator
Graph analyzer
Troubleshooting
Kore.ai KnowledgeGraph Generator enables you to cut down your effort in building ontology in Knowledge Collection section by automating this process.
Output generated through this engine can be directly imported in KnowledgeCollection and you can use the faqs after training the KnowledgeCollection. However, user should manually add synonyms if he wants to. Since the engine won't support this yet.
If you have managed your stopwords the engine will consider only those stopwords in generating the graph considering the fact that you pass JSON or CSV export directly. If CSV format from extraction type is given as input or user haven't modified stopword collection, Engine uses it's predefined set of stopwords
-
Python 3.6: The KnowledgeGraph Generator requires python 3.6. Can be downloaded from here: https://www.python.org/downloads/
-
Virtual Environment: It is advised to use virtual environment, instead of directly installing requirements in the system directly. Follow the steps mentioned here to setup virtual environment. https://www.geeksforgeeks.org/creating-python-virtual-environment-windows-linux/
Configuring KnowledgeGraph Generator involves the following major steps:
-
Step 1: Download the KnowledgeGraphGenerator from GitHub : Find the repository here: https://github.com/Koredotcom/KnowledgeGraphGenerator
-
Step 2: Activate virtual environment: Execute the following command with required changes in it to activate the virtual environment
source virtual_environments_folder_location/virtualenv_name/bin/activate
Once the virtual environemnt is activated, you should see virtual environment name at the start of every command in the console. Something like this -
Step 3: Install requirements for the project: Run the following command from your project root directory (KnowledgeGraphGenerator) to install requirements
pip install -r requirements.txt
Runpip list
command to verify is all the installed requirementsFor Windows Operating System -
- Windows 10 users should install Windows 10 SDK. You can download it from here here
- Operating system should be upto date for seamless installation of requirements. Some libraries like scipy (internal dependency) need specific dll's which are available in latest updates. Avoiding this may involve lot of troubleshooting.
- Windows 10 users should install Windows 10 SDK. You can download it from here here
We verified installation with build version 1903.
- Step 4. Download spacy english model: Run following command to download the model
python -m spacy download en
python KnowledgeGraphGenerator.py --file_path 'INPUT_FILE_PATH' --type 'INPUT_TYPE' --language 'LANGUAGE_CODE' --v true
python KnowledgeGraphGenerator.py --file_path INPUT_FILE_PATH --type INPUT_TYPE --language LANGUAGE_CODE --v true
note - no quotes for command arguments in windows
The command which generates KnowledgeGraph have options which need to be passed while executing the command. The following are the options which are used.
Note: : The options which are listed as mandatory should be given along with command, for options which are regarded as optional, the default values mentioned will be picked
Option name | Description | Mandatory / Optional | Default value | |
--file_path | Input file location | Mandatory | ||
--language | The language code for langauge in which input data exist | Optional | en (english) | |
--type | The type of input file
|
Mandatory | ||
--v | Running command in verbose mode to see intermediate progress steps | Optional | false | |
The following languages are supported currently and others will be added in incremental approach. Create an issue if any language is required on priority
Language | Language Code |
English | en |
Type specifies the type of input file. Currently only three formats are supported and those are listed below:
-
json_export -
You can generate input in this format from kore.ai bot builder tool by exporting KnowledgeCollection and selecting JSON format for export
-
csv_export -
You can generate input int this format from kore.ai bot builder tool by exporting KnowledgeCollection and selecting CSV format for export
-
csv -
This format is enabled to support input from KnowledgeExtraction. To build input file in this format, all you need to do is copy all questions in first column and their respective answers in second column and save it as .csv file
Output JSON file generated can be located under project root directory with name of file as ao_output.json
The output JSON file can be directly imported to KnowledgeCollection in bot as json format
Synonym generator is an add-on tool developed to help the bot developer derive synonyms for the nodes in the KG. For this, one need to follow the following basic steps:
- Step 1: Run KG generator and create an ontology for the given questions.
- Step 2: Run synonym generator, giving this ontology as input.
- Step 3: Take the synonyms file that is generated, edit it as required, and re-run KG generator with it to create the final ontology.
The synonym generator has the following modes of operation:
- Using the answers from the knowledge graph to generate synonyms.
- Using a given PDF document or ZIP of PDF documents to generate synonyms.
- Using a pre-trained word2vec model to generate synonyms.
If there are a substantial number of voluminous answers in the KG, the first option will give a closed-domain set of synonyms. In the event that the KG is smaller or does not have enough content, one can use a collection of PDF documents to provide the corpus. The third option gives a way to generate open-domain synonyms as it can use any pretrained word2vec model.
- Step 1: Download the GoogleNews model from https://github.com/mmihaltz/word2vec-GoogleNews-vectors. Alternatively, any other word2vec model can also be used.
- Step 2: Change to the synonym_generator folder.
- Step 3: Run synonym generator using the following command:
python synonym_generator.py --file_path 'INPUT_FILE_PATH' --training_data_path 'TRAINING_FILE_PATH' --training_data_type 'INPUT_TYPE'
- Step 4: The output is saved to a file called generated_synonyms.csv in that directory itself.
These parameters take the following values:
Option name | Description | Mandatory / Optional | Default value |
--file_path | Input file location | Mandatory | |
--training_data_path | The path to the training data or a pretrained word2vec model. This can be either a PDF file or a ZIP containing PDFs or the path to a pretrained model. | Optional | None |
--training_data_type | The type of training file
|
Optional |
The following is an example of how the synonym generator is to be used:
python synonym_generator.py --file_path oa_output.json
Use following command to run KnowledgeGraph generator with generated synonyms file.
python KnowledgeGraphGenerator.py --file_path 'INPUT_FILE_PATH' --type 'INPUT_TYPE' --language 'LANGUAGE_CODE' --v true --synonyms_file_path 'path_to_synonyms_file/synonyms_file.csv'
The generated graph export will have both graph level synonysm from input file (if present) and synonyms from the synonyms file under synonyms section
Graph generated by the tool may not meet human expectations. After graph is generated, the generated graph is analyzed by our analyzer tool which helps in identifying errors which results due to input data, the way it is. We may have two issues while preparing the graph.
This report can be located under project root directory with name of file as analyzer_report.csv
. New report is appended to the current report. So the new report is always the last one found in the file with the latest timestamp. Just like log file.
Developer can clear the file or delete the file to remove previous reports.
First one, the alternate questions which are part of the input primary questions in input export, will be mapped to same questions again. This is due to preserve the question-alternate question relation that was given previously. This may lead to less path coverage for alternate questions as the terms built for primary question will also be part of its alternate questions.
Second one, the questions which are very dissimilar in corpus may not get grouped. These questions are placed in root node. Its bot developers responsibility to group them, the way they want.
The output from analyzer is a CSV file which shows error type and questions under that error. The path to reach the question is also available in the CSV. Following, is the sample analyzer CSV
Cannot open include fil
e: 'basetsd.h': No such file or directory
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe /c /nolog o /Ox /MD /W3 /GS- /DNDEBUG -IC:\Python27\include -IC:\Python27\PC /Tchello.c /F obuild\temp.win-amd64-2.7\Release\hello.obj hello.c C:\Python27\include\pyconfig.h(227) : fatal error C1083: Cannot open include fil e: 'basetsd.h': No such file or directory error: command '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64 \cl.exe"' failed with exit status 2