Source code related to the AAAI22 paper:
Unifying Knowledge Base Completion with PU Learning to Mitigate the Observation Bias. Jonas Schouterden, Jessa Bekker, Jesse Davis, Hendrik Blockeel.
- Abstract
- Contents of this repository
- Installation
- Notebooks
- Running the experiments
- Generating the tables in the paper
- Generating the images in the paper
- Preparation of the "ideal" Yago3-10 KB
The following is the abstract of our paper:
Methods for Knowledge Base Completion (KBC) reason about a knowledge base (KB) in order to derive new facts that should be included in the KB. This is challenging for two reasons. First, KBs only contain positive examples. This complicates model evaluation which needs both positive and negative examples. Second, those facts that were selected to be included in the knowledge base, are most likely not an i.i.d. sample of the true facts, due to the way knowledge bases are constructed. In this paper, we focus on rule-based approaches, which traditionally address the first challenge by making assumptions that enable identifying negative examples, which in turn makes it possible to compute a rule’s confidence or precision. However, they largely ignore the second challenge, which means that their estimates of a rule’s confidence can be biased. This paper approaches rule-based KBC through the lens of PU learning, which can cope with both challenges. We make three contributions. (1) We provide a unifying view that formalizes the relationship between multiple existing confidences measures based on (i) what assumption they make about and (ii) how their accuracy depends on the selection mechanism. (2) We introduce two new confidence measures that can mitigate known biases by using propensity scores that quantify how likely a fact is to be included the KB. (3) We show through theoretical and empirical analysis that taking the bias into account improves the confidence estimates, even when the propensity scores are not known exactly.
- artificial_bias_experiments: Python source code root module for running the experiments & generating images about those experiments.
- dask_utils: Python code for using dask when running the experiments.
- data/yago3_10: The yago3-10 dataset. This data directory is also used as root for everything generated when running the experiments.
- external/AMIE3: External dependency: the AMIE-jar. See also the AMIE3 repository.
- images: Root directory for all images.
- kbc_pul: Python source code root module containing the core of this repository: everything related to rules, knowledge bases, confidence metrics and selection mechanisms.
- notebooks: Jupyter notebooks as illustration on how to do some things.
- notes: Markdown files describing this repository.
- paper: PDF of the AAAI paper and its appendices.
- paper_latex_tables: Tables used in the paper in LaTex.
- amie_dir.json: Settings file used by our AMIE Python wrapper pointing to the AMIE jar.
- LICENSE
- README
Create a fresh Python3 environment (3. or higher) and install the following packages:
- jupyter: for the notebooks.
- pandas: for representing the KB.
- problog : used for its parsing functionalty, i.e. parsing Prolog clauses from their string representation
- pylo2: see below
- matplotlib: plotting
- seaborn: plotting.
- tqdm: pretty status bars.
- unidecode: used when cleaning data.
- tabulate: for pretty table printouts
- dask.delayed and dask.distributed: for running experiments using dask
We use data structures from Pylo2 to represent rules as Prolog clauses.
More specifically, Pylo2 data structures from src/pylo/language/lp
are often used.
To install Pylo2 in your Python environment, first clone it:
git clone git@github.com:sebdumancic/pylo2.git
cd pylo2
Note that Pylo has a lot of functionality we don't need.
As we don't Pylo´s bindings to Prolog engines, we don't need those bindings.
To install Pylo2 without these bindings, modify its setup.py
by ading right before the line:
print(f"Building:\n\tGNU:{build_gnu}\n\tXSB:{build_xsb}\n\tSWIPL:{build_swi}")
the following lines:
build_gnu = None
build_xsb = None
build_swi = None
Then, install Pylo in the current environment using
python setup.py install
Different notebooks are provided:
- How to run AMIE from Python
- The yago3-10 dataset: cleaning & exploration
- How to apply a rule to a KB
For a description on how to run the experiments, see here.
For instructions on how to generate the tables in the paper from the results, see here.
Instructions on how to generate the images in the paper can be found here.
In the paper, the experiments are run on a cleaned version of the yago3-10 datasets. The cleaning was done to remove unicode characters that might be incompatible with older prolog engines, using ./notebooks/yago3_10/data_exploration_and_preparation/yago3_10_data_cleaning.ipynb
The original data was obtained using AmpliGraph, but can also be found under ./data/yago3_10/original.
The cleaned version can be found under ./data/yago3_10/cleaned_csv.