A largely incomplete but hopefully useful list of links to datasets for relational learning and inductive logic programming. No guarantees on availability.
A list of datasets per source.
-
The CVUT Prague Relational Dataset Repository: A large collection of ILP datasets, stored as MariaDB (SQL) datasets.
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
-
ACE data mining system data sets: nine ILP datasets in Quinlan's FOIL format, together with scripts to convert them into ACE format (see README.txt in the ZIP). These were used in:
Jan Struyf, Jesse Davis and David Page, An efficient approximation to lookahead in relational learners. In J. Fürnkranz, T. Scheffer and M. Spiliopoulou, editors, Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Proceedings. Lecture Notes in Artificial Intelligence, volume 4212, pages 775-782, Springer, 2006, [Abstract], [BibTeX].
- Muta188
- Muta230
- Financial
- Sisyphus A
- Sisyphus B
- UWCSE
- Yeast
- Carcinogenesis
- Bongard
-
- Animals
- CiteSeer
- Cora
- Epinions
- IMDB
- Kinships
- Nations
- Protein Interaction
- Radish Robot Mapping - Tutorial
- UMLS
- UW-CSE
- WebKB
-
ILP Datasets:: in SQL format
- Carcinogenesis
- Financial
- Trains
- Mutagenesis
- Imdb
- IMDB Top/Botttom Movies
-
Stephen Muggleton's data set directory:
- Trains
- alzheimers
- carcinogenesis
- chess
- e_coli
- mesh
- more_chess
- mutagenesis
- proteins
- satellite
- suramin
- utube
-
Sriraam's StARLinGLAB data sets:
- Toy Father
- Toy Cancer
- IMDB
- Cora
- UW-CSE
- WebKB
- CiteSeer
- Boston Housing
- Drug-Drug Interactions
-
- alzheimers
- carcinogenesis
- dsstox
- metabolism
- mutagenesis
- pyrimidines
- trains
-
BayesBase: Datasets posted in 3 formats: (i) as a MySQL dump for a relational schema, (ii) in the WILL format, similar to the Aleph ILP input format, (iii) in the .db format of Markov Logic Networks as implemented in the Alchemy system.
- unielwin
- Mutagenesis_std
- MovieLens_std
- MovieLens_TQ(1M)
- Financial_std
- Mondial_std
- UW_std
- imdb_MovieLens
- Hepatitis_std
- Cont_PLG_TM (Continuous database)
-
LINQS - Statistical Relational Learning Group
- Social Spammer
- Drug-Target Interaction
- Stance Classification
- CiteSeer for Document Classification
- CiteSeer for Entity Resolution
- Cora
- ArXiv
- PubMed Diabetes
- WebKB
- Terrorists
- Terrorist Attacks
-
klog Datasets as Prolog files:
- WebKB: Originally developed by M. Craven et al. (1998). The version available here is a direct conversion to Prolog of the data available at the Alchemy website.
- Internet Movie Database: Data extracted from this database has been used in a number of relational learning papers. The version available here was downloaded from the IMDb website, converted into SQL using the prodecure described in http://imdbpy.sourceforge.net/docs/README.sqldb.txt and finally a subset of the tuples was converted into a Prolog file.
- UW-CSE The data set originally developed at University of Washington for demonstrating the capabilities of Markov logic networks. The version available here is a direct conversion to Prolog of the data available at theAlchemy website.
- Bursi This data set contains 4,337 molecules labeled according to mutagenicity (2,401 mutagens and 1,936 nonmutagens). Originally developed by Kazius et al (2005) it has been used in a number of machine learning papers, especially those studying graph kernels.
- Biodegradability This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
-
MLnet
Among others, some ILP datasets. Note: Internet Archive's Wayback machine link
- [Kaggle]
- KDnuggets
- Microsoft Research Open Data
- Registry of Open Data on AWS
- Awesome Public Datasets Collection
- San Francisco open data website
- Stanford Large Network Dataset Collection (SNAP)
- metapath2vec: Scalable Representation Learning for Heterogeneous Networks
- Benchmark data sets for graph kernels