Algorithm

WikiFactCrawl accepts any Wikidata topic and returns a list of facts about related topics. The facts are of form subject, property, object:

For example:

Joan of England place of birth: Château d'Angers

WikiFactCrawl uses a relevancy heuristic and a breadth-first search of the public WikiData knowledge graph. It locates related topics of interest and collects facts about those topics.

This source code is for the Java library.

The library is currently available online as a rest endpoint.: http://molten-method-309519.uc.r.appspot.com/q?wikiDataID=wd:Q42&maxDepth=2 where *Q42 is the WikiData entity id. To find an id for any topic, visit wikidata.org and search for your topic. The entity id is shown at the top of the page *3 is the maximum depth of the crawl. Please note the expansion of searching is nearly exponential and this number does not need to be high to return lots of results.

Algorithm

Parameters:

root:= WikiData unique identifier for any topic
maxDepth:= how many links away from root to crawl

Pseudo-Code:

Queue uncrawled:= []
list crawled:= []
list facts:= []
uncrawled.enqueue(root)
while uncrawled has items
    item = uncrawled.dequeue()
    if distance(item, root) < maxDepth
        uncrawled.enqueue(synonyms of item not in crawled)
        uncrawled.enqueue(relatives of item not in crawled)
        uncrawled.enqueue(lists inside item not in crawled)
    crawled.add(i)
    facts.add(all facts of item)
loop
print facts

Output: The list of facts about topics related to root

Technical details

The four tasks of crawling topic are run in parallel *Synonyms *Relatives *Lists *Facts

The crawl follows through links which are subproperties of certain parent classes. The links can be any number of levels of subproperty below the parent class. Each task is accomplished by querying WikiData's SparQL endpoint. The specific queries and parent property classes can be found in the code. The exact heuristics used to crawl will evolve with future versions.

Written by Jae Muzzin

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/main/java/info/clock/factcrawl		src/main/java/info/clock/factcrawl
nbactions.xml		nbactions.xml
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Algorithm

Technical details

About

Releases

Packages

Languages

jaemuzzin/WikiFactCrawl

Folders and files

Latest commit

History

Repository files navigation

Algorithm

Technical details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages