Skip to content

Configuring Entity Checking

Michael Röder edited this page Oct 11, 2024 · 2 revisions

Gerbil will check if an entity represented by an URI does exist. Further on gerbil will use a cache system to assure that an entity which already were tested, does not need to be tested again.

There are two types of Entity Checker

and two types of Cache Systems

Adding, removing and modifying an entity checker as well as configuring the cache can be done in the entity_checking.properties which is located in the src/main/properties/ folder

Index-based Entity Checker

The Index-based Entity Checker will use a pregenerated Lucene index the start.sh script will automatically download and use if specified. The Index is created for DBpedia. You can create an index for a different domain using DBpediaEntityCheckIndexTool, be aware that you need to change the code for now.

To use the Index-based Entity Checker simply add the following to the entity_checking.properties

org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.yourEntityCheckerName=${org.aksw.gerbil.DataPath}/indexes/YOUR_INDEX,http://example.org, http://fr.example.org

The first argument will set the directory in which the lucene Index is located, the following arguments are the domains the index contains.

Using the pregenerated index for example

org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.dbpedia=${org.aksw.gerbil.DataPath}/indexes/dbpedia_check,http://dbpedia.org,http://de.dbpedia.org,http://fr.dbpedia.org

The Index was created using the english, french and german DBpedia and is located under gerbil_data/indexes/dbpedia_check

⚠️ The indexes we provide might be outdated. However, you can create a new index similar to the sameAs index if you need an up-to-date version.

HTTP-based Entity Checker

Another solution which takes more time than the Index-based Entity Checker is the HTTP-based Entity Checker. This one will check if an Entity exists using HTTP and can simply be set by adding the following:

org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://example.org/res/
org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://de.example.org/res/

This entity checker will test if an URI exists and starts with one of those namespaces (e.g. http://example.org/res/) against an HTTP endpoint.

⚠️ Please note that this checker executes an HTTP request for each entity with the given namespace(s). Especially if requests time out, the check will take quite a lot of time.

ℹ️ This checker should be used in combination with the file-based cache to invest the time for checking only once.

ℹ️ The HTTP-based Entity Checker can be turned off easily by commenting out all namespace statements in the properties file.

Configuring the Cache

To adjust the cache to your system you can edit entity_checking.properties.

If you want to use a persistent cache you can add the following propertie

org.aksw.gerbil.dataset.check.EntityCheckerManagerImpl.usePersistentCache=true

In-Memory Cache

The In-Memory Cache will store the entity checking results in memory. To restrict the memory used by the cache you can set

org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheSize=1000000

This will set the maximum amount of entities stored in the cache

Furthermore as URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:

org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheDuration=2592000000

whereas the duration is in ms.

File-based Cache

The file-based cache will store the entity checking results into a file. Per default this will be located at gerbil_data/cache/entityCheck.cache however this can be changed using the following property

org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheFile=${org.aksw.gerbil.CachePath}/entityCheck.cache

As URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:

org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheDuration=2592000000

whereas the duration is in ms.