-
Notifications
You must be signed in to change notification settings - Fork 58
Configuring Entity Checking
Gerbil will check if an entity represented by an URI does exist. Further on gerbil will use a cache system to assure that an entity which already were tested, does not need to be tested again.
There are two types of Entity Checker
and two types of Cache Systems
Adding, removing and modifying an entity checker as well as configuring the cache can be done in the entity_checking.properties which is located in the src/main/properties/
folder
The Index-based Entity Checker will use a pregenerated Lucene index the start.sh
script will automatically download and use if specified.
The Index is created for DBpedia.
You can create an index for a different domain using DBpediaEntityCheckIndexTool, be aware that you need to change the code for now.
To use the Index-based Entity Checker simply add the following to the entity_checking.properties
org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.yourEntityCheckerName=${org.aksw.gerbil.DataPath}/indexes/YOUR_INDEX,http://example.org, http://fr.example.org
The first argument will set the directory in which the lucene Index is located, the following arguments are the domains the index contains.
Using the pregenerated index for example
org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.dbpedia=${org.aksw.gerbil.DataPath}/indexes/dbpedia_check,http://dbpedia.org,http://de.dbpedia.org,http://fr.dbpedia.org
The Index was created using the english, french and german DBpedia and is located under gerbil_data/indexes/dbpedia_check
Another solution which takes more time than the Index-based Entity Checker is the HTTP-based Entity Checker. This one will check if an Entity exists using HTTP and can simply be set by adding the following:
org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://example.org/res/
org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://de.example.org/res/
This entity checker will test if an URI exists and starts with one of those namespaces (e.g. http://example.org/res/) against an HTTP endpoint.
ℹ️ This checker should be used in combination with the file-based cache to invest the time for checking only once.
ℹ️ The HTTP-based Entity Checker can be turned off easily by commenting out all namespace statements in the properties file.
To adjust the cache to your system you can edit entity_checking.properties.
If you want to use a persistent cache you can add the following propertie
org.aksw.gerbil.dataset.check.EntityCheckerManagerImpl.usePersistentCache=true
The In-Memory Cache will store the entity checking results in memory. To restrict the memory used by the cache you can set
org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheSize=1000000
This will set the maximum amount of entities stored in the cache
Furthermore as URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:
org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheDuration=2592000000
whereas the duration is in ms.
The file-based cache will store the entity checking results into a file.
Per default this will be located at gerbil_data/cache/entityCheck.cache
however this can be changed using the following property
org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheFile=${org.aksw.gerbil.CachePath}/entityCheck.cache
As URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:
org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheDuration=2592000000
whereas the duration is in ms.