-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtuoso wikidata import performance - virtuoso wikidata endpoints as part of snapquery wikidata mirror network #1326
Comments
Note that we (OpenLink Software [1], [2]) have also loaded Wikidata into a live Virtuoso instance, available at https://wikidata.demo.openlinksw.com/sparql. I'm not sure whether I'm the "Ted" referenced in the last paragraph; if so, regrettably, I've forgotten the specifics of that conversation. Could you provide more detail about the "question" being asked by this issue, especially to benefit others who may have more to contribute to the "answer" than I? |
https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours has the info as well as https://www.wikidata.org/wiki/Wikidata:Scholia/Events/Hackathon_October_2024 We are well aware of the virtuoso endpoint it is already configured in the default https://github.com/WolfgangFahl/snapquery/blob/main/snapquery/samples/endpoints.yaml file. The question here is how do we get a virtuoso endpoint that is as up-to-date as possible quickly. We intent to "rotate" images based on dumps as long as the streaming updates are not possible. So currently that would be roughly weekly. E.g. ad-freiburg/qlever-control#82 is an example. This is just an initial issue to start the communication as suggested by Ted in the online meeting of wikidata Search Platform mentioned above. Depending on how the Virtuoso open source project is going to be involved we might need multiple tickets for the different aspects. I suggest to stick with the import performance issue in this ticket for the time being and wait for Tim's comment. |
Is Tim a GitHub user? Tagging their handle seems appropriate, if so. If not, I wonder how they are to comment here? (Also if not a GitHub user, it might make sense to instead raise these threads on the OpenLink Community Forum. They would need to register there, but this could be done using various third-party IdPs.) |
The import took ~4 days and the virtuoso instance was configured with the recommendation for 64 GB RAM (highest available recommendation in the documentation)
To improve the import performance I want to try:
Is there a recommendation for a configuration that would allow the import of the dump on a single day?
|
|
@tholzheim thanks Tim for showing up and bringing the dicussion forward. |
@TallTed
Tim Holzheim has successfully imported Wikidata into a virtuoso instance see https://cr.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso and
https://wiki.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso
for the documentation. The endpoint is available at https://virtuoso.wikidata.dbis.rwth-aachen.de/sparql/ and we would love to integrate this and other virtuoso endpoints into our snapquery https://github.com/WolfgangFahl/snapquery infrastructure.
Ted suggested that i should open a ticket to get the dicussion going about how virtuoso endpoints could be made part of the snapquery wikidata mirror infrastructure. The idea is to use named parameterized queries that hide the details of the endpoints so that it does not matter wether you use blazegraph, qlever, jena, virtuoso, stardog, ... you name it. Queries should just work as specified and be monitored for non functional aspects proactively.
The text was updated successfully, but these errors were encountered: