-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle more than 50,000 entries in the sitemap #8936
Comments
@PaulBoon Do you happen to know for the fact if this is still a problem? I.e., if Google is still enforcing this limit? |
@landreev This is a while back, but I do remember that the Google Search Console was driving me mad. |
@PaulBoon yeah. Can you please upload your script here? Maybe someone can use it, for now, until we implement a proper solution in Dataverse itself. |
@PaulBoon thank you. I've been looking into all of this, and yes, it will be a good idea to combine and document all the solutions/tips we may find.
but the bot just kept stubbornly using the combined sitemap we had there previously. I had to go into the search console and force-submit the index there. (although there's a chance I simply didn't wait long enough and it would have switched to it eventually - ?) There appears to be lots of small idiosyncratic things like this when trying to appease the bot. |
@pdurbin The script we use to split, scraped it from the internet sometime ago. splitter.py below
We also have two bash script that are used to get it working as a cronjob. generate-sitemap.sh
and split-up-sitemap.sh
I do see some payara5 hardwired and some more of our ansible vars, but you get the general idea. |
Sorry, I accidentally closed the issue |
Awesome, thanks @PaulBoon |
2024/01/29
|
What steps does it take to reproduce the issue?
Generate a sitemap for an archive that has more than 50k datasets
What happens?
A single sitemap.xml file is generated, but Google only wants sitemap files with 50k or less URL's in it, so it won't be used for indexing.
What did you expect to happen?
Dataverse should split up the sitemap entries over several files and reference them in a sitemap index file. See: https://developers.google.com/search/docs/advanced/sitemaps/large-sitemaps
Any related open or closed issues to this bug report?
The text was updated successfully, but these errors were encountered: