-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sitemaps with more than 50,000 items #10321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick review. Thanks for the pull request, @jeromeroucou ! ❤️
@@ -1745,6 +1745,8 @@ https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Proj | |||
|
|||
Once the sitemap has been generated and placed in the domain docroot directory, it will become available to the outside callers at <YOUR_SITE_URL>/sitemap/sitemap.xml; it will also be accessible at <YOUR_SITE_URL>/sitemap.xml (via a *pretty-faces* rewrite rule). Some search engines will be able to find it at this default location. Some, **including Google**, need to be **specifically instructed** to retrieve it. | |||
|
|||
On Dataverse installation with more than 50000 Dataverse collections or datasets, sitemap file name is ``sitemap_index.html``, not ``sitemap.xml``. Be aware in previous steps to use the correct file name corresponding to your installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why not use the library in all cases? And avoid having two code paths and two XML files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My note wasn't precise enough.
In sitemap specification, in the case where is more than 50000 URLs and/or the sitemap file is larger than 50MB, we have to use sitemap index file. The name of this file isn't clearly indicate, but in example there is indicate http://www.yoursite.com/sitemap_index.xml
or https://example.com/public/sitemap_index.xml
in Google developers documentation.
The library sitemapgen4j
generate sitemap_index.xml
name file by default in the case of sitemap index. You can see it here.
So in all cases the library is used, only the name of the sitemap file is change.
I'll have to clarify this in the documentation 👍
Hi @jeromeroucou @pdurbin is this set to be QAed? (or does it still need more review / follwoup for the review). Also, can we change this from not being a draft? |
Just a general note here: the dependency added has not been updated since 2019 (no activity, stale PRs and issues) and uses old Java APIs. At least it doesn't introduce any other transitive dependencies. There are some forks with small patches. My suggestion: let's create a fork at github.com/gdcc and have our own package at Maven Central. In case we need patches because of bugs or changes at search engines, we would be much quicker to get this done. It also allows us to modernize the Java code over time. |
Good catch @poikilotherm ! If there is possible to have a specific fork with patches, it could be nice, and I will be happy to change my PR with this dependency. |
@poikilotherm not a bad suggestion. So we'd fork https://github.com/dfabulich/sitemapgen4j under https://github.com/gdcc @scolapasta and others reading this, what do you think? |
I see a 👍 from @scolapasta and no objections in Slack so I forked it to https://github.com/gdcc/sitemapgen4j Now what? Fix it up and release to Maven Central? Do we want a separate issue to track this? @jeromeroucou are you interested in helping? |
@pdurbin Yes I am interested. And I also think a separate issue on the forked project can be better to review patchs applications with PR |
@jeromeroucou great thanks. @poikilotherm is already working away in this "modernize" branch, if you'd like to try building it: https://github.com/gdcc/sitemapgen4j/tree/modernize |
He doesn't even have to. Just enable Maven Central Snaps and point to the new coordinates: io.gdcc:sitemapgen4j:2.0.0-SNAPSHOT |
https://github.com/gdcc/sitemapgen4j has been released as v2.0.0 to Maven Central. (Might take another hour until available from all mirrors.) Have fun! (If we need fixes, updated |
I have just released a version 2.1.0 that has again more optimizations. Most important: addressed loads of code smells and removed obsolete, no longer supported Google Sitemap Extensions. |
@jeromeroucou Is this all ready for QA now? (if so can you change it from draft - I jsut want to be sure there's not something else we're waiting for here) |
Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18 |
Ah, OK, that makes sense. We'll keep an eye on for when you switch it out of draft state. |
@jeromeroucou I merged your PR and a new SNAPSHOT version is available. Would you mind testing this in Dataverse again before I cut a release? |
@poikilotherm Thanks for the update ! On my laptop with the latest SNAPSHOT version, unit tests and sitemap generation with some dataverses and datasets work fine. When the new release will be published, I'll mention it on pom.xml file. |
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
@@ -0,0 +1,3 @@ | |||
The sitemap file generation can handle more than 50,000 entries if needed with the [sitemapgen4j](https://github.com/gdcc/sitemapgen4j) library, maintained by the Global Dataverse Community Consortium. | |||
|
|||
In this case, the Dataverse Admin API `api/admin/sitemap` create a sitemap index file, called `sitemap_index.xml`, in place of the `sitemap.xml` file. See the config section of the Installation Guide for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add just a couple of words here, along the lines of "... in place of the single sitemap.xml
, which will be split into individual files sitemap1.xml
, sitemap2.xml
etc. as needed; these files will be listed in the main index file and placed in the same directory..." I obviously encourage you to try and explain this better and in a more clear way (hence "along the lines of ...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @landreev.
Thanks you for the review. A update of release notes snippet and configuration page have been done, hoping to have taken your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments. Looks great otherwise. We have more than 50K public objects in our production, so I've been using a simple script to split the sitemap into segments. But it will help other instances to have it done automatically.
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
@jeromeroucou I took a close look at the documentation and I'm suggesting a significant rewrite here: If you like the changes, please feel free to merge. Otherwise, I'm happy to talk about it! 😄 |
I looked at the code and poked around with the new testHugeSiteMap test. I'm basically ready to approve this and move it to "ready for QA" but I'd like to see the doc PR I made above get merged first. |
sitemap URL must have "/sitemap/" to be accessible in case of sitemap index file
After discussion with @pdurbin, a new commit have been done to fix the sitemap URL in case of <?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://domain.tld/sitemap/sitemap1.xml</loc>
<lastmod>2024-04-22</lastmod>
</sitemap>
<sitemap>
<loc>https://domain.tld/sitemap/sitemap2.xml</loc>
<lastmod>2024-04-22</lastmod>
</sitemap>
</sitemapindex> In I've also modified the unit tests to validate this modification, and I've also added some comments. |
updates to sitemap docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeromeroucou thanks for merging my docs pull request and for adding the pretty-config change. I haven't tested it specifically but I think this pull request is ready for QA so I'll move it along. Thanks!
Merging. |
What this PR does / why we need it:
Actual generation sitemap file doesn't take care the numbers of entries. But sitemap specification indicate "each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes)" (see sitemap.org)
Which issue(s) this PR closes:
Closes #8936
Special notes for your reviewer:
Sitemap generation is done by
sitemapgen4j
library, which has the advantage of simplifying the code. I also use the new java 8 date formatterDateTimeFormatter
which is threadsafe.I make this PR as a draft because issue have been moved in "SPRINT READY" in IQSS Dataverse Project.
Suggestions on how to test this:
A new Unit Test have been added. And if the code is lunch into a running Dataverse, you can also call the API (for ex with
curl -X POST http://localhost:8080/api/admin/sitemap
).Is there a release notes update needed for this change?:
No, I don't thinks.
Additional documentation:
Installation configuration need to be updated with the new file name of the sitemap.
Preview at https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines