Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] large sitemap, check last sitemap item and continue generate next file only? #411

Open
kien-pham opened this issue Mar 19, 2023 · 2 comments

Comments

@kien-pham
Copy link

Hello,

I have a large DB with more than 100k records, I need to split it to multiple sitemap files. As your code, I use SitemapAndIndexStream to generate all sitemap files from sitemap-1.xml to sitemap-9.xml for example.

Now I want to run code to generate sitemap everyday, and it should only generate from last item on sitemap-9.xml to the new latest item on DB. If I run SitemapAndIndexStream again, it will query all the items on DB and generate all files again, it will take more resources and not good.

How can I do this?

@huntharo
Copy link
Contributor

Yeah, you don't want to run SitemapAndIndexStream again.

Although... the DB index you should be using should have all of the sitemap records sequentially in an index, so the actual cost of the query should be very low in that it should not use a lot of DB I/O and should not use a lot of DB CPU.

100k records is not a lot. That would only need to write a 2-3 record sitemap index and 2-4 sitemap xml files. Regenerating all of that would only take a minute or so with the XML serialization being the longest part.

The alternative is much more complex. You'll need to pull back the index file, get the latest sitemap XML file out of it, fetch that file, parse the latest sitemap into items, add to the items collection, rotate into a new file (and add it to the index) when the last file fills up, then put back the old latest file, any newly added files, and the index file.

It's... not trivial. I have a system that I'm trying to open source that does all of that, but I can't make any promises.

@huntharo
Copy link
Contributor

Here is a project that utilizes this project and uses Kinesis streams to allow millions of items to be written to sitemaps (and updated / maintained):

https://github.com/shutterstock/streaming-sitemaps

That project may meet your needs or give you some ideas on how to approach the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants