Handle more than 50,000 entries in the sitemap #8936

PaulBoon · 2022-08-25T11:46:16Z

What steps does it take to reproduce the issue?
Generate a sitemap for an archive that has more than 50k datasets

What happens?
A single sitemap.xml file is generated, but Google only wants sitemap files with 50k or less URL's in it, so it won't be used for indexing.
What did you expect to happen?
Dataverse should split up the sitemap entries over several files and reference them in a sitemap index file. See: https://developers.google.com/search/docs/advanced/sitemaps/large-sitemaps

Any related open or closed issues to this bug report?

Dataverse discovery in Google - Machine Readable Sitemaps #4261

landreev · 2023-12-18T15:46:15Z

@PaulBoon Do you happen to know for the fact if this is still a problem? I.e., if Google is still enforcing this limit?
I was under the impression/assumption that they were no longer applying it, but I'm seeing some evidence to the contrary now, and they appear to still mention it in their documentation.
It really looks like we need to address this in the code to be safe.

PaulBoon · 2023-12-27T11:46:24Z

@landreev This is a while back, but I do remember that the Google Search Console was driving me mad.
It might be that Google is not very strict on the limit, but if you have 100k+ it was complaining the last time I looked.
We now have a Python script in place that splits up the sitemap every night via cron. However we do have problems with Google, not being clear what and when and how it is doing things, their indexing is intentionally really a black box.
We do have problems getting all the published datasets properly indexed by Google, as others from the Dataverse community also have. It might be good if we shared our combined knowledge somehow.

pdurbin · 2024-01-05T15:34:39Z

@PaulBoon yeah. Can you please upload your script here? Maybe someone can use it, for now, until we implement a proper solution in Dataverse itself.

landreev · 2024-01-05T17:11:52Z

@PaulBoon thank you. I've been looking into all of this, and yes, it will be a good idea to combine and document all the solutions/tips we may find.
BTW, did it actually work in your case, supplying the sitemap index to the bot by simply adding it to your robots.txt? - I did try that, via this line:

sitemap: https://dataverse.harvard.edu/sitemap_index.xml

but the bot just kept stubbornly using the combined sitemap we had there previously. I had to go into the search console and force-submit the index there. (although there's a chance I simply didn't wait long enough and it would have switched to it eventually - ?)

There appears to be lots of small idiosyncratic things like this when trying to appease the bot.

PaulBoon · 2024-01-10T14:55:10Z

@pdurbin The script we use to split, scraped it from the internet sometime ago.
This is templated in our ansible deployment scripts.

splitter.py below

#!/usr/bin/env python3

import os
import sys
from xml.sax import parse
from xml.sax.saxutils import XMLGenerator
import datetime

# based on code from https://github.com/realitix/sitemap_splitter
# needs to be run from same directory as the sitemap.xml file

BASE_URL = "{{ dataverse.payara.siteurl }}/sitemap/"
BREAK_AFTER= 2500 

class CycleFile():
    def __init__(self, filename):
        self.basename, self.ext = os.path.splitext(filename)
        self.index = 0
        self.filenames = []
        self.open_next_file()

    def open_next_file(self):
        self.index += 1
        filename = self.name()
        self.file = open(filename, 'w')
        self.filenames.append(filename)

    def name(self):
        return'%s%s%s' % (self.basename, self.index, self.ext)

    def cycle(self):
        self.file.close()
        self.open_next_file()

    def write(self, str):
        self.file.write(str.decode('utf-8'))

    def close(self):
        self.file.close()


class XMLBreaker(XMLGenerator):
    def __init__(self, break_into=None, break_after=1000, out=None, *args, **kwargs):
        XMLGenerator.__init__(self, out, encoding='utf-8', *args, **kwargs)
        self.out_file = out
        self.break_into = break_into
        self.break_after = break_after
        self.context = []
        self.count = 0

    def startElement(self, name, attrs):
        XMLGenerator.startElement(self, name, attrs)
        self.context.append((name, attrs))

    def endElement(self, name):
        XMLGenerator.endElement(self, name)
        self.context.pop()

        if name == self.break_into:
            self.count += 1
            if self.count == self.break_after:
                self.count = 0
                for element in reversed(self.context):
                    self.out_file.write(b"\n")
                    XMLGenerator.endElement(self, element[0])
                self.out_file.cycle()

                XMLGenerator.startDocument(self)
                for element in self.context:
                    XMLGenerator.startElement(self, *element)


def generate_index(base_url, filenames):
    now = datetime.datetime.now()
    dt = now.strftime("%Y-%m-%d")
    index_content = """<?xml version="1.0" encoding="UTF-8"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    """

    for filename in filenames:
        index_content += """
            <sitemap>
                <loc>{}</loc>
                <lastmod>{}</lastmod>
            </sitemap>
        """.format(base_url+filename, dt)

    index_content += """
    </sitemapindex>
    """

    # Move current sitemap to backup and write the other one
    os.rename('sitemap.xml', 'backup_sitemap.xml')
    with open('sitemap.xml', 'w') as f:
        f.write(index_content)


def run():
    filename = "sitemap.xml"
    break_into = "url"
    break_after = BREAK_AFTER
    cycle = CycleFile(filename)
    parse(filename, XMLBreaker(break_into, int(break_after), out=cycle))
    generate_index(BASE_URL, cycle.filenames)


if __name__ == '__main__':
    run()

We also have two bash script that are used to get it working as a cronjob.
The job will run: "/home/{{ shared_payara_user }}/bin/generate-sitemap.sh 2>&1 | /usr/bin/logger -t generate-sitemap"

generate-sitemap.sh

#!/bin/bash

# Update the dataverse sitemap
# see: https://guides.dataverse.org/en/latest/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines
# The sitemap.xml file will be generated in /var/lib/payara5/glassfish/domains/domain1/docroot/sitemap/

SITEMAP_DIR="/var/lib/payara5/glassfish/domains/domain1/docroot/sitemap"
BIN_DIR="$HOME/bin"
SPLIT_DIR="/tmp/sitemap"

# Split existing sitemap first, this is the previously generated one.
# Needed because we don't know when the new sitemap file is ready so we lag behind one
if [[ -f $SITEMAP_DIR/sitemap.xml ]]
then
  (
    $BIN_DIR/splitup-sitemap.sh
  ) || {
    exit 1
  }
fi

CURL_OUT="curloutput.txt"
(
  # Try to update the sitemap
  # Run curl, and stick all output in the temp file
  /usr/bin/curl --silent --show-error -X POST http://localhost:8080/api/admin/sitemap > "$CURL_OUT" 2>&1
) || {
  # If curl exited with a non-zero error code, send its output to stderr so that
  # cron could e-mail it.
  # You can test this by stopping the payara service for instance
  cat "$CURL_OUT" 1>&2
  rm "$CURL_OUT"
  exit 1
}

# Otherwise curl completed 
# but maybe the result status was not 'OK'
if ! grep -q "^{\"status\":\"OK\"" "$CURL_OUT"; then
  # If it does not start with the OK status, also have error output in cron
  # This will be the case when there is a sitemap.xml.staged file for instance
  cat "$CURL_OUT" 1>&2
  rm "$CURL_OUT"
  # Remove any staged file, otherwise next update attempt will also fail
  rm -f $SITEMAP_DIR/sitemap.xml.staged
  exit 1
fi

# Everything seems OK, so send the output to stdout (which
# should be redirected to a log file in crontab)
cat "$CURL_OUT"
rm "$CURL_OUT"

and split-up-sitemap.sh

#!/bin/bash

SITEMAP_DIR="/var/lib/payara5/glassfish/domains/domain1/docroot/sitemap"
SPLIT_DIR="/tmp/sitemap"
BIN_DIR="$HOME/bin"

rm -rf $SPLIT_DIR
mkdir $SPLIT_DIR
cp $SITEMAP_DIR/sitemap.xml $SPLIT_DIR/
cp $BIN_DIR/splitter.py $SPLIT_DIR/
( 
   cd $SPLIT_DIR
   ./splitter.py
) || {
   rm -rf $SPLIT_DIR
   exit 1
}
mv $SPLIT_DIR/sitemap.xml $SPLIT_DIR/sitemap_index.xml
rm $SPLIT_DIR/backup_sitemap.xml
rm $SPLIT_DIR/splitter.py
cp $SPLIT_DIR/* $SITEMAP_DIR/
rm -rf $SPLIT_DIR

I do see some payara5 hardwired and some more of our ansible vars, but you get the general idea.

PaulBoon · 2024-01-10T15:17:34Z

Sorry, I accidentally closed the issue

pdurbin · 2024-01-10T21:01:36Z

Awesome, thanks @PaulBoon

cmbz · 2024-01-30T01:43:04Z

2024/01/29

Prioritized following Slack conversation with @scolapasta

pdurbin added Feature: Metadata Type: Bug a defect User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh labels Oct 12, 2022

jggautier mentioned this issue Jul 14, 2023

robots.txt disallowing all, preventing crawling by Google etc. IQSS/dataverse.harvard.edu#227

Closed

pdurbin mentioned this issue Sep 22, 2023

Spike: Get the prod. archive fully reindexed by Google, while mitigating the load on the servers from crawling by the bot IQSS/dataverse.harvard.edu#228

Open

cmbz added this to IQSS Dataverse Project Dec 18, 2023

PaulBoon closed this as completed Jan 10, 2024

PaulBoon reopened this Jan 10, 2024

cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Jan 30, 2024

jeromeroucou added a commit to Recherche-Data-Gouv/dataverse that referenced this issue Jan 31, 2024

Sitemap more than 50000 entries IQSS#8936

9110ade

landreev added the Size: 10 A percentage of a sprint. 7 hours. label Feb 12, 2024

cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Feb 12, 2024

jeromeroucou mentioned this issue Feb 14, 2024

Support sitemaps with more than 50,000 items #10321

Merged

pdurbin moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Feb 15, 2024

scolapasta moved this from This Sprint 🏃‍♀️ 🏃 to In Review 🔎 in IQSS Dataverse Project Feb 29, 2024

scolapasta self-assigned this Feb 29, 2024

scolapasta removed this from IQSS Dataverse Project Feb 29, 2024

scolapasta removed their assignment Feb 29, 2024

pdurbin added a commit that referenced this issue Apr 18, 2024

rename release note snippet with "8936" #8936

7cd7789

pdurbin added a commit that referenced this issue Apr 18, 2024

simplify release note, add upgrade section #8936

ceb8c0f

pdurbin added a commit that referenced this issue Apr 18, 2024

rewrite sitemap docs (50,000 items now supported) #8936

b228fe7

pdurbin added a commit that referenced this issue Apr 24, 2024

various sitemap doc tweaks #8936

7c6d101

landreev closed this as completed in #10321 May 8, 2024

pdurbin added this to the 6.3 milestone May 8, 2024

DS-INRAE added this to Recherche Data Gouv Jul 10, 2024

DS-INRAE moved this to Done in Recherche Data Gouv Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle more than 50,000 entries in the sitemap #8936

Handle more than 50,000 entries in the sitemap #8936

PaulBoon commented Aug 25, 2022 •

edited by pdurbin

Loading

landreev commented Dec 18, 2023

PaulBoon commented Dec 27, 2023

pdurbin commented Jan 5, 2024

landreev commented Jan 5, 2024

PaulBoon commented Jan 10, 2024 •

edited

Loading

PaulBoon commented Jan 10, 2024

pdurbin commented Jan 10, 2024

cmbz commented Jan 30, 2024

Handle more than 50,000 entries in the sitemap #8936

Handle more than 50,000 entries in the sitemap #8936

Comments

PaulBoon commented Aug 25, 2022 • edited by pdurbin Loading

landreev commented Dec 18, 2023

PaulBoon commented Dec 27, 2023

pdurbin commented Jan 5, 2024

landreev commented Jan 5, 2024

PaulBoon commented Jan 10, 2024 • edited Loading

PaulBoon commented Jan 10, 2024

pdurbin commented Jan 10, 2024

cmbz commented Jan 30, 2024

PaulBoon commented Aug 25, 2022 •

edited by pdurbin

Loading

PaulBoon commented Jan 10, 2024 •

edited

Loading