Dataverse discovery in Google - Machine Readable Sitemaps #4261

eugene-barsky · 2017-11-06T17:23:59Z

Hello:

AS per Philip's request, we sat in a Google session about data discoverability, and Dr. Natasha Noy @google was mentioning Google preference to include sitemaps for better data indexing.

They also talked about Google preference for schema.org coding for the landing pages.

Thanks,

Eugene

jggautier · 2017-11-06T17:32:03Z

Thanks @eugene-barsky! Some more info for this issue: Google's "Build and submit a sitemap" guide

pdurbin · 2017-11-07T01:53:33Z

Yep, as I mentioned at #2717 (comment) it was about 24 minutes into the video at https://www.rd-alliance.org/making-data-discoverable-web-search-engines that sitemaps were mentioned and the slide above appeared. Thanks for opening this issue @eugene-barsky

mheppler · 2018-04-02T15:53:23Z

In issue #4555 @pameyer commented:

"Why don't my datasets show up in google?" seems like a question that comes up relatively commonly (but completely out of scope for this issue).

This would be the appropriately scoped issue for that question.

djbrooke · 2018-09-12T13:15:35Z

We'll estimate this and work on this. I'm removing the schema.org part of the title, since that was delivered in 4.8.4. We still need to add a sitemap. This is more important since Google Dataset Search is now a thing. :)

pdurbin · 2018-09-24T15:45:27Z

I'm playing around with Documentation Driven Development (DDD), if that's a thing, by making pull request #5084 which for now is only a stub of the direction I think we're going. See d2ccf59

At standup I mentioned that for my family site I recently switched from Jekyll to Hugo, which creates a sitemap at http://thedurbins.com/sitemap.xml . Hugo creates an XML file but at https://support.google.com/webmasters/answer/183668 Google indicates they support multiple formats:

XML
RSS, mRSS, and Atom 1.0
Text

My assumption is that we want to create an XML file. I assume we'll be using an EJB timer to control how often the XML file is updated. If I'm misunderstanding any requirements, please advise.

kcondon · 2018-10-02T16:01:43Z

Sending back, here's your requested "Punch List" @pdurbin, mostly what we had discussed last evening:

Need logging to indicate something is running for admins when job takes longer than a few minutes, eg. start, finish, object counts. Other ideas welcome.
Running endpoint on copy of prod finished in around an hour, shows a sitemap page but not sure it actually finished correctly since the result output on the command line was different from a successful one:

"Phil, sitemap seems like it finished in under an hour but I was not watching it.
weirdly it said this:
[root@dvn-vm5 tmp]# curl -X POST http://localhost:8080/api/admin/sitemap

<?xml version='1.0' encoding='UTF-8' ?>

this is what it says when it works:
[root@dvn-vm4 tmp]# curl -X POST http://localhost:8080/api/admin/sitemap
{"status":"OK","data":{"message":"Sitemap updated."}}
however, there is now a sitemap.xml file on vm5"

Make operation non-blocking on the command line: current blocking call continues to run anyway if ctrl-c is my understanding.
Make call check if already running and return already running rather than execute again, provides feedback and potentially corrupting sitemap. Some suggested approaches:
-Mozilla download model, create sitemap.xml as a diff file until completed/verified, then copy to sitemap.xml. Temp file acts as lock to check.
-Db table entry with lock.
-In memory singleton with state to check if in progress.
Document or provide a script to add as a validation cron job that would validate sitemap.xml after completion, otherwise not sure whether it completed successfully or was interrupted by service restart, etc.
Document fact that file exists on node on which endpoint was run so in a multi web node environment, this file needs to be shared or replicated.

Many of the above are design/usability issues rather than strictly QA functional testing.

pdurbin · 2018-10-02T20:39:43Z

@kcondon after we chatted a few minutes ago I added the final thing we agreed would be nice: feedback from the curl command if the staged file exists. Over to you. Thanks.

kcondon · 2018-10-04T19:13:07Z

@pdurbin
Found a couple things:

Path to sitemap.xml is listed as two different places in multiple web server (logos) versus sitemap instructions (sitemap). It appears the sitemap instructions are correct based on the log messages.
When no initial sitemap exists, endpoint fails in server log:

[2018-10-04T15:09:57.296-0400] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.sitemap.SiteMapUtil] [tid: _ThreadID=147 _ThreadName=__ejb-thread-pool5] [timeMillis: 1538680197296] [levelValue: 800] [[
  Writing staged sitemap to /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged]]

[2018-10-04T15:09:57.397-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.sitemap.SiteMapUtil] [tid: _ThreadID=147 _ThreadName=__ejb-thread-pool5] [timeMillis: 1538680197397] [levelValue: 900] [[
  Unable to update sitemap! Unable to write staged sitemap to /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged. TransformerException: java.io.FileNotFoundException: /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged (No such file or directory)]]

Support for sitemaps #4261

pdurbin mentioned this issue Nov 7, 2017

Indexing Dataverses in Google Scholar #2717

Closed

pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Jul 13, 2018

djbrooke added Status: Backlog labels Sep 12, 2018

djbrooke changed the title ~~Dataverse discovery in Google - Sitemaps and schema.org~~ Dataverse discovery in Google - Sitemaps Sep 12, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog User Role: Depositor Creates datasets, uploads data, etc. labels Sep 12, 2018

djbrooke self-assigned this Sep 19, 2018

djbrooke changed the title ~~Dataverse discovery in Google - Sitemaps~~ Dataverse discovery in Google - Machine Readable Sitemaps Sep 19, 2018

djbrooke removed their assignment Sep 19, 2018

djbrooke removed the ready for estimation label Sep 19, 2018

pdurbin added Status: Development and removed Status: This/Next Sprint labels Sep 24, 2018

pdurbin self-assigned this Sep 24, 2018

pdurbin added a commit that referenced this issue Sep 24, 2018

add doc stub for sitemaps #4261

d2ccf59

pdurbin mentioned this issue Sep 24, 2018

Support for sitemaps #4261 #5084

Merged

pdurbin added a commit that referenced this issue Sep 24, 2018

stub out sitemap code and tests #4261

afe3d0f

pdurbin added a commit that referenced this issue Sep 25, 2018

write sitemap to docroot #4261

61231b2

pdurbin added a commit that referenced this issue Sep 25, 2018

serve from /sitemap.xml #4261

45bde32

pdurbin added a commit that referenced this issue Sep 26, 2018

add datasets to sitemap #4261

e436355

pdurbin added a commit that referenced this issue Sep 27, 2018

add test to assert that XML is well formed #4261

125d163

pdurbin added a commit that referenced this issue Sep 27, 2018

validate sitemap against the schema #4261

f3d9b31

pdurbin added a commit that referenced this issue Sep 27, 2018

add dataverses to sitemap #4261

d96f8cb

pdurbin added a commit that referenced this issue Sep 27, 2018

fix test (dv must be published to appear in sitemap) #4261

9934fdf

matthew-a-dunlap removed their assignment Oct 1, 2018

matthew-a-dunlap added Status: QA and removed Status: Code Review labels Oct 1, 2018

kcondon self-assigned this Oct 1, 2018

poikilotherm mentioned this issue Oct 2, 2018

IQSS-5122 Fix NetBeans handling of test files. #5127

Merged

3 tasks

kcondon assigned pdurbin and unassigned kcondon Oct 2, 2018

kcondon added Status: Development and removed Status: QA labels Oct 2, 2018

pdurbin added a commit that referenced this issue Oct 2, 2018

Merge branch '5122-fix-netbeans-compat' into 4261-sitemap #4261

e73595f

pdurbin added a commit that referenced this issue Oct 2, 2018

add BEGIN and END lines to log #4261

bd54ba0

pdurbin added a commit that referenced this issue Oct 2, 2018

explain that logos and sitemaps are written per server #4261

c4116a1

pdurbin added a commit that referenced this issue Oct 2, 2018

stage sitemap before writing to final file #4261

b574f27

pdurbin added a commit that referenced this issue Oct 2, 2018

add validation to main routine, s/copy/move/ #4261

3b9bbf1

pdurbin added a commit that referenced this issue Oct 2, 2018

make async, report error if staged file exists #4261

11f6fca

pdurbin added Status: QA and removed Status: Development labels Oct 2, 2018

pdurbin removed their assignment Oct 2, 2018

kcondon self-assigned this Oct 3, 2018

pdurbin added a commit that referenced this issue Oct 3, 2018

Merge branch 'develop' into 4261-sitemap #4261

d3531c5

pdurbin added a commit that referenced this issue Oct 3, 2018

Merge branch 'develop' into 4261-sitemap #4261

c41fc16

pdurbin added a commit that referenced this issue Oct 4, 2018

typo: wrong directory for sitemap was documented #4261

c80dc43

kcondon added a commit that referenced this issue Oct 4, 2018

Merge pull request #5084 from IQSS/4261-sitemap

eed3ac1

Support for sitemaps #4261

kcondon closed this as completed Oct 4, 2018

kcondon removed the Status: QA label Oct 4, 2018

djbrooke added this to the 4.10 - Additional Data Transfer Options milestone Dec 11, 2018

PaulBoon mentioned this issue Aug 25, 2022

Handle more than 50,000 entries in the sitemap #8936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataverse discovery in Google - Machine Readable Sitemaps #4261

Dataverse discovery in Google - Machine Readable Sitemaps #4261

eugene-barsky commented Nov 6, 2017

jggautier commented Nov 6, 2017 •

edited

Loading

pdurbin commented Nov 7, 2017

mheppler commented Apr 2, 2018

djbrooke commented Sep 12, 2018

pdurbin commented Sep 24, 2018

kcondon commented Oct 2, 2018 •

edited

Loading

pdurbin commented Oct 2, 2018

kcondon commented Oct 4, 2018

Dataverse discovery in Google - Machine Readable Sitemaps #4261

Dataverse discovery in Google - Machine Readable Sitemaps #4261

Comments

eugene-barsky commented Nov 6, 2017

jggautier commented Nov 6, 2017 • edited Loading

pdurbin commented Nov 7, 2017

mheppler commented Apr 2, 2018

djbrooke commented Sep 12, 2018

pdurbin commented Sep 24, 2018

kcondon commented Oct 2, 2018 • edited Loading

pdurbin commented Oct 2, 2018

kcondon commented Oct 4, 2018

jggautier commented Nov 6, 2017 •

edited

Loading

kcondon commented Oct 2, 2018 •

edited

Loading