Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse discovery in Google - Machine Readable Sitemaps #4261

Closed
eugene-barsky opened this issue Nov 6, 2017 · 9 comments
Closed

Dataverse discovery in Google - Machine Readable Sitemaps #4261

eugene-barsky opened this issue Nov 6, 2017 · 9 comments
Assignees

Comments

@eugene-barsky
Copy link

Hello:

AS per Philip's request, we sat in a Google session about data discoverability, and Dr. Natasha Noy @google was mentioning Google preference to include sitemaps for better data indexing.

They also talked about Google preference for schema.org coding for the landing pages.

Thanks,

Eugene
32423887-4f77b876-c277-11e7-91c1-a4894e05420f

@jggautier
Copy link
Contributor

jggautier commented Nov 6, 2017

Thanks @eugene-barsky! Some more info for this issue: Google's "Build and submit a sitemap" guide

@pdurbin
Copy link
Member

pdurbin commented Nov 7, 2017

Yep, as I mentioned at #2717 (comment) it was about 24 minutes into the video at https://www.rd-alliance.org/making-data-discoverable-web-search-engines that sitemaps were mentioned and the slide above appeared. Thanks for opening this issue @eugene-barsky

@mheppler
Copy link
Contributor

mheppler commented Apr 2, 2018

In issue #4555 @pameyer commented:

"Why don't my datasets show up in google?" seems like a question that comes up relatively commonly (but completely out of scope for this issue).

This would be the appropriately scoped issue for that question.

@pdurbin pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Jul 13, 2018
@djbrooke
Copy link
Contributor

We'll estimate this and work on this. I'm removing the schema.org part of the title, since that was delivered in 4.8.4. We still need to add a sitemap. This is more important since Google Dataset Search is now a thing. :)

@djbrooke djbrooke changed the title Dataverse discovery in Google - Sitemaps and schema.org Dataverse discovery in Google - Sitemaps Sep 12, 2018
@djbrooke djbrooke added Status: This/Next Sprint and removed Status: Backlog User Role: Depositor Creates datasets, uploads data, etc. labels Sep 12, 2018
@djbrooke djbrooke self-assigned this Sep 19, 2018
@djbrooke djbrooke changed the title Dataverse discovery in Google - Sitemaps Dataverse discovery in Google - Machine Readable Sitemaps Sep 19, 2018
@djbrooke djbrooke removed their assignment Sep 19, 2018
@pdurbin pdurbin self-assigned this Sep 24, 2018
pdurbin added a commit that referenced this issue Sep 24, 2018
@pdurbin
Copy link
Member

pdurbin commented Sep 24, 2018

I'm playing around with Documentation Driven Development (DDD), if that's a thing, by making pull request #5084 which for now is only a stub of the direction I think we're going. See d2ccf59

At standup I mentioned that for my family site I recently switched from Jekyll to Hugo, which creates a sitemap at http://thedurbins.com/sitemap.xml . Hugo creates an XML file but at https://support.google.com/webmasters/answer/183668 Google indicates they support multiple formats:

  • XML
  • RSS, mRSS, and Atom 1.0
  • Text

My assumption is that we want to create an XML file. I assume we'll be using an EJB timer to control how often the XML file is updated. If I'm misunderstanding any requirements, please advise.

pdurbin added a commit that referenced this issue Sep 24, 2018
pdurbin added a commit that referenced this issue Sep 25, 2018
pdurbin added a commit that referenced this issue Sep 25, 2018
pdurbin added a commit that referenced this issue Sep 26, 2018
pdurbin added a commit that referenced this issue Sep 27, 2018
@kcondon
Copy link
Contributor

kcondon commented Oct 2, 2018

Sending back, here's your requested "Punch List" @pdurbin, mostly what we had discussed last evening:

  1. Need logging to indicate something is running for admins when job takes longer than a few minutes, eg. start, finish, object counts. Other ideas welcome.

  2. Running endpoint on copy of prod finished in around an hour, shows a sitemap page but not sure it actually finished correctly since the result output on the command line was different from a successful one:

"Phil, sitemap seems like it finished in under an hour but I was not watching it.
weirdly it said this:
[root@dvn-vm5 tmp]# curl -X POST http://localhost:8080/api/admin/sitemap

<?xml version='1.0' encoding='UTF-8' ?>

this is what it says when it works:
[root@dvn-vm4 tmp]# curl -X POST http://localhost:8080/api/admin/sitemap
{"status":"OK","data":{"message":"Sitemap updated."}}
however, there is now a sitemap.xml file on vm5"

  1. Make operation non-blocking on the command line: current blocking call continues to run anyway if ctrl-c is my understanding.

  2. Make call check if already running and return already running rather than execute again, provides feedback and potentially corrupting sitemap. Some suggested approaches:
    -Mozilla download model, create sitemap.xml as a diff file until completed/verified, then copy to sitemap.xml. Temp file acts as lock to check.
    -Db table entry with lock.
    -In memory singleton with state to check if in progress.

  3. Document or provide a script to add as a validation cron job that would validate sitemap.xml after completion, otherwise not sure whether it completed successfully or was interrupted by service restart, etc.

  4. Document fact that file exists on node on which endpoint was run so in a multi web node environment, this file needs to be shared or replicated.

Many of the above are design/usability issues rather than strictly QA functional testing.

@pdurbin
Copy link
Member

pdurbin commented Oct 2, 2018

@kcondon after we chatted a few minutes ago I added the final thing we agreed would be nice: feedback from the curl command if the staged file exists. Over to you. Thanks.

@kcondon kcondon self-assigned this Oct 3, 2018
@kcondon
Copy link
Contributor

kcondon commented Oct 4, 2018

@pdurbin
Found a couple things:

  1. Path to sitemap.xml is listed as two different places in multiple web server (logos) versus sitemap instructions (sitemap). It appears the sitemap instructions are correct based on the log messages.
  2. When no initial sitemap exists, endpoint fails in server log:
[2018-10-04T15:09:57.296-0400] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.sitemap.SiteMapUtil] [tid: _ThreadID=147 _ThreadName=__ejb-thread-pool5] [timeMillis: 1538680197296] [levelValue: 800] [[
  Writing staged sitemap to /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged]]

[2018-10-04T15:09:57.397-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.sitemap.SiteMapUtil] [tid: _ThreadID=147 _ThreadName=__ejb-thread-pool5] [timeMillis: 1538680197397] [levelValue: 900] [[
  Unable to update sitemap! Unable to write staged sitemap to /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged. TransformerException: java.io.FileNotFoundException: /usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml.staged (No such file or directory)]] 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants