Skip to content

Commit

Permalink
Merge pull request #5084 from IQSS/4261-sitemap
Browse files Browse the repository at this point in the history
Support for sitemaps #4261
  • Loading branch information
kcondon authored Oct 4, 2018
2 parents 0b25f9d + c80dc43 commit eed3ac1
Show file tree
Hide file tree
Showing 10 changed files with 453 additions and 1 deletion.
6 changes: 5 additions & 1 deletion doc/sphinx-guides/source/installation/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@ Advanced installations are not officially supported but here we are at least doc
Multiple Glassfish Servers
--------------------------

The main thing to know about running multiple Glassfish servers is that only one can be the dedicated timer server, as explained in the :doc:`/admin/timers` section of the Admin Guide.
You should be conscious of the following when running multiple Glassfish servers.

- Only one Glassfish server can be the dedicated timer server, as explained in the :doc:`/admin/timers` section of the Admin Guide.
- When users upload a logo for their dataverse using the "theme" feature described in the :doc:`/user/dataverse-management` section of the User Guide, these logos are stored only on the Glassfish server the user happend to be on when uploading the logo. By default these logos are written to the directory ``/usr/local/glassfish4/glassfish/domains/domain1/docroot/logos``.
- When a sitemp is created by a Glassfish server it is written to the filesystem of just that Glassfish server. By default the sitemap is written to the directory ``/usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap``.

Detecting Which Glassfish Server a User Is On
+++++++++++++++++++++++++++++++++++++++++++++
Expand Down
22 changes: 22 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,9 @@ Out of the box, Dataverse attempts to block search engines from crawling your in
Letting Search Engines Crawl Your Installation
++++++++++++++++++++++++++++++++++++++++++++++

Ensure robots.txt Is Not Blocking Search Engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a public production Dataverse installation, it is probably desired that search agents be able to index published pages (AKA - pages that are visible to an unauthenticated user).
Polite crawlers usually respect the `Robots Exclusion Standard <https://en.wikipedia.org/wiki/Robots_exclusion_standard>`_; we have provided an example of a production robots.txt :download:`here </_static/util/robots.txt>`).

Expand All @@ -437,6 +440,25 @@ For more of an explanation of ``ProxyPassMatch`` see the :doc:`shibboleth` secti

If you are not fronting Glassfish with Apache you'll need to prevent Glassfish from serving the robots.txt file embedded in the war file by overwriting robots.txt after the war file has been deployed. The downside of this technique is that you will have to remember to overwrite robots.txt in the "exploded" war file each time you deploy the war file, which probably means each time you upgrade to a new version of Dataverse. Furthermore, since the version of Dataverse is always incrementing and the version can be part of the file path, you will need to be conscious of where on disk you need to replace the file. For example, for Dataverse 4.6.1 the path to robots.txt may be ``/usr/local/glassfish4/glassfish/domains/domain1/applications/dataverse-4.6.1/robots.txt`` with the version number ``4.6.1`` as part of the path.

Creating a Sitemap and Submitting it to Search Engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Search engines have an easier time indexing content when you provide them a sitemap. The Dataverse sitemap includes URLs to all published dataverses and all published datasets that are not harvested or deaccessioned.

Create or update your sitemap by adding the following curl command to cron to run nightly or as you see fit:

``curl -X POST http://localhost:8080/api/admin/sitemap``

This will create or update a file in the following location unless you have customized your installation directory for Glassfish:

``/usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml``

On an installation of Dataverse with many datasets, the creation or updating of the sitemap can take a while. You can check Glassfish's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between.

https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Demo site and yours should be similar. Submit your sitemap URL to Google by following `Google's "submit a sitemap" instructions`_ or similar instructions for other search engines.

.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668

Putting Your Dataverse Installation on the Map at dataverse.org
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Expand Down
31 changes: 31 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/SiteMap.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package edu.harvard.iq.dataverse.api;

import edu.harvard.iq.dataverse.sitemap.SiteMapServiceBean;
import edu.harvard.iq.dataverse.sitemap.SiteMapUtil;
import javax.ejb.EJB;
import javax.ejb.Stateless;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import javax.ws.rs.core.Response;

@Stateless
@Path("admin/sitemap")
public class SiteMap extends AbstractApiBean {

@EJB
SiteMapServiceBean siteMapSvc;

@POST
@Produces(MediaType.APPLICATION_JSON)
public Response updateSiteMap() {
boolean stageFileExists = SiteMapUtil.stageFileExists();
if (stageFileExists) {
return error(Response.Status.BAD_REQUEST, "Sitemap cannot be updated because staged file exists.");
}
siteMapSvc.updateSiteMap(dataverseSvc.findAll(), datasetSvc.findAll());
return ok("Sitemap update has begun. Check logs for status.");
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package edu.harvard.iq.dataverse.sitemap;

import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.Dataverse;
import java.util.List;
import javax.ejb.Asynchronous;
import javax.ejb.Stateless;

@Stateless
public class SiteMapServiceBean {

@Asynchronous
public void updateSiteMap(List<Dataverse> dataverses, List<Dataset> datasets) {
SiteMapUtil.updateSiteMap(dataverses, datasets);
}

}
225 changes: 225 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/sitemap/SiteMapUtil.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
package edu.harvard.iq.dataverse.sitemap;

import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.Dataverse;
import edu.harvard.iq.dataverse.DvObjectContainer;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.xml.XmlValidator;
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;

public class SiteMapUtil {

private static final Logger logger = Logger.getLogger(SiteMapUtil.class.getCanonicalName());

static final String SITEMAP_FILENAME_FINAL = "sitemap.xml";
static final String SITEMAP_FILENAME_STAGED = "sitemap.xml.staged";

/**
* TODO: Handle more than 50,000 entries in the sitemap.
*
* (As of this writing Harvard Dataverse only has ~3000 dataverses and
* ~30,000 datasets.)
*
* "each Sitemap file that you provide must have no more than 50,000 URLs"
* https://www.sitemaps.org/protocol.html
*
* Consider using a third party library: "One sitemap can contain a maximum
* of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain
* only 1,000 URLs.) If you need to put more URLs than that in a sitemap,
* you'll have to use a sitemap index file. Fortunately, WebSitemapGenerator
* can manage the whole thing for you."
* https://github.com/dfabulich/sitemapgen4j
*/
public static void updateSiteMap(List<Dataverse> dataverses, List<Dataset> datasets) {

logger.info("BEGIN updateSiteMap");

String sitemapPathString = getSitemapPathString();
String stagedSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_STAGED;
String finalSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_FINAL;

Path stagedPath = Paths.get(stagedSitemapPathAndFileString);
if (Files.exists(stagedPath)) {
logger.warning("Unable to update sitemap! The staged file from a previous run already existed. Delete " + stagedSitemapPathAndFileString + " and try again.");
return;
}

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = null;
try {
documentBuilder = documentBuilderFactory.newDocumentBuilder();
} catch (ParserConfigurationException ex) {
logger.warning("Unable to update sitemap! ParserConfigurationException: " + ex.getLocalizedMessage());
return;
}
Document document = documentBuilder.newDocument();

Element urlSet = document.createElement("urlset");
urlSet.setAttribute("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");
urlSet.setAttribute("xmlns:xhtml", "http://www.w3.org/1999/xhtml");
document.appendChild(urlSet);

for (Dataverse dataverse : dataverses) {
if (!dataverse.isReleased()) {
continue;
}
Element url = document.createElement("url");
urlSet.appendChild(url);

Element loc = document.createElement("loc");
String dataverseAlias = dataverse.getAlias();
loc.appendChild(document.createTextNode(SystemConfig.getDataverseSiteUrlStatic() + "/dataverse/" + dataverseAlias));
url.appendChild(loc);

Element lastmod = document.createElement("lastmod");
lastmod.appendChild(document.createTextNode(getLastModDate(dataverse)));
url.appendChild(lastmod);
}

for (Dataset dataset : datasets) {
if (!dataset.isReleased()) {
continue;
}
if (dataset.isHarvested()) {
continue;
}
// The deaccessioned check is last because it has to iterate through dataset versions.
if (dataset.isDeaccessioned()) {
continue;
}
Element url = document.createElement("url");
urlSet.appendChild(url);

Element loc = document.createElement("loc");
String datasetPid = dataset.getGlobalId().asString();
loc.appendChild(document.createTextNode(SystemConfig.getDataverseSiteUrlStatic() + "/dataset.xhtml?persistentId=" + datasetPid));
url.appendChild(loc);

Element lastmod = document.createElement("lastmod");
lastmod.appendChild(document.createTextNode(getLastModDate(dataset)));
url.appendChild(lastmod);
}

TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
try {
transformer = transformerFactory.newTransformer();
} catch (TransformerConfigurationException ex) {
logger.warning("Unable to update sitemap! TransformerConfigurationException: " + ex.getLocalizedMessage());
return;
}
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(document);
File directory = new File(sitemapPathString);
if (!directory.exists()) {
directory.mkdir();
}

boolean debug = false;
if (debug) {
logger.info("Writing sitemap to console/logs");
StreamResult consoleResult = new StreamResult(System.out);
try {
transformer.transform(source, consoleResult);
} catch (TransformerException ex) {
logger.warning("Unable to print sitemap to the console: " + ex.getLocalizedMessage());
}
}

logger.info("Writing staged sitemap to " + stagedSitemapPathAndFileString);
StreamResult result = new StreamResult(new File(stagedSitemapPathAndFileString));
try {
transformer.transform(source, result);
} catch (TransformerException ex) {
logger.warning("Unable to update sitemap! Unable to write staged sitemap to " + stagedSitemapPathAndFileString + ". TransformerException: " + ex.getLocalizedMessage());
return;
}

logger.info("Checking staged sitemap for well-formedness. The staged file is " + stagedSitemapPathAndFileString);
try {
XmlValidator.validateXmlWellFormed(stagedSitemapPathAndFileString);
} catch (Exception ex) {
logger.warning("Unable to update sitemap! Staged sitemap file is not well-formed XML! The exception for " + stagedSitemapPathAndFileString + " is " + ex.getLocalizedMessage());
return;
}

logger.info("Checking staged sitemap against XML schema. The staged file is " + stagedSitemapPathAndFileString);
URL schemaUrl = null;
try {
schemaUrl = new URL("https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd");
} catch (MalformedURLException ex) {
// This URL is hard coded and it's fine. We should never get MalformedURLException so we just swallow the exception and carry on.
}
try {
XmlValidator.validateXmlSchema(stagedSitemapPathAndFileString, schemaUrl);
} catch (SAXException | IOException ex) {
logger.warning("Unable to update sitemap! Exception caught while checking XML staged file (" + stagedSitemapPathAndFileString + " ) against XML schema: " + ex.getLocalizedMessage());
return;
}

Path finalPath = Paths.get(finalSitemapPathAndFileString);
logger.info("Copying staged sitemap from " + stagedSitemapPathAndFileString + " to " + finalSitemapPathAndFileString);
try {
Files.move(stagedPath, finalPath, StandardCopyOption.REPLACE_EXISTING);
} catch (IOException ex) {
logger.warning("Unable to update sitemap! Unable to copy staged sitemap from " + stagedSitemapPathAndFileString + " to " + finalSitemapPathAndFileString + ". IOException: " + ex.getLocalizedMessage());
return;
}

logger.info("END updateSiteMap");
}

private static String getLastModDate(DvObjectContainer dvObjectContainer) {
// TODO: Decide if YYYY-MM-DD is enough. https://www.sitemaps.org/protocol.html
// says "The date of last modification of the file. This date should be in W3C Datetime format.
// This format allows you to omit the time portion, if desired, and use YYYY-MM-DD."
return new SimpleDateFormat("yyyy-MM-dd").format(dvObjectContainer.getModificationTime());
}

public static boolean stageFileExists() {
String sitemapPathString = getSitemapPathString();
String stagedSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_STAGED;
Path stagedPath = Paths.get(stagedSitemapPathAndFileString);
if (Files.exists(stagedPath)) {
logger.warning("Unable to update sitemap! The staged file from a previous run already existed. Delete " + stagedSitemapPathAndFileString + " and try again.");
return true;
}
return false;
}

private static String getSitemapPathString() {
String sitemapPathString = "/tmp";
// i.e. /usr/local/glassfish4/glassfish/domains/domain1
String domainRoot = System.getProperty("com.sun.aas.instanceRoot");
if (domainRoot != null) {
// Note that we write to a directory called "sitemap" but we serve just "/sitemap.xml" using PrettyFaces.
sitemapPathString = domainRoot + File.separator + "docroot" + File.separator + "sitemap";
}
return sitemapPathString;

}
}
1 change: 1 addition & 0 deletions src/main/webapp/WEB-INF/glassfish-web.xml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@
<property name="alternatedocroot_1" value="from=/guides/* dir=./docroot"/>
<property name="alternatedocroot_2" value="from=/dataexplore/* dir=./docroot"/>
<property name="alternatedocroot_logos" value="from=/logos/* dir=./docroot"/>
<property name="alternatedocroot_sitemap" value="from=/sitemap/* dir=./docroot"/>
<parameter-encoding default-charset="UTF-8"/>
</glassfish-web-app>
5 changes: 5 additions & 0 deletions src/main/webapp/WEB-INF/pretty-config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,9 @@
<view-id value="/search/advanced.xhtml" />
</url-mapping>

<url-mapping id="sitemap">
<pattern value="/sitemap.xml" />
<view-id value="/sitemap/sitemap.xml" />
</url-mapping>

</pretty-config>
23 changes: 23 additions & 0 deletions src/test/java/edu/harvard/iq/dataverse/api/SiteMapIT.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package edu.harvard.iq.dataverse.api;

import com.jayway.restassured.RestAssured;
import org.junit.BeforeClass;
import org.junit.Test;
import com.jayway.restassured.response.Response;

public class SiteMapIT {

@BeforeClass
public static void setUpClass() {
RestAssured.baseURI = UtilIT.getRestAssuredBaseUri();
}

@Test
public void testSiteMap() {
Response response = UtilIT.sitemapUpdate();
response.prettyPrint();
Response download = UtilIT.sitemapDownload();
download.prettyPrint();
}

}
10 changes: 10 additions & 0 deletions src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java
Original file line number Diff line number Diff line change
Expand Up @@ -1612,6 +1612,16 @@ static Response clearMetricCache() {
return requestSpecification.delete("/api/admin/clearMetricsCache");
}

static Response sitemapUpdate() {
return given()
.post("/api/admin/sitemap");
}

static Response sitemapDownload() {
return given()
.get("/sitemap.xml");
}

@Test
public void testGetFileIdFromSwordStatementWithNoFiles() {
String swordStatementWithNoFiles = "<feed xmlns=\"http://www.w3.org/2005/Atom\">\n"
Expand Down
Loading

0 comments on commit eed3ac1

Please sign in to comment.