Merge pull request #32 from cicirello/drop-html-extension

Option to drop html extension from urls in sitemap
cicirello · Jun 28, 2021 · 7ba44bb · 7ba44bb
2 parents a7370bb + 6e7d70b
commit 7ba44bb
Show file tree

Hide file tree

Showing 7 changed files with 236 additions and 18 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -59,5 +59,36 @@ jobs:
         echo "url-count = ${{ steps.integration2.outputs.url-count }}"
         echo "excluded-count = ${{ steps.integration2.outputs.excluded-count }}"
 
+    - name: Integration test 3
+      id: integration3
+      uses: ./
+      with:
+        path-to-root: tests/subdir
+        base-url-path: https://TESTING.FAKE.WEB.ADDRESS.TESTING/
+        drop-html-extension: true
+
+    - name: Output stats test 3
+      run: |
+        echo "sitemap-path = ${{ steps.integration3.outputs.sitemap-path }}"
+        echo "url-count = ${{ steps.integration3.outputs.url-count }}"
+        echo "excluded-count = ${{ steps.integration3.outputs.excluded-count }}"
+
+    - name: Integration test 4
+      id: integration4
+      uses: ./
+      with:
+        path-to-root: tests/subdir
+        base-url-path: https://TESTING.FAKE.WEB.ADDRESS.TESTING/
+        sitemap-format: txt 
+        additional-extensions: docx pptx
+        drop-html-extension: true
+
+    - name: Output stats test 4
+      run: |
+        echo "sitemap-path = ${{ steps.integration4.outputs.sitemap-path }}"
+        echo "url-count = ${{ steps.integration4.outputs.url-count }}"
+        echo "excluded-count = ${{ steps.integration4.outputs.excluded-count }}"
+
     - name: Verify integration test results
       run: python3 -u -m unittest tests/integration.py
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,13 +4,11 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [Unreleased] - 2021-05-20
+## [Unreleased] - 2021-06-28
 
 ### Added
 
 ### Changed
-* Use major release tag when pulling base docker image (e.g., automatically get non-breaking
-  changes to base image, such as bug fixes, etc without need to update Dockerfile).
 
 ### Deprecated
 
@@ -21,6 +19,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### CI/CD
 
 
+## [1.8.0] - 2021-06-28
+
+### Added
+* Added option to exclude `.html` from URLs listed in the sitemap 
+  for html files. GitHub Pages automatically serves a corresponding
+  html file if a user browses to a page with a URL with no file extension.
+  This new option to the `generate-sitemap` action enables your sitemap to
+  match this behavior if you prefer the extension-less look of URLs. There
+  is a new action input, `drop-html-extension`, to control this behavior.
+
+### Changed
+* Use major release tag when pulling base docker image (e.g., 
+  automatically get non-breaking changes to base image, such as 
+  bug fixes, etc without need to update Dockerfile).
+
+
 ## [1.7.2] - 2021-05-13
 
 ### Changed

diff --git a/README.md b/README.md
@@ -32,7 +32,8 @@ Pages, and has the following features:
 * It assumes that for files with the name `index.html` that the preferred URL for the page
   ends with the enclosing directory, leaving out the `index.html`. For example,
   instead of `https://WEBSITE/PATH/index.html`, the sitemap will contain
-  `https://WEBSITE/PATH/` in such a case.  
+  `https://WEBSITE/PATH/` in such a case. 
+* Provides option to exclude `.html` extension from URLs listed in sitemap.   
 
 The generate-sitemap GitHub action is designed to be used 
 in combination with other GitHub Actions. For example, it 
@@ -133,6 +134,22 @@ that are generated using the last commit dates of each file. Setting
 this input to anything other than `xml` will generate a plain text 
 `sitemap.txt` simply listing the urls.
 
+### `drop-html-extension`
+
+The `drop-html-extension` input provides the option to exclude `.html` extension 
+from URLs listed in the sitemap. The default is `drop-html-extension: false`.  If
+you want to use this option, just pass `drop-html-extension: true` to the action in
+your workflow. GitHub Pages automatically serves the 
+corresponding html file if URL has no file extension. For example, if a user
+of your site browses to the URL, `https://WEBSITE/PATH/filename` (with no extension),
+GitHub Pages automatically serves `https://WEBSITE/PATH/filename.html` if it exists.
+The default behavior of the `generate-sitemap` action includes the `.html` extension
+for pages where the filename has the `.html` extension. If you prefer to exclude the
+`.html` extension from the URLs in your sitemap, then 
+pass `drop-html-extension: true` to the action in your workflow. 
+Note that you should also ensure that any canonical links that you list within
+the html files corresponds to your choice here.  
+
 ## Outputs
 
 ### `sitemap-path`
@@ -172,7 +189,7 @@ you can also use a specific version such as with:
 
 ```yml
     - name: Generate the sitemap
-      uses: cicirello/generate-sitemap@v1.7.2
+      uses: cicirello/generate-sitemap@v1.8.0
       with:
         base-url-path: https://THE.URL.TO.YOUR.PAGE/
 ```

diff --git a/action.yml b/action.yml
@@ -53,6 +53,10 @@ inputs:
     description: 'Space separated list of additional file extensions to include in sitemap.'
     required: false
     default: ''
+  drop-html-extension:
+    description: 'Enables dropping .html from urls in sitemap.'
+    required: false
+    default: false
 outputs:
   sitemap-path: 
     description: 'The path to the generated sitemap file.'
@@ -70,3 +74,4 @@ runs:
     - ${{ inputs.include-pdf }}
     - ${{ inputs.sitemap-format }}
     - ${{ inputs.additional-extensions }}
+    - ${{ inputs.drop-html-extension }}
diff --git a/generatesitemap.py b/generatesitemap.py
@@ -50,28 +50,32 @@ def gatherfiles(extensionsToInclude) :
                 allfiles.append(os.path.join(root, f))
     return allfiles
 
-def sortname(f) :
+def sortname(f, dropExtension=False) :
     """Partial url to sort by, which strips out the filename
     if the filename is index.html.
 
     Keyword arguments:
     f - Filename with path
+    dropExtension - true to drop extensions of .html from the filename when sorting
     """
     if len(f) >= 11 and f[-11:] == "/index.html" :
         return f[:-10]
     elif f == "index.html" :
         return ""
+    elif dropExtension and len(f) >= 5 and f[-5:] == ".html" :
+        return f[:-5]
     else :
         return f
 
-def urlsort(files) :
+def urlsort(files, dropExtension=False) :
     """Sorts the urls with a primary sort by depth in the website,
     and a secondary sort alphabetically.
 
     Keyword arguments:
     files - list of files to include in sitemap
+    dropExtension - true to drop extensions of .html from the filename when sorting
     """
-    files.sort(key = lambda f : sortname(f))
+    files.sort(key = lambda f : sortname(f, dropExtension))
     files.sort(key = lambda f : f.count("/"))
 
 def hasMetaRobotsNoindex(f) :
@@ -207,12 +211,13 @@ def lastmod(f) :
         mod = datetime.now().astimezone().replace(microsecond=0).isoformat()
     return mod
 
-def urlstring(f, baseUrl) :
+def urlstring(f, baseUrl, dropExtension=False) :
     """Forms a string with the full url from a filename and base url.
 
     Keyword arguments:
     f - filename
     baseUrl - address of the root of the website
+    dropExtension - true to drop extensions of .html from the filename in urls
     """
     if f[0]=="." :
         u = f[1:]
@@ -222,6 +227,8 @@ def urlstring(f, baseUrl) :
         u = u[:-10]
     elif u == "index.html" :
         u = ""
+    elif dropExtension and len(u) >= 5 and u[-5:] == ".html" :
+        u = u[:-5]
     if len(u) >= 1 and u[0]=="/" and len(baseUrl) >= 1 and baseUrl[-1]=="/" :
         u = u[1:]
     elif (len(u)==0 or u[0]!="/") and (len(baseUrl)==0 or baseUrl[-1]!="/") :
@@ -233,41 +240,44 @@ def urlstring(f, baseUrl) :
 <lastmod>{1}</lastmod>
 </url>"""	
 
-def xmlSitemapEntry(f, baseUrl, dateString) :
+def xmlSitemapEntry(f, baseUrl, dateString, dropExtension=False) :
     """Forms a string with an entry formatted for an xml sitemap
     including lastmod date.
 
     Keyword arguments:
     f - filename
     baseUrl - address of the root of the website
     dateString - lastmod date correctly formatted
+    dropExtension - true to drop extensions of .html from the filename in urls
     """
-    return xmlSitemapEntryTemplate.format(urlstring(f, baseUrl), dateString)
+    return xmlSitemapEntryTemplate.format(urlstring(f, baseUrl, dropExtension), dateString)
 
-def writeTextSitemap(files, baseUrl) :
+def writeTextSitemap(files, baseUrl, dropExtension=False) :
     """Writes a plain text sitemap to the file sitemap.txt.
 
     Keyword Arguments:
     files - a list of filenames
     baseUrl - the base url to the root of the website
+    dropExtension - true to drop extensions of .html from the filename in urls
     """
     with open("sitemap.txt", "w") as sitemap :
         for f in files :
-            sitemap.write(urlstring(f, baseUrl))
+            sitemap.write(urlstring(f, baseUrl, dropExtension))
             sitemap.write("\n")
 
-def writeXmlSitemap(files, baseUrl) :
+def writeXmlSitemap(files, baseUrl, dropExtension=False) :
     """Writes an xml sitemap to the file sitemap.xml.
 
     Keyword Arguments:
     files - a list of filenames
     baseUrl - the base url to the root of the website
+    dropExtension - true to drop extensions of .html from the filename in urls
     """
     with open("sitemap.xml", "w") as sitemap :
         sitemap.write('<?xml version="1.0" encoding="UTF-8"?>\n')
         sitemap.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
         for f in files :
-            sitemap.write(xmlSitemapEntry(f, baseUrl, lastmod(f)))
+            sitemap.write(xmlSitemapEntry(f, baseUrl, lastmod(f), dropExtension))
             sitemap.write("\n")
         sitemap.write('</urlset>\n')
 
@@ -279,22 +289,23 @@ def writeXmlSitemap(files, baseUrl) :
     includePDF = sys.argv[4]=="true"
     sitemapFormat = sys.argv[5]
     additionalExt = set(sys.argv[6].lower().replace(",", " ").replace(".", " ").split())
+    dropExtension = sys.argv[7]=="true"
 
     os.chdir(websiteRoot)
     blockedPaths = parseRobotsTxt()
 
     allFiles = gatherfiles(createExtensionSet(includeHTML, includePDF, additionalExt))
     files = [ f for f in allFiles if not robotsBlocked(f, blockedPaths) ]
-    urlsort(files)
+    urlsort(files, dropExtension)
 
     pathToSitemap = websiteRoot
     if pathToSitemap[-1] != "/" :
         pathToSitemap += "/"
     if sitemapFormat == "xml" :
-        writeXmlSitemap(files, baseUrl)
+        writeXmlSitemap(files, baseUrl, dropExtension)
         pathToSitemap += "sitemap.xml"
     else :
-        writeTextSitemap(files, baseUrl)
+        writeTextSitemap(files, baseUrl, dropExtension)
         pathToSitemap += "sitemap.txt"
 
     print("::set-output name=sitemap-path::" + pathToSitemap)

diff --git a/tests/integration.py b/tests/integration.py
@@ -95,3 +95,45 @@ def testIntegrationWithAdditionalTypes(self) :
                      }
         self.assertEqual(expected, urlset)
 
+    def testIntegrationDropHtmlExtension(self) :
+        urlset = set()
+        with open("tests/subdir/sitemap.xml","r") as f :
+            for line in f :
+                i = line.find("<loc>")
+                if i >= 0 :
+                    i += 5
+                    j = line.find("</loc>", i)
+                    if j >= 0 :
+                        urlset.add(line[i:j].strip())
+                    else :
+                        self.fail("No closing </loc>")
+                i = line.find("<lastmod>")
+                if i >= 0 :
+                    i += 9
+                    j = line.find("</lastmod>", i)
+                    if j >= 0 :
+                        self.assertTrue(validateDate(line[i:j].strip()))
+                    else :
+                        self.fail("No closing </lastmod>")
+
+        expected = { "https://TESTING.FAKE.WEB.ADDRESS.TESTING/a",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/y.pdf",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/b",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/z.pdf"
+                     }
+        self.assertEqual(expected, urlset)
+
+    def testIntegrationWithAdditionalTypesDropHtmlExtension(self) :
+        urlset = set()
+        with open("tests/subdir/sitemap.txt","r") as f :
+            for line in f :
+                line = line.strip()
+                if len(line) > 0 :
+                    urlset.add(line)
+        expected = { "https://TESTING.FAKE.WEB.ADDRESS.TESTING/a",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/y.pdf",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/b",
+                     "https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/z.pdf"
+                     }
+        self.assertEqual(expected, urlset)
+