Skip to content

Commit

Permalink
Merge pull request #32 from cicirello/drop-html-extension
Browse files Browse the repository at this point in the history
Option to drop html extension from urls in sitemap
  • Loading branch information
cicirello authored Jun 28, 2021
2 parents a7370bb + 6e7d70b commit 7ba44bb
Show file tree
Hide file tree
Showing 7 changed files with 236 additions and 18 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,5 +59,36 @@ jobs:
echo "url-count = ${{ steps.integration2.outputs.url-count }}"
echo "excluded-count = ${{ steps.integration2.outputs.excluded-count }}"
- name: Integration test 3
id: integration3
uses: ./
with:
path-to-root: tests/subdir
base-url-path: https://TESTING.FAKE.WEB.ADDRESS.TESTING/
drop-html-extension: true

- name: Output stats test 3
run: |
echo "sitemap-path = ${{ steps.integration3.outputs.sitemap-path }}"
echo "url-count = ${{ steps.integration3.outputs.url-count }}"
echo "excluded-count = ${{ steps.integration3.outputs.excluded-count }}"
- name: Integration test 4
id: integration4
uses: ./
with:
path-to-root: tests/subdir
base-url-path: https://TESTING.FAKE.WEB.ADDRESS.TESTING/
sitemap-format: txt
additional-extensions: docx pptx
drop-html-extension: true

- name: Output stats test 4
run: |
echo "sitemap-path = ${{ steps.integration4.outputs.sitemap-path }}"
echo "url-count = ${{ steps.integration4.outputs.url-count }}"
echo "excluded-count = ${{ steps.integration4.outputs.excluded-count }}"
- name: Verify integration test results
run: python3 -u -m unittest tests/integration.py

20 changes: 17 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased] - 2021-05-20
## [Unreleased] - 2021-06-28

### Added

### Changed
* Use major release tag when pulling base docker image (e.g., automatically get non-breaking
changes to base image, such as bug fixes, etc without need to update Dockerfile).

### Deprecated

Expand All @@ -21,6 +19,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### CI/CD


## [1.8.0] - 2021-06-28

### Added
* Added option to exclude `.html` from URLs listed in the sitemap
for html files. GitHub Pages automatically serves a corresponding
html file if a user browses to a page with a URL with no file extension.
This new option to the `generate-sitemap` action enables your sitemap to
match this behavior if you prefer the extension-less look of URLs. There
is a new action input, `drop-html-extension`, to control this behavior.

### Changed
* Use major release tag when pulling base docker image (e.g.,
automatically get non-breaking changes to base image, such as
bug fixes, etc without need to update Dockerfile).


## [1.7.2] - 2021-05-13

### Changed
Expand Down
21 changes: 19 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ Pages, and has the following features:
* It assumes that for files with the name `index.html` that the preferred URL for the page
ends with the enclosing directory, leaving out the `index.html`. For example,
instead of `https://WEBSITE/PATH/index.html`, the sitemap will contain
`https://WEBSITE/PATH/` in such a case.
`https://WEBSITE/PATH/` in such a case.
* Provides option to exclude `.html` extension from URLs listed in sitemap.

The generate-sitemap GitHub action is designed to be used
in combination with other GitHub Actions. For example, it
Expand Down Expand Up @@ -133,6 +134,22 @@ that are generated using the last commit dates of each file. Setting
this input to anything other than `xml` will generate a plain text
`sitemap.txt` simply listing the urls.

### `drop-html-extension`

The `drop-html-extension` input provides the option to exclude `.html` extension
from URLs listed in the sitemap. The default is `drop-html-extension: false`. If
you want to use this option, just pass `drop-html-extension: true` to the action in
your workflow. GitHub Pages automatically serves the
corresponding html file if URL has no file extension. For example, if a user
of your site browses to the URL, `https://WEBSITE/PATH/filename` (with no extension),
GitHub Pages automatically serves `https://WEBSITE/PATH/filename.html` if it exists.
The default behavior of the `generate-sitemap` action includes the `.html` extension
for pages where the filename has the `.html` extension. If you prefer to exclude the
`.html` extension from the URLs in your sitemap, then
pass `drop-html-extension: true` to the action in your workflow.
Note that you should also ensure that any canonical links that you list within
the html files corresponds to your choice here.

## Outputs

### `sitemap-path`
Expand Down Expand Up @@ -172,7 +189,7 @@ you can also use a specific version such as with:

```yml
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1.7.2
uses: cicirello/generate-sitemap@v1.8.0
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
```
Expand Down
5 changes: 5 additions & 0 deletions action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ inputs:
description: 'Space separated list of additional file extensions to include in sitemap.'
required: false
default: ''
drop-html-extension:
description: 'Enables dropping .html from urls in sitemap.'
required: false
default: false
outputs:
sitemap-path:
description: 'The path to the generated sitemap file.'
Expand All @@ -70,3 +74,4 @@ runs:
- ${{ inputs.include-pdf }}
- ${{ inputs.sitemap-format }}
- ${{ inputs.additional-extensions }}
- ${{ inputs.drop-html-extension }}
37 changes: 24 additions & 13 deletions generatesitemap.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,28 +50,32 @@ def gatherfiles(extensionsToInclude) :
allfiles.append(os.path.join(root, f))
return allfiles

def sortname(f) :
def sortname(f, dropExtension=False) :
"""Partial url to sort by, which strips out the filename
if the filename is index.html.
Keyword arguments:
f - Filename with path
dropExtension - true to drop extensions of .html from the filename when sorting
"""
if len(f) >= 11 and f[-11:] == "/index.html" :
return f[:-10]
elif f == "index.html" :
return ""
elif dropExtension and len(f) >= 5 and f[-5:] == ".html" :
return f[:-5]
else :
return f

def urlsort(files) :
def urlsort(files, dropExtension=False) :
"""Sorts the urls with a primary sort by depth in the website,
and a secondary sort alphabetically.
Keyword arguments:
files - list of files to include in sitemap
dropExtension - true to drop extensions of .html from the filename when sorting
"""
files.sort(key = lambda f : sortname(f))
files.sort(key = lambda f : sortname(f, dropExtension))
files.sort(key = lambda f : f.count("/"))

def hasMetaRobotsNoindex(f) :
Expand Down Expand Up @@ -207,12 +211,13 @@ def lastmod(f) :
mod = datetime.now().astimezone().replace(microsecond=0).isoformat()
return mod

def urlstring(f, baseUrl) :
def urlstring(f, baseUrl, dropExtension=False) :
"""Forms a string with the full url from a filename and base url.
Keyword arguments:
f - filename
baseUrl - address of the root of the website
dropExtension - true to drop extensions of .html from the filename in urls
"""
if f[0]=="." :
u = f[1:]
Expand All @@ -222,6 +227,8 @@ def urlstring(f, baseUrl) :
u = u[:-10]
elif u == "index.html" :
u = ""
elif dropExtension and len(u) >= 5 and u[-5:] == ".html" :
u = u[:-5]
if len(u) >= 1 and u[0]=="/" and len(baseUrl) >= 1 and baseUrl[-1]=="/" :
u = u[1:]
elif (len(u)==0 or u[0]!="/") and (len(baseUrl)==0 or baseUrl[-1]!="/") :
Expand All @@ -233,41 +240,44 @@ def urlstring(f, baseUrl) :
<lastmod>{1}</lastmod>
</url>"""

def xmlSitemapEntry(f, baseUrl, dateString) :
def xmlSitemapEntry(f, baseUrl, dateString, dropExtension=False) :
"""Forms a string with an entry formatted for an xml sitemap
including lastmod date.
Keyword arguments:
f - filename
baseUrl - address of the root of the website
dateString - lastmod date correctly formatted
dropExtension - true to drop extensions of .html from the filename in urls
"""
return xmlSitemapEntryTemplate.format(urlstring(f, baseUrl), dateString)
return xmlSitemapEntryTemplate.format(urlstring(f, baseUrl, dropExtension), dateString)

def writeTextSitemap(files, baseUrl) :
def writeTextSitemap(files, baseUrl, dropExtension=False) :
"""Writes a plain text sitemap to the file sitemap.txt.
Keyword Arguments:
files - a list of filenames
baseUrl - the base url to the root of the website
dropExtension - true to drop extensions of .html from the filename in urls
"""
with open("sitemap.txt", "w") as sitemap :
for f in files :
sitemap.write(urlstring(f, baseUrl))
sitemap.write(urlstring(f, baseUrl, dropExtension))
sitemap.write("\n")

def writeXmlSitemap(files, baseUrl) :
def writeXmlSitemap(files, baseUrl, dropExtension=False) :
"""Writes an xml sitemap to the file sitemap.xml.
Keyword Arguments:
files - a list of filenames
baseUrl - the base url to the root of the website
dropExtension - true to drop extensions of .html from the filename in urls
"""
with open("sitemap.xml", "w") as sitemap :
sitemap.write('<?xml version="1.0" encoding="UTF-8"?>\n')
sitemap.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
for f in files :
sitemap.write(xmlSitemapEntry(f, baseUrl, lastmod(f)))
sitemap.write(xmlSitemapEntry(f, baseUrl, lastmod(f), dropExtension))
sitemap.write("\n")
sitemap.write('</urlset>\n')

Expand All @@ -279,22 +289,23 @@ def writeXmlSitemap(files, baseUrl) :
includePDF = sys.argv[4]=="true"
sitemapFormat = sys.argv[5]
additionalExt = set(sys.argv[6].lower().replace(",", " ").replace(".", " ").split())
dropExtension = sys.argv[7]=="true"

os.chdir(websiteRoot)
blockedPaths = parseRobotsTxt()

allFiles = gatherfiles(createExtensionSet(includeHTML, includePDF, additionalExt))
files = [ f for f in allFiles if not robotsBlocked(f, blockedPaths) ]
urlsort(files)
urlsort(files, dropExtension)

pathToSitemap = websiteRoot
if pathToSitemap[-1] != "/" :
pathToSitemap += "/"
if sitemapFormat == "xml" :
writeXmlSitemap(files, baseUrl)
writeXmlSitemap(files, baseUrl, dropExtension)
pathToSitemap += "sitemap.xml"
else :
writeTextSitemap(files, baseUrl)
writeTextSitemap(files, baseUrl, dropExtension)
pathToSitemap += "sitemap.txt"

print("::set-output name=sitemap-path::" + pathToSitemap)
Expand Down
42 changes: 42 additions & 0 deletions tests/integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,45 @@ def testIntegrationWithAdditionalTypes(self) :
}
self.assertEqual(expected, urlset)

def testIntegrationDropHtmlExtension(self) :
urlset = set()
with open("tests/subdir/sitemap.xml","r") as f :
for line in f :
i = line.find("<loc>")
if i >= 0 :
i += 5
j = line.find("</loc>", i)
if j >= 0 :
urlset.add(line[i:j].strip())
else :
self.fail("No closing </loc>")
i = line.find("<lastmod>")
if i >= 0 :
i += 9
j = line.find("</lastmod>", i)
if j >= 0 :
self.assertTrue(validateDate(line[i:j].strip()))
else :
self.fail("No closing </lastmod>")

expected = { "https://TESTING.FAKE.WEB.ADDRESS.TESTING/a",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/y.pdf",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/b",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/z.pdf"
}
self.assertEqual(expected, urlset)

def testIntegrationWithAdditionalTypesDropHtmlExtension(self) :
urlset = set()
with open("tests/subdir/sitemap.txt","r") as f :
for line in f :
line = line.strip()
if len(line) > 0 :
urlset.add(line)
expected = { "https://TESTING.FAKE.WEB.ADDRESS.TESTING/a",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/y.pdf",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/b",
"https://TESTING.FAKE.WEB.ADDRESS.TESTING/subdir/z.pdf"
}
self.assertEqual(expected, urlset)

Loading

0 comments on commit 7ba44bb

Please sign in to comment.