If you generate HTML files, then this tool might be for you.
HTMLProofer is a set of tests to validate your HTML output. These tests check if your image references are legitimate, if they have alt tags, if your internal links are working, and so on. It's intended to be an all-in-one checker for your output.
In scope for this project is any well-known and widely-used test for HTML document quality. A major use for this project is continuous integration -- so we must have reliable results. We usually balance correctness over performance. And, if necessary, we should be able to trace this program's detection of HTML errors back to documented best practices or standards, such as W3 specifications.
Third-party modules. We want this product to be useful for continuous integration so we prefer to avoid subjective tests which are prone to false positive results, such as spell checkers, indentation checkers, etc. If you want to work on these items, please see the section on custom tests and consider adding an implementation as a third-party module.
Advanced configuration. Most front-end developers can test their HTML using our command line program. Advanced configuration will require using Ruby.
Add this line to your application's Gemfile:
gem 'html-proofer'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install html-proofer
NOTE: When installation speed matters, set NOKOGIRI_USE_SYSTEM_LIBRARIES
to true
in your environment. This is useful for increasing the speed of your Continuous Integration builds.
Below is mostly comprehensive list of checks that HTMLProofer can perform.
img
elements:
- Whether all your images have alt tags
- Whether your internal image references are not broken
- Whether external images are showing
- Whether your images are HTTP
a
, link
elements:
- Whether your internal links are working
- Whether your internal hash references (
#linkToMe
) are working - Whether external links are working
- Whether your links are HTTPS
- Whether CORS/SRI is enabled
script
elements:
- Whether your internal script references are working
- Whether external scripts are loading
- Whether CORS/SRI is enabled
- Whether your favicons are valid.
- Whether the images and URLs in the OpenGraph metadata are valid.
- Whether your HTML markup is valid. This is done via Nokogumbo to validate well-formed HTML5 markup.
You can configure HTMLProofer to run on:
- a file
- a directory
- an array of directories
- an array of links
It can also run through the command-line, Docker, or as Rack middleware.
- Require the gem.
- Generate some HTML.
- Create a new instance of the
HTMLProofer
on your output folder. run
that instance.
Here's an example:
require 'html-proofer'
require 'html/pipeline'
require 'find'
# make an out dir
Dir.mkdir("out") unless File.exist?("out")
pipeline = HTML::Pipeline.new [
HTML::Pipeline::MarkdownFilter,
HTML::Pipeline::TableOfContentsFilter
], :gfm => true
# iterate over files, and generate HTML from Markdown
Find.find("./docs") do |path|
if File.extname(path) == ".md"
contents = File.read(path)
result = pipeline.call(contents)
File.open("out/#{path.split("/").pop.sub('.md', '.html')}", 'w') { |file| file.write(result[:output].to_s) }
end
end
# test your out dir!
HTMLProofer.check_directory("./out").run
If you simply want to check a single file, use the check_file
method:
HTMLProofer.check_file('/path/to/a/file.html').run
If you want to check a directory, use check_directory
:
HTMLProofer.check_directory('./out').run
If you want to check multiple directories, use check_directories
:
HTMLProofer.check_directories(['./one', './two']).run
With check_links
, you can also pass in an array of links:
HTMLProofer.check_links(['http://github.com', 'http://jekyllrb.com']).run
This configures Proofer to just test those links to ensure they are valid. Note that for the command-line, you'll need to pass a special --as-links
argument:
Note: flags are different from the default ones provided above. The underscores are replaced with dashes.
allow_hash_href
will be --allow-hash-href
htmlproofer www.google.com,www.github.com --as-links
You'll also get a new program called htmlproofer
with this gem. Terrific!
Pass in options through the command-line as flags, like this:
htmlproofer --extension .html.erb ./out
Use htmlproofer --help
to see all command line options, or take a peek here.
For options which require an array of input, surround the value with quotes, and don't use any spaces. For example, to exclude an array of HTTP status code, you might do:
htmlproofer --http-status-ignore "999,401,404" ./out
For something like url-ignore
, and other options that require an array of regular expressions,
you can pass in a syntax like this:
htmlproofer --url-ignore "/www.github.com/,/foo.com/" ./out
Since url_swap
is a bit special, you'll pass in a pair of RegEx:String
values. The escape sequences \:
should be used to produce literal
:
s htmlproofer
will figure out what you mean.
htmlproofer --url-swap "wow:cow,mow:doh" --extension .html.erb --url-ignore www.github.com ./out
Want to use HTML Proofer with your Jekyll site? Awesome. Simply add gem 'html-proofer'
to your Gemfile
as described above, and add the following to your Rakefile
,
using rake test
to execute:
require 'html-proofer'
task :test do
sh "bundle exec jekyll build"
options = { :assume_extension => true }
HTMLProofer.check_directory("./_site", options).run
end
Don't have or want a Rakefile
? You can also do something like the following:
htmlproofer --assume-extension ./_site
If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See klakegg/html-proofer for more information.
You can run html-proofer as part of your Rack middleware to validate your HTML at runtime. For example, in Rails, add these lines to config/application.rb
:
config.middleware.use HTMLProofer::Middleware if Rails.env.test?
config.middleware.use HTMLProofer::Middleware if Rails.env.development?
This will raise an error at runtime if your HTML is invalid. You can choose to skip validation of a page by adding ?proofer-ignore
to the URL.
This is particularly helpful for projects which have extensive CI, since any invalid HTML will fail your build.
Add the data-proofer-ignore
attribute to any tag to ignore it from every check.
<a href="http://notareallink" data-proofer-ignore>Not checked.</a>
This can also apply to parent elements, all the way up to the <html>
tag:
<div data-proofer-ignore>
<a href="http://notareallink">Not checked because of parent.</a>
</div>
Say you've got some new files in a pull request, and your tests are failing because links to those files are not live yet. One thing you can do is run a diff against your base branch and explicitly ignore the new files, like this:
directories = %w(content)
merge_base = `git merge-base origin/production HEAD`.chomp
diffable_files = `git diff -z --name-only --diff-filter=AC #{merge_base}`.split("\0")
diffable_files = diffable_files.select do |filename|
next true if directories.include?(File.dirname(filename))
filename.end_with?('.md')
end.map { |f| Regexp.new(File.basename(f, File.extname(f))) }
HTMLProofer.check_directory('./output', { url_ignore: diffable_files }).run
The HTMLProofer
constructor takes an optional hash of additional options:
Option | Description | Default |
---|---|---|
allow_missing_href |
If true , does not flag a tags missing href (this is the default for HTML5). |
false |
allow_hash_href |
If true , ignores the href="#" . |
false |
alt_ignore |
An array of Strings or RegExps containing img s whose missing alt tags are safe to ignore. |
[] |
assume_extension |
Automatically add extension (e.g. .html ) to file paths, to allow extensionless URLs (as supported by Jekyll 3 and GitHub Pages) |
false |
check_external_hash |
Checks whether external hashes exist (even if the webpage exists). This slows the checker down. | false |
check_favicon |
Enables the favicon checker. | false |
check_opengraph |
Enables the Open Graph checker. | false |
check_html |
Enables HTML validation errors from Nokogumbo | false |
check_img_http |
Fails an image if it's marked as http |
false |
check_sri |
Check that <link> and <script> external resources use SRI |
false |
checks_to_ignore |
An array of Strings indicating which checks you do not want to run | [] |
directory_index_file |
Sets the file to look for when a link refers to a directory. | index.html |
disable_external |
If true , does not run the external link checker, which can take a lot of time. |
false |
empty_alt_ignore |
If true , ignores images with empty alt tags. |
false |
enforce_https |
Fails a link if it's not marked as https . |
false |
error_sort |
Defines the sort order for error output. Can be :path , :desc , or :status . |
:path |
extension |
The extension of your HTML files including the dot. | .html |
external_only |
Only checks problems with external references. | false |
file_ignore |
An array of Strings or RegExps containing file paths that are safe to ignore. | [] |
http_status_ignore |
An array of numbers representing status codes to ignore. | [] |
internal_domains |
An array of Strings containing domains that will be treated as internal urls. | [] |
log_level |
Sets the logging level, as determined by Yell. One of :debug , :info , :warn , :error , or :fatal . |
:info |
only_4xx |
Only reports errors for links that fall within the 4xx status code range. | false |
root_dir |
The absolute path to the directory serving your html-files. | "" |
typhoeus_config |
A JSON-formatted string. Parsed using JSON.parse and mapped on top of the default configuration values so that they can be overridden. |
{} |
url_ignore |
An array of Strings or RegExps containing URLs that are safe to ignore. It affects all HTML attributes. Note that non-HTTP(S) URIs are always ignored. | [] |
url_swap |
A hash containing key-value pairs of RegExp => String . It transforms URLs that match RegExp into String via gsub . |
{} |
verbose |
If true , outputs extra information as the checking happens. Useful for debugging. Will be deprecated in a future release. |
false |
In addition, there are a few "namespaced" options. These are:
:validation
:typhoeus
/:hydra
:parallel
:cache
See below for more information.
If check_html
is true
, Nokogumbo performs additional validation on your HTML.
You can pass in additional options to configure this validation.
Option | Description | Default |
---|---|---|
report_eof_tags |
When check_html is enabled, HTML markup with mismatched tags are reported as errors |
false |
report_invalid_tags |
When check_html is enabled, HTML markup that is unknown to Nokogumbo are reported as errors. |
false |
report_mismatched_tags |
When check_html is enabled, HTML markup with tags that are malformed are reported as errors |
false |
report_missing_doctype |
When check_html is enabled, HTML markup with missing or out-of-order DOCTYPE are reported as errors. |
false |
report_missing_names |
When check_html is enabled, HTML markup that are missing entity names are reported as errors. |
false |
report_script_embeds |
When check_html is enabled, script tags containing markup are reported as errors. |
false |
For example:
opts = { :check_html => true, :validation => { :report_script_embeds => true } }
Typhoeus is used to make fast, parallel requests to external URLs. You can pass in any of Typhoeus' options for the external link checks with the options namespace of :typhoeus
. For example:
HTMLProofer.new("out/", {:extension => ".htm", :typhoeus => { :verbose => true, :ssl_verifyhost => 2 } })
This sets HTMLProofer
's extensions to use .htm, gives Typhoeus a configuration for it to be verbose, and use specific SSL settings. Check the Typhoeus documentation for more information on what options it can receive.
You can similarly pass in a :hydra
option with a hash configuration for Hydra.
The default value is:
{
:typhoeus =>
{
:followlocation => true,
:connecttimeout => 10,
:timeout => 30
},
:hydra => { :max_concurrency => 50 }
}
You can provide a block to set some logic before an external link is checked. For example, say you want to provide an authentication token every time a GitHub URL is checked. You can do that like this:
proofer = HTMLProofer.check_directory(item, opts)
proofer.before_request do |request|
request.options[:headers]['Authorization'] = "Bearer <TOKEN>" if request.base_url == "https://github.com"
end
proofer.run
The Authorization
header is being set if and only if the base_url
is https://github.com
, and it is excluded for all other URLs.
Parallel can be used to speed internal file checks. You can pass in any of its options with the options namespace :parallel
. For example:
HTMLProofer.check_directories(["out/"], {:extension => ".htm", :parallel => { :in_processes => 3} })
In this example, :in_processes => 3
is passed into Parallel as a configuration option.
Checking external URLs can slow your tests down. If you'd like to speed that up, you can enable caching for your external links. Caching simply means to skip links that are valid for a certain period of time.
You can enable caching for this log file by passing in the option :cache
, with a hash containing a single key, :timeframe
. :timeframe
defines the length of time the cache will be used before the link is checked again. The format of :timeframe
is a number followed by a letter indicating the length of time. For example:
M
means monthsw
means weeksd
means daysh
means hours
For example, passing the following options means "recheck links older than thirty days":
{ :cache => { :timeframe => '30d' } }
And the following options means "recheck links older than two weeks":
{ :cache => { :timeframe => '2w' } }
You can change the directory where the cachefile is kept by also providing the storage_dir
key:
{ :cache => { :storage_dir => '/tmp/html-proofer-cache-money' } }
Links that were failures are kept in the cache and always rechecked. If they pass, the cache is updated to note the new timestamp.
The cache operates on external links only.
If caching is enabled, HTMLProofer writes to a log file called tmp/.htmlproofer/cache.log. You should probably ignore this folder in your version control system.
If you want to enable caching with Travis CI, be sure to add these lines into your .travis.yml file:
cache:
directories:
- $TRAVIS_BUILD_DIR/tmp/.htmlproofer
For more information on using HTML-Proofer with Travis CI, see this wiki page.
HTML-Proofer can be as noisy or as quiet as you'd like. If you set the :log_level
option, you can better define the level of logging.
Want to write your own test? Sure, that's possible!
Just create a class that inherits from HTMLProofer::Check
. This subclass must define one method called run
. This is called on your content, and is responsible for performing the validation on whatever elements you like. When you catch a broken issue, call add_issue(message, line: line, content: content)
to explain the error. line
refers to the line numbers, and content
is the node content of the broken element.
If you're working with the element's attributes (as most checks do), you'll also want to call create_element(node)
as part of your suite. This constructs an object that contains all the attributes of the HTML element you're iterating on.
Here's an example custom test demonstrating these concepts. It reports mailto
links that point to octocat@github.com
:
class MailToOctocat < ::HTMLProofer::Check
def mailto?
return false if @link.data_proofer_ignore || @link.href.nil?
@link.href.match /mailto/
end
def octocat?
return false if @link.data_proofer_ignore || @link.href.nil?
@link.href.match /octocat@github.com/
end
def run
@html.css('a').each do |node|
@link = create_element(node)
line = node.line
if mailto? && octocat?
return add_issue("Don't email the Octocat directly!", line: line)
end
end
end
end
See our list of third-party custom classes and add your own to this list.
Here are some brief snippets identifying some common problems that you can work around. For more information, check out our wiki.
Our wiki page on using HTML-Proofer with Travis CI might also be useful.
To ignore SSL certificates, turn off Typhoeus' SSL verification:
HTMLProofer.check_directory("out/", {
:typhoeus => {
:ssl_verifypeer => false,
:ssl_verifyhost => 0}
}).run
To change the User-Agent used by Typhoeus:
HTMLProofer.check_directory("out/", {
:typhoeus => {
:headers => { "User-Agent" => "Mozilla/5.0 (compatible; My New User-Agent)" }
}}).run
To exclude urls using regular expressions, include them between forward slashes and don't quote them:
HTMLProofer.check_directories(["out/"], {
:url_ignore => [/example.com/],
}).run
Project | Repository | Notes |
---|---|---|
Jekyll's website | jekyll/jekyll | A separate script calls htmlproofer and this used to be called from Circle CI |
Raspberry Pi's documentation | raspberrypi/documentation | |
Squeak's website | squeak-smalltalk/squeak.org | |
Atom Flight Manual | atom/flight-manual.atom.io | |
HTML Website Template | fulldecent/html-website-template | A starting point for websites, uses a Rakefile and Travis configuration to call preconfigured testing |
Project Calico Documentation | projectcalico/calico | Simple integration with Jekyll and Docker using a Makefile |