Bug in regex used to detect robots noindex directive in page header #110

cicirello · 2023-10-05T17:56:37Z

Summary

The current regular expression used to detect if there is a meta tag in the page header with a robots noindex directive (e.g., to exclude such pages from the sitemap) has a potential bug. \s* is used in a couple places to account for sequences of space characters. However, it is not being passed through to Python's regular expression processor, and instead being detected as an invalid escape sequence in the string. Need to escape the \. Revealed when upgrading to Python 3.12, which gives a warning. Earlier versions of Python not warning on this, although behavior appears to be correct. Not entirely sure why. But should fix this none-the-less.

The text was updated successfully, but these errors were encountered:

cicirello added the bug Something isn't working label Oct 5, 2023

cicirello mentioned this issue Oct 5, 2023

Fixed error in regex that detects noindex directives, and bumped Python to 3.12 in CI/CD workflow when running unit tests #109

Merged

10 tasks

cicirello closed this as completed in #109 Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in regex used to detect robots noindex directive in page header #110

Bug in regex used to detect robots noindex directive in page header #110

cicirello commented Oct 5, 2023

Bug in regex used to detect robots noindex directive in page header #110

Bug in regex used to detect robots noindex directive in page header #110

Comments

cicirello commented Oct 5, 2023

Summary