Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page titles should not be updated by versions that are errors #751

Closed
Mr0grog opened this issue Sep 8, 2020 · 1 comment · Fixed by #1061
Closed

Page titles should not be updated by versions that are errors #751

Mr0grog opened this issue Sep 8, 2020 · 1 comment · Fixed by #1061
Labels

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Sep 8, 2020

The Version model has an after_create callback that updates its page’s title field:

def sync_page_title
if title.present?
most_recent_capture_time = page.latest.capture_time
page.update(title: title) if most_recent_capture_time.nil? || most_recent_capture_time <= capture_time
end
end

However, it shouldn’t update the page title if the version in question represents an error — the page model represents the conceptual page (not the URL) that we are tracking, and it’s unhelpful to rename the page to represent a temporary error, or worse, a “not found” message when the page is permanently removed. It makes it hard to tell what page we’re talking about.

For example, this page: https://monitoring.envirodatagov.org/page/9c85fc4f-c0ef-4327-ac4e-df2b6314917f/ba9b8e97-668d-4c92-bf95-d2ee5ffa46f2..0f4b06b2-6473-4ff6-a5e6-aada11f65cdd
…shows the title “Page Not Found | USDA” when it should show “USDA | Office of the Chief Economist | Climate Change | Effects | Climate Change and Agriculture Report.”

Screen Shot 2020-09-08 at 10 28 46 AM

To do this, we should probably add a method on Version named is_error? or similar rather than directly checking the status code so that we can account for things like #468 in the future.

@Mr0grog Mr0grog added the bug label Sep 8, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 8, 2020

is_error? might be something like:

def is_error?(strict: false)
    return true if status && status >= 400
    return false if strict

    # Use some basic heuristics to determine if a page with an OK
    # status code is actually an error. There's lots of room for
    # improvement here.
    case_title = title.downcase
    (
        case_title.include?('error 404')
        || (case_title.match?(/\b404\b/) && title.include?('not found'))
        || case_title.match?(/(page|file)( was)? not found/)
        || case_title.split(/\s*(?:-+|\|)\s*/).any? {|p| p == 'restricted access'}
    )
end

Mr0grog added a commit that referenced this issue Jan 17, 2023
If a new version is an error (e.g. a 404 status code because the page was removed), don't update the page record's title with the title from the error response. Fixes #751.
Mr0grog added a commit that referenced this issue Jan 19, 2023
If a new version is an error (e.g. a 404 status code because the page was removed), don't update the page record's title with the title from the error response. Fixes #751.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant