Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some RSS feeds have double-encoded HTML entities #109

Open
Mr0grog opened this issue Aug 2, 2020 · 1 comment
Open

Some RSS feeds have double-encoded HTML entities #109

Mr0grog opened this issue Aug 2, 2020 · 1 comment
Labels
bug Something isn't working news Related to scraping news (rather than data)

Comments

@Mr0grog
Copy link
Collaborator

Mr0grog commented Aug 2, 2020

The San Mateo source data (which is an actual RSS feed!) currently has some double-encoded HTML entities that we should see if we can clean up. That is, the source data might have code like:

 

Which is a common symptom of an HTML entity like:

 

Getting re-encoded and second time and therefore getting ruined. See an example of this happening in practice on sfbrigade/stop-covid19-sfbayarea#309 (comment)

One common way to fix this is to just repeatedly decode the HTML into plain text until there’s nothing left to decode, and then re-encode it once. e.g:

" " → " " → " " → (then re-encode once) → " "

The downside here is that this can ruin code that was intentionally double-encoded, like source code examples. That’s probably not likely in the kind of data we’re dealing with, though.

@Mr0grog Mr0grog added bug Something isn't working news Related to scraping news (rather than data) labels Aug 2, 2020
@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Aug 2, 2020

Note: Python’s built-in html module has escape() and unescape() functions for doing the above work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working news Related to scraping news (rather than data)
Projects
None yet
Development

No branches or pull requests

1 participant