Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title extraction in indexer is based on the WARC headers and not HTTP headers #568

Closed
machawk1 opened this issue Sep 25, 2018 · 0 comments
Closed
Assignees

Comments

@machawk1
Copy link
Member

This prevents a title from ever being extracted, as no WARC header will have a Content-Type of text/html, e.g., The Content-Types of WARC records are Content-Type: application/http; msgtype=response.

PR incoming.

@machawk1 machawk1 self-assigned this Sep 25, 2018
machawk1 added a commit that referenced this issue Sep 25, 2018
Fix title extraction to use HTTP and not WARC headers. Closes #568
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant