-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove angle brackets from ArchiveRecord.getUrl #157
Comments
Thanks @dportabella for the issue. I will take a look at this tomorrow. |
I've been looking a little into this, and it appears that we are using the iipc libraries to get the Url. I cannot see anything that would remove diamond brackets in the header.getUrl functions, but it seems strange that it would not exist if that is what the specifications require. Do you have an example warc I can use to see what happens when warc url data is encased in diamond brackets? |
you can produce an archive test.warc.gz as follows: Here it explains that most libraries do not follow the specification: |
Since we're drawing on IIPC libraries, I'm going to suggest that at this time we wait for the libraries to catch up rather than baking this into AUT. Thanks for sharing that issue, @dportabella. Sounds like it was an error (i.e. here as well iipc/warc-specifications#24) - in practice apart from the example you've provided I haven't seen angled brackets. Is this causing substantial issues on your end? |
No; I remove the angled brackets on ArchiveRecord.getUrl when needed, as a workaround |
OK. I think given that this is an IIPC library issue rather than an AUT issue, I'm going to close for now (my gut tells me that given our limited resources, any fix might end up hitting performance in all cases while just fixing the small number of <'ed WARCs in the wild, if that makes sense). |
According to iipc/warc-specifications#23, the standard says that WARC-Target-URI should be surrounded by <>, such as in:
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
And this is the result produced by wget:
$ wget --warc-file=test.warc.gz "http://www.example.com"
However, some tools and datasets, such as the CommonCrawl dataset, forgot to use the angle brackets. The
aut
library also does not expect the angle backets. But this is wrong. In order to accept both cases, could it be possible to remove the angle brackets inArchiveRecord.getUrl
in case those exists?The text was updated successfully, but these errors were encountered: