Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove angle brackets from ArchiveRecord.getUrl #157

Closed
dportabella opened this issue Dec 24, 2017 · 6 comments
Closed

remove angle brackets from ArchiveRecord.getUrl #157

dportabella opened this issue Dec 24, 2017 · 6 comments
Assignees

Comments

@dportabella
Copy link
Contributor

According to iipc/warc-specifications#23, the standard says that WARC-Target-URI should be surrounded by <>, such as in:
WARC-Target-URI: http://www.archive.org/images/logoc.jpg

And this is the result produced by wget:
$ wget --warc-file=test.warc.gz "http://www.example.com"

However, some tools and datasets, such as the CommonCrawl dataset, forgot to use the angle brackets. The aut library also does not expect the angle backets. But this is wrong. In order to accept both cases, could it be possible to remove the angle brackets in ArchiveRecord.getUrl in case those exists?

@greebie greebie self-assigned this Jan 3, 2018
@greebie
Copy link
Contributor

greebie commented Jan 3, 2018

Thanks @dportabella for the issue. I will take a look at this tomorrow.

@greebie
Copy link
Contributor

greebie commented Jan 4, 2018

I've been looking a little into this, and it appears that we are using the iipc libraries to get the Url. I cannot see anything that would remove diamond brackets in the header.getUrl functions, but it seems strange that it would not exist if that is what the specifications require.

Do you have an example warc I can use to see what happens when warc url data is encased in diamond brackets?

@dportabella
Copy link
Contributor Author

you can produce an archive test.warc.gz as follows:
$ wget --warc-file=test.warc.gz "http://www.example.com"

Here it explains that most libraries do not follow the specification:
iipc/warc-specifications#23

@ianmilligan1
Copy link
Member

Since we're drawing on IIPC libraries, I'm going to suggest that at this time we wait for the libraries to catch up rather than baking this into AUT.

Thanks for sharing that issue, @dportabella. Sounds like it was an error (i.e. here as well iipc/warc-specifications#24) - in practice apart from the example you've provided I haven't seen angled brackets.

Is this causing substantial issues on your end?

@dportabella
Copy link
Contributor Author

dportabella commented Jan 4, 2018

Is this causing substantial issues on your end?

No; I remove the angled brackets on ArchiveRecord.getUrl when needed, as a workaround

@ianmilligan1
Copy link
Member

OK. I think given that this is an IIPC library issue rather than an AUT issue, I'm going to close for now (my gut tells me that given our limited resources, any fix might end up hitting performance in all cases while just fixing the small number of <'ed WARCs in the wild, if that makes sense).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants