Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc format #11

Open
Natkeeran opened this issue Aug 16, 2018 · 2 comments
Open

warc format #11

Natkeeran opened this issue Aug 16, 2018 · 2 comments

Comments

@Natkeeran
Copy link

Natkeeran commented Aug 16, 2018

I downloaded a website from Internet Archive using wayback-machine-downloader then created a WARC using warcit with the following command: warcit --fixed-dt 20100212221453 http://domainname.com /dirpath.

It did create a WARC file. I would like to index them into solr using webarchive-discovery. When trying to do so, I get the following error:

2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@28005
2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@40193
2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@79054

I could not load it into to AUT as well.

Example warc is attached. Can WARCIT be used to convert snapshots downloaded from Internet Archive into WARC format? (Unfortunately, Internet Archive does not provide a way to download WARCs).

esports.com.warc.gz

@ikreymer
Copy link
Member

Hm, it seems that AUT must not support resource records, can let them know. Can also generate fake response records probably, although that's less ideal..

But, for your use case, you can also use webrecorder.io directly and enter a wayback machine url. Webrecorder will detect that its a wayback machine url and should do the right thing with it.
You'll then be able to download a WARC directly as well.

@Natkeeran
Copy link
Author

@ikreymer

Thank you for looking into this issue.

Can you please provide some additional background around resource records support. Is this related to how they are implementing/using the WARC standards.

I tried providing this url to webrecorder.io. http://web.archive.org/web/20071016060747/http://eelamsports.com:80/. It seems to download just the home page. I need the full snapshot/site to be downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants