Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slash (/) missing in warcit output #26

Open
DHKaplan opened this issue Aug 22, 2023 · 3 comments · May be fixed by #29
Open

Slash (/) missing in warcit output #26

DHKaplan opened this issue Aug 22, 2023 · 3 comments · May be fixed by #29
Labels

Comments

@DHKaplan
Copy link

DHKaplan commented Aug 22, 2023

I needed a small warc file for testing, so I took a regular wget download and picked a few files that interconnected and used warcit to create the warc file. When I looked at it in Replayweb.page there were no pages visible. I edited the warc file in an ASCII editor and found that the "/" was not being inserted after the domain name. Please see https://forum.webrecorder.net/t/warcit-not-putting-a-before-the-file-name/413 for more information.

@despens
Copy link
Contributor

despens commented Aug 26, 2023

Hey @DHKaplan, you need to enter the exact URL prefix you want when running warcit. For instance

warcit http://www.wticalumni.com/ my-local-folder

The prefix could be anything, for instance something like:

warcit 'http://mydomain.com/query?q=' my-local-folder

This flexibility of the tool makes it necessary that you give the exact URL prefix.

@DHKaplan
Copy link
Author

@despens The folder that contains my html is www.wticalumni.com
and the command I am using is warcit https://www.wticalumni.com ./www.wticalumni.com/

I get no pages found. When I edit the gz file with an ASCII editor I get:

WARC/1.0
WARC-Date: 2023-05-10T18:36:00Z
WARC-Source-URI: file://./www.wticalumni.com/events.htm
WARC-Creation-Date: 2023-08-26T17:00:45Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:e1377996-9417-4ddb-8af8-19dc44972209>
WARC-Target-URI: https://www.wticalumni.comevents.htm
WARC-Payload-Digest: sha1:AP4CVEJE4OHSPK24OURQRPDOHKP2LWOA
WARC-Block-Digest: sha1:AP4CVEJE4OHSPK24OURQRPDOHKP2LWOA
Content-Type: text/html
Content-Length: 10002

Note the the Source-URI line is WARC-Source-URI: file://./www.wticalumni.com/events.htm
While the Target-URI line is WARC-Target-URI: https://www.wticalumni.comevents.htm
There is no slash before the file name in the Target-URI.

I really appreciate your reply, but I can't see what I am doing wrong.

@despens
Copy link
Contributor

despens commented Aug 27, 2023

Hi @DHKaplan, you just need to use the desired / character in the command:

warcit https://www.wticalumni.com/ ./www.wticalumni.com/
                                 ^
                                 |
                             important

@Shrinks99 Shrinks99 linked a pull request Oct 3, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants