Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing fails if a WARC file misses some records. #12

Closed
Florents-Tselai opened this issue Oct 17, 2023 · 2 comments · Fixed by #13
Closed

Importing fails if a WARC file misses some records. #12

Florents-Tselai opened this issue Oct 17, 2023 · 2 comments · Fixed by #13

Comments

@Florents-Tselai
Copy link
Owner

Assuming a newly created archive.db
warcdb import archive.db ./tests/apod.warc.gz
fails with a sqlite_utils.db.AlterError: No such column: warcinfo.WARC-Record-ID

If however, one does it like this, it works fine

warcdb import archive.db ./tests/google.warc
warcdb import archive.db ./tests/apod.warc.gz
./tests/apod.warc.gz: 807it [00:00, 1249.71it/s]

That is because google.warc is a "complete" - ideal warc file and the db schema is appropriately created.

@Florents-Tselai
Copy link
Owner Author

The proper way to do this is ship the packages themselves with a predefined SQL schema, which would require a new version every time the schema changes. This is fine, but the WARC --> relational transformation as-is is just. a personal preference that I'm not convinced that it's' good enough to nail it.

Then again we don't want to have a failed import every time a non-complete warc record is supplied.

@edsu
Copy link
Contributor

edsu commented Oct 17, 2023

I wonder if a stub warcinfo record could be generated if one isn't found when encountering the first record?

But I see your point that we probably need to define a schema first? Perhaps we could use the wget schema as canonical? It looks like sqlite-utils lets you create tables without inserting:

https://sqlite-utils.datasette.io/en/stable/python-api.html#explicitly-creating-a-table

Would it be weird to require users to do a warcb init warc.db prior to importing records?

It looks like it's pretty new but I wonder if @simonw's https://github.com/simonw/sqlite-migrate could be useful here?

edsu added a commit to edsu/WarcDB that referenced this issue Oct 18, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
edsu added a commit to edsu/WarcDB that referenced this issue Oct 18, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
edsu added a commit to edsu/WarcDB that referenced this issue Oct 18, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
edsu added a commit to edsu/WarcDB that referenced this issue Oct 19, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
edsu added a commit to edsu/WarcDB that referenced this issue Oct 19, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
edsu added a commit to edsu/WarcDB that referenced this issue Oct 20, 2023
This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants