-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add init command and migrations #13
Conversation
caa91ce
to
4f092da
Compare
fa5448a
to
4e6924c
Compare
@Florents-Tselai one thing I was wondering is if maybe we should normalize the column names, so that they look a bit more like what we would expect to see in a database? So instead of You may notice that currently we have some inconsistency, |
This commit adds a new `init` command which will initialize a SQLite database with the schema. Initializing the database is required prior to running `import`. The database schema is managed with sqlite-migrate which is kind of new, but seems to work well. So the process for working with warcdb is to: ```bash $ warcdb init warc.db $ warcdb import warc.db google.warc.gz ``` Then if you you want to update warcdb and apply migrations you can: ``` $ pip install --upgrade warcdb $ warcdb migrate warc.db ``` Closes Florents-Tselai#12
def test_import(warc_path): | ||
runner = CliRunner() | ||
|
||
with runner.isolated_filesystem() as fs: | ||
DB_FILE = "test_warc.db" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove this runner.isolated_filesystem
context manager because using the runner twice (once to init and then again to import) worked, but printed this annoying warning at the end of the run:
(warcdb-py3.11) ➜ WarcDB git:(init-schema) ✗ /Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
File "/Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/mp-6s_7o47z'
So instead I just removed the test db file at the end of each run.
from warcio import ArchiveIterator, StatusAndHeaders | ||
from warcio.recordloader import ArcWarcRecord | ||
|
||
from warcdb.migrations import migration | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed unused imports and sorted them using isort.
Initialize a new warcdb database | ||
""" | ||
db = WarcDB(db_path) | ||
migration.apply(db.db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new command to initialize a database using the migrations.
Yes, in the first Iteration, I just dragged and drop the fields as they appear in the WARC spec, but hyphes complicate things a lot. Let's open this in a separate issue to discuss it; probably use camelCase instead of hyphens. |
@@ -51,8 +46,7 @@ def record_payload(self: ArcWarcRecord): | |||
@cache | |||
def record_as_dict(self: ArcWarcRecord): | |||
"""Method to easily represent a record as a dict, to be fed into db_utils.Database.insert()""" | |||
|
|||
return dict(self.rec_headers.headers) | |||
return {k.lower().replace('-', '_'): v for k, v in self.rec_headers.headers} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names are normalized by lower casing and replacing '-' with '_'. So WARC-Record-Id
will be warc_record_id
.
Oops I didn't see your comment beforehand. Hopefully |
This commit normalizes the column names so that they are lowercased and have underscores instead of dashes. Hopefully it's not disruptive for existing uses of warcdb!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the warc-info record from google.warc and saved as no-warc-info.warc to test whether the import works when warc-info isn't present. Just to cut down on the size of the repo.
I agree with the workflow of init & migrate; But I'm merging this to unblock you and maybe we can circle back again. |
This commit adds a new
init
command which will initialize a SQLite database with the canonical warcdb schema. The inital schema was derived from importingtests/google.warc
and using its schema as a starting place. Theinit
step was added to the unit test, and thetests/apod.warc.gz
file was added to the list of files that are tested so we can see that it works.For command line users Initializing the database with
init
is required prior to runningimport
. The database schema is managed with sqlite-migrate which is kind of new, but seems to work well.So the process for working with warcdb is to:
Then if you you want to update warcdb and apply migrations you can:
Closes #12
Closes #6