Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/structure: help locate/extract gdpr exports #175

Merged
merged 5 commits into from
Jul 7, 2021

Conversation

seanbreckenridge
Copy link
Contributor

@seanbreckenridge seanbreckenridge commented Jul 1, 2021

I tried to refactor get_files into something that could handle
extracting zipfiles or traversing structures, but the differences between
zipfile.Path and pathlib.Path make it pretty annoying, starts
polluting the types and handling that without some hack
of just overwriting/subclassing the type seems difficult, especially
while remaining backwards compatible

The purpose here was to temporarily extract the zipfile
into a temporary directory so other code (like my discord data module) can treat it as a real
filesystem, instead of having to deal with zipfile/archive code
everywhere.

I expect to run into the same problem for Google takeout in the future, it feels
easier to handle the extraction here in HPI, and then pass off a real directory
to Google takeout which knows how to parse that directory structure

This is something I had originally wrote just for my
discord module, but I feel like it would be pretty useful
to traverse any sort of gdpr directory structure (especially when
you're not sure if you have one or more gdpr exports, and you
want to merge them)

The context manager transparently unzips into the core_config.get_tmp_dir
folder (if the passed base is a zipfile), otherwise it recursively searches
for matching relative paths. If you're using cachew, that means it only gets
unzipped once while youre cache is still valid

I think the docstrings are good enough, let me know if theres anything you'd
want to change, particularly maybe about the location of the testdata/tests?

I think I added enough robustness so I don't mistakenly shutil.rmtree stuff,
but I wrote this on a plane with only access to python3 -c 'import contextlib; help(contextlib)'
so would appreciate just a look over that as this is the first contextmanager I've ever written

@seanbreckenridge
Copy link
Contributor Author

somewhat related karlicoss/kompress#10

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Jul 1, 2021

Hmm, think the failures are unrelated to this PR... ?

Not sure, looks like a mypy config error or something

Expect there to be a few more commits before this can be squashed

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Jul 2, 2021

Think thats all for now

Could also add tar or other stuff in the future, most gdpr exports come in zip format and zipfile is a built-in


if is_zip:
# make sure we're not mistakenly deleting data
assert str(searchdir).startswith(str(tdir)), f"Expected the temporary directory for extracting zip to start with the temporary directory prefix ({tdir}), found {searchdir}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context manager looks good! It also seems safe because of this assert; even though in theory it's doesn't look really necessary in the current version of code, if someone added some code between line 113-115, and it threw exception, it would be possible to delete the actual zip file by accident.
Such condition cleanup arises now and then and I prefer to handle it slightly differently

Although I feel like for such cases it's cleaner to rely on context managers as well. So I'd rewrite it as something like:

@contextmanager
def ctx_zip():
    with ZipFile(base.absolute()) as zf, tempfile.TemporaryDirectory(dir=tdir) as td:
        # technically, ZipFile ctx manager is not necessarily since we're reading only?
        # but still, I guess it's nice to call close()
        zf.extractall(path=str(td))
        yield Path(td)

@contextmanager
def ctx_dir():
    yield base.absolute()

ctx = ctx_zip if is_zip else ctx_dir

with ctx as searchdir:
    # rest of code in try section (lines 124-142)

That way all the logic for each case is local and it's easier to reason about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I knew there was something I could do with yielding back the location to make it generic across dir/zipfile -- agreed it is cleaner, not sure if thats more or less understandable for someone unfamiliar with contextmanagers trying to read this code though

I'm fine leaving it as is if you are

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure!

@karlicoss
Copy link
Owner

Thanks, all looks good overall! Sorry took so long to look, was away.

@karlicoss karlicoss merged commit 821bc08 into karlicoss:master Jul 7, 2021
@seanbreckenridge seanbreckenridge deleted the structure branch July 9, 2021 21:48
@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Jul 23, 2021

seanbreckenridge/HPI@13254fe

As an update, integrated this into my.discord; now my discord data is just:

$ du -h discord
200M    discord

(When unzipped, it would have been):

$ du -h --summarize discord2
2.2G    discord2

With

# parses the GDPR export
class discord:
    export_path: Paths = data("discord/*.zip")
HPI_LOGS=debug hpi doctor --skip-config-check my.discord
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:801] [my.discord:messages] using inferred type <class 'discord_data.model.Message'>
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:801] [my.discord:activity] using inferred type <class 'discord_data.model.Activity'>
✅ OK  : my.discord
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:918] using /home/sean/.cache/cachew/my.discord:messages for db cache
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:921] new hash: cachew: 0.9.0, schema: [Column('message_id', Integer(), table=None), Column('timestamp', IsoDateTime(), table=None), Column('channel_channel_id', Integer(), table=None), Column('channel_name', String(), table=None), Column('channel__server_is_null', Boolean(), table=None), Column('channel_server_server_id', Integer(), table=None), Column('channel_server_name', String(), table=None), Column('content', String(), table=None), Column('attachments', String(), table=None)], dependencies: ['/home/sean/data/discord/march_2021.zip', '/home/sean/data/discord/october_2020.zip']
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:948] old hash: cachew: 0.9.0, schema: [Column('message_id', Integer(), table=None), Column('timestamp', IsoDateTime(), table=None), Column('channel_channel_id', Integer(), table=None), Column('channel_name', String(), table=None), Column('channel__server_is_null', Boolean(), table=None), Column('channel_server_server_id', Integer(), table=None), Column('channel_server_name', String(), table=None), Column('content', String(), table=None), Column('attachments', String(), table=None)], dependencies: ['/home/sean/data/discord/march_2021.zip', '/home/sean/data/discord/october_2020.zip']
[DEBUG   2021-07-22 19:51:10 my.discord __init__.py:951] hash matched: loading from cache
[DEBUG   2021-07-22 19:51:12 my.discord __init__.py:918] using /home/sean/.cache/cachew/my.discord:activity for db cache
[DEBUG   2021-07-22 19:51:12 my.discord __init__.py:921] new hash: cachew: 0.9.0, schema: [Column('event_id', String(), table=None), Column('event_type', String(), table=None), Column('_region_info_is_null', Boolean(), table=None), Column('region_info_city', String(), table=None), Column('region_info_country_code', String(), table=None), Column('region_info_region_code', String(), table=None), Column('region_info_time_zone', String(), table=None), Column('fingerprint_os', String(), table=None), Column('fingerprint_os_version', String(), table=None), Column('fingerprint_browser', String(), table=None), Column('fingerprint_ip', String(), table=None), Column('fingerprint_isp', String(), table=None), Column('fingerprint_device', String(), table=None), Column('fingerprint_distro', String(), table=None), Column('timestamp', IsoDateTime(), table=None)], dependencies: ['/home/sean/data/discord/march_2021.zip', '/home/sean/data/discord/october_2020.zip']
[DEBUG   2021-07-22 19:51:12 my.discord __init__.py:948] old hash: cachew: 0.9.0, schema: [Column('event_id', String(), table=None), Column('event_type', String(), table=None), Column('_region_info_is_null', Boolean(), table=None), Column('region_info_city', String(), table=None), Column('region_info_country_code', String(), table=None), Column('region_info_region_code', String(), table=None), Column('region_info_time_zone', String(), table=None), Column('fingerprint_os', String(), table=None), Column('fingerprint_os_version', String(), table=None), Column('fingerprint_browser', String(), table=None), Column('fingerprint_ip', String(), table=None), Column('fingerprint_isp', String(), table=None), Column('fingerprint_device', String(), table=None), Column('fingerprint_distro', String(), table=None), Column('timestamp', IsoDateTime(), table=None)], dependencies: ['/home/sean/data/discord/march_2021.zip', '/home/sean/data/discord/october_2020.zip']
[DEBUG   2021-07-22 19:51:12 my.discord __init__.py:951] hash matched: loading from cache
✅     - stats: {'messages': {'count': 132872, 'last': datetime.datetime(2019, 4, 19, 5, 17, 25, 806000, tzinfo=datetime.timezone.utc)}, 'activity': {'count': 1124745, 'last': datetime.datetime(2020, 7, 5, 3, 15, 3, 307000, tzinfo=datetime.timezone.utc)}}

@karlicoss
Copy link
Owner

very cool! mine are nearly not as huge, but still nice to not have to maintain an unpacked view!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants