Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPI local installation caches Reddit exported data and does not refresh #187

Open
sergio-ns opened this issue Nov 21, 2021 · 4 comments
Open

Comments

@sergio-ns
Copy link

sergio-ns commented Nov 21, 2021

I am experimenting with HPI as I was looking for a system that would allow me to create a repository of my digital traces: cool stuff.

I've installed HPI according as per the local/editable option.

I'm testing it with Reddit.
I've configured the path to the Reddit export file in $HOME/.config/my/my/init.py by adding:

export_path = "/home/ubuntu/hpi/reddit/*.json"

Rexport is using the information in secret.py to dump the Reddit data:
python3 -m rexport.export --secrets $HOME/git/rexport/secrets.py > ./reddit/"export-$(date -I).json"

This piece of code I've found in the documentation should report the list of the 4 subreddits with most saved posts:

import my.reddit.all
from collections import Counter
print(Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4))

But what happens is that the information processed by my.reddit gets cached in $HOME/.cache and does not update when I rerun the rexport script

ubuntu@MARS:~/.cache/my$ ls -la
-rw-r--r-- 1 ubuntu ubuntu 1433600 Nov 21 15:35 my.reddit.rexport:comments
-rw-r--r-- 1 ubuntu ubuntu 1400832 Nov 21 15:34 my.reddit.rexport:saved
-rw-r--r-- 1 ubuntu ubuntu   94208 Nov 21 15:35 my.reddit.rexport:submissions
-rw-r--r-- 1 ubuntu ubuntu  561152 Nov 21 15:35 my.reddit.rexport:upvoted

To see the refreshed dump I must first delete the cached files.

What am I missing?

Thanks
s.

@sergio-ns sergio-ns changed the title HPI local installation caches exported data and does not refresh HPI local installation caches Reddit exported data and does not refresh Nov 21, 2021
@sergio-ns
Copy link
Author

This is probably intended behavior and the sqlite files are created in the .cache folder by cachew per design. Question is how do I get those file recreated after re-running the rexport script, perhaps removing them as part of the script execution is the most logical approach

@purarue
Copy link
Contributor

purarue commented Dec 9, 2021

cachew should automatically pick up that there have been new files picked up, and should recalculate new comments and overwrite that database

On line 88 in my/reddit/rexport.py:

diff --git a/my/reddit/rexport.py b/my/reddit/rexport.py
index cca3e35..5c4d045 100755
--- a/my/reddit/rexport.py
+++ b/my/reddit/rexport.py
@@ -85,7 +85,7 @@ Upvote     = dal.Upvote
 def _dal() -> dal.DAL:
     inp = list(inputs())
     return dal.DAL(inp)
-cache = mcachew(depends_on=inputs) # depends on inputs only
+cache = mcachew(depends_on=inputs, logger=logger) # depends on inputs only


 @cache

If you modify the line to add the logger (this should actually probably be done by default), you can then see what cachew is doing by settings the HPI_LOGS variable like this:

HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash matched: loading from cache

In most cases you'll see the same hash matched: loading from cache, since the input filenames are the same as last time it ran

If you then add a new one by running the rexport, and re-run that:

HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'), PosixPath('/home/sean/data/rexport/20211209T191206Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash mismatch: computing data and writing to db
[D 211209 11:13:17 dal:167] comments: finished processing /home/sean/data/rexport/20200930T214405Z.json:  999/ 999 new; total: 999
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211113T102439Z.json:   21/1000 new; total: 1020
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211209T191206Z.json:    0/1000 new; total: 1020

You should hopefully see it recalculating (hash mismatch: computing data and writing to db) the results to include the new data

@purarue
Copy link
Contributor

purarue commented Dec 9, 2021

Oh -- The only case where I see an issue if the filesnames of the new data are the same as the old, and you seem to be using date -I, which returns something like

date -I
2021-12-09

so it may be expecting that exports made by rexport on the same day have the same data (or rather, if you make multiple exports in the same day, the new one is overwriting the old one), but cachew assumes the data is the same.

If you change the date command to be specific to the second rather than the date, to something like:

python3 -m rexport.export --secrets /path/to/secrets.py >"export-$(date +%s).json"

... may fix this issue, unsure.

@karlicoss
Copy link
Owner

Yep, I think @seanbreckenridge is right -- it would be due to cachew using filenames by default, so it assumes no changes if you only use the date.

There is something experimental to use the file modification time, but still need thing how/if we should rely on it by default https://github.com/karlicoss/cachew/blob/49d349f5c32ae25d6f5a36279c8f0c5090242da2/src/cachew/__init__.py#L623-L626

And yeah, IMO it's best to keep full timstamp.. either by date +%s or date -Iseconds --utc (a bit more human readable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants