Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework logprep temp files #402

Merged
merged 4 commits into from
Jun 13, 2023
Merged

Conversation

dtrai2
Copy link
Collaborator

@dtrai2 dtrai2 commented Jun 12, 2023

closes #401

@dtrai2 dtrai2 added the enhancement New feature or request label Jun 12, 2023
@dtrai2 dtrai2 requested a review from ekneg54 June 12, 2023 09:36
@dtrai2 dtrai2 self-assigned this Jun 12, 2023
@dtrai2 dtrai2 linked an issue Jun 12, 2023 that may be closed by this pull request
Copy link
Collaborator

@ekneg54 ekneg54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will have issues with multiprocessing... :-/

Comment on lines 90 to 93
temp_dir = Path(tempfile.gettempdir())
list_path = temp_dir / Path(f"{self.name}-tldlist-{index}.dat")
list_path.touch()
list_path.write_bytes(GetterFactory.from_string(tld_list).get_raw())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get issues with multiprocessing on this solution. Not for touching the file and not for reading it, but in case two processed try to write to this file as they do in line 92.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in the other touched files.

How about creating the file as before with the process name but delete it on restart?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mh yeah I can try to delete the file or the whole temp directory on shutdown of logprep, but that doesn't help with your suggestion from the issue: "this leads to downloading the file on every restart instead of using the already downloaded file."
If the issues can only happen on writing to the file, can't we solve this then with a lock? I think the processors only create this file in the startup process and are not writing to it during the logprep run. So creating these files initially with a lock should be possible if multiple processes can read from it at the same time. Besides, I think even the reading shouldn't be an issue as the TLDExtract library should load the lists content into memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. a lock should work for this.
And we also should give it a try for the geoip_enricher database. we then should lock every database access.. .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a FileLock to these write operations.

@dtrai2 dtrai2 requested a review from ekneg54 June 13, 2023 09:27
@codecov-commenter
Copy link

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.11 🎉

Comparison is base (466a809) 92.21% compared to head (7fae7ff) 92.32%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #402      +/-   ##
==========================================
+ Coverage   92.21%   92.32%   +0.11%     
==========================================
  Files         133      133              
  Lines        9452     9460       +8     
==========================================
+ Hits         8716     8734      +18     
+ Misses        736      726      -10     
Impacted Files Coverage Δ
logprep/metrics/metric_targets.py 100.00% <ø> (ø)
...prep/processor/domain_label_extractor/processor.py 98.52% <100.00%> (+0.06%) ⬆️
logprep/processor/domain_resolver/processor.py 100.00% <100.00%> (ø)
logprep/processor/geoip_enricher/processor.py 100.00% <100.00%> (ø)
logprep/util/prometheus_exporter.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ekneg54 ekneg54 merged commit cdaf9a9 into main Jun 13, 2023
@ekneg54 ekneg54 deleted the 401-stores-temporary-files-not-in-tmp branch October 10, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stores temporary files not in /tmp
3 participants