Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Aggregator Rewrite #753

Merged
merged 152 commits into from
Feb 22, 2021
Merged

Data Aggregator Rewrite #753

merged 152 commits into from
Feb 22, 2021

Conversation

vringar
Copy link
Contributor

@vringar vringar commented Sep 29, 2020

This rewrite modularizes the storage provider setup to make it more flexible and easier to extend.
It also aims to reduce the coupling between how the extension outputs data and how it's passed to storage
Closes #561, Closes #652, Closes #684

@codecov
Copy link

codecov bot commented Sep 30, 2020

Codecov Report

Merging #753 (4820b2c) into master (a81d80a) will increase coverage by 50.30%.
The diff coverage is 69.24%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #753       +/-   ##
===========================================
+ Coverage        0   50.30%   +50.30%     
===========================================
  Files           0       34       +34     
  Lines           0     3306     +3306     
===========================================
+ Hits            0     1663     +1663     
- Misses          0     1643     +1643     
Impacted Files Coverage Δ
openwpm/commands/browser_commands.py 26.87% <ø> (ø)
openwpm/storage/cloud_storage/s3_storage.py 0.00% <0.00%> (ø)
openwpm/utilities/build_cookie_table.py 0.00% <0.00%> (ø)
openwpm/commands/profile_commands.py 24.61% <17.85%> (ø)
openwpm/deploy_browsers/deploy_firefox.py 22.22% <30.00%> (ø)
openwpm/storage/cloud_storage/gcp_storage.py 39.62% <39.62%> (ø)
openwpm/socket_interface.py 58.38% <46.66%> (ø)
openwpm/storage/storage_controller.py 47.36% <47.36%> (ø)
openwpm/mp_logger.py 67.07% <50.00%> (ø)
openwpm/browser_manager.py 59.00% <80.00%> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a81d80a...4820b2c. Read the comment docs.

docs/Configuration.md Outdated Show resolved Hide resolved
openwpm/mp_logger.py Show resolved Hide resolved
openwpm/socket_interface.py Show resolved Hide resolved
docs/Configuration.md Show resolved Hide resolved
openwpm/storage/cloud_storage/gcp_storage.py Show resolved Hide resolved
openwpm/storage/parquet_schema.py Outdated Show resolved Hide resolved
openwpm/task_manager.py Show resolved Hide resolved
openwpm/task_manager.py Outdated Show resolved Hide resolved
logger = MPLogger(log_path, log_level_console=logging.DEBUG)
yield logger
logger.close()
# The performance hit for this might be unacceptable but it might help us discover bugs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the mean in terms of total test suite runtime? In general I think it's okay to trade off some test time if it means that we'll catch errors that we'd otherwise miss.

test/storage/test_gcp.py Show resolved Hide resolved
vringar and others added 3 commits February 22, 2021 12:56
@vringar
Copy link
Contributor Author

vringar commented Feb 22, 2021

I think I have addressed all outstanding comments.
The merge request UI is slowy but surely breaking under the amount of commits and files changed.
So I'm proposing to merge this and fix all followup issues in new PRs.
With two 👍 on this comment I'll merge it.

@vringar vringar merged commit b29c3f4 into master Feb 22, 2021
@vringar vringar deleted the DataAggregatorRefactor branch February 22, 2021 16:51
Zaxeli pushed a commit to Zaxeli/OpenWPM that referenced this pull request Aug 10, 2021
* First steps in the rewrite

* Fixed import paths

* One giant refactor

* Fixing tests

* Adding mypy

* Removed mypy from pre-commit workflow

* First draft on DataAggregator

* Wrote a DataAggregator that starts and shuts down

* Created tests and added more empty types

* Got demo.py working

* Created sql_provider

* Cleaned up imports in TaskManager

* Added async

* Fixed minor bugs

* First steps at porting arrow

* Introduced TableName and different Task handling

* Added more failing tests

* First first completes others don't

* It works

* Started working on arrow_provider

* Implemented ArrowProvider

* Added logger fixture

* Fixed test_storage_controller

* Fixing OpenWPMTest.visit()

* Moved test/storage_providers to test/storage

* Fixing up tests

* Moved automation to openwpm

* Readded datadir to .gitignore

* Ran repin.sh

* Fixed formatting

* Let's see if this works

* Fixed imports

* Got arrow_memory_provider working

* Starting to rewrite tests

* Setting up fixtures

* Attempting to fix all the tests

* Still fixing tests

* Broken content saving

* Added node

* Fixed screenshot tests

* Fixing more tests

* Fixed tests

* Implemented local_storage.py

* Cleaned up flush_cache

* Fixing more tests

* Wrote test for LocalArrowProvider

* Introduced tests for local_storage_provider.py

* Asserting test dir is empty

* Creating subfolder for different aggregators

* New depencies and init()

* Everything is terribly broken

* Figured out finalize_visit_id

* Running two event loops kinda works???

* Rearming the event

* Introduced mypy

* Downgraded black in pre-commit

* Modifying the database directly

* Fixed formatting

* Made mypy a lil stricter

* Fixing docs and config printing

* Realising I've been using the wrong with

* Trying to figure arrow_storage

* Moving lock initialization in in_memory_storage

* Fixing tests

* Fixing up tests and adding more typechecking

* Fixed num_browsers in test_cache_hits_recorded

* Parametrized unstructured

* String fix

* Added failing test

* New test

* Review changes with Steven

* Fixed repin.sh and test_arrow_cache

* Minor change

* Fixed prune-environment.py

* Removing references to DataAggregator

* Fixed test_seed_persistance

* More paths

* Fixed test display shutdown

* Made cache test more robust

* Update crawler.py

Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>

* Slimming down ManagerParams

* Fixing more tests

* Update test/storage/test_storage_controller.py

Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>

* Purging references to DataAggregator

* Reverted changes to .travis.yml

* Demo.py saves locally again

* Readjusting test paths

* Expanded comment on initialize to reference openwpm#846

* Made token optional in finalize_visit_id

* Simplified test paramtetrization

* Fixed callback semantics change

* Removed test_parse_http_stack_trace_str

* Added DataSocket

* WIP need to fix path encoding

* Fixed path encoding

* Added task and crawl to schema

* Fixed paths in GitHub actions

* Refactored completion handling

* Fix tests

* Trying to fix tests on CI

* Removed redundant setting of tag

* Removing references to S3

* Purging more DataAggregator references

* Craking up logging to figure out test failure

* Moved test_values into a fixture

* Fixing GcpUnstructuredProvider

* Fixed paths for future crawls

* Renamed sqllite to official sqlite

* Restored demo.py

* Update openwpm/commands/profile_commands.py

Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>

* Restored previous behaviour of DumpProfileCommand

Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>

* Removed leftovers

* Cleaned up comments

* Expanded lock check

* Fixed more stuff

* More comment updates

* Update openwpm/socket_interface.py

Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>

* Removed outdated comment

* Using config_encoder

* Renamed tar_location to tar_path

* Removed references to database_name in docs

* Cleanup

* Moved screenshot_path and source_dump_path to ManagerParamsInternal

* Fixed imports

* Fixing up comments

* Fixing up comments

* More docs

* updated dependencies

* Fixed test_task_manager

* Reupgraded to python 3.9.1

* Restoring crawl_reference in mp_logger

* Removed unused imports

* Apply suggestions from code review

Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>

* Cleaned up socket handling

* Fixed TaskManager.__exit__

* Moved validation code into config.py

* Removed comment

* Removed comment

* Removed comment

Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>
Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants