Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on localstack / refactor s3aggregator to parquet aggregator #652

Closed
birdsarah opened this issue May 13, 2020 · 7 comments · Fixed by #753
Closed

Remove dependency on localstack / refactor s3aggregator to parquet aggregator #652

birdsarah opened this issue May 13, 2020 · 7 comments · Fixed by #753

Comments

@birdsarah
Copy link
Contributor

birdsarah commented May 13, 2020

It's come up in a few issues so gathering the core issue here.

Related: #614, #628

Localstack has caused us a number of headaches:

It's purpose is to mock out s3, but that complexity is driven from the fact that the s3aggregator does two things for structured data:

  • aggregates data into parquet
  • pushes it to s3

In addition to that it saves out individual content files to s3.
We can separate these two functions allowing for openwpm to generate parquet and then to push it to a variety of locations: s3, gcs, or just local.

@birdsarah birdsarah changed the title Remove dependency on localstack Remove dependency on localstack / refactor s3aggregator to parquet aggregator May 13, 2020
@birdsarah
Copy link
Contributor Author

@vringar @englehardt I recall from our conversations that the above description of the issue is not accurate and there's an additional piece of related to saving content. Can you update the above description and we can continue iterating the conversation?

@englehardt
Copy link
Collaborator

englehardt commented May 15, 2020

This seems like a duplicate of #618? We should close that one or this one (this has more context).

And yes, see #390 for context.

Right now we have:

  1. S3Aggregator:
    1. structured data: parquet on S3
    2. unstructured data: individual gzipped files on S3 (very inefficient)
  2. LocalAggregator:
    1. structured data: sqlite
    2. unstructured data: leveldb

it seems like we want an option to independently choose how to save the structured data (e.g., parquet / sql) from the unstructured data (e.g., leveldb when local and something else when remote).

@birdsarah
Copy link
Contributor Author

it seems like we want an option to independently choose how to save the structured data (e.g., parquet / sql) from the unstructured data (e.g., leveldb when local and something else when remote).

+100

@birdsarah
Copy link
Contributor Author

Move from leveldb to rocksdb and use https://github.com/rockset/rocksdb-cloud to push to aws, google cloud etc?

@birdsarah
Copy link
Contributor Author

Move from leveldb to rocksdb and use https://github.com/rockset/rocksdb-cloud to push to aws, google cloud etc?

I don't love this suggestion because I think the path to windows is easier with leveldb.

@vringar
Copy link
Contributor

vringar commented Jun 29, 2020

One big thing I'd want to be mindful of is the discussion around caching and completion callbacks.
If we offload the caching into an external system we loose the knowledge whether something has been actually saved out to the remote storage or just cached locally.
At least for s3fs I couldn't find a way to manually flush, maybe this happens upon closing the FS?
Also if we cache the records to group them into reasonably sized parquet files and then save them out to s3fs we have a two layered caching system, which doesn't seem ideal to me.
So while I agree it would be nice have the data storage separated from the data encoding I think both would need to happen in openwpm or the completion callbacks become less useful.
Maybe we could pass the ParquetAggregator a "Storage Provider" in its constructor. And for tests this would be a local "Storage Provider".

@birdsarah
Copy link
Contributor Author

birdsarah commented Feb 22, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants