-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look into columnar data store #521
Comments
Overview and ResultsThe goal of this work was to see how storing the data on S3 in different formats would change the performance of the queries run via Amazon Athena. This is the main service we use to do large-scale data aggregations (as well as allow full archive queries), so optimizing this can lead to great gains. I created a number of sample queries and table definitions (both described below). You can see the results summarized in the table below, some disucssion below that.
Format for values in table are Time (minutes), Data Scanned (GB)
Curiously, the last query which was date bound seems to scan more data. Because we're using But the big thing here is that, just for those two big aggregations, we'd see a ~75% cost reduction for Athena. EDIT: I just added runs for ORC (GZIP and SNAPPY compressions). At a quick scan, ORC looks to be much more performant, but at the cost of reading a bit more data. QueriesLocations Aggregation
Latest Aggregation
Global Averages Example
Single Parameter Averages Example
Date-bound Example
Table CreationParquet (GZIP)(Run time: 3 minutes 27 seconds, Data scanned: 246.34 GB)
Parquet (GZIP, 10 date_local buckets)(Run time: 9 minutes 7 seconds, Data scanned: 246.34 GB)
Parquet (SNAPPY, 10 date_local buckets)(Run time: 8 minutes 37 seconds, Data scanned: 246.36 GB)
Parquet (SNAPPY)(Run time: 3 minutes 11 seconds, Data scanned: 246.36 GB)
ORC (GZIP, 10 date_local buckets)(Run time: 7 minutes 35 seconds, Data scanned: 272.4 GB)
ORC (SNAPPY, , 10 date_local buckets)(Run time: 7 minutes 30 seconds, Data scanned: 272.4 GB)
|
Updated the table to include ORC (GZIP and SNAPPY compressions). |
Added some color-coding to more easily compare between different format and compression options:
Parquet with GZIP compression partitioned by parameter and no buckets seems to be the way to go. |
INSERT_INTO investigation
(pretty much the same as openaq.ddl)
You don't have to specify the format because it automatically uses the format/partitions the table being inserted into has Seems like this is a viable option for adding in data as it comes in |
As for how to switch the data to the new format, here's a proposed plan that won't significantly change our current processes:
|
I think this looks like a great plan! Do you think we could just use the existing data bucket and have a |
Building upon #455 we likely want to store the data (or some version of it) in http://parquet.apache.org/ or https://orc.apache.org/. This should great decrease costs of large queries across the data using tools like Athena as well as increase performance. We'd need to figure out the right format as well as partition schemes and aggregation levels (bin these up by fetch, hour, day, week, etc?).
The text was updated successfully, but these errors were encountered: