-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/sc 425841/internal team raster loader killed when using #153
Bug/sc 425841/internal team raster loader killed when using #153
Conversation
This reverts commit 815c42c.
raster_loader/io/common.py
Outdated
quantiles = compute_quantiles(qdata, casting_function) | ||
|
||
print("Computing most commons values...") | ||
# Not sure whether we should compute most_common values for float bands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a warning could be shown to the user in this case, so they are aware of potentially meaningless/useless results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the warning message as suggested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!
@volaya: I have spent a little bit trying to fix the code so that the CI passes, however it keeps on failing for Python3.8. After some changes and some googling I just realised that |
@cayetanobv just pointed out that |
Issue
This document details the most significant changes made to the
raster_loader
project to fix bug reported in SC-425841. The changes include improvements in statistics handling, support for overviews, performance optimizations, and updates to the command-line interface.Proposed Major Changes
1. Support for Approximate and Exact Statistics
By default, the
raster_loader
project calculated exact statistics for raster bands. This required loading the entire raster dataset into memory, which could be inefficient for large datasets. In fact, theKilled Signal
bug reported in the ticket was caused by the memory consumption of the statistics calculation process.Users can now choose to compute exact and comprehensive statistics for raster bands on demand. The new options
--exact_stats
and--all_stats
have been added to the command-line interfaces for BigQuery and Snowflake.Now, by default the raster loader computes approximate statistics for raster bands. The
--exact_stats
computes exact statistics for the raster bands.The computation of quantiles and most frequent statistics is also quite resource intensive. For this reason, the loader do not compute these statistics by default and the
--all_stats
has ben added to compute all statistics.How everything works now
The current implementation works as follows:
min
,max
,mean
,std
,count
,sum
andsum_square
.--all_stats
option computes all approximate statistics for the raster bands. That is, in addition to the main statistics above, it also computes approximatequantiles
andmost_common
.--exact_stats
option computes exact statistics for the raster bands:min
,max
,mean
,std
,count
,sum
,sum_square
,quantiles
, andmost_common
.Note
In case of
--all_stats
option, the most common values are only computed for integer values.Note
The user is highly encouraged to use the approximate stats (default) and use only the
--exact_stats
only when the approximate statistics are not precise enough as exact statistics computation is quite resource intensive.Changes made
--exact_stats
and--all_stats
options in the command-line interfaces for BigQuery and Snowflake.2. Performance Optimizations
When processing raster data, the loader was loading all blocks into memory and into the DataWarehouses. For sparse raster data, this could lead to unnecessary resource consumption and processing time as we store tones of
nodata
values. We have implemented a mechanism to skip blocks without any data to improve processing efficiency.Proposed Minor Improvement Changes
3. Improvement in Support for Overviews
Overviews were handled together with the window blocks in the previous version of the
raster_loader
project. This implementation was not optimal because several reasons: i) overviews were not accounted in the progress tracking, ii) the function handling overviews was huge and difficult to maintain, and iii) the processing of raster overviews was not efficient.We have now separated the processing of raster overviews from the window blocks. This separation allows for better progress tracking and more efficient processing of raster data.
rasterio_overview_to_records
function to process raster overviews.4. Improvements in Metadata Handling
5. Improvements in Data Validation
Validation of raster overviews was performed to ensure that the overview factors are consecutive powers of 2. This validation helps prevent errors when processing raster data. However, this validation was carried out after all the raster data was uploaded into the DataWarehose, which could lead to a waste of resources if the overview were invalid.
Now, the validation is performed before the raster data is uploaded to the DataWarehouse.
6. Bugfixed Metadata Update
One of the use cases of the loader is to update existing rasters with more data (
--append
). When this happens, the raster metadata needs to be updated with new stats computed with previous information and information from this new data.The way it was implemented, this functionality allowed to update the raster just once. The second time the raster was updated the process was throwing an Uploading error due to raster with different bands.
The main reason for this was a bug during this process of update, where a fake new band metadata object was added to the metadata information.
This behaviour has been fixed and now it is possible to update the raster as many times as the user wants.
7. New metadata information
Two new parameters
colorinterp
andcolortable
has been added to the metadata information. These two params will be used to color rasters on visualisation time.In order for the update use case to work we had to change the metadata validation to take those parameter into accout
Pull Request Checklist