Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/sc 425841/internal team raster loader killed when using #153

Merged

Conversation

rantolin
Copy link
Contributor

@rantolin rantolin commented Oct 30, 2024

Issue

This document details the most significant changes made to the raster_loader project to fix bug reported in SC-425841. The changes include improvements in statistics handling, support for overviews, performance optimizations, and updates to the command-line interface.

Proposed Major Changes

1. Support for Approximate and Exact Statistics

By default, the raster_loader project calculated exact statistics for raster bands. This required loading the entire raster dataset into memory, which could be inefficient for large datasets. In fact, the Killed Signal bug reported in the ticket was caused by the memory consumption of the statistics calculation process.

Users can now choose to compute exact and comprehensive statistics for raster bands on demand. The new options --exact_stats and --all_stats have been added to the command-line interfaces for BigQuery and Snowflake.

Now, by default the raster loader computes approximate statistics for raster bands. The
--exact_stats computes exact statistics for the raster bands.

The computation of quantiles and most frequent statistics is also quite resource intensive. For this reason, the loader do not compute these statistics by default and the --all_stats has ben added to compute all statistics.

How everything works now

The current implementation works as follows:

  1. By default, the raster loader computes approximate main statistics for raster bands: min, max, mean, std, count, sum and sum_square.
  2. The --all_stats option computes all approximate statistics for the raster bands. That is, in addition to the main statistics above, it also computes approximate quantiles and most_common.
  3. In case approximate statistics were doubtely not precise enough, --exact_stats option computes exact statistics for the raster bands: min, max, mean, std, count, sum, sum_square, quantiles, and most_common.

Note

In case of --all_stats option, the most common values are only computed for integer values.

Note

The user is highly encouraged to use the approximate stats (default) and use only the --exact_stats only when the approximate statistics are not precise enough as exact statistics computation is quite resource intensive.

Changes made

  • Added options to calculate exact and comprehensive statistics for raster bands on demand.
  • Implemented new functions for approximate and exact statistics calculation.
  • Added --exact_stats and --all_stats options in the command-line interfaces for BigQuery and Snowflake.
@click.option(
    "--exact_stats",
    help="Compute exact statistics for the raster bands.",
    default=False,
    is_flag=True,
)
@click.option(
    "--all_stats",
    help="Compute all statistics including quantiles and most frequent values.",
    required=False,
    is_flag=True,
)

2. Performance Optimizations

When processing raster data, the loader was loading all blocks into memory and into the DataWarehouses. For sparse raster data, this could lead to unnecessary resource consumption and processing time as we store tones of nodata values. We have implemented a mechanism to skip blocks without any data to improve processing efficiency.

  • Implemented skipping of blocks without data to improve processing efficiency.
  • Added an empty block counter for better progress tracking.
 Skip blocks without any data to relieve loading burden
if no_data_value is not None and np.all(arr == no_data_value):
    return None

Proposed Minor Improvement Changes

3. Improvement in Support for Overviews

Overviews were handled together with the window blocks in the previous version of the raster_loader project. This implementation was not optimal because several reasons: i) overviews were not accounted in the progress tracking, ii) the function handling overviews was huge and difficult to maintain, and iii) the processing of raster overviews was not efficient.

We have now separated the processing of raster overviews from the window blocks. This separation allows for better progress tracking and more efficient processing of raster data.

  • Implemented the rasterio_overview_to_records function to process raster overviews.
  • Modified the processing logic to include both base blocks and overviews.
overviews_records_gen = rasterio_overview_to_records(
    file_path,
    self.band_rename_function,
    bands_info
)

windows_records_gen = rasterio_windows_to_records(
    file_path,
    self.band_rename_function,
    bands_info,
)
records_gen = chain(overviews_records_gen, windows_records_gen)

4. Improvements in Metadata Handling

  • Added support for color tables in band metadata.
  • Implemented functions to check metadata compatibility when adding records to existing tables.
def get_color_table(raster_dataset: rasterio.io.DatasetReader, band: int):
    try:
        if raster_dataset.colorinterp[band - 1].name == "palette":
            return raster_dataset.colormap(band)
        return None
    except ValueError:
        return None

5. Improvements in Data Validation

Validation of raster overviews was performed to ensure that the overview factors are consecutive powers of 2. This validation helps prevent errors when processing raster data. However, this validation was carried out after all the raster data was uploaded into the DataWarehose, which could lead to a waste of resources if the overview were invalid.

Now, the validation is performed before the raster data is uploaded to the DataWarehouse.

  • Implemented functions to validate the structure of raster blocks and overview factors.
def is_valid_raster_dataset(raster_dataset: rasterio.io.DatasetReader) -> bool:
    if not is_valid_block_shapes(raster_dataset.block_shapes):
        raise ValueError("Invalid block shapes: must be equal for all bands")

    if not is_valid_overview_indexes(raster_dataset.overviews(1)):
        raise ValueError(
            "Invalid overview factors: must be consecutive powers of 2"
        )

    return True

6. Bugfixed Metadata Update

One of the use cases of the loader is to update existing rasters with more data (--append). When this happens, the raster metadata needs to be updated with new stats computed with previous information and information from this new data.

The way it was implemented, this functionality allowed to update the raster just once. The second time the raster was updated the process was throwing an Uploading error due to raster with different bands.

The main reason for this was a bug during this process of update, where a fake new band metadata object was added to the metadata information.

This behaviour has been fixed and now it is possible to update the raster as many times as the user wants.

7. New metadata information

Two new parameters colorinterp and colortable has been added to the metadata information. These two params will be used to color rasters on visualisation time.

In order for the update use case to work we had to change the metadata validation to take those parameter into accout

def band_without_stats(band):
    return {
        k: band[k]
        for k in set(list(band.keys())) - set(["stats", "colorinterp", "colortable"])
    }

Pull Request Checklist

  • I have tested the changes locally
  • I have added tests to cover my changes (if applicable)
  • I have updated the documentation (if applicable)

@rantolin rantolin requested a review from volaya October 31, 2024 09:09
@rantolin rantolin added the bug Something isn't working label Oct 31, 2024
quantiles = compute_quantiles(qdata, casting_function)

print("Computing most commons values...")
# Not sure whether we should compute most_common values for float bands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a warning could be shown to the user in this case, so they are aware of potentially meaningless/useless results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the warning message as suggested!

Copy link
Contributor

@volaya volaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

@rantolin
Copy link
Contributor Author

@volaya: I have spent a little bit trying to fix the code so that the CI passes, however it keeps on failing for Python3.8. After some changes and some googling I just realised that rasterio requires Python>=3.9

@rantolin
Copy link
Contributor Author

@volaya: I have spent a little bit trying to fix the code so that the CI passes, however it keeps on failing for Python3.8. After some changes and some googling I just realised that rasterio requires Python>=3.9

@cayetanobv just pointed out that Python3.8 is EOL

@rantolin rantolin marked this pull request as ready for review October 31, 2024 12:20
@rantolin rantolin merged commit 167c3d6 into main Oct 31, 2024
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants