Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient compression of very large GeoTIFFs compared to gdal_* binaries #127

Open
niclasmattsson opened this issue Nov 1, 2021 · 2 comments

Comments

@niclasmattsson
Copy link

Originally posted at discourse, full text below including test code and links to sample data. I opened the issue here because visr suggested the problem might be in GDAL.jl instead of ArchGDAL.
https://discourse.julialang.org/t/inefficient-compression-of-very-large-geotiffs-with-archgdal-compared-to-gdal-binaries/70687/1


I'm using ArchGDAL to save climate data as large 3D GeoTIFFs, but I've noticed that files created in ArchGDAL are roughly 3x larger than those created with gdal_* binaries. As a diagnostic I eliminated all my calculations and wrote an ArchGDAL function that just resaves its input file using a given compression method. Then I wrapped this in a test function that also recompresses the ArchGDAL output using gdal_translate and compares resulting file sizes.

Spoiler: small input GeoTIFFs have the same output size, but very large input files (4 GB) become much larger after passing through ArchGDAL.

First my test code:

using ArchGDAL, GDAL_jll

function copy_gtiff(infile, outfile; compressmethod="ZSTD")
    ArchGDAL.readraster(infile) do origdataset
        nbands = ArchGDAL.nraster(origdataset)
        band1 = ArchGDAL.getband(origdataset, 1)
        nodata = ArchGDAL.getnodatavalue(band1)
        isfile(outfile) && rm(outfile)
        origdata = origdataset[:,:,:]
        ArchGDAL.create(outfile,
            driver = ArchGDAL.getdriver("GTiff"),
            width = ArchGDAL.width(origdataset),
            height = ArchGDAL.height(origdataset),
            nbands = nbands,
            dtype = ArchGDAL.pixeltype(band1),
            options = ["BIGTIFF=YES", "COMPRESS=$compressmethod"]
        ) do dataset
            geotransform = ArchGDAL.getgeotransform(origdataset)
            proj = ArchGDAL.getproj(origdataset)
            ArchGDAL.setgeotransform!(dataset, geotransform)
            ArchGDAL.setproj!(dataset, proj)
            for b = 1:nbands
                band = ArchGDAL.getband(dataset, b)
                ArchGDAL.setnodatavalue!(band, nodata)
                ArchGDAL.write!(band, origdata[:,:,b])
            end
        end
        return origdata
    end
end

function recompress_gtiff(infile, outfile; compressmethod="ZSTD")
    gdal_translate_path() do gdal_translate
        run(`$gdal_translate -q -co BIGTIFF=YES -co COMPRESS=$compressmethod $infile $outfile`)
    end
end

filesizeMB(filename) = round(filesize(filename)/1024^2, digits=1)

function readalldata(filename)
    ArchGDAL.readraster(filename) do dataset
        return dataset[:,:,:]
    end
end

function testcopy(compressmethod="ZSTD")
    for nbands in [3, 30, 300, 1000, 2920]
        infile = "test_ZSTD_$(nbands)bands.tif"
        outfile1, outfile2 = tempname(), tempname()
        println("\nTesting with $nbands bands, compression method = $compressmethod:")
        print("  copying $infile with ArchGDAL...")
        time = round(@elapsed origdata = copy_gtiff(infile, outfile1; compressmethod); digits=1)
        println(" ($time s). Original file $(filesizeMB(infile)) MB, copy $(filesizeMB(outfile1)) MB.")
        print("  recompressing with gdal_translate...")
        time = round(@elapsed recompress_gtiff(outfile1, outfile2; compressmethod); digits=1)
        println(" ($time s). Recompressed file $(filesizeMB(outfile2)) MB.")
        print("  reading copied file...")
        time = round(@elapsed newdata = readalldata(outfile1); digits=1)
        println(" ($time s). Identical to original file: $(all(origdata .== newdata)).")
        print("  reading recompressed file...")
        time = round(@elapsed newdata = readalldata(outfile2); digits=1)
        println(" ($time s). Identical to original file: $(all(origdata .== newdata)).")
        rm(outfile1)
        rm(outfile2)
    end
end

Running the tests using ZSTD compression:

julia> testcopy("ZSTD")

Testing with 3 bands, compression method = ZSTD:
  copying test_ZSTD_3bands.tif with ArchGDAL... (1.3 s). Original file 3.9 MB, copy 3.9 MB.
  recompressing with gdal_translate... (1.2 s). Recompressed file 3.9 MB.
  reading copied file... (0.0 s). Identical to original file: true.
  reading recompressed file... (0.0 s). Identical to original file: true.

Testing with 30 bands, compression method = ZSTD:
  copying test_ZSTD_30bands.tif with ArchGDAL... (2.1 s). Original file 40.2 MB, copy 40.2 MB.
  recompressing with gdal_translate... (2.1 s). Recompressed file 40.2 MB.
  reading copied file... (0.2 s). Identical to original file: true.
  reading recompressed file... (0.3 s). Identical to original file: true.

Testing with 300 bands, compression method = ZSTD:
  copying test_ZSTD_300bands.tif with ArchGDAL... (8.6 s). Original file 405.8 MB, copy 405.8 MB.
  recompressing with gdal_translate... (8.4 s). Recompressed file 405.8 MB.
  reading copied file... (1.6 s). Identical to original file: true.
  reading recompressed file... (1.6 s). Identical to original file: true.

Testing with 1000 bands, compression method = ZSTD:
  copying test_ZSTD_1000bands.tif with ArchGDAL... (23.1 s). Original file 1355.7 MB, copy 1355.7 MB.
  recompressing with gdal_translate... (22.3 s). Recompressed file 1355.7 MB.
  reading copied file... (5.6 s). Identical to original file: true.
  reading recompressed file... (5.8 s). Identical to original file: true.

Testing with 2920 bands, compression method = ZSTD:
  copying test_ZSTD_2920bands.tif with ArchGDAL... (171.5 s). Original file 3968.5 MB, copy 10906.3 MB.
  recompressing with gdal_translate... (61.3 s). Recompressed file 3968.5 MB.
  reading copied file... (15.1 s). Identical to original file: true.
  reading recompressed file... (15.6 s). Identical to original file: true.

My 5 test files are available in a Box folder here. I get similar results for other compression methods. Some valid values are DEFLATE, LZW and NONE. My test results for these are in a log file in that Box folder. My installed versions: ArchGDAL v0.7.4 and GDAL_jll v300.202.100+0.

I suspect the problem is some kind of interaction with the BIGTIFF setting, which is required for creating GeoTIFFs larger than 4 GB.

@visr
Copy link
Member

visr commented Nov 1, 2021

Ha I meant that that this is more likely a OSGeo/gdal issue, in that there is unlikely to be any Julia wrapping code that behaves differently somewhere between 1000 and 2920 bands.

It might be some work, but perhaps you can reproduce this with generated data, such that people don't need to download such a large raster? Then you can also see if it also happens if you have say 2 rows and 2 columns. And one way to verify is to write this test in Python, to see if it occurs there as well. GDAL devs are more likely to be able to run those reproducers as well.

@maxfreu
Copy link
Contributor

maxfreu commented Jun 2, 2023

For compression it can play a big role whether your file is band or pixel interleaved and whether or not it is tiled. gdal_translate might copy this info from the source file, while your code is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants