Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading in slice of raster is unexpectedly slow #827

Open
tiemvanderdeure opened this issue Dec 5, 2024 · 11 comments
Open

reading in slice of raster is unexpectedly slow #827

tiemvanderdeure opened this issue Dec 5, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@tiemvanderdeure
Copy link
Contributor

I'm working with future worldclim data for bioclimatic variables, which is chunked along the X axis, and also has a Y and Band axis.

Reading in a single slice along the Band axis takes almost as long as reading in the entire thing, even though I'd expect it to take 1/19th the time.

Reproduce:

using Rasters, RasterDataSources, Dates, ArchGDAL
path = getraster(WorldClim{Future{BioClim, CMIP6, GFDL_ESM4, SSP370}}; res = "2.5m", date = Date(2090))
bior = Raster(path; lazy = true)
@time read(view(bior, :, 1, :));
@time read(view(bior, :, :, 1));
@time read(bior);

Gives

  0.004638 seconds (290 allocations: 652.578 KiB)
 11.733330 seconds (291 allocations: 142.394 MiB)
 14.055928 seconds (416 allocations: 2.642 GiB, 1.03% gc time)

May be a DiskArrays or ArchGDAL issue?

@tiemvanderdeure tiemvanderdeure added the bug Something isn't working label Dec 5, 2024
@asinghvi17
Copy link
Collaborator

This would be easier to diagnose if you wrap the data in ras within an AccessCountDiskArray...will investigate.

@asinghvi17
Copy link
Collaborator

On my machine I get these timings:

julia> @time read(view(bior, :, 1, :));
  0.002350 seconds (290 allocations: 652.578 KiB)

julia> @time read(view(bior, :, :, 1));

  4.612768 seconds (411 allocations: 142.396 MiB, 1.41% gc time)

julia> @time read(bior);
  6.438516 seconds (294 allocations: 2.642 GiB, 0.05% gc time)

Curiously, open doesn't help:

julia> open(bior) do bior
       @time read(view(bior, :, 1, :));
       @time read(view(bior, :, :, 1));
       @time read(bior);
       end
  0.001171 seconds (101 allocations: 646.484 KiB)
  4.587579 seconds (101 allocations: 142.388 MiB)
  6.128943 seconds (110 allocations: 2.642 GiB, 0.22% gc time)
╭────────────────────────────────╮

but if I modify it to be an AccessCountDiskArray, I get GDAL failures

using DiskArrays.TestTypes: AccessCountDiskArray

bior = modify(bior) do A
AccessCountDiskArray(A; chunksize = (8640, 1, 1))
end
julia> @time read(bior)
ERROR: GDALError (CE_Failure, code 10):
	Pointer 'hDS' is NULL in 'GDALGetRasterBand'.


Stacktrace:
  [1] maybe_throw()
    @ GDAL ~/.julia/packages/GDAL/hNyuz/src/error.jl:42
  [2] aftercare
    @ ~/.julia/packages/GDAL/hNyuz/src/error.jl:59 [inlined]
  [3] gdalgetrasterband
    @ ~/.julia/packages/GDAL/hNyuz/src/libgdal.jl:7270 [inlined]

and if I rebuild it I get a memory failure (?!)

julia> bior_r = rebuild(bior; data = AccessCountDiskArray(bior.data; chunksize = (8640, 1, 1)))
╭────────────────────────────────╮
│ 8640×4320×19 Raster{Float32,3} │
├────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
   X    Projected{Float64} LinRange{Float64}(-180.0, 179.95833333333326, 8640) ForwardOrdered Regular Intervals{Start},
   Y    Projected{Float64} LinRange{Float64}(89.95833333333333, -89.99999999999997, 4320) ReverseOrdered Regular Intervals{Start},
  ↗ Band Categorical{String} [wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_1, wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_2, , wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_18, wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_19] Unordered
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── metadata ┤
  Metadata{Rasters.GDALsource} of Dict{String, Any} with 4 entries:
  "units"    => ""
  "offset"   => 0.0
  "filepath" => "/Users/anshul/downloads/WorldClim/Future/BioClim/ssp370/GFDL-ESM4/wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100.tif"
  "scale"    => 1.0
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── raster ┤
  extent: Extent(X = (-180.0, 179.99999999999991), Y = (-89.99999999999997, 90.0), Band = (nothing, nothing))
  missingval: NaN32
  crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]
  filename:
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


julia> @time read(bior_r)
ERROR: MethodError: no method matching GenericMemory(::Int64, ::Ptr{Nothing})
The type `GenericMemory` exists, but no method is defined for this combination of argument types when trying to construct it.
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Flatten/zTf7d/src/Flatten.jl:233 [inlined]
  [2] macro expansion
    @ ./none:0 [inlined]
  [3] _reconstruct(obj::AccessCountDiskArray{…}, data::Tuple{…}, flattentrait::typeof(FieldMetadata.flattenable), use::Type{…}, ignore::Type{…}, n::Int64)
    @ Flatten ./none:0
  [4] reconstruct(obj::AccessCountDiskArray{Float32, 3, Rasters.FileArray{…}, DiskArrays.ChunkRead{…}}, data::Tuple{ArchGDAL.RasterDataset{…}}, use::Type, ignore::Type)
    @ Flatten ~/.julia/packages/Flatten/zTf7d/src/Flatten.jl:166
  [5] (::Rasters.var"#36#37"{Rasters.var"#28#29"{UnionAll}, Raster{Float32, 3, Tuple{}, Tuple{}, AccessCountDiskArray{}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{}, Float32}})(x::ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset})
    @ Rasters ~/.julia/dev/Rasters/src/array.jl:172
  [6] call_composed(fs::Tuple{Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}}, x::Tuple{ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset}}, kw::@Kwargs{})
    @ Base ./operators.jl:1054
  [7] call_composed
    @ ./operators.jl:1053 [inlined]
  [8] (::ComposedFunction{typeof(Rasters.cleanreturn), Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}})(x::ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset}; kw::@Kwargs{})
    @ Base ./operators.jl:1050
  [9] readraster(f::ComposedFunction{typeof(Rasters.cleanreturn), Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}}, args::String; kwargs::@Kwargs{})
    @ ArchGDAL ~/.julia/packages/ArchGDAL/xZUPv/src/context.jl:268
 [10] readraster(f::Function, args::String)
    @ ArchGDAL ~/.julia/packages/ArchGDAL/xZUPv/src/context.jl:265
 [11] _open(f::Function, ::Rasters.GDALsource, filename::String; write::Bool, kw::@Kwargs{name::Nothing})
    @ RastersArchGDALExt ~/.julia/dev/Rasters/ext/RastersArchGDALExt/gdal_source.jl:112
 [12] _open
    @ ~/.julia/dev/Rasters/ext/RastersArchGDALExt/gdal_source.jl:88 [inlined]
 [13] #open#17
    @ ~/.julia/dev/Rasters/src/filearray.jl:48 [inlined]
 [14] open
    @ ~/.julia/dev/Rasters/src/filearray.jl:47 [inlined]
 [15] #_open_one#35
    @ ~/.julia/dev/Rasters/src/array.jl:170 [inlined]
 [16] _open_one(f::Function, A::Raster{…}, fa::Rasters.FileArray{…})
    @ Rasters ~/.julia/dev/Rasters/src/array.jl:169
 [17] open(f::Rasters.var"#28#29"{UnionAll}, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32}; kw::@Kwargs{})
    @ Rasters ~/.julia/dev/Rasters/src/array.jl:162
 [18] open(f::Function, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32})
    @ Rasters ~/.julia/dev/Rasters/src/array.jl:155
 [19] modify(f::Type{Array}, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32})
    @ Rasters ~/.julia/dev/Rasters/src/array.jl:98
 [20] #read#223
    @ ~/.julia/dev/Rasters/src/read.jl:16 [inlined]
 [21] read(x::Raster{Float32, 3, Tuple{X{…}, Y{…}, Band{…}}, Tuple{}, AccessCountDiskArray{Float32, 3, Rasters.FileArray{…}, DiskArrays.ChunkRead{…}}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{Rasters.GDALsource, Dict{…}}, Float32})
    @ Rasters ~/.julia/dev/Rasters/src/read.jl:12
 [22] macro expansion
    @ ./timing.jl:581 [inlined]
 [23] top-level scope
    @ ./REPL[26]:1
Some type information was truncated. Use `show(err)` to see complete types.

@rafaqz
Copy link
Owner

rafaqz commented Dec 5, 2024

I think you have the chunksize inverted?

Oh maybe not. But put an Array in there and see what happens

@tiemvanderdeure
Copy link
Contributor Author

Hm I'm not sure what is going on here - I haven't really done much I/O.

But I noticed that the first time you read in any band from an ArchGDAL dataset, it is much slower than all other reads from the same dataset, even different bands.

using ArchGDAL, Rasters, RasterDataSources, Dates
path = getraster(WorldClim{Future{BioClim, CMIP6, GFDL_ESM4, SSP370}}; res = "2.5m", date = Date(2090))
dataset = ArchGDAL.read(path)
@time band = ArchGDAL.getband(dataset, 1) |> Array
@time band = ArchGDAL.getband(dataset, 1) |> Array
@time band2 = ArchGDAL.getband(dataset, 2) |> Array
 11.663348 seconds (69.17 k allocations: 286.877 MiB, 2.46% gc time)
  0.232574 seconds (69.17 k allocations: 286.877 MiB, 38.37% gc time)
  0.270758 seconds (69.17 k allocations: 286.877 MiB, 12.53% gc time)

The first timing is similar to reading a single band from a lazy Raster, but all subsequent reads are still equally slow.

ras = Raster(path; lazy = true)
@time read(view(ras, Band = 1))
@time read(view(ras, Band = 2))
 11.957042 seconds (86.80 k allocations: 288.140 MiB, 1.73% gc time)
 12.023004 seconds (86.73 k allocations: 288.139 MiB, 1.68% gc time)

Seems curious the number of allocations is the exact same but the timing is so different? Profiling doesn't help because it's all just C calls.

@rafaqz
Copy link
Owner

rafaqz commented Dec 6, 2024

Yeah, you may want to try doing that in an open(rast) do R block

@tiemvanderdeure
Copy link
Contributor Author

Riiight that fixes it

@rafaqz
Copy link
Owner

rafaqz commented Dec 6, 2024

Guess I need to communicate that much better, and maybe as @asinghvi17 wants allow opening without a do block.

@tiemvanderdeure
Copy link
Contributor Author

tiemvanderdeure commented Dec 6, 2024

But then in this case the problem is here:

stack = RasterStack(view(ras, Band = 1:2), layersfrom = Rasters.Band)
@time read(stack)

open(stack) do s
    @time read(s)
end

Both come out around 25 seconds - the same as if reading them in separately without the open-do block.

@rafaqz
Copy link
Owner

rafaqz commented Dec 6, 2024

Haha I was waiting for this one :)

Open the Raster before making the stack!

It's opening each one separately. We should be smart and union the filenames, open them once, then map them back to stack layers.

Maybe... We should allow mid-level laziness open on load using lazy=:open.

But I'm still surprised the overhead is so high, it doesn't totally make sense to me. Why is nearly all of the cost just opening the file if it's chunked

@tiemvanderdeure
Copy link
Contributor Author

Why is nearly all of the cost just opening the file if it's chunked

Doesn't make any sense. What is GDAL doing for those 12 seconds?

At the end the point of this would be to be able to do

stack = RasterStack(WorldClim{Future{BioClim, CMIP6, GFDL_ESM4, SSP370}}; res = "2.5m", date = Date(2090), lazy = true)
stack_cropped = stack[1:10, 1:10];

without this crazy overhead, and without users ever knowing that this thing came from a single .tif file with bands, and not from 19 separate .tifs.

Maybe... We should allow mid-level laziness open on load using lazy=:open.

Is that necessary here? There should be a way to call open before maplayers somehow?

@rafaqz
Copy link
Owner

rafaqz commented Dec 6, 2024

The problem is actually knowing that each stack layer is from the same file. There is a bunch of code that chains the opening of multiple stack files, but here it probably just opens the same file over and over because its not that smart.

The stack will already hold views of FileArray objects, we just need to put the same actual ArchGDAL.RasterDataset inside all of them and it will work fine with one load.

(but also it seems to me like something is also broken and gdal is doing too much for each file - I'm sure this wasn't much of a problem in the past because opening wasn't that expensive)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants