-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading in slice of raster is unexpectedly slow #827
Comments
This would be easier to diagnose if you wrap the data in |
On my machine I get these timings: julia> @time read(view(bior, :, 1, :));
0.002350 seconds (290 allocations: 652.578 KiB)
julia> @time read(view(bior, :, :, 1));
4.612768 seconds (411 allocations: 142.396 MiB, 1.41% gc time)
julia> @time read(bior);
6.438516 seconds (294 allocations: 2.642 GiB, 0.05% gc time) Curiously, julia> open(bior) do bior
@time read(view(bior, :, 1, :));
@time read(view(bior, :, :, 1));
@time read(bior);
end
0.001171 seconds (101 allocations: 646.484 KiB)
4.587579 seconds (101 allocations: 142.388 MiB)
6.128943 seconds (110 allocations: 2.642 GiB, 0.22% gc time)
╭────────────────────────────────╮ but if I modify it to be an AccessCountDiskArray, I get GDAL failures using DiskArrays.TestTypes: AccessCountDiskArray
bior = modify(bior) do A
AccessCountDiskArray(A; chunksize = (8640, 1, 1))
end julia> @time read(bior)
ERROR: GDALError (CE_Failure, code 10):
Pointer 'hDS' is NULL in 'GDALGetRasterBand'.
Stacktrace:
[1] maybe_throw()
@ GDAL ~/.julia/packages/GDAL/hNyuz/src/error.jl:42
[2] aftercare
@ ~/.julia/packages/GDAL/hNyuz/src/error.jl:59 [inlined]
[3] gdalgetrasterband
@ ~/.julia/packages/GDAL/hNyuz/src/libgdal.jl:7270 [inlined] and if I rebuild it I get a memory failure (?!) julia> bior_r = rebuild(bior; data = AccessCountDiskArray(bior.data; chunksize = (8640, 1, 1)))
╭────────────────────────────────╮
│ 8640×4320×19 Raster{Float32,3} │
├────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Projected{Float64} LinRange{Float64}(-180.0, 179.95833333333326, 8640) ForwardOrdered Regular Intervals{Start},
→ Y Projected{Float64} LinRange{Float64}(89.95833333333333, -89.99999999999997, 4320) ReverseOrdered Regular Intervals{Start},
↗ Band Categorical{String} [wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_1, wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_2, …, wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_18, wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100_19] Unordered
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── metadata ┤
Metadata{Rasters.GDALsource} of Dict{String, Any} with 4 entries:
"units" => ""
"offset" => 0.0
"filepath" => "/Users/anshul/downloads/WorldClim/Future/BioClim/ssp370/GFDL-ESM4/wc2.1_2.5m_bioc_GFDL-ESM4_ssp370_2081-2100.tif"
"scale" => 1.0
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── raster ┤
extent: Extent(X = (-180.0, 179.99999999999991), Y = (-89.99999999999997, 90.0), Band = (nothing, nothing))
missingval: NaN32
crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]
filename:
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
julia> @time read(bior_r)
ERROR: MethodError: no method matching GenericMemory(::Int64, ::Ptr{Nothing})
The type `GenericMemory` exists, but no method is defined for this combination of argument types when trying to construct it.
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/Flatten/zTf7d/src/Flatten.jl:233 [inlined]
[2] macro expansion
@ ./none:0 [inlined]
[3] _reconstruct(obj::AccessCountDiskArray{…}, data::Tuple{…}, flattentrait::typeof(FieldMetadata.flattenable), use::Type{…}, ignore::Type{…}, n::Int64)
@ Flatten ./none:0
[4] reconstruct(obj::AccessCountDiskArray{Float32, 3, Rasters.FileArray{…}, DiskArrays.ChunkRead{…}}, data::Tuple{ArchGDAL.RasterDataset{…}}, use::Type, ignore::Type)
@ Flatten ~/.julia/packages/Flatten/zTf7d/src/Flatten.jl:166
[5] (::Rasters.var"#36#37"{Rasters.var"#28#29"{UnionAll}, Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32}})(x::ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset})
@ Rasters ~/.julia/dev/Rasters/src/array.jl:172
[6] call_composed(fs::Tuple{Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}}, x::Tuple{ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset}}, kw::@Kwargs{})
@ Base ./operators.jl:1054
[7] call_composed
@ ./operators.jl:1053 [inlined]
[8] (::ComposedFunction{typeof(Rasters.cleanreturn), Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}})(x::ArchGDAL.RasterDataset{Float32, ArchGDAL.Dataset}; kw::@Kwargs{})
@ Base ./operators.jl:1050
[9] readraster(f::ComposedFunction{typeof(Rasters.cleanreturn), Rasters.var"#36#37"{Rasters.var"#28#29"{…}, Raster{…}}}, args::String; kwargs::@Kwargs{})
@ ArchGDAL ~/.julia/packages/ArchGDAL/xZUPv/src/context.jl:268
[10] readraster(f::Function, args::String)
@ ArchGDAL ~/.julia/packages/ArchGDAL/xZUPv/src/context.jl:265
[11] _open(f::Function, ::Rasters.GDALsource, filename::String; write::Bool, kw::@Kwargs{name::Nothing})
@ RastersArchGDALExt ~/.julia/dev/Rasters/ext/RastersArchGDALExt/gdal_source.jl:112
[12] _open
@ ~/.julia/dev/Rasters/ext/RastersArchGDALExt/gdal_source.jl:88 [inlined]
[13] #open#17
@ ~/.julia/dev/Rasters/src/filearray.jl:48 [inlined]
[14] open
@ ~/.julia/dev/Rasters/src/filearray.jl:47 [inlined]
[15] #_open_one#35
@ ~/.julia/dev/Rasters/src/array.jl:170 [inlined]
[16] _open_one(f::Function, A::Raster{…}, fa::Rasters.FileArray{…})
@ Rasters ~/.julia/dev/Rasters/src/array.jl:169
[17] open(f::Rasters.var"#28#29"{UnionAll}, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32}; kw::@Kwargs{})
@ Rasters ~/.julia/dev/Rasters/src/array.jl:162
[18] open(f::Function, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32})
@ Rasters ~/.julia/dev/Rasters/src/array.jl:155
[19] modify(f::Type{Array}, A::Raster{Float32, 3, Tuple{…}, Tuple{}, AccessCountDiskArray{…}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{…}, Float32})
@ Rasters ~/.julia/dev/Rasters/src/array.jl:98
[20] #read#223
@ ~/.julia/dev/Rasters/src/read.jl:16 [inlined]
[21] read(x::Raster{Float32, 3, Tuple{X{…}, Y{…}, Band{…}}, Tuple{}, AccessCountDiskArray{Float32, 3, Rasters.FileArray{…}, DiskArrays.ChunkRead{…}}, Symbol, DimensionalData.Dimensions.Lookups.Metadata{Rasters.GDALsource, Dict{…}}, Float32})
@ Rasters ~/.julia/dev/Rasters/src/read.jl:12
[22] macro expansion
@ ./timing.jl:581 [inlined]
[23] top-level scope
@ ./REPL[26]:1
Some type information was truncated. Use `show(err)` to see complete types. |
Oh maybe not. But put an Array in there and see what happens |
Hm I'm not sure what is going on here - I haven't really done much I/O. But I noticed that the first time you read in any band from an ArchGDAL dataset, it is much slower than all other reads from the same dataset, even different bands. using ArchGDAL, Rasters, RasterDataSources, Dates
path = getraster(WorldClim{Future{BioClim, CMIP6, GFDL_ESM4, SSP370}}; res = "2.5m", date = Date(2090))
dataset = ArchGDAL.read(path)
@time band = ArchGDAL.getband(dataset, 1) |> Array
@time band = ArchGDAL.getband(dataset, 1) |> Array
@time band2 = ArchGDAL.getband(dataset, 2) |> Array
The first timing is similar to reading a single band from a lazy Raster, but all subsequent reads are still equally slow.
Seems curious the number of allocations is the exact same but the timing is so different? Profiling doesn't help because it's all just C calls. |
Yeah, you may want to try doing that in an |
Riiight that fixes it |
Guess I need to communicate that much better, and maybe as @asinghvi17 wants allow opening without a do block. |
But then in this case the problem is here: stack = RasterStack(view(ras, Band = 1:2), layersfrom = Rasters.Band)
@time read(stack)
open(stack) do s
@time read(s)
end Both come out around 25 seconds - the same as if reading them in separately without the open-do block. |
Haha I was waiting for this one :) Open the Raster before making the stack! It's opening each one separately. We should be smart and union the filenames, open them once, then map them back to stack layers. Maybe... We should allow mid-level laziness open on load using But I'm still surprised the overhead is so high, it doesn't totally make sense to me. Why is nearly all of the cost just opening the file if it's chunked |
Doesn't make any sense. What is GDAL doing for those 12 seconds? At the end the point of this would be to be able to do stack = RasterStack(WorldClim{Future{BioClim, CMIP6, GFDL_ESM4, SSP370}}; res = "2.5m", date = Date(2090), lazy = true)
stack_cropped = stack[1:10, 1:10]; without this crazy overhead, and without users ever knowing that this thing came from a single .tif file with bands, and not from 19 separate .tifs.
Is that necessary here? There should be a way to call |
The problem is actually knowing that each stack layer is from the same file. There is a bunch of code that chains the opening of multiple stack files, but here it probably just opens the same file over and over because its not that smart. The stack will already hold views of FileArray objects, we just need to put the same actual ArchGDAL.RasterDataset inside all of them and it will work fine with one load. (but also it seems to me like something is also broken and gdal is doing too much for each file - I'm sure this wasn't much of a problem in the past because opening wasn't that expensive) |
I'm working with future worldclim data for bioclimatic variables, which is chunked along the X axis, and also has a Y and Band axis.
Reading in a single slice along the Band axis takes almost as long as reading in the entire thing, even though I'd expect it to take 1/19th the time.
Reproduce:
Gives
May be a DiskArrays or ArchGDAL issue?
The text was updated successfully, but these errors were encountered: