-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
satpy v0.33 seems to run slower than v0.30 when I project modis l1b data #1944
Comments
Darn, this is not good to hear. Thanks for reporting it though. A couple questions and comments to narrow down the issues:
To answer your question about which chunk size: this differs from user to user and machine to machine. I've seen good results with 1024 when working with ABI data. The reason 2200 was mentioned in the AHI issue is that that matches the number of rows per segment in the 500m resolution data. This means that each input AHI file isn't producing a ton of small chunks and less chunks means it is easier for dask to deal with. I have honestly not played around a lot with chunk size and polar/swath-based data that much so I can't give you a good answer. After you provide some answers to my questions I can try running tests myself and see what I can find. There is no reason off the top of my head why my changes should make any of this processing worse. Another thing you could try would be to use regular |
For reference here is what I get with current satpy main and pyresample main with a chunk size of 2200 (left over from my AHI profiling), resampling using nearest neighbor (which is the default) to an area definition that is 16000x16000 with same projection as your but with the extents shifted over to match my test data ( With a result that looks like this: So peak memory is ~11GB and ~200 seconds to process. Nothing about the profile plots is too surprising to me and it generally seems pretty good considering this is a huge 16k x 16k output area. I'll need more details about how you are running your stuff before I make more guesses about what might be using so much memory. NOTE: Even though a lot of my output image is empty this shouldn't make a big difference until the very end when GDAL starts compressing the data to write it to the geotiff. NOTE 2: I was also watching HBOMax on my computer at the same time so it isn't like my computer was doing nothing. |
Running with satpy 0.32.0 and pyresample 1.22.1 I do get a slightly faster result but about the same memory usage: The actual end of the CPU and memory usage is ~190s where as the new versions were about ~200s. All the graphs look very similar so I wouldn't feel bad about saying this is mostly caused by load on my machine from other tasks...at least mostly. I'm not sure I'm seeing the same drastic differences you are @haiyangdaozhang. |
|
For 2, the reason it is taking so long is that MODIS only has up to 1km resolution data. When you ask for the generic "latitude" dataset Satpy will provide the highest resolution it is able to which is 250m in this case. This extrapolation/interpolation takes time. Much more time than you would expect if the data was only being loaded from disk. If you add a If/when you do that, what is the shape of that original swath at 1km resolution? For 4, in your experience with Satpy or with other satellite processing tools? For the most basic operations, and excluding the compression/saving to disk portion of the processing, Satpy should treat every pixel as a "black box" and process them all the same. However, now that I say that, there may be some edge cases (especially with geostationary/pre-gridded data) where Satpy may be able to optimize what data is loaded from disk. For 5, import: from dask.diagnostics import CacheProfiler, ResourceProfiler, Profiler, visualize
import dask Then when you do the actual work in your code: with dask.config.set({"num_workers": 4}), CacheProfiler() as cprof, ResourceProfiler() as rprof, Profiler() as prof:
# do stuff
visualize([prof, rprof, cprof], filename="some_file.html", show=False) Here's an ongoing profiling script I've been using: https://gist.github.com/djhoese/2a38533d7cf8ab539aac1db4ca1eba46 You can learn more about dask diagnostics here: https://docs.dask.org/en/latest/diagnostics-local.html |
@haiyangdaozhang Any updates on this? I'm not sure where exactly we left off but I do see I asked you some questions in my last comment. |
Thank you very much for your help! You can close this issue. |
@haiyangdaozhang I will close this, but know that optimizing MODIS L1b processing will be one of my main goals over the next month or so so hopefully things well get a little faster. |
I checked the issue:#1902
Then I updated satpy from version 0.30 to version 0.33 to try to use less memory and time.
I used the 0.33 version of the program and the 0.30 version of the program to project MODIS level1 data, and compared them. I Found that the 0.33 version is even slower. If I try to project data with a resolution of 0.0025° and size of 16000 * 16000, the 0.33 version takes 804 seconds on my machine, while the 0.30 version only takes 695 seconds. If I try to project 0.005°, 8000*8000 data, the 0.33 version takes 287 seconds, while the 0.30 version only takes 256 seconds.
I tested them in docker. For the 16000 * 16000 projection, they all took up at least 25G of memory, maybe more.
I don't know why it takes so much time and memory? When I try to get the latitude and longitude of the data, I use scn["longtide"].compute(), and it takes 10 seconds at least. So I later read directly from the hdf data, it only takes less than 1 second.
In addition, how should I set my
PYTROLL_CHUNK_SIZE
. For example, my original data is 2200 * 2200, and it is 4000 * 4000 after projection. Should I setPYTROLL_CHUNK_SIZE
to 2000 or 2200?My code:
myarea:
Thank you!
The text was updated successfully, but these errors were encountered: