Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

satpy v0.33 seems to run slower than v0.30 when I project modis l1b data #1944

Closed
haiyangdaozhang opened this issue Dec 16, 2021 · 8 comments

Comments

@haiyangdaozhang
Copy link

I checked the issue:#1902
Then I updated satpy from version 0.30 to version 0.33 to try to use less memory and time.

I used the 0.33 version of the program and the 0.30 version of the program to project MODIS level1 data, and compared them. I Found that the 0.33 version is even slower. If I try to project data with a resolution of 0.0025° and size of 16000 * 16000, the 0.33 version takes 804 seconds on my machine, while the 0.30 version only takes 695 seconds. If I try to project 0.005°, 8000*8000 data, the 0.33 version takes 287 seconds, while the 0.30 version only takes 256 seconds.

I tested them in docker. For the 16000 * 16000 projection, they all took up at least 25G of memory, maybe more.
I don't know why it takes so much time and memory? When I try to get the latitude and longitude of the data, I use scn["longtide"].compute(), and it takes 10 seconds at least. So I later read directly from the hdf data, it only takes less than 1 second.

In addition, how should I set my PYTROLL_CHUNK_SIZE. For example, my original data is 2200 * 2200, and it is 4000 * 4000 after projection. Should I set PYTROLL_CHUNK_SIZE to 2000 or 2200?

My code:

import time
import os,sys
os.environ["PYTROLL_CHUNK_SIZE"]="4000"
os.environ["DASK_NUM_WORKERS"]="4"
os.environ["OMP_NUM_THREADS"]="2"

from satpy import Scene
import glob
sys.path.append("../")
from getArea import *
from dask.diagnostics import ProgressBar

file = "/data/test_data/*21339065901.hdf"
scn = Scene(reader = "modis_l1b",filenames= glob.glob(file))
scn.load(["true_color_crefl"])
scb_new = scn.resample(myarea)
time1 =time.time()
with ProgressBar():
    scb_new.save_dataset("true_color_crefl",filename="test_true_color_crefl.tif")
time2 =time.time()
print(int(time2-time1))

myarea:

Area ID: LATLONG_250
Description: 
Projection ID: LATLONG_250
Projection: {'datum': 'WGS84', 'no_defs': 'None', 'proj': 'longlat', 'type': 'crs'}
Number of columns: 16000
Number of rows: 16000
Area extent: (89.0, -12.0, 129.0, 28.0)

Thank you!

@djhoese
Copy link
Member

djhoese commented Dec 16, 2021

Darn, this is not good to hear. Thanks for reporting it though. A couple questions and comments to narrow down the issues:

  1. When you got your timings for Satpy 0.30 were you using this exact script? The chunk sizes were the same?
  2. Did you also update your version of pyresample?
  3. Comment: That's a big area. What a nice test.
  4. You said doing scn["longitude"].compute() takes a long time, but I don't see you loading it explicitly in this script. If you are instead doing lons, lats = scn["true_color_crefl"].attrs["area"].get_lonlats() are you doing it from the original Scene or the resampled Scene?
  5. What kind of differences do you see if you don't set the chunk size and let satpy its default?
  6. Is your /data a local file system or something like a NFS or lustre mount where it is accessing remote data?

To answer your question about which chunk size: this differs from user to user and machine to machine. I've seen good results with 1024 when working with ABI data. The reason 2200 was mentioned in the AHI issue is that that matches the number of rows per segment in the 500m resolution data. This means that each input AHI file isn't producing a ton of small chunks and less chunks means it is easier for dask to deal with. I have honestly not played around a lot with chunk size and polar/swath-based data that much so I can't give you a good answer.

After you provide some answers to my questions I can try running tests myself and see what I can find. There is no reason off the top of my head why my changes should make any of this processing worse. Another thing you could try would be to use regular true_color and see if that behaves better than true_color_crefl as far as the differences between versions. If the difference between the versions for true_color aren't that different then maybe we look into further optimizing how the crefl rayleigh modifier works and how it uses angles/lons/lats.

@djhoese
Copy link
Member

djhoese commented Dec 16, 2021

For reference here is what I get with current satpy main and pyresample main with a chunk size of 2200 (left over from my AHI profiling), resampling using nearest neighbor (which is the default) to an area definition that is 16000x16000 with same projection as your but with the extents shifted over to match my test data ( - 180.0 in longitude for both X extents, so same geographic area). I'm using some MODIS L1b data from 2012 that was produced by IMAPP and I provided the 1000m, 500m, 250m, and geo files (equivalent to the MOD02 and MOD03 NASA files). I save to a tiled geotiff (new_scn.save_datasets(..., tiled=True)). I have 4 dask workers, OMP threads set to 1, and my source data and destination path are all local SSDs on my laptop.

image

With a result that looks like this:

image

So peak memory is ~11GB and ~200 seconds to process. Nothing about the profile plots is too surprising to me and it generally seems pretty good considering this is a huge 16k x 16k output area. I'll need more details about how you are running your stuff before I make more guesses about what might be using so much memory.

NOTE: Even though a lot of my output image is empty this shouldn't make a big difference until the very end when GDAL starts compressing the data to write it to the geotiff.

NOTE 2: I was also watching HBOMax on my computer at the same time so it isn't like my computer was doing nothing.

@djhoese
Copy link
Member

djhoese commented Dec 16, 2021

Running with satpy 0.32.0 and pyresample 1.22.1 I do get a slightly faster result but about the same memory usage:

image

The actual end of the CPU and memory usage is ~190s where as the new versions were about ~200s. All the graphs look very similar so I wouldn't feel bad about saying this is mostly caused by load on my machine from other tasks...at least mostly. I'm not sure I'm seeing the same drastic differences you are @haiyangdaozhang.

@haiyangdaozhang
Copy link
Author

  1. I created two environments by using anaconda3.
    In the first environment, satpy is version 0.30.1 and pyresampler is version 1.22.0.
    In the second environment, satpy is version 0.33.0 and pyresampler is version 1.22.3.
    I wrote the test script and ran the script in two environments to get the previous results.

  2. If it is an original scene, I think lons, lats = scn["true_color_crefl"].attrs["area"].get_lonlats() will not work.
    I used this method to read the original latitude and longitude, which took me about 10-20 seconds on my machine. So I am curious why it is so slow.

scn = Scene(reader = "modis_l1b",filenames= glob.glob(file))
scn.load(["latitude"])
time1 = time.time()
lat = scn["latitude"].compute()
time2 = time.time()
print (f"It took {int(time2-time1)} seconds to generate lat data.")
  1. I tested it again on a new machine which uses AMD 5600X, 32G memory and a SSD. The new machine system is windows10. The python version is 3.9. All other programs are closed, only the CMD window is used to run the test script for testing. For ture_color, satpy v0.33 and v0.30 take 3 minutes and 50 seconds, for true_color_crefl, they take 5 minutes. Maybe you are right, the difference in speed between 2 versions is probably just an error. They may have no difference in speed. But in both versions, true_color_crefl takes more than 1 minute longer than true_color.

  2. In my experience, if the proportion of data covering the projection area is less, the time and memory used less.

  3. May I ask how do you monitor the memory and CPU (how your image is generated). I want to do further testing about memory.

@djhoese
Copy link
Member

djhoese commented Dec 20, 2021

For 2, the reason it is taking so long is that MODIS only has up to 1km resolution data. When you ask for the generic "latitude" dataset Satpy will provide the highest resolution it is able to which is 250m in this case. This extrapolation/interpolation takes time. Much more time than you would expect if the data was only being loaded from disk. If you add a resolution=1000 to your scn.load call then the scn["latitude"].compute() should be much faster.

If/when you do that, what is the shape of that original swath at 1km resolution?

For 4, in your experience with Satpy or with other satellite processing tools? For the most basic operations, and excluding the compression/saving to disk portion of the processing, Satpy should treat every pixel as a "black box" and process them all the same. However, now that I say that, there may be some edge cases (especially with geostationary/pre-gridded data) where Satpy may be able to optimize what data is loaded from disk.

For 5, import:

from dask.diagnostics import CacheProfiler, ResourceProfiler, Profiler, visualize
import dask

Then when you do the actual work in your code:

with dask.config.set({"num_workers": 4}), CacheProfiler() as cprof, ResourceProfiler() as rprof, Profiler() as prof:
    # do stuff
visualize([prof, rprof, cprof], filename="some_file.html", show=False)

Here's an ongoing profiling script I've been using: https://gist.github.com/djhoese/2a38533d7cf8ab539aac1db4ca1eba46

You can learn more about dask diagnostics here: https://docs.dask.org/en/latest/diagnostics-local.html

@djhoese
Copy link
Member

djhoese commented Feb 24, 2022

@haiyangdaozhang Any updates on this? I'm not sure where exactly we left off but I do see I asked you some questions in my last comment.

@haiyangdaozhang
Copy link
Author

Thank you very much for your help! You can close this issue.

@djhoese
Copy link
Member

djhoese commented Mar 3, 2022

@haiyangdaozhang I will close this, but know that optimizing MODIS L1b processing will be one of my main goals over the next month or so so hopefully things well get a little faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants