Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/hdf4 subdatasets #410

Merged
merged 3 commits into from
Aug 21, 2020
Merged

Feature/hdf4 subdatasets #410

merged 3 commits into from
Aug 21, 2020

Conversation

mpu-creare
Copy link
Contributor

Support reading subdatasets in the rasterio Node.
This is motivated by reading MODIS data downloaded from the source.

@mpu-creare mpu-creare changed the base branch from master to develop July 3, 2020 02:04
@coveralls
Copy link

coveralls commented Jul 3, 2020

Coverage Status

Coverage decreased (-1.5%) to 89.668% when pulling 30e283f on feature/hdf4_subdatasets into cdedd85 on develop.

@mpu-creare mpu-creare requested a review from jmilloy July 9, 2020 15:28

@property
def subdatasets(self):
return self.dataset.subdatasets
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially make the subdatasets as different outputs and then open multiple datasets to composite the results. That would take another branch in the get_data function but would results in a much nicer user experience.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. A dataset can already have multiple outputs, at least if you open it with the xarray Dataset or H5PY nodes. Could you end up with a situation with several subdatasets that each have multiple outputs, but no way to have "nested" outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes me want to scream -- I don't know. I hope the answer is NO but I suspect it's a resounding Yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the simple answer then is not to make the subdatasets different outputs, basically require one node per subdataset. Hopefully the file can be open concurrently for multiple read-only file-pointers.

Copy link
Contributor Author

@mpu-creare mpu-creare Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the research I've done, there should not be any multi-band subdatasets.

I think I'll end up making the subdatasets multiple outputs -- make it really easy for users. I've now heard from two people that they gave up on PODPAC because they couldn't read the HDF files and they didn't know why. So, it should just work without them having to know about the intricacies of subdatasets.

Also, the s3 case is not a real use-case since reading and hdf file from S3 seems to involve reading the whole file anyway. In other words, it's very slow.

A related questions: right now we read band 1 by default. Should we just go ahead and read all bands by default? What do you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Yeah, I guess we should read all of the bands by default. The FileKeysMixin that is used by CSV, Zarr, Dataset, and H5PY loads all available data_keys by default as separate outputs, unless there is only one, in which case outputs is None and the node is a standard single-output node.

    @tl.default("data_key")
    def _default_data_key(self):
        if len(self.available_data_keys) == 1:
            return self.available_data_keys[0]
        else:
            return self.available_data_keys

    @tl.default("outputs")
    def _default_outputs(self):
        if not isinstance(self.data_key, list):
            return None
        else:
            return self.data_key

Rasterio should be basically the same, something like allowing band to be either a list or a single value and using this:

    @tl.default("band")
    def _default_band(self):
        if len(self.band_keys == 1:
            return self.band_keys[0]
        else:
            return self.band_keys

    @tl.default("outputs")
    def _default_outputs(self):
        if not isinstance(self.band, list):
            return None
        else:
            return self.band

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make an issue for this, or fix it now before merging this issue? Do you have time, or shall I?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fixing this inside this PR is fine. And don't worry about timing -- this is complex enough that I think we should do it for the next release and not rush something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up creating #420 for this instead. This PR was getting pretty old and it should be in develop to support our team right now.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

I don't really know what to do here.

When I try to open the subdataset using rasterio (no podpac), I get "No such file or directory". Am I doing it correctly?

>>> rasterio.open('HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI"')
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 67, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI": No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/env.py", line 433, in wrapper
    return f(*args, **kwds)
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/__init__.py", line 218, in open
    s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 218, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI": No such file or directory

When I try to just use rasterio to open the hdf file directly, it doesn't work, either. Should it?

>>> rasterio.open('MOD13Q1.A2013033.h08v05.006.2015256072248.hdf')
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 67, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: 'MOD13Q1.A2013033.h08v05.006.2015256072248.hdf' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/env.py", line 433, in wrapper
    return f(*args, **kwds)
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/__init__.py", line 218, in open
    s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 218, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: 'MOD13Q1.A2013033.h08v05.006.2015256072248.hdf' not recognized as a supported file format.

@mpu-creare
Copy link
Contributor Author

What version of rasterio are you using? Do you have hdf4 installed?

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

rasterio 1.1.5

I'm checking hdf4 now. I guess it is not part of the extra dependencies?

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

Oh, it's a system library. Okay, I would have expected a better error message from rasterio, but instally hdf4 should help!

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

No difference.

@mpu-creare
Copy link
Contributor Author

mpu-creare commented Jul 13, 2020

hdf4 should come with rasterio -- if installed through conda. I'm using 1.1.3 and it works fine.

@mpu-creare
Copy link
Contributor Author

What's the cwd and the path of the file?

@mpu-creare
Copy link
Contributor Author

mpu-creare commented Jul 13, 2020

So,

df = rasterio.open('MOD13Q1.A2013033.h08v05.006.2015256072248.hdf')
df.datasets

should work.

Also, what's the filesize? Maybe something got broken in transfer. It should be 189MB (198,229,812 bytes)

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

What's the cwd and the path of the file?

The file is in the cwd. The fact that the rasterio.open('<filename>.hdf') command fails because of the filetype makes me feel like the path/filename isn't the issue.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

seems okay? 190M, or 198229812 bytes.

md5sum: 1d41e6e86c0a247581b66666bd9c9f9b
sha1sum: d1a79009b63d9cac65e13a42c747407ecec35275

@mpu-creare
Copy link
Contributor Author

Yep, MD5SUM is correct.... hmmm... A mystery... here's some of my output:

In [1]: import rasterio

In [2]: df = rasterio.open("MOD13Q1.A2013033.h08v05.006.2015256072248.hdf")
C:\Anaconda3\envs\soilmap\lib\site-packages\rasterio\__init__.py:219: NotGeoreferencedWarning: Dataset has no geotransform set. The identity matrix may be returned.
  s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

In [3]: df.driver
Out[3]: 'HDF4'

In [4]: df.subdatasets
Out[4]:
['HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days NDVI',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days relative azimuth angle',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days composite day of the year',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days pixel reliability',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days EVI',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days VI Quality',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days red reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days NIR reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days blue reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days MIR reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days view zenith angle',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days sun zenith angle']

Do you have GDAL installed? You could try gdalinfo from the commandline to make sure at least that layer works. Then you can move on to `rio info '

Just ideas...

@mpu-creare
Copy link
Contributor Author

mpu-creare commented Jul 13, 2020

Try:

with rasterio.Env() as env:
   dk = list(env.drivers().keys())
   dk.sort()
   print(dk)

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

Yeah, there's no HDF4 driver. I still can't figure out how to install it.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

gdalinfo worked, the file is fine

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

I'm trying conda now, if that doesn't work, you might be best off just merging this since it is working for you. I'd prefer an explicit check for the subdataset string instead of a try except.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

Okay, conda was able to include the hdf4 driver for rasterio. I had no trouble opening the file locally using the Rasterio node.

Copy link
Collaborator

@jmilloy jmilloy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but with a preference for the one change.

podpac/core/data/rasterio_source.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants