-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoDataset is not well-suited to datasets with multiple CRSes #278
Comments
I don't believe this will work. I definitely agree that this is something we need to support. The biggest issue with reprojection is that it significantly slows down sampling rates (see section 4.2 of our paper). @calebrob6 implemented something like what you're looking for for the Chesapeake CVPR dataset, but it isn't compatible with any other GeoDataset. I think this will likely be a major change we'll make for the 0.3.0 release (0.2.0 is coming soon). Reprojection is necessary for two purposes:
We could potentially provide a flag that says "don't reproject, I know this prevents me from doing 1 and 2". I'm trying to think of a more elegant solution that only reprojects when necessary. Will update this issue thread as I come up with new ideas. Feel free to add to this thread if you (or anyone else) has ideas! |
For 1, the whole idea of reprojecting to UTM with fixed pixel sizes is that the CRS no longer matters. For all purposes, I have turned the rasters into their locally low-distortion representations with square pixels. Why should I not be able to sample from two different datasets at that point? Is For 2, if all the scenes in the GeoDataset are already in the same CRS, then why does there need to be any reprojection step involved? It could also be that I am being too clever for my own good w.r.t. this library. I could just leave my rasters in ESPG:4326, but then I incur large distortions at polar regions and oceans (up to 40km!). That's why we reproject to "local" UTMs in the first place. |
That assumes that both datasets are in UTM. If they aren't, reprojection is again necessary.
PyTorch's
All scenes in the GeoDataset are not in the same CRS, each scene is in its local UTM zone, and images on the boundaries of UTM zones that overlap could be in different UTM zones. |
Going to take a stab at this
The thing that doesn't work here is the way we actually index into GeoDatasets. If you have a GeoDataset, Consider the general problem of indexing into a dataset (just getting some crop out of the dataset) made up of two rasters, A and B, that are projected into different UTM zones:
Now, further, consider the problem where you want to be able to do spatial joins with this dataset. Assume you have another dataset with a raster C in some geographic coordinate system that covers both A and B (maybe this is a global land cover map or something). You'd like to sample patches from both dataset and have everything line up. (You aren't asking about this use case, but it helps explain why we've done what we have.) For the first indexing approach: For the second indexing approach:
@adamjstewart @isaaccorley, what if we allowed RasterDatasets to be indexed both ways described above then adjust the Samplers to behave in "mode 1" (geographic space) or "mode 2" (pixel-space)? Note: it is easy to convert |
I would have to think more about this. I wonder if it's possible to not reproject at all by default and only reproject when sampling from two datasets in a different CRS. |
I'm assuming
From what I can tell, "mode 2" would not allow intersections between datasets or overlapping files in a dataset, is this correct? This is pretty fundamental to GeoDataset, if we removed this we might as well just use a VisionDataset. Ideally, what I would like to do is avoid reprojection when we don't need it and only reproject when sampling from multiple files in different CRSs. What if we:
I think we might run into issues with varying image size if we don't reproject though... |
Correct, but I don't think the reverse is needed.
You are correct that "mode 2" doesn't allow intersections, however if that is necessary, then the
I agree. Perhaps it'd be useful to write out expected behavior in different scenarios, e.g.:
|
What would it take to use multiple datasets? And the GeoSampler could create one grid for each dataset, or maybe multiple samplers are supported? Then the |
What advantage would that have? You would still need to reproject and merge all indices inside the GeoSampler, then reproject the bbox back to the original CRS before sampling. I just don't want to make people create a separate dataset for every UTM Zone for a global dataset. |
A common workflow to get geospatial datasets ML-ready is to reproject them to the correct UTM zone they originate in with fixed
xres
andyres
in order to work with them directly in pixel space. Thus, datasets comprised of imagery from a wide area will have multiple CRSes.RasterDataset
currently relies on all the data being in the same CRS. If it is not,RasterDataset
will automatically reproject the data to a fixed CRS (user-specified or the one found in the first raster).One proposed solution is to use one RasterDataset per CRS and then combine them using
torch.utils.data.ConcatDataset
. This could work but feels inelegant.A flag which lets me choose whether to reproject or not could work here.
The text was updated successfully, but these errors were encountered: