-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GeoParquet #6438
Comments
Thank you, @severo ! I would be more than happy to help in any way I can. I am not familiar with this repo's codebase, but I would be eager to contribute. :) For the preview in Datasets Hub, I think it makes sense to just display the geospatial column as text. If there were a dataset loader, though, I think it should be able to support the geospatial components. Geopandas is probably the most user-friendly interface for that. I'm not sure if it's currently relevant in the context of geoparquet, but I think the pyogrio driver is faster than fiona. But the whole gdal dependency thing can be a real pain. If anything, it would need to be an optional dependency. Maybe it would be best if the loader tries importing relevant geospatial libraries, and in the event of an ImportError, falls back to text for the geometry column. Please let me know if I can be of assistance, and thanks again for creating this Issue. :) |
Just hitting into this same issue too showing GeoParquet files in Datasets Viewer. I tried to implement a custom reader for GeoParquet in https://huggingface.co/datasets/weiji14/clay_vector_embeddings/discussions/1, but it seems like HuggingFace has disabled datasets with custom loading scripts from using the dataset viewer according to https://discuss.huggingface.co/t/dataset-repo-requires-arbitrary-python-code-execution/59346 I'm thinking now if there's a way to simply map files with GeoParquet extensions (*.gpq, *.geoparquet, etc) to use the Parquet reader. Maybe we could allowlist these geoparquet file extensions at https://github.com/huggingface/datasets/blame/0caf91285116ec910f409e82cc6e1f4cff7496e3/src/datasets/packaged_modules/__init__.py#L30-L51? Having the table columns show up would be a quick win. Longer term though, it would certainly be nice if the WKB geometry columns could be displayed in a nicer form. Geopandas' read_parquet function is supposedly faster than |
Update: It looks like renaming the GeoParquet file to have a file extension of I've opened a quick PR at #6508 to allow files with a |
@joshuasundance-swca, @weiji14, If I'm understanding this correctly, the code below wouldn't be recommended to due to dependency headaches? If that's the case, what solution would there be to see the geometry features for .gpq files in huggingfaceHub? code for dataset_loader.py
|
Hi @mehrdad-es, I'm not sure if HuggingFace would be keen to add |
I've created huggingface/dataset-viewer#2416 to discuss the possibility of supporting (vectorial) geospatial columns in the dataset viewer, or in the converted parquet files. At the same time, it would be super interesting to see what is already possible to do with a Hugging Face dataset that hosts geospatial data.
It would be awesome to show this inside a Space. |
Feature request
Support the GeoParquet format
Motivation
GeoParquet (https://geoparquet.org/) is a common format for sharing vectorial geospatial data on the cloud, along with "traditional" data columns.
It would be nice to be able to load this format with datasets, and more generally, in the Datasets Hub (see https://huggingface.co/datasets/joshuasundance/govgis_nov2023-slim-spatial/discussions/1).
Your contribution
I would be happy to help work on a PR (but I don't think I can do one on my own).
Also, we have to define what we want to support:
The text was updated successfully, but these errors were encountered: