-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with geopandas #588 #818
Conversation
Do we need from geojson import FeatureCollection,Feature
from geojson.utils import generate_random
import altair as alt
alt.renderers.enable('notebook')
geo_data = FeatureCollection([Feature(geometry=generate_random('LineString'),
properties={"prop": i+1}) for i in range(5)])
alt.Chart(alt.InlineData(geo_data.__geo_interface__,{'type':'json','property':'features'})
).mark_geoshape(
fillOpacity=0.8
).project(
).encode(
fill='properties.prop:O'
).properties( width=500,height=300) |
This is great – thanks for tackling this! One comment: I think the cleanest approach would be to make it so the In particular, I think we should avoid explicit dependence on geojson within Altair's test suite if at all possible. Please let me know when this is ready for review! |
Is it ok to use |
I'd prefer a test using a custom class that provides a simple geo interface output, if that's at all possible, so we have some test coverage of the behavior even without geo libraries installed. We could also use an |
I think code and test are ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment below, but overall I'm a bit confused. It looks like this has three broad changes:
to_json()
will act differently if the dataframe has a__geo_interface__
to_csv()
will act differently if the dataframe has a__geo_interface__
- There is a new
to_geojson_values
data transformer that the user will have to somewhere explicitly enable in order to use geojson inputs.
Is that correct?
I'm a bit confused regarding the goal of these changes, and what your intent is with regard to the API and the user experience with Geojson data.
My thought would be that we should make it as seamless as possible: i.e. a user can pass a geojson object as they would any dataframe, and it should work without any additional configuration.
altair/utils/core.py
Outdated
attrs['type'] = infer_vegalite_type(data[attrs['field']]) | ||
if hasattr(data,'__geo_interface__'): #TODO: Add descripion in shorthand | ||
attrs['field']='properties.'+attrs['field'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something strange is going on here: is not checking for type
in attrs
necessary for this PR? And why is this block doubly indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'type' not in attrs
this is still checked, but in the line below (232).
The idea behind: to allow user work with GeoDataFrame like as regular DataFrame but showing geoshapes attached to its rows.
Implementation details:
First of all GeoDataFrame is subclass of pd.DataFrame
with __geo_interface__
. So it is instance of pd.DataFrame
that has attribute __geo_interface__
. GeoPandas are stored as geojson FeatureCollection with each row as Feature object where all columns are placed in properties object. Sample will be more informative:
{ /* vega-light data*/
"format": {
"property": "features", /* generate geoshape for each row*/
"type": "json"
},
"values": { /* valid geojson for all rows*/
"type": "FeatureCollection",
"features": [
{ /* valid geojson for each row*/
"type": "Feature",
"properties": { /* column values */
"pop_est": 12799293.0,
"continent": "Africa",
},
"geometry": { /* geometry of the row */
"type": "MultiPolygon",
"coordinates": [ /* a lot of numbers*/]
}
}
]
}
}
So first step is to add ["property": "features"] to vega-light data format description. That splits valid geojson stored in values back to rows of GeoDataFrame (it is possible to replace this step with storing content of "features" directly into "values" but that will made "values" invalid geojson).
Next is access to column values of GeoDataFrame. Values are accessible from chart as "properties.column_name". I hoped to simplify user experience by adding "properties." in shorthand. Now if user use shorthands he will get same behavior for GeoDataFrame as for DataFrame (take a look to updated description of PR) .
May be I should check if field name starts with "properties." to avoid doubling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see.
We can think about that. In the meantime can you fix the indentation? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore due new save format
Yes, but I see it a little bit different.
No it was removed by 2cd9fde after your proposal of using
Yes, all three of
|
Strange... when I looked at changes a few hours ago, I didn't see the updated commits. I see that the changes are entirely different now. Let me take another look. |
altair/utils/data.py
Outdated
@@ -33,7 +33,7 @@ class DataTransformerRegistry(PluginRegistry[DataTransformerType]): | |||
# form. | |||
# | |||
# A data model transformer has the following type signature: | |||
# DataModelType = Union[dict, pd.DataFrame] | |||
# DataModelType = Union[dict, pd.DataFrame, gpd.GeoDataFrame, geojson interface object] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that any Python object can expose a geojson interface? In that case, I don't think the typing-style specification is all that useful anymore. (or just needs to have Any
as an entry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According Sean Gillies it is growing community of adopters:). In fact there is some list of classes unknown to me. In my opinion there is no crucial value in supporting general interface - just funny feature. If it breaks some concept maybe we should remove interface support.
Or we can declare only geojson, so all others will be supported accidentally.
Union[dict, pd.DataFrame, gpd.GeoDataFrame, geojson.GeoJSON]
? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I'm not certain. I don't know the geo space well enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I think Union[dict, pd.DataFrame, gpd.GeoDataFrame, geojson.GeoJSON]
better for now :) I'll leave it until further instructions.
altair/utils/data.py
Outdated
|
||
elif hasattr(data,'__geo_interface__'): # geojson object | ||
with open(filename,'w') as f: | ||
json.dump(data.__geo_interface__, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put all the geo_interface logic in one block... e.g.
if hasattr(data, '__geo_interface__'):
# do geo stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, GeoPandas
needs extra line of code than a simple __geo_interface__
. I had something like :
if isinstance(data, pd.DataFrame) and hasattr(data, '__geo_interface__'):
# do GeoPandas stuff
elif hasattr(data, '__geo_interface__'):
# do geo_interface stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff
That is simple and clear. But I supposed that it could provoke errors in case some modification in Pandas block without mirroring it to GeoPandas. That why is so strange logic in final version.
Should i rewrite it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see – there's overlap between geo and dataframe.
I think it would be best to put all the geo logic under one if statement, so something like:
if hasattr(data, '__geo_interface__'):
if dataframe:
#stuff
else:
# other stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.vscode/settings.json
Outdated
@@ -0,0 +1,3 @@ | |||
{ | |||
"python.pythonPath": "/Users/tim/anaconda/anaconda/envs/altair_test/bin/python" | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.vscode/tasks.json
Outdated
] | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
altair/utils/core.py
Outdated
attrs['type'] = infer_vegalite_type(data[attrs['field']]) | ||
if hasattr(data,'__geo_interface__'): #TODO: Add descripion in shorthand | ||
attrs['field']='properties.'+attrs['field'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see.
We can think about that. In the meantime can you fix the indentation? 😄
altair/utils/data.py
Outdated
@@ -57,6 +57,8 @@ def limit_rows(data, max_rows=5000): | |||
values = data['values'] | |||
else: | |||
return data | |||
else: | |||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there are three return statements in this code, the logic is pretty opaque (it took me a bit to read this and figure out what it was doing).
I think the function should be refactored for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
altair/utils/data.py
Outdated
data = sanitize_dataframe(data) | ||
data.to_csv(filename) | ||
return { | ||
'url': filename, | ||
'format': {'type': 'csv'} | ||
} | ||
elif hasattr(data,'__geo_interface__'):#GeoJSON | ||
raise NotImplementedError('to_csv only works with Pandas DataFrame objects.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here: please do the geo_interface logic all in one place, rather than checking it twice.
Also, four spaces for indentation please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
What does it take to include (potential) support for writing to I’m not saying it should be implemented now, but once it is supported in |
@mattijn It would be tricky now as interface it is not defined yet, but as soon as GeoPandas release it (I hope we will see it) we should update to topojson as default behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all requested changes are done.
Thanks – I'm traveling to PyCon tomorrow and speaking the next two days, so it may be a few days before I get back to reviewing this. |
While Travis is fighting my Python 3.5 code it skipped main part of the test based on
Detailed report is here. It would take some time to analyze it, but my preliminary conclusion is that storing I'll think about some other format to store data. |
Yes, selenium is not particularly easy to set up or install in Travis, so for the time being those tests are skipped in the CI. I'm not certain I understand the issue with GeoDataFrame... are you trying to use GeoDataFrame for every example? What's the purpose of that? |
I was working on example for documentation: import altair as alt
import geopandas as gpd
alt.renderers.enable('notebook')
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
data = world[world.continent=='Africa']
alt.hconcat(
alt.Chart().mark_bar(
).encode(
x='pop_est',
#SortField do not use shorthand so full field naming is needed.
# \\. is to avoid Vega-lite bug 3727
y=alt.Y('name', sort=alt.SortField(field='properties\\.pop_est',
op='sum', order='descending')),
tooltip='pop_est',
color='pop_est',
).properties(
width=300,
height=400,
),
alt.Chart().mark_geoshape(
).project(
).encode(
# for GeoDataFrame fields shorthand adds "properties." and infer types
color='pop_est',
tooltip='name'
).properties(
width=400,
height=400,
title='Africa population'
),
data=data
) Now shorthand adds "properties." to field names and adds a "title" property if it is not set. But I understood that there are a lot of field definitions without shorthand, so we need to add "properties." in all other cases and than test it somehow. Then I bumped into vega-lite #3727 than on #3729, so I started thinking on systematic approach to test it all. I found "test_examples" that do complex test like I needed and decided that if use just data from GeoDataFrame in the same code I should expect the same charts as with DataFrame but with data stored otherway. I expected to find most of places where "properties." should be added and come up with solution for that, but things gone wrong. |
First six charts analysis in a table:
Basing on that table I can conclude that Vega-light has issues with nested data that will block all efforts to support GeoJSON on Altair. In other hand to hide from user that GeoPandas stored in nested structures will need some real efforts that may include new Mixin for transformers and so. New approach What do you think about it? Proof of concept import altair as alt
import geopandas as gpd
alt.renderers.enable('notebook')
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
data = world[world.continent=='Africa']
alt_data = dict(values =[dict(d,type = g['type'],geometry=g['geometry'])
for d,g in zip(
data.drop('geometry',axis=1).to_dict('row'),
data.geometry.__geo_interface__['features']
)] )
alt.Chart(alt_data).mark_geoshape(
).project(
).encode(
color='pop_est:Q',
tooltip='name:N'
).properties(
width=500,
height=300
) |
Is it ok to have dependence from geopandas in an example ? Or notebook with examples is preferable? |
elif isinstance(data, dict) and ('values' in data): | ||
values = data['values'] | ||
else: | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written the function will never progress beyond this line, and the max_rows check will never happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so, how could it pass test_limit_rows()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean that it bypass unknown data types, then yes like #887
Thanks for the continued work on this. I have to admit I'm a bit apprehensive about merging this at this point: this ends up being much more complicated than I anticipated in the initial discussion, and there seem to be a lot of details and subtleties here. Admittedly geojson is not something I have a deep understanding of... I would not be confident in my ability to maintain this code, or to answer user questions that will inevitably come up. |
How can I support you in that decision? Now with flat data version, in my opinion it looks much simpler that it was in progress. If you really don't want include it in Altair, could I publish "plugin" connector based on your 'utils.data' which will overload pipeline with my version? |
Hm, this PR is still open. Let me be brave and add a comment in the hope we might go forward (or backward, both is fine). I'd played a bit with the way how you create a flat data version of the pandas dataframe including the geometry column and its seems to follow more the 'normal' row-oriented JSON. I really liked the example you wrote of the population of continental Africa, where you combined a bar chart with a spatial chart*1. I'm worried though that if this flat data version is written on the side of So even though I will use your flat data version for the time being, I think Regarding plugins, I hope that we won't get a plugin for any *1 In the flat data version it creates a dictionary of the (geo)dataframe where there is no
|
…pandas # Conflicts: # altair/utils/data.py
…pandas # Conflicts: # altair/utils/data.py
@jakevdp I've rebased to resolve merge conflicts, but is good point to make a final decision on that branch. @mattijn thank you for your feedback. But fortunately GeoJSON Specification RFC 7946 allows use of "Foreign Members" in objects. That is exactly what I do during "flattering". So now is generating valid GeoJSON format, with one exception it stores array of GeoJSON objects which is acceptable for Vega but not allowed by specification. If you think that "data.values" by itself should be exactly a GeoJSON Object I can wrap it with collection and than spit it back with value of "data.format.property" Regarding maintenance and user support I understand Jake's scepticism. GeoJSON it little bit bigger than just a mark type so users could have a lot directly unrelated questions that should be redirected somewhere and it could take some time. From my side I have plan to make a notebook with examples of making charts and maps using geopandas and altair. And I can support that area for now if it could help. |
I think it would be good to move forward with this. But I think that #887 offers a much cleaner framework in which to add this sort of functionality. I thought we could merge that relatively quickly a few weeks ago, but review has been slow. |
@jakevdp I've reviewed #887 it really good idea to somehow universalise pipeline to work with different classes. But geojson and geopandas introduce some new cases to that abstraction (I commented it in the PR). That's why I would prefer if it possible to rebase #887 on my code to make a useful framework. Otherwise I'll need to somehow update their implementation of mechanics for handling invalid pipelines. |
} for item in dct['data']['values']] | ||
}) | ||
|
||
assert (data2[data.columns] == data).all().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would data2[data.columns].equals(data)
not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should work :) and looks more readable
@jakevdp I think is time to close branch in favour to separate connection module. Or you prefer to use this branch somehow? Please take a look at https://iliatimofeev.github.io/gpdvega/. What do you think on idea to move it under altait-viz umbrella? |
Thanks for prototyping the |
Integration with geopandas( fix #588)
Now GeoDataFrame is valid Data type for alt.Chart
Any object with geo_interface could be used as data. But in that case it will be interpreted as one shape
If you want to split FeatureCollection into different shapes use
alt.geojson_feature
(76a2af8)With GeoPandas to_json or to_values directly.
TODO:
Changes summary
All three of
to_values
,to_json
,to_csv
act differently in two new use cases.Case GeoDataFrame (from GeoPandas) that is
pd.DataFrame
with__geo_interface__
:to_csv()
I don't know any way to use geometry in this format from vega-light. So it throws NotImplemented to avoid user misunderstandings.to_values()
andto_json()
will save GeoDataFrame.geometry through__geo_interface__
into geojson enriching it with values from data. That flatting allows simplify usage by avoiding nested properties. The Idea is inspired by vega/vega-lite/#3341Case geo_interface only for objects with
__geo_interface__
than is not apd.DataFrame
instance so it is not GeoDataFrame:to_csv()
Throws NotImplemented. Geojson tree structure could not be saved in plain csv format.to_values()
andto_json()
will save object through__geo_interface__
into geojson.geojson_feature
a clone oftopojson_feature
wraps data with appropriatealt.Data
and allows to split GeoJSON collection into different shapes.