-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
735 viz dataframe #741
735 viz dataframe #741
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestions.
I'm missing some tests, specially in the dataset class. It would be nice having on example for each type of dataframe with a geometry column and check that get_geodataframe
returns what we expect.
cartoframes/dataset.py
Outdated
@@ -331,7 +337,7 @@ def _get_remote_geom_type(self, query): | |||
def _get_local_geom_type(self, gdf): | |||
"""Compute geom type of the local dataframe""" | |||
if len(gdf.geometry) > 0: | |||
geom_type = gdf.geometry[0].type | |||
geom_type = gdf.geometry.iloc[0].geom_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the first geometry is null?
We have an utility method to get the first not null value _first_not_null_value
that could be handy for this.
cartoframes/dataset.py
Outdated
return df['longitude'] | ||
if 'lng' in df: | ||
return df['lng'] | ||
if 'long' in df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add lon
as well
cartoframes/dataset.py
Outdated
return df['wkb_geometry'] | ||
if 'geom' in df: | ||
return df['geom'] | ||
if 'geojson' in df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About geojson, if we support them, we should decode them properly right? I guess decode_geom does not support geojson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I supposed that this was an alias for geometry
(https://github.com/CartoDB/cartodb/blob/master/services/importer/lib/importer/georeferencer.rb#L14).
IMO we can remove this case since we have good support for geojson in CF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove it then.
Yes, I have in mind to add tests. I'm thinking now if we could move the |
LGTM |
👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- only thing I want changed is an error helping the user to move forward if geoms aren't found. I'm happy to help re-write it if you don't like the message I wrote.
cartoframes/dataset.py
Outdated
if lat_column is not None and lng_column is not None: | ||
df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column) | ||
else: | ||
raise ValueError('DataFrame has no geographic data.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this:
"No geographic data found. If a geometry exists, change the column name to the_geom
or geom
or ensure it is a GeoDataFrame with a valid geometry. If there are latitude/longitude columns, rename to lat
/lng
."
This way people have a route forward if they do indeed have geographic information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better yet we can give them the full list of valid names/formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
No geographic data found. If a geometry exists, change the column name (geometry, the_geom, wkt_geometry, wkb_geometry, geom, wkt, wkb) or ensure it is a DataFrame with a valid geometry. If there are latitude/longitude columns, rename to (latitude, lat), (longitude, lng, lon, long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cartoframes/viz/layer.py
Outdated
@@ -104,7 +112,9 @@ def __init__(self, | |||
|
|||
def _set_source(source, context): | |||
"""Set a Source class from the input""" | |||
if isinstance(source, (str, list, dict, Dataset)): | |||
if isinstance(source, (str, list, dict, Dataset)) or \ | |||
isinstance(source, pandas.DataFrame) or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not include pandas.DataFrame
in the tuple above with the other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that
cartoframes/data/utils.py
Outdated
df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column) | ||
else: | ||
raise ValueError('''No geographic data found. ''' | ||
'''If a geometry exists, change the column name ({0}) or ensure it is a DataFrame with a valid geometry. ''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long (141 > 120 characters)
from __future__ import absolute_import | ||
|
||
from .dataset import Dataset, get_query, get_geodataframe | ||
from .dataset_info import DatasetInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'.dataset_info.DatasetInfo' imported but unused
@@ -0,0 +1,8 @@ | |||
from __future__ import absolute_import | |||
|
|||
from .dataset import Dataset, get_query, get_geodataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'.dataset.get_geodataframe' imported but unused
'.dataset.get_query' imported but unused
Hey, I have merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and a reflexion.
When I added this issue #704 I was thinking about starting in the from_dataframe
method, removing one of _df
or _gdf
leaving only one way to work locally, avoiding having the same data twice.
This one is focused on the visualization part and really, we are not using geopandas
for anything in the "data" side, but anyway, it is a step forward.
cartoframes/data/dataset.py
Outdated
|
||
from .utils import decode_geometry, compute_query, compute_geodataframe, get_columns, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max length is 120. You can put the constant in the same line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not used to 120, but I can change it
cartoframes/data/utils.py
Outdated
|
||
|
||
def compute_geodataframe(dataset): | ||
if dataset._df is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dataset._df is not None: | |
if dataset.dataframe is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't notice this getter 👍
from ..columns import Column | ||
|
||
try: | ||
import geopandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need that? Why is not a new dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a dependency. CARTOframes works without geopandas, but this library is required for some situations. This comes from the past versions, but you can create a ticket to discuss this in future versions of CF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot()
returns plots.
', '.join(LAT_COLUMN_NAMES), | ||
', '.join(LNG_COLUMN_NAMES) | ||
)) | ||
return geopandas.GeoDataFrame(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will fail without geopandas dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we need to check the flag here. Thanks
cartoframes/data/utils.py
Outdated
|
||
def compute_geodataframe(dataset): | ||
if dataset._df is not None: | ||
df = dataset._df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df = dataset._df.copy() | |
df = dataset.dataframe.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like to copy it. It could be a performance problem (in memory). I am thinking about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't copy the df to generate the gdf, the original df will be modified with an extra column. I'm not sure if we want this or not. Maybe it's a good thing to have the geometry column already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a warning (or logging.info) that a new column is being added to the original dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed the copy() by now. But I think we should revisit it in the future.
Regarding the logging info, this will be displayed always when there is a df visualization, but if you consider that it is OK to have a warning we can add it
@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC): | |||
|
|||
@classmethod | |||
def create(cls, html, name, context=None, password=None): | |||
from cartoframes.auth import _default_context | |||
from ..auth import _default_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you prefer a relative path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a common practice to use relative imports in the project and absolute imports for your dependencies. It requires usually fewer chars and it decouples your implementation from the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I like relative imports for project module imports
from .legend import Legend | ||
from ..data import Dataset | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as Dataset one
@@ -0,0 +1,150 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this file, making the rest cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Only one I'm concerned about is the _first_not_null_value
function in the case where all are null.
cartoframes/data/dataset.py
Outdated
pass | ||
return None | ||
def _first_not_null_value(array): | ||
return array.loc[~array.isnull()].iloc[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if they're all null? I think there will be an IndexError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that. We can rename the method and use an if everywhere it is used. cc @alrocar
from ..columns import Column | ||
|
||
try: | ||
import geopandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot()
returns plots.
cartoframes/data/utils.py
Outdated
|
||
def compute_geodataframe(dataset): | ||
if dataset._df is not None: | ||
df = dataset._df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.
cartoframes/data/utils.py
Outdated
|
||
def compute_geodataframe(dataset): | ||
if dataset._df is not None: | ||
df = dataset._df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a warning (or logging.info) that a new column is being added to the original dataframe?
@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC): | |||
|
|||
@classmethod | |||
def create(cls, html, name, context=None, password=None): | |||
from cartoframes.auth import _default_context | |||
from ..auth import _default_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I like relative imports for project module imports
7748bdf
to
810fdf1
Compare
810fdf1
to
e41c6ab
Compare
e96aa31
to
6a8f69d
Compare
6a8f69d
to
30b40aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing. Awesome effort!
cartoframes/data/dataset.py
Outdated
@@ -504,7 +493,7 @@ def _get_geom_col_type(df): | |||
return None | |||
|
|||
try: | |||
geom = _decode_geom(_first_not_null_value(df, geom_col)) | |||
geom = decode_geometry(_first_value(df[geom_col])) | |||
except IndexError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this method raise an IndexError in any case? Maybe we could add a test case ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Exception won't be raised anymore. What case do you want to cover with tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If first value returns None
I guess decode_geometry
will raise an exception, right? (not an IndexError) should we cover that case with a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode_geometry has an if internally, so the None case is ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂ I missread the code, you are right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💥 Looks great!
Related to #735, #746.
This PR adds support for DataFrame visualization. This is the same mechanism we use to get the query from a table, but in this case to obtains a gdf from a df. The geometry is obtained from the available data in the df:
So now you can render a df: