Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

735 viz dataframe #741

Merged
merged 22 commits into from
Jun 10, 2019
Merged

735 viz dataframe #741

merged 22 commits into from
Jun 10, 2019

Conversation

Jesus89
Copy link
Member

@Jesus89 Jesus89 commented Jun 6, 2019

Related to #735, #746.

This PR adds support for DataFrame visualization. This is the same mechanism we use to get the query from a table, but in this case to obtains a gdf from a df. The geometry is obtained from the available data in the df:

  1. Geom column
  2. Lat and lng columns

So now you can render a df:

  • From local data
import pandas as pd

data = {'latitude': [0, 10, 20, 30], 'longitude': [0, 10, 20, 30]}
df = pd.DataFrame.from_dict(data)

Map(Layer(df, 'width: 30'))
  • From CARTO
from cartoframes import Dataset

ds = Dataset.from_table('populated_places')

df = ds.download(limit=10)

Map(Layer(df, 'width: 10'))
  • From Dataset created with a df
import pandas as pd

data = {'latitude': [0, 10, 20, 30], 'longitude': [0, 10, 20, 30]}
df = pd.DataFrame.from_dict(data)

ds = Dataset.from_dataframe(df)

Map(Layer(ds, 'width: 30'))

Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestions.

I'm missing some tests, specially in the dataset class. It would be nice having on example for each type of dataframe with a geometry column and check that get_geodataframe returns what we expect.

@@ -331,7 +337,7 @@ def _get_remote_geom_type(self, query):
def _get_local_geom_type(self, gdf):
"""Compute geom type of the local dataframe"""
if len(gdf.geometry) > 0:
geom_type = gdf.geometry[0].type
geom_type = gdf.geometry.iloc[0].geom_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the first geometry is null?

We have an utility method to get the first not null value _first_not_null_value that could be handy for this.

return df['longitude']
if 'lng' in df:
return df['lng']
if 'long' in df:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add lon as well

return df['wkb_geometry']
if 'geom' in df:
return df['geom']
if 'geojson' in df:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About geojson, if we support them, we should decode them properly right? I guess decode_geom does not support geojson.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supposed that this was an alias for geometry (https://github.com/CartoDB/cartodb/blob/master/services/importer/lib/importer/georeferencer.rb#L14).
IMO we can remove this case since we have good support for geojson in CF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it then.

@Jesus89
Copy link
Member Author

Jesus89 commented Jun 6, 2019

Yes, I have in mind to add tests. I'm thinking now if we could move the get_query and `get_geodataframe' utilities to another file (?) The Dataset file is pretty big now.

@Jesus89 Jesus89 requested a review from andy-esch June 6, 2019 12:49
@alrocar
Copy link
Contributor

alrocar commented Jun 6, 2019

I'm thinking now if we could move the get_query and `get_geodataframe' utilities to another file

LGTM

@andy-esch
Copy link
Contributor

👀

Copy link
Contributor

@andy-esch andy-esch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- only thing I want changed is an error helping the user to move forward if geoms aren't found. I'm happy to help re-write it if you don't like the message I wrote.

if lat_column is not None and lng_column is not None:
df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column)
else:
raise ValueError('DataFrame has no geographic data.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this:
"No geographic data found. If a geometry exists, change the column name to the_geom or geom or ensure it is a GeoDataFrame with a valid geometry. If there are latitude/longitude columns, rename to lat/lng."

This way people have a route forward if they do indeed have geographic information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet we can give them the full list of valid names/formats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

No geographic data found. If a geometry exists, change the column name (geometry, the_geom, wkt_geometry, wkb_geometry, geom, wkt, wkb) or ensure it is a DataFrame with a valid geometry. If there are latitude/longitude columns, rename to (latitude, lat), (longitude, lng, lon, long).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -104,7 +112,9 @@ def __init__(self,

def _set_source(source, context):
"""Set a Source class from the input"""
if isinstance(source, (str, list, dict, Dataset)):
if isinstance(source, (str, list, dict, Dataset)) or \
isinstance(source, pandas.DataFrame) or \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include pandas.DataFrame in the tuple above with the other types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that

df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column)
else:
raise ValueError('''No geographic data found. '''
'''If a geometry exists, change the column name ({0}) or ensure it is a DataFrame with a valid geometry. '''

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (141 > 120 characters)

@Jesus89 Jesus89 mentioned this pull request Jun 6, 2019
from __future__ import absolute_import

from .dataset import Dataset, get_query, get_geodataframe
from .dataset_info import DatasetInfo

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'.dataset_info.DatasetInfo' imported but unused

@@ -0,0 +1,8 @@
from __future__ import absolute_import

from .dataset import Dataset, get_query, get_geodataframe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'.dataset.get_geodataframe' imported but unused
'.dataset.get_query' imported but unused

@Jesus89
Copy link
Member Author

Jesus89 commented Jun 7, 2019

Hey, I have merged develop with the sharing viz changes, updated the data namespace, and added tests to cover the geometry generation. So it's ready for review 😄

Copy link
Contributor

@simon-contreras-deel simon-contreras-deel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and a reflexion.

When I added this issue #704 I was thinking about starting in the from_dataframe method, removing one of _df or _gdf leaving only one way to work locally, avoiding having the same data twice.

This one is focused on the visualization part and really, we are not using geopandas for anything in the "data" side, but anyway, it is a step forward.


from .utils import decode_geometry, compute_query, compute_geodataframe, get_columns, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max length is 120. You can put the constant in the same line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not used to 120, but I can change it



def compute_geodataframe(dataset):
if dataset._df is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dataset._df is not None:
if dataset.dataframe is not None:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't notice this getter 👍

from ..columns import Column

try:
import geopandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need that? Why is not a new dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a dependency. CARTOframes works without geopandas, but this library is required for some situations. This comes from the past versions, but you can create a ticket to discuss this in future versions of CF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot() returns plots.

', '.join(LAT_COLUMN_NAMES),
', '.join(LNG_COLUMN_NAMES)
))
return geopandas.GeoDataFrame(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail without geopandas dependency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we need to check the flag here. Thanks


def compute_geodataframe(dataset):
if dataset._df is not None:
df = dataset._df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = dataset._df.copy()
df = dataset.dataframe.copy()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to copy it. It could be a performance problem (in memory). I am thinking about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't copy the df to generate the gdf, the original df will be modified with an extra column. I'm not sure if we want this or not. Maybe it's a good thing to have the geometry column already there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a warning (or logging.info) that a new column is being added to the original dataframe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the copy() by now. But I think we should revisit it in the future.

Regarding the logging info, this will be displayed always when there is a df visualization, but if you consider that it is OK to have a warning we can add it

@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC):

@classmethod
def create(cls, html, name, context=None, password=None):
from cartoframes.auth import _default_context
from ..auth import _default_context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer a relative path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a common practice to use relative imports in the project and absolute imports for your dependencies. It requires usually fewer chars and it decouples your implementation from the public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I like relative imports for project module imports

from .legend import Legend
from ..data import Dataset

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as Dataset one

@@ -0,0 +1,150 @@
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this file, making the rest cleaner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ❤️

Copy link
Contributor

@andy-esch andy-esch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Only one I'm concerned about is the _first_not_null_value function in the case where all are null.

pass
return None
def _first_not_null_value(array):
return array.loc[~array.isnull()].iloc[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if they're all null? I think there will be an IndexError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that. We can rename the method and use an if everywhere it is used. cc @alrocar

from ..columns import Column

try:
import geopandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot() returns plots.


def compute_geodataframe(dataset):
if dataset._df is not None:
df = dataset._df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.


def compute_geodataframe(dataset):
if dataset._df is not None:
df = dataset._df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a warning (or logging.info) that a new column is being added to the original dataframe?

@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC):

@classmethod
def create(cls, html, name, context=None, password=None):
from cartoframes.auth import _default_context
from ..auth import _default_context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I like relative imports for project module imports

@Jesus89 Jesus89 force-pushed the 735-viz-dataframe branch 2 times, most recently from 7748bdf to 810fdf1 Compare June 7, 2019 16:08
@Jesus89 Jesus89 force-pushed the 735-viz-dataframe branch from 810fdf1 to e41c6ab Compare June 10, 2019 08:49
@Jesus89 Jesus89 mentioned this pull request Jun 10, 2019
@Jesus89 Jesus89 force-pushed the 735-viz-dataframe branch from e96aa31 to 6a8f69d Compare June 10, 2019 09:42
@Jesus89 Jesus89 force-pushed the 735-viz-dataframe branch from 6a8f69d to 30b40aa Compare June 10, 2019 09:43
@Jesus89 Jesus89 requested review from alrocar and andy-esch June 10, 2019 09:44
Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing. Awesome effort!

@@ -504,7 +493,7 @@ def _get_geom_col_type(df):
return None

try:
geom = _decode_geom(_first_not_null_value(df, geom_col))
geom = decode_geometry(_first_value(df[geom_col]))
except IndexError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this method raise an IndexError in any case? Maybe we could add a test case ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Exception won't be raised anymore. What case do you want to cover with tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If first value returns None I guess decode_geometry will raise an exception, right? (not an IndexError) should we cover that case with a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_geometry has an if internally, so the None case is ignored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂ I missread the code, you are right!

Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@andy-esch andy-esch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥 Looks great!

@Jesus89 Jesus89 merged commit 54cf937 into develop Jun 10, 2019
@Jesus89 Jesus89 deleted the 735-viz-dataframe branch June 10, 2019 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants