735 viz dataframe #741

Jesus89 · 2019-06-06T11:22:17Z

Related to #735, #746.

This PR adds support for DataFrame visualization. This is the same mechanism we use to get the query from a table, but in this case to obtains a gdf from a df. The geometry is obtained from the available data in the df:

Geom column
Lat and lng columns

So now you can render a df:

From local data

import pandas as pd

data = {'latitude': [0, 10, 20, 30], 'longitude': [0, 10, 20, 30]}
df = pd.DataFrame.from_dict(data)

Map(Layer(df, 'width: 30'))

From CARTO

from cartoframes import Dataset

ds = Dataset.from_table('populated_places')

df = ds.download(limit=10)

Map(Layer(df, 'width: 10'))

From Dataset created with a df

import pandas as pd

data = {'latitude': [0, 10, 20, 30], 'longitude': [0, 10, 20, 30]}
df = pd.DataFrame.from_dict(data)

ds = Dataset.from_dataframe(df)

Map(Layer(ds, 'width: 30'))

alrocar

A couple of suggestions.

I'm missing some tests, specially in the dataset class. It would be nice having on example for each type of dataframe with a geometry column and check that get_geodataframe returns what we expect.

alrocar · 2019-06-06T11:49:54Z

cartoframes/dataset.py

@@ -331,7 +337,7 @@ def _get_remote_geom_type(self, query):
    def _get_local_geom_type(self, gdf):
        """Compute geom type of the local dataframe"""
        if len(gdf.geometry) > 0:
-            geom_type = gdf.geometry[0].type
+            geom_type = gdf.geometry.iloc[0].geom_type


What if the first geometry is null?

We have an utility method to get the first not null value _first_not_null_value that could be handy for this.

alrocar · 2019-06-06T12:03:04Z

cartoframes/dataset.py

+        return df['longitude']
+    if 'lng' in df:
+        return df['lng']
+    if 'long' in df:


maybe we can add lon as well

alrocar · 2019-06-06T12:12:05Z

cartoframes/dataset.py

+        return df['wkb_geometry']
+    if 'geom' in df:
+        return df['geom']
+    if 'geojson' in df:


About geojson, if we support them, we should decode them properly right? I guess decode_geom does not support geojson.

I supposed that this was an alias for geometry (https://github.com/CartoDB/cartodb/blob/master/services/importer/lib/importer/georeferencer.rb#L14).
IMO we can remove this case since we have good support for geojson in CF

let's remove it then.

Jesus89 · 2019-06-06T12:48:33Z

Yes, I have in mind to add tests. I'm thinking now if we could move the get_query and `get_geodataframe' utilities to another file (?) The Dataset file is pretty big now.

alrocar · 2019-06-06T12:58:40Z

I'm thinking now if we could move the get_query and `get_geodataframe' utilities to another file

LGTM

andy-esch · 2019-06-06T13:00:03Z

👀

andy-esch

Looks great -- only thing I want changed is an error helping the user to move forward if geoms aren't found. I'm happy to help re-write it if you don't like the message I wrote.

andy-esch · 2019-06-06T13:05:21Z

cartoframes/dataset.py

+            if lat_column is not None and lng_column is not None:
+                df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column)
+            else:
+                raise ValueError('DataFrame has no geographic data.')


How about something like this:
"No geographic data found. If a geometry exists, change the column name to the_geom or geom or ensure it is a GeoDataFrame with a valid geometry. If there are latitude/longitude columns, rename to lat/lng."

This way people have a route forward if they do indeed have geographic information.

Or better yet we can give them the full list of valid names/formats.

What about

No geographic data found. If a geometry exists, change the column name (geometry, the_geom, wkt_geometry, wkb_geometry, geom, wkt, wkb) or ensure it is a DataFrame with a valid geometry. If there are latitude/longitude columns, rename to (latitude, lat), (longitude, lng, lon, long).

andy-esch · 2019-06-06T13:07:41Z

cartoframes/viz/layer.py

@@ -104,7 +112,9 @@ def __init__(self,

 def _set_source(source, context):
    """Set a Source class from the input"""
-    if isinstance(source, (str, list, dict, Dataset)):
+    if isinstance(source, (str, list, dict, Dataset)) or \
+       isinstance(source, pandas.DataFrame) or \


Why not include pandas.DataFrame in the tuple above with the other types?

houndci-bot · 2019-06-06T14:21:55Z

cartoframes/data/utils.py

+                df['geometry'] = _compute_geometry_from_latlng(lat_column, lng_column)
+            else:
+                raise ValueError('''No geographic data found. '''
+                                 '''If a geometry exists, change the column name ({0}) or ensure it is a DataFrame with a valid geometry. '''


line too long (141 > 120 characters)

houndci-bot · 2019-06-07T09:05:45Z

cartoframes/data/__init__.py

+from __future__ import absolute_import
+
+from .dataset import Dataset, get_query, get_geodataframe
+from .dataset_info import DatasetInfo


'.dataset_info.DatasetInfo' imported but unused

houndci-bot · 2019-06-07T09:05:45Z

cartoframes/data/__init__.py

@@ -0,0 +1,8 @@
+from __future__ import absolute_import
+
+from .dataset import Dataset, get_query, get_geodataframe


'.dataset.get_geodataframe' imported but unused
'.dataset.get_query' imported but unused

Jesus89 · 2019-06-07T10:32:49Z

Hey, I have merged develop with the sharing viz changes, updated the data namespace, and added tests to cover the geometry generation. So it's ready for review 😄

simon-contreras-deel

Some questions and a reflexion.

When I added this issue #704 I was thinking about starting in the from_dataframe method, removing one of _df or _gdf leaving only one way to work locally, avoiding having the same data twice.

This one is focused on the visualization part and really, we are not using geopandas for anything in the "data" side, but anyway, it is a step forward.

simon-contreras-deel · 2019-06-07T13:06:35Z

cartoframes/data/dataset.py


+from .utils import decode_geometry, compute_query, compute_geodataframe, get_columns, \


The max length is 120. You can put the constant in the same line

I'm not used to 120, but I can change it

simon-contreras-deel · 2019-06-07T13:07:53Z

cartoframes/data/utils.py

+
+
+def compute_geodataframe(dataset):
+    if dataset._df is not None:


Suggested change

if dataset._df is not None:

if dataset.dataframe is not None:

Oh, I didn't notice this getter 👍

simon-contreras-deel · 2019-06-07T13:15:43Z

cartoframes/data/utils.py

+from ..columns import Column
+
+try:
+    import geopandas


Why do we need that? Why is not a new dependency?

This is not a dependency. CARTOframes works without geopandas, but this library is required for some situations. This comes from the past versions, but you can create a ticket to discuss this in future versions of CF.

geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot() returns plots.

simon-contreras-deel · 2019-06-07T13:16:17Z

cartoframes/data/utils.py

+                                     ', '.join(LAT_COLUMN_NAMES),
+                                     ', '.join(LNG_COLUMN_NAMES)
+                                 ))
+        return geopandas.GeoDataFrame(df)


It will fail without geopandas dependency

Yep, we need to check the flag here. Thanks

simon-contreras-deel · 2019-06-07T13:17:22Z

cartoframes/data/utils.py

+
+def compute_geodataframe(dataset):
+    if dataset._df is not None:
+        df = dataset._df.copy()


Suggested change

df = dataset._df.copy()

df = dataset.dataframe.copy()

I don't like to copy it. It could be a performance problem (in memory). I am thinking about it.

If we don't copy the df to generate the gdf, the original df will be modified with an extra column. I'm not sure if we want this or not. Maybe it's a good thing to have the geometry column already there.

Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.

What about a warning (or logging.info) that a new column is being added to the original dataframe?

We have removed the copy() by now. But I think we should revisit it in the future.

Regarding the logging info, this will be displayed always when there is a df visualization, but if you consider that it is OK to have a warning we can add it

simon-contreras-deel · 2019-06-07T13:27:24Z

cartoframes/viz/kuviz.py

@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC):

    @classmethod
    def create(cls, html, name, context=None, password=None):
-        from cartoframes.auth import _default_context
+        from ..auth import _default_context


Why do you prefer a relative path?

I think it is a common practice to use relative imports in the project and absolute imports for your dependencies. It requires usually fewer chars and it decouples your implementation from the public API.

Agree, I like relative imports for project module imports

simon-contreras-deel · 2019-06-07T13:28:51Z

cartoframes/viz/layer.py

 from .legend import Legend
+from ..data import Dataset
+
+try:


Same comment as Dataset one

simon-contreras-deel · 2019-06-07T13:32:09Z

cartoframes/data/utils.py

@@ -0,0 +1,150 @@
+import time


I really like this file, making the rest cleaner

alrocar

Awesome ❤️

andy-esch

Looks great! Only one I'm concerned about is the _first_not_null_value function in the case where all are null.

andy-esch · 2019-06-07T15:12:42Z

cartoframes/data/dataset.py

-                            pass
-    return None
+def _first_not_null_value(array):
+    return array.loc[~array.isnull()].iloc[0]


What happens if they're all null? I think there will be an IndexError

True that. We can rename the method and use an if everywhere it is used. cc @alrocar

andy-esch · 2019-06-07T15:15:06Z

cartoframes/data/utils.py

+from ..columns import Column
+
+try:
+    import geopandas


geopandas is a heavy dependency, so I like the idea of making it optional since it's only required for a couple of features. This is a common pattern. E.g., matplotlib is not required for pandas but when installed dataframe.plot() returns plots.

andy-esch · 2019-06-07T15:16:20Z

cartoframes/data/utils.py

+
+def compute_geodataframe(dataset):
+    if dataset._df is not None:
+        df = dataset._df.copy()


Yeah, I worry about performance too. E.g., if a user has a dataframe that's 1GB, which isn't uncommon if they have complex polygons.

andy-esch · 2019-06-07T15:16:57Z

cartoframes/data/utils.py

+
+def compute_geodataframe(dataset):
+    if dataset._df is not None:
+        df = dataset._df.copy()


What about a warning (or logging.info) that a new column is being added to the original dataframe?

andy-esch · 2019-06-07T15:19:56Z

cartoframes/viz/kuviz.py

@@ -20,7 +20,7 @@ def __init__(self, id, url, name, privacy=PRIVACY_PUBLIC):

    @classmethod
    def create(cls, html, name, context=None, password=None):
-        from cartoframes.auth import _default_context
+        from ..auth import _default_context


Agree, I like relative imports for project module imports

alrocar

One last thing. Awesome effort!

alrocar · 2019-06-10T09:55:30Z

cartoframes/data/dataset.py

@@ -504,7 +493,7 @@ def _get_geom_col_type(df):
        return None

    try:
-        geom = _decode_geom(_first_not_null_value(df, geom_col))
+        geom = decode_geometry(_first_value(df[geom_col]))
    except IndexError:


Will this method raise an IndexError in any case? Maybe we could add a test case ;)

This Exception won't be raised anymore. What case do you want to cover with tests?

If first value returns None I guess decode_geometry will raise an exception, right? (not an IndexError) should we cover that case with a test?

decode_geometry has an if internally, so the None case is ignored

🤦‍♂ I missread the code, you are right!

alrocar

🚀

andy-esch

💥 Looks great!

Jesus89 added 3 commits June 5, 2019 15:01

Support df in Source

d92dd2c

Implement df geom detection

eac3231

Detect df geometry (the_geom, latitude-longitude, lat-lng)

27231b2

Jesus89 requested a review from simon-contreras-deel June 6, 2019 11:22

simon-contreras-deel requested a review from alrocar June 6, 2019 11:25

Jesus89 added 2 commits June 6, 2019 13:47

Refactor df geometry detection

cbe47ed

Fix DF detection over GDF

8043439

alrocar suggested changes Jun 6, 2019

View reviewed changes

Jesus89 added 2 commits June 6, 2019 14:45

Use _first_not_null_value to detect the local geom type

2f6a711

Improve geom columns detection

faf0e5a

Jesus89 requested a review from andy-esch June 6, 2019 12:49

Fix _first_not_null_value to pass the tests

384b0de

andy-esch reviewed Jun 6, 2019

View reviewed changes

Jesus89 added 2 commits June 6, 2019 15:48

Move methods to data.utils

3580393

Refactor compute_gdf. Improve error message

a6f3932

houndci-bot reviewed Jun 6, 2019

View reviewed changes

Fix linter

1b0c749

Jesus89 mentioned this pull request Jun 6, 2019

Sharing visualizations #740

Merged

Jesus89 added 3 commits June 7, 2019 10:16

Merge branch 'develop' into 735-viz-dataframe

cbbe01b

Create data namespace

212d68c

Update source example

d7a0012

houndci-bot reviewed Jun 7, 2019

View reviewed changes

Add data.utils tests

546023a

Update NEWS

db5ceb8

simon-contreras-deel reviewed Jun 7, 2019

View reviewed changes

alrocar reviewed Jun 7, 2019

View reviewed changes

Use Dataset getters

5700414

andy-esch suggested changes Jun 7, 2019

View reviewed changes

Jesus89 force-pushed the 735-viz-dataframe branch 2 times, most recently from 7748bdf to 810fdf1 Compare June 7, 2019 16:08

Do not copy df on compute_geodataframe method

e41c6ab

Jesus89 force-pushed the 735-viz-dataframe branch from 810fdf1 to e41c6ab Compare June 10, 2019 08:49

Jesus89 added 2 commits June 10, 2019 11:16

Fix _first_not_null_value method

bf4ddc9

Update NEWS

8615523

Jesus89 mentioned this pull request Jun 10, 2019

743 infer legend prop #749

Merged

Jesus89 force-pushed the 735-viz-dataframe branch from e96aa31 to 6a8f69d Compare June 10, 2019 09:42

Add new_geom_column warning message

30b40aa

Jesus89 force-pushed the 735-viz-dataframe branch from 6a8f69d to 30b40aa Compare June 10, 2019 09:43

Jesus89 requested review from alrocar and andy-esch June 10, 2019 09:44

alrocar reviewed Jun 10, 2019

View reviewed changes

Refactor _get_geom_col_type method

5535017

alrocar approved these changes Jun 10, 2019

View reviewed changes

andy-esch approved these changes Jun 10, 2019

View reviewed changes

Jesus89 merged commit 54cf937 into develop Jun 10, 2019

Jesus89 deleted the 735-viz-dataframe branch June 10, 2019 13:28

This was referenced Jun 10, 2019

Rename cartoframes.datasets namespace #746

Closed

Add Dataframe visualization support #735

Closed

andy-esch mentioned this pull request Jun 13, 2019

Optimize Dataset df/gdf #704

Closed

		@@ -0,0 +1,8 @@
		from __future__ import absolute_import

		from .dataset import Dataset, get_query, get_geodataframe


		from .utils import decode_geometry, compute_query, compute_geodataframe, get_columns, \



		def compute_geodataframe(dataset):
		if dataset._df is not None:

	if dataset._df is not None:
	if dataset.dataframe is not None:

735 viz dataframe #741

735 viz dataframe #741

Conversation

Jesus89 commented Jun 6, 2019 • edited Loading

alrocar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jesus89 commented Jun 6, 2019

alrocar commented Jun 6, 2019

andy-esch commented Jun 6, 2019

andy-esch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jesus89 commented Jun 7, 2019

simon-contreras-deel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar left a comment

Choose a reason for hiding this comment

andy-esch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar left a comment

Choose a reason for hiding this comment

andy-esch left a comment

Choose a reason for hiding this comment

Jesus89 commented Jun 6, 2019 •

edited

Loading