Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Dataset df/gdf #704

Closed
simon-contreras-deel opened this issue May 28, 2019 · 6 comments · Fixed by #829
Closed

Optimize Dataset df/gdf #704

simon-contreras-deel opened this issue May 28, 2019 · 6 comments · Fixed by #829
Assignees
Labels
version: 1.0 Mandatory or nice to have features for 1.0 release

Comments

@simon-contreras-deel
Copy link
Contributor

simon-contreras-deel commented May 28, 2019

We have 2 properties to “support” DataFrame and geoDataFrame. Furthermore, we are using DataFrames and casting to GeoDataFrame (compute_geodataframe method) when we want to create a CARTO VL map. In the end, we have the same thing twice and in some cases, we have the same thing even twice in memory. In the geojson case, we are already converting it into a GeoDataFrame.

The proposal is to use only GeoDataFrame by default, having only one property and one creation method that tries to create a GeoDataFrame from the beginning:

  • create a Dataset from a GeoDataFrame ---> everything done
  • create a Dataset from a geojson ---> same as now (from_geojson method)
  • create a Dataset from a DataFrame with geometry ---> compute_geodataframe
  • create a Dataset from a DataFrame without geometry ---> pandas.DataFrame

With this PR https://github.com/CartoDB/cartoframes/pull/741/files we already have a big part of this work done.

We would need to solve the case when a user creates the Dataset from DataFrame without geometry and wants to add a geometry after that. Probably, we will need to create a specific method for that.

@simon-contreras-deel simon-contreras-deel added the version: 1.0 Mandatory or nice to have features for 1.0 release label May 28, 2019
@andy-esch
Copy link
Contributor

I believe this was handled in #741

@simon-contreras-deel
Copy link
Contributor Author

simon-contreras-deel commented Jun 13, 2019

When I created this issue, I was thinking about starting in the from_dataframe method, removing one of _df or _gdf leaving only one way to work locally, avoiding having the same data twice, and probably doing things after downloading a table.

I think this one still makes sense, we will need to do fewer things thanks to #741, but I am not sure about the scope

@Jesus89 Jesus89 added this to the [1.0] Refactor Auth/Data API milestone Jun 24, 2019
@Jesus89 Jesus89 changed the title Adding support for geoDataFrames in Dataset Optimize Dataset df/gdf Jun 25, 2019
@simon-contreras-deel
Copy link
Contributor Author

ei @andy-esch & @alrocar how do you see adding the geopandas dependecy?

We could handle this issue from 2 sides:

  • the one explained in the issue description that needs geopandas
  • do something between the current approach and the description one, trying to avoid using 2 objects with the same information when the user creates a map, but not converting to geodataframe from the beginning (doing it only when required: in the map creation)

We have talked about it in the past, and the main reason was the size of the geopandas package. But:

  • I suppose we could add geopandas and remove pandas (it is a geopandas dependecy)
  • we have a lot of places doing:
try:
    import geopandas
    HAS_GEOPANDAS = True
except ImportError:
    HAS_GEOPANDAS = False

@alrocar
Copy link
Contributor

alrocar commented Jul 4, 2019

+1 to remove internal duplication of _df and _gdf since both behave the same.

I suppose we could add geopandas and remove pandas (it is a geopandas dependecy)

I don't have a strong opinion about this, so I prefer to listen to Andy's thoughts.

@simon-contreras-deel
Copy link
Contributor Author

(I have deleted my previous comment)

Yes, we are creating 2 different objects. The second one, the GeoDataFrame one is not saved in the Dataset. So we are using 2x memory.

Showing the code deeper, I am afraid using only one object, we could transform the dataframe in the Map render process

@simon-contreras-deel
Copy link
Contributor Author

I am going to focus on the scenario without geopandas installed. We could advance more if we find it useful in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version: 1.0 Mandatory or nice to have features for 1.0 release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants