Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slim down dependencies #1313

Open
jsignell opened this issue Mar 3, 2020 · 14 comments
Open

Slim down dependencies #1313

jsignell opened this issue Mar 3, 2020 · 14 comments

Comments

@jsignell
Copy link
Member

jsignell commented Mar 3, 2020

I was chatting with @jorisvandenbossche at the dask developer meeting last week and he mentioned that gdal is only required for fiona which handles the IO parts of geopandas.

Since gdal is known to be a pain to install, it'd be nice if geopandas were split into two conda packages geopandas-core and geopandas. geopandas-core would include all the dependencies except fiona, and geopandas would include fiona, geopandas-core and all the current dependencies.

For pip installs there could be more subsets of dependencies, but the full install would be pip install geoviews[complete].

This pattern had been established in dask and other projects (such as geoviews).

@jsignell
Copy link
Member Author

jsignell commented Mar 3, 2020

I just noticed that there is a discussion of this in #1261. I'll leave this open for visibility unless people would prefer to close.

@jorisvandenbossche
Copy link
Member

Thanks for opening the issue (one item I can strike out on my to do list :-))

There is some discussion about this also happening in #1261

@jorisvandenbossche
Copy link
Member

For conda, I think we can have a geopandas-base or geopandas-core that depends on: pandas, shapely, pyproj, rtree.
And then the "full" geopandas in addition can also depend on fiona, matplotlib-base + descartes, mapclassify (decreasing order certainty it should be included).

Also rtree is optional, but since spatial join, overlay and clip are rather essential operations, I would prefer to keep this even in the core installation. But it can certainly be considered as well.

(in theory even pyproj could be made optional, since a GeoDataFrame does not require to have a crs ...)


For pip I am less sure. I would find it slightly annoying that pip install geopandas no longer gives the best user experience (since all online instructions say that, and not pip install geopandas[all/complete])

@knaaptime
Copy link

it seems i'm in the minority here but, IMHO dropping these dependencies would be counterproductive for the vast majority of users.

As @jorisvandenbossche describes above, I'm already frustrated that descartes doesn't come with geopandas. Moving toward this model means that geopandas-base would end up as a package for "spatial analysis" that

  1. cant read or write geospatial data
  2. cant plot geospatial data
  3. can't reproject geospatial data to another representation
  4. cant use a spatial index for analytical operations

What case do users have for geopandas in that setting? Why not just use pandas/shapely directly or spatialpandas?

I get that as a general principle, fewer dependencies makes for a preferable alternative. But in the case of packages like pyproj or rtree that add essentially 0 overhead, why remove features users are likely to need in the majority of cases? I also get that fiona is a different beast, but i'd still argue that gdal is [still] a fundamental dependency for geospatial analysis and that if it's difficult to get installed, then it makes more sense to focus effort on easing that process (or swapping in pygeos) rather than removing its functionality

@jorisvandenbossche
Copy link
Member

Note that if you are using conda and if you are used to do conda intall geopandas, nothing will change. Or it will actually improve (by also installing matplotlib and descartes). For conda, getting the slimmed down version without dependencies requires explicit action to do so.
That's why I like it for conda, but as mentioned above, much less so for pip (where getting all packages would require explicit action).

Also note that I only proposed to not include fiona in the core dependencies, and only mentioned rtree and pyproj as theoretical options (I am myself also not in favor of dropping as those, as mentioned above, unless someone comes with good reasons).
The reason for fiona is that it is fiona/GDAL is notoriously hard to install and taking a lot of space (which is much less the case for other dependencies). Specifically in cluster nodes / cloud instances, you might want to avoid needing to install that if you don't need it. And right now geopandas basically makes it impossible to not install it.

@knaaptime
Copy link

thats fair, of course. Also, just to be clear, i didn't mean to come in here and start an argument, just to raise an alternative view--and i was curious about the counterpoints :)

@ljwolf
Copy link
Member

ljwolf commented Mar 3, 2020

One alternative would be to define define the geopandas package on PyPI/conda to grab all the dependencies and make geopandas-base a separate package.

  1. a new package called geopandas-base is added to PyPI/conda that drops the GEOS content,
  2. the package on PyPI called "geopandas" remains "full geopandas" and imports from geopandas-core all the non-GEOS code.

With this, neither the PyPI nor conda target would change, and restricted users could use conda/pip install geopandas-base and program on top of geopandas-base? Is the issue that we want to support programming on top of geopandas-base as if it were geopandas?

@martinfleis
Copy link
Member

I vote for @ljwolf's proposal, assuming he means GDAL, not GEOS.

I would like to see conda install geopandas to install most of the dependencies, probably with the exception of mapclassify as it comes with the whole scikit-learn (but that might be resolved on mapclassify side to have it optional, then I would be happy to have it included). Meaning that for simple plotting we would not have to install descartes and Matplotlib manually.

Then we should have geopandas-base coming with the bare minimum, probably even without rtree and pyproj. If advanced users have an issue with some of the C deps, they can just install geopandas-base and only those parts they require.

It would require just a small change to the codebase, the rest is the question of packaging, i.e. new recipe on conda-forge. Not sure about PyPI, how would that work because we would have to alter setup.py and requirements.

@jorisvandenbossche
Copy link
Member

Yes, that is what I was intending for conda (and we have examples of matplotlib or dask that do this from a single feedstock (https://github.com/conda-forge/matplotlib-feedstock/blob/master/recipe/meta.yaml), or multiple feedstocks (https://github.com/conda-forge/dask-core-feedstock/blob/master/recipe/meta.yaml)). So there we have examples.

For pip I am less sure (eg matplotlib and dask don't do something similar on PyPI).

@martinfleis
Copy link
Member

Shall we start with conda and then see if there is a need to try the same on PyPI later based on the response?

@knaaptime
Copy link

I guess this is why i'm confused... If you have conda at your disposal, then installing gdal is trivial, no? The trouble comes from installing fiona from pip and handling gdal manually.

so if the root of the problem is that gdal/fiona are difficult to install with pip, why is it useful to create another conda package without gdal?

@snowman2
Copy link
Contributor

snowman2 commented Mar 4, 2020

i vote for calling it geospandas - and have it only depend on pygeos/shapely (optionally rtree).

@martinfleis
Copy link
Member

An update on this - conda recipe currently offers a minimal geopandas-base flavour and geopandas with most of dependencies.

We could still do the same for pip in some way.

@jorisvandenbossche
Copy link
Member

so if the root of the problem is that gdal/fiona are difficult to install with pip, why is it useful to create another conda package without gdal?

To still answer this (very lately), I see two main reasons: 1) even with conda, installing gdal/fiona is still the package that can give problems from time to time (given the many c dependencies, it most easily gives channel conflicts, or some temporary error if one of the packages gets updated, or ..), 2) more importantly, it gives a large install size, and if you don't need gdal/fiona, the geopandas-base package gives you the option to get a lighter env (which can be useful in cases where size matters, eg in containers, AWS lambda, ..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants