Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Move rarely used I/O connectors to third party modules #28409

Closed
datapythonista opened this issue Sep 12, 2019 · 15 comments
Closed

DEPR: Move rarely used I/O connectors to third party modules #28409

datapythonista opened this issue Sep 12, 2019 · 15 comments
Labels
Deprecate Functionality to remove in pandas IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

Related to #26804 (and indirectly related to #15862, #15008 if we standardize I/O connectors as part of this)

The recent survey shows that several of our I/O connectors (Stata, SPSS, SAS and GBQ) are rarely used (from the chart I'd say around 3% of respondents said they used them):

pandas_survey_io

While afaik all them are just wrappers around other libraries, and don't contain much code to maintain, I see few main immediate advantages:

  • Allow an ecosystem of third-party modules
  • Reduce the complexity and time of the CI
  • Not commit to the connector API

I/O ecosystem

While the number of widely used formats is quite limited, the number of possible I/O connectors is huge. Several new connectors have been requested in the past, like markdown (#11052), avro (#11752), docx (#22518), and there are surely plenty of specialized and internal formats that didn't make sense to propose adding to pandas, but that people would appreciate having and using the same standard API (pandas.read_* / DataFrame.to_*) we have. And there may also be alternative versions of the connectors we provide (see for example this lightning talk by @dutc dicussing about an unsafe pandas.read_csv: https://youtu.be/QkQ5HHEu1b4?t=1554).

CI

For the CI, those modules include dependencies that need to be downloaded for every build. We also have files in pandas/tests/io/data that eventually we may need to update. For GBQ in particular, we have to store the password for the account we use to test, and those tests fail from time to time for connectivity problems. For the clipboard (not proposing to move to a third-party yet, but is a good candidate too), we require to emulate an X system in our tests (need to check if we finally fixed the tests, but they weren't running for a while because of problems on the CI set up).

And of course our test suite is taking some extra time to test all these connectors.

API changes

Being part of our public API, I think we need to commit to the API of those connectors, in the same way as we do with the rest of our API. But, for whatever reason it may make sense to discontinue some of these connectors (e.g. Google releasing a replacement for GBQ...), or make important changes to their parameters, and probably it makes sense to have different release cycles for them.

This also applies to connectors not yet in pandas, but that we may include if we keep the policy of adding to our code base the connectors for all the formats / services we consider relevant.

I know timing is not great, but because of this API commitment, I'd personally remove Stata, SPSS, SAS and GBQ before pandas 1.0.

@datapythonista datapythonista added IO Data IO issues that don't fit into a more specific label IO Google Deprecate Functionality to remove in pandas IO Stata read_stata, to_stata Needs Discussion Requires discussion from core team before further action IO SAS SAS: read_sas labels Sep 12, 2019
@TomAugspurger
Copy link
Contributor

I don't think Stata or SAS are simple wrappers around 3rd party libraries. We're the ones maintaining them right now (primarily with individuals who are experts in that section of the code base, but aren't necessarily pandas maintainers). I'm concerned that moving them out of pandas will make them more likely to be abandoned.

As for CI, I assume we would be testing these as downstream dependencies? So we would still be downloading them and their dependencies (just not running the full set of tests).

@datapythonista
Copy link
Member Author

Thanks for the feedback, that's good to know.

Not sure if any of us is using the connectors to Stata or SAS, but personally I don't see as a negative thing that the connectors are abandoned if we don't maintain it ourselves. I guess the users of those are mainly corporations; if they are not interested in maintaining or funding the maintenance of those, not sure if it's worth that we do it for free.

I was thinking that third party connectors would implement and run their own tests. Do we test any downstream dependency currently?

@jreback
Copy link
Contributor

jreback commented Sep 12, 2019

I am not sure of the goal here. the existing data io connectors don’t require much maintenance by maintainers not is CI an issue at all here (the d/l of the reps is immaterial)

I also don’t think that the api is really that big of a deal.

So IIIC you want to enable adding in new data io connectors not being in pandas proper (or at least an api to do this)?

+1 on that and if that were available then you could easily move lesser used libraries out

BUT this seems like the cart before the horse; we can’t move these libraries out w/o an option for external installation; this is not possible for sas/ STATA as they all code in pandas. (gbq is completely external)

so if someone wants to move these out and maintain them great

@datapythonista
Copy link
Member Author

Thanks for the feedback. And yes, enable adding new data connectors (and eventually move the connectors that make sense out) is exactly what I wanted to propose. Sorry the focus was too much on moving out the connectors, wanted to complement what was already discussed in #26804 (which wasn't well focused either).

If people is happy with the idea of being able to extend pandas with new connectors, then I'll open a PR to see how that would look like, and we can better discuss over the code.

@TomAugspurger
Copy link
Contributor

I guess the users of those are mainly corporations

I don't think that's accurate, at least not for Stata.

Do we test any downstream dependency currently?

Yeah (see test_downstream.py)


I'm still a bit unclear on the goal here that would be achieved by moving this out of pandas. I don't think that wanting to add new IO formats (via the connector stuff) is necessarily related. Even if we had a connector API we could choose to maintain these here vs. another repo.

So, in my mind, it comes down to the cost of maintenance vs. the benefit to users. I think Stata, SAS, SPSS, at least pass that test. Their maintenance cost is relatively small, especially when people like @kshedden, @bashtage, @cbrnr, etc. are around to fix issues when they do come up. And I think the benefit of having these is decently large.

Something like msgpack didn't meet that test. That code was hard to work with, and since it's a wire format we were essentially defining, it couldn't handle the new extension types. The existing file formats like Stata don't really share that problem.

@cbrnr
Copy link
Contributor

cbrnr commented Sep 12, 2019

I'd definitely keep Stata, SAS, and SPSS in. Personally, I know a lot of people who are switching from one of these packages to Python/pandas, and having these readers greatly simplifies life for them.

@bashtage
Copy link
Contributor

Stata, SPSS, SAS

IMO these types of connectors, especially for import (but why not export if it is done), are really essential to the success of pandas. They make it possible to move to Pandas in a gradual way and to interface with a large part of the statistics community.

Stata has 0 external deps FWIW.

@bashtage
Copy link
Contributor

I also think the survey results are skewed towards a particular type, and so are challenging to interpret as broad guidance. I was not aware of the survey until I saw some results on reddit.

@datapythonista
Copy link
Member Author

Thanks all for the feedback.

Just to be clear, there is no discussion on removing those connectors from pandas, and I have no doubt they add value to many users (that 3%, if true, represents at least 30,000 users).

The question is whether it'd be better to have these connectors in separate repos. And consequently, users would have to add another dependency (e.g conda install pandas-stata) and possibly an extra import in their code (I guess with entry points this could not be required).

That would also allow that the people maintaining those could have write access to those repos.

If we allow third-party connectors (sorry I mixed both things here), I think it could make things simpler and more efficient to have some connectors in separate projects. The line of what should be in pandas and what not will always be ambiguous, in the same way it's somehow arbitrary what Python has in the standard library, and what lives in third-party libraries.

@TomAugspurger
Copy link
Contributor

I'm not sure how to square these two

Just to be clear, there is no discussion on removing those connectors from pandas

users would have to add another dependency (e.g conda install pandas-stata) and possibly an extra import in their code

The questions of "where is this developed" and "how do I install this" can be separated. We're able to

  1. develop things in one repo and install them in one package (today's setup)
  2. split things into separate repos and package as one
  3. split things into separate reops and package separately

Which of those (if any) are you proposing?

@datapythonista
Copy link
Member Author

Good point, I'm proposing 3.

But probably worth clarifying my priorities:

  1. Allow an ecosystem of I/O connectors (without monkey patching, and with the same API than the ones we ship).
  2. Make our connectors follow a standard API (e.g. subclass a base class and implement certain methods), the same than for third-party connectors
  3. Move GBQ to a third party library, because the complexity it adds to the CI
  4. Move out the other connectors that make sense

To me, 1 and 2 are quite important, they will empower users and simplify our code. 3 would be really nice, but not as important. I think 4 makes a lot of sense, but if people who actually maintain those less used connectors prefer to have them in pandas, I think it's not optimal, but it's surely all right.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

Good point, I'm proposing 3.

the main issue is maintainers (who is going to maintain these) and infrastructure

it’s not trivial to build wheels in all platforms

But probably worth clarifying my priorities:

  1. Allow an ecosystem of I/O connectors (without monkey patching, and with the same API than the ones we ship).
  2. Make our connectors follow a standard API (e.g. subclass a base class and implement certain methods), the same than for third-party connectors
  3. Move GBQ to a third party library, because the complexity it adds to the C

this is already a 3rd party library;we just test

  1. Move out the other connectors that make sense

To me, 1 and 2 are quite important, they will empower users and simplify our code. 3 would be really nice, but not as important. I think 4 makes a lot of sense, but if people who actually maintain those less used connectors prefer to have them in pandas, I think it's not optimal, but it's surely all right.

@datapythonista
Copy link
Member Author

this is already a 3rd party library;we just test

Thanks for the info, I totally missed that. I thought it was a wrapper around a generic gbq library, I see now that it's a third-party connector already, and we basically have the documentation.

Do you think it's worth testing it in every pandas PR? I guess there is a reason, but seems like adding a nightly build to pandas_gbq that tests with the latest pandas (release candidates, or may be master) should be enough. I think having it well decoupled (moving the tests there) would make things simpler, not only for the CI.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

wouldn’t object to having a nightly build test
but otoh it’s simpler to have everything in a single well tested build(s)

@jbrockmendel jbrockmendel removed IO Google IO SAS SAS: read_sas IO Stata read_stata, to_stata labels Dec 19, 2019
@datapythonista
Copy link
Member Author

I think the preference is to keep all I/O connectors in the pandas main repo. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants