DEPR: Move rarely used I/O connectors to third party modules #28409

datapythonista · 2019-09-12T11:45:43Z

Related to #26804 (and indirectly related to #15862, #15008 if we standardize I/O connectors as part of this)

The recent survey shows that several of our I/O connectors (Stata, SPSS, SAS and GBQ) are rarely used (from the chart I'd say around 3% of respondents said they used them):

While afaik all them are just wrappers around other libraries, and don't contain much code to maintain, I see few main immediate advantages:

Allow an ecosystem of third-party modules
Reduce the complexity and time of the CI
Not commit to the connector API

I/O ecosystem

While the number of widely used formats is quite limited, the number of possible I/O connectors is huge. Several new connectors have been requested in the past, like markdown (#11052), avro (#11752), docx (#22518), and there are surely plenty of specialized and internal formats that didn't make sense to propose adding to pandas, but that people would appreciate having and using the same standard API (pandas.read_* / DataFrame.to_*) we have. And there may also be alternative versions of the connectors we provide (see for example this lightning talk by @dutc dicussing about an unsafe pandas.read_csv: https://youtu.be/QkQ5HHEu1b4?t=1554).

CI

For the CI, those modules include dependencies that need to be downloaded for every build. We also have files in pandas/tests/io/data that eventually we may need to update. For GBQ in particular, we have to store the password for the account we use to test, and those tests fail from time to time for connectivity problems. For the clipboard (not proposing to move to a third-party yet, but is a good candidate too), we require to emulate an X system in our tests (need to check if we finally fixed the tests, but they weren't running for a while because of problems on the CI set up).

And of course our test suite is taking some extra time to test all these connectors.

API changes

Being part of our public API, I think we need to commit to the API of those connectors, in the same way as we do with the rest of our API. But, for whatever reason it may make sense to discontinue some of these connectors (e.g. Google releasing a replacement for GBQ...), or make important changes to their parameters, and probably it makes sense to have different release cycles for them.

This also applies to connectors not yet in pandas, but that we may include if we keep the policy of adding to our code base the connectors for all the formats / services we consider relevant.

I know timing is not great, but because of this API commitment, I'd personally remove Stata, SPSS, SAS and GBQ before pandas 1.0.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-12T13:56:00Z

I don't think Stata or SAS are simple wrappers around 3rd party libraries. We're the ones maintaining them right now (primarily with individuals who are experts in that section of the code base, but aren't necessarily pandas maintainers). I'm concerned that moving them out of pandas will make them more likely to be abandoned.

As for CI, I assume we would be testing these as downstream dependencies? So we would still be downloading them and their dependencies (just not running the full set of tests).

datapythonista · 2019-09-12T14:04:51Z

Thanks for the feedback, that's good to know.

Not sure if any of us is using the connectors to Stata or SAS, but personally I don't see as a negative thing that the connectors are abandoned if we don't maintain it ourselves. I guess the users of those are mainly corporations; if they are not interested in maintaining or funding the maintenance of those, not sure if it's worth that we do it for free.

I was thinking that third party connectors would implement and run their own tests. Do we test any downstream dependency currently?

jreback · 2019-09-12T14:29:31Z

I am not sure of the goal here. the existing data io connectors don’t require much maintenance by maintainers not is CI an issue at all here (the d/l of the reps is immaterial)

I also don’t think that the api is really that big of a deal.

So IIIC you want to enable adding in new data io connectors not being in pandas proper (or at least an api to do this)?

+1 on that and if that were available then you could easily move lesser used libraries out

BUT this seems like the cart before the horse; we can’t move these libraries out w/o an option for external installation; this is not possible for sas/ STATA as they all code in pandas. (gbq is completely external)

so if someone wants to move these out and maintain them great

datapythonista · 2019-09-12T14:41:29Z

Thanks for the feedback. And yes, enable adding new data connectors (and eventually move the connectors that make sense out) is exactly what I wanted to propose. Sorry the focus was too much on moving out the connectors, wanted to complement what was already discussed in #26804 (which wasn't well focused either).

If people is happy with the idea of being able to extend pandas with new connectors, then I'll open a PR to see how that would look like, and we can better discuss over the code.

TomAugspurger · 2019-09-12T17:23:15Z

I guess the users of those are mainly corporations

I don't think that's accurate, at least not for Stata.

Do we test any downstream dependency currently?

Yeah (see test_downstream.py)

I'm still a bit unclear on the goal here that would be achieved by moving this out of pandas. I don't think that wanting to add new IO formats (via the connector stuff) is necessarily related. Even if we had a connector API we could choose to maintain these here vs. another repo.

So, in my mind, it comes down to the cost of maintenance vs. the benefit to users. I think Stata, SAS, SPSS, at least pass that test. Their maintenance cost is relatively small, especially when people like @kshedden, @bashtage, @cbrnr, etc. are around to fix issues when they do come up. And I think the benefit of having these is decently large.

Something like msgpack didn't meet that test. That code was hard to work with, and since it's a wire format we were essentially defining, it couldn't handle the new extension types. The existing file formats like Stata don't really share that problem.

cbrnr · 2019-09-12T17:33:33Z

I'd definitely keep Stata, SAS, and SPSS in. Personally, I know a lot of people who are switching from one of these packages to Python/pandas, and having these readers greatly simplifies life for them.

bashtage · 2019-09-12T17:35:44Z

Stata, SPSS, SAS

IMO these types of connectors, especially for import (but why not export if it is done), are really essential to the success of pandas. They make it possible to move to Pandas in a gradual way and to interface with a large part of the statistics community.

Stata has 0 external deps FWIW.

bashtage · 2019-09-12T17:37:24Z

I also think the survey results are skewed towards a particular type, and so are challenging to interpret as broad guidance. I was not aware of the survey until I saw some results on reddit.

datapythonista · 2019-09-12T22:45:19Z

Thanks all for the feedback.

Just to be clear, there is no discussion on removing those connectors from pandas, and I have no doubt they add value to many users (that 3%, if true, represents at least 30,000 users).

The question is whether it'd be better to have these connectors in separate repos. And consequently, users would have to add another dependency (e.g conda install pandas-stata) and possibly an extra import in their code (I guess with entry points this could not be required).

That would also allow that the people maintaining those could have write access to those repos.

If we allow third-party connectors (sorry I mixed both things here), I think it could make things simpler and more efficient to have some connectors in separate projects. The line of what should be in pandas and what not will always be ambiguous, in the same way it's somehow arbitrary what Python has in the standard library, and what lives in third-party libraries.

TomAugspurger · 2019-09-13T21:11:20Z

I'm not sure how to square these two

Just to be clear, there is no discussion on removing those connectors from pandas

users would have to add another dependency (e.g conda install pandas-stata) and possibly an extra import in their code

The questions of "where is this developed" and "how do I install this" can be separated. We're able to

develop things in one repo and install them in one package (today's setup)
split things into separate repos and package as one
split things into separate reops and package separately

Which of those (if any) are you proposing?

datapythonista · 2019-09-13T21:25:05Z

Good point, I'm proposing 3.

But probably worth clarifying my priorities:

Allow an ecosystem of I/O connectors (without monkey patching, and with the same API than the ones we ship).
Make our connectors follow a standard API (e.g. subclass a base class and implement certain methods), the same than for third-party connectors
Move GBQ to a third party library, because the complexity it adds to the CI
Move out the other connectors that make sense

To me, 1 and 2 are quite important, they will empower users and simplify our code. 3 would be really nice, but not as important. I think 4 makes a lot of sense, but if people who actually maintain those less used connectors prefer to have them in pandas, I think it's not optimal, but it's surely all right.

jreback · 2019-09-13T21:31:13Z

Good point, I'm proposing 3.

the main issue is maintainers (who is going to maintain these) and infrastructure

it’s not trivial to build wheels in all platforms

But probably worth clarifying my priorities:

Allow an ecosystem of I/O connectors (without monkey patching, and with the same API than the ones we ship).

Make our connectors follow a standard API (e.g. subclass a base class and implement certain methods), the same than for third-party connectors

Move GBQ to a third party library, because the complexity it adds to the C

this is already a 3rd party library;we just test

Move out the other connectors that make sense

To me, 1 and 2 are quite important, they will empower users and simplify our code. 3 would be really nice, but not as important. I think 4 makes a lot of sense, but if people who actually maintain those less used connectors prefer to have them in pandas, I think it's not optimal, but it's surely all right.

datapythonista · 2019-09-13T21:59:29Z

this is already a 3rd party library;we just test

Thanks for the info, I totally missed that. I thought it was a wrapper around a generic gbq library, I see now that it's a third-party connector already, and we basically have the documentation.

Do you think it's worth testing it in every pandas PR? I guess there is a reason, but seems like adding a nightly build to pandas_gbq that tests with the latest pandas (release candidates, or may be master) should be enough. I think having it well decoupled (moving the tests there) would make things simpler, not only for the CI.

jreback · 2019-09-13T22:02:30Z

wouldn’t object to having a nightly build test
but otoh it’s simpler to have everything in a single well tested build(s)

datapythonista · 2020-09-24T09:56:12Z

I think the preference is to keep all I/O connectors in the pandas main repo. Closing.

datapythonista added IO Data IO issues that don't fit into a more specific label IO Google Deprecate Functionality to remove in pandas IO Stata read_stata, to_stata Needs Discussion Requires discussion from core team before further action IO SAS SAS: read_sas labels Sep 12, 2019

jbrockmendel removed IO Google IO SAS SAS: read_sas IO Stata read_stata, to_stata labels Dec 19, 2019

datapythonista closed this as completed Sep 24, 2020

lithomas1 mentioned this issue Feb 16, 2023

PDEP-1 first revision (scope) #51417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: Move rarely used I/O connectors to third party modules #28409

DEPR: Move rarely used I/O connectors to third party modules #28409

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

datapythonista commented Sep 12, 2019

jreback commented Sep 12, 2019

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

cbrnr commented Sep 12, 2019

bashtage commented Sep 12, 2019

bashtage commented Sep 12, 2019

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 13, 2019

datapythonista commented Sep 13, 2019

jreback commented Sep 13, 2019

datapythonista commented Sep 13, 2019

jreback commented Sep 13, 2019

datapythonista commented Sep 24, 2020

DEPR: Move rarely used I/O connectors to third party modules #28409

DEPR: Move rarely used I/O connectors to third party modules #28409

Comments

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

datapythonista commented Sep 12, 2019

jreback commented Sep 12, 2019

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

cbrnr commented Sep 12, 2019

bashtage commented Sep 12, 2019

bashtage commented Sep 12, 2019

datapythonista commented Sep 12, 2019

TomAugspurger commented Sep 13, 2019

datapythonista commented Sep 13, 2019

jreback commented Sep 13, 2019

datapythonista commented Sep 13, 2019

jreback commented Sep 13, 2019

datapythonista commented Sep 24, 2020