CLN: Implement io modules as plugins #26804

datapythonista · 2019-06-12T08:56:22Z

(following the discussion in #26710)

Currently, most of the io functionality (csv, json, html, pickle, stata...) lives in pandas/io. But the to_* functions are implemented in pandas/core/[generic|frame|series].py. The tests live in pandas/tests/io. And their dependencies, are not explicit anywhere afaik, they are imported lazily so the needed libraries are reported once the function is used (some are listed in environment.yml but not all).

I propose to move every io type to a directory in pandas/contrib/io that will include:

The read_* and to_* functions
The docstrings with the documentation
The tests
A file with the dependencies

So, a example structure could be:

pandas/contrib/io/stata/__init__.py
pandas/contrib/io/stata/reader.py
pandas/contrib/io/stata/writer.py
pandas/contrib/io/stata/tests/*
pandas/contrib/io/stata/dependencies.yml

To call the functionality we could simply have something like this in Series, DataFrame (also something similar for read_*), but other ideas welcome:

def __getattr__(self, name):
    if name.startswith('to_'):
        mod = importlib.import_module('pandas.contrib.io.{}'.format(name[3:]):
        return mod.export_dataframe()

I see several advantages here:

A clearer (more modular and more uniform) structure of the code (generic.py has more than 11k lines of code, a significant part are io related, same for frame.py with 8k...)
We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.
In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the skip_if_no that cause problems and tests stop being run without noticing
Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

Not part of this proposal, but I think in the future we could also move to contrib other parts that are not IO but could also benefit from being decoupled, like plotting or the extension arrays (that's why I think contrib/io/ makes sense, so we can have contrib/plotting... in the future).

CC: @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2019-06-12T09:29:48Z

generally agree.

where would pandas\io\formats\format.py which contains some common functionality be located and when would it be imported?

pandas\io\formats\format.py also contains the to_string() method. may need to be split.

some of this functionality and NotebookFormatter from pandas\io\formats\html.py are used for __repr__

again pandas\io\formats\html.py could potentially be split into separate modules for the HTMLFormatter and NotebookFormatter see #23820

datapythonista · 2019-06-12T09:44:11Z

I was thinking more in the ones listed in http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html than in methods like to_string. I don't have a strong opinion about those, but I'd probably leave them as they are for now, and make the decision after the rest are moved.

I agree the formats part should be splitted, I guess it's doable. Probably also other parts that can be reused by different io modules (including third-party modules).

simonjayhawkins · 2019-06-12T09:53:50Z

the to_ methods could benefit from a more unified API. and perhaps using the TableFormatter, DataFrameFormatter (and SeriesFormatter) as base classes for more to the to_ methods.

I think that for the HTML (for notebooks) especially, a core implementation is needed but also the ability to use 3rd party plugins for additional capability like cdn DataTables.

enhancements like to_markdown #11052, could benefit from a unified api, a base class to extend, but not be part of pandas itself.

jreback · 2019-06-12T11:07:43Z

@datapythonista you are saying a lot of things, but aside from creating massive churn, what problem are you actually trying to solve here?

A clearer (more modular and more uniform) structure of the code (generic.py has more than 11k lines of code, a significant part are io related, same for frame.py with 8k...)

how does this proposal actually help here? The vast majority of generic.py code has nothing to do with this < 10% at most); look at to_gbq and read_gbq; these wrappers actually take up very little space.

We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.

again how does your proposal do anything for this?

(they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way)

not even sure what this means

In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the skip_if_no that cause problems and tests stop being run without noticing
Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

again how does your propsal actually address any of this

void having all the skip_if_no

and what would you replace this with? or are you proposing no testing?

@datapythonista

your general statements are more or less aggreeable, but these lack any specific advantages I think. I was expecting you to say:

we should move all of the io routines to a public .io accessor to modularize them. I would actually be in favor of this, but your proposal is focused around the actual code layout / implementation and I fail to see much benefit here.

h

datapythonista · 2019-06-12T11:47:44Z

In generic.py it's around 10%, 1,000 of code. If this doesn't sound enough, let's continue after this issue decoupling other things and making other parts more modular too, I do think it makes sense.

Regarding dependencies, with this proposal we should be able to have the imports in every IO module normally at the top of files. And not in _try_import or similar ways. And also pandas dependencies that are imported inside functions to avoid cycle dependencies. Nothing in pandas should import pandas.contrib.io directly, but with the lazy machinery I propose.

If every module in pandas/contrib/io is defining its own tests, we can in the CI (or locally) call things like pytest pandas/contrib/io/stata pandas/contrib/io/parquet and be explicit on what we're testing. This way, we don't need skip_if_no because we won't use the dependencies to decide what's being tested.

I'm also +1 on a DataFrame.io accessor, but I see it somehow unrelated to the code structure I propose here. I think with this proposal should be much easier to implement, as a simple os.listdir('pandas/contrib/io') would return the list of everything that needs to be registered.

The main advantage I see in this proposal is having a uniform and decoupled way of writing io modules. For me it makes a huge difference when maintaining those modules if:

It's immediate to see where the 100% of the code, docs and tests of the io module lives: e.g. .to_csv -> pandas/contrib/io/csv/*
There is a common structure of files/directories that all those modules use, so once one is learned, it's much easier to find where functions are
Since everything would be modular, we could make it easy to extend, and having more/less IO modules is not a question of "pandas supports X" or it doesn't, but it's just a question of whether it's included or needs to be installed. There are many examples of this, .to_pdf(), .to_markdown(), read_yahoo() (currently pandas-datareader using a different API for the user). And also the modules that could be discussed maintaining as third-party packages. Of course we can let maintainers of third-party modules use a different API (e.g. mylib.to_whatever(df)) or monkey patch pandas, but I don't think that's how open source packages should interact with each other.

simonjayhawkins · 2019-06-12T12:04:24Z

If every module in pandas/contrib/io is defining its own tests, we can in the CI (or locally) call things like pytest pandas/contrib/io/stata pandas/contrib/io/parquet and be explicit on what we're testing. This way, we don't need skip_if_no because we won't use the dependencies to decide what's being tested.

we could probably do this now using pytest custom markers.

WillAyd · 2019-06-12T12:34:26Z

FWIW I would be against moving tests. I think that would make packaging more complex and always runs the risk of messing with test discovery and subtly not running things.

…

On Jun 12, 2019, at 8:04 AM, Simon Hawkins ***@***.***> wrote: If every module in pandas/contrib/io is defining its own tests, we can in the CI (or locally) call things like pytest pandas/contrib/io/stata pandas/contrib/io/parquet and be explicit on what we're testing. This way, we don't need skip_if_no because we won't use the dependencies to decide what's being tested. we could probably do this now using pytest custom markers. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#26804?email_source=notifications&email_token=AAEU4UPSPPBMZZXWDHMHIMTP2DQ43A5CNFSM4HXGYTA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQF3WI#issuecomment-501243353>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEU4ULQPXFAIPBFOJMGLCLP2DQ43ANCNFSM4HXGYTAQ>.

jbrockmendel · 2019-06-12T14:30:34Z

where would pandas\io\formats\format.py which contains some common functionality be located and when would it be imported?

format.py being in io never felt right to me. Most of our classes rely on it for their __repr__ methods, throwing a wrench in an otherwise fairly clean dependency structure (I'm willing to ignore e.g. the dependency on e.g. io.stata since DataFrame can be considered "complete" without to_stata in a way that it can't without __repr__)

If (part of?) format.py were moved to core, the improved dependency structure would be worth the churn IMO (per @jreback's point). What was left behind in io would be pretty close to what @datapythonista is suggesting.

TomAugspurger · 2019-07-19T16:29:19Z

I don't really have an opinion on the proposed reorganization of implementations / test files.

To call the functionality we could simply have something like this in Series, DataFrame (also something similar for read_*), but other ideas welcome:

I think I would be against any magic like this. I'm pretty happy with the read_parquet / DataFrame.to_parquet model. Pandas has a very light-weight wrapper around the engine(s) with runtime imports. But the wrappers are implemented as normal python functions.

datapythonista added Refactor Internal refactoring of code IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action labels Jun 12, 2019

This was referenced Jun 16, 2019

Configure CI to fail on skip of optional deps? #26890

Open

Keep (un-deprecate) read_table() for API stability #25220

Closed

datapythonista mentioned this issue Jul 4, 2019

Reuse pandas components in other libraries python-sprints/dataframe-summit#1

Open

datapythonista mentioned this issue Sep 12, 2019

DEPR: Move rarely used I/O connectors to third party modules #28409

Closed

jreback mentioned this issue Dec 27, 2019

[ENH] Add to_markdown method #30350

Merged

5 tasks

datapythonista changed the title ~~REFACTOR: Implement io modules as plugins~~ CLN: Implement io modules as plugins Mar 31, 2020

TomAugspurger mentioned this issue May 15, 2020

Feature Request: TFRecord file format support dask/dask#5618

Closed

mroeschke added the Enhancement label Jul 10, 2021

janosh mentioned this issue Apr 5, 2023

ENH: Styler.to_pdf() and Styler.to_svg() #52455

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Implement io modules as plugins #26804

CLN: Implement io modules as plugins #26804

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

jreback commented Jun 12, 2019

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

WillAyd commented Jun 12, 2019 via email

jbrockmendel commented Jun 12, 2019

TomAugspurger commented Jul 19, 2019

CLN: Implement io modules as plugins #26804

CLN: Implement io modules as plugins #26804

Comments

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

jreback commented Jun 12, 2019

datapythonista commented Jun 12, 2019

simonjayhawkins commented Jun 12, 2019

WillAyd commented Jun 12, 2019 via email

jbrockmendel commented Jun 12, 2019

TomAugspurger commented Jul 19, 2019