Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Implement io modules as plugins #26804

Open
datapythonista opened this issue Jun 12, 2019 · 9 comments
Open

CLN: Implement io modules as plugins #26804

datapythonista opened this issue Jun 12, 2019 · 9 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action Refactor Internal refactoring of code

Comments

@datapythonista
Copy link
Member

(following the discussion in #26710)

Currently, most of the io functionality (csv, json, html, pickle, stata...) lives in pandas/io. But the to_* functions are implemented in pandas/core/[generic|frame|series].py. The tests live in pandas/tests/io. And their dependencies, are not explicit anywhere afaik, they are imported lazily so the needed libraries are reported once the function is used (some are listed in environment.yml but not all).

I propose to move every io type to a directory in pandas/contrib/io that will include:

  • The read_* and to_* functions
  • The docstrings with the documentation
  • The tests
  • A file with the dependencies

So, a example structure could be:

pandas/contrib/io/stata/__init__.py
pandas/contrib/io/stata/reader.py
pandas/contrib/io/stata/writer.py
pandas/contrib/io/stata/tests/*
pandas/contrib/io/stata/dependencies.yml

To call the functionality we could simply have something like this in Series, DataFrame (also something similar for read_*), but other ideas welcome:

def __getattr__(self, name):
    if name.startswith('to_'):
        mod = importlib.import_module('pandas.contrib.io.{}'.format(name[3:]):
        return mod.export_dataframe()

I see several advantages here:

  • A clearer (more modular and more uniform) structure of the code (generic.py has more than 11k lines of code, a significant part are io related, same for frame.py with 8k...)
  • We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.
  • In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the skip_if_no that cause problems and tests stop being run without noticing
  • Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

Not part of this proposal, but I think in the future we could also move to contrib other parts that are not IO but could also benefit from being decoupled, like plotting or the extension arrays (that's why I think contrib/io/ makes sense, so we can have contrib/plotting... in the future).

CC: @pandas-dev/pandas-core

@datapythonista datapythonista added Refactor Internal refactoring of code IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action labels Jun 12, 2019
@simonjayhawkins
Copy link
Member

generally agree.

where would pandas\io\formats\format.py which contains some common functionality be located and when would it be imported?

pandas\io\formats\format.py also contains the to_string() method. may need to be split.

some of this functionality and NotebookFormatter from pandas\io\formats\html.py are used for __repr__

again pandas\io\formats\html.py could potentially be split into separate modules for the HTMLFormatter and NotebookFormatter see #23820

@datapythonista
Copy link
Member Author

I was thinking more in the ones listed in http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html than in methods like to_string. I don't have a strong opinion about those, but I'd probably leave them as they are for now, and make the decision after the rest are moved.

I agree the formats part should be splitted, I guess it's doable. Probably also other parts that can be reused by different io modules (including third-party modules).

@simonjayhawkins
Copy link
Member

the to_ methods could benefit from a more unified API. and perhaps using the TableFormatter, DataFrameFormatter (and SeriesFormatter) as base classes for more to the to_ methods.

I think that for the HTML (for notebooks) especially, a core implementation is needed but also the ability to use 3rd party plugins for additional capability like cdn DataTables.

enhancements like to_markdown #11052, could benefit from a unified api, a base class to extend, but not be part of pandas itself.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2019

@datapythonista you are saying a lot of things, but aside from creating massive churn, what problem are you actually trying to solve here?

A clearer (more modular and more uniform) structure of the code (generic.py has more than 11k lines of code, a significant part are io related, same for frame.py with 8k...)

how does this proposal actually help here? The vast majority of generic.py code has nothing to do with this < 10% at most); look at to_gbq and read_gbq; these wrappers actually take up very little space.

We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.

again how does your proposal do anything for this?

(they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way)

not even sure what this means

In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the skip_if_no that cause problems and tests stop being run without noticing
Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

again how does your propsal actually address any of this

void having all the skip_if_no

and what would you replace this with? or are you proposing no testing?

@datapythonista

your general statements are more or less aggreeable, but these lack any specific advantages I think. I was expecting you to say:

we should move all of the io routines to a public .io accessor to modularize them. I would actually be in favor of this, but your proposal is focused around the actual code layout / implementation and I fail to see much benefit here.

h

@datapythonista
Copy link
Member Author

In generic.py it's around 10%, 1,000 of code. If this doesn't sound enough, let's continue after this issue decoupling other things and making other parts more modular too, I do think it makes sense.

Regarding dependencies, with this proposal we should be able to have the imports in every IO module normally at the top of files. And not in _try_import or similar ways. And also pandas dependencies that are imported inside functions to avoid cycle dependencies. Nothing in pandas should import pandas.contrib.io directly, but with the lazy machinery I propose.

If every module in pandas/contrib/io is defining its own tests, we can in the CI (or locally) call things like pytest pandas/contrib/io/stata pandas/contrib/io/parquet and be explicit on what we're testing. This way, we don't need skip_if_no because we won't use the dependencies to decide what's being tested.

I'm also +1 on a DataFrame.io accessor, but I see it somehow unrelated to the code structure I propose here. I think with this proposal should be much easier to implement, as a simple os.listdir('pandas/contrib/io') would return the list of everything that needs to be registered.

The main advantage I see in this proposal is having a uniform and decoupled way of writing io modules. For me it makes a huge difference when maintaining those modules if:

  • It's immediate to see where the 100% of the code, docs and tests of the io module lives: e.g. .to_csv -> pandas/contrib/io/csv/*
  • There is a common structure of files/directories that all those modules use, so once one is learned, it's much easier to find where functions are
  • Since everything would be modular, we could make it easy to extend, and having more/less IO modules is not a question of "pandas supports X" or it doesn't, but it's just a question of whether it's included or needs to be installed. There are many examples of this, .to_pdf(), .to_markdown(), read_yahoo() (currently pandas-datareader using a different API for the user). And also the modules that could be discussed maintaining as third-party packages. Of course we can let maintainers of third-party modules use a different API (e.g. mylib.to_whatever(df)) or monkey patch pandas, but I don't think that's how open source packages should interact with each other.

@simonjayhawkins
Copy link
Member

If every module in pandas/contrib/io is defining its own tests, we can in the CI (or locally) call things like pytest pandas/contrib/io/stata pandas/contrib/io/parquet and be explicit on what we're testing. This way, we don't need skip_if_no because we won't use the dependencies to decide what's being tested.

we could probably do this now using pytest custom markers.

@WillAyd
Copy link
Member

WillAyd commented Jun 12, 2019 via email

@jbrockmendel
Copy link
Member

where would pandas\io\formats\format.py which contains some common functionality be located and when would it be imported?

format.py being in io never felt right to me. Most of our classes rely on it for their __repr__ methods, throwing a wrench in an otherwise fairly clean dependency structure (I'm willing to ignore e.g. the dependency on e.g. io.stata since DataFrame can be considered "complete" without to_stata in a way that it can't without __repr__)

If (part of?) format.py were moved to core, the improved dependency structure would be worth the churn IMO (per @jreback's point). What was left behind in io would be pretty close to what @datapythonista is suggesting.

@TomAugspurger
Copy link
Contributor

I don't really have an opinion on the proposed reorganization of implementations / test files.

To call the functionality we could simply have something like this in Series, DataFrame (also something similar for read_*), but other ideas welcome:

I think I would be against any magic like this. I'm pretty happy with the read_parquet / DataFrame.to_parquet model. Pandas has a very light-weight wrapper around the engine(s) with runtime imports. But the wrappers are implemented as normal python functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action Refactor Internal refactoring of code
Projects
None yet
Development

No branches or pull requests

7 participants