Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Enhancement Proposals? #28568

Closed
h-vetinari opened this issue Sep 22, 2019 · 6 comments · Fixed by #47444
Closed

Pandas Enhancement Proposals? #28568

h-vetinari opened this issue Sep 22, 2019 · 6 comments · Fixed by #47444
Labels
Admin Administrative tasks related to the pandas project Needs Discussion Requires discussion from core team before further action

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Sep 22, 2019

I was wondering if there has been discussion about having dedicated RFC / Proposals for larger changes to pandas?

For example, after a rather heated discussion in #24046, @wesm asked:

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

I wanted to come back to this and maybe write such a design document, and therefore wanted to check what the thoughts of the core team are about this? Needless to say, the obvious candidates to model this after would be Python's PEPs or NumPy's NEPs, but maybe there could/should be pandas-specific departures.

I think this could benefit several high-impact design questions (e.g. around extension arrays or release plans). Actually, I have 2-3 other bigger PRs that eventually stalled in design discussion (and a few ideas of similar magnitude), where I could imagine that such a vessel would help move the discussion forward.

@TomAugspurger
Copy link
Contributor

We have a new section in our roadmap that I think covers this: https://dev.pandas.io/docs/development/roadmap.html#roadmap-evolution

@h-vetinari
Copy link
Contributor Author

That proposal may then be submitted as a GitHub issue

In the case of #24046, it was exactly this point that was in question, and why I opened this issue here. I admit I hadn't seen that section about the roadmap/evolution, but the question remains (to me): has some more formal process been considered? I think pandas is big enough to warrant it.

The way I see it, there's a flood of issues and PRs, and with scarce core-dev resources, any given issue PR will mostly only receive as much attention as is absolutely necessary (which is fine; no judgement about that). But this means that for larger topics, it's very hard to fully lay out a proposal and discuss its merits, because people jump into threads (that get longer and longer) at different points, and it's way too easy to get caught up in discussions about semantics or other side issues, so that eventually, it's very hard to keep track of what the status is, and people spend their scarce time elsewhere.

A PEP-like format could take a longer approach, where feedback is requested on the whole proposal, and not just the latest comment on the GH issue. Maybe I'm too naive about how this would work out, but it seems to work reasonably well for PEPs/NEPs.

Don't get me wrong, live would go on without PAEPs, but I'd be more hesitant about investing my time for (re-)running the gauntlet in the format that has failed (from my POV) several times already.

@TomAugspurger
Copy link
Contributor

You can check the discussion on the roadmap PR where the evolution process was discussed a bit. I'm not sure what all was proposed as alternatives to design issues.

@simonjayhawkins simonjayhawkins added the Needs Discussion Requires discussion from core team before further action label Sep 24, 2019
@toobaz
Copy link
Member

toobaz commented Oct 10, 2019

PEPs/NEPs are not a bad thing in absolute, clearly, but I don't see how they would increase the free time of devs, which is the real scarce thing. If any, a new procedure will increase the burden.

@h-vetinari You start from the assumption that #24046 would have benefited from more attention from the devs, but it really got a lot.

Something I would be happy to recommend would be, in any API-related issue, a comment (e.g., the second of the discussion) which is always updated to reflect the status of the current proposal(s). But I wouldn't make the process more formal than this.

I perfectly understand the frustration when you feel you have wasted time. The way I (slowly learned to) try to avoid this in my pandas contributions which affect user behaviour is i) first discuss, then propose a PR ii) try to proceed one step at a time (first settle one point, then another). There are plenty of issues on github that require limited discussion: some instead do - and notice they are not necessarily more important, they are just more complicated.

@h-vetinari
Copy link
Contributor Author

Sorry for the long time to respond.

@TomAugspurger
I looked in the PR for the roadmap (#27478), and some of these issues were discussed there as well:

@jorisvandenbossche: Regarding the "Roadmap evolution" you added, one concern I have is about using github issues for this, as it is very easy to miss an issue in the large flood of github notifications.

@TomAugspurger: Indeed. I'm not sure what's best here. I don't want to implement too much overhead, especially as we're still learning what process works best for pandas. As a compromise, perhaps we require that the mailing list also be notified?

@jorisvandenbossche: Yep, I also don't really know what would work best. Some random ideas: [...]

I brought the question of pandas PEPs/NEPs up as well in #24046, where there has just been a resolution of a long discussion thread - which could be considered documentation. However, I think far-reaching or controversial decisions should aim for a higher standard (following the PEP model):

  • Present a coherent proposal and arguments in favour (as a single document, not scattered over many issues and comments)
  • Discuss rejected alternatives and reasons why they are inferior to the proposed solution
  • Document the way the final decision was taken

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 10, 2019

@toobaz
Thanks for the response. I was preparing mine at the same time and didn't see that yours preceded mine.

@toobaz: @h-vetinari You start from the assumption that #24046 would have benefited from more attention from the devs, but it really got a lot.

This is (emphatically) not my assumption. My assumption is that there was in fact so much discussion that people got lost about what the current status is. I believe that discussing a design document would have focused the discussion, rather than stretch it out to the point where people gave up participating.

@toobaz: PEPs/NEPs are not a bad thing in absolute, clearly, but I don't see how they would increase the free time of devs, which is the real scarce thing. If any, a new procedure will increase the burden.

Dev time is precious, I agree (and would also benefit IMO; see above). But it would not necessarily be more difficult than reviewing and merging a PR that introduces the proposal (way before starting the implementation). And for PRs from devs where all others are in agreement anyway, you could just skip the proposal process.

@toobaz: I perfectly understand the frustration when you feel you have wasted time.

I really want to insist that the idea behind the proposal process goes far beyond (indeed precedes) the decision of #24046, and is not connected to frustration on my part (actually, I'm more happy that it's resolved than unhappy at the outcome).

@toobaz: The way I (slowly learned to) try to avoid this in my pandas contributions which affect user behaviour is i) first discuss, then propose a PR ii) try to proceed one step at a time (first settle one point, then another).

I've had to learn this as well (still am), but I have tried to have those discussions. It's precisely those issues where the discussion stalls or is too complicated / dispersed that I want to address. Some examples that affected me:

  • adding an inverse to .unique
  • changing Series.unique to return a Series
  • cleaning up and unifying dtype promotion internals
  • cleaning up and unifying update/combine_first/coalesce
  • cleaning up and unifying groupby return types

But I really don't believe I'm special in this regard, and that the process would help other corners of the API, e.g. around serialisation and other interfaces.

In short: The more intricate the API implications, the harder it is to discuss in github comments (because there is usually too many things to consider at the same time, or the comments/threads get ridiculously long or both). That does not mean that the given change does not have merit though, just that it's (likely) too hard to discuss in a thread format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Admin Administrative tasks related to the pandas project Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants