-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Return basic statistics to Base #27375
Conversation
Stuff like |
Yeah I'm fine with removing Edit: Added a commit that removes |
|
Because these are generally unuseful to laypeople. Standard deviation, on the other hand, is common enough to be considered "general purpose".
OK, I might agree with you in this particular case, but what are the criteria that we (as a project) have set for what gets put in Base, what gets moved to Stdlib, and what is the responsibility of a third-party package? It seems arbitrary to me that |
I would favor having |
I tend to agree with @StefanKarpinski here that retaining I don't understand though for necessarily including something in I think having common things in Base (that everyone would want, and I do think |
Sharing my point of view about this "slimming effect" that we've been experiencing as users and package developers. I think the real issue here is the lack of a formal specification of what is "in" and what is "out", and most importantly a proper organization of what is left out. Right now many things that were slimmed out of Base got a somewhat random home in a relatively huge package. Take I would never complain as a user/package developer to import a separate package to have The devs did a good job purging out things that are not essential to the language, and we are now struggling to get them in a place that makes sense and is minimal. Coming up with this set of minimal domain-specific packages is the challenge. |
This is not my experience at all, and it makes me sad that it seems to be an accepted way of using Julia. Again, I am vehemently opposed to the “npm-ification” of Julia. Perhaps I’m an outlier, or maybe I’m reading too much into this statement, but I have a fundamental aversion to it. |
We could start with some simple rules of thumb like "Base" packages can't have dependencies in their REQUIRE file. This is the current list of dependencies in
As package developers, we have little control of dependency growth unless we establish some conventions like the above. I propose to call a package "Base" if it only depends on |
I have no idea what |
Missings.jl is just a Compat-like package for 0.6 and not needed on 0.7. DataStructures is only used in one place and that dependency could probably go away. Finally, I think SortingAlgorithms is worth adding to the stdlib. Overall StatsBase isn't that big and it's pure-Julia. See #27152 (comment) for a list of features it includes. I'm not opposed to having |
If you want objective criteria for what should be in Base, the closest I can come is "as little as possible". It should contain only functions it would be hard to imagine writing any julia code without. Anything else is just arguing about what is "commonly used" enough. It's fine if people want to have those debates; just be aware that's what you're signing up for. |
I think a lot of the contention here is coming from the fact that |
From a developer perspective, I favor moving away from |
@sbromberger I think calling this npm-ification of Julia is a bit unfair. Julia packages tend to include larger chunks of functionality. On the dozens of packages, I find myself at least wanting @dpsanders I was wondering what Whether something is in |
So, since this discussion came out of a gripe I made in slack, perhaps I can help refocus the discussion. Right now, the following seems to have occurred, resulting in an undesirable situation (at least from my perspective) - perhaps I've misunderstood part of it.
It's the combination of 3 - 5 that is causing me grief. We've spent a TON of time and effort trying to reduce the number of dependencies that LightGraphs and the LG ecosystem has. This was originally "reduce the number of packages that get installed by The move of functions from Base to stdlib makes this a bit more complex, since we now can't use "number of import/using statements" as a metric, but rather, "number of packages in REQUIRE, plus their dependents", since the move to stdlib increases the former but doesn't really represent new dependencies. So what we've done is reimagined our "ecosystem fragility" as a function of "what are we depending on that's not core Julia?" This is important to us (me) because I am seeing a trend over the past 3 years of abandoned / unmaintained packages, and we've been bitten by this directly. In some cases it's just because the package is mature and there's not much left to do; in others, the maintainer has moved on to other things. But it is a higher risk that a third-party package will stagnate than a stdlib package will, so our preference is for the latter over the former given that we've got to change our thinking about Base. Thus, we're now faced with a decision:
So, from our perspective, the initial problem has been resolved by doing something else entirely; but the issue still remains that in the future our strong preference is to prefer Base / stdlib over third party wherever possible. Our developer guidelines (fourth bullet) need to be updated but the essence still holds, as in order to keep LightGraphs as a "core" library for graph analysis, we need to make sure it's not difficult for end users to use, but just as importantly, for developers to incorporate, adapt, and extend. Julia Core team, please keep this in mind as you make further decisions about what lives in Base, what lives in stdlib, and what gets moved out to third-party, and the process by which any migration happens. |
From a user perspective, my test for Base exports would be: Am I supposed to read the manual for this function before starting to code julia? If not, then it wants to be in a separate namespace. But I agree that juliaLang should be responsible for maintaining/curating basic necessities like standard deviation; that is, NPM left-pad is a catastrophe to learn from, and a fracturing of dependencies is the worst thing that could happen. But if there is a large majority of users that can get away without knowing of a function's existence, then it only pollutes their namespace and disorganizes the manual. I would be entirely happy if these things moved to a submodule, like |
I think everyone agrees that Apart from that, yes, I do acknowledge the larger philosophy of burden on package writers, but perhaps we can focus the conversation here on what is best to solve the current issue - being the specific functions in this PR. I really also don't like how folks are bringing up NPM here - as I think it does a disservice to other open source developers (after all it is very easy to sit in ones armchair and criticize things), and it tends to over-simplify the issues we are trying to sort out. |
The immediate issue from LightGraphs' perspective has already been resolved; that is, even if you choose to do nothing, we have worked around the change.
I think this criticism is unfair. NPM is a classic case of a new approach to dependency management that has seen some very strong drawbacks in implementation (left-pad being one of them; reliance on a central repository being another; fragmentation of that repository being a third). There are lots of advantages to an NPM-like system for Node developers, but I don't think it carries to Julia developers. This is not a criticism of NPM without context, which is what I gather you mean by "armchair criticism"; it is a plea to the Julia team not to create a repository structure that has all the well-known shortcomings of NPM without some significant offsetting benefits. I'm happy to discuss elsewhere if it's a distraction to the current issue. |
I believe with Pkg3 stdlib packages do need to be listed in Project.toml, so your steps 3-5 of first adding and then removing a dependency should not occur. |
I would like to point out that we are not moving each function into its own package or splitting things without reason. Historically, we shoved a lot of stuff in Base, before we had a well functioning package manager. I do agree with the general principles pointed out. I also personally had a similar reaction when we first started this process, but quickly realized it is actually better to group functionality in packages where developers vested in that functionality can own it and can move faster than base. |
How about we keep |
Because it's inconsistent with |
I was personally on the fence about moving |
I was under the impression that it was impossible at this point to move things from Base to Stdlib directly without going through a third-party package. Is this wrong? |
I would be onboard with |
I would like to clarify (as a user and follower of the developments) that the initial goal of being a language for scientific computing has changed along the way. Nowadays, it is incorrect to give Julia this label. We can either accept the new wave of generality and leave all the statistical functions outside of Base for consistency, or regain the title and export a set of names that anyone doing scientific work would expect to have coming from numpy, octave, matlab, etc. If all these statistical functions are collected coherently in a package |
@juliohm, you are not the arbiter of what does or does not make something a scientific computing language and that kind of rhetoric does not help make your point. |
Sorry for the noise, I am just sharing my feelings about these changes. Won't do it again. |
|
I discussed the thread with a co-worker and here is a summary of some points from the conversation:
|
Those are all valid points. The only argument being that |
...where you have to type As long as we're going to have mean, median, and quantile in Base, I won't fight including |
Moving |
Your input and opinion are welcomed. If you have a solid argument for or against something, please make it. Please do not make technical choices that you disagree with into "historical disasters" or epic existential threats to the language (because it is no longer a true scientific language). That kind of rhetoric actively hinders having a reasonable, civilized debate about what clearly is a subject where different people, many of whom have been actively involved in scientific computing for a long time, can have divergent opinions. By painting the situation in terms of the people who want Julia to be a scientific language (those who agree with you), versus those who don't (those who disagree), you are implicitly saying that all of the people who disagree with you are not real computational scientists and don't care about scientific computing. In the inner products thread there were many posts where you similarly stated or implied that anyone who disagreed with you was either not a real mathematician or didn't care about correctness (amusingly enough when arguing with Steven Johnson). That's simply not a good stance to approach a debate from. Assume that we are all reasonable people who want Julia to be the best language for scientific computing that it can be (and great for other things too), and that if you make a good case for why your way is better, we will listen to it. |
Thank you for the wise words @StefanKarpinski , I will try to organize my thoughts better in the future. Sometimes, however, I feel that there is reluctance from core devs to think neutrally about new ideas proposed by users and package developers. I am not the first to complain about this as you know, and there are many cases online of users blogging about these problems, which are real. That thread with inner products is a good example where a very decent idea was almost killed from the beginning without clear justification, and I would argue that it unrolled in that long discussion just because of the feedback and comments that I got with no scientific basis, and just personal taste. It felt to me like the the idea was being evaluated with a mindset like "He is not from the pack, why do we have to listen to him?" By all means, we are all here with a shared goal to have a good language, and to design it to the best extent possible. More impartial evaluation of new issues contributed by external users would make things a lot better. |
My grain of salt in the above is that usually the misunderstanding are from a lack of sufficient nuance. It is not the first (nor will it be the last) time that someone has turned a decision point into a dooming the language scenario or a core developer has shut it down without sufficient nuance as to why the language / framing is disruptive and counterproductive (which can seem unjust / harsh and shut down a potential contributor). However, in most cases the nuance has taken place afterwards (such as in this case) and clarifies the matter improving the community environment. That is something inherent in the system as human beings and hopefully it continuous to improve. Sometimes, I will have issues closed before having time to address follow ups that lead to actual issues addressed, but I totally get why some responses might be a bit rushed (e.g., closing this PR/issue for instance when the discussion has evolve in spite of being closed) due to the need to move through issues as developers. |
I can't help but notice some irony here: some people object to adding a dependency on StatsBase, but their solution to that is to force every julia program to depend on some statistics functions via Base. |
The reason this PR is closed is because the discussion for moving everything back from StatsBase is not open. The only thing we are considering is |
Final vote from my part is that |
I think @Nosferican's split is a good one. The only thing we need to get there is to merge |
Yes, I'd be up with having the |
Kind of weird that |
We still have |
Part of the problem is that |
Implementation-wise, it would be silly to have |
|
As somebody who almost never uses statistics (and had to check only to find out that |
I don't personally love the move to Finally as to the points made about lean vs fat standard libraries is the point made about copying functions into a package to avoid a dependency something that happens in these more modular environments (i.e. npm etc) more often? That seems so scary to me from a reuse perspective, and code idiom perspective. Seems like some tough trade-offs between given the language a built in feel with enough verbs from the standard library that it has some reach to basic packages, versus having too fat of a base distribution which might hamper some domain applications (that require small install footprints). Also having all the documentation distributed across packages is a real pain point as it makes everything feel less integrated. |
There’s a choice here:
Refusing to do either and insisting that the only acceptable solution is having super short names and having them in Base strikes me frankly as pretty unreasonable, yet that is the position that seems to be taken by many in this discussion. |
I prefer one. |
I am not sure it is that clean. Of course, in a perfect world you could pick one or the other, but for really every math function name (i.e. exp, sqrt, sin, cos, I [identity object] etc) we choose a short overly generic name that gets auto imported. At this point these are simply idioms and for them to mean something else hurts the brain. I see the trade off, but we are being just as inconsistent not having to do What we seem to be doing is singling out var and std from the above list of many such common math functions and potentially giving a non idiomatic name because we don't want them in the base namespace. I can see why this might be done, but it is hardly such an unreasonable or uncommon position as you make it sound. |
That in large part comes from I also think the namespace issue is not the main issue. Being in Base does not prevent other packages or code from using those names for something else, though it's often discussed as if that were the case. On the other side, keeping the functions in Base but renaming them would not fix anything as far as I'm concerned. What matters more to me is where the code is located, for purposes of developing and versioning it, and for purposes of removing it as a dependency for people who might want to build minimal executables at some point in the future. |
I agree that std and var is less common (likely just because it is not in math.h :) ), but I think the naming issues are the same, and std / sd and var, but when they exist in a base language they are almost always one of these versions (though I do see that python has added the variance, stddev version to their standard library ... which is ugly given the numpy names ... but so it goes). Mostly this was in response the @StefanKarpinski point about the two choices which focused on the namespacing, variable name length trade-off. For the code location I do think it has a end user impact as well though which is likely why this always brings up so much discussion when any new piece is removed from base. The rise in |
I'd rather have longer names for |
This reverts "move cor, cov, std, stdm, var, varm and linreg to StatsBase (#27152)," commit 746d08f. Fixes #27374.