-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include .conda
packages
#45
Comments
Looking into this with @cappadona |
@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to |
@jakirkham @dopplershift. Apologies for the delay. We have not yet addressed |
Thanks Nick! 🙏 |
Hi @jakirkham @dopplershift. Quick update on the status of this issue. We're working on finalizing a new pipeline that will source this public data set and include |
Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up? |
@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data. |
cc @aterrel @chenghlee (as we discussed this earlier) |
Hi @cappadona @jezdez Friendly nudge for updates 🙂 This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, |
Hi @leofang. Thanks for checking in. We are on track to include |
Just wanted to check in, @cappadona how are things looking here? |
Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package) |
To be fair, Nick said end of the month originally. So end of next week Though would be good to learn if that is still true or if this is likely to slip |
@cappadona how are things looking? |
@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include I will post an update to this thread once the March data is available. |
Thanks Nick! 🙏 |
Thanks Nick! 🙏 With
|
Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the |
How are things looking @cappadona ? |
@cappadona are there any updates here? Also as a side note, users are also asking about March data in this issue: #51 |
Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes Thank you all for your patience. |
@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that? |
So just to get it right, the format of the parquet files changed? |
Tried generating my own script to parse through the data. Am seeing the following download counts for #!/usr/bin/env python
import packaging
import sys
from packaging.version import InvalidVersion, Version
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams["figure.figsize"] = (22, 5)
def main(*argv):
pkgs = [
("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")),
("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"),
]
for each_pkg, keep_filter in pkgs:
year = "2024"
month = "04"
df = pd.read_parquet(f"{year}-{month}.parquet")
df_pkg = df[df["pkg_name"] == each_pkg]
pkg_vers = []
for v in df_pkg["pkg_version"].unique():
try:
v = Version(v)
except InvalidVersion:
# Skip invalid version formats
continue
pkg_vers.append(v)
pkg_vers = sorted(pkg_vers)
pkg_vers_filt = list(filter(keep_filter, pkg_vers))
df_pkg_sorted = pd.concat(
[df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt]
)
df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]]
df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6
plt.clf()
plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"])
plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}")
plt.xlabel(f"{each_pkg} versions")
plt.ylabel("Downloads (millions)")
plt.savefig(f"{each_pkg}_download_count.svg")
return 0
if __name__ == "__main__":
sys.exit(main(*sys.argv)) Here are the results it shows (note values are in millions): Admittedly this is only one month Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support) Also it would be better to group the Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case) Edit: Fix issue where 12.0 got cutoff |
Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st. As of today, the
Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include cc @jezdez |
Thanks Nick! 🙏 This would be incredibly helpful 🙂 |
It would be amazing to pull these updates back to the introduction of If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023: While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of |
I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO. |
@cappadona , hope you had a good weekend! 😀 Do you have thoughts on the questions above? To summarize...
Also would add one more...
Thanks for your help! 🙏 |
June and July data is still not available. Is there an issue?
From: jakirkham ***@***.***>
Sent: Monday, July 15, 2024 1:07 AM
To: ContinuumIO/anaconda-package-data ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [ContinuumIO/anaconda-package-data] Include `.conda` packages (Issue #45)
@cappadona<https://github.com/cappadona> , hope you had a good weekend! 😀
Do you have thoughts on the questions above? To summarize...
1. Can we backport the .conda count fix to earlier dates?
2. Do we know how the timestamps are being created (seeing 1970 references)?
Also would add one more...
1. How are the anaconda.org numbers generated relative to these? Seeing some differences ( conda-incubator/condastats#18<conda-incubator/condastats#18> )
Thanks for your help! 🙏
—
Reply to this email directly, view it on GitHub<#45 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZKHWNUDRGW5XPEWEZBUNSTZMN7L5AVCNFSM6AAAAAA3GO2T3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXHEYTSNJSG4>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>
|
We are removing download counts from our website for now since it doesn't seem to be very reliable and looks just bad :( |
@cappadona could you please help us with the issues posted above? Notably:
|
@jakirkham FTR, this has been prioritized and get more attention again |
Thanks Jannis! 🙏 Please let us know if you need more info from us or need us to test anything 🙂 |
The 1970 issues were actually issues in our code. Sorry about that! |
We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore? The latest data is 2024-06... |
Huh, I'd check with @cappadona about it, he was working on an analysis |
Hi all. We've been running some analysis on the dataset in response to everyone's feedback and will share our findings when this is complete. In the interim, responding to some of the recent questions in this thread...
The latest data available in the s3 bucket is for
Thank you. This is one issue that we haven't been able to reproduce.
False alarm -- addressed by Wolf This is the main focus of our QA effort and we're tentatively planning to replace data beginning in Temporarily paused publishing new data (see my response above)
We still need to dig into the download counter displayed on |
We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point. |
Lastly, while it appears you fixed the https://prefix.dev/channels/conda-forge/packages/_libgcc_mutex |
Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync. |
Hi @wolfv I'm unable to reproduce this dropoff for |
OK, then we might have an issue on our end again :) Thanks! |
@cappadona is this working correctly for other channels? Think it would be good to double check these are all handled correctly (others may have suggestions):
|
Also worth noting RAPIDS is switching to publishing |
Any updates on the backfill? I tried to run the by-the-numbers binder again, and dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/2024-06-*.parquet",storage_options={'anon': True}) returns an empty data frame, and so do all months after June (whereas the months up until May 2024 are fine). I've loosened the match to dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/*.parquet",storage_options={'anon': True}) and still nothing. |
Asked about this at the Conda community meeting earlier this week and it sounds like they are working through some issues |
It would be helpful to include both
.conda
&.tar.bz2
packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer formatThe text was updated successfully, but these errors were encountered: