Include `.conda` packages #45

jakirkham · 2023-08-07T05:42:48Z

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

The text was updated successfully, but these errors were encountered:

jakirkham · 2023-08-07T05:43:44Z

cc @beckermr @wolfv

jezdez · 2023-09-20T18:13:47Z

Looking into this with @cappadona

dopplershift · 2023-10-02T22:31:03Z

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

jakirkham · 2023-10-24T00:03:04Z

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: #41

Is there a path for fixing the other packages? Or did this already happen?

cappadona · 2023-10-24T14:12:53Z

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

jakirkham · 2023-10-25T17:37:56Z

Thanks Nick! 🙏

cappadona · 2024-01-04T17:47:07Z

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

leofang · 2024-01-04T17:54:34Z

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

cappadona · 2024-01-04T18:00:06Z

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

jakirkham · 2024-01-23T20:26:48Z

cc @aterrel @chenghlee (as we discussed this earlier)

leofang · 2024-03-01T15:37:10Z

Hi @cappadona @jezdez Friendly nudge for updates 🙂 This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

cappadona · 2024-03-01T16:00:04Z

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

jakirkham · 2024-03-19T19:25:34Z

Just wanted to check in, @cappadona how are things looking here?

wolfv · 2024-03-19T19:48:35Z

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

jakirkham · 2024-03-19T20:19:06Z

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

jakirkham · 2024-04-01T04:29:49Z

@cappadona how are things looking?

cappadona · 2024-04-01T13:45:34Z

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

jakirkham · 2024-04-01T17:46:19Z

Thanks Nick! 🙏

cappadona · 2024-04-05T21:40:17Z

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

jakirkham · 2024-04-05T21:50:31Z

Thanks Nick! 🙏

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

cappadona · 2024-04-08T15:01:32Z

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

jakirkham · 2024-04-17T02:10:07Z

How are things looking @cappadona ?

jakirkham · 2024-04-30T18:22:55Z

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: #51

cappadona · 2024-05-04T14:48:41Z

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

jezdez · 2024-05-06T11:02:57Z

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

wolfv · 2024-05-06T11:13:02Z

So just to get it right, the format of the parquet files changed?

jakirkham · 2024-05-30T04:49:09Z

Tried generating my own script to parse through the data. Am seeing the following download counts for cudatoolkit (legacy package for CUDA 11 and earlier) and cuda-version (used in CUDA 12 and later)

#!/usr/bin/env python


import packaging
import sys

from packaging.version import InvalidVersion, Version

import matplotlib.pyplot as plt
import pandas as pd


plt.rcParams["figure.figsize"] = (22, 5)


def main(*argv):
    pkgs = [
        ("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")),
        ("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"),
    ]

    for each_pkg, keep_filter in pkgs:
        year = "2024"
        month = "04"
        df = pd.read_parquet(f"{year}-{month}.parquet")

        df_pkg = df[df["pkg_name"] == each_pkg]

        pkg_vers = []
        for v in df_pkg["pkg_version"].unique():
            try:
                v = Version(v)
            except InvalidVersion:
                # Skip invalid version formats
                continue
            pkg_vers.append(v)
        pkg_vers = sorted(pkg_vers)

        pkg_vers_filt = list(filter(keep_filter, pkg_vers))

        df_pkg_sorted = pd.concat(
            [df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt]
        )

        df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]]
        df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6

        plt.clf()

        plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"])
        plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}")
        plt.xlabel(f"{each_pkg} versions")
        plt.ylabel("Downloads (millions)")

        plt.savefig(f"{each_pkg}_download_count.svg")

    return 0


if __name__ == "__main__":
    sys.exit(main(*sys.argv))

Here are the results it shows (note values are in millions):

Admittedly this is only one month

Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support)

Also it would be better to group the cudatoolkit patch versions together like how cuda-version is handled

Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case)

Edit: Fix issue where 12.0 got cutoff

cappadona · 2024-06-04T15:12:51Z

Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st.

As of today, the .conda packages are included in the data for the following months:

2024-03
2024-04
2024-05

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

cc @jezdez

jakirkham · 2024-06-04T16:58:28Z

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

Thanks Nick! 🙏

This would be incredibly helpful 🙂

h-vetinari · 2024-06-05T02:06:06Z

It would be amazing to pull these updates back to the introduction of .conda artefacts, both for having a correct history and an accurate total number of downloads. The conda-forge landing page currently prominently displays the latter, and I think we're still not counting over a year of .conda downloads.

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of .conda around November 2022.

jezdez · 2024-06-05T12:26:40Z

I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO.

wolfv · 2024-06-09T14:44:38Z

Did something happen with the timestamps? For some reason, we seem to have some new entries at "epoch 0" (ie. somewhere in 1970)

I'll delete/filter them from our data but just wanted to check if anyone knows what's up?

jakirkham · 2024-07-15T08:07:00Z

@cappadona , hope you had a good weekend! 😀

Do you have thoughts on the questions above? To summarize...

Can we backport the .conda count fix to earlier dates?
Do we know how the timestamps are being created (seeing 1970 references)?

Also would add one more...

How are the anaconda.org numbers generated relative to these? Seeing some differences ( Difference in numbers from condastats and anaconda.org conda-incubator/condastats#18 )

Thanks for your help! 🙏

modouldemba · 2024-08-06T21:31:45Z

June and July data is still not available. Is there an issue? From: jakirkham ***@***.***> Sent: Monday, July 15, 2024 1:07 AM To: ContinuumIO/anaconda-package-data ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [ContinuumIO/anaconda-package-data] Include `.conda` packages (Issue #45) @cappadona<https://github.com/cappadona> , hope you had a good weekend! 😀 Do you have thoughts on the questions above? To summarize... 1. Can we backport the .conda count fix to earlier dates? 2. Do we know how the timestamps are being created (seeing 1970 references)? Also would add one more... 1. How are the anaconda.org numbers generated relative to these? Seeing some differences ( conda-incubator/condastats#18<conda-incubator/condastats#18> ) Thanks for your help! 🙏 — Reply to this email directly, view it on GitHub<#45 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZKHWNUDRGW5XPEWEZBUNSTZMN7L5AVCNFSM6AAAAAA3GO2T3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXHEYTSNJSG4>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>

wolfv · 2024-08-07T08:52:57Z

We are removing download counts from our website for now since it doesn't seem to be very reliable and looks just bad :(

jakirkham · 2024-08-12T22:37:08Z

@cappadona could you please help us with the issues posted above?

Notably:

Odd 1970 data points
Prior months with .conda
Missing recent months (June & July)
Differences between this data and Anaconda.org's numbers
- Anything wrong with the data from March to today ? #52
- Difference in numbers from condastats and anaconda.org conda-incubator/condastats#18

jezdez · 2024-08-13T07:08:59Z

@jakirkham FTR, this has been prioritized and get more attention again

jakirkham · 2024-08-13T09:03:12Z

Thanks Jannis! 🙏

Please let us know if you need more info from us or need us to test anything 🙂

wolfv · 2024-09-06T14:22:23Z

The 1970 issues were actually issues in our code. Sorry about that!

wolfv · 2024-09-06T14:27:44Z

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

jezdez · 2024-09-06T16:54:27Z

Huh, I'd check with @cappadona about it, he was working on an analysis

cappadona · 2024-09-06T17:46:54Z

Hi all. We've been running some analysis on the dataset in response to everyone's feedback and will share our findings when this is complete.

In the interim, responding to some of the recent questions in this thread...

@wolfv

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

The latest data available in the s3 bucket is for 2024-05, which was made available in June. We have temporarily paused publishing new data until we complete the QA.

The 1970 issues were actually issues in our code. Sorry about that!

Thank you. This is one issue that we haven't been able to reproduce.

@jakirkham @phwuil @nicrie

Notably:

Odd 1970 data points

False alarm -- addressed by Wolf

Prior months with .conda

This is the main focus of our QA effort and we're tentatively planning to replace data beginning in 2022-06 to address the undercounting.

Missing recent months (June & July)

Temporarily paused publishing new data (see my response above)

Differences between this data and Anaconda.org's numbers

Anything wrong with the data from March to today ? #52

Difference in numbers from condastats and anaconda.org conda-incubator/condastats#18

We still need to dig into the download counter displayed on anaconda.org. I will also comment on each of those issues.

wolfv · 2024-09-20T12:18:27Z

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

wolfv · 2024-09-20T12:19:15Z

wolfv · 2024-09-20T12:20:59Z

Lastly, while it appears you fixed the .conda, is it possible that .tar.bz2 are not accounted for anymore?

https://prefix.dev/channels/conda-forge/packages/_libgcc_mutex

cappadona · 2024-09-23T15:14:59Z

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

cappadona · 2024-09-23T15:24:57Z

Hi @wolfv I'm unable to reproduce this dropoff for _libgcc_mutex when using condastats

wolfv · 2024-09-23T15:44:00Z

OK, then we might have an issue on our end again :) Thanks!

jakirkham · 2024-10-23T21:00:16Z

@cappadona is this working correctly for other channels?

Think it would be good to double check these are all handled correctly (others may have suggestions):

bioconda
defaults
nvidia
pytorch
rapidsai
rapidsai-nightly

jakirkham · 2024-11-05T19:12:40Z

Also worth noting RAPIDS is switching to publishing .conda packages. So we will want to make sure they are picked up in the statistics here

h-vetinari · 2024-11-07T23:51:48Z

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

Any updates on the backfill?

I tried to run the by-the-numbers binder again, and

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/2024-06-*.parquet",storage_options={'anon': True})

returns an empty data frame, and so do all months after June (whereas the months up until May 2024 are fine).

I've loosened the match to

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/*.parquet",storage_options={'anon': True})

and still nothing.

jakirkham · 2024-11-08T23:49:46Z

Asked about this at the Conda community meeting earlier this week and it sounds like they are working through some issues

jezdez mentioned this issue Sep 20, 2023

Download counts missing python 3.10+ versions #41

Closed

jaimergp mentioned this issue Jan 3, 2024

Missing packages? #49

Closed

wolfv mentioned this issue Jan 24, 2024

[Bug] Download stats only shown for old versions prefix-dev/prefix-dev#23

Open

jakirkham mentioned this issue Apr 17, 2024

Add missing March data #51

Closed

cappadona mentioned this issue Jun 4, 2024

Anything wrong with the data from March to today ? #52

Open

mglisse mentioned this issue Jun 5, 2024

Anything wrong with these stats ? conda-incubator/condastats#22

Closed

nicrie mentioned this issue Aug 13, 2024

Difference in numbers from condastats and anaconda.org conda-incubator/condastats#18

Closed

cappadona mentioned this issue Sep 9, 2024

Access issue with s3 path s3://anaconda-package-data/conda/hourly #55

Closed

Include .conda packages #45

Include .conda packages #45

Comments

jakirkham commented Aug 7, 2023

jakirkham commented Aug 7, 2023

jezdez commented Sep 20, 2023

dopplershift commented Oct 2, 2023

jakirkham commented Oct 24, 2023

cappadona commented Oct 24, 2023

jakirkham commented Oct 25, 2023

cappadona commented Jan 4, 2024

leofang commented Jan 4, 2024

cappadona commented Jan 4, 2024

jakirkham commented Jan 23, 2024

leofang commented Mar 1, 2024

cappadona commented Mar 1, 2024

jakirkham commented Mar 19, 2024

wolfv commented Mar 19, 2024 • edited Loading

jakirkham commented Mar 19, 2024

jakirkham commented Apr 1, 2024

cappadona commented Apr 1, 2024

jakirkham commented Apr 1, 2024

cappadona commented Apr 5, 2024

jakirkham commented Apr 5, 2024

cappadona commented Apr 8, 2024

jakirkham commented Apr 17, 2024

jakirkham commented Apr 30, 2024

cappadona commented May 4, 2024

jezdez commented May 6, 2024

wolfv commented May 6, 2024

jakirkham commented May 30, 2024 • edited Loading

cappadona commented Jun 4, 2024

jakirkham commented Jun 4, 2024

h-vetinari commented Jun 5, 2024

jezdez commented Jun 5, 2024

wolfv commented Jun 9, 2024

jakirkham commented Jul 15, 2024

modouldemba commented Aug 6, 2024 via email

wolfv commented Aug 7, 2024

jakirkham commented Aug 12, 2024 • edited Loading

jezdez commented Aug 13, 2024

jakirkham commented Aug 13, 2024

wolfv commented Sep 6, 2024

wolfv commented Sep 6, 2024

jezdez commented Sep 6, 2024 • edited Loading

cappadona commented Sep 6, 2024

wolfv commented Sep 20, 2024

wolfv commented Sep 20, 2024

wolfv commented Sep 20, 2024

cappadona commented Sep 23, 2024

cappadona commented Sep 23, 2024

wolfv commented Sep 23, 2024

jakirkham commented Oct 23, 2024

jakirkham commented Nov 5, 2024

h-vetinari commented Nov 7, 2024

jakirkham commented Nov 8, 2024

Include `.conda` packages #45

Include `.conda` packages #45

wolfv commented Mar 19, 2024 •

edited

Loading

jakirkham commented May 30, 2024 •

edited

Loading

jakirkham commented Aug 12, 2024 •

edited

Loading

jezdez commented Sep 6, 2024 •

edited

Loading