Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include .conda packages #45

Open
jakirkham opened this issue Aug 7, 2023 · 66 comments
Open

Include .conda packages #45

jakirkham opened this issue Aug 7, 2023 · 66 comments

Comments

@jakirkham
Copy link

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

@jakirkham
Copy link
Author

cc @beckermr @wolfv

@jezdez
Copy link
Member

jezdez commented Sep 20, 2023

Looking into this with @cappadona

@dopplershift
Copy link

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

@jakirkham
Copy link
Author

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: #41

Is there a path for fixing the other packages? Or did this already happen?

@cappadona
Copy link

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

@jakirkham
Copy link
Author

Thanks Nick! 🙏

@cappadona
Copy link

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

@leofang
Copy link

leofang commented Jan 4, 2024

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

@cappadona
Copy link

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

@jakirkham
Copy link
Author

cc @aterrel @chenghlee (as we discussed this earlier)

@leofang
Copy link

leofang commented Mar 1, 2024

Hi @cappadona @jezdez Friendly nudge for updates 🙂 This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

@cappadona
Copy link

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

@jakirkham
Copy link
Author

Just wanted to check in, @cappadona how are things looking here?

@wolfv
Copy link

wolfv commented Mar 19, 2024

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

@jakirkham
Copy link
Author

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

@jakirkham
Copy link
Author

@cappadona how are things looking?

@cappadona
Copy link

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

@jakirkham
Copy link
Author

Thanks Nick! 🙏

@cappadona
Copy link

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

Screenshot 2024-04-05 at 5 17 12 PM Screenshot 2024-04-05 at 5 20 53 PM

@jakirkham
Copy link
Author

Thanks Nick! 🙏

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

@cappadona
Copy link

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

@jakirkham
Copy link
Author

How are things looking @cappadona ?

@jakirkham
Copy link
Author

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: #51

@cappadona
Copy link

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

@jezdez
Copy link
Member

jezdez commented May 6, 2024

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

@wolfv
Copy link

wolfv commented May 6, 2024

So just to get it right, the format of the parquet files changed?

@jakirkham
Copy link
Author

jakirkham commented May 30, 2024

Tried generating my own script to parse through the data. Am seeing the following download counts for cudatoolkit (legacy package for CUDA 11 and earlier) and cuda-version (used in CUDA 12 and later)

#!/usr/bin/env python


import packaging
import sys

from packaging.version import InvalidVersion, Version

import matplotlib.pyplot as plt
import pandas as pd


plt.rcParams["figure.figsize"] = (22, 5)


def main(*argv):
    pkgs = [
        ("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")),
        ("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"),
    ]

    for each_pkg, keep_filter in pkgs:
        year = "2024"
        month = "04"
        df = pd.read_parquet(f"{year}-{month}.parquet")

        df_pkg = df[df["pkg_name"] == each_pkg]

        pkg_vers = []
        for v in df_pkg["pkg_version"].unique():
            try:
                v = Version(v)
            except InvalidVersion:
                # Skip invalid version formats
                continue
            pkg_vers.append(v)
        pkg_vers = sorted(pkg_vers)

        pkg_vers_filt = list(filter(keep_filter, pkg_vers))

        df_pkg_sorted = pd.concat(
            [df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt]
        )

        df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]]
        df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6

        plt.clf()

        plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"])
        plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}")
        plt.xlabel(f"{each_pkg} versions")
        plt.ylabel("Downloads (millions)")

        plt.savefig(f"{each_pkg}_download_count.svg")

    return 0


if __name__ == "__main__":
    sys.exit(main(*sys.argv))

Here are the results it shows (note values are in millions):

cudatoolkit_download_count

cuda-version_download_count

Admittedly this is only one month

Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support)

Also it would be better to group the cudatoolkit patch versions together like how cuda-version is handled

Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case)

Edit: Fix issue where 12.0 got cutoff

@cappadona
Copy link

Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st.

As of today, the .conda packages are included in the data for the following months:

  • 2024-03
  • 2024-04
  • 2024-05

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

cc @jezdez

@jakirkham
Copy link
Author

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

Thanks Nick! 🙏

This would be incredibly helpful 🙂

@h-vetinari
Copy link

It would be amazing to pull these updates back to the introduction of .conda artefacts, both for having a correct history and an accurate total number of downloads. The conda-forge landing page currently prominently displays the latter, and I think we're still not counting over a year of .conda downloads.

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

Untitled

While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of .conda around November 2022.

@jezdez
Copy link
Member

jezdez commented Jun 5, 2024

I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO.

@wolfv
Copy link

wolfv commented Jun 9, 2024

Did something happen with the timestamps? For some reason, we seem to have some new entries at "epoch 0" (ie. somewhere in 1970)

Screenshot 2024-06-09 at 09 33 05

I'll delete/filter them from our data but just wanted to check if anyone knows what's up?

@jakirkham
Copy link
Author

@cappadona , hope you had a good weekend! 😀

Do you have thoughts on the questions above? To summarize...

  1. Can we backport the .conda count fix to earlier dates?
  2. Do we know how the timestamps are being created (seeing 1970 references)?

Also would add one more...

  1. How are the anaconda.org numbers generated relative to these? Seeing some differences ( Difference in numbers from condastats and anaconda.org conda-incubator/condastats#18 )

Thanks for your help! 🙏

@modouldemba
Copy link

modouldemba commented Aug 6, 2024 via email

@wolfv
Copy link

wolfv commented Aug 7, 2024

We are removing download counts from our website for now since it doesn't seem to be very reliable and looks just bad :(

@jakirkham
Copy link
Author

jakirkham commented Aug 12, 2024

@jezdez
Copy link
Member

jezdez commented Aug 13, 2024

@jakirkham FTR, this has been prioritized and get more attention again

@jakirkham
Copy link
Author

Thanks Jannis! 🙏

Please let us know if you need more info from us or need us to test anything 🙂

@wolfv
Copy link

wolfv commented Sep 6, 2024

The 1970 issues were actually issues in our code. Sorry about that!

@wolfv
Copy link

wolfv commented Sep 6, 2024

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

@jezdez
Copy link
Member

jezdez commented Sep 6, 2024

Huh, I'd check with @cappadona about it, he was working on an analysis

@cappadona
Copy link

Hi all. We've been running some analysis on the dataset in response to everyone's feedback and will share our findings when this is complete.

In the interim, responding to some of the recent questions in this thread...


@wolfv

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

The latest data available in the s3 bucket is for 2024-05, which was made available in June. We have temporarily paused publishing new data until we complete the QA.

The 1970 issues were actually issues in our code. Sorry about that!

Thank you. This is one issue that we haven't been able to reproduce.


@jakirkham @phwuil @nicrie

Notably:

False alarm -- addressed by Wolf

This is the main focus of our QA effort and we're tentatively planning to replace data beginning in 2022-06 to address the undercounting.

Temporarily paused publishing new data (see my response above)

We still need to dig into the download counter displayed on anaconda.org. I will also comment on each of those issues.

@wolfv
Copy link

wolfv commented Sep 20, 2024

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

@wolfv
Copy link

wolfv commented Sep 20, 2024

Screenshot 2024-09-20 at 14 18 53

@wolfv
Copy link

wolfv commented Sep 20, 2024

Lastly, while it appears you fixed the .conda, is it possible that .tar.bz2 are not accounted for anymore?

https://prefix.dev/channels/conda-forge/packages/_libgcc_mutex

Screenshot 2024-09-20 at 14 20 38

@cappadona
Copy link

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

@cappadona
Copy link

Hi @wolfv I'm unable to reproduce this dropoff for _libgcc_mutex when using condastats

Screenshot 2024-09-23 at 11 20 20 AM

@wolfv
Copy link

wolfv commented Sep 23, 2024

OK, then we might have an issue on our end again :) Thanks!

@jakirkham
Copy link
Author

@cappadona is this working correctly for other channels?

Think it would be good to double check these are all handled correctly (others may have suggestions):

  • bioconda
  • defaults
  • nvidia
  • pytorch
  • rapidsai
  • rapidsai-nightly

@jakirkham
Copy link
Author

Also worth noting RAPIDS is switching to publishing .conda packages. So we will want to make sure they are picked up in the statistics here

@h-vetinari
Copy link

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

Any updates on the backfill?

I tried to run the by-the-numbers binder again, and

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/2024-06-*.parquet",storage_options={'anon': True})

returns an empty data frame, and so do all months after June (whereas the months up until May 2024 are fine).

I've loosened the match to

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/*.parquet",storage_options={'anon': True})

and still nothing.

@jakirkham
Copy link
Author

Asked about this at the Conda community meeting earlier this week and it sounds like they are working through some issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants