From 12058f259696037592eaadd6f4b5946cf2baab87 Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Thu, 23 Nov 2023 05:32:17 -0500 Subject: [PATCH] docs(pypi-metadata-post): add Fortran pattern and fix regex --- .../index/execute-results/html.json | 4 ++-- .../index.qmd | 24 ++++++++++++------- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/docs/_freeze/posts/querying-pypi-metadata-compiled-languages/index/execute-results/html.json b/docs/_freeze/posts/querying-pypi-metadata-compiled-languages/index/execute-results/html.json index 7aaa4cd78a33..e9681446ea78 100644 --- a/docs/_freeze/posts/querying-pypi-metadata-compiled-languages/index/execute-results/html.json +++ b/docs/_freeze/posts/querying-pypi-metadata-compiled-languages/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "56b4893eaa1fa706a74bf40277abf190", + "hash": "abc5bd2dbe85c1732e0b3f1e4a1e0353", "result": { - "markdown": "---\ntitle: Querying every file in every release on the Python Package Index (redux)\nauthor: Gil Forsyth\ndate: 2023-11-15\ncategories:\n - blog\n---\n\nSeth Larson wrote a great [blog\npost](https://sethmlarson.dev/security-developer-in-residence-weekly-report-18)\non querying a PyPI dataset to look for trends in the use of memory-safe\nlanguages in Python.\n\nCheck out Seth's article for more information on the dataset (and\nit's a good read!). It caught our eye because it makes use of\n[DuckDB](https://duckdb.org/) to clean the data for analysis.\n\nThat's right up our alley here in Ibis land, so let's see if we can duplicate\nSeth's results (and then continue on to plot them!)\n\n## Grab the data (locations)\n\nSeth showed (and then safely decomposed) a nested `curl` statement and that's\nalways viable -- we're in Python land so why not grab the filenames using\n`urllib3`?\n\n::: {#d6525f66 .cell execution_count=1}\n``` {.python .cell-code}\nimport urllib3\n\nhttp = urllib3.PoolManager()\n\nresp = http.request(\"GET\", \"https://github.com/pypi-data/data/raw/main/links/dataset.txt\")\n\nparquet_files = resp.data.decode().split()\nparquet_files\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```\n['https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-0.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-1.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-10.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-11.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-12.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-13.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-14.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-2.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-3.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-4.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-5.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-6.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-7.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-8.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-16-03-06/index-9.parquet']\n```\n:::\n:::\n\n\n## Grab the data\n\nNow we're ready to get started with Ibis!\n\nDuckDB is clever enough to grab only the parquet metadata. This means we can\nuse `read_parquet` to create a lazy view of the parquet files and then build up\nour expression without downloading everything beforehand!\n\n::: {#4980d892 .cell execution_count=2}\n``` {.python .cell-code}\nimport ibis\nfrom ibis import _ # <1>\n\nibis.options.interactive = True\n```\n:::\n\n\n1. See https://ibis-project.org/how-to/analytics/chain_expressions.html for docs\non the deferred operator!\n\nCreate a DuckDB connection:\n\n::: {#f5c245b2 .cell execution_count=3}\n``` {.python .cell-code}\ncon = ibis.duckdb.connect()\n```\n:::\n\n\nAnd load up one of the files (we can run the full query after)!\n\n::: {#ccd86d1f .cell execution_count=4}\n``` {.python .cell-code}\npypi = con.read_parquet(parquet_files[0], table_name=\"pypi\")\n```\n:::\n\n\n::: {#2ec712cd .cell execution_count=5}\n``` {.python .cell-code}\npypi.schema()\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```\nibis.Schema {\n project_name string\n project_version string\n project_release string\n uploaded_on timestamp\n path string\n archive_path string\n size uint64\n hash binary\n skip_reason string\n lines uint64\n repository uint32\n}\n```\n:::\n:::\n\n\n## Query crafting\n\nLet's break down what we're looking for. As a high-level view of the use of\ncompiled languages, Seth is using file extensions as an indicator that a given\nfiletype is used in a Python project.\n\nThe dataset we're using has _every file in every project_ -- what criteria should we use?\n\nWe can follow Seth's lead and look for things:\n\n1. A file extension that is one of: `asm`, `cc`, `cpp`, `cxx`, `h`, `hpp`, `rs`, `go`, and variants of `F90`, `f90`, etc...\n That is, C, C++, Assembly, Rust, Go, and Fortran.\n2. We exclude matches where the file path is within the `site-packages/` directory.\n3. We exclude matches that are in directories used for testing.\n\n::: {#4c189d14 .cell execution_count=6}\n``` {.python .cell-code}\nexpr = pypi.filter(\n [\n _.path.re_search(r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$\"),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n)\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=34}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓\n┃ project_name ┃ project_version ┃ project_release ┃ uploaded_on ┃ path ┃ archive_path ┃ size ┃ hash ┃ skip_reason ┃ lines ┃ repository ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ string │ string │ timestamp │ string │ string │ uint64 │ binary │ string │ uint64 │ uint32 │\n├─────────────────┼─────────────────┼──────────────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────┼─────────────┼────────┼────────────┤\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/support.c │ 1607 │ b'\\xca\\x0c\\xf2\\\\R\\x83\\xefS\\x0c\\xe4\\x0c\\x15`\\x1fM\\x16\"\\x93\\x88\\x08' │ ~ │ 66 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/stemmer.c │ 5054 │ b'y\\xc3A\\x12\\x17\\xd4\\xeb\\xbb\\xcfan\\xfd\\x80\\xbac\\x18\\xcf\\xc0W\\x9a' │ ~ │ 230 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x81s\\xa1t\\x86}\\xf9\\xe5\\xb5Zt\\xcb\\xd3\\xae\\nHfe\\x8c\\x9d' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 80922 │ b'\\xae<\\xc7f\\x02\\xc5{\\xc50\\xf4\\xdc\\x8fa\\x1a\\t..k\\xd5\\x9d' │ ~ │ 2205 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x14D\\xeb\\xb4\\x9ac\\xab\\x14:b\\xa4\\xba\\xa5\\x9f\\x1f\\x06\\xce\\x0bj\\xf2' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 10684 │ b\"!\\xa259'\\x94\\xc7.\\x16\\x0b\\x08\\x95J\\x0e\\xef\\x86{\\x0e\\xd6\\x8f\" │ ~ │ 309 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x10W.\\xcc7\\x08=WV\\xde\\x1bP9\\x03w\\x03\\xa2\\x8c\\xe7\\xec' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 41620 │ b'\\x95P\\xd6|\\x85\\x97\\xb2H\\x14\\xa0d<q-iu\\xc1\\x98h\\xbb' │ ~ │ 1097 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'N\\xf7t\\xdd\\xcc\\xbb8Y\\x0b\\xbc\\xd5No_\\x8d\\xc7\\xf2\\x80\\x10\\xd0' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 25440 │ b'o\\n\\x96M+\\xb0\\xfbV\\xaa6<*\\xc8\\xb0B\\x03\\x8a\\xa9\\xc3\\x10' │ ~ │ 694 │ 1 │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└─────────────────┴─────────────────┴──────────────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────┴─────────────┴────────┴────────────┘\n\n```\n:::\n:::\n\n\nThat _could_ be right -- we can peak at the filename at the end of the `path` column to do a quick check:\n\n::: {#72c66b62 .cell execution_count=7}\n``` {.python .cell-code}\nexpr.path.split(\"/\")[-1]\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayIndex(StringSplit(path, '/'), -1) ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────────────────────────┤\n│ support.c │\n│ stemmer.c │\n│ stem_UTF_8_turkish.h │\n│ stem_UTF_8_turkish.c │\n│ stem_UTF_8_swedish.h │\n│ stem_UTF_8_swedish.c │\n│ stem_UTF_8_spanish.h │\n│ stem_UTF_8_spanish.c │\n│ stem_UTF_8_russian.h │\n│ stem_UTF_8_russian.c │\n│ … │\n└────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nOk! Next up, we want to group the matches by:\n\n1. The month that the package / file was published\n For this, we can use the `truncate` method and ask for month as our truncation window.\n2. The file extension of the file used\n\n::: {#9b81c700 .cell execution_count=8}\n``` {.python .cell-code}\nexpr.group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n).aggregate()\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓\n┃ month ┃ ext ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩\n│ timestamp │ string │\n├─────────────────────┼────────┤\n│ 2015-12-01 00:00:00 │ h │\n│ 2015-10-01 00:00:00 │ c │\n│ 2015-11-01 00:00:00 │ cpp │\n│ 2015-12-01 00:00:00 │ hpp │\n│ 2016-01-01 00:00:00 │ cc │\n│ 2016-07-01 00:00:00 │ c │\n│ 2017-03-01 00:00:00 │ c │\n│ 2016-09-01 00:00:00 │ cc │\n│ 2013-09-01 00:00:00 │ cpp │\n│ 2013-08-01 00:00:00 │ asm │\n│ … │ … │\n└─────────────────────┴────────┘\n\n```\n:::\n:::\n\n\nThat looks promising. Now we need to grab the package names that correspond to a\ngiven file extension in a given month and deduplicate it. And to match Seth's\nresults, we'll also sort by the month in descending order:\n\n::: {#837aae90 .cell execution_count=9}\n``` {.python .cell-code}\nexpr = (\n expr.group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n)\n\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼────────┼──────────────────────────────────────────────┤\n│ 2017-07-01 00:00:00 │ c │ ['discretize', 'djangoforandroid', ... +262] │\n│ 2017-07-01 00:00:00 │ asm │ ['fibers', 'SyntheSys', ... +6] │\n│ 2017-07-01 00:00:00 │ rs │ ['tokio', 'xmldirector.plonecore', ... +2] │\n│ 2017-07-01 00:00:00 │ cpp │ ['diffpy.srreal', 'fastdtw', ... +108] │\n│ 2017-07-01 00:00:00 │ cxx │ ['amplpy', 'CPyCppyy', ... +8] │\n│ 2017-07-01 00:00:00 │ go │ ['django-instant', 'django-mqueue', ... +5] │\n│ 2017-07-01 00:00:00 │ cc │ ['dyNET', 'george', ... +14] │\n│ 2017-07-01 00:00:00 │ h │ ['fastmat', 'ffpyplayer', ... +222] │\n│ 2017-07-01 00:00:00 │ hpp │ ['diffpy.srreal', 'glwindow', ... +19] │\n│ 2017-06-01 00:00:00 │ cc │ ['grpcio', 'kenlm', ... +32] │\n│ … │ … │ … │\n└─────────────────────┴────────┴──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Massage and plot\n\nLet's continue and see what our results look like.\n\nWe'll do a few things:\n\n1. Combine all of the C and C++ extensions into a single group by renaming them all.\n2. Count the number of distinct entries in each group\n3. Plot the results!\n\n::: {#f0c6000f .cell execution_count=10}\n``` {.python .cell-code}\ncollapse_names = expr.mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\"),\n)\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼──────────┼─────────────────────────────────────────────────────┤\n│ 2017-07-01 00:00:00 │ Assembly │ ['pwntools', 'fibers', ... +6] │\n│ 2017-07-01 00:00:00 │ Rust │ ['rust-pypi-example', 'tokio', ... +2] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['newrelic', 'nuclitrack', ... +262] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pipcudemo', 'pyDEM', ... +108] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['numba', 'p4d', ... +222] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pyemd', 'pogeo', ... +19] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['nixio', 'yawinpty', ... +14] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pytetgen', 'python-libsbml-experimental', ... +8] │\n│ 2017-07-01 00:00:00 │ Go │ ['pre-commit', 'django-instant', ... +5] │\n│ 2017-06-01 00:00:00 │ C/C++ │ ['gippy', 'halotools', ... +148] │\n│ … │ … │ … │\n└─────────────────────┴──────────┴─────────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nNote that now we need to de-duplicate again, since we might've had separate\nunique entries for both an `h` and `c` file extension, and we don't want to\ndouble-count!\n\nWe could rewrite our original query and include the renames in the original\n`group_by` (this would be the smart thing to do), but let's push on and see if\nwe can make this work.\n\nThe `projects` column is now a column of string arrays, so we want to collect\nall of the arrays in each group, this will give us a \"list of lists\", then we'll\n`flatten` that list and call `unique().length()` as before.\n\nDuckDB has a `flatten` function, but it isn't exposed in Ibis (yet!).\n\nWe'll use a handy bit of Ibis magic to define a `builtin` `UDF` that will map directly\nonto the underlying DuckDB function (what!? See\n[here](https://ibis-project.org/how-to/extending/builtin.html#duckdb) for more\ninfo):\n\n::: {#73b19bd8 .cell execution_count=11}\n``` {.python .cell-code}\n@ibis.udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\ncollapse_names = collapse_names.group_by([\"month\", \"ext\"]).aggregate(\n projects=flatten(_.projects.collect())\n)\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼──────────┼────────────────────────────────────────────────┤\n│ 2011-01-01 00:00:00 │ C/C++ │ ['simplejson', 'simplerandom', ... +82] │\n│ 2011-01-01 00:00:00 │ Assembly │ ['pycryptopp'] │\n│ 2010-08-01 00:00:00 │ C/C++ │ ['tokyo-python', 'regex', ... +85] │\n│ 2010-07-01 00:00:00 │ C/C++ │ ['scikits.audiolab', 'svectors', ... +108] │\n│ 2010-05-01 00:00:00 │ C/C++ │ ['tornado', 'rl', ... +74] │\n│ 2010-03-01 00:00:00 │ C/C++ │ ['ThreadLock', 'yajl', ... +99] │\n│ 2009-07-01 00:00:00 │ C/C++ │ ['gevent', 'hashlib', ... +52] │\n│ 2012-12-01 00:00:00 │ C/C++ │ ['pyFaceTracker', 'pycpx', ... +149] │\n│ 2012-09-01 00:00:00 │ C/C++ │ ['gdmodule', 'gevent', ... +154] │\n│ 2012-07-01 00:00:00 │ C/C++ │ ['dm.xmlsec.binding', 'eea.exhibit', ... +110] │\n│ … │ … │ … │\n└─────────────────────┴──────────┴────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nWe could have included the `unique().length()` in the `aggregate` call, but\nsometimes it's good to check that your slightly off-kilter idea has worked (and\nit has!).\n\n::: {#8a2592b6 .cell execution_count=12}\n``` {.python .cell-code}\ncollapse_names = collapse_names.select(\n _.month, _.ext, project_count=_.projects.unique().length()\n)\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=40}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ project_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ int64 │\n├─────────────────────┼──────────┼───────────────┤\n│ 2005-05-01 00:00:00 │ C/C++ │ 1 │\n│ 2005-03-01 00:00:00 │ C/C++ │ 1 │\n│ 2009-07-01 00:00:00 │ C/C++ │ 35 │\n│ 2009-07-01 00:00:00 │ Assembly │ 1 │\n│ 2008-06-01 00:00:00 │ C/C++ │ 28 │\n│ 2016-05-01 00:00:00 │ Rust │ 3 │\n│ 2015-11-01 00:00:00 │ C/C++ │ 325 │\n│ 2015-11-01 00:00:00 │ Assembly │ 3 │\n│ 2015-10-01 00:00:00 │ Rust │ 1 │\n│ 2015-09-01 00:00:00 │ Go │ 3 │\n│ … │ … │ … │\n└─────────────────────┴──────────┴───────────────┘\n\n```\n:::\n:::\n\n\nNow that the data are tidied, we can pass our expression directly to Altair and see what it looks like!\n\n::: {#7d774d0b .cell execution_count=13}\n``` {.python .cell-code}\nimport altair as alt\n\nchart = (\n alt.Chart(collapse_names)\n .mark_line()\n .encode(x=\"month\", y=\"project_count\", color=\"ext\")\n .properties(width=600, height=300)\n)\nchart\n```\n\n::: {.cell-output .cell-output-display execution_count=41}\n```{=html}\n\n\n\n\n```\n:::\n:::\n\n\nThat looks good, but it definitely doesn't match the plot from Seth's post:\n\n![upstream plot](upstream_plot.png)\n\nOur current plot is only showing the results from a subset of the available\ndata. Now that our expression is complete, we can re-run on the full dataset and\ncompare.\n\n## The full run\n\nTo recap -- we pulled a lazy view of a single parquet file from the `pypi-data`\nrepo, filtered for all the files that contain file extensions we care about,\nthen grouped them all together to get counts of the various filetypes used\nacross projects by month.\n\nHere's the entire query chained together into a single command, now running on\nall of the `parquet` files we have access to:\n\n::: {#1b12b5c1 .cell execution_count=14}\n``` {.python .cell-code}\npypi = con.read_parquet(parquet_files, table_name=\"pypi\")\n\nfull_query = (\n pypi.filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\"),\n )\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n)\nchart = (\n alt.Chart(full_query)\n .mark_line()\n .encode(x=\"month\", y=\"project_count\", color=\"ext\")\n .properties(width=600, height=300)\n)\nchart\n```\n\n::: {.cell-output .cell-output-display execution_count=42}\n```{=html}\n\n\n\n\n```\n:::\n:::\n\n\n", + "markdown": "---\ntitle: Querying every file in every release on the Python Package Index (redux)\nauthor: Gil Forsyth\ndate: 2023-11-15\ncategories:\n - blog\n---\n\nSeth Larson wrote a great [blog\npost](https://sethmlarson.dev/security-developer-in-residence-weekly-report-18)\non querying a PyPI dataset to look for trends in the use of memory-safe\nlanguages in Python.\n\nCheck out Seth's article for more information on the dataset (and\nit's a good read!). It caught our eye because it makes use of\n[DuckDB](https://duckdb.org/) to clean the data for analysis.\n\nThat's right up our alley here in Ibis land, so let's see if we can duplicate\nSeth's results (and then continue on to plot them!)\n\n## Grab the data (locations)\n\nSeth showed (and then safely decomposed) a nested `curl` statement and that's\nalways viable -- we're in Python land so why not grab the filenames using\n`urllib3`?\n\n::: {#81f38c7a .cell execution_count=1}\n``` {.python .cell-code}\nimport urllib3\n\nurl = \"https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt\"\n\nwith urllib3.PoolManager() as http:\n resp = http.request(\"GET\", url)\n\nparquet_files = resp.data.decode().split()\nparquet_files\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```\n['https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-0.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-1.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-10.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-11.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-12.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-13.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-14.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-2.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-3.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-4.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-5.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-6.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-7.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-8.parquet',\n 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-9.parquet']\n```\n:::\n:::\n\n\n## Grab the data\n\nNow we're ready to get started with Ibis!\n\nDuckDB is clever enough to grab only the parquet metadata. This means we can\nuse `read_parquet` to create a lazy view of the parquet files and then build up\nour expression without downloading everything beforehand!\n\n::: {#f593faf1 .cell execution_count=2}\n``` {.python .cell-code}\nimport ibis\nfrom ibis import _ # <1>\n\nibis.options.interactive = True\n```\n:::\n\n\n1. See https://ibis-project.org/how-to/analytics/chain_expressions.html for docs\non the deferred operator!\n\nCreate a DuckDB connection:\n\n::: {#2c0c0fab .cell execution_count=3}\n``` {.python .cell-code}\ncon = ibis.duckdb.connect()\n```\n:::\n\n\nAnd load up one of the files (we can run the full query after)!\n\n::: {#1237bd7c .cell execution_count=4}\n``` {.python .cell-code}\npypi = con.read_parquet(parquet_files[0], table_name=\"pypi\")\n```\n:::\n\n\n::: {#1f81de21 .cell execution_count=5}\n``` {.python .cell-code}\npypi.schema()\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```\nibis.Schema {\n project_name string\n project_version string\n project_release string\n uploaded_on timestamp\n path string\n archive_path string\n size uint64\n hash binary\n skip_reason string\n lines uint64\n repository uint32\n}\n```\n:::\n:::\n\n\n## Query crafting\n\nLet's break down what we're looking for. As a high-level view of the use of\ncompiled languages, Seth is using file extensions as an indicator that a given\nfiletype is used in a Python project.\n\nThe dataset we're using has _every file in every project_ -- what criteria should we use?\n\nWe can follow Seth's lead and look for things:\n\n1. A file extension that is one of: `asm`, `cc`, `cpp`, `cxx`, `h`, `hpp`, `rs`, `go`, and variants of `F90`, `f90`, etc...\n That is, C, C++, Assembly, Rust, Go, and Fortran.\n2. We exclude matches where the file path is within the `site-packages/` directory.\n3. We exclude matches that are in directories used for testing.\n\n::: {#7e8640fd .cell execution_count=6}\n``` {.python .cell-code}\nexpr = pypi.filter(\n [\n _.path.re_search(r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n)\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓\n┃ project_name ┃ project_version ┃ project_release ┃ uploaded_on ┃ path ┃ archive_path ┃ size ┃ hash ┃ skip_reason ┃ lines ┃ repository ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ string │ string │ timestamp │ string │ string │ uint64 │ binary │ string │ uint64 │ uint32 │\n├─────────────────┼─────────────────┼──────────────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────┼─────────────┼────────┼────────────┤\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/support.c │ 1607 │ b'\\xca\\x0c\\xf2\\\\R\\x83\\xefS\\x0c\\xe4\\x0c\\x15`\\x1fM\\x16\"\\x93\\x88\\x08' │ ~ │ 66 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/stemmer.c │ 5054 │ b'y\\xc3A\\x12\\x17\\xd4\\xeb\\xbb\\xcfan\\xfd\\x80\\xbac\\x18\\xcf\\xc0W\\x9a' │ ~ │ 230 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x81s\\xa1t\\x86}\\xf9\\xe5\\xb5Zt\\xcb\\xd3\\xae\\nHfe\\x8c\\x9d' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 80922 │ b'\\xae<\\xc7f\\x02\\xc5{\\xc50\\xf4\\xdc\\x8fa\\x1a\\t..k\\xd5\\x9d' │ ~ │ 2205 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x14D\\xeb\\xb4\\x9ac\\xab\\x14:b\\xa4\\xba\\xa5\\x9f\\x1f\\x06\\xce\\x0bj\\xf2' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 10684 │ b\"!\\xa259'\\x94\\xc7.\\x16\\x0b\\x08\\x95J\\x0e\\xef\\x86{\\x0e\\xd6\\x8f\" │ ~ │ 309 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\\x10W.\\xcc7\\x08=WV\\xde\\x1bP9\\x03w\\x03\\xa2\\x8c\\xe7\\xec' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 41620 │ b'\\x95P\\xd6|\\x85\\x97\\xb2H\\x14\\xa0d<q-iu\\xc1\\x98h\\xbb' │ ~ │ 1097 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'N\\xf7t\\xdd\\xcc\\xbb8Y\\x0b\\xbc\\xd5No_\\x8d\\xc7\\xf2\\x80\\x10\\xd0' │ ~ │ 16 │ 1 │\n│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 25440 │ b'o\\n\\x96M+\\xb0\\xfbV\\xaa6<*\\xc8\\xb0B\\x03\\x8a\\xa9\\xc3\\x10' │ ~ │ 694 │ 1 │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└─────────────────┴─────────────────┴──────────────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────┴─────────────┴────────┴────────────┘\n\n```\n:::\n:::\n\n\nThat _could_ be right -- we can peak at the filename at the end of the `path` column to do a quick check:\n\n::: {#d8a4bc9b .cell execution_count=7}\n``` {.python .cell-code}\nexpr.path.split(\"/\")[-1]\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayIndex(StringSplit(path, '/'), -1) ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────────────────────────┤\n│ support.c │\n│ stemmer.c │\n│ stem_UTF_8_turkish.h │\n│ stem_UTF_8_turkish.c │\n│ stem_UTF_8_swedish.h │\n│ stem_UTF_8_swedish.c │\n│ stem_UTF_8_spanish.h │\n│ stem_UTF_8_spanish.c │\n│ stem_UTF_8_russian.h │\n│ stem_UTF_8_russian.c │\n│ … │\n└────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nOk! Next up, we want to group the matches by:\n\n1. The month that the package / file was published\n For this, we can use the `truncate` method and ask for month as our truncation window.\n2. The file extension of the file used\n\n::: {#edc6d70f .cell execution_count=8}\n``` {.python .cell-code}\nexpr.group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n).aggregate()\n```\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓\n┃ month ┃ ext ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩\n│ timestamp │ string │\n├─────────────────────┼────────┤\n│ 2015-12-01 00:00:00 │ cpp │\n│ 2015-11-01 00:00:00 │ h │\n│ 2015-12-01 00:00:00 │ f90 │\n│ 2015-11-01 00:00:00 │ hpp │\n│ 2010-11-01 00:00:00 │ c │\n│ 2010-07-01 00:00:00 │ h │\n│ 2010-12-01 00:00:00 │ cpp │\n│ 2010-03-01 00:00:00 │ h │\n│ 2011-08-01 00:00:00 │ cpp │\n│ 2010-05-01 00:00:00 │ h │\n│ … │ … │\n└─────────────────────┴────────┘\n\n```\n:::\n:::\n\n\nThat looks promising. Now we need to grab the package names that correspond to a\ngiven file extension in a given month and deduplicate it. And to match Seth's\nresults, we'll also sort by the month in descending order:\n\n::: {#02d5af05 .cell execution_count=9}\n``` {.python .cell-code}\nexpr = (\n expr.group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n)\n\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=17}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼────────┼─────────────────────────────────────────────────────┤\n│ 2017-07-01 00:00:00 │ c │ ['newrelic', 'nuclitrack', ... +262] │\n│ 2017-07-01 00:00:00 │ asm │ ['pwntools', 'fibers', ... +6] │\n│ 2017-07-01 00:00:00 │ rs │ ['rust-pypi-example', 'tokio', ... +2] │\n│ 2017-07-01 00:00:00 │ f │ ['okada-wrapper', 'numpy', ... +6] │\n│ 2017-07-01 00:00:00 │ cpp │ ['pipcudemo', 'pyDEM', ... +108] │\n│ 2017-07-01 00:00:00 │ f90 │ ['pySpecData', 'numpy', ... +8] │\n│ 2017-07-01 00:00:00 │ cxx │ ['pytetgen', 'python-libsbml-experimental', ... +8] │\n│ 2017-07-01 00:00:00 │ go │ ['pre-commit', 'django-instant', ... +5] │\n│ 2017-07-01 00:00:00 │ cc │ ['nixio', 'pogeo', ... +14] │\n│ 2017-07-01 00:00:00 │ h │ ['numba', 'p4d', ... +222] │\n│ … │ … │ … │\n└─────────────────────┴────────┴─────────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Massage and plot\n\nLet's continue and see what our results look like.\n\nWe'll do a few things:\n\n1. Combine all of the C and C++ extensions into a single group by renaming them all.\n2. Count the number of distinct entries in each group\n3. Plot the results!\n\n::: {#e8f84935 .cell execution_count=10}\n``` {.python .cell-code}\ncollapse_names = expr.mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n).dropna(\"ext\")\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼──────────┼─────────────────────────────────────────────────────┤\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pipcudemo', 'pyDEM', ... +108] │\n│ 2017-07-01 00:00:00 │ Fortran │ ['numpy', 'pySpecData', ... +8] │\n│ 2017-07-01 00:00:00 │ Go │ ['pre-commit', 'ronin', ... +5] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pytetgen', 'python-libsbml-experimental', ... +8] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['newrelic', 'nuclitrack', ... +262] │\n│ 2017-07-01 00:00:00 │ Fortran │ ['okada-wrapper', 'numpy', ... +6] │\n│ 2017-07-01 00:00:00 │ Assembly │ ['pwntools', 'xmldirector.plonecore', ... +6] │\n│ 2017-07-01 00:00:00 │ Rust │ ['rust-pypi-example', 'tokio', ... +2] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['numba', 'numpythia', ... +222] │\n│ 2017-07-01 00:00:00 │ C/C++ │ ['pyemd', 'pogeo', ... +19] │\n│ … │ … │ … │\n└─────────────────────┴──────────┴─────────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nNote that now we need to de-duplicate again, since we might've had separate\nunique entries for both an `h` and `c` file extension, and we don't want to\ndouble-count!\n\nWe could rewrite our original query and include the renames in the original\n`group_by` (this would be the smart thing to do), but let's push on and see if\nwe can make this work.\n\nThe `projects` column is now a column of string arrays, so we want to collect\nall of the arrays in each group, this will give us a \"list of lists\", then we'll\n`flatten` that list and call `unique().length()` as before.\n\nDuckDB has a `flatten` function, but it isn't exposed in Ibis (yet!).\n\nWe'll use a handy bit of Ibis magic to define a `builtin` `UDF` that will map directly\nonto the underlying DuckDB function (what!? See\n[here](https://ibis-project.org/how-to/extending/builtin.html#duckdb) for more\ninfo):\n\n::: {#42a08cd7 .cell execution_count=11}\n``` {.python .cell-code}\n@ibis.udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\ncollapse_names = collapse_names.group_by([\"month\", \"ext\"]).aggregate(\n projects=flatten(_.projects.collect())\n)\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=19}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ projects ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ array<string> │\n├─────────────────────┼──────────┼─────────────────────────────────────────────┤\n│ 2009-07-01 00:00:00 │ C/C++ │ ['gevent', 'hashlib', ... +52] │\n│ 2009-07-01 00:00:00 │ Assembly │ ['pycryptopp'] │\n│ 2008-10-01 00:00:00 │ Fortran │ ['numscons'] │\n│ 2008-08-01 00:00:00 │ Fortran │ ['numscons'] │\n│ 2008-06-01 00:00:00 │ C/C++ │ ['dm.incrementalsearch', 'Cython', ... +45] │\n│ 2008-05-01 00:00:00 │ Fortran │ ['numscons'] │\n│ 2007-03-01 00:00:00 │ Fortran │ ['Model-Builder'] │\n│ 2005-05-01 00:00:00 │ C/C++ │ ['ll-xist', 'll-xist'] │\n│ 2005-03-01 00:00:00 │ C/C++ │ ['pygenx', 'pygenx'] │\n│ 2011-08-01 00:00:00 │ Fortran │ ['pysces', 'ffnet'] │\n│ … │ … │ … │\n└─────────────────────┴──────────┴─────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nWe could have included the `unique().length()` in the `aggregate` call, but\nsometimes it's good to check that your slightly off-kilter idea has worked (and\nit has!).\n\n::: {#ec05bdc8 .cell execution_count=12}\n``` {.python .cell-code}\ncollapse_names = collapse_names.select(\n _.month, _.ext, project_count=_.projects.unique().length()\n)\n\ncollapse_names\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n┃ month ┃ ext ┃ project_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n│ timestamp │ string │ int64 │\n├─────────────────────┼─────────┼───────────────┤\n│ 2007-03-01 00:00:00 │ C/C++ │ 6 │\n│ 2006-01-01 00:00:00 │ C/C++ │ 5 │\n│ 2005-10-01 00:00:00 │ C/C++ │ 2 │\n│ 2011-08-01 00:00:00 │ C/C++ │ 57 │\n│ 2011-03-01 00:00:00 │ C/C++ │ 63 │\n│ 2011-01-01 00:00:00 │ Fortran │ 2 │\n│ 2010-12-01 00:00:00 │ C/C++ │ 48 │\n│ 2010-08-01 00:00:00 │ Fortran │ 1 │\n│ 2010-07-01 00:00:00 │ Fortran │ 3 │\n│ 2010-03-01 00:00:00 │ Fortran │ 1 │\n│ … │ … │ … │\n└─────────────────────┴─────────┴───────────────┘\n\n```\n:::\n:::\n\n\nNow that the data are tidied, we can pass our expression directly to Altair and see what it looks like!\n\n::: {#c5073a65 .cell execution_count=13}\n``` {.python .cell-code}\nimport altair as alt\n\nchart = (\n alt.Chart(collapse_names.to_pandas())\n .mark_line()\n .encode(x=\"month\", y=\"project_count\", color=\"ext\")\n .properties(width=600, height=300)\n)\nchart\n```\n\n::: {.cell-output .cell-output-display execution_count=21}\n```{=html}\n\n\n\n\n```\n:::\n:::\n\n\nThat looks good, but it definitely doesn't match the plot from Seth's post:\n\n![upstream plot](upstream_plot.png)\n\nOur current plot is only showing the results from a subset of the available\ndata. Now that our expression is complete, we can re-run on the full dataset and\ncompare.\n\n## The full run\n\nTo recap -- we pulled a lazy view of a single parquet file from the `pypi-data`\nrepo, filtered for all the files that contain file extensions we care about,\nthen grouped them all together to get counts of the various filetypes used\nacross projects by month.\n\nHere's the entire query chained together into a single command, now running on\nall of the `parquet` files we have access to:\n\n::: {#859879c8 .cell execution_count=14}\n``` {.python .cell-code}\npypi = con.read_parquet(parquet_files, table_name=\"pypi\")\n\nfull_query = (\n pypi.filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .dropna(\"ext\")\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n)\nchart = (\n alt.Chart(full_query.to_pandas())\n .mark_line()\n .encode(x=\"month\", y=\"project_count\", color=\"ext\")\n .properties(width=600, height=300)\n)\nchart\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n\n\n\n\n```\n:::\n:::\n\n\n", "supporting": [ "index_files" ], diff --git a/docs/posts/querying-pypi-metadata-compiled-languages/index.qmd b/docs/posts/querying-pypi-metadata-compiled-languages/index.qmd index b2f0adc5f5b6..bee8ee5f7a31 100644 --- a/docs/posts/querying-pypi-metadata-compiled-languages/index.qmd +++ b/docs/posts/querying-pypi-metadata-compiled-languages/index.qmd @@ -27,9 +27,10 @@ always viable -- we're in Python land so why not grab the filenames using ```{python} import urllib3 -http = urllib3.PoolManager() +url = "https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt" -resp = http.request("GET", "https://github.com/pypi-data/data/raw/main/links/dataset.txt") +with urllib3.PoolManager() as http: + resp = http.request("GET", url) parquet_files = resp.data.decode().split() parquet_files @@ -87,7 +88,7 @@ We can follow Seth's lead and look for things: ```{python} expr = pypi.filter( [ - _.path.re_search(r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$"), + _.path.re_search(r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"), ~_.path.re_search(r"(^|/)test(|s|ing)"), ~_.path.contains("/site-packages/"), ] @@ -144,10 +145,12 @@ We'll do a few things: ```{python} collapse_names = expr.mutate( ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++") + .re_replace("^f.*$", "Fortran") .replace("rs", "Rust") .replace("go", "Go") - .replace("asm", "Assembly"), -) + .replace("asm", "Assembly") + .nullif(""), +).dropna("ext") collapse_names ``` @@ -202,7 +205,7 @@ Now that the data are tidied, we can pass our expression directly to Altair and import altair as alt chart = ( - alt.Chart(collapse_names) + alt.Chart(collapse_names.to_pandas()) .mark_line() .encode(x="month", y="project_count", color="ext") .properties(width=600, height=300) @@ -235,7 +238,7 @@ full_query = ( pypi.filter( [ _.path.re_search( - r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$" + r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$" ), ~_.path.re_search(r"(^|/)test(|s|ing)"), ~_.path.contains("/site-packages/"), @@ -249,15 +252,18 @@ full_query = ( .order_by(_.month.desc()) .mutate( ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++") + .re_replace("^f.*$", "Fortran") .replace("rs", "Rust") .replace("go", "Go") - .replace("asm", "Assembly"), + .replace("asm", "Assembly") + .nullif(""), ) + .dropna("ext") .group_by(["month", "ext"]) .aggregate(project_count=flatten(_.projects.collect()).unique().length()) ) chart = ( - alt.Chart(full_query) + alt.Chart(full_query.to_pandas()) .mark_line() .encode(x="month", y="project_count", color="ext") .properties(width=600, height=300)