Your favourite conference is out, you have links to all the GitHub repos for that conf and you want to get a sense of what researchers have been using? This collection of scripts will help you figure out where and how Python APIs are used.
IN: A bunch of GitHub URLs, e.g. all the URLS from ICCV/CVPR.
OUT: The ability to find all calls, imports or access of any API e.g. all calls
to torch.compile
or some_random_lib.CoolClass
with:
- the code snippet in which the call/import/access happened.
- the GitHub permalink so you can browse more on GitHub and gain context.
- Answers to random questions like:
- what are the most common values for DataLoader's
num_workers
? - what is the most popular video decoder?
- who's using iterable datasets? who's using mapstyle datasets?
- what are the most popular APIs from other libraries competing with mine?
- who's still using this depreacted parameter?
- when people use torchvision' Resize() or resize(), do they use bilinear or bicubic mode?
- what are the most common values for DataLoader's
See in report.py
or in N4981684.
(Roughly)
download_repos.py
: Clone specified GitHub repositories in batches.parse_repos.py
: Parse Python files in the repos and record all calls/imports/attribute accesses of any api. Results are available as pandas DataFrames and saved in csv files.report.py
: Query the saved csv files and get reports for the APIs you're interested in. Open this as a bento notebook.
These files are meant to be copy/pasted and modified on your own devvm or laptop. This isn't meant to be a buck project (although it could).
Steps 1 and 2 have been done for all 2.5k repos from ICCV and CVPR. You can download the resulting csv files from https://drive.google.com/drive/folders/1MYiMvFBFZwFl9CjNonoqMNP5A4qkGDkf?usp=sharing and go straight to step 3.
TL;DR: 32GB of RAM and joblib, pandas, numpy.
On a devvm with 32 cores, this scales reasonably well for the ~2,500 repos of ICCV/CVPR. It should take <20 minutes to download all repos, run the analysis and load the resulting csvs to start querying. Each api report should then just take a few seconds. The aggregated resuling csv files/dataframes are about ~20GB in total.
download_repos.py
and parse_repos.py
can be executed as bento notebooks for
exploratory work, but running them as standalone i.e. python download_repos.py
will be a lot faster as it will leverage multiprocessing.
To do that on a devvm you'll need a conda/virtualenv env with joblib
, pandas
and
numpy
installed:
$ https_proxy=http://fwdproxy:8080 http_proxy=http://fwdproxy:8080 pip install pandas numpy joblib
See 1. 2. 3. from above. The global vars you might want to change are noted as
USER_EDIT
. For now it's just:
grep -nr -e USER_EDIT -e USER_TODO --exclude README.md
download_repos.py:50:REPOS_DIR = Path("~/repos").expanduser() # USER_EDIT
download_repos.py:53:URLS_FILE = "~/dev/repo_analysis/{conf}_urls" # USER_EDIT
parse_repos.py:295:REPOS_DIR = Path("~/repos").expanduser() # USER_EDIT
parse_repos.py:304:code_context = "full" # USER_EDIT. Can be "line" (fast, line-only) or "full" (slow, accurate).
report.py:14:# - Look for instances of "USER_TODO" and "USER_EDIT" and follow instructions.
report.py:18:# USER_TODO: Download csv files from https://drive.google.com/drive/folders/1MYiMvFBFZwFl9CjNonoqMNP5A4qkGDkf?usp=sharing
report.py:20:I_HAVE_DOWNLOADED_THE_ICCV_AND_CVPR_CSV_FILES_ALREADY = True # USER_EDIT
report.py:24: CSVS_DIR = Path("~/csvs").expanduser() # USER_EDIT
report.py:34: REPOS_DIR = Path("~/repos").expanduser() # USER_EDIT
I got the CVPR and ICCV URLs by scraping https://github.com/DmitryRyumin/CVPR-2023-Papers and https://github.com/DmitryRyumin/ICCV-2023-Papers/.
The results are in cvpr_urls
and iccv_urls
.
For other conferences... IDK, but someone else probably did it already. Maybe PapersWithCode?
Some key parts of these scripts are adapted from: