GitHub

Your favourite conference is out, you have links to all the GitHub repos for that conf and you want to get a sense of what researchers have been using? This collection of scripts will help you figure out where and how Python APIs are used.

IN: A bunch of GitHub URLs, e.g. all the URLS from ICCV/CVPR. OUT: The ability to find all calls, imports or access of any API e.g. all calls to torch.compile or some_random_lib.CoolClass with:

the code snippet in which the call/import/access happened.
the GitHub permalink so you can browse more on GitHub and gain context.
Answers to random questions like:
- what are the most common values for DataLoader's num_workers?
- what is the most popular video decoder?
- who's using iterable datasets? who's using mapstyle datasets?
- what are the most popular APIs from other libraries competing with mine?
- who's still using this depreacted parameter?
- when people use torchvision' Resize() or resize(), do they use bilinear or bicubic mode?

See in report.py or in N4981684.

Getting Started

(Roughly)

download_repos.py: Clone specified GitHub repositories in batches.
parse_repos.py: Parse Python files in the repos and record all calls/imports/attribute accesses of any api. Results are available as pandas DataFrames and saved in csv files.
report.py: Query the saved csv files and get reports for the APIs you're interested in. Open this as a bento notebook.

These files are meant to be copy/pasted and modified on your own devvm or laptop. This isn't meant to be a buck project (although it could).

Steps 1 and 2 have been done for all 2.5k repos from ICCV and CVPR. You can download the resulting csv files from https://drive.google.com/drive/folders/1MYiMvFBFZwFl9CjNonoqMNP5A4qkGDkf?usp=sharing and go straight to step 3.

Pre-requisite / dependencies

TL;DR: 32GB of RAM and joblib, pandas, numpy.

On a devvm with 32 cores, this scales reasonably well for the ~2,500 repos of ICCV/CVPR. It should take <20 minutes to download all repos, run the analysis and load the resulting csvs to start querying. Each api report should then just take a few seconds. The aggregated resuling csv files/dataframes are about ~20GB in total.

download_repos.py and parse_repos.py can be executed as bento notebooks for exploratory work, but running them as standalone i.e. python download_repos.py will be a lot faster as it will leverage multiprocessing.

To do that on a devvm you'll need a conda/virtualenv env with joblib, pandas and numpy installed:

$ https_proxy=http://fwdproxy:8080 http_proxy=http://fwdproxy:8080 pip install pandas numpy joblib

Where to start

See 1. 2. 3. from above. The global vars you might want to change are noted as USER_EDIT. For now it's just:

grep -nr -e USER_EDIT -e USER_TODO --exclude README.md

download_repos.py:50:REPOS_DIR = Path("~/repos").expanduser()  # USER_EDIT
download_repos.py:53:URLS_FILE = "~/dev/repo_analysis/{conf}_urls"  # USER_EDIT
parse_repos.py:295:REPOS_DIR = Path("~/repos").expanduser()  # USER_EDIT
parse_repos.py:304:code_context = "full"  # USER_EDIT. Can be "line" (fast, line-only) or "full" (slow, accurate).
report.py:14:# - Look for instances of "USER_TODO" and "USER_EDIT" and follow instructions.
report.py:18:# USER_TODO: Download csv files from https://drive.google.com/drive/folders/1MYiMvFBFZwFl9CjNonoqMNP5A4qkGDkf?usp=sharing
report.py:20:I_HAVE_DOWNLOADED_THE_ICCV_AND_CVPR_CSV_FILES_ALREADY = True # USER_EDIT
report.py:24:    CSVS_DIR = Path("~/csvs").expanduser()  # USER_EDIT
report.py:34:    REPOS_DIR = Path("~/repos").expanduser()  # USER_EDIT

Where to find GitHub URLs for a conference

I got the CVPR and ICCV URLs by scraping https://github.com/DmitryRyumin/CVPR-2023-Papers and https://github.com/DmitryRyumin/ICCV-2023-Papers/.

The results are in cvpr_urls and iccv_urls. For other conferences... IDK, but someone else probably did it already. Maybe PapersWithCode?

References

Some key parts of these scripts are adapted from:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Pre-requisite / dependencies

Where to start

Where to find GitHub URLs for a conference

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
reports		reports
README.md		README.md
cvpr_urls		cvpr_urls
download_repos.py		download_repos.py
iccv_urls		iccv_urls
parse_repos.py		parse_repos.py
report.py		report.py

NicolasHug/github_repo_analysis

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Pre-requisite / dependencies

Where to start

Where to find GitHub URLs for a conference

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages