Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👥 Compare data loading between kvikIO and Zarr engine #7

Merged
merged 9 commits into from
Oct 13, 2023

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented Oct 12, 2023

Getting the hard numbers on how fast the GPU-based kvikIO engine is over the CPU-based Zarr engine.

compare_kvikio_zarr

Formula for calculation:

$$ \frac{Slower - Faster}{Slower} = \frac{16.0 - 11.9}{16.0} = 0.25625 = 25.53\%$$

I.e. kvikio engine takes ~25% less time than zarr engine to load the ERA5 subset dataset.

Note that:

  • $Speed = \frac{Epochs}{Time}$
  • Reported time taken is using the Median statistic (instead of the Mean) to better account for random variations due to cold start or caching.

Preview results in 2_compare_results.ipynb notebook.

TODO:

  • Get average data loading time over 10 epochs
  • Initialize Jupyter Notebook with kvikIO vs Zarr engine benchmark code
  • Plot bar graphs showing absolute time difference

Benchmark results between the Zarr and kvikIO engine were too close for one epoch, so looping over 10 epochs and reporting the average instead. Not printing the MSE Loss anymore to declutter the console output.
@weiji14 weiji14 self-assigned this Oct 12, 2023
Will be reusing some of this code in a Jupyter Notebook, so refactoring to use tqdm.auto instead of standard tqdm.
Save the time taken to complete each epoch, and compute the median, mean and standard deviation across all epochs. Needed because the time to process one epoch can vary by a few seconds across the ten epochs depending on various factors (e.g. caching), so computing the average time as total_time / num_epochs can lead to misleading results. Also updated main README.md to say be more specific about the reported total/median/mean/std benchmark times and the size of the ERA5 subset dataset.
Reporting the actual numbers on which is faster - kvikIO or Zarr! Reusing some code from 1_benchmark_kvikIOzarr.py, but now the total/median/mean/std times can be displayed. Final cell calculates the speedup of kvikIO to be ~20% over Zarr, but note that this speedup can actually fluctuate depending on lots of factors (have seen values from 10%-30% over multiple runs).
Statistical data visualization in Python!
A bar plot (with error bars) to visually compare kvikio (with GPUDirect Storage) against the zarr (no GPUDirect Storage) xarray backend engines in terms of data loading speed. Speedup results still fluctuates between runs, but are mostly around the 20% mark.

Also did some slight refactoring to use pandas instead of numpy for the mean/median/std calculations. Using ddof=1 for the standard deviation.
Comment on lines 596 to 606
"sns.set_theme(context=\"talk\", palette=[\"#7400ff\", \"#e01073\"])\n",
"ax = sns.barplot(data=df)\n",
"for container in ax.containers:\n",
" ax.bar_label(container=container, fontsize=11, fmt=\"%.1fs\", label_type=\"center\")\n",
"ax.set_ylabel(ylabel=\"Data load time per epoch\\n ◀ seconds, lower is better\")\n",
"ax.set_xlabel(\n",
" xlabel=\" (with GDS) (without GDS) \\n\\n xarray backend engine\"\n",
")\n",
"ax.set_title(label=\"Reading ERA5 data with/without GPUDirect Storage\")\n",
"fig = ax.get_figure()\n",
"fig.savefig(fname=\"figures/compare_kvikio_zarr.svg\", bbox_inches=\"tight\")"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is the results when running from my freshly booted system (cold start):

compare_kvikio_zarr

The kvikIO engine shows:

  • Median: 12.1870 s
  • Mean ± Standard deviation: 14.2480 ± 6.6434 seconds/epoch

The zarr engine shows:

  • Median: 15.9633 seconds/epoch
  • Mean ± Standard deviation: 16.0398 ± 0.4271 seconds/epoch

which is why I switched to reporting the Median time instead of the Mean time at 25d1642.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run 2, after the GPU has warmed up a bit, and presumably there's some caching going on:

compare_kvikio_zarr

The kvikIO engine shows:

  • Median: 11.9764 seconds/epoch
  • Mean: 12.2756 ± 1.1680 seconds/epoch

The zarr engine shows:

  • Median: 15.7782 seconds/epoch
  • Mean: 15.8177 ± 0.2417 seconds/epoch

And I just realized, the bar plot is showing the mean value, not the median 😅

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run 3, after fixing the bar plot to show the median value instead of the mean value.

compare_kvikio_zarr

The kvikIO engine shows:

  • Median: 11.8606 seconds/epoch
  • Mean: 12.1578 ± 0.8311 seconds/epoch

The zarr engine shows:

  • Median: 16.0240 seconds/epoch
  • Mean: 15.9295 ± 0.3761 seconds/epoch

Gonna go with this one (commit at 6235109) since 16.0s and 11.9s are almost round numbers 🙂

Seaborn plots the mean value by default, but changing to median instead. The kvikIO engine is now reported as 35% faster than the Zarr engine.
Speed is equal to Distance (or epochs) over time. It makes more sense to report 'less time' (absolute measure) instead of 'faster speed' (inverse measure), so fixing the formulation. Previous calculation of speedup may actually have been incorrect?
@weiji14 weiji14 marked this pull request as ready for review October 13, 2023 04:06
@weiji14 weiji14 merged commit cc76120 into main Oct 13, 2023
2 checks passed
@weiji14 weiji14 deleted the benchmark_compare branch October 13, 2023 04:31
@RichardScottOZ
Copy link

Have you tried this one with COGs?

@weiji14
Copy link
Owner Author

weiji14 commented Oct 27, 2023

Have you tried this one with COGs?

No, kvikIO doesn't work with GeoTIFFs yet unfortunately, someone needs to implement that with cuFile and all. But I'm wondering if there's a way to hack a way together using kerchunk's tiff_to_zarr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants