Skip to content

Latest commit

 

History

History
51 lines (37 loc) · 1.98 KB

README.md

File metadata and controls

51 lines (37 loc) · 1.98 KB

🪣 hfds-clj

Clojars Project

hfds-clj is a lib to help you get to the HuggingFace datasets. The lib provides seamless access to datasets via this process:

  • downloading HF dataset,
  • caching downloaded set locally, and
  • serving it from there for subsequent requests.

It does not aim to replicate the full range of functionality found in the HuggingFace datasets library. Though as an immediate extension, it would be great to support Dataset Features.

Usage

CLI

Data sets can be downloaded from the command line

clojure -X:download :dataset "allenai/prosocial-dialog"

See next section for parameter description.

Code

(require '[hfds-clj.core :refer [load-dataset]])

Download HF datasets with this oneliner, where a single parameter is the dataset name as provided on the HF dataset page.

(load-dataset "Anthropic/hh-rlhf")

The second call with Anthropic/hh-rlhf parameter will load it from the cache and return a lazy sequence of all the dataset records.

A more fine-grained data set request is supported via a parameterized call:

(load-dataset  {:dataset "allenai/prosocial-dialog"
                          :split   "train"
                          :config  "default"
                          :offset  0
                          :length  100}
               {:hfds/download-mode :reuse-dataset-if-exists
                :hfds/cache-dir     "/data"
                :hfds/limit         4000}))

Notes

  • This is extracted from Bosquet where HuggingFace datasets are used for LLM related developments.
  • Thanks to TrueGrit helping to rebustly fetch data from HF API