Skip to content

rose madder finish spitz

Compare
Choose a tag to compare
@b5 b5 released this 06 Jun 21:39
· 2807 commits to master since this release
5e54b5a

We're going to start writing proper release notes now, so, uh, here are those notes:

This release brings a big new feature in the form of our first transformation implementation, and a bunch of refinements that make the experience of working with qri a little easier.

Introducing Skylark Transformations

For months qri has had a planned feature set for embedding the concept of "transformations" directly into datasets. Basically transforms are scripts that auto-generate datasets. We've defined a "transformation" to be a repeatable process that takes zero or more datasets and data sources, and outputs exactly one dataset. By embedding transformations directly into datasets, users can repeat them with a single command, keeping the code that updates a dataset, and it's resulting data in the same place. This opens up a whole new set of uses for qri datasets, making them auditable, repeatable, configurable, and generally functional. Using transformations, qri can check to see if your dataset is out of date, and update it for you.

While we've had the plan for transformations for some time now, it's taken us a long time to figure out how to write a first implementaion. Because transformations are executable code, security & behavioural expectations are a big concern. We also want to set ourselves up for success by choosing an implementation that will feel familiar to those who do a lot of code-based data munging, while also leaving the door open to things we'd like to do in the future like parallelized execution.

So after a lot of reasearch and a false-start or five, we've decided on a scripting language called skylark as our base implementation, which has grown out of the bazel project at google. This choice might seem strange at first (bazel is a build tool and has nothing to do with data), but skylark has a number of advantages:

  • python-like syntax - many people working in data science these days write python, we like that.
  • deterministic subset of python - unlike python, skylark removes properties that reduce introspection into code behaviour. things like while loops and recursive functions are ommitted, making it possible for qri to infer how a given transformation will behave.
  • parallel execution - thanks to this deterministic requirement (and lack of global interpreter lock) skylark functions can be executed in parallel. Combined with peer-2-peer networking, we're hoping to advance tranformations toward peer-driven distribed computing. More on that in the coming months.

A tutorial on how to write skylark transformations is forthcoming, we'll post examples to our documentation site when it's ready: https://qri.io/docs

dataset.yaml, and more ❤️ for the CLI

For a while now we've been thinking about datasets as being a lot like web pages. Web pages have head,meta and body elements. Datasets have meta, structure, commit, and data. To us this metaphor helps reason about the elements of a dataset, why they exist, and their function. And just like how webpages are defined in .html files, we've updated the CLI to work with .yaml files that define dataests. qri export --blank will now write a blank dataset file with comments that link to documentation on each section of a dataset. You can edit that file, save it, and run qri add --dataset=dataset.yaml me/my_dataset to add the dataset to qri. Ditto for qri update.

We'd like to encourage users to think in terms of these dataset.yaml files, building up a mental model of each element of a dataset in much the same way we think about HTML page elements. We chose yaml over JSON specifically because we can include comments in these files, making it easier to pass them around with tools outside of qri, and we're hoping this'll make it easier to think about datasets moving forward. In futures release we plan to rename the "data" element to "body" to bring this metaphor even closer.

Along with dataset.yaml, we've also done a bunch of refactoring & bug fixes to make the CLI generally work better, and look forward to improving on this trend in near-term patch releases. One of the biggest things we'd like to improve upon is providing more meaningful error messages.