Dataspec is a data specification and normalization toolkit written in pure Python. With Dataspec, you can create Specs to validate and normalize data of almost any shape. Dataspec is inspired by Clojure's spec library.
Specs are declarative data specifications written in pure Python code. Specs can be
created using the generic Spec constructor function s
. Specs provide two useful and
related functions. The first is to evaluate whether an arbitrary data structure
satisfies the specification. The second function is to conform (or normalize) valid
data structures into a canonical format.
The simplest Specs are based on common predicate functions, such as
lambda x: isinstance(x, str)
which asks "Is the object x an instance of str
?".
Fortunately, Specs are not limited to being created from single predicates. Specs can
also be created from groups of predicates, composed in a variety of useful ways, and
even defined for complex data structures. Because Specs are ultimately backed by
pure Python code, any question that you can answer about your data in code can be
encoded in a Spec.
- Simple API using primarily native Python types and data structures
- Stateless, immutable Spec objects are designed to be created once, reused, and composed
- Rich error objects point to the exact location of the error in the input value
- Builtin factories for many common validations
Dataspec is developed on GitHub and hosted
on PyPI. You can fetch Dataspec using pip
:
pip install dataspec
To enable support for phone number specs or arbitrary date strings, you can choose the extras when you install:
pip install dataspec[dates]
pip install dataspec[phonenumbers]
To begin using the dataspec
library, you can simply import the s
object:
from dataspec import s
s
is a generic constructor for creating new Specs. Many useful Specs can be
composed from basic Python objects like types, functions, and data structures. The
"Hello, world!" equivalent for creating new Specs might be a simple Spec that validates
that an input is a string (a Python str
). We can do this by simply passing the
Python str
type directly to s
. When s
receives an instance of a type
object, it assumes you want to create a Spec that validates input values are of that
type:
spec = s(str)
spec.is_valid("a string") # True
spec.is_valid(3) # False
Often you want to assert more than one condition on an input value. After all, it's
fairly trivial to assert type checks on a value. In fact, this may even be done by
a deserialization library on your behalf. Perhaps you're interested in checking that
your input is a string and that it contains only numbers and hyphens. dataspec
lets
you define Specs with boolean logic, which can be useful for asserting multiple
conditions on your input:
spec = s.all(str, lambda s: all(c.isdecimal() or c == "-" for c in s))
spec.is_valid("212-867-5309") # True
spec.is_valid("Philip Jennings") # False
Composition is at the heart of dataspec
's design. In the previous example, we
learned a few useful things. First, s
is actually a callable object with static
methods which help produce other sorts of Specs. Second, we can see that when we
pass objects understood to s
into various Spec constructors, they are automatically
coerced into the appropriate Spec type. Here, we passed a type
, which we used
previously. We also passed in a function of one argument returning a boolean; in
dataspec
, these are called predicates and they are turned into Specs which validate
input values if the function returns True
and fail otherwise. Finally, we learned
that s.all
can be used to produce and
-type boolean logic between different
Specs. (You can produce or
Specs using s.any
).
In the previous example, we used the and
logic to check for our conditions to show
various different features of dataspec
. However, in real code you'd likely take
advantage of dataspec
's builtin s.str
factory, which can assert several useful
properties of strings (in addition to the basic isinstance
check). In the case
above, perhaps we really wanted to check for a US ZIP code (with the trailing 4 digits).
We can perform that check using a simple regex string validator:
spec = s.str("us_zip_plus_4", regex=r"\d{5}\-\d{4}")
spec.is_valid("10001-3093") # True
spec.is_valid("10001") # False
spec.is_valid("N0L 1E0") # False
Scalar Specs like the one above are trivially different from the same checks you could
write in raw Python. The real power of dataspec
comes from its ability to compose
Specs for larger, nested data structures. Suppose you were accepting a physician
profile object via a JSON API and you wanted to validate that the physician licenses
were valid in all of the states you operate in:
operating_states = s("operating_states", {"CA", "GA", "NY"})
license_states = s("license_states", [operating_states, {"kind": list}])
license_states.is_valid(["CA", "NY"]) # True
license_states.is_valid(["SD", "GA"]) # False, you do not operate in South Dakota
license_states.is_valid({"CA"}) # False, as the input collection is a set
In the previous example, we learned a bit more about dataspec
. First, we can see
that Spec objects are designed to be reused. We declared operating_states
as a
separate Spec from license_states
with the intent that we could use it as a
component of other Specs. Specs are immutable and stateless, so they can be reused in
other Specs without issue. Next, we can see that we're expecting a collection, indicated
by the Python list
wrapping operating_states
in the license_states
Spec.
In particular, we are expecting exactly a list
, not a set
or tuple
.
Third, we are expecting a limited set of enumerated values, indicated by
operating_states
being a set
. Values not in the set are rejected. dataspec
also supports using Python's Enum
objects for defining enumerated types.
We did declare two separate Specs and pass both to s
directly. However, we could
have declared the entire Spec inline and s
would have converted each child value
into a Spec automatically: s([{"CA", "GA", "NY"}, {"kind": list}])
.
Building on the previous example, let's suppose we want to validate a simplified version of that physician profile object. Spec is great for validating data at your application boundaries. You can pass it your deserialized input values and it will help you ensure that you're receiving data in the shape your internal services expect:
spec = s(
"user-profile",
{
"id": s.str("id", format_="uuid"),
"first_name": s.str("first_name"),
"last_name": s.str("last_name"),
"date_of_birth": s.str("date_of_birth", format_="iso-date"),
s.opt("gender"): s("gender", {"M", "F"}),
"license_states": license_states, # using the previously defined Spec
}
)
spec.is_valid( # True
{
"id": "e1bc9fb2-a4d3-4683-bfef-3acc61b0edcc",
"first_name": "Carl",
"last_name": "Sagan",
"date_of_birth": "1996-12-20",
"license_states": ["CA"],
}
)
spec.is_valid( # False; the optional "gender" key included an invalid value
{
"id": "e1bc9fb2-a4d3-4683-bfef-3acc61b0edcc",
"first_name": "Carl",
"last_name": "Sagan",
"date_of_birth": "1996-12-20",
"gender": "O",
"license_states": ["CA"],
}
)
spec.is_valid( # True; note that extra keys _are ignored_
{
"id": "958e2f55-5fdf-4b84-a522-a0765299ba4b",
"first_name": "Marie",
"last_name": "Curie",
"date_of_birth": "1867-11-07",
"gender": "F",
"license_states": ["NY", "GA"],
"occupation": "Chemist",
}
)
spec.is_valid( # False; the "license_states" includes the invalid value "TX"
{
"id": "958e2f55-5fdf-4b84-a522-a0765299ba4b",
"first_name": "Marie",
"last_name": "Curie",
"date_of_birth": "1867-11-07",
"license_states": ["TX"],
}
)
dataspec
includes plenty of additional functionality which is not discussed above.
Read more at Read the Docs.
Python's ecosystem features a rich collection of data validation and normalization tools, so a new entrant in the space naturally begs the question "why didn't you just use X instead?". Before creating Dataspec, we surveyed a wide variety of different tools and had even used one or two in our production service. All of these tools are generally successful at validating data, but each had some issue that caused us to pass.
- Many of the libraries in this space primarily help validate data, but do not always help you normalize or conform that data after it has been validated. Dataspec provides validation and conformation out of the box.
- Libraries which do feature validation and normalization often complect these two steps. Dataspec validation is a discrete step that occurs before conformation, so it is easy to reason about failures in validation.
- Some of the libraries we tried were stateful or leaned too heavily on mutability. We tend to prefer immutable and stateless objects where mutability and state is not required. Specs in Dataspec are completely stateless and conformation always produces a new value. This is certainly more costly than mutating inputs, but mutating code is harder to reason about and is a major source of bugs, so we prefer to avoid it.
- Many libraries we surveyed focused on defining validations from the top-down, rather than encouraging composition. Specs in Dataspec are designed to be created once, reused, and composed, rather than requiring a separate definition for each usage.
MIT License