Visions
provides a set of tools for defining and using semantic data types.
-
Semantic type detection & inference on sequence data.
-
Automated data processing
-
Completely customizable.
Visions
makes it easy to build and modify semantic data types for domain specific purposes -
Out of the box support for multiple backend implementations including pandas, spark, numpy, and python
-
A robust set of default types and typesets covering the most common use cases.
Check out the complete documentation here.
Source code is available on github and binary installers via pip.
# Pip
pip install visions
Complete installation instructions (including extras) are available in the docs.
If you want to play immediately check out the examples folder on . Otherwise, let's get some data
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
The most important abstraction in visions
are Types - these represent semantic notions about data. You have access to a
range of well tested types like Integer
, Float
, and Files
covering the most common software development use cases.
Types can be bundled together into typesets. Behind the scenes, visions
builds a traversable graph for any collection
of types.
from visions import types, typesets
# StandardSet is the basic builtin typeset
typeset = typesets.CompleteSet()
typeset.plot_graph()
Note: Plots require pygraphviz to be installed.
Because of the special relationship between types these graphs can be used to detect the type of your data or infer a more appropriate one.
# Detection looks like this
typeset.detect_type(df)
# While inference looks like this
typeset.infer_type(df)
# Inference works well even if we monkey with the data, say by converting everything to strings
typeset.infer_type(df.astype(str))
>> {
'PassengerId': Integer,
'Survived': Integer,
'Pclass': Integer,
'Name': String,
'Sex': String,
'Age': Float,
'SibSp': Integer,
'Parch': Integer,
'Ticket': String,
'Fare': Float,
'Cabin': String,
'Embarked': String
}
Visions
solves many of the most common problems working with tabular data for example, sequences of Integers are still
recognized as integers whether they have trailing decimal 0's from being cast to float, missing values, or something
else altogether. Much of this cleaning is performed automatically providing nicely cleaned and processed data as well.
cleaned_df = typeset.cast_to_inferred(df)
This is only a small taste of everything visions can do including building your own domain specific types and typesets so please check out the API documentation or the examples/ directory for more info!
Thanks to its dispatch based implementation Visions
is able to exploit framework specific capabilities offered by
libraries like pandas and spark. Currently it works with the following backends by default.
- Pandas (feature complete)
- Numpy (boolean, complex, date time, float, integer, string, time deltas, string, objects)
- Spark (boolean, categorical, date, date time, float, integer, numeric, object, string)
- Python (string, float, integer, date time, time delta, boolean, categorical, object, complex - other datatypes are untested)
If you're using pandas it will also take advantage of parallelization tools like swifter if available.
It also offers a simple annotation based API for registering new implementations as needed. For example, if you wished to extend the categorical data type to include a Dask specific implementation you might do something like
from visions.types.categorical import Categorical
from pandas.api import types as pdt
import dask
@Categorical.contains_op.register
def categorical_contains(series: dask.dataframe.Series, state: dict) -> bool:
return pdt.is_categorical_dtype(series.dtype)
Contributions to visions
are welcome. For more information, please visit the community
contributions page and join on us
on slack. The
github issues tracker is used for reporting bugs, feature
requests and support questions.
Also, please check out some of the other companies and packages using visions
including:
If you're currently using visions
or would like to be featured here please let us know.
This package is part of the dylan-profiler project. The package is core component of pandas-profiling. More information can be found here. This work was partially supported by SIDN Fonds.