-
Notifications
You must be signed in to change notification settings - Fork 45
Column Types
Workbench manipulates tabular data in Apache Arrow, Apache Parquet 2.0 and Pandas DataFrame formats. (See Table Data Structures to understand how each format is used.)
Arrow is the canonical data format.
Here are the valid column types and how to use them.
In Arrow, a column is a ChunkedArray
(C++ docs, Python docs). For example:
import pyarrow as pa
data = pa.chunked_array([["a", "b", "c", None]])
(Currently, Workbench only supports zero or one Array in the ChunkedArray.)
All columns support null values.
A table is a Table
(C++ docs, Python docs). For example:
table = pa.table({"A": data})
A table must not have any metadata. The schema can always be derived exclusively from the table's ChunkedArrays' data types.
The Parquet format is used as a render cache: all Parquet data has a one-to-one mapping to Arrow data. Parquet tables have no metadata.
Every Pandas column can be represented as a pandas.Series
:
import pandas as pd
import numpy as np
# a DataFrame
table: pandas.DataFrame = pd.DataFrame({
'A': ['a', 'b', 'c'],
'B': [1, 2, 3],
})
# a Series
a_series: pandas.Series = df['A']
Workbench does not have a null column type; but any value (in any column) may be null
.
Workbench supports nulls because its input data may include nulls. Workbench itself has no meaning for null
. The nuance of null
comes from our community of modules.
Our guiding principle with null
is: it should remind users of SQL's NULL
. Here are the basics of NULL-handling in SQL:
SELECT NULL + 3; -- NULL: arithmetic returns null
SELECT NULL || 'foo'; -- NULL: string operations produce null
SELECT SUM(v)
FROM (SELECT UNNEST(ARRAY[1,2,NULL,4]) AS v) t
; -- 7: aggregations ignore null values...
SELECT SUM(v)
FROM (SELECT UNNEST(ARRAY[NULL,NULL]::BIGINT[]) AS v) t
; -- NULL: ... or return null when all values are null
Each module can handle null
in its own way. The module's documentation should clarify what null
means.
Regardless of format, each table must follow these rules:
- Each column must have a different name.
- Column names cannot include ASCII control characters such as
\n
. (The full disallowed range is\x00-\x1f
.) - Column names cannot exceed 120 bytes when UTF8-encoded. (All formats store column names as UTF-8.)
- Column names must be valid Unicode.
- Tables cannot exceed a million rows.
- Tables cannot exceed 500 columns.
- Values must be of the types listed below; and they must follow type-specific rules.
If your module output does not follow these rules, Workbench will tell the user your module has a bug.
A module's render()
must input and output a pd.DataFrame
or ArrowTable
:
from typing import Any, Dict
import pandas as pd
import numpy as np
def render(table: pd.DataFrame, params: Dict[str, Any])
# ...
return table # or another pd.DataFrame
Each valid Text value is made up of 0 to 32,767 bytes of valid UTF-8 (or null
).
Arrow text is either a StringArray
(C++ docs, Python docs) or a DictionaryArray
(C++ docs, Python docs) with StringArray
values.
DictionaryArray is an optimization, and all modules must read it. Only use dictionaries to store large amounts of repetitive data. A good guideline: don't store more than 100 unique values or values over 100 bytes each. (TODO create hard limits.)
A DictionaryArray must not contain unused dictionary values or duplicate dictionary values.
While Step code sees the difference between DictionaryArray and StringArray, users do not see the difference. To them, text data is text data.
import pyarrow as pa
# these all produce equivalent text columns
string_array = pa.array(["a", "b", "c", None, "a"])
dictionary_array = pa.DictionaryArray.from_arrays([0, 1, 2, None, 0], ["a", "b", "c"])
dictionary_array = string_array.dictionary_encode()
Parquet mimics Arrow.
A StringArray is encoded in any valid Parquet encoding, except a dictionary encoding.
A DictionaryArray is encoded in RLE_DICTIONARY encoding.
A StringArray becomes a pd.Series
of object
s, in which each object is either a str
or None
.
A DictionaryArray becomes a pd.Series
of unordered Categorical without unused values. Nulls are stored as numpy.nan
in this case.
Beware: many Pandas algorithms behave differently when faced with String versus Categorical data. To be safe, you can always "un-optimize" and convert a categorical to objects.
All these series are equivalent, as far as the user is concerned:
import pandas as pd
import pyarrow as pa
str_series = pd.Series(["a", "b", "c", None, "a"])
str_series_from_arrow = pa.array(["a", "b", "c", None, "a"]).to_pandas()
cat_series = str_series.astype("category")
cat_series_from_arrow = pa.array(["a", "b", "c", None, "a"]).dictionary_encode().to_pandas()
# Converting from category to str takes two lines of code:
str_series_from_dict = cat_series.astype(str)
str_series_from_dict[cat_series.isnull()] = None
Workbench only uses categorical arrays and object arrays to handle strings. So there's the test logic:
def is_text(series):
return hasattr(series, 'cat') or series.dtype == object
- Treat either as an object array using
str
functions: for instance,series.str.lower()
- You can also use
if hasattr(series, 'cat')
and then deal withseries.cat.rename_categories()
and so forth. These functions can be fast and memory-efficient. Category logic is tricky: for instance, you can't simply sayseries = series.cat.rename_categories(series.cat.categories.str.upper())
because theupper()
call may return duplicates, which would throw an error. If you know you want the output to be categories and you're not optimizing, the simplest approach is to useseries.str.func().astype('category')
Pandas is inefficient with text. It is RAM-heavy: Each value costs 50 bytes of overhead, plus a pointer, all in physical RAM. (In comparison, an Arrow value costs only a pointer in overhead, mmapped so text does not need to consume physical RAM.) And it is fragmented: text is scattered across physical RAM, making for many cache misses. (In comparison, Arrow pointers and text data are contiguous.) If your text-manipulation module is too slow, rewrite it as pure Arrow before you even begin optimizing.
Beware series.astype(str)
because it will try to convert np.nan
to str, which will not give the result you expect. To convert properly, use this two-step process:
text_series = number_series.astype(str) # np.nan => "NaN"
text_series[number_series.isna()] = np.nan # "NaN" from previous line => np.nan
text_series = text_series.as_type('category') # optionally
Number arrays are arrays of signed integers or finite floats of any size (or null
).
The rules of precision: when operating with integer scalars and columns:
- Treat float values that can be losslessly converted to integer as integer.
- If the input values are all integers that can be represented as (signed) int64, the output column must be an integer type; output values must be accurate as per int64 math.
- Otherwise, the output column must be a float type; output values must be accurate as per float64 math.
- To optimize space and speed, integers and floats may optionally be represented as int32, int16, int8 or float32 only if converting to the smaller type does not lose any data.
Floats must be finite or null
. Workbench does not support NaN
, Infinity
and -Infinity
.
Module documentation should explain to the user how null
is handled in that module. Usually, modules should mimic SQL NULL
handling: when null
is an input to simple math, return null
; when null
is a value being aggregated, ignore it.
Any Number column has a "format" that dictates how to convert them to String. This is stored in metadata as a string adhering to a subset of the Python format specification. For instance: ${:,d}M
means "dollar-sign prefix" ($
), "thousands separator" (,
), "cast as integer" (d
) and "M suffix" (M
).
The default format is {:,}
-- which usually renders as decimal of arbitrary precision
Whole numbers -- even floating-point numbers -- are cast to int before being passed to Python's format() function, because format strings are supported in JavaScript and JavaScript has only one Number type.
Common, useful formats: {:,.1%}
(percentage), {:,.2f}
(fixed point notation).
The Arrow format for numbers is NumericArray
(C++ docs, Python docs).
The concrete types each module must support are Int8Array
, Int16Array
, Int32Array
, Int64Array
, FloatArray
and DoubleArray
. The canonical logic is Int64Array
and DoubleArray
operations.
Nulls can be stored in any type. NaN
, Infinity
and -Infinity
cannot be stored in any type.
The Parquet format for numbers is INT32
, INT64
, FLOAT32
or FLOAT64
-- whatever fits the data, using a sensible encoding. Integers' Logical Types must be signed-integer.
Nulls can be stored in any type. NaN
, Infinity
and -Infinity
cannot be stored in any type.
Pandas numbers are backed by numpy np.array
arrays.
Nulls are represented as np.nan
. NaN
, Infinity
and -Infinity
cannot be stored. This usually isn't a big concern: Pandas's handling of np.nan
usually does what we want to do with Workbench null
s.
BUG: numpy does not allow nulls in an int64
-typed array. When the module-loading code converts an Arrow table to Pandas, it will automatically convert a null-containing int64
array to float64
. This is lossy, and it makes the module produce bugs according to the "rules of precision" above. If your users complain, rewrite your algorithm as Arrow to correct the problem.
from pandas.api.types import is_numeric_dtype
def is_number(series):
return is_numeric_dtype(series)
series1 = pandas.Series([1, 2, 3, numpy.nan]) # automatic dtype -- float64 in this case
series2 = pandas.Series([1, 2.1, 3], dtype=numpy.float32) # manual dtype
A slew of numpy/pandas operations operate on an array at a time, instead of a value at a time. For instance:
series2 = series * 2 # series2 has same length as series, doubled values
series3 = series2 - series1 # arraywise subtraction -- here, series3 will equal to series1
number_series = pandas.to_numeric(text_series) # throws ValueError
number_series = pandas.to_numeric(text_series, errors='coerce') # np.nan for each parse error
This "timestamp" type is a 64-bit number (or null
) representing nanoseconds since the UNIX Epoch (1970-01-01T00:00:00.000Z
). It does not store timezone. All times are UTC.
Leap seconds are ignored: a leap second has the value of the second before it.
The minimum timestamp is 1677-09-21 00:12:43.145225Z
; the maximum timestamp is 2262-04-11 23:47:16.854775807Z
.
In Arrow, we use a TimestampArray (C++ docs, Python docs). The type
must always be in ns
.
The following are equivalent:
import datetime
import dateutil.parser
import pyarrow as pa
by_int64 = pa.array([1234678901234567000, None], pa.timestamp("ns"))
by_pyobject = pa.array([datetime.datetime(2009, 2, 15, 6, 21, 41, 234567), None], pa.timestamp("ns"))
by_parse = pa.array([dateutil.parser.parse("2009-02-15T06:21:41.234567Z"), None], pa.timestamp("ns"))
pyarrow.array()
defaults to microsecond precision when parsing datetimes. Microsecond precision is not allowed. So this won't work:
by_pyobject = pa.array([datetime.datetime(2009, 2, 15)]) # INVALID
In Parquet, we use the TIMESTAMP logical type to store an int64
(or null
). isAdjustedToUTC
is always true, and precision
is always NANOS
.
In Pandas, we use the datetime64[ns]
numpy type. See Pandas timeseries docs. null
becomes pd.NaT
.
def is_datetime(series):
return pandas.api.types.is_datetime64_dtype(series.dtype)
# Dates are long to write
series = pandas.Series(['2018-08-07T17:12:00.003-04:00'], dtype='datetime64[ns]')
If the input series is ISO8601-formatted, all is well. Otherwise, chaos.
For instance: pandas will interpret 03:12
to be today's time. A proper converter has its work cut out for it.
datetime_series = pandas.to_datetime(text_series) # throws ValueError
datetime_series = pandas.to_datetime(text_series, errors='coerce') # np.nan for each parse error
Two cool ideas:
series.dt.floor('d') # `.dt` rounds dates down
series2 - series1 # subtraction gives timedelta64 values (not Workbench-compatible)
(series2 - series1) / np.timedelta64(1, 's') # np.float64 -- Workbench-compatible
iso8601_series = series.dt.strftime('%Y-%m-%dT%H:%M:%SZ')
This "date" type is a 32-bit number (or null
) representing days since the UNIX Epoch (1970-01-01
). It does not store timezone.
The minimum date is -5877641-06-23
; the maximum is 5881580-07-11
.
Beware, Pythonistas: Python's native datetime.date
raises errors for years under 0001
or over 9999
. Do not use Python datetime.date
for Workbench date math. What's more, pyarrow
and pandas
tend to convert to Python datetime.date
for some operations (in particular, repr()
): you must avoid this.
A Date column has a "unit": "year", "month", "day", "week", "quarter". The "unit" declares which dates are valid and how dates should be formatted. "Day" is the default.
In Arrow, we use a Date32Array (C++ docs, Python docs).
The following are equivalent
import datetime
import pyarrow as pa
by_pyobject = pa.array([datetime.date(2021, 4, 5), None])
by_int32 = pa.array([18722, None], pa.date32())
In Parquet, we use the DATE logical type to store an int32
(or null
).
In Pandas, we use the period[D]
Pandas type. See Pandas timeseries docs. null
becomes pd.NaT
.
Beware mixed metaphors. A Pandas Period is a timespan (with a duration and timezone); a Workbench Date is a calendar date. So while the physical storage is correct, you'll want to steer clear of Pandas' tendency to convert anything to timestamp. Workbench draws a clear line between timestamp (with timezone) and date (without timezone); your module should, too.
def is_date(series):
return pd.PeriodDtype('D') == series.dtype
series = pd.Series([pd.Period('2021-04-05', 'D')])
If the input series is ISO8601-formatted (YYYY-MM-DD
), all is well. Otherwise, Pandas will silently convert to text.
date_series = pandas.Series(text_series, dtype=pd.PeriodDtype('D'))
if pandas.PeriodDtype('D') != date_series.dtype:
raise ValueError("At least one date was badly formatted") # throws ValueError
Cool ideas:
series.dt.asfreq('M').dt.asfreq('D') # round down to month
iso8601_series = series.dt.strftime('%Y-%m-%d')