- Fixes
- Repeated dropna/dropnan/dropmissing could report cached length #874
- Features
- Arrow is now a core dependency, vaex-arrow is deprecated. Much better chunked array support, numpy conversion is done lazily. #517
This is now part of vaex-core.
- Features
- Normalize histogram and change selection mode. #826
- Performance
- isin uses hashmaps, leading to a 2x-4x performance increase for primitives, 200x for strings in some cases #822
- Features
- Selection toggle list. #797
- Fixes
- Remote dataframe was still using dtype, not data_type. #797
- Features
- Implementation of
GroupbyTransformer
#479
- Implementation of
- Fixes
- Various fixes for aliased columns (column names with invalid identifiers) #768
- Fixes
- Fixes
- Masked arrays supported in hdf5 files on s3 #781
- Expression.map always uses masked arrays to be state transferrable (a new dataset might have missing values) #479
- Support importing Pandas dataframes with version 0.23 #794
- Various fixes for aliased columns (column names with invalid identifiers) #768 #793
- Fixes
- Join could in rare cases point to row 0, when there were values in the left, not present in the right #765
- Tabulate 0.8.7 escaped html, undo this to print dataframes nicely.
- Breaking changes:
- Python 2 is not supported anymore
- Variables don't have access to pi and e anymore
df.rename_column
is nowdf.rename
(and also renames variables)- DataFrame uses a normal dict instead of OrderedDict, requiring Python >= 3.6
- Default limits (e.g. for plots) is minmax, so we don't miss outliers
df.get_column_names()
returns the aliased names (invalid identifiers), passalias=False
to get the internal column name- Default value of
virtual
is True in methoddf.export
,df.to_dict
,df.to_items
,df.to_arrays
. - df.dtype is a property, to get data types for expressions, use df.data_type(), df.expr.dtype is still behaving the same
- df.categorize takes min_value and max_value, and no longer needs the check argument, also the labels do not have to be strings.
- vaex.open/from_csv etc does not copy the pandas index by default #756
- df.categorize takes an inplace argument, similar to most methods, and returns the dataframe affected.
-
Performance
-
Refactor
-
Fixes
- Renaming columns fixes #571
- Joining with virtual columns but different data, and name collision fixes #570
- Variables are treated similarly as columns, and respected in join #573
- Arguments to lazy function which are numpy arrays gets put in the variables #573
- Executor does not block after failed/interrupted tasks. #571
- Default limits (e.g. for plots) is minmax, so we don't miss outliers #581
- Do no fail printing out dataframe with 0 rows #582
- Give proper NameError when using non-existing column names #299
- Several fixes for concatenated dataframes. #590
- dropna/nan/missing only dropped rows when all column values were missing, if no columns were specified. #600
- Flaky test for RobustScaler skipped for p36 #614
- Copying/printing sparse matrices #615
- Sparse columns names with invalid identifiers are not rewritten. #617
- Column names with invalid identifiers which are rewritten are shown when printing the dataframe. #617
- Column name rewriting for invalid identifiers also works on virtual columns. #617
- Fix the links to the example datasets. #609
- Expression.isin supports dtype=object #669
- Fix
colum_count
, now only counts hidden columns if expicitly specified #593 - df.values respects masked arrays #640
- Rewriting a virtual column and doing a state transfer does not lead to
ValueError: list.remove(x): x not in list
#592 df.<stat>(limits=...)
will now respect the selection #651- Using automatic names for aggregators led to many underscores in name #687
- Support Python3.8 #559
-
Features
- New lazy numpy wrappers: np.digitize and np.searchsorted #573
df.to_arrow_table
/to_pandas_df
/to_items
/df.to_dict
/df.to_arrays
now take a chunk_size argument for chunked iterators #589 (vaexio#699)- Filtered datasets can be concatenated. #590
- DataFrames/Executors are thread safe (meaning you can schedule/compute from any thread), which makes it work out of the box for Dash and Flask #670
df.count/mean/std
etc can output in xarray.DataArray array type, makes plotting easier #671- Column names can have unicode, and we use str.isidentifier to test, also dont accidently hide columns. #617
- Percentile approx can take a sequence of percentages #527
- Polygon testing, useful in combinations with geo/geojson data #685
- Added dt.quarter property and dt.strftime method to expression (by Juho Lauri) #682
- Refactored server, can return multiple binary blobs, execute multiple tasks, cancel tasks, encoding/serialization is more flexible (like returning masked arrays). #571
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Fixes
- Booleans were negated, and didn't respect offsets.
- Requirement of vaex-core >=2,<3
- Breaking changes
- vaex-jupyter is refactored #654
- Features
- Fixes
- Slicing arrow string arrays with masked arrays is respected/working #530]
- Performance
- IncrementalPredictor uses parallel chunked support (2x speedup possible) #515
- Fix
- Features
- Performance
- Dataframes are always true (implements
__bool__
) to avoid calling__len__
#496
- Dataframes are always true (implements
- Fixes
- Do not duplicate column when joining DataFrames on a column with the same name #480
- Better error messages/stack traces, and work better with debugger. #488
- Accept numpy scalars in expressions. #462
- Expression.astype can create datetime64 columns out of (arrow) strings arrays. #440
- Invalid mask access triggered when memory-mapped read only for strings. #459
- Features
- Features
- IncrementalPredictor for
scikit-learn
models that support the.partial_fit
method #497
- IncrementalPredictor for
- Fixes
- Adding unique function names to dataframes to enable adding a predictor twice #492
* Compatibility with vaex-core 1.4.0
- Performance
- Parallel df.evaluate #474
- Avoid calling df.get_column_names (1000x for 1 billion rows per column use) #473
- Slicing e.g df[1:-1] goes much faster for filtered dataframes #471
- Dataframe copying and expression rewriting was slow #470
- Double indices columns were not using index cache since empty dict is falsy #439
- Features
- requires vaex-core >=1.3,<2 for parallel evaluate
- Fixes:
- bqplot 0.12 revealed a bug/inconsistency with heatmap #465
- Fixes
- Support for Apache Arrow >= 0.15
- Fixes
- Docstrings and minor improvements
- initial release 0.1
- feature: auto upcasting for sum #435
- fix: selection/filtering fix when using masked values #431
- fix: masked string array fixes #434
- fix: memory usage fix for joins #439
- fix: support for Apache Arrow >= 0.15