Released 2024-10-16
- feature: add support for Python 3.12, with corresponding updates to core dataframe dependencies
- feature: add
--set
flag for overriding values withinearthmover.yml
from the command line
Released 2024-09-06
- bugfix: Jinja in destination
header
failed if dataframe is empty
Released 2024-09-04
- feature: implementing a limit_rows operation
- feature: add support for a
require_rows
boolean or non-negative int on any node - feature: add support for Jinja in a destination node header and footer
- bugfix: union fails with duplicate columns
Released 2024-08-07
- feature: add
json_array_agg
function togroup_by
operation - feature: select all columns using "*" in
modify_columns
operation - internal: set working directory to the location of the
earthmover.yaml
file - documentation: add information on
earthmover init
andearthmover clean
to the README - bugfix: fix bug with
earthmover clean
that could have removed earthmover.yaml files
Released 2024-07-12
- feature: add
earthmover init
command to initialize a new sample project in the expected bundle structure - internal: expand test run to include the new
debug
andflatten
operations, as well as a nested JSON source file - internal: improve customization in write behavior in new file destinations
- bugfix: Fix bug when writing null values in
FileDestination
Released 2024-06-26
- hotfix: Fix bug when writing out JSON in
FileDestination
Released 2024-06-18
- hotfix: Resolve incompatible package dependencies
- hotfix: Fix type casting of nested JSON for destination templates
Released 2024-06-14
- feature: Add
DebugOperation
for logging data head, tail, columns, or metadata midrun - feature: Add
FlattenOperation
for splitting and exploding string columns into values - feature: Add optional 'fill_missing_columns' field to
UnionOperation
to fill disjunct columns with nulls, instead of raising an error (defaultFalse
) - feature: Add
git_auth_timeout
config when entering Git credentials during package composition - feature: Add
earthmover clean
command that removes local project artifacts - feature: only output compiled template during
earthmover compile
- feature: Render full row into JSON lines when
template
is undefined inFileDestination
- internal: Move
FileSource
size-checking andFtpSource
FTP-connecting from compile to execute - internal: Move template-file check from compile to execute in
FileDestination
- internal: Allow filepaths to be passed to an optional
FileSource
, and check for file before creating empty dataframe - internal: Build an empty dataframe if an empty folder is passed to an optional
FileSource
- internal: fix some examples in README
- internal: remove GitPython dependency
- bugfix: fix bug in
FileDestination
wherelinearize: False
resulted in BOM characters - bugfix: fix bug where nested JSON would be loaded as a stringified Python dictionary
- bugfix: Ensure command list in help menu and log output is always consistent
- bugfix: fix bug in
ModifyColumnsOperation
where__row_data__
was not exposed in Jinja templating
Released 2024-04-26
- internal: allow any ordering of Transformations during graph-building in compile
- internal: only create a
/packages
dir whenearthmover deps
succeeds
Released 2024-04-17
- feature: add project composition using
packages
keyword in template file (see README) - feature: add installation extras for optional libraries, and improve error logging to notify which is missing
- feature:
GroupByWithRankOperation
cumulatively sums record counts by group-by columns - feature: setting
log_level: DEBUG
in template configs or settingdebug: True
for a node displays the head of the node mid-run - feature: add
optional_fields
key to all Sources to add optional empty columns when missing from schema - feature: add optional
ignore_errors
andexact_match
boolean flags toDateFormatOperation
- internal: force-cast a dataframe to string-type before writing as a Destination
- internal: remove attempted directory-hashing when a source is a directory (i.e., Parquet)
- internal: refactor project to standardize import paths for Node and Operation
- internal: add
Node.full_name
attribute andNode.set_upstream_source()
method - internal: unify graph-building into compilation
- internal: refactor compilation and execution code for cleanliness
- internal: unify
Node.compile()
into initialization to ease Node development - internal: Remove unused
group_by_with_count
andgroup_by_with_agg
operations
Released 2024-04-08
- feature: adding fromjson() function to Jinja
- feature: fix docs typos
- feature:
SortRowsOperation
sorts the dataset bycolumns
Released 2023-09-11
- breaking change: remove
source
as Operation config and move to Transformation; this simplifies templates and reduces memory usage - breaking change:
version: 2
required in Earthmover YAML files - feature:
SnakeCaseColumnsOperation
converts all columns to snake_case - feature:
show_progress
can be turned on globally inconfig
or locally in any Source, Transformation, or Destination to display a progress bar - feature:
repartition
can be turned on in any applicableNode
to alter Dask partition-sizes post-execute - feature: improve performance when writing Destination files
- feature: improved Earthmover YAML-parsing and config-retrieval
- internal: rename
YamlEnvironmentJinjaLoader
toJinjaEnvironmentYamlLoader
for better transparency of use - internal: simplify Earthmover.build_graph()
- internal: unify Jinja rendering into a single util function, instead of redeclaring across project
- internal: unify
Node.verify()
intoNode.execute()
for improved code legibility - internal: improve attribute declarations across project
- internal: improve type-hinting and doc-strings across project
- bugfix: refactor SqlSource to be compatible with SQLAlchemy 2.x
Released 2023-07-11
Released 2023-06-13
Released 2023-05-12
-
bugfix:
config.state
_file was being ignored when specified -
bugfix: further issues with multi-line
config.macros
- the resolution here (hopefully the last one!) is to pre-load macros (so they can be injected into run-time Jinja contexts) and then just allow the Jinja to render and macro definitions down to nothing in the config YAML... you do have to be careful with Jinja linebreak suppression, i.e.config: macros: > # this is a macro! {%- macro test() -%} testing! {%- endmacro -%} sources: ...
could render down to
config: macros: > # this is a macro!sources: ...
which will fail with an error about no sources defined.
-
bugfix: charset issues when reading / writing non-UTF8 files - this should be resolved by enforcing every file read/write to specify UTF8 encoding
Released 2023-05-05
- feature: implement ability to call
{{ md5(column) }}
in Jinja throughout eathmover, with a framework for other Python functions to be added in the future - bugfix: fix multi-line macros issue
Released 2023-05-02
- bugfix: fix continued issues with environment variable expansion under Windows by changing from
os.path.expandvars()
to native PythonString.Template
implementation - bugfix: change how earthmover loads
config.macros
from YAML to prevent issues with multi-line macros definitions
Released 2023-03-27
- bugfix: a single quote in the config YAML could prevent environment variable expansion from working since
os.path.expandvars()
does not expand variables within single quotes in Python under Windows
Released 2023-03-23
- feature: added parse-time Jinja templating to YAML configuration
⚠️ Potentially breaking change: if your config YAML containsadd_columns
ormodify_columns
operations with Jinja expressions, these will now be parsed at YAML load time. To preserve the Jinja for runtime parsing, wrap the expressions with{%raw%}...{%endraw%}
. See YAML parsing for further information.
- feature: removed dependency on matplotlib, which is only required if your YAML specified
config.show_graph: True
... now if you try toshow_graph
without matplotlib installed, you'll get an error prompting you to install matplotlib
Released 2023-02-23
- feature: added
str_min()
andstr_max()
functions forgroup by
operation
Released 2023-02-17
- feature: pass
__row_data__
dict into Jinja templates for easier dynamic column referencing - bugfix: parameter / env var interpolation into YAML keys, not just values
- refactor error handling key assertion methods
- refactor YAML loader line number context handling
Released 2022-12-16
- trim nodes not connected to a destination from DAG
- ensure all source datatypes return a Dask dataframe
- update optional source functionality to require
columns
list, and pass an empty dataframe through the DAG
Released 2022-10-27
- support running in Google Colab
Released 2022-10-27
- support for Python 3.7
Released 2022-09-22
- initial release