Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed replacing instance of dataframe when using aggregate method #1057

Merged
merged 2 commits into from
Apr 27, 2024

Conversation

norberttech
Copy link
Member

Change Log

Added

Fixed

  • DataFrame::aggregate returns the same instance of dataframe with updated pipeline

Changed

Removed

Deprecated

Security


Description

As much as I hate using reflections I even more hate the idea of exposing Pipelines outside the dataframe. Pipelines are considered to be internal and should not be exposed as it's not something we want to provide BC promise.

I couldn't find a better compromise between keeping the data frame mutable (it's just a big builder for the pipeline) and allowing it to expose partial builders like in this case GroupBy. Since we are just building a pipeline, returning a new instance of its builder is highly problematic since it's already a different object so it can't be referenced anymore.

<?php

$df = df();
$df->read(...);
$df->withEntry(...);
$df->aggregate(...)
// without reflection this would not work anymore
$df->rename(...)
$rows = $df->fetch();

The goal of GroupedDataFrame is to not let users do something like for example DataFrame::groupBy()->run(). That would be highly problematic since it would be a source of memory leaks as grouping without aggregation is just collecting all rows. InnerBuilders for more advanced transformations are also improving DX since IDE will guide developers through all possible methods.

Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.290mb +0.00%  | 847.026ms -0.53% | ±0.53% +29.39%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 5.010mb +0.01%   | 344.675ms -0.68% | ±0.63% +23.47%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 5.165mb +0.00%   | 1.077s -1.59%    | ±0.13% -87.59%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 135.837mb +0.00% | 911.921ms -0.96% | ±0.46% -22.65%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.925mb +0.01%   | 35.490ms -1.39%  | ±2.58% +785.80% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.931mb +0.01%   | 433.845ms -1.70% | ±0.87% +16.94%  |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.231mb +0.00% | 59.593ms -1.70% | ±0.58% -41.36% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 96.676mb +0.00%  | 452.794ms -1.90% | ±0.71% +36.10%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 55.151mb +0.00%  | 67.738ms -2.68%  | ±1.74% +34.47%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 107.585mb +0.00% | 50.570ms -3.24%  | ±0.46% -20.00%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 227.005mb +0.00% | 1.412s -1.37%    | ±0.78% +222.28% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.967mb +0.00%  | 37.618ms -4.13%  | ±0.23% -80.19%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 87.051mb +0.00%  | 3.309ms -12.97%  | ±1.74% +126.82% |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.649mb +0.00% | 188.537ms +0.19% | ±1.10% +119.81% |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.369mb +0.00%  | 18.714ms -2.01%  | ±0.59% -56.83%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.291mb +0.00%  | 1.663ms -13.56%  | ±0.77% +123.65% |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.291mb +0.00%  | 1.632ms -19.57%  | ±1.32% -59.11%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.403mb +0.00%  | 2.473ms -15.43%  | ±3.13% +352.03% |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 85.932mb +0.00%  | 15.949ms -4.59%  | ±0.36% -28.03%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 85.932mb +0.00%  | 15.984ms -7.00%  | ±1.08% +20.24%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 83.836mb +0.00%  | 1.800μs -5.56%   | ±0.00% -100.00% |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 83.836mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00% |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 93.186mb +0.00%  | 12.535ms -8.56%  | ±1.69% +280.64% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.557mb +0.00% | 61.649ms -1.76%  | ±0.26% -81.79%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.452mb +0.00%  | 1.223ms -22.95%  | ±2.07% -8.62%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 89.799mb +0.00%  | 62.461ms -4.68%  | ±0.54% -59.92%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.553mb +0.00%  | 3.839ms -4.62%   | ±0.84% -52.79%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 83.914mb +0.00%  | 39.598ms -3.59%  | ±1.20% +49.71%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 83.915mb +0.00%  | 40.761ms -0.52%  | ±1.42% -30.34%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 83.914mb +0.00%  | 40.496ms -2.55%  | ±0.28% -67.83%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.278mb +0.00%  | 7.362ms -2.85%   | ±0.76% -25.88%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 83.836mb +0.00%  | 28.908ms -5.60%  | ±0.69% +43.27%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 83.836mb +0.00%  | 13.100μs -4.93%  | ±0.62% -30.87%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 83.836mb +0.00%  | 15.888μs -3.67%  | ±0.60% +108.00% |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.651mb +0.00% | 191.701ms -0.32% | ±0.82% +41.67%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.729mb +0.00% | 519.883ms -6.72% | ±0.80% -71.20%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 60.207mb +0.00%  | 255.641ms -3.98% | ±0.81% +141.91% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 15.140mb +0.00%  | 54.415ms -5.18%  | ±0.41% -70.00%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.966mb +0.00%  | 433.937ms -1.68% | ±0.40% +84.71%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.505mb +0.00%  | 85.211ms -2.81%  | ±0.55% -1.01%   |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 42fe036 into flow-php:1.x Apr 27, 2024
17 checks passed
@norberttech norberttech deleted the bug/data-frame-aggregation branch May 9, 2024 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant