Skip to content

Commit

Permalink
[SPARK-49387][PYTHON] Fix type hint for accuracy in `percentile_app…
Browse files Browse the repository at this point in the history
…rox` and `approx_percentile`

### What changes were proposed in this pull request?
Fix type hint for `accuracy` in `percentile_approx` and `approx_percentile`

### Why are the changes needed?
float `accuracy` is not supported:
```
In [9]: df.select(approx_percentile("value", [0.25, 0.5, 0.75], 1.1).alias("quantiles")).show()

...

AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "approx_percentile(value, array(0.25, 0.5, 0.75), 1.1)" due to data type mismatch: The third parameter requires the "INTEGRAL" type, however "1.1" has the type "DOUBLE". SQLSTATE: 42K09;
```

### Does this PR introduce _any_ user-facing change?
yes, minor doc change

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47869 from zhengruifeng/py_approx_percentile_acc.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
  • Loading branch information
zhengruifeng authored and attilapiros committed Oct 4, 2024
1 parent af8e8aa commit 22bd84a
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions python/pyspark/sql/connect/functions/builtin.py
Original file line number Diff line number Diff line change
Expand Up @@ -1223,7 +1223,7 @@ def percentile(
def percentile_approx(
col: "ColumnOrName",
percentage: Union[Column, float, Sequence[float], Tuple[float]],
accuracy: Union[Column, float] = 10000,
accuracy: Union[Column, int] = 10000,
) -> Column:
percentage = lit(list(percentage)) if isinstance(percentage, (list, tuple)) else lit(percentage)
return _invoke_function_over_columns("percentile_approx", col, percentage, lit(accuracy))
Expand All @@ -1235,7 +1235,7 @@ def percentile_approx(
def approx_percentile(
col: "ColumnOrName",
percentage: Union[Column, float, Sequence[float], Tuple[float]],
accuracy: Union[Column, float] = 10000,
accuracy: Union[Column, int] = 10000,
) -> Column:
percentage = lit(list(percentage)) if isinstance(percentage, (list, tuple)) else lit(percentage)
return _invoke_function_over_columns("approx_percentile", col, percentage, lit(accuracy))
Expand Down
8 changes: 4 additions & 4 deletions python/pyspark/sql/functions/builtin.py
Original file line number Diff line number Diff line change
Expand Up @@ -6339,7 +6339,7 @@ def percentile(
def percentile_approx(
col: "ColumnOrName",
percentage: Union[Column, float, Sequence[float], Tuple[float]],
accuracy: Union[Column, float] = 10000,
accuracy: Union[Column, int] = 10000,
) -> Column:
"""Returns the approximate `percentile` of the numeric column `col` which is the smallest value
in the ordered `col` values (sorted from least to greatest) such that no more than `percentage`
Expand All @@ -6360,7 +6360,7 @@ def percentile_approx(
When percentage is an array, each value of the percentage array must be between 0.0 and 1.0.
In this case, returns the approximate percentile array of column col
at the given percentage array.
accuracy : :class:`~pyspark.sql.Column` or float
accuracy : :class:`~pyspark.sql.Column` or int
is a positive numeric literal which controls approximation accuracy
at the cost of memory. Higher value of accuracy yields better accuracy,
1.0/accuracy is the relative error of the approximation. (default: 10000).
Expand Down Expand Up @@ -6397,7 +6397,7 @@ def percentile_approx(
def approx_percentile(
col: "ColumnOrName",
percentage: Union[Column, float, Sequence[float], Tuple[float]],
accuracy: Union[Column, float] = 10000,
accuracy: Union[Column, int] = 10000,
) -> Column:
"""Returns the approximate `percentile` of the numeric column `col` which is the smallest value
in the ordered `col` values (sorted from least to greatest) such that no more than `percentage`
Expand All @@ -6414,7 +6414,7 @@ def approx_percentile(
When percentage is an array, each value of the percentage array must be between 0.0 and 1.0.
In this case, returns the approximate percentile array of column col
at the given percentage array.
accuracy : :class:`~pyspark.sql.Column` or float
accuracy : :class:`~pyspark.sql.Column` or int
is a positive numeric literal which controls approximation accuracy
at the cost of memory. Higher value of accuracy yields better accuracy,
1.0/accuracy is the relative error of the approximation. (default: 10000).
Expand Down

0 comments on commit 22bd84a

Please sign in to comment.