Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Sorting Functions #233

Merged
merged 6 commits into from
Jul 12, 2022
Merged

Conversation

Amalicia
Copy link
Contributor

@Amalicia Amalicia commented Jun 24, 2022

Description

This PR is implements the asc and desc functions alongside the orderBy and sort functions currently present in Spark.

Related Issue

Resolves #70

Motivation and Context

Currently when performing orderBy or sort operations in a DataFrame, you have to use Spark columns and functions. This PR aims to allow the use of doric columns when sorting.

How Has This Been Tested?

  • This pull request contains appropriate tests?

@Amalicia Amalicia requested a review from a team as a code owner June 24, 2022 11:39
@github-actions github-actions bot added spark_2.4 PR changes to spark 2.4 spark_3.0 PR changes to spark 3.0 spark_3.1 PR changes to spark 3.1 spark_3.2 PR changes to spark 3.2 spark_3.3 PR changes to spark 3.3 labels Jun 24, 2022
@Amalicia Amalicia marked this pull request as draft June 24, 2022 11:40
@codecov
Copy link

codecov bot commented Jun 24, 2022

Codecov Report

Merging #233 (0c02a25) into main (0b3f5e7) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #233      +/-   ##
==========================================
+ Coverage   96.61%   96.64%   +0.03%     
==========================================
  Files          53       54       +1     
  Lines         943      952       +9     
  Branches       18       12       -6     
==========================================
+ Hits          911      920       +9     
  Misses         32       32              
Flag Coverage Δ
spark-2.4.x 93.00% <100.00%> (+0.07%) ⬆️
spark-3.0.x 95.82% <100.00%> (+0.04%) ⬆️
spark-3.1.x 96.78% <100.00%> (+0.03%) ⬆️
spark-3.2.x 96.78% <100.00%> (+0.03%) ⬆️
spark-3.3.x 96.78% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
core/src/main/scala/doric/sem/SortingOps.scala 100.00% <100.00%> (ø)
...re/src/main/scala/doric/syntax/CommonColumns.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b3f5e7...0c02a25. Read the comment docs.

@alfonsorr
Copy link
Member

alfonsorr commented Jun 27, 2022

Rebase your branch with the latest updates of Maine. They are missing the changes for spark 3.3

Sorry in the phone show me that the changes from the last pr were reverted 😵‍💫

@eruizalo eruizalo mentioned this pull request Jun 28, 2022
12 tasks
eruizalo
eruizalo previously approved these changes Jun 28, 2022
Copy link
Collaborator

@eruizalo eruizalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just rebase your branch because codecov is acting weird, saying the coverage will drop because of your PR, and that's not true

I updated the issue #70 as well because we were only talking about column functions (we forgot about DataFrame order)

alfonsorr
alfonsorr previously approved these changes Jun 29, 2022
@Amalicia
Copy link
Contributor Author

Apologies for the delay on this! The past few days have been a little bit busy for me. Looking at the updated #70, it seems all thats left on this issue is to add def sortWithinPartitions(sortExprs: Column*): Dataset[T], right? 😄

@eruizalo
Copy link
Collaborator

No problem at all! We really appreciate what you are doing.

About #70, yeah, that's the only función left, I updated the issue to track more related functions, but it doesn't mean that you have to implement it (you can though, but if you don't we will do it as soon as we can).

Anyway, whenever you rebase your branch and mark your PR as "ready for review" we will merge it.

Thank you again for your help and effort!

@Amalicia
Copy link
Contributor Author

The Spark 2.4 tests are failing due the Spark minus months returning a different result to the LocalDate equivalent... It would appear this is some flakiness that will occur on certain days of the year.

One option to get around this and prevent it from happening again would be to hardcode the date to a certain value instead of using LocalDate.now(). Any other thoughts or suggestions? 🤔

@Amalicia
Copy link
Contributor Author

Amalicia commented Jul 8, 2022

As a follow up to the above comment about the LocalDate flakiness, I found the PR that updated the logic to use the LocalDate api: apache/spark#25153. I'll move the existing logic to the spark 3.x tests and try and come up with a solution for the 2.4 tests

@Amalicia Amalicia marked this pull request as ready for review July 8, 2022 09:50
@alfonsorr
Copy link
Member

Hi! I'm back, sorry for this absence. I see you got it fixed but to test it for all versions, we can set the expected value according to the spark version. Something like:

val expectedResult = if (spark.version.take(1).toInt > 2)
 List(Date.valueOf(LocalDate.now.minusMonths(3)), null).map(Option(_))
 else
 \\ spark 2 result

df.testColumns2("dateCol", -3)(
        (d, m) => colDate(d).addMonths(m.lit),
        (d, m) => f.add_months(f.col(d), m),
        expectedResult
      )

Thanks for all the work 😄

@Amalicia
Copy link
Contributor Author

Welcome back! No worries at all!!

Sounds good! Just for understanding, when should tests be placed into the Spark 3 folders like I did in b2b4c38 compared to when logic like this should be implemented?

@alfonsorr
Copy link
Member

The folders contain in the name the versions that apply to them, if it has 3.0 it will be added when it's spark 3.0, if the folder has more than one, it will be added for all the versions added there. And the scala folder is for all versions.
In your case, we can add the test for version (and folder) 2.4 and the other these for the folder 3.0_3.1_3.2_3.3 or we can simplify it by adding it to the scala folder and put a condition for the version.

We use the folders if it's more a compile problem (different interfaces, methods, or classes that don't exist in a spark version, etc) but in your case it exists, it's compatible, but the only problem is that depending on the version, it returns a value or other.

@alfonsorr
Copy link
Member

Codecov error, I'm going to launch it again. 😓

@alfonsorr alfonsorr merged commit b77d57f into hablapps:main Jul 12, 2022
@Amalicia Amalicia deleted the feature/sort-function branch July 12, 2022 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark_2.4 PR changes to spark 2.4 spark_3.0 PR changes to spark 3.0 spark_3.1 PR changes to spark 3.1 spark_3.2 PR changes to spark 3.2 spark_3.3 PR changes to spark 3.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting functions
3 participants