How to implement spill in StreamingWindow? #8975

JkSelf · 2024-03-06T03:53:27Z

JkSelf
Mar 6, 2024
Collaborator

Because the Window operator in Spark inserts an OrderBy operation to sort the input data before execution, we don't need to sort again within the Window operator. Therefore, we previously added StreamingWindowBuild in Velox to support the Window operator in Spark. During recent testing, we found that Q67 in TPC-DS still runs out of memory under tight memory conditions. This is because Q67 is a special case where all input data is in one window partition. Therefore, we plan to support the Spill case within the StreamingWindow.

Here are the steps we take to implement Spill in StreamingWindow:

Support adding data while spilling during addInput() phase. After all data in a complete WindowPartition has been added, the remaining data are spilled to a file. Then, when calling nextPartition(), if the current WindowPartition has been spilled to a file, it is read from the file; otherwise, it is read from memory.
Having the first step does not really solve the problem of Window spill, because the calculation process of WindowFunction currently supports reading WindowPartition from memory. Therefore, we also need to support reading data from the spill file in WindowPartition later.

Regarding the first step, we have currently proposed a Draft PR to implement spill support during addInput() phase. Can you help to review whether this method is feasible? Thanks. @mbasmanova @aditi-pandit @zhouyuan @zhztheplayer

mbasmanova · 2024-03-06T11:36:08Z

mbasmanova
Mar 6, 2024
Collaborator

@JkSelf I assume there are some easy cases, e.g. rank() over (partitioned by ... sorted by ... <default frame>). In these cases, the implementation doesn't really need to store all rows of a partition in memory. Instead, it can simply process one row at a time in a streaming manner without using much memory at all. For these case, we don't need spilling.

There are harder cases where the frame changes wildly from row to row and the function is such that it needs to access all or many or rows at the edges of the frame. In these cases spilling may not help. We'll need to repeatedly read data from disk while processing each row and that will be too slow to be useful.

Therefore, I suggest to identify the "easy" cases and optimize the implementation of these cases to not require storing all rows of a partition in memory.

0 replies

aditi-pandit · 2024-03-06T16:53:23Z

aditi-pandit
Mar 6, 2024
Collaborator

@JkSelf :

I had put a lot of thought into this problem when starting Window design previously... and handling all cases is very complex (and could be aided by entire new structures like random row access, B-Tree concept in Velox). I have many notes in Spilling section in https://ibm.box.com/s/nxttuhm2c0kttcpyv2kb23lxloem5q12.

Agree with @mbasmanova that in many cases window function computation can be made streaming viz ranking functions, incremental aggregations with (ROW frames, unbounded and default RANK) and those could be got out of the way first. I think TPC-DS has only those kind of queries.
Though staying a bit more on Q67 : my understanding is that Q67 can be optimized to TopNRowNumber operator (see Optimized TopN rank functions) in https://aws.amazon.com/blogs/big-data/run-queries-3x-faster-with-up-to-70-cost-savings-on-the-latest-amazon-athena-engine/. Can you implement this in Gluten ?

0 replies

JkSelf · 2024-03-07T02:29:59Z

JkSelf
Mar 7, 2024
Collaborator Author

@mbasmanova @aditi-pandit Thank you very much for your reply.

Based on your suggestions, I will first try to implement the spill of the Rank window function in Velox.
@aditi-pandit As for implementing TopNRowNumber in Gluten, Spark has now added the corresponding RankLimit operator in version 3.5 to achieve the same optimization SPARK-37099. We have now upgraded Spark to version 3.5 in Gluten. We will enable this feature in Gluten later. However, this requires our customers to also upgrade their Spark version to 3.5 to apply this feature. Our customers' current version is still below 3.5. It will take some time to upgrade to 3.5. So it is still necessary to implement spill in the Window for Gluten now. Regarding TopNRowNumber, there is another issue. I saw that the data of RankLimit on the Spark side is still ordered. Do we still need to implement an operator suitable for Spark's already sorted operator like StreamingWindowBuild in TopNRowNumber?

6 replies

JkSelf Mar 7, 2024
Collaborator Author

@mbasmanova Oh, I might have misunderstood your point. So, you mean not to spill but to optimize WindowFunction like Rank function without holding data in memory? For StreamingWindow, there will be a problem. If the data of a Partition is all in one WindowPartition, we will hold the entire Partition's data in memory when addInput. In this case, it should be necessary to hold the data because a WindowPartition is not ready. The calculation will not start at the getOutput.

JkSelf Mar 7, 2024
Collaborator Author

My current idea is for Window Functions like Rank and Ntile. We need to implement spill in addInput, and then pass a SpillMergeStream to WindowPartition so that the data of this Partition can be read from the spilled file not holding in memory. Correct me if wrong understand. @mbasmanova

JkSelf Mar 8, 2024
Collaborator Author

@mbasmanova @aditi-pandit Do you have any input? Thanks for your help.

aditi-pandit Mar 8, 2024
Collaborator

If data is sorted and streaming is applicable then we process each block from addInput() at a time. We don't need to construct a complete WindowPartition in this case. There are some corner situations, but that is the idea in general.

Don't think that spilling is required.

mbasmanova Mar 8, 2024
Collaborator

@JkSelf

q67 uses rank()

rank() OVER (PARTITION BY i_category ORDER BY sumsales DESC)

My understanding is that Spark plans this using OrderBy(i_category, sumsales DESC) before the Window node. Hence, the data comes into Window operator already sorted by partition keys and order by keys. rank() function can produce the output without waiting to see all the rows in a partition. row_number() function is similar. Hence, these 2 functions do not require the whole partition to be stored in memory. Currently, StreamingWindow operator always loads full partition into memory, but it doesn't need to do that.

That said, other window functions do need to know how many rows there are in a partition (e.g. ntile). To support these for queries with huge partitions that don't fit in memory some sort of spilling would be required.

Do you see such queries in practice or do you only see them in the benchmarks? If the latter, then you can make changes to support row_number and rank over huge partitions without spilling. If the former, please, share some of the use cases and we can figure out how to support these. If such support is needed, first step would be to create a doc with the description of the use cases and a detailed design of the solution.

Thanks.

aditi-pandit · 2024-03-08T06:20:56Z

aditi-pandit
Mar 8, 2024
Collaborator

@JkSelf : TopNRowNumber is not streaming. Yes, changes similar to StreamingWindow would be needed.

0 replies

JkSelf · 2024-03-11T01:57:06Z

JkSelf
Mar 11, 2024
Collaborator Author

@mbasmanova @aditi-pandit Thank you for your replies.

will first optimize the Rank and RowNumber functions that do not require holding all partition data and use the default frame, by adopting a streaming method to calculate the Window function result.
For scenarios that do not belong to the above case, does Velox plan to support spill in the future? Because Spark supports spill. If Velox does not support spill, Gluten will not be able to handle it.

5 replies

mbasmanova Mar 11, 2024
Collaborator

@JkSelf

Re: 2 - here, we are talking about the case where a single partition doesn't fit in memory. Do these occur in practice? Are these common enough to justify additional complexity?

JkSelf Mar 12, 2024
Collaborator Author

@mbasmanova Besides encountering OOM with Q67 TPC-DS, we now have a customer who has encountered OOM when calculating the window operator. The functions involved include sum, rank and row_number. Two types of Frames are used, one is the default frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), and the other is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. If we use solution 1, Gluten can only solve some of the customer's cases, but not all of the OOM issues.

mbasmanova Mar 12, 2024
Collaborator

@JkSelf Thank you for sharing further context.

The functions involved include sum, rank and row_number

I assume we have a solution for rank and row_number, hence, only 'sum' needs to be considered.

Two types of Frames are used, one is the default frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), and the other is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

The first frame can be solved easily because sum() can be computed incrementally, hence, no need to keep all rows in a partition in memory.

The second frame is the whole partition. What's the point of computing sum over the whole partition for all the rows in that partition? Can you share a bit more about this use case? Maybe ask the customer whether the query can be rewritten to avoid computing such sums.

aditi-pandit Mar 12, 2024
Collaborator

+1 @mbasmanova

The second frame looks like it comes from a correlated subquery rewrite (Groupby Pullup). I have seen systems implement that.

SELECT o1.orderdate,
l.receiptdate
FROM orders o1,
lineitem l,
(SELECT orderkey, AVG (o2.totalprice) as avgTotalPrice
FROM orders o2
GROUP BY orderkey) o2
WHERE o1.orderkey = l.orderkey
and o1.orderkey = o2.orderkey
and l.extendedprice > 104948
and o1.totalprice > avgTotalPrice

rewritten to

SELECT V.orderdate,
V.receiptdate
FROM (
select
l.receiptdate,
o1.orderdate,
o1.totalprice,
avg(o2.totalprice) OVER (PARTITION BY o2.orderkey ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING) as avgPricePerOrderKey
FROM orders o2,
orders o1,
lineitem l
WHERE o1.orderkey = l.orderkey
and l.extendedprice > 104948
and o2.orderkey = o1.orderkey
) V
WHERE V.totalprice > V.avgPricePerOrderKey

JkSelf Mar 15, 2024
Collaborator Author

@mbasmanova @aditi-pandit Thank you very much for your reply and suggestions. Currently, 80% of the scenarios at the customer's side involve rank and row_number, and 20% of the scenarios involve sum cases. After communicating with the client, they are considering rewriting the SQL to see if they can bypass this sum use cases.

aditi-pandit · 2024-03-12T19:02:14Z

aditi-pandit
Mar 12, 2024
Collaborator

@JkSelf : Its interesting to know Spark is going further along in how much spilling it can handle.

Looking at a sub-partition level, does Spark assume that atleast all rows of a frame should fit in memory ?
If all rows of a frame should fit in memory, then we can work on making peer and frame calculations streaming by nature. That could be a next step.

If we had to handle the cases where even a frame might not fit in memory, then the algorithms get very complex. We would need to be able to handle random access of data in a spilled partition to make the implementation somewhat efficient.

1 reply

JkSelf Mar 15, 2024
Collaborator Author

@aditi-pandit The spill in Spark Window will put the input data into a buffer that can spill. It will hold the data of a complete window partition, not the data of a single frame. You can find the detailed code here.

aditi-pandit · 2024-03-25T19:05:56Z

aditi-pandit
Mar 25, 2024
Collaborator

@JkSelf @mbasmanova : I put together some notes about Window Partition spilling in doc https://ibm.ent.box.com/file/1481358566140?s=azyj3s8xgjyzdqqwv9q79yruj9xga6r7

Please give me your feedback on it.

0 replies

aditi-pandit · 2024-04-16T14:56:19Z

aditi-pandit
Apr 16, 2024
Collaborator

@JkSelf : I started putting a doc together at https://docs.google.com/document/d/1ug_QAU4muSPqRZsBOanq--ECBUSzvIFDPov5R2XcdNs/edit?usp=sharing. But will be updating it over this week.

I've been looking at your PR. I have some design in mind and we can work on it together to make joint progress if you are open to the idea. Would you do a video call with me sometime for more brainstorming ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement spill in StreamingWindow? #8975

{{title}}

Replies: 8 comments 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to implement spill in StreamingWindow? #8975

JkSelf Mar 6, 2024 Collaborator

Replies: 8 comments · 12 replies

mbasmanova Mar 6, 2024 Collaborator

aditi-pandit Mar 6, 2024 Collaborator

JkSelf Mar 7, 2024 Collaborator Author

JkSelf Mar 7, 2024 Collaborator Author

JkSelf Mar 7, 2024 Collaborator Author

JkSelf Mar 8, 2024 Collaborator Author

aditi-pandit Mar 8, 2024 Collaborator

mbasmanova Mar 8, 2024 Collaborator

aditi-pandit Mar 8, 2024 Collaborator

JkSelf Mar 11, 2024 Collaborator Author

mbasmanova Mar 11, 2024 Collaborator

JkSelf Mar 12, 2024 Collaborator Author

mbasmanova Mar 12, 2024 Collaborator

aditi-pandit Mar 12, 2024 Collaborator

JkSelf Mar 15, 2024 Collaborator Author

aditi-pandit Mar 12, 2024 Collaborator

JkSelf Mar 15, 2024 Collaborator Author

aditi-pandit Mar 25, 2024 Collaborator

aditi-pandit Apr 16, 2024 Collaborator

JkSelf
Mar 6, 2024
Collaborator

Replies: 8 comments 12 replies

mbasmanova
Mar 6, 2024
Collaborator

aditi-pandit
Mar 6, 2024
Collaborator

JkSelf
Mar 7, 2024
Collaborator Author

JkSelf Mar 7, 2024
Collaborator Author

JkSelf Mar 7, 2024
Collaborator Author

JkSelf Mar 8, 2024
Collaborator Author

aditi-pandit Mar 8, 2024
Collaborator

mbasmanova Mar 8, 2024
Collaborator

aditi-pandit
Mar 8, 2024
Collaborator

JkSelf
Mar 11, 2024
Collaborator Author

mbasmanova Mar 11, 2024
Collaborator

JkSelf Mar 12, 2024
Collaborator Author

mbasmanova Mar 12, 2024
Collaborator

aditi-pandit Mar 12, 2024
Collaborator

JkSelf Mar 15, 2024
Collaborator Author

aditi-pandit
Mar 12, 2024
Collaborator

JkSelf Mar 15, 2024
Collaborator Author

aditi-pandit
Mar 25, 2024
Collaborator

aditi-pandit
Apr 16, 2024
Collaborator