Allow multiple CUs to write output files to a single DU. #173

pradeepmantha · 2014-02-09T19:15:51Z

Use case - In case of MapReduce - multiple Map tasks generates data related to a single reduce task.

Currently all the output files of a Map task are stored together in a DU, and later all the files related to a reduce from Map tasks are segregated by the MapReduce framework to pass the DU's as inputs to reduce task.

This could be optimized, if we can allow a Map task to write output files to a reduce DU's directly.

I envision, this could be a useful feature for other use cases too.

There could be some concurrency problems with metdata udpate, I just want log this, as we might come up with some solution to make this possible.

pradeepmantha · 2014-02-17T02:04:46Z

The task is still in 'Running' until the output data of the task is moved, which is captured by output DU. This should be decoupled.. The task should be moved to 'Done' state, to improve concurrency and Output DU should take care of moving data.

marksantcroos · 2014-02-19T13:27:34Z

Hi Pradeep, apart from the semantic problems this would create, why is it an optimization to have all data in one DU?

pradeepmantha · 2014-02-19T18:19:16Z

consider below example - I have 1000 map tasks running, where each task generates set of output files.. which need to be grouped later into 8 DUs.... With current framework.. I need to create 1000 empty intermediate Output DUS to store files of each task.. and later create 8 more DUS to manage the files between DUs. With the 1000 tasks directly writing to 8 DUS, will eliminate the creation of intermediate 1000 DUS and related wait time. This is useful for applications where high performance is required, where even milli seconds matter.. I think, this is additional flexibility that can be provided, and its upto application to use it or not.

marksantcroos · 2014-02-19T19:50:27Z

Hi Pradeep,

On 19 Feb 2014, at 19:19 , pradeepmantha notifications@github.com wrote:

consider below example - I have 1000 map tasks running, where each task generates set of output files.. which need to be grouped later into 8 DUs….

Ok, clear.

With current framework.. I need to create 1000 empty intermediate Output DUS to store files of each task..

Ok.

and later create 8 more DUS to manage the files between DUs.

Arguably you don’t need those 8 DUs but could divide the 1000 DUs in 8 piles, right?
(Not saying you should, but to get the semantics clear)

With the 1000 tasks directly writing to 8 DUS, will eliminate the creation of intermediate 1000 DUS and related wait time. This is useful for applications where high performance is required, where even milli seconds matter.. I think, this is additional flexibility that can be provided, and its upto application to use it or not.

I hear your concern about performance, which is important of course.
When performance is at play, its not necessarily required to change semantics, changing implementation might be enough.

Having “mixed” writing to a DU opens up a whole can of worms:

What if the CU’s run on different DCIs with different transfer mechanisms?
What if you want to replicate a DU?
What will be the state model of a DU?

Note that I’m not against your “wish”, but your proposal has far reaching implications, so it needs to be considered carefully.

Gr,

Mark

pradeepmantha · 2014-02-20T02:49:16Z

Hi,

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8
piles, right?

No. The DU( of the 8DU group) needs one file from each DU ( 1000 DU
group).. Explained the case below with detailed example..

i.e Task 1 creates - { file-0, file-1,file-2, file-3.... file-7}
Task 2 creates - { file1-0, file1-1,file1-2.... file1-7}
....
Task 1000 creates - { file999-0, file999-1,file999-2.... file999-7}

Now 8 Dus should look like.

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

DU-8 = { file 1-7,file2-7 ... file999-7}

so each DU ( of the 8DUS) need one file from the 1000 DUs, which is lot of
performance impact in case of MapReduce framework. So, I can't simply
divide into 1000 DUS into 8 groups.

Having "mixed" writing to a DU opens up a whole can of worms:

What if the CU's run on different DCIs with different transfer
mechanisms?

     - I am still trying to understand how the mixing can be avoided

with Current PD design? Consider CUS running on XSEDE cluster that supports
GlobusOnline PD, and FG Sierra(ssh/scp PD) and the target DU resides on
Sierra..
with current PD
- Condor tasks push data to Condor Pilot-data, Sierra
tasks push data to Sierra Pilot-Data.. Then for grouping, 8 DUS should be
prepared by mixing 1000 DUS(Which is not yet supported) from both condor
and Sierra pilot-datas and moved to Sierra Pilot-Data.

           With CUS writing data directly to DU -
               - Condor tasks directly push data to DU on sierra, based

on the affinity of target DU & Pilot-Data affinity.

   Both the cases need some effort.

What if you want to replicate a DU?
- Could be similar to above..

What will be the state model of a DU?

   - Might affect the state of CU, as data transfer from CU to DU could

be long and need a intermediate state to avoid cpu resource blocking, until
the transfer is finished.

Note that I'm not against your "wish", but your proposal has far reaching
implications, so it needs to be considered carefully.

Yes, I am mostly thinking in perspective of Pilot-MapReduce, which is
just one use case. But providing above features or hooks to have these
kind of features, will help the performance of application. These could be
some helper functions leaving application to figure out whats best for it.

Thanks
Pradeep

On Wed, Feb 19, 2014 at 11:50 AM, Mark Santcroos
notifications@git.luolix.topwrote:

Hi Pradeep,

On 19 Feb 2014, at 19:19 , pradeepmantha notifications@github.com
wrote:

consider below example - I have 1000 map tasks running, where each task
generates set of output files.. which need to be grouped later into 8 DUs....

Ok, clear.

With current framework.. I need to create 1000 empty intermediate Output
DUS to store files of each task..

Ok.

and later create 8 more DUS to manage the files between DUs.

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8
piles, right?
(Not saying you should, but to get the semantics clear)

With the 1000 tasks directly writing to 8 DUS, will eliminate the
creation of intermediate 1000 DUS and related wait time. This is useful for
applications where high performance is required, where even milli seconds
matter.. I think, this is additional flexibility that can be provided, and
its upto application to use it or not.

I hear your concern about performance, which is important of course.
When performance is at play, its not necessarily required to change
semantics, changing implementation might be enough.

Gr,

Mark

Reply to this email directly or view it on GitHubhttps://github.com//issues/173#issuecomment-35540272
.

marksantcroos · 2014-05-27T22:26:58Z

Hi Pradeep,

Still had this open, gave it some more thought, triggered by the Two Tales paper.

Long ago pradeepmantha notifications@github.com wrote:

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8
piles, right?

No. The DU( of the 8DU group) needs one file from each DU ( 1000 DU
group).. Explained the case below with detailed example..

i.e Task 1 creates - { file-0, file-1,file-2, file-3.... file-7}
Task 2 creates - { file1-0, file1-1,file1-2.... file1-7}
....
Task 1000 creates - { file999-0, file999-1,file999-2.... file999-7}

Now 8 Dus should look like.

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

DU-8 = { file 1-7,file2-7 ... file999-7}

so each DU ( of the 8DUS) need one file from the 1000 DUs, which is lot of
performance impact in case of MapReduce framework. So, I can't simply
divide into 1000 DUS into 8 groups.

Ok, I didn’t understand the pattern correctly, thanks for clarifying.

But I’m still tempted to say that you should “just” either:
A. create 8000 DU’s then. Then you either use all of these DUs independently or you merge them into 8 DUs.
B. create 1000 DU’s then. Then you split and merge them into 8 new DUs.

If we don’t talk about performance at this stage, then from a semantical/expressiveness perspective, this both does what you need I believe.

Of course both A and B will have different performance characteristics, but I would consider that an optimisation problem once we settle on the semantics.

Having "mixed" writing to a DU opens up a whole can of worms:

What if the CU's run on different DCIs with different transfer
mechanisms?

I am still trying to understand how the mixing can be avoided
with Current PD design? Consider CUS running on XSEDE cluster that supports
GlobusOnline PD, and FG Sierra(ssh/scp PD) and the target DU resides on
Sierra..
with current PD

Condor tasks push data to Condor Pilot-data, Sierra
tasks push data to Sierra Pilot-Data.. Then for grouping, 8 DUS should be
prepared by mixing 1000 DUS(Which is not yet supported) from both condor
and Sierra pilot-datas and moved to Sierra Pilot-Data.

With CUS writing data directly to DU -

Condor tasks directly push data to DU on sierra, based
on the affinity of target DU & Pilot-Data affinity.

Both the cases need some effort.

Regardless of what route we go there, I’m tempted to say that the initial output DUs should be written to by just one CU.
For the simple reasons that CUs on different DCIs might simply not be able to write to the same place. That is a real showstopper for mixed DU writing.

What if you want to replicate a DU?

Could be similar to above..

Thats a bit too ambiguous ...
If multiple CUs can write to a DU, I think it complicates the state model, because its state now depends on many CUs instead of just 1, and thereby make the decision wether to replicate and when more difficult.

What will be the state model of a DU?

Might affect the state of CU, as data transfer from CU to DU could
be long and need a intermediate state to avoid cpu resource blocking, until
the transfer is finished.

See my comment about states above.

Overall, I fully agree that the current granularity of PD makes use cases like yours complex.
But thats also why you created another layer that abstracts much of that away.

Then the issue of having to manage 1, 8, 1000, or 8000 DUs becomes “just” a performance issue.
To work on the performance aspects, I can see that we need to create some more possibilities to split/merge DUs to support your use case as good as we can!

Gr,

Mark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple CUs to write output files to a single DU. #173

Allow multiple CUs to write output files to a single DU. #173

pradeepmantha commented Feb 9, 2014

pradeepmantha commented Feb 17, 2014

marksantcroos commented Feb 19, 2014

pradeepmantha commented Feb 19, 2014

marksantcroos commented Feb 19, 2014

pradeepmantha commented Feb 20, 2014

marksantcroos commented May 27, 2014

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

Allow multiple CUs to write output files to a single DU. #173

Allow multiple CUs to write output files to a single DU. #173

Comments

pradeepmantha commented Feb 9, 2014

pradeepmantha commented Feb 17, 2014

marksantcroos commented Feb 19, 2014

pradeepmantha commented Feb 19, 2014

marksantcroos commented Feb 19, 2014

pradeepmantha commented Feb 20, 2014

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

marksantcroos commented May 27, 2014

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}