[Translation] CLHC for Horizontal Scaling #5

KipCrossing · 2020-07-27T07:13:08Z

Problem:

Horizontal scaling involves adding nodes to deal with processing large datasets. A common method or processing these large datasets across many nodes is the mapreduce method. Mapreduce is currently being used by the soiltech project.

The conditioned Latin hypercube sampling method as outlined by Minasny and McBratney 2006 requires certain variables to be shared by all nodes. These Shared variables include:

x - All current samples
r - reservoir of the non-used samples
Metro (from the annealing schedule)

Ideally for horizontal scaling mapreduce problems, the processing of the data should not need to be shared by nodes.

Solution:

The quality of the selected samples are determined by the Objective Functions; hypothetically, this may be performed on any set of random samples or size N. Therefore, the entire dataset may be used to make groups of N random samples. Then the Objective function can be applied on each group. Further, new unique groups may be made by randomising the dataset again and hence the process may be repeated to increase the probability of the 'best' sample set. The Suggested method is a follows:

Note: this will be done within the context of the methods provided by the Apache Spark library

Get quantile definitions by sorting
Map a random number to each of the potential samples
Sort the dataset based on the random number
Reduce data into groups of N samples
Calculate the Objective Factor (OF) for each group
Get group with the best OF

Repeat (1) - (6) 10 times and choose the best sample group. (may be more than 10 to increase odds).

This alternitive method does not require any of the nodes to share variables. Further there is no limit on the size of the dataset. Lastly, the entire dataset may be used to obtain the final group of samples.

KipCrossing · 2020-07-27T07:13:35Z

Comment below if you would like to view the code.

KipCrossing · 2020-09-22T08:30:57Z

Using the original method:

Using the new method:

KipCrossing added the translate label Jul 27, 2020

KipCrossing assigned correllink and KipCrossing Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Translation] CLHC for Horizontal Scaling #5

[Translation] CLHC for Horizontal Scaling #5

KipCrossing commented Jul 27, 2020

KipCrossing commented Jul 27, 2020

KipCrossing commented Sep 22, 2020

[Translation] CLHC for Horizontal Scaling #5

[Translation] CLHC for Horizontal Scaling #5

Comments

KipCrossing commented Jul 27, 2020

KipCrossing commented Jul 27, 2020

KipCrossing commented Sep 22, 2020

Using the original method:

Using the new method: