Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Translation] CLHC for Horizontal Scaling #5

Open
KipCrossing opened this issue Jul 27, 2020 · 2 comments
Open

[Translation] CLHC for Horizontal Scaling #5

KipCrossing opened this issue Jul 27, 2020 · 2 comments
Assignees

Comments

@KipCrossing
Copy link
Member

Problem:

Horizontal scaling involves adding nodes to deal with processing large datasets. A common method or processing these large datasets across many nodes is the mapreduce method. Mapreduce is currently being used by the soiltech project.

The conditioned Latin hypercube sampling method as outlined by Minasny and McBratney 2006 requires certain variables to be shared by all nodes. These Shared variables include:

  • x - All current samples
  • r - reservoir of the non-used samples
  • Metro (from the annealing schedule)

Ideally for horizontal scaling mapreduce problems, the processing of the data should not need to be shared by nodes.

Solution:

The quality of the selected samples are determined by the Objective Functions; hypothetically, this may be performed on any set of random samples or size N. Therefore, the entire dataset may be used to make groups of N random samples. Then the Objective function can be applied on each group. Further, new unique groups may be made by randomising the dataset again and hence the process may be repeated to increase the probability of the 'best' sample set. The Suggested method is a follows:

Note: this will be done within the context of the methods provided by the Apache Spark library

  1. Get quantile definitions by sorting
  2. Map a random number to each of the potential samples
  3. Sort the dataset based on the random number
  4. Reduce data into groups of N samples
  5. Calculate the Objective Factor (OF) for each group
  6. Get group with the best OF

Repeat (1) - (6) 10 times and choose the best sample group. (may be more than 10 to increase odds).

This alternitive method does not require any of the nodes to share variables. Further there is no limit on the size of the dataset. Lastly, the entire dataset may be used to obtain the final group of samples.

@KipCrossing
Copy link
Member Author

Comment below if you would like to view the code.

@KipCrossing
Copy link
Member Author

Using the original method:

Quantiles_old2

Using the new method:

Quantiles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants