Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIHADOOP-23735: Added new heuristic DistributedCacheLimit heuristic. #187

Merged
merged 1 commit into from
Feb 6, 2017

Conversation

rajagopr
Copy link
Contributor

Jobs which put large files(> 500MB) in the distributed cache are flagged.
Files as part of the following are considered.
mapreduce.job.cache.files
mapreduce.job.cache.archives

<classname>com.linkedin.drelephant.mapreduce.heuristics.DistributedCacheLimitHeuristic</classname>
<viewname>views.html.help.mapreduce.helpDistributedCacheLimit</viewname>
<params>
<distributed.cache.file.size.limit>500000000</distributed.cache.file.size.limit>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suffix the property with bytes.
distributed.cache.file.size.limit.bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private static final String MAPREDUCE_JOB_CACHE_ARCHIVES_FILESIZES = "mapreduce.job.cache.archives.filesizes";
private static final String MAPREDUCE_JOB_CACHE_FILES = "mapreduce.job.cache.files";
private static final String MAPREDUCE_JOB_CACHE_ARCHIVES = "mapreduce.job.cache.archives";
private static long distributedCacheFileSizeLimit = 500000000; // 500MB default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 * FileUtils.ONE_MB would be cleaner and accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if (checkFileSizeLimit(result, cacheFileToSizeMap)) {
result.setSeverity(Severity.MODERATE);
logger.warn("Cache file size was found to be exceeding the max limit of " + distributedCacheFileSizeLimit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be logging job specific messages.

In fact, any logging in heuristics class will simply increase our log file size without much help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why you want to avoid job specific messages. All messages will be in the context of analyzing some job and it's useful to have the job context for better debugging.
However, In this case, I think this message is redundant and has the same info that we can find out from DB. Hence, suggest removing it.
Also, app_id is a better identifier than job name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed the message.

private boolean checkFileSizeLimit(HeuristicResult result, Map<String, String> cacheFileToSizeMap) {
boolean limitViolated = false;
for (String file : cacheFileToSizeMap.keySet()) {
long size = Long.parseLong(cacheFileToSizeMap.get(file));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you handle the parse exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

0);
}

String[] archiveCacheFilesArray = archiveCacheFiles.split(",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using an ArrayList. (http://stackoverflow.com/questions/716597/array-or-list-in-java-which-is-faster)

List<String> archiveCacheFilesArray = new ArrayList<String>(Arrays.asList(archiveCacheFiles.split(",")));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very convinced about the use case here, there's a cost to convert it. However, I have changed per your suggestion.


if (checkFileSizeLimit(result, cacheFileToSizeMap)) {
result.setSeverity(Severity.MODERATE);
logger.warn("Cache file size was found to be exceeding the max limit of " + distributedCacheFileSizeLimit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why you want to avoid job specific messages. All messages will be in the context of analyzing some job and it's useful to have the job context for better debugging.
However, In this case, I think this message is redundant and has the same info that we can find out from DB. Hence, suggest removing it.
Also, app_id is a better identifier than job name.

String[] cacheFileSizesArray = cacheFileSizes.split(",");

if (cacheFilesArray.length != cacheFileSizesArray.length) {
result.setSeverity(Severity.MODERATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its either NONE or MODERATE ? I think exceeding 500 MB should be critical ? Also, we should be able to have intermediate limits for severe and moderate. If you are not sure what those limits should be, jack should be able to suggest some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get this clarified whether to have intermediate limits or not.

String[] archiveCacheFileSizesArray = archiveCacheFileSizes.split(",");

if (archiveCacheFilesArray.length != archiveCacheFileSizesArray.length) {
result.setSeverity(Severity.MODERATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it should change our heuristic because of this inconsistency. We should just go by the sum of the sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clarified.

result.setSeverity(Severity.MODERATE);
logger
.warn("Mismatch in the number of files and their corresponding sizes for " + MAPREDUCE_JOB_CACHE_ARCHIVES);
result.addResultDetail(MAPREDUCE_JOB_CACHE_ARCHIVES, Integer.toString(archiveCacheFilesArray.length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can this get ? I think we have a limit on this field of 255 chars. This can easily exceed that. You should cut it off at that to avoid db update errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. Just representing a number as string here.

When there's a mismatch in the number of files and their corresponding sizes, it will looks as follows.

mapreduce.job.cache.archives : 10
mapreduce.job.cache.archives.filesizes : 9

cacheFileToSizeMap.put(cacheFilesArray[i], cacheFileSizesArray[i]);
}

if (checkFileSizeLimit(result, cacheFileToSizeMap)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are checking if files and archives are separately > 500 MB. I think the combined size should not exceed 500 MB is the intent. Can you get that clarified from jack ? Also, some comments what the difference between the two would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, to be clarified.

boolean limitViolated = false;
for (String file : cacheFileToSizeMap.keySet()) {
long size = Long.parseLong(cacheFileToSizeMap.get(file));
if (size > distributedCacheFileSizeLimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are check each jar size limit to be > 500MB. Shouldn't it be the sum of all sizes > 500 MB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is to be clarified.

@Test
public void testHeuristicResult() {
jobConf.setProperty("mapreduce.job.cache.files.filesizes", "100,200,300");
jobConf.setProperty("mapreduce.job.cache.archives.filesizes", "400,500,600");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests passing ? you are specifying 500 here but I dont see it get multiplied with FILEUTILS*100MB befor checking against the limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis check would have failed if tests don't pass. There's no need to multiply with 100MB, as a file can be of any size. Here 500 means 500 bytes which is less than 500MB.

Jobs which put large files(> 500MB) in the distributed cache are flagged.
Files as part of the following are considered.
  mapreduce.job.cache.files
  mapreduce.job.cache.archives
@akshayrai
Copy link
Contributor

+1, LGTM except the things you wanted to get clarified from Shankar.

@rajagopr
Copy link
Contributor Author

rajagopr commented Feb 2, 2017

@akshayrai : Everything stands clarified now, Jack did mean a single file size.

@akshayrai akshayrai merged commit 4df9ba9 into linkedin:master Feb 6, 2017
skakker pushed a commit to skakker/dr-elephant that referenced this pull request Dec 14, 2017
Jobs which put large files(> 500MB) in the distributed cache are flagged.
Files as part of the following are considered.
  mapreduce.job.cache.files
  mapreduce.job.cache.archives
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants