Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oneDNN] First fix to #33021 #33174

Merged
merged 4 commits into from
Jun 9, 2021
Merged

Conversation

jczaja
Copy link
Contributor

@jczaja jczaja commented May 27, 2021

PR types

Bug fixes

PR changes

Others

Describe

It is a fix to increasing cache size when in cache clearing mode as described in issue #33021.

@jczaja jczaja added the Intel label May 27, 2021
@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@jczaja jczaja force-pushed the prv-33021-fix branch 2 times, most recently from 431fe82 to a36fd33 Compare May 31, 2021 17:38
@jczaja jczaja changed the title [oneDNN] Candidate fix to #33021 [oneDNN] First fix to #33021 May 31, 2021
- compilation fix

- Fix

- Lint
@jczaja
Copy link
Contributor Author

jczaja commented Jun 2, 2021

@jakpiase Please continue your review.

Copy link
Contributor

@lidanqing-intel lidanqing-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Reviewed half

cfg.SetMkldnnCacheCapacity(cache_capacity);

// we will use two predictors (same model)
auto first_predictor = CreatePaddlePredictor<AnalysisConfig>(cfg);
Copy link
Contributor

@lidanqing-intel lidanqing-intel Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, so there are two predictors? Where is second_predictor? And this validate_cache_onednn means test_cache_onednn ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there were two predictors, but I removed second one. So the comment here is misleading. I will fix that. thanks

@@ -739,7 +758,7 @@ void MKLDNNDeviceContext::SetBlob(const std::string& name,
return;
}

unsigned int MKLDNNDeviceContext::GetCachedObjectsNumber(void) {
unsigned int MKLDNNDeviceContext::GetCachedObjectsNumber(void) const {
unsigned int num_entries = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetCachedObjectsNumber results is num_threads * nums_input_shape ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually structure of cache is of three levels: session IDs , shapes, actual cached objects per shape. So To get quantity of cache we iterate through sessions and their shapes. For every shape we then get amount of cached objects (for shape) and sum it up with other amounts for different shapes.

auto onednn_dev_ctx =
dynamic_cast<platform::MKLDNNDeviceContext *>(pool.Get(place));
return onednn_dev_ctx->GetCachedObjectsNumber();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we got number of all threas cached different input_shapes blobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just return total number of cached objects regardless the number of threads used. oneDnn cache is one and shared among threads.

@jakpiase jakpiase self-requested a review June 2, 2021 14:55
jakpiase
jakpiase previously approved these changes Jun 2, 2021
Copy link
Contributor

@jakpiase jakpiase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lidanqing-intel
lidanqing-intel previously approved these changes Jun 2, 2021
Copy link
Contributor

@lidanqing-intel lidanqing-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much !

@lidanqing-intel
Copy link
Contributor

Hi @wzzju Could you please review it? Thanks

Comment on lines 751 to 753
using ExecMap = std::unordered_map<
void*, std::vector<std::pair<BlobPtr_t<KeyBlob>, KeyBlob::iterator>>>;
using ExecShape = std::unordered_map<std::string, std::shared_ptr<ExecMap>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO those structures deserves some nice documentation. Since every time I look at this code I need to decipher what kind of data are stored here and what are the keys in those maps. Please write detailed explanation. Maybe define some types which will help to make this code more readable. like:

using ExecKey = void*;
using BlobMapCacheIterPair = std::pair<BlobPtr_t<KeyBlob>, KeyBlob::iterator>;
using ExecBlobMapsVec = std::vector<BlobMapCacheIterPair>;
using ExecBlobMap = std::unordered_map<ExecKey, ExecBlobMapsVec>;
using ExecShapeMap = std::unordered_map<std::string, std::shared_ptr<ExecBlobMap>>;

shape_lines.resize(num_samples);

// Let's remeber number of cached objects before
// exection and after every single execution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// exection and after every single execution
// execution and after every single execution

std::vector<int> cache_filling;
cache_filling.push_back(GetNumCachedObjects());

// compute sequenctilly prediction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// compute sequenctilly prediction
// compute sequentially prediction

}
} else {
VLOG(3) << "Prevented Clearing DNNL cache.";
block_next_cache_clearing_ = false;
}
}

void MKLDNNDeviceContext::RemoveShapeEntriesWithExecutor(void) const {
p_exec_items_->erase(p_exec_items_->begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing the first object in map? How do you know that this is the actual key (shape) you want delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing shapes works according to FIFO concepts. So I remove objects related to oldest shape


Record ProcessALine(const std::string &line, const std::string &shape_line) {
VLOG(3) << "process a line";
std::vector<std::string> columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi this columns is not used ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

TEST(Analyzer_detect, validate_cache_onednn) {
validate_cache_onednn(2 /*cache_capacity */);
}
#endif
Copy link
Contributor

@lidanqing-intel lidanqing-intel Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi if this whole TEST(...) are within #ifdef PADDLE_WITH_MKLDNN #endif, does it mean this file could be renamed as xxx_ mkldnn_tester.cc because otherwise if WITH_MKLDNN OFF this test will not be executed but it looks now it will be executed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It seems It will be multi-threading functional tests of oneDNN execution to be tested, so I can change a name of file

@jczaja jczaja dismissed stale reviews from lidanqing-intel and jakpiase via 9c30e39 June 7, 2021 12:29
@jczaja
Copy link
Contributor Author

jczaja commented Jun 8, 2021

@juncaipeng could you please start your review?

@jczaja
Copy link
Contributor Author

jczaja commented Jun 8, 2021

@wzzju Please start your review. All required CI passed

@jakpiase
Copy link
Contributor

jakpiase commented Jun 8, 2021

LGTM

Copy link
Contributor

@juncaipeng juncaipeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juncaipeng juncaipeng merged commit 1382cd2 into PaddlePaddle:develop Jun 9, 2021
lidanqing-intel pushed a commit to lidanqing-intel/Paddle that referenced this pull request Jun 15, 2021
Superjomn pushed a commit that referenced this pull request Jun 16, 2021
…y consumption (#33571)

* [oneDNN] First fix to #33021  (#33174)

* - First fix to #33021

* [oneDNN] Second fix to #33021 (#33471)

* use older download_data function

Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants