-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero #4630
fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero #4630
Conversation
On x86_64, we observed that when restoring from a snapshot, one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never received TSC interrupts until the MSR is updated externally (eg by setting the system time). We believe this happens because the TSC interrupt is lost during snapshot taking process: the MSR is cleared, but the interrupt is not delivered to the guest, so the guest does not rearm the timer. A visible effect of that is failure to connect to a restored VM via SSH. This commit introduces a workaround. If when taking a snapshot, we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with the MSR_IA32_TSC value from the same vCPU to make sure that the vCPU will continue to receive TSC interrupts. (cherry picked from commit 94b37cb) Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated by the guest kernel based on the current TSC value. (cherry picked from commit 4402c82) Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated by the guest kernel based on the current TSC value. (cherry picked from commit cee34ab) Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
ceea8bd
to
345218e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## firecracker-v1.8 #4630 +/- ##
====================================================
+ Coverage 82.14% 82.16% +0.01%
====================================================
Files 255 255
Lines 31285 31307 +22
====================================================
+ Hits 25700 25722 +22
Misses 5585 5585
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
We have observed no more failures of this type on main since merging #4618, so are confident to proceed with this PR, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the change.
Changes
Backport from #4618 .
This change introduces a workaround. If when taking a snapshot, we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with the MSR_IA32_TSC value from the same vCPU to make sure the vCPU will continue to receive TSC interrupts.
Reason
On x86_64, we observed that when restoring from a snapshot, one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never received TSC interrupts until the MSR is updated externally (eg by setting the system time).
We believe this happens because the TSC interrupt is lost during snapshot taking process: the MSR is cleared, but the interrupt is not delivered to the guest, so the guest does not rearm the timer.
A visible effect of that is failure to connect to a restored VM via SSH, similar to https://buildkite.com/firecracker/firecracker-pr-nightly/builds/1403#018f83db-5395-4656-8d9c-83b6fcfcfd54/50-1994 .
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
[ ] If a specific issue led to this PR, this PR closes the issue.PR.
[ ] API changes follow the Runbook for Firecracker API changes.CHANGELOG.md
.[ ] NewTODO
s link to an issue.contribution quality standards.
rust-vmm
.