Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clarifications requested in issue #48 #49

Merged
merged 6 commits into from
Apr 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 35 additions & 15 deletions reri_err_reporting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -287,8 +287,10 @@ is as follows:
Error reporting functionality in the error record is enabled if the
error-logging-and-signaling-enable (`else`) field is set to 1. The `else` field
is WARL and may default to 1 or 0 at reset. When `else` is 1, the hardware unit
logs and signals errors in the error record. When `else` is 0, whether the
hardware unit continues detecting and correcting errors is `UNSPECIFIED`.
logs and signals errors in the error record. When `else` is 0, any signaling
associated with prior logged errors remains unaffected, the hardware unit does
not log and signal new errors in the error record, and it is `UNSPECIFIED`
whether the hardware unit continues detecting and correcting errors.

[NOTE]
====
Expand Down Expand Up @@ -349,6 +351,21 @@ may be configured to cause a High-priority RAS local interrupt, an external
interrupt, or an Non-Maskable Interrupt (NMI) and the Low-priority RAS signal
may be configured to cause a Low-priority RAS local interrupt or an external
interrupt.

When error class and/or priority-specific RAS handlers are implemented,
these handlers must take into consideration the possibility that an error
record intended for a handler could be overwritten by an error of higher
severity or priority — which also triggers a signal to another RAS handler
for the new error — in the period between the first signal's generation and
its examination of the error record by the first RAS handler. In such
instances, the first RAS handler may find an error record that is not
intended for it. This handler may choose to disregard this error record as
spurious from its perspective, and leave it to be handled by the other RAS
handler. It may also note that an error occurred that concerns it, but
information for the error is no longer available. Similarly, spurious
signals may arise if the fields controlling the type of signal generated by
an error record are modified while either the `v` field or the `ceco` field
in the `status_i` register is set to 1.
====

If the error record supports corrected-error counting then the
Expand All @@ -360,14 +377,14 @@ its value. If corrected error counting is not supported in the error record then
`cece` and `cec` may be hardwired to 0. An overflow of `cec` is signaled using
the signal configured in the `ces` field. When `cece` is 1, the logging of a CE
in the error record does not cause an error signal and an error signal
configured in `ces` occurs only on a `cec` overflow.
configured in `ces` occurs only on a `cec` overflow that sets the `ceco` bit.

The set-read-in-progress (`srdp`) field, when written with a value of 1, causes
the `rdip` (read-in-progress) bit of the associated `status_i` register to be
set. The `srdp` field always returns 0 on read. The `rdip` field in the
`status_i` register is set to 1 by hardware when an error is recorded in an
invalid error record causing the `v` field to change from 0 to 1. The `rdip`
field is cleared to 0 by hardware when a new error overwrites a valid (`v=1`)
field is cleared to 0 by hardware when a new error updates any field of a valid (`v=1`)
error record.

The status-register-invalidate (`sinv`) bit, when written with a value of 1,
Expand Down Expand Up @@ -474,7 +491,7 @@ hardwired to 0. If the bits corresponding to more than one error class are set
to 1 then the error record holds information about the highest severity error
class among the bits set. The error record may be used to provide an
informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and
`uec` bits to 0. Such informational updates are signaled using the signal
`uec` bits to 0. Such informational updates are lower severity than a CE but are signaled using the signal
configured in `control_i.ces`.

When `v` is 1, if more errors of the same class as the error currently logged in
Expand Down Expand Up @@ -659,8 +676,7 @@ overwrite that may occur while it is in process of reading an error record.

An error record that supports the 1 setting of the `cece` field in `control_i`,
implements a corrected-error-counter in the `cec` field. The `cec` is a WARL
field. When `cece` is 1, the `cec` is incremented on each CE in addition to
logging details of the error in the error record registers. If an unsigned
field. When `cece` is 1, the `cec` is incremented on each CE. If an unsigned
integer overflow occurs on an `cec` increment then the
corrected-error-counter-overflow (`ceco`) field is set to 1. The `cec`
continues to count following an overflow. The `cec` and `ceco` fields hold valid
Expand Down Expand Up @@ -774,14 +790,17 @@ global counter (e.g, mtime, etc.), or other implementation specific means.
[[OVERWRITE_RULES]]
=== Error Record Overwrite Rules

When a hardware unit detects an error it may find its error record still valid
due to an earlier detected error that has not yet been consumed by software.
When a hardware unit detects an error and its error record is not valid, it
writes the error record with the error information and marks the record as
valid. However, if the error record is already valid, owing to an earlier
detected but unprocessed error, the decision to overwrite the error record with
new error information is determined by the new error's severity and/or priority.

The overwrite rules allow a higher severity error to overwrite a lower severity
error. UEC has the highest severity, followed by UED, and then CE. When the two
error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two
errors have the same severity the priority of the errors (as determined by
`status_i.pri`) is used to determine if the error record is overwritten. Higher
priority errors overwrite the lower priority errors. When a error record is
priority errors overwrite the lower priority errors. When an error record is
overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by
UEC/UED), the status bits indicating the severity of the older errors are
retained (i.e., are sticky).
Expand All @@ -791,6 +810,11 @@ When an error writes or overwrites an error record, the `status_i.cec` and
severity. When implemented, `cec` counts CE occurrences; unsigned integer
overflow on `cec` increment sets `ceco` to 1.

Whenever a new error writes to or overwrites an error record, the signal
configured in the `control_i` register for its severity level is asserted. When
`status_i.ceco` changes from 0 to 1, the signal configured in `control_i.ces` is
asserted.

<<<

[[REC_WRITE_RULE]]
Expand Down Expand Up @@ -871,10 +895,6 @@ error. And yet another implementation may choose to record one of the errors as
determined by implementation specific rules.
====

When a new error is recorded by the hardware unit in the `status_i` register of its
error record then the signal configured in the `control_i` register for error is
asserted.

=== Error Reporting Defined by Other Standards

Standards such as PCIe cite:[PCI] and CXL cite:[CXL] define standardized error
Expand Down
Loading