diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 402d5a1..10cda5e 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -287,8 +287,10 @@ is as follows: Error reporting functionality in the error record is enabled if the error-logging-and-signaling-enable (`else`) field is set to 1. The `else` field is WARL and may default to 1 or 0 at reset. When `else` is 1, the hardware unit -logs and signals errors in the error record. When `else` is 0, whether the -hardware unit continues detecting and correcting errors is `UNSPECIFIED`. +logs and signals errors in the error record. When `else` is 0, any signaling +associated with prior logged errors remains unaffected, the hardware unit does +not log and signal new errors in the error record, and it is `UNSPECIFIED` +whether the hardware unit continues detecting and correcting errors. [NOTE] ==== @@ -349,6 +351,21 @@ may be configured to cause a High-priority RAS local interrupt, an external interrupt, or an Non-Maskable Interrupt (NMI) and the Low-priority RAS signal may be configured to cause a Low-priority RAS local interrupt or an external interrupt. + +When error class and/or priority-specific RAS handlers are implemented, +these handlers must take into consideration the possibility that an error +record intended for a handler could be overwritten by an error of higher +severity or priority — which also triggers a signal to another RAS handler +for the new error — in the period between the first signal's generation and +its examination of the error record by the first RAS handler. In such +instances, the first RAS handler may find an error record that is not +intended for it. This handler may choose to disregard this error record as +spurious from its perspective, and leave it to be handled by the other RAS +handler. It may also note that an error occurred that concerns it, but +information for the error is no longer available. Similarly, spurious +signals may arise if the fields controlling the type of signal generated by +an error record are modified while either the `v` field or the `ceco` field +in the `status_i` register is set to 1. ==== If the error record supports corrected-error counting then the @@ -360,14 +377,14 @@ its value. If corrected error counting is not supported in the error record then `cece` and `cec` may be hardwired to 0. An overflow of `cec` is signaled using the signal configured in the `ces` field. When `cece` is 1, the logging of a CE in the error record does not cause an error signal and an error signal -configured in `ces` occurs only on a `cec` overflow. +configured in `ces` occurs only on a `cec` overflow that sets the `ceco` bit. The set-read-in-progress (`srdp`) field, when written with a value of 1, causes the `rdip` (read-in-progress) bit of the associated `status_i` register to be set. The `srdp` field always returns 0 on read. The `rdip` field in the `status_i` register is set to 1 by hardware when an error is recorded in an invalid error record causing the `v` field to change from 0 to 1. The `rdip` -field is cleared to 0 by hardware when a new error overwrites a valid (`v=1`) +field is cleared to 0 by hardware when a new error updates any field of a valid (`v=1`) error record. The status-register-invalidate (`sinv`) bit, when written with a value of 1, @@ -474,7 +491,7 @@ hardwired to 0. If the bits corresponding to more than one error class are set to 1 then the error record holds information about the highest severity error class among the bits set. The error record may be used to provide an informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and -`uec` bits to 0. Such informational updates are signaled using the signal +`uec` bits to 0. Such informational updates are lower severity than a CE but are signaled using the signal configured in `control_i.ces`. When `v` is 1, if more errors of the same class as the error currently logged in @@ -659,8 +676,7 @@ overwrite that may occur while it is in process of reading an error record. An error record that supports the 1 setting of the `cece` field in `control_i`, implements a corrected-error-counter in the `cec` field. The `cec` is a WARL -field. When `cece` is 1, the `cec` is incremented on each CE in addition to -logging details of the error in the error record registers. If an unsigned +field. When `cece` is 1, the `cec` is incremented on each CE. If an unsigned integer overflow occurs on an `cec` increment then the corrected-error-counter-overflow (`ceco`) field is set to 1. The `cec` continues to count following an overflow. The `cec` and `ceco` fields hold valid @@ -774,14 +790,17 @@ global counter (e.g, mtime, etc.), or other implementation specific means. [[OVERWRITE_RULES]] === Error Record Overwrite Rules -When a hardware unit detects an error it may find its error record still valid -due to an earlier detected error that has not yet been consumed by software. +When a hardware unit detects an error and its error record is not valid, it +writes the error record with the error information and marks the record as +valid. However, if the error record is already valid, owing to an earlier +detected but unprocessed error, the decision to overwrite the error record with +new error information is determined by the new error's severity and/or priority. The overwrite rules allow a higher severity error to overwrite a lower severity -error. UEC has the highest severity, followed by UED, and then CE. When the two +error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two errors have the same severity the priority of the errors (as determined by `status_i.pri`) is used to determine if the error record is overwritten. Higher -priority errors overwrite the lower priority errors. When a error record is +priority errors overwrite the lower priority errors. When an error record is overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by UEC/UED), the status bits indicating the severity of the older errors are retained (i.e., are sticky). @@ -791,6 +810,11 @@ When an error writes or overwrites an error record, the `status_i.cec` and severity. When implemented, `cec` counts CE occurrences; unsigned integer overflow on `cec` increment sets `ceco` to 1. +Whenever a new error writes to or overwrites an error record, the signal +configured in the `control_i` register for its severity level is asserted. When +`status_i.ceco` changes from 0 to 1, the signal configured in `control_i.ces` is +asserted. + <<< [[REC_WRITE_RULE]] @@ -871,10 +895,6 @@ error. And yet another implementation may choose to record one of the errors as determined by implementation specific rules. ==== -When a new error is recorded by the hardware unit in the `status_i` register of its -error record then the signal configured in the `control_i` register for error is -asserted. - === Error Reporting Defined by Other Standards Standards such as PCIe cite:[PCI] and CXL cite:[CXL] define standardized error