From fe7b6ea519efc9ec067fae27dcae8e718428745b Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Sat, 23 Mar 2024 16:26:38 -0500 Subject: [PATCH 1/6] clarify severity of informational error --- reri_err_reporting.adoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 402d5a1..6944b5a 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -474,7 +474,7 @@ hardwired to 0. If the bits corresponding to more than one error class are set to 1 then the error record holds information about the highest severity error class among the bits set. The error record may be used to provide an informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and -`uec` bits to 0. Such informational updates are signaled using the signal +`uec` bits to 0. Such informational updates are lower severity than a CE but are signaled using the signal configured in `control_i.ces`. When `v` is 1, if more errors of the same class as the error currently logged in @@ -778,10 +778,10 @@ When a hardware unit detects an error it may find its error record still valid due to an earlier detected error that has not yet been consumed by software. The overwrite rules allow a higher severity error to overwrite a lower severity -error. UEC has the highest severity, followed by UED, and then CE. When the two +error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two errors have the same severity the priority of the errors (as determined by `status_i.pri`) is used to determine if the error record is overwritten. Higher -priority errors overwrite the lower priority errors. When a error record is +priority errors overwrite the lower priority errors. When an error record is overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by UEC/UED), the status bits indicating the severity of the older errors are retained (i.e., are sticky). From 74ebcd972e3b79515f0c90da340136146e6b8aa6 Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Sat, 23 Mar 2024 16:30:56 -0500 Subject: [PATCH 2/6] clarify any status_i update clears rdip --- reri_err_reporting.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 6944b5a..42c63ff 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -367,7 +367,7 @@ the `rdip` (read-in-progress) bit of the associated `status_i` register to be set. The `srdp` field always returns 0 on read. The `rdip` field in the `status_i` register is set to 1 by hardware when an error is recorded in an invalid error record causing the `v` field to change from 0 to 1. The `rdip` -field is cleared to 0 by hardware when a new error overwrites a valid (`v=1`) +field is cleared to 0 by hardware when a new error updates any field of a valid (`v=1`) error record. The status-register-invalidate (`sinv`) bit, when written with a value of 1, From f63ccd59c9ec33e5be27c33bebe8de7408127e7f Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Sat, 23 Mar 2024 16:38:47 -0500 Subject: [PATCH 3/6] clarify asserted signals unaffected on clearing else --- reri_err_reporting.adoc | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 42c63ff..dbd6bfb 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -287,8 +287,10 @@ is as follows: Error reporting functionality in the error record is enabled if the error-logging-and-signaling-enable (`else`) field is set to 1. The `else` field is WARL and may default to 1 or 0 at reset. When `else` is 1, the hardware unit -logs and signals errors in the error record. When `else` is 0, whether the -hardware unit continues detecting and correcting errors is `UNSPECIFIED`. +logs and signals errors in the error record. When `else` is 0, any signaling +associated with prior logged errors remains unaffected, the hardware unit does +not log and signal new errors in the error record, and it is `UNSPECIFIED` +whether the hardware unit continues detecting and correcting errors. [NOTE] ==== From 3b97fff4f5c192702cf02ebdba09d745f0b402bd Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Sat, 23 Mar 2024 16:39:35 -0500 Subject: [PATCH 4/6] add note for class or priority specific handlers --- reri_err_reporting.adoc | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index dbd6bfb..38f492d 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -351,6 +351,21 @@ may be configured to cause a High-priority RAS local interrupt, an external interrupt, or an Non-Maskable Interrupt (NMI) and the Low-priority RAS signal may be configured to cause a Low-priority RAS local interrupt or an external interrupt. + +When error class and/or priority-specific RAS handlers are implemented, +these handlers must take into consideration the possibility that an error +record intended for a handler could be overwritten by an error of higher +severity or priority — which also triggers a signal to another RAS handler +for the new error — in the period between the first signal's generation and +its examination of the error record by the first RAS handler. In such +instances, the first RAS handler may find an error record that is not +intended for it. This handler may choose to disregard this error record as +spurious from its perspective, and leave it to be handled by the other RAS +handler. It may also note that an error occurred that concerns it, but +information for the error is no longer available. Similarly, spurious +signals may arise if the fields controlling the type of signal generated by +an error record are modified while either the `v` field or the `ceco` field +in the `status_i` register is set to 1. ==== If the error record supports corrected-error counting then the From a0cc37c95bcbc7d486a0c23efc877503e3054930 Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Wed, 27 Mar 2024 15:46:58 -0500 Subject: [PATCH 5/6] clarify cec overflow signals on setting ceco --- reri_err_reporting.adoc | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 38f492d..4d598f5 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -377,7 +377,7 @@ its value. If corrected error counting is not supported in the error record then `cece` and `cec` may be hardwired to 0. An overflow of `cec` is signaled using the signal configured in the `ces` field. When `cece` is 1, the logging of a CE in the error record does not cause an error signal and an error signal -configured in `ces` occurs only on a `cec` overflow. +configured in `ces` occurs only on a `cec` overflow that sets the `ceco` bit. The set-read-in-progress (`srdp`) field, when written with a value of 1, causes the `rdip` (read-in-progress) bit of the associated `status_i` register to be @@ -676,8 +676,7 @@ overwrite that may occur while it is in process of reading an error record. An error record that supports the 1 setting of the `cece` field in `control_i`, implements a corrected-error-counter in the `cec` field. The `cec` is a WARL -field. When `cece` is 1, the `cec` is incremented on each CE in addition to -logging details of the error in the error record registers. If an unsigned +field. When `cece` is 1, the `cec` is incremented on each CE. If an unsigned integer overflow occurs on an `cec` increment then the corrected-error-counter-overflow (`ceco`) field is set to 1. The `cec` continues to count following an overflow. The `cec` and `ceco` fields hold valid From ef158e2e15359e7a7be24eaf1bfea4f53591778f Mon Sep 17 00:00:00 2001 From: Ved Shanbhogue Date: Sat, 30 Mar 2024 16:53:39 -0500 Subject: [PATCH 6/6] clarify recorded means writes and overwrites --- reri_err_reporting.adoc | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index 4d598f5..10cda5e 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -790,8 +790,11 @@ global counter (e.g, mtime, etc.), or other implementation specific means. [[OVERWRITE_RULES]] === Error Record Overwrite Rules -When a hardware unit detects an error it may find its error record still valid -due to an earlier detected error that has not yet been consumed by software. +When a hardware unit detects an error and its error record is not valid, it +writes the error record with the error information and marks the record as +valid. However, if the error record is already valid, owing to an earlier +detected but unprocessed error, the decision to overwrite the error record with +new error information is determined by the new error's severity and/or priority. The overwrite rules allow a higher severity error to overwrite a lower severity error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two @@ -807,6 +810,11 @@ When an error writes or overwrites an error record, the `status_i.cec` and severity. When implemented, `cec` counts CE occurrences; unsigned integer overflow on `cec` increment sets `ceco` to 1. +Whenever a new error writes to or overwrites an error record, the signal +configured in the `control_i` register for its severity level is asserted. When +`status_i.ceco` changes from 0 to 1, the signal configured in `control_i.ces` is +asserted. + <<< [[REC_WRITE_RULE]] @@ -887,10 +895,6 @@ error. And yet another implementation may choose to record one of the errors as determined by implementation specific rules. ==== -When a new error is recorded by the hardware unit in the `status_i` register of its -error record then the signal configured in the `control_i` register for error is -asserted. - === Error Reporting Defined by Other Standards Standards such as PCIe cite:[PCI] and CXL cite:[CXL] define standardized error