ASoC: SOF: ipc4: change trigger order for chain DMA #4798

kv2019i · 2024-01-26T15:33:29Z

The host DMA (controlled by BE ops) must be stopped before sending PAUSE/STOP IPC (sent from FE ops) to chain DMA. Unless this is done, the DMA stop flow is not following programming sequence and DMA engine may get stuck in busy state.

Link: thesofproject/sof#8792

The host DMA (controlled by BE ops) must be stopped before sending PAUSE/STOP IPC (sent from FE ops) to chain DMA. Unless this is done, the DMA stop flow is not following programming sequence and DMA engine may get stuck in busy state. Link: thesofproject/sof#8792 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2024-01-26T16:28:23Z

For Intel folks, I scheduled test job #37444 that tests this kernel PR against a SOF built with a newer Zephyr (that raises an error if GBUSY is stuck in the DMA).

kv2019i · 2024-01-26T16:52:36Z

UPDATE: and test still fails.

@RanderWang I need your help to understand the original problem sequence in thesofproject/sof#8686 .. after your patch, TGL now fails in every PR plan and we need to figure out a way to unblock integration. Reverting your patch is one option, but I guess we'd then get #8686 back. Plus, DMA stuck in GBUSY after stop, just seems wrong, so we need to fix this as well (but this might take more time, so we can't hold off integration while we debug).

RanderWang · 2024-01-29T02:26:56Z

UPDATE: and test still fails.

@RanderWang I need your help to understand the original problem sequence in thesofproject/sof#8686 .. after your patch, TGL now fails in every PR plan and we need to figure out a way to unblock integration. Reverting your patch is one option, but I guess we'd then get #8686 back. Plus, DMA stuck in GBUSY after stop, just seems wrong, so we need to fix this as well (but this might take more time, so we can't hold off integration while we debug).

@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)
{
       .....
	/* trigger the component */
	ret = pipeline_trigger(host->cd->pipeline, host->cd, cmd); // stop host dma, cmd = COMP_TRIGGER_STOP
	if (ret < 0) {
            .........
	} else if (cmd == COMP_TRIGGER_STOP) {
		/*
		 * reset the pipeline components if STOP trigger is executed in
		 * the same thread.
		 * Otherwise, the pipeline will be reset after the STOP trigger
		 * has finished executing in the pipeline task.
		 */
		ret = pipeline_reset(host->cd->pipeline, host->cd); // stop host dma
		if (ret < 0)
			ret = IPC4_INVALID_REQUEST;
	}
}

My patch was aligned with REF FW. As you mentioned that the bug only happens on CAVS 2.5 platforms. I check the REF FW again and found that hda dma code in current REF FW is totally different with an old CAVS FW code (2 years ago). There was a complete rework for hda dma. The wait for host dma only exists in current FW code (hw change ?), not in an old CAVS FW code. I don't know whether the current REF FW code can support CAVS platforms.

It is very likely that we need to disable wait for CAVS platforms. One problem is that hda dma in zephyr is shared by all intel platforms, how to identify hw platforms in hda dma code ?

kv2019i · 2024-01-29T09:39:37Z

@RanderWang I need your help to understand the original problem sequence in [thesofproject/sof#8686](https://github.com
@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)

Hmm, but even if we have racy calls to DMA stop, intel_adsp_hda_dma_stop() (Zephyr HD-DMA driver) will first check intel_adsp_hda_is_enabled() and that should cover the race, right? I don't undestand how we get past that and end up with inbalanced PM:

static inline bool intel_adsp_hda_is_enabled(uint32_t base, uint32_t regblock_size, uint32_t sid)
{
»       return *DGCS(base, regblock_size, sid) & (DGCS_GEN | DGCS_FIFORDY);

GEN and FIFORDY are cleared as the first thing in the stop function, so evenif we get another stop immediately, the function should be correction. So is the "if (!intel_adsp_hda_is_enabled())" check broken? And if yes, can we fix that to protect against stop racing?

RanderWang · 2024-01-30T05:41:55Z

@RanderWang I need your help to understand the original problem sequence in [thesofproject/sof#8686](https://github.com
@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)

Hmm, but even if we have racy calls to DMA stop, intel_adsp_hda_dma_stop() (Zephyr HD-DMA driver) will first check intel_adsp_hda_is_enabled() and that should cover the race, right? I don't undestand how we get past that and end up with inbalanced PM:
static inline bool intel_adsp_hda_is_enabled(uint32_t base, uint32_t regblock_size, uint32_t sid)
{
»       return *DGCS(base, regblock_size, sid) & (DGCS_GEN | DGCS_FIFORDY);
GEN and FIFORDY are cleared as the first thing in the stop function, so evenif we get another stop immediately, the function should be correction. So is the "if (!intel_adsp_hda_is_enabled())" check broken? And if yes, can we fix that to protect against stop racing?

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again. Current wait is used to fix the second intel_adsp_hda_is_enabled()

kv2019i · 2024-01-31T18:33:57Z

This PR is based on bad assumptions... and does not work. Closing.

kv2019i · 2024-02-01T12:13:53Z

@RanderWang wrote:

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again.
Current wait is used to fix the second intel_adsp_hda_is_enabled()

Sorry, I don't still understand. is_enable checks "DGCS_GEN | DGCS_FIFORDY" bits at function entry. If they are set, the driver then clears these bits. There is no link to the state of the GBUSY bit. I've checked multiple cases where the GBUSY is stuck and GEN/FIFORDY remain cleared (this happens with chain-dma flow). So I cannot see how a second stop (even if it is immediate) could see is_enable() return true unless we have two calls in parallel, but that should not be happening.

I'll try to make an alternative fix, but given how chain-dma works, returning error on GBUSY is not ok.

RanderWang · 2024-02-02T02:06:55Z

@RanderWang wrote:

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again.
Current wait is used to fix the second intel_adsp_hda_is_enabled()

Sorry, I don't still understand. is_enable checks "DGCS_GEN | DGCS_FIFORDY" bits at function entry. If they are set, the driver then clears these bits. There is no link to the state of the GBUSY bit. I've checked multiple cases where the GBUSY is stuck and GEN/FIFORDY remain cleared (this happens with chain-dma flow). So I cannot see how a second stop (even if it is immediate) could see is_enable() return true unless we have two calls in parallel, but that should not be happening.

I'll try to make an alternative fix, but given how chain-dma works, returning error on GBUSY is not ok.

@kv2019i This bug is hard to be reproduced and I saved the mtrace for log which is about 130MB+ and happened after 10911 seconds. Thanks for your hard work!

"----host put" means try to stop dma before intel_adsp_hda_is_enabled(), "host put 4000020" means dma stop done after intel_adsp_hda_is_enabled() return true. For successive two dma_stop ops, there should be only one "host put 4000020" since the intel_adsp_hda_is_enabled() returns false for the second stop so no "host put 4000020" for second "----host put". You can find the first two dma stop for pipe16 has decreased one more pm ref, so the error happened at the last host dma stop for pipeline 14

[10911.308698] <inf> pipe: pipeline_trigger: pipe:16 0x0 pipe trigger cmd 0
[10911.308705] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.308710] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.308721] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.308733] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020   // error since intel_adsp_hda_is_enabled() return ture
[10911.308760] <inf> src: src_reset: comp:16 0x10008 src_reset()
[10911.308785] <inf> pipe: pipeline_trigger: pipe:15 0x0 pipe trigger cmd 0
[10911.308835] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 2a000, reg 34, mask 100, value 0
[10911.308885] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 2a000, reg 8, mask 80, value 80
[10911.308891] <inf> dai_intel_ssp: dai_ssp_stop: dai_ssp_stop RX stop
...
[10911.385973] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.385978] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.385990] <inf> pm: intel_adsp_hda_dma_stop: ----host put     // no more "host put 4000020"
[10911.386051] <inf> pipe: pipeline_trigger: pipe:7 0x0 pipe trigger cmd 0
[10911.386106] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 34, mask 100, value 0
[10911.386156] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 8, mask 80, value 80
[10911.386161] <inf> dai_intel_ssp: dai_ssp_stop: dai_ssp_stop RX stop
[10911.387090] <inf> ipc: ipc_cmd: rx   : 0x46080004|0x80e0004
[10911.387466] <inf> ipc: ipc_cmd: rx   : 0x120b0000|0x0
[10911.387683] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 4
[10911.388315] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.388338] <inf> pipe: pipeline_trigger: pipe:8 0x0 pipe trigger cmd 0
[10911.388345] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.388355] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.388368] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.388905] <inf> ipc: ipc_cmd: rx   : 0x46070004|0x900000f
[10911.389418] <inf> ipc: ipc_cmd: rx   : 0x46080004|0x90004
[10911.389775] <inf> ipc: ipc_cmd: rx   : 0x12070000|0x0
[10911.389798] <inf> dma: dma_put: dma_put(), dma = 0x400ec748, sref = 1
[10911.390406] <inf> ipc: ipc_cmd: rx   : 0x12080000|0x0
[10911.390511] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 3
[10911.390973] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.390996] <inf> pipe: pipeline_trigger: pipe:10 0x0 pipe trigger cmd 0
[10911.391003] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.391006] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.391020] <inf> pm: intel_adsp_hda_dma_stop: ----host put  // no more "host put 4000020"
[10911.391063] <inf> pipe: pipeline_trigger: pipe:9 0x0 pipe trigger cmd 0
...
[10911.391123] <inf> eq_iir: eq_iir_reset: comp:9 0x10 eq_iir_reset()
[10911.391663] <inf> ipc: ipc_cmd: rx   : 0x460c0004|0x90006
[10911.392111] <inf> ipc: ipc_cmd: rx   : 0x120a0000|0x0
[10911.392183] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 2
[10911.392703] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.392736] <inf> pipe: pipeline_trigger: pipe:13 0x0 pipe trigger cmd 0
[10911.392743] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.392746] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.392758] <inf> pm: intel_adsp_hda_dma_stop: ----host put // no more "host put 4000020"
[10911.392803] <inf> pipe: pipeline_trigger: pipe:12 0x0 pipe trigger cmd 0
[10911.394495] <inf> ipc: ipc_cmd: rx   : 0x46100004|0xc0006
[10911.394858] <inf> ipc: ipc_cmd: rx   : 0x120d0000|0x0
[10911.394933] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 1
[10911.395403] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.395431] <inf> pipe: pipeline_trigger: pipe:14 0x0 pipe trigger cmd 0
[10911.395445] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.395448] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.395453] <wrn> pm_device: runtime_suspend: Unbalanced suspend
[10911.395458] <err> host_comp: host_common_trigger: comp:14 0x120004 host_trigger(): dma stop failed: -120
[10911.395463] <err> pipe: pipeline_trigger_run: pipe:14 0x0 pipeline_trigger_run(): ret = -120, host->comp.id = 1179652, cmd = 0
...

kv2019i mentioned this pull request Jan 26, 2024

[BUG] multiple-pause-resume.sh sof-test fails to FW_GEN_MSG failed with err 3018 thesofproject/sof#8792

Closed

kv2019i requested review from jsarha and ujfalusi January 26, 2024 16:28

kv2019i closed this Jan 31, 2024

RanderWang mentioned this pull request Feb 2, 2024

drivers: dma: intel-adsp-hda: modify stop dma logic zephyrproject-rtos/zephyr#68415

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASoC: SOF: ipc4: change trigger order for chain DMA #4798

ASoC: SOF: ipc4: change trigger order for chain DMA #4798

kv2019i commented Jan 26, 2024

kv2019i commented Jan 26, 2024

kv2019i commented Jan 26, 2024

RanderWang commented Jan 29, 2024 •

edited

Loading

kv2019i commented Jan 29, 2024

RanderWang commented Jan 30, 2024 •

edited

Loading

kv2019i commented Jan 31, 2024

kv2019i commented Feb 1, 2024

RanderWang commented Feb 2, 2024

ASoC: SOF: ipc4: change trigger order for chain DMA #4798

ASoC: SOF: ipc4: change trigger order for chain DMA #4798

Conversation

kv2019i commented Jan 26, 2024

kv2019i commented Jan 26, 2024

kv2019i commented Jan 26, 2024

RanderWang commented Jan 29, 2024 • edited Loading

kv2019i commented Jan 29, 2024

RanderWang commented Jan 30, 2024 • edited Loading

kv2019i commented Jan 31, 2024

kv2019i commented Feb 1, 2024

RanderWang commented Feb 2, 2024

RanderWang commented Jan 29, 2024 •

edited

Loading

RanderWang commented Jan 30, 2024 •

edited

Loading