Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASoC: SOF: ipc4: change trigger order for chain DMA #4798

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Jan 26, 2024

The host DMA (controlled by BE ops) must be stopped before sending PAUSE/STOP IPC (sent from FE ops) to chain DMA. Unless this is done, the DMA stop flow is not following programming sequence and DMA engine may get stuck in busy state.

Link: thesofproject/sof#8792

The host DMA (controlled by BE ops) must be stopped before
sending PAUSE/STOP IPC (sent from FE ops) to chain DMA.
Unless this is done, the DMA stop flow is not following programming
sequence and DMA engine may get stuck in busy state.

Link: thesofproject/sof#8792
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 26, 2024

For Intel folks, I scheduled test job #37444 that tests this kernel PR against a SOF built with a newer Zephyr (that raises an error if GBUSY is stuck in the DMA).

@kv2019i kv2019i requested review from jsarha and ujfalusi January 26, 2024 16:28
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 26, 2024

UPDATE: and test still fails.

@RanderWang I need your help to understand the original problem sequence in thesofproject/sof#8686 .. after your patch, TGL now fails in every PR plan and we need to figure out a way to unblock integration. Reverting your patch is one option, but I guess we'd then get #8686 back. Plus, DMA stuck in GBUSY after stop, just seems wrong, so we need to fix this as well (but this might take more time, so we can't hold off integration while we debug).

@RanderWang
Copy link

RanderWang commented Jan 29, 2024

UPDATE: and test still fails.

@RanderWang I need your help to understand the original problem sequence in thesofproject/sof#8686 .. after your patch, TGL now fails in every PR plan and we need to figure out a way to unblock integration. Reverting your patch is one option, but I guess we'd then get #8686 back. Plus, DMA stuck in GBUSY after stop, just seems wrong, so we need to fix this as well (but this might take more time, so we can't hold off integration while we debug).

@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)
{
       .....
	/* trigger the component */
	ret = pipeline_trigger(host->cd->pipeline, host->cd, cmd); // stop host dma, cmd = COMP_TRIGGER_STOP
	if (ret < 0) {
            .........
	} else if (cmd == COMP_TRIGGER_STOP) {
		/*
		 * reset the pipeline components if STOP trigger is executed in
		 * the same thread.
		 * Otherwise, the pipeline will be reset after the STOP trigger
		 * has finished executing in the pipeline task.
		 */
		ret = pipeline_reset(host->cd->pipeline, host->cd); // stop host dma
		if (ret < 0)
			ret = IPC4_INVALID_REQUEST;
	}
}

My patch was aligned with REF FW. As you mentioned that the bug only happens on CAVS 2.5 platforms. I check the REF FW again and found that hda dma code in current REF FW is totally different with an old CAVS FW code (2 years ago). There was a complete rework for hda dma. The wait for host dma only exists in current FW code (hw change ?), not in an old CAVS FW code. I don't know whether the current REF FW code can support CAVS platforms.

It is very likely that we need to disable wait for CAVS platforms. One problem is that hda dma in zephyr is shared by all intel platforms, how to identify hw platforms in hda dma code ?

@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 29, 2024

@RanderWang I need your help to understand the original problem sequence in [thesofproject/sof#8686](https://github.com
@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)

Hmm, but even if we have racy calls to DMA stop, intel_adsp_hda_dma_stop() (Zephyr HD-DMA driver) will first check intel_adsp_hda_is_enabled() and that should cover the race, right? I don't undestand how we get past that and end up with inbalanced PM:

static inline bool intel_adsp_hda_is_enabled(uint32_t base, uint32_t regblock_size, uint32_t sid)
{
»       return *DGCS(base, regblock_size, sid) & (DGCS_GEN | DGCS_FIFORDY);

GEN and FIFORDY are cleared as the first thing in the stop function, so evenif we get another stop immediately, the function should be correction. So is the "if (!intel_adsp_hda_is_enabled())" check broken? And if yes, can we fix that to protect against stop racing?

@RanderWang
Copy link

RanderWang commented Jan 30, 2024

@RanderWang I need your help to understand the original problem sequence in [thesofproject/sof#8686](https://github.com
@kv2019i the problem sequence is simple. For pipeline stop case, host dma is stopped by two paths in the following code in SOF FW. If the time interval between two stop is very short(rare case), the bug will happen.

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)

Hmm, but even if we have racy calls to DMA stop, intel_adsp_hda_dma_stop() (Zephyr HD-DMA driver) will first check intel_adsp_hda_is_enabled() and that should cover the race, right? I don't undestand how we get past that and end up with inbalanced PM:

static inline bool intel_adsp_hda_is_enabled(uint32_t base, uint32_t regblock_size, uint32_t sid)
{
»       return *DGCS(base, regblock_size, sid) & (DGCS_GEN | DGCS_FIFORDY);

GEN and FIFORDY are cleared as the first thing in the stop function, so evenif we get another stop immediately, the function should be correction. So is the "if (!intel_adsp_hda_is_enabled())" check broken? And if yes, can we fix that to protect against stop racing?

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again. Current wait is used to fix the second intel_adsp_hda_is_enabled()

@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 31, 2024

This PR is based on bad assumptions... and does not work. Closing.

@kv2019i kv2019i closed this Jan 31, 2024
@kv2019i
Copy link
Collaborator Author

kv2019i commented Feb 1, 2024

@RanderWang wrote:

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again.
Current wait is used to fix the second intel_adsp_hda_is_enabled()

Sorry, I don't still understand. is_enable checks "DGCS_GEN | DGCS_FIFORDY" bits at function entry. If they are set, the driver then clears these bits. There is no link to the state of the GBUSY bit. I've checked multiple cases where the GBUSY is stuck and GEN/FIFORDY remain cleared (this happens with chain-dma flow). So I cannot see how a second stop (even if it is immediate) could see is_enable() return true unless we have two calls in parallel, but that should not be happening.

I'll try to make an alternative fix, but given how chain-dma works, returning error on GBUSY is not ok.

@RanderWang
Copy link

@RanderWang wrote:

sure, the problem is : intel_adsp_hda_is_enabled() return true for the second stop, so the dma is free again.
Current wait is used to fix the second intel_adsp_hda_is_enabled()

Sorry, I don't still understand. is_enable checks "DGCS_GEN | DGCS_FIFORDY" bits at function entry. If they are set, the driver then clears these bits. There is no link to the state of the GBUSY bit. I've checked multiple cases where the GBUSY is stuck and GEN/FIFORDY remain cleared (this happens with chain-dma flow). So I cannot see how a second stop (even if it is immediate) could see is_enable() return true unless we have two calls in parallel, but that should not be happening.

I'll try to make an alternative fix, but given how chain-dma works, returning error on GBUSY is not ok.

@kv2019i This bug is hard to be reproduced and I saved the mtrace for log which is about 130MB+ and happened after 10911 seconds. Thanks for your hard work!

"----host put" means try to stop dma before intel_adsp_hda_is_enabled(), "host put 4000020" means dma stop done after intel_adsp_hda_is_enabled() return true. For successive two dma_stop ops, there should be only one "host put 4000020" since the intel_adsp_hda_is_enabled() returns false for the second stop so no "host put 4000020" for second "----host put". You can find the first two dma stop for pipe16 has decreased one more pm ref, so the error happened at the last host dma stop for pipeline 14

[10911.308698] <inf> pipe: pipeline_trigger: pipe:16 0x0 pipe trigger cmd 0
[10911.308705] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.308710] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.308721] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.308733] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020   // error since intel_adsp_hda_is_enabled() return ture
[10911.308760] <inf> src: src_reset: comp:16 0x10008 src_reset()
[10911.308785] <inf> pipe: pipeline_trigger: pipe:15 0x0 pipe trigger cmd 0
[10911.308835] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 2a000, reg 34, mask 100, value 0
[10911.308885] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 2a000, reg 8, mask 80, value 80
[10911.308891] <inf> dai_intel_ssp: dai_ssp_stop: dai_ssp_stop RX stop
...
[10911.385973] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.385978] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.385990] <inf> pm: intel_adsp_hda_dma_stop: ----host put     // no more "host put 4000020"
[10911.386051] <inf> pipe: pipeline_trigger: pipe:7 0x0 pipe trigger cmd 0
[10911.386106] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 34, mask 100, value 0
[10911.386156] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 8, mask 80, value 80
[10911.386161] <inf> dai_intel_ssp: dai_ssp_stop: dai_ssp_stop RX stop
[10911.387090] <inf> ipc: ipc_cmd: rx   : 0x46080004|0x80e0004
[10911.387466] <inf> ipc: ipc_cmd: rx   : 0x120b0000|0x0
[10911.387683] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 4
[10911.388315] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.388338] <inf> pipe: pipeline_trigger: pipe:8 0x0 pipe trigger cmd 0
[10911.388345] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.388355] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.388368] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.388905] <inf> ipc: ipc_cmd: rx   : 0x46070004|0x900000f
[10911.389418] <inf> ipc: ipc_cmd: rx   : 0x46080004|0x90004
[10911.389775] <inf> ipc: ipc_cmd: rx   : 0x12070000|0x0
[10911.389798] <inf> dma: dma_put: dma_put(), dma = 0x400ec748, sref = 1
[10911.390406] <inf> ipc: ipc_cmd: rx   : 0x12080000|0x0
[10911.390511] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 3
[10911.390973] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.390996] <inf> pipe: pipeline_trigger: pipe:10 0x0 pipe trigger cmd 0
[10911.391003] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.391006] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.391020] <inf> pm: intel_adsp_hda_dma_stop: ----host put  // no more "host put 4000020"
[10911.391063] <inf> pipe: pipeline_trigger: pipe:9 0x0 pipe trigger cmd 0
...
[10911.391123] <inf> eq_iir: eq_iir_reset: comp:9 0x10 eq_iir_reset()
[10911.391663] <inf> ipc: ipc_cmd: rx   : 0x460c0004|0x90006
[10911.392111] <inf> ipc: ipc_cmd: rx   : 0x120a0000|0x0
[10911.392183] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 2
[10911.392703] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.392736] <inf> pipe: pipeline_trigger: pipe:13 0x0 pipe trigger cmd 0
[10911.392743] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.392746] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.392758] <inf> pm: intel_adsp_hda_dma_stop: ----host put // no more "host put 4000020"
[10911.392803] <inf> pipe: pipeline_trigger: pipe:12 0x0 pipe trigger cmd 0
[10911.394495] <inf> ipc: ipc_cmd: rx   : 0x46100004|0xc0006
[10911.394858] <inf> ipc: ipc_cmd: rx   : 0x120d0000|0x0
[10911.394933] <inf> dma: dma_put: dma_put(), dma = 0x400ec790, sref = 1
[10911.395403] <inf> ipc: ipc_cmd: rx   : 0x13000002|0x1
[10911.395431] <inf> pipe: pipeline_trigger: pipe:14 0x0 pipe trigger cmd 0
[10911.395445] <inf> pm: intel_adsp_hda_dma_stop: ----host put
[10911.395448] <inf> pm: intel_adsp_hda_dma_stop: host put 4000020
[10911.395453] <wrn> pm_device: runtime_suspend: Unbalanced suspend
[10911.395458] <err> host_comp: host_common_trigger: comp:14 0x120004 host_trigger(): dma stop failed: -120
[10911.395463] <err> pipe: pipeline_trigger_run: pipe:14 0x0 pipeline_trigger_run(): ret = -120, host->comp.id = 1179652, cmd = 0
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants