-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested interrupts corrupt LR ( AKA printer freezes mid print with heaters on ) #18358
Comments
In addition to that the
I could be wrong since I am not familiar with UI but it appears as if the code is invoking an SD access method which is relatively slow, and hoping that the TMC stepper interrupt will not trigger in between? |
@minosg still an issue? |
Yes that's still an issue. The core appears to be dying by recursive interrupt uart calls. It's not something that can be fixed by a config change. I am using this ticket to provide more technical info of the issue, or we will keep on seeing random my printer freezes bugs. I will be tracing that behaviour next week. If there is a limited time a ticket is allowed to be open feel free to close it and we can continue to chat on it anyway. Thanks |
Please test the bugfix-2.0.x branch to see where it stands. |
@boelle I have tested with latest bugfix and the issue is still here. |
Theory 3:I have been testing this board using a debugger. The original theory that the LR Is corrupted seems less likely and I believe it is more of an interrupt firing issue. I believe it was introduced by this pr #14030 , but need deeper understanding of the stepper class. What happens when the printer freezes.A freeze will become evident when obviously the printer is no longer moving. The logic is NOT stuck in exception mode, and the interrupt will be returning to the main thread. The code snippet that is of interest is this part https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/planner.h
Which causes the UI update loop to trigger, and the lcd uart to keep on firing. The problem appears to be that this while loop is not breaking . Chasing the logic the block_tail is only updated in the following method loop which appears is no longer firing Stepper::pulse_phase_isr()-> discard_current_block -> release_current_block The interrupt configuration on this platform is the following
Which appear to be enabled during the time of the freeze
And the interrupt priorities appear to be in order
I have also checked timer5 and the CC register as well as the count register and the timer appears to be counting as expected. Further notesThis platform appears to not be using instruction and memory barriers for interrupts. I do not know if there is a design choice behind that, but the pull request I quoted is enabling and disabling timers using CPSID, CPSIE instructions, with no guarantees that they are actually in the state that we want them to. I am not certain on the priority grouping setting. Considering that I cannot see any use of sub-priorities in Marlin Core, are they being used in Mapple or should it be set to 0x4 anyway? Is that code a snippet of when we were using Autopreload with the timers instead of resetting them? When the board freezes most interrupts but the lcd and the temperature stop firing. This is also very bizarre behavior which I am trying to track down. This is also a very happy coincidence, since if the temperature interrupts dies the hotend wil cause meltdown/fire so its a considerable safety risk. I have disabled interrupt grouping and set the uasart1 ( lcd) interrupt to have the lowest priority (15) and yet it keeps on firing, which indicates there is no interrupt preemption or starvation happening. What I suspect happens is because of too many interrupts firing , ( TMC set in UART modes individually and not chained in a single line ) something in the stepper.cpp logic mini scheduler is falling out of its timing bounds. The problem is that when the issue happens is way before it becomes evident in the logic. so a longer trace is needed which is not supported by st-link. To be continued. |
I think the issue is not STM32 only as I have a LPC1768 and seing similar symptoms since I added a BTT TFT 35 with Marlin mode LCD to my MKS SGEN L board with TMC2209s in UART mode. My setup is as follows:
The TFT proxies the commands from the PI to the board. I have slightly modified the BTT firmware to parse the M117 commands and display them in TFT mode. What I have discovered so far in my tests (granted not yet exhaustive):
|
That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts. But the fact that even slower lcd's cause the issue, could also verify the hypothesis that the added logic on the loop, is what is pushing a timing sensitive task outside of the acceptable boundaries. Compiling out and CLD will significantly decrease the ui.update() loop's time. Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler |
I think this is where the disconnect happens in BTT firmware https://github.com/guruathwal/BIGTREETECH-TouchScreenFirmware/blob/c189809e95c627d1680ef81533b32e8fac56514b/TFT/src/User/Menu/SettingsMenu.c#L53
LA yes, set to 0.1 (I thought from your tests it did not really matter), acceleration is set to 1000 and speed to 90mm/s. Starting a second 3h print without LCD support. |
I have a very similar setup: SKR mini E3 V2, BTT TFT35. I had frequent crashes with I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that? Also I'm still waiting for my STLink to get into some proper debugging. |
No special configuration just use guruathwal fork as it fixes a number of issues with uart forwarding. The only issue that still stands is the fact that because the lines sent by Octoprint start with line numbers (Nxxx) the TFT does not parse the information from them so for instance |
Second print without LCD support froze after 2h :( I was able to resume using the disconnect method. |
@kind3r I'm not sure your issue is related to this one, as the board does not freeze (being an LPC176x the watchdog would catch that if it did so there would be no risk of the heaters being left on), From your description you are not using the normal usb cdc serial port (which is a 10Mbit, lossless, flow controlled connection), but bridging standard uart through a tft display board, it almost seems like the gcode commands to Marlin are just stopping at some point until you reset the tfts serial connection? do you have any way to check that there is data getting to LP176x uart while in the frozen state? |
@p3p THe watchdog is what made this issue visible and this is why it had its own ticket, but inherently we are talking about the same Marlin core cause. The key element to test for this issue is the following
But does never freeze if there is display compiled, and you are printing from SD using M21, M24 then you are seeing the same issue. As discussed above I think that the the planner stops discarding blocks, and the firmware freezes waiting for new free blocks. When that happens some interrupts can fire ( temperature, LCD ) while others don't. |
It also hapened while printing from the PI connected via USB (in which case I unplugged the USB cable and reconnected it but it was still frozen). I still need to run more tests to make it repeatable, but it seems similar (something uart related in either case). I'm not sure that heaters are still on as I imediatly reacted every time, but next time it happens I will leave it frozen for longer to check if the heaters are really on or not. |
@kind3r Can you please try to print using SD directly and no usb host? Pronterface has a nice UI element ( SD ) to help with that. If you have a freeze, please let me know. |
@minosg Could you clarify a bit the conditions ?
|
If the issue is happening with the USB CDC connection and the hardware UART along with both STM32 and LPC then I guess I can rule out it being a low level issue (in my LPC Arduino framework) that's getting one of those into a bad state, they are very different peripherals, @kind3r I'm not sure if your board manufacturer has messed with the default bootloader or not but LPC176x should never boot loop, so if the watchdog triggers the board will be stalled in a safe state until hardware reset, in theory that rules out an actual mcu crash at least. |
@p3p For the STM case, I can confirm it happens regardless if you have a usb connected or not. By connecting the serial, or any other action which that delays the logic it causes it to happen way more often. With the ST-Link attached it happens in the first 10 minutes, while normally it takes a couple of prints to trigger. |
Update on the investigation on this issue. I run a tracing session today, and the following race condition was observed. Looking into the stepper.cpp logic in particular When the freeze happens, the interrupts and the timers are enabled and running/firing at the appropriate rate. The block_phase_isr() called the finalize state right before the printer froze
discard_current_block() will set the pointer to null
Then the pulse_base_isr will fire next time and exit
The next time the block_phase_isr() is called it will attempt to retrieve a new block and exit
Planner uses this logic
Moves planned will always return 0 since the head and tail of Planner:block_buffer are the same
This deadlocks the stepper logic. I would assume that step_events_completed >= step_event_count is a normal condition, and should happen in the end of every loop. The question is why it never recovers form it. |
@minosg I ran the test like you said and it worked fine (from SD Card, all serials disconnected after starting the print). I also ran another test from Octoprint via USB and it also worked fine. I think @p3p is right and it seems that my issue is related to the BTT TFT uart handling and not Marlin's since printing resumes after disconnecting and reconnecting from the TFT and I don't think Marlin restarts, it's just the TFT that stops communication. |
A minor update I have good news and bad news. The good news. I have identified an actual bug in the planner, which is making this issue worse. Planner is slicing moves as a series of micro-moves made to fit in a ring buffer ( Default size of 16 block ) .Ring buffers and interrupts are a nasty combination since a double increment when moving the head/tail can make an empty buffer appear full or the other way around. From the code
The good news is when this was designed a logic was included to make sure that things are delayed a bit when the buffer is working at its boundaries ( ie 15 out of 16 blocks full ). This is done by delay_before_delivering In the beginning of each move this logic is triggered which is aimed at giving 100ms to the stepper driver to move the tail before the new line starts
The problem is that the stepper will call get_current_block which is implemented as such
The latter is equivalent to
Which is effectively cancelling the delay and wasting a cycle of stepper logic if it returns null pointer I believe it was aimed to be written as
It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth. |
We’re accepting PRs, so please feel free to contribute your changes. |
This is a minor change, and hardly a critical one. Weather it gets contributed and attributed to me or anyone who reads that comments and adds it to their already pending PR is of little importance compared to fixing the core issue. I currently am deep down the rabbit hole of investigating this issue which effectively does not allow you to use drivers in uart mode alongside a display or octorpint. When and if that is resolved, all side findings can be merged in a larger PR without blocking valuable maintainers` time. |
@minosg I think you are misreading the code a bit, the ideea is to have a delay of |
I could really use any help on understanding the planner code so thanks for commenting that. I could not determine intent of this code to begin with. Why should the delay only apply for when there are less than 3 moves planned? During a normal operation the HEAD > TAIL and moves planned is the SIZE_OFF_BUFF -1 >= 3 so it will default to zero right after first invocation effectively using no delay and proceed to deliver the block. If the logic intended is:
Then the above code is not doing that
From the comments in the code about this delay, it seems to indicate it was added to protect you from the ISR messing the
|
Is there a reason why the interrupt isn't disabled while non-interrupt code manipulates the ring buffer? |
@boelle, @rhapsodyv and I are already heavily involved in this issue, which is entirely about serial low level race conditions. There is no need to test random things to narrow down the problem at this point. |
@sjasonsmith I tried to keep the original hardware identical to the beginning of the thread. I am attaching a file with the configurations and the ox.gcode which is sliced with prusaslicer (so it contains the setttings as comments) The board is a BTT SKR Mini e3 1.2, the Display is a Malyan LCD and the body of the printer is a MP select mini, which is a standard 3 NEMA17, 120x120cm bed size machine. The gcode file has been used thoroughly for months since it could consistently trigger the issue. I was printing from SD card. |
Thanks. I think I can replicate that setup here. I don't have a Mini E3 1.2, but I have an E3 DIP which uses SoftwareSerial like it does. |
@sjasonsmith .Please note that e3 1.2 uses software serial while e3 1.0 and 2.0 use hardware serial. I think that is limited to tmc stepper though, just pointing it out since it slightly adds a timing overhead |
Yeah, that is why I'll use my DIP instead of the e3 1.0. The DIP version uses SoftwareSerial since it has to work with TMC2208 drivers, which are not addressable. |
@minosg, which PlatformIO environment are you building your firmware with? |
STM32F103RC_btt_512K |
I was able to reproduce the hang when using a Malyan LCD. As it turns out, the Malyan LCD code is accessing the Arduino framework's serial classes directly, rather than using something instantiated by Marlin. This bypasses the recent improvements @rhapsodyv made. This explains why you saw the hang still, but other users reported it was gone. Other users had other serial-connected displays or boards connected to computers or OctoPrint over serial, all of which used the new class. I have bypassed this so that it will use the I assume that you had to modify |
@sjasonsmith this bug keeps on giving. I suspected it had to do with linking or symbol overriding but didn't think to check the display driver. To answer your question yes and no. This test was done with stock malyan_lcd as present in the github tracking, since it was aimed at testing the branch. You can use the serial pins in the EXP0 connector and it just works if you pull 3.3 and gnd from the SWD. But in my stable branch I have heavily modified that library mostly to make it poll less frequently and be more time deterministic when api calls are made. This was because I originally though it would fix the bug. |
Ok, I connected mine to the TFT port, hence the need for the modification. I’m working on a fix, but it unfortunately has to touch more than a couple lines of code. As a temporary workaround you can change Serial1 to MSerial1 in the Malyan CPP. You will also need to modify the bottom of MarlinSerial.cpp to make sure MSerial1 is instantiated. |
@minosg could you test again with the changes in the PR above, #19464? I believe this will resolve the hangs for both Malyan LCD and TMC drivers when using HardwareSerial. I was able to reproduce the hang on my machine, so I am hopeful it will be fixed for you as well. When updating to that PR, you will see a new option |
@sjasonsmith I will be testing it as soon as possible. I am not testing with my patched Malyanlcd code, I am always using the vanilla in the PR. I will let you know how it goes. |
I looked around the web for this issue and came up with this discussion. SKR E3 1.2 + BTT TFT35 E3 + Marlin 2.0.5.3 and 2.0.6.1 |
Try marlin bugfix branch. I think it’s fixed for your setup. |
@sjasonsmith I would like to report that after 24 hours of non stop testing the PR appears to fix the issue. I didn't modify the malyan lcd and just used the one in the branch, copying over the configurations . Sir I think that this is it..... |
Hello, I believe I am experiencing the same issue. The printer will freeze at a random point during prints. This usually happened after two hours but before five hours of printing. The freeze does not happen at any consistent point. The last two freezes did happen on the shell rather than internally but I cannot remember if that was consistent previously. The one time the freeze happened in front of me there was no noise or other signs of distress the print head simply stopped moving and slowly built a cocoon of filiment around itself before stopping altogether. The heaters and power remain on after the freeze. This behavior happens regardless of the media used for printing (sd card, usb, octaprint) After the freeze the menus seem operable at first but the printer is unresponsive to commands and the tft35 display will lose connection (I believe the connection is already lost the display has just not realized it). Rebooting the printer consistently restores functionality but the prints do not seem to resume correctly (there is usually at least a layer shift). Ender 3 Pro |
@Incarnant can you post your configuration files? Have you modified any files other than the two configuration files? Is there any chance you have old and new source files mixed? |
@Incarnant you said firmware is from October 1 bugfix, but you mention several multi-hour failures. Did you have several failures in one day after Oct 1st, or were failures with older firmware? |
@sjasonsmith I am new to Marlin and Giit so please let me know if you wanted different files. Here are the Configuration.h and nConfiguration_adv.h files. Three files were modified: Platformio.ini was modified for board type; Configuration.h was copied from the configuration examples and modified for bltouch and leveling; Configuration_adv.h was copied from the configuration examples. I did use the configuration examples from the example configurations but those files and all of the others were freshly retrieved. I deleted the previous files prior to unzipping the freshly downloaded Marlin files so I don't believe there would have been any additional mixing of files. I have attempted one long print since the firmware was flashed on 10/2/20 (release date of the firmware appeared to be 10/1/20) and it failed as described. The previous failures were with previous versions of the firmware (I have attempted a firmware update twice over the month that I have been aware of the problem). |
@Incarnant do you board have onboard sd? Can do you a long print from onboard sd? |
@rhapsodyv Yes, it has an onboard sd card. I will kick one off now. |
Okay, some interesting results. The actual print went through to completion. However, the TFT35 display lost connection to the printer at some point. After the print was completed it still showed as active printing with the timer incrementing. The buttons were still responsive and I could click through menus but no observable response from the printer. Also, temperatures were displayed as at printing temperature but the a tual print bed and nozzle had cooled after printing. I had a thought yesterday and realized I should ask. Do I need to patch my maple library for the bugfix firmware to work? |
@Incarnant no, you do not need to modify Maple. Marlin now overrides the Maple interrupt handler discussed in this issue with a version that doesn’t hand and supports an emergency parser. Your issue is something different than this bug report. This bug causes the entire printer to hang, it could not continue printing if the error occurred. I am going to close this issue, since we believe it is resolved and doesn’t match the problem you currently see. Your problem might be an issue with the TFT firmware, or might be another Marlin issue not directly related to this one. |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Bug Description
This ticket is aimed to report a correlation on a common but hard to reproduce issue across a variety of unrelated users and configurations. I have been following older tickets and running experiments on a spare machine of mine using the BTT_SKR_MINI board 1.2 and Malyan_LCD
#18117
#18315
#17161
Possibly related
#18356
On STM32F1 platforms the watchdog is disabled which explains why it freezes in that mode, but I do no think the issue is limited to them
#18226
My Configurations
configurations.zip
Steps to Reproduce
high baud-rate 50000 it requires. ( It makes it more likely to occur ). If your display allows you to
change the baud rate increase it to its maximum
Expected behavior:
I expect all the prints to complete
Actual behavior:
At random occurrences the printer freezes with heaters on, and if watchdog is enabled it just resets
Additional Information
As discussed before #18177 , I have confirmed that most of the previous reported workarounds, while they make the issue less frequent, will not fix it.
I have confirmed that it freezes when
The only common occurrence across all tickets submitted with related behavior in the last 6 months are:
I have not been able to locate a single report of it happening with steppers on standalone legacy mode.
What I believe is the actually cause of this issue:
Theory 1
An interrupt kicks in when already in exception mode. Lr 0xfffffff9 indicates a EXC_RETURN state. When trying to recover the original LR it has been corrupted and the code lands in an undefined space
Theory 2:
It could also be possible that the UartInterrupt flag for the screen uart is not properly cleared and the system locks in this loop of libmaple's usart_private.h
Consecutive triggering of the interrupt could cause stack corruption, resulting in a improperly formatted LR register
Last debug session call stack when the incident occurred
Registers state
Workaround/Patches
The issue has been correctly identified and can be mitigated by patching your local maple libary, usually residing inside the platform folder
In windows the location of the file which is needed to be edited is:
C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h
In MacOS open a terminal and use the following command to find your platformio folder
open ~/.platformio
In Linux it should be in the same location
cd~/.platformio
The callback for usart in
usart_private.h
needs to be replaced by either patchPATCH 1
PATCH 2
How to choose:
The text was updated successfully, but these errors were encountered: