Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax OS_Time IRQHandler #2833

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

kisslorand
Copy link
Contributor

@kisslorand kisslorand commented Sep 10, 2023

Requirements

BTT or MKS TFT.

Description

This PR is an attempt to ease up the OS timer IRQ handler. Major credit goes to @rondlh who brought it to our attention here: #2832. More than that @rondlh made several suggestions to make the IRQ handler lighter on the MCU, suggestions that are included in this PR.
The extra changes are:

  • remove modulo operation which is expensive and replace it with bare simple arithmetic ones
  • remove the usage of any conditional and ternary from one of the functions called in that interrupt handler and replaced it with bare arithmetic operations
  • moved the interrupt flag reset to the end of the operations to avoid any unpredictable behaviour of the IRQ handler

Benefits

In theory it speeds up the TFT.

@rondlh
Copy link

rondlh commented Sep 10, 2023

Brilliant improvement for updatePrintTime, but why did you make void loopTouchScreen 2 to 3 times slower than before? Sabotage?

@kisslorand
Copy link
Contributor Author

Sorry, my sense of humor might be very different from yours, I really cannot decide it was meant as a joke or you are asking for real.

@rondlh
Copy link

rondlh commented Sep 10, 2023

Sorry, my sense of humor might be very different from yours, I really cannot decide it was meant as a joke or you are asking for real.

I'm asking for real, your implementation is 2-3 times slower than before, the sabotage part is obviously a joke.

@kisslorand
Copy link
Contributor Author

Can you pleased share the benchmark you used to measure the performance decrease?
All I see is performance increase there.

@rondlh
Copy link

rondlh commented Sep 10, 2023

Can you pleased share the benchmark you used to measure the performance decrease?
All I see is performance increase there.

Show me... I theoretically can see your implementation is much slower

@kisslorand
Copy link
Contributor Author

Show what?

  touch = !XPT2046_Read_Pen() * (touch + !touchScreenIsPress);
  touchScreenIsPress = !(TOUCH_DEBOUNCE_MS - touch);

Are we talking about these 2 lines?

@rondlh
Copy link

rondlh commented Sep 10, 2023

Yes, I'm talking about loopTouchScreen.
You say you see performance increase, what did you see/find? What test? Any numbers?

@kisslorand
Copy link
Contributor Author

kisslorand commented Sep 10, 2023

How did you come to the conclusion it is 2-3 slower?

Just to clarify: there's no sabotage intention, I spent quite some time to try to increase the performance of that function.

@rondlh
Copy link

rondlh commented Sep 10, 2023

How did you came to the conclusion it is 2-3 slower?

Like I said, theoretically I can see that. But you said:

All I see is performance increase there.

So the question is... how did you see performance improvement?

@kisslorand
Copy link
Contributor Author

I have the same question, how do you see performance decrease?

@rondlh
Copy link

rondlh commented Sep 10, 2023

There is "an issue" in your code, similar to the issue in "ADVANCED_OK", that stops it from being efficient.

@kisslorand
Copy link
Contributor Author

You said there's no need for benchmarks, but you just intrigued me. I will make some benchmarks and will come back with the results. It might take an hour or so.

@rondlh
Copy link

rondlh commented Sep 10, 2023

OK, take it easy.
May I ask what nationality you are? (I'm Dutch)

@kisslorand
Copy link
Contributor Author

I am already on it. ;)

@kisslorand
Copy link
Contributor Author

I have the results. :)
CodeBattle

@rondlh
Copy link

rondlh commented Sep 10, 2023

How did you test?
Can you show me the code?

@kisslorand
Copy link
Contributor Author

kisslorand commented Sep 10, 2023

Actually the two codes compiled do perform the same, the compiler does its job quite well. The difference above could be a rounding issue as the timer is counting microseconds (I did more tests and the results were identical for both code versions). The previous test were done on a Cortex-M4 MCU, I have the results on a Cortex-M3 MCU also:

CodeBattle1

I'm aware that it is harder to understand the code I wrote, I have no issue undoing it and keeping the original code since I know now that the compiler knows how to optimize it.

This test was interesting for me, I wasn't aware that a Cortex-M4 MCU is so much faster than a Cortex-M3.

@kisslorand
Copy link
Contributor Author

May I ask what nationality you are? (I'm Dutch)

There's no secret, I am Hungarian. You could already tell that since I made many additions/changes to the Hungarian language pack.

@rondlh
Copy link

rondlh commented Sep 11, 2023

My tests show that your loopTouchScreen is only 25% slower than the original code (STM32F207), not the 2-3 times I theoretically expected. Only code readability has decreased by a factor of 2-3.

You can make loopTouchScreen a tiny bit faster by dropping the guard inversion, and swapping the normal and else case. This trick can also be used elsewhere in the code, but the benefit is very small.
The DMA serial writing I have under test has significantly more impact.

Slightly faster code:

void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  if (XPT2046_Read_Pen())
  {
    touchScreenIsPress = false;
    touch = 0;
  }
  else
  {
    if (touch >= 20)  // 20ms
    {
      touchScreenIsPress = true;
    }
    else
    {
      touch++;
    }
  }
}

@kisslorand
Copy link
Contributor Author

What speed increase did your test show for your proposed code?

@rondlh
Copy link

rondlh commented Sep 11, 2023

What speed increase did your test show for your proposed code?

A few percent only.

Your modulo calculation replacement using a counter is faster. Usually you should count down if you do this, not up, but for modern compilers it might be irrelevant. For the TFT performance it has virtually no impact.
Confirmed: Counting down is about 15% faster than counting up (only considering the counting part).

So overall, the conclusions in this PR:
Cons:

@kisslorand
Copy link
Contributor Author

kisslorand commented Sep 11, 2023

Opinions are always welcome but taken with a grain of salt.

@kisslorand kisslorand force-pushed the OS_Time branch 2 times, most recently from 03ba16b to 2bc45e4 Compare September 11, 2023 22:11
@rondlh
Copy link

rondlh commented Sep 12, 2023

Where is the opinion in this? I only see facts.

  • loopTouchScreen has become about 25% slower

This is a fact, even your own tests show that at best loopTouchScreen has not become faster, only more unreadable. That your code is slower was obvious to me immediately when I saw it, although I overestimated the speed decrease that it causes. Look again at the code... it's obvious! If you still don't see it then I can show it to you.

  • loopTouchScreen is completely unreadable and unmaintainable

Fact, you also agree to this above.

  • most of the changes are just copied from

Very likely, or perhaps you came up with the exact same idea a day after @digant73 did

  • TIM7_IRQHandler is a bit faster (still not optimal)

Fact, like I showed you above, you need to count down, not up to be faster.

  • Some unused code was removed

Fact (+43 −293), but if you deny it I'm willing to take your side.

Your code should not be merged in it's current state! I don't understand that you keep "force pushing" a merge while being aware of the issues in your code.

BTW: Force-pushing slower and less readable code is sabotage in my book

@rondlh
Copy link

rondlh commented Sep 12, 2023

Can somebody please review this code change force-pushed by @kisslorand?
Please consider execution speed and code readability. In my view it's obvious, but I still did a benchmark which confirmed my suspicions. This is a routine handling the touch screen presses of the TFT and is executed 1000x a second.

ORIGINAL CODE:

static volatile bool touchScreenIsPress = false;
void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  if (!XPT2046_Read_Pen())
  {
    if (touch >= 20)  // 20ms
    {
      touchScreenIsPress = true;
	  lastTouchTime = OS_GetTimeMs(); // IRON, ADDED TO RECORD TOUCH TIME
    }
    else
    {
      if (OS_GetTimeMs() - lastTouchTime > 250) // IRON, PREVENT TOO FAST TOUCHING
      touch++;
    }
  }
  else
  {
    touchScreenIsPress = false;
    touch = 0;
  }
}

@kisslorand VERSION: (comments removed that hold the original version, I wonder why those were left in the code)

static volatile bool touchScreenIsPress = false;
void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  touch = !XPT2046_Read_Pen() * (touch + !touchScreenIsPress);
  touchScreenIsPress = !(TOUCH_DEBOUNCE_MS - touch);
}

UPDATE: @kisslorand has seen the light and updated loopTouchScreen to:

void loopTouchScreen(void)  // Handle in interrupt
{
  if (XPT2046_Read_Pen() == LOW)
    touchCountdown -= !!touchCountdown;
  else
    touchCountdown = TOUCH_DEBOUNCE_MS;
}

Eliminating a variable (touchScreenIsPress) makes sense as it basically represented the same thing as touch was. It's a bit of a dirty trick with touchCountdown, but this should work fine and is a bit faster, good job.

@kisslorand
Copy link
Contributor Author

kisslorand commented Sep 13, 2023

@kisslorand has seen the light

Actually what I've seen is the rookie mistake I made benchmarking the original function and mine. In the benchmark I didn't use the variables from the functions benchmarked, the compiler optimized the code and eliminated the unused variables.
A lesson for everybody, you are not benchmarking the code you wrote, you are benchmarking the code the compiler generates!

After the realization of my mistake I redid the tests, the results are the following:

Benchmark

Where "My new code" is:

void loopTouchScreen(void)  // Handle in interrupt
{
  if (XPT2046_Read_Pen() == LOW)
    touchCountdown -= !!touchCountdown;
  else
    touchCountdown = TOUCH_DEBOUNCE_MS;
}

"My old code" is:

void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  touch = !XPT2046_Read_Pen() * (touch + !touchScreenIsPress);
  touchScreenIsPress = !(TOUCH_DEBOUNCE_MS - touch);
}

"Ori code" is:

void loopTouchScreen(void)  // Handle in interrupt
{
    static uint8_t touch;
    if (!XPT2046_Read_Pen())
    {
      if (touch >= 20)  // 20ms
      {
        touchScreenIsPress = true;
      }
      else
      {
        touch++;
      }
    }
    else
    {
      touchScreenIsPress = false;
      touch = 0;
    }
}

"Dumb code" is same as "Ori code" but variables not being used, it can be observed how the optimization of compiler the kicks in.

So @rondlh was right, my old code was 20% slower than the original code (not 25% as he stated, but who knows, maybe my tests are still inaccurate).

I have yet to verify the up/down count speed theory, although it is about the comparison to 0 (zero) and not the direction of the count. The specialized literature mentions that it was a thing in the 1980s but modern compilers and MCUs do not exhibit this behaviour but better test it, right?

Note: negative criticism is good as long as it is done in a civilized manner. Github also have this rule: "criticize the idea, not the person".

@kisslorand
Copy link
Contributor Author

kisslorand commented Sep 13, 2023

Up/Down count and comparison to zero results:

Count

Upcount: count from 0 to X
Downcount: count from X + 10 down to 10
CompToZero: count from X down to 0

The difference is clearly from comparison to 0 (zero), not from the count direction. The results of @rondlh for down count being more faster than up count is probably because he tested a down count to 0 (zero). It is the comparison to 0 (zero) that makes down count faster. Anyhow I can only see a 3% difference, not 15%, but in any case it is a free trick for speed.

Everyday we learn something...

@rondlh
Copy link

rondlh commented Sep 14, 2023

What do you aim to achieve by moving `TIMER_INTF(TIMER6) &= (uint16_t)~(1<<0); to the back of the ISR?
Generally that's not the right thing to do, but I'm curios to hear your explanation.

@kisslorand
Copy link
Contributor Author

In many microcontroller architectures, including those that use the STM32 series with the STM32 HAL library, clearing the interrupt flag at the end of the ISR is the recommended and standard practice (check TIM3, the buzzer interrupt). This is to ensure that the ISR completes its main tasks before allowing the processor to respond to new interrupts, minimizing interrupt latency and potential issues with nested interrupts.
In our case in reality it is irrelevant where's the interrupt reset done in the ISR for TIM7. I guess I did it out of habit.

@rondlh
Copy link

rondlh commented Sep 15, 2023

I also think it is irrelevant in this case because the ISR will terminate long before the next interrupt arrives.

If you have experience with this kind of thing then perhaps you can have a look at the serial write DMA interrupt handler I'm testing. I still have incidental issues (typically 1 byte lost, not received by the host). It's build on top of #2824, not sure if it will work for the current source base. (STM32F2_4 platform only!).
src-user-HAL-DMA Writing.zip
srt-user-API-SerialConnection.zip

I have been testing this with the serial line idle interrupt disabled (manual update of wIndex in Serial_Get). That seems to work fine (still testing), but when I enable the serial line idle interrupt then I see some rare issues (1 byte lost, not received by the host). Note that I use the same ISR for the serial idle interrupt (reading) and the DMA serial writing (transfer complete interrupt). I don't use the DMA transfer complete interrupt because that is very complex, but probably it's the better choice.
Any insight would be appreciate.

UPDATE: Nice to see you like it. I think the code is actually fine. I slightly improved it in a search to find an issue I am having. But the issue is not causes by the DMA transfer. I have some data corruption at very high print speeds (> 500%), sometimes a byte is missed at the host, about 1 byte in 6MB. But I see the same issue when using Interrupt based serial writing and unbuffered writing. Serial speed is at 250k Baud, now I'm testing at 115200 baud to see if that solves the issue.

@rondlh
Copy link

rondlh commented Sep 21, 2023

#2824 now contains a performance benchmark that you can enable in Configuration.h (DEBUG_MONITORING). When enabled, in the notification menu you will find another button to bring you to the monitoring screen. The scan rate will show you how many times the loopProcess is executed. This way you can test the impact of your improvement easily.

@kisslorand
Copy link
Contributor Author

Thanks for the info, I did that.
The speed increase is at best 2% compared to the original code. I even tested modulo with 1024 (os_timer & 0x03ff == 0) which gave occasionally at best 3% speed increase. It's amazing how practice beats theory. :)
I think this PR can be entirely ignored given the new data.
Yes, the TIM7 is 10% faster than it was originally but in the big scheme of the entire FW it gives at best 2% speed increase.

@rondlh
Copy link

rondlh commented Sep 21, 2023

So you mean that this PR increases the scan rate by about 2%?

@kisslorand
Copy link
Contributor Author

So you mean that this PR increases the scan rate by about 2%?

Yes.

@rondlh
Copy link

rondlh commented Sep 22, 2023

What concrete numbers do you get? Before and after scan rates? What hardware do you use?

@kisslorand
Copy link
Contributor Author

The checking was done on a BTT TFT35 V3.0 with STM32F207

The scan rate on average was:

  1. master branch -> 29400~29600
  2. this PR -> 29900~30100

@rondlh
Copy link

rondlh commented Sep 22, 2023

So you mean that this PR increases the scan rate by about 2%?

Yes.

The checking was done on a BTT TFT35 V3.0 with STM32F207

The scan rate on average was:

1. master branch -> 29400~29600

2. this PR -> 29900~30100

I think you are doing something very wrong, nothing you state here makes any sense.
The speed improvement is a solid 0.0% (BTT TFT35 V3.0)
2% is complete non-sense, anybody can see that when looking at the tiny source code changes.

@kisslorand
Copy link
Contributor Author

I think you are doing something very wrong, nothing you state here makes any sense.

It's the opinion of the same person who has the strong belief that counting down is faster than counting up. :)

2% is complete non-sense, anybody can see that when looking at the tiny source code changes.

A real professional argument brought to the table! :)

I guess it's true what they say, a circus is not a circus without a clown.

Copy link

This PR has been automatically marked as stale because it has had no activity for the last 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contribution.

@github-actions github-actions bot added the Stale label Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants