Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad double for loop optimization #8261

Closed
6 tasks done
mhightower83 opened this issue Aug 8, 2021 · 15 comments
Closed
6 tasks done

bad double for loop optimization #8261

mhightower83 opened this issue Aug 8, 2021 · 15 comments

Comments

@mhightower83
Copy link
Contributor

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [5f04fbb]
  • Development Env: [Arduino IDE]
  • Operating System: [Ubuntu]

Settings in IDE

  • Module: [Nodemcu]
  • Flash Mode: [dio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Higher Bandwidth]
  • Reset Method: [nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz]
  • Upload Using: [SERIAL]
  • Upload Speed: [460800] (serial upload only)

Problem Description

With the newer "xtensa-lx106-elf-gcc (GCC) 10.3.0" I have had problems from time to time, that I have dismissed as being an error in my use of Extended ASM. It now appears that may not have been completely true. The MCVE below builds and executes correctly with -O0 or the old "xtensa-lx106-elf-gcc (GCC) 4.8.2" compiler with -O0 or -Os.
However, for the case of "xtensa-lx106-elf-gcc (GCC) 10.3.0" and -Os it generates bad code that does not properly update the array as it executes a double for loop.

This MCVE is an extreme minimization of the original bearssl experiment that was failing for me. There is always a chance I minimized out the failure case and added my own bug. However, I think I got everything cleaned up.

MCVE Sketch

uint16_t x[] = {1, 2, 3};

extern "C" void test(uint16_t *x);

void setup() {
  Serial.begin(115200);
  delay(200);
  Serial.println("\r\nTesting 1, 2, 3");

  test(x);

  for (size_t i = 0; i < sizeof(x)/sizeof(uint16_t); i++)
    ets_uart_printf("addr(%p), x[%u](0x%04X)\n", &x[i], i, x[i]);
}

void loop() {
}

Avoid the compiler defeating our test and optimizing to an answer by place this code with sketch In a separate .c file:

#include <Arduino.h>
#include <pgmspace.h>  // make sure dependencies catch debug edits to file

#if 1
// Used this one to get a simplified ASM for inspection
#undef pgm_read_word
#define pgm_read_word get_uint16

static inline __attribute__((always_inline))
uint16_t get_uint16(const uint16_t *p16) {
  uint32_t val;
  asm volatile ("l16ui %[val], %[base], 0;" :[val]"=r"(val) :[base]"r"(p16) :);
  return val;
}

#elif 0
//  Extended ASM is not an issue, this will also fail.
#undef pgm_read_word
#define pgm_read_word get_uint16

static inline __attribute__((always_inline))
uint16_t get_uint16(const uint16_t *p16) {
  uint32_t val = (*(uint32_t *)((uintptr_t)p16 & ~0x3u));
  uint32_t pos = ((uintptr_t)p16 & 0x3u) * 8u;
  val >>= pos;
  return (uint16_t)val;
}

#else
//  Use PROGMEM, the original context the failure was observed with.
#endif


// #pragma GCC optimize("O0")   // Works
#pragma GCC optimize("Os")   // Fails
void test(uint16_t *x) {
  size_t len = 3;
  for (size_t u = 0; u < len; u++) {
    uint16_t x1 = pgm_read_word(&x[0]);
    for (size_t v = 0; v < len; v++) {
      x[v] = pgm_read_word(&x[v]) + x1;
    }
  }
}

Debug Messages

Sketch output when working will look something like this:

  Testing 1, 2, 3
  addr(0x3ffe85c8), x[0](0x0008)
  addr(0x3ffe85ca), x[1](0x0009)
  addr(0x3ffe85cc), x[2](0x000A)

Sketch failing output will look something like this (-Os):

  Testing 1, 2, 3
  addr(0x3ffe85c8), x[0](0x0002)
  addr(0x3ffe85ca), x[1](0x0003)
  addr(0x3ffe85cc), x[2](0x0004)

Working code generated under Arduino ESP8266 Core 2.7.4
xtensa-lx106-elf-gcc (GCC) 4.8.2 with -Os option

Disassembly of section .text.test:

00000000 <test>:
   0:	340c                	movi.n	a4, 3
   2:	001272               	l16ui	a7, a2, 0
   5:	030c                	movi.n	a3, 0
   7:	f47070               	extui	a7, a7, 0, 16
   a:	523a                	add.n	a5, a2, a3
   c:	001562               	l16ui	a6, a5, 0
   f:	676a                	add.n	a6, a7, a6
  11:	005562               	s16i	a6, a5, 0
  14:	332b                	addi.n	a3, a3, 2
  16:	f06366               	bnei	a3, 6, a <test+0xa>
  19:	440b                	addi.n	a4, a4, -1
  1b:	fe3456               	bnez	a4, 2 <test+0x2>
  1e:	f00d                	ret.n

Failing code generated under Arduino ESP8266 Core current master
xtensa-lx106-elf-gcc (GCC) 10.3.0 with -Os option

Disassembly of section .text.test:

00000000 <test>:
   0:	822b                	addi.n	a8, a2, 2
   2:	04c272               	addi	a7, a2, 4
   5:	03a062               	movi	a6, 3
   8:	001232               	l16ui	a3, a2, 0
   b:	f43030               	extui	a3, a3, 0, 16
   e:	001252               	l16ui	a5, a2, 0
  11:	535a                	add.n	a5, a3, a5
  13:	f45050               	extui	a5, a5, 0, 16
  16:	001842               	l16ui	a4, a8, 0
  19:	434a                	add.n	a4, a3, a4
  1b:	f44040               	extui	a4, a4, 0, 16
  1e:	001792               	l16ui	a9, a7, 0
  21:	339a                	add.n	a3, a3, a9
  23:	660b                	addi.n	a6, a6, -1
  25:	f43030               	extui	a3, a3, 0, 16
  28:	fdc656               	bnez	a6, 8 <test+0x8>
  2b:	005252               	s16i	a5, a2, 0
  2e:	015242               	s16i	a4, a2, 2
  31:	025232               	s16i	a3, a2, 4
  34:	f00d                	ret.n
@mcspr
Copy link
Collaborator

mcspr commented Aug 9, 2021

Slightly modified test code to digest / try to understand all of functions mentioned above:
https://gist.github.com/mcspr/b0add9ffc049eb93c174621851b624fa
Added if 1 / if 0 with *p16; instead of asm variant, modified plain-c to offset by 2 instead of 4, plus macro to unroll inner loop to see the differences.

Is the issue with asm the missing memory at the end?
With the original plain-c & pgm functions it does read out-of-bounds since alignment is 2 (u16) and not 4 (u32), so that's also an undefined behaviour from the compiler POV?

@earlephilhower
Copy link
Collaborator

@mcspr This is interesting, but I'm not sure the repo is the right place to pose the question. Given that it seems you can isolate the issue to a single test routine, wouldn't the GCC bugs mailing list be more appropriate? https://gcc.gnu.org/bugs/

While we do have simple machine definition patches for the xtensa, I'm really not sure how they would even be tangentially related to the code you're showing. https://github.com/earlephilhower/esp-quick-toolchain/tree/master/patches/gcc10.3

Maybe a clean gcc10.3 build from scratch with the ESP8266 xtensa architecture headers, then a gcc -S ... to see if there is any difference?

@mhightower83
Copy link
Contributor Author

@mcspr Thanks for reviewing.

The problem I had with creating a simple example for the original problem I observed, was a certain amount of complexity appeared to be needed for a failure to occur. From empirically observations *p16 and the 2 bytes offset instead of 4 bytes result in too simple an example. Your example is showing the correct answer for these and a failing result from pgm_read_word.

If we increase the array size to 4 elements and align to 4 to correct the out-of-bounds technicality and keep the C function with 32-bit word access, this change can still demonstrate the failure without introducing Extended ASM. Extended ASM is also used in PROGMEM, I wanted to show/prove that the issue was not with Extended ASM.

uint16_t x[] __attribute__((aligned(4))) = {1, 2, 3, 0} ;

@earlephilhower
Copy link
Collaborator

Oops, my bad. Sorry, my previous comment #8261 (comment) should have been directed to @mhightower83 . Rough Monday morning here!

@mcspr
Copy link
Collaborator

mcspr commented Aug 9, 2021

@mhightower83 true, first two cases just try to 'fix' the access through the pointer and are not really supposed to fail though (plus "memory" to circumvent the gcc caching / whatever it does with memory accesses there, I don't really have a great explanation for asm as I mostly try to avoid it instead of actually use it :)

I am leaning to the idea that optimization is correct in it's assumptions, and to actually read the values the code should change it's approach? As for failing examples, consider the order of operations here (which is sort-of similar to what happens in the pgm & plain-c variants)

    uint16_t a[2] {1,2};
    Serial.printf("unaligned 0x%08X\n", reinterpret_cast<uintptr_t>(&a[1]));

    uintptr_t aligned = reinterpret_cast<uintptr_t>(&a[1]) & static_cast<uintptr_t>(~3);
    Serial.printf("aligned 0x%08X\n", aligned);
    a[0] = 3;
    a[1] = 4;

    uint32_t b = *reinterpret_cast<const uint32_t*>(aligned);
    Serial.printf("%08X\n", b);
    Serial.printf("a[0](%p)=%hu\n", &a[0], a[0]);
    Serial.printf("a[1](%p)=%hu\n", &a[1], a[1]);

Placed into the setup:

unaligned 0x3FFFFF82
aligned 0x3FFFFF80
00020001
a[0](0x3fffff80)=3
a[1](0x3fffff82)=4

And I would also agree with gcc mail list advice *https://gcc.gnu.org/lists.html

@mhightower83
Copy link
Contributor Author

Working through the "Before reporting that GCC compiles your code incorrectly, compile it with ..." at https://gcc.gnu.org/bugs/, I found that the -fno-strict-aliasing build option resolves the problem with PROGMEM pgm_read_word and my plain-c version of the same. @mcspr Which I think is in line with your thinking.

As for the new issue with Extended ASM and the 10.3 compilers read caching optimization. The Extended ASM references memory through a pointer the compiler doesn't know about it and the data read may be stale because the compiler is still holding the current value in a register. My solution, don't use memory pointers in Extended ASM for reading or writing to memory.

For me, this answer on StackOverflow was useful "What is the strict aliasing rule?" https://stackoverflow.com/a/99010
A remark I found of particular relevance:

Basically, with this rule, it doesn't have to think about inserting instructions to refresh the contents of buff every run of the loop. Instead, when optimizing, with some annoyingly unenforced assumptions about aliasing, it can omit those instructions, load buff[0] and buff[1] into CPU registers once before the loop is run, and speed up the body of the loop.

Unless I misunderstood something, it appears I have gone from bug to feature.
And, now the question left is: How to best address strict-aliasing in PROGMEM pgm_read_word?
As well as my collection of inline functions in mmu_iram.h: mmu_get_uint16() etc.

@mcspr
Copy link
Collaborator

mcspr commented Aug 10, 2021

Huh. From the same page, there is suggestion about unions, which are a special case for GCC:

uint16_t my_pgm_read_word(const void* p16) {
    union AsU16 {
        uint16_t value;
        uint32_t aligned;
    };

    const auto* ptr = reinterpret_cast<const AsU16*>(p16);
    return (*ptr).value;
}

Numbers are the same for the test array, also reads PROGMEM pointer as the name suggests.

I still don't get how PROGMEM is involved though, if the issue is with changing values with different ptr types and unless it's done like the test above, how is it changing the data on the flash / at that memory location?

@mhightower83
Copy link
Contributor Author

mhightower83 commented Aug 10, 2021

@mcspr Right, it occurred to me, while clearing my head, that PROGMEM functions, when used against flash, would not be an issue. It is when they are used against IRAM that trouble could follow, which is what my test simulated (however, it was really DRAM).

So with that thought, I think the mmu_get/set group of functions in mmu_iram.h are the ones that will need special attention.

Also, I think it is clear that in the past many programs would have worked fine w/o following the strict-aliasing rule and now with gcc 10.3 they may not work reliably.

My mind is numb and wants to stop, I'll close with this quick search result:
https://stackoverflow.com/questions/2958633/gcc-strict-aliasing-and-horror-stories

@TD-er
Copy link
Contributor

TD-er commented Aug 12, 2021

Another very good read (so far... still reading it) What is the Strict Aliasing Rule and Why do we care?
(OR Type Punning, Undefined Behavior and Alignment, Oh My!)

(a better title could have been "pun intended" by the way ;) )

@TD-er
Copy link
Contributor

TD-er commented Aug 15, 2021

Hmm, I've been reading a lot the last 2 days on this subject and I'm not sure if I ever dare to compile without the -fno-strict-aliasing build flag set. At least not for embedded code where I do have to rely on lots of low-level libraries like I do now for ESPEasy.

Can the pre-compiled pieces of code we link, like the WiFi code (and others?) perhaps also be compiled with this flag set?
The more I read about this, the more situations I can imagine where code will run perfectly fine in one build and completely unstable in other builds with only seemingly unrelated changes like changing some flash-string... which is exactly what is happening to me a lot. (especially when using the esp8266/Arduino core 3.0.x code base)

@mcspr
Copy link
Collaborator

mcspr commented Aug 15, 2021

@TD-er only related to the code generation happening with the gcc10.3, which does not include SDK blobs.

Specifically, the mmu_{get,set}... group of functions

/*
* Some inlines to allow faster random access to non32bit access of iRAM or
* iCACHE data elements. These remove the extra time and stack space that would
* have occurred by relying on exception processing.
*/
static inline __attribute__((always_inline))
uint8_t mmu_get_uint8(const void *p8) {
ASSERT_RANGE_TEST_READ(p8);
uint32_t val = (*(uint32_t *)((uintptr_t)p8 & ~0x3));
uint32_t pos = ((uintptr_t)p8 & 0x3) * 8;
val >>= pos;
return (uint8_t)val;
}
static inline __attribute__((always_inline))
uint16_t mmu_get_uint16(const uint16_t *p16) {
ASSERT_RANGE_TEST_READ(p16);
uint32_t val = (*(uint32_t *)((uintptr_t)p16 & ~0x3));
uint32_t pos = ((uintptr_t)p16 & 0x3) * 8;
val >>= pos;
return (uint16_t)val;
}
static inline __attribute__((always_inline))
int16_t mmu_get_int16(const int16_t *p16) {
ASSERT_RANGE_TEST_READ(p16);
uint32_t val = (*(uint32_t *)((uintptr_t)p16 & ~0x3));
uint32_t pos = ((uintptr_t)p16 & 0x3) * 8;
val >>= pos;
return (int16_t)val;
}
static inline __attribute__((always_inline))
uint8_t mmu_set_uint8(void *p8, const uint8_t val) {
ASSERT_RANGE_TEST_WRITE(p8);
uint32_t pos = ((uintptr_t)p8 & 0x3) * 8;
uint32_t sval = val << pos;
uint32_t valmask = 0x0FF << pos;
uint32_t *p32 = (uint32_t *)((uintptr_t)p8 & ~0x3);
uint32_t ival = *p32;
ival &= (~valmask);
ival |= sval;
*p32 = ival;
return val;
}
static inline __attribute__((always_inline))
uint16_t mmu_set_uint16(uint16_t *p16, const uint16_t val) {
ASSERT_RANGE_TEST_WRITE(p16);
uint32_t pos = ((uintptr_t)p16 & 0x3) * 8;
uint32_t sval = val << pos;
uint32_t valmask = 0x0FFFF << pos;
uint32_t *p32 = (uint32_t *)((uintptr_t)p16 & ~0x3);
uint32_t ival = *p32;
ival &= (~valmask);
ival |= sval;
*p32 = ival;
return val;
}
static inline __attribute__((always_inline))
int16_t mmu_set_int16(int16_t *p16, const int16_t val) {
ASSERT_RANGE_TEST_WRITE(p16);
uint32_t sval = (uint16_t)val;
uint32_t pos = ((uintptr_t)p16 & 0x3) * 8;
sval <<= pos;
uint32_t valmask = 0x0FFFF << pos;
uint32_t *p32 = (uint32_t *)((uintptr_t)p16 & ~0x3);
uint32_t ival = *p32;
ival &= (~valmask);
ival |= sval;
*p32 = ival;
return val;
}

It is always reading from a valid object though, so pgm side of things of just reading seems to be just fine. Writing and reading is a problem though, where unlike the pgm_... variants, writes to the pointer cause the optimization cutting out code deemed unnecessary, as it operates on the pointer-to-object-through-reinterpret-cast and the object itself separately.

BTW, looking at the objdump of the union code above once again, it becomes just a single l16ui and isn't what I though originally it'd do.
It is still confusing though, whether this kind of code is actually a valid operation:

static inline uint16_t __attribute__((always_inline))
get_uint16_plain_c(const void *p16) {
    auto* ptr = reinterpret_cast<const void*>((uintptr_t)(p16) & (uintptr_t)(~3));

    uint32_t out;
    std::memcpy(&out, ptr, sizeof(uint32_t));
    out >>= ((uintptr_t)(p16) & (uintptr_t)(3)) * 8;

    return out;
}

(and, despite the pgmspace.h comment, memcpy is optimized into 32bit wide read. e.g. adding external func to avoid inline marker, and also with float real_out; std::memcpy(&real_out, &out, sizeof(real_out));)

00000000 <_Z19my_pgm_read_wordPKv>:
   0:   c37c                    movi.n  a3, -4
   2:   103230                  and a3, a2, a3
   5:   0338                    l32i.n  a3, a3, 0
   7:   142020                  extui   a2, a2, 0, 2
   a:   402200                  ssa8l   a2
   d:   912030                  srl a2, a3
  10:   f42020                  extui   a2, a2, 0, 16
  13:   f00d                    ret.n

@TD-er
Copy link
Contributor

TD-er commented Aug 15, 2021

@TD-er only related to the code generation happening with the gcc10.3, which does not include SDK blobs.

For this specific issue you found , sure, but the reason -fno-strict-aliasing build flag does generate expected code for this issue is because you disable some optimizations which are enabled when selecting -O2 or higher.
The kind of issues that may be caused by these optimizations for code that's not "strict aligned" is topic of discussion for much longer than Gcc 10.x is around.
So that's what I meant with my remark to feel I may need to use this build flag to build my project as I don't know how I can be certain at compile time these optimizations will not break my code.

@mcspr
Copy link
Collaborator

mcspr commented Aug 15, 2021

@TD-er It is all there is though
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fstrict-aliasing
https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstrict-aliasing

Plus, future gcc11 does even more stuff related to aliasing, I'd wager to change assumptions related to object access is a better solution than trying to fight compiler authors. And optimizations could be enabled / disabled on the object file level by using optimize pragmas, if there are some definitive issues with the gcc implementation

@TD-er
Copy link
Contributor

TD-er commented Aug 15, 2021

Sure if we know of specific code not working well on these kind of optimizations, then per file excluding some optmizations is the best way to go.
But I'm not sure where things go wrong.
One thing I do know is that my attempts for core 3.0.x builds work perfectly fine in one build and in a next one it crashes very often and the decoded stack does quite often point to something in the String class.
So one could think the String class is to blame, but it could very well be in the code for mmu_iram you showed.
I honestly don't know and maybe to be safe, I will build my project -at least for now- with these optimizations disabled, just to get some progress in my development as I feel I've been held back for years now trying to find bugs in a black box, while being blindfolded riding a mechanical bull... well you get the idea.

@mhightower83
Copy link
Contributor Author

From https://gcc.gnu.org/bugs/,
Before reporting that GCC compiles your code incorrectly, compile it with gcc -Wall -Wextra and see whether this shows anything wrong with your code. Similarly, if compiling with -fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations makes a difference, or if compiling with -fsanitize=undefined produces any run-time errors, then your code is probably not correct.

Since -fno-strict-aliasing revealed a cause for my failure that emerged by going from Core 2.7.4 to 3.0, I am thinking that some of the other options in the GCC guidance described above might provide some light for others that are experiencing projects that build okay but do not work properly under Core 3.0.

@TD-er and @mcspr thank you for those links and discussions. It was very helpful. To comply with strict-aliasing rules, I will mostly use __builtin_memcpy to ensure the compiler gets the opportunity to do the right thing. This should help avoid any issue of a #define changing memcpy, that might exist for other purposes. To preserve 32-bit word access (block conversion to 16-bit or 8-bit loads), I found an empty Extended ASM reference to the 32-bit value that did the trick. This helped persuade the compiler optimizer to keep the 32-bit work access.

An example code fragment compling with strict-aliasing rules using a union:

    const uint32_t *icache_flash = (const uint32_t *)0x40200000u;
    union {               // to comply with strict-aliasing rules
      image_header_t hdr; // total size 8 bytes
      uint32_t u32;       // we only need the 1st 4 bytets
    } imghdr_4bytes;
    // read first 4 byte (magic byte + flash config)
    imghdr_4bytes.u32 = *icache_flash;
    /*
      ICACHE memory read requires aligned word transfers.  Because
      imghdr_4bytes.hdr.flash_size_freq is a byte value, the GCC 10.3 compiler
      tends to optimize out our 32-bit access for 8-bit access. If we reference
      the 32-bit word from Extended ASM, this persuades the compiler to keep the
      32-bit register load and extract the 8-bit value later.     
     */
    asm volatile ("# imghdr_4bytes.u32 => %0" ::"r"(imghdr_4bytes.u32));
    flashchip->chip_size = esp_c_magic_flash_chip_size(imghdr_4bytes.hdr.flash_size_freq >> 4);

An example compling with strict-aliasing rules using __builtin_memcpy:

static inline __attribute__((always_inline))
uint8_t mmu_get_uint8(const void *p8) {
  void *v32 = (void *)((uintptr_t)p8 & ~(uintptr_t)3u);
  uint32_t val;
  __builtin_memcpy(&val, v32, sizeof(uint32_t));
  asm volatile ("" ::"r"(val));  // helps ensure we keep the 32-bit load operation
  uint32_t pos = ((uint32_t)p8 & 3u) * 8u;
  val >>= pos;
  return (uint8_t)val;
}

And a thank you to @jjsuwa-sys3175 for introducing me to the concept of "injecting dependency" by way of unused input variables in Extended ASM.

mcspr pushed a commit that referenced this issue Oct 13, 2021
These changes are needed to address bugs that can emerge with the improved optimization from the GCC 10.3 compiler.

Updated performance inline functions `mmu_get_uint8()`, ... and `mmu_set_uint8()`, ...  to comply with strict-aliasing rules. 
Without this change, stale data may be referenced. This issue was revealed in discussions on #8261 (comment) 

Changes to avoid over-optimization of 32-bit wide transfers from IRAM, turning into 8-bit or 16-bit transfers by the new GCC 10.3 compiler. This has been a reoccurring/tricky problem for me with the new compiler. 

So far referencing the 32-bit value loaded by way of an Extended ASM R/W output register has stopped the compiler from optimizing down to an 8-bit or 16-bit transfer.

Example:
```cpp
  uint32_t val;
  __builtin_memcpy(&val, v32, sizeof(uint32_t));
  asm volatile ("" :"+r"(val)); // inject 32-bit dependency
  ...
```

Updated example `irammem.ino`
* do a simple test of compliance to strict-aliasing rules
* For `mmu_get_uint8()`, added tests to evaluate if 32-bit wide transfers were converted to an 8-bit transfer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants