ESP32 I2S 3-step Cadence Faster encoding #855

devarishi7 · 2024-10-23T09:23:54Z

devarishi7
Oct 23, 2024

With the new 3-step encoding for the ESP32 I2S, the trade-off is for less memory use vs encoding speed. Now as i was busy encoding the DMX512 method which was, as a by product, much more easily to add for me., it occurred to me that the 3-step encoding could be significantly quicker. So i decided to have a go. My first idea was to use a double (32-bit) lookup table, similar to the 4-step, but double since the 12-bit result would need to be at least eventually have to result in a multiple of 8-bits. The way to speed things up from the current method was to remove as many bit-shifts as possible. Bit-shifts are relatively slow, and shifting a variable 8 bits either way takes 8 times the amount of time as shifting it 1 bit (unlike for instance additions where x + 8 takes as much time as x +17)
I was even looking for a way to directly read the High nibble from the source byte. If i remember correctly the Z80 had a specific instruction to do just that, but that probably doesn't exist on a modern ESP anymore. So anyway i came up with this

class NeoEsp32I2sCadence3Stepfast
{
public:
    const static size_t DmaBitsPerPixelBit = 3; // 3 step cadence, matches encoding
    //const static size_t DmaBitsPerPixelByte = 24; // 3 step cadence, matches encoding
	
    static void EncodeIntoDma(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
    {		
        const uint32_t bitpatternsLow[16] =
        {
            0x000924, 0x000926, 0x000934, 0x000936, 0x0009A4, 0x0009A6, 0x0009B4, 0x0009B6, 
			0x000D24, 0x000D26, 0x000D34, 0x000D36, 0x000DA4, 0x000DA6, 0x000DB4, 0x000DB6

        };
		const uint32_t bitpatternsHigh[16] =
        {
            0x924000, 0x926000, 0x934000, 0x936000, 0x9A4000, 0x9A6000, 0x9B4000, 0x9B6000, 
			0xD24000, 0xD26000, 0xD34000, 0xD36000, 0xDA4000, 0xDA6000, 0xDB4000, 0xDB6000

        };
		uint32_t output[2];  // 2x 24-bit bitPattern in consequetive location
		uint8_t * output8 = reinterpret_cast<uint8_t *>(output);

        uint8_t* pDma = dmaBuffer;
        const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
        const uint8_t* pSrc = data; 
		while (pSrc < pEnd)  
        {
            output[0] = bitpatternsHigh[((*pSrc) >> 4)] | bitpatternsLow[((*pSrc) & 0x0f)];
			pSrc++;
			output[1] = bitpatternsHigh[((*pSrc) >> 4)] | bitpatternsLow[((*pSrc) & 0x0f)];
			pSrc++;   // note: the mask for the bitpatternsHigh index should be obsolete
			          // To get the 2x 3-byte values in the right order copy them Byte by Byte
			memcpy(pDma++, output8 + 1 , 1);  
			memcpy(pDma++, output8 + 2 , 1);  
			memcpy(pDma++, output8 + 6 , 1);  
			memcpy(pDma++, output8 + 0 , 1);  
			memcpy(pDma++, output8 + 4 , 1);  
			memcpy(pDma++, output8 + 5 , 1); 
        }
		if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
		{
			output[0] = bitpatternsHigh[((*pSrc) >> 4) & 0x0f] | bitpatternsLow[((*pSrc) & 0x0f)];
			memcpy(pDma++, output8 + 1 , 1); 
			memcpy(pDma++, output8 + 2 , 1); 
			pDma++;
			memcpy(pDma++, output8 + 0 , 1); 			
		}		
    }
};

First i was testing the resulting bit pattern on my UNO and comparing them to what the current encoding produces, and after some fiddling with the memcpy() pointers i got it to match. And a quick speed comparison showed great promise.

Then another thought occurred to me, to get rid of the memcpy() and directly assign into 16-bit variables and use 6 x 16-bit lookup tables.

class NeoEsp32I2sCadence3Stepfast2
{
public:
        const static size_t DmaBitsPerPixelBit = 3; // 3 step cadence, matches encoding
	//const static size_t DmaBitsPerPixelByte = 24; // 3 step cadence, matches encoding

	static void EncodeIntoDma(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
	{
		const uint16_t bitpatterns[6][16] = {
		{
			0x9240, 0x9260, 0x9340, 0x9360, 0x9A40, 0x9A60, 0x9B40, 0x9B60,
			0xD240, 0xD260, 0xD340, 0xD360, 0xDA40, 0xDA60, 0xDB40, 0xDB60
		},
		{
			0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009,
			0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D
		},
		{
			0x0092, 0x0092, 0x0093, 0x0093, 0x009A, 0x009A, 0x009B, 0x009B,
			0x00D2, 0x00D2, 0x00D3, 0x00D3, 0x00DA, 0x00DA, 0x00DB, 0x00DB
		},
		{
			0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600,
			0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600
		},
		{
			0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000,
			0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000
		},
		{
			0x0924, 0x0926, 0x0934, 0x0936, 0x09A4, 0x09A6, 0x09B4, 0x09B6,
			0x0D24, 0x0D26, 0x0D34, 0x0D36, 0x0DA4, 0x0DA6, 0x0DB4, 0x0DB6
		}};

		uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
		const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
		const uint8_t* pSrc = data;

		while (pSrc < pEnd)
		{
			uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
			pSrc++;
			msNibble = ((*pSrc) >> 4) & 0x0f;
			*(pDma++) = bitpatterns[2][msNibble] | bitpatterns[3][lsNibble];
			lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[4][msNibble] | bitpatterns[5][lsNibble];
			pSrc++;
		}
		if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
		{
			uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
			*(pDma++) = bitpatterns[3][lsNibble];
		}
	}
};

Again a bit of fiddling to get it right, but it appeared to marginally slower than the first attempt.
So i migrated to the ESP32 (which unlike the UNO is not always on my desk) and performed a speedtest using this sketch

uint32_t EncodeIntoDma4Step(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t bitpatterns[16] =
  {
    0b1000100010001000, 0b1000100010001110, 0b1000100011101000, 0b1000100011101110,
    0b1000111010001000, 0b1000111010001110, 0b1000111011101000, 0b1000111011101110,
    0b1110100010001000, 0b1110100010001110, 0b1110100011101000, 0b1110100011101110,
    0b1110111010001000, 0b1110111010001110, 0b1110111011101000, 0b1110111011101110,
  };

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  const uint8_t* pEnd = data + sizeData;
  for (const uint8_t* pSrc = data; pSrc < pEnd; pSrc++)
  {
    *(pDma++) = bitpatterns[((*pSrc) >> 4) & 0x0f];
    *(pDma++) = bitpatterns[((*pSrc) & 0x0f)];
  }
  return micros() - startTime;
}

uint32_t EncodeIntoDma3Step(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t OneBit =  0b00000110;
  const uint16_t ZeroBit = 0b00000100;
  const uint8_t SrcBitMask = 0x80;
  const size_t BitsInSample = sizeof(uint16_t) * 8;

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  uint16_t dmaValue = 0;
  uint8_t destBitsLeft = BitsInSample;

  const uint8_t* pSrc = data;
  const uint8_t* pEnd = pSrc + sizeData;

  while (pSrc < pEnd)
  {
    uint8_t value = *(pSrc++);
    for (uint8_t bitSrc = 0; bitSrc < 8; bitSrc++)
    {
      const uint16_t Bit = ((value & SrcBitMask) ? OneBit : ZeroBit);
      if (destBitsLeft > 3)
      {
        destBitsLeft -= 3;
        dmaValue |= Bit << destBitsLeft;  // this is the most time consuming way to do this
        // it is much more efficient to to '|=' first and shift as needed
      }
      else if (destBitsLeft <= 3)
      {
        uint8_t bitSplit = (3 - destBitsLeft);
        dmaValue |= Bit >> bitSplit;
        *(pDma++) = dmaValue;
        dmaValue = 0;

        destBitsLeft = BitsInSample - bitSplit;
        if (bitSplit)
        {
          dmaValue |= Bit << destBitsLeft;
        }
      }
      value <<= 1;
    }
  }
  *pDma++ = dmaValue;
  return micros() - startTime;
}


uint32_t EncodeIntoDma3StepFast(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint32_t bitpatternsLow[16] =
  {
    0x000924, 0x000926, 0x000934, 0x000936, 0x0009A4, 0x0009A6, 0x0009B4, 0x0009B6,
    0x000D24, 0x000D26, 0x000D34, 0x000D36, 0x000DA4, 0x000DA6, 0x000DB4, 0x000DB6
  };
  const uint32_t bitpatternsHigh[16] =
  {
    0x924000, 0x926000, 0x934000, 0x936000, 0x9A4000, 0x9A6000, 0x9B4000, 0x9B6000,
    0xD24000, 0xD26000, 0xD34000, 0xD36000, 0xDA4000, 0xDA6000, 0xDB4000, 0xDB6000
  };
  uint32_t output[2];  // 2x 24-bit bitPattern
  uint8_t * output8 = reinterpret_cast<uint8_t *>(output);

  uint8_t* pDma = dmaBuffer;
  const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;
  while (pSrc < pEnd)
  {
    output[0] = bitpatternsHigh[((*pSrc) >> 4) /*& 0x0f*/] | bitpatternsLow[((*pSrc) & 0x0f)];
    //Serial.println(output[0], BIN);
    pSrc++;
    output[1] = bitpatternsHigh[((*pSrc) >> 4) /*& 0x0f*/] | bitpatternsLow[((*pSrc) & 0x0f)];
    pSrc++;   // note: the mask for the bitpatternsHigh index should be obsolete
    // To get the 2x 3-byte values in the right order copy them Byte by Byte
    memcpy(pDma++, output8 + 1 , 1);
    memcpy(pDma++, output8 + 2 , 1);
    //memcpy(pDma++, output8 + 1 , 2); pDma++;
    memcpy(pDma++, output8 + 6 , 1);
    memcpy(pDma++, output8 + 0 , 1);
    memcpy(pDma++, output8 + 4 , 1);
    memcpy(pDma++, output8 + 5 , 1);
    //memcpy(pDma++, output8 + 4 , 2); pDma++;
  }
  if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
  {
    output[0] = bitpatternsHigh[((*pSrc) >> 4) & 0x0f] | bitpatternsLow[((*pSrc) & 0x0f)];
    memcpy(pDma++, output8 + 1 , 1);
    memcpy(pDma++, output8 + 2 , 1);
    pDma++;
    memcpy(pDma++, output8 + 0 , 1);
  }
  return micros() - startTime;
}

uint32_t EncodeIntoDma3StepFast2(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t bitpatterns[6][16] = {
    {
      0x9240, 0x9260, 0x9340, 0x9360, 0x9A40, 0x9A60, 0x9B40, 0x9B60,
      0xD240, 0xD260, 0xD340, 0xD360, 0xDA40, 0xDA60, 0xDB40, 0xDB60
    },
    {
      0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009,
      0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D
    },
    {
      0x0092, 0x0092, 0x0093, 0x0093, 0x009A, 0x009A, 0x009B, 0x009B,
      0x00D2, 0x00D2, 0x00D3, 0x00D3, 0x00DA, 0x00DA, 0x00DB, 0x00DB
    },
    {
      0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600,
      0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600
    },
    {
      0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000,
      0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000
    },
    {
      0x0924, 0x0926, 0x0934, 0x0936, 0x09A4, 0x09A6, 0x09B4, 0x09B6,
      0x0D24, 0x0D26, 0x0D34, 0x0D36, 0x0DA4, 0x0DA6, 0x0DB4, 0x0DB6
    }
  };

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;

  while (pSrc < pEnd)
  {
    uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
    pSrc++;
    msNibble = ((*pSrc) >> 4) & 0x0f;
    *(pDma++) = bitpatterns[2][msNibble] | bitpatterns[3][lsNibble];
    lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[4][msNibble] | bitpatterns[5][lsNibble];
    pSrc++;
  }
  if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
  {
    uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
    *(pDma++) = bitpatterns[3][lsNibble];
  }
  return micros() - startTime;
}

void setup() {

  uint16_t bufferSize = 1050;
  uint8_t pixelData27[27] = {0xfe, 0x68, 0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68,
                                   0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68, 0x12, 0xdc,
                                   0x86, 0x34, 0xdc, 0x86, 0x34
                                  };
                                  
  uint8_t* pixelData;
  pixelData = (uint8_t*) malloc(bufferSize);
  uint8_t* pix27 = pixelData;
  for (uint8_t i = 0; i < 27; i++) {
    *(pix27++) = pixelData27[i];
  }
                                  
  for (uint16_t i = 27; i < bufferSize; i++) {
    pixelData[i] = pixelData[i - 27];
  }
  uint32_t elaps;

  uint8_t* dmaOutputBuffer4Step;
  dmaOutputBuffer4Step = (uint8_t*) malloc(bufferSize * 4 + 4);
  memset(dmaOutputBuffer4Step, 0x00, bufferSize * 4 + 4);
  uint8_t* dmaOutputBuffer3Step;
  dmaOutputBuffer3Step = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3Step, 0x00, bufferSize * 3 + 4);
  uint8_t* dmaOutputBuffer3StepFast;
  dmaOutputBuffer3StepFast = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3StepFast, 0x00, bufferSize * 3 + 4);
  uint8_t* dmaOutputBuffer3StepFast2;
  dmaOutputBuffer3StepFast2 = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3StepFast2, 0x00, bufferSize * 3 + 4);

  Serial.begin(500000);
  Serial.println();
  Serial.println();
  Serial.println();

  Serial.println("* 4 Step encoding");
  delay(1000);
  elaps = EncodeIntoDma4Step(dmaOutputBuffer4Step, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer4Step, bufferSize * 4 + 4);
  

  Serial.println();
  Serial.println("* 3 Step encoding");
  delay(1000);
  elaps = EncodeIntoDma3Step(dmaOutputBuffer3Step, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3Step, bufferSize * 3 + 4);
  Serial.println();

  Serial.println("* 3 Step encoding fast");
  delay(1000);
  elaps = EncodeIntoDma3StepFast(dmaOutputBuffer3StepFast, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepFast, bufferSize * 3 + 4);
  Serial.println();

  Serial.println("* 3 Step encoding fast 2");
  delay(1000);
  elaps = EncodeIntoDma3StepFast2(dmaOutputBuffer3StepFast2, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepFast2, bufferSize * 3 + 4);
  Serial.println();


}

void PrintOutputBuffer(uint8_t * outputBuf, uint32_t bufferSize) {
  for (uint32_t i = 0; i < bufferSize; i++) {
    Serial.print("0x");
    String s =  String(*(outputBuf++), HEX);
    while (s.length() < 2) {
      s = "0" + s;
    }
    Serial.print(s);
    if (i < bufferSize - 1) {
      Serial.print(", ");
    }
    if (!((i +1) % 24)) {
      Serial.println();
    }
  }
  Serial.println();
  Serial.println();
}

void loop() {
}

And i the results (using 160MHz clockrate)

* 4 Step encoding
   us Elapsed : 156

* 3 Step encoding
   us Elapsed : 969

* 3 Step encoding fast
   us Elapsed : 163

* 3 Step encoding fast 2
   us Elapsed : 177

Conclusion. The first attempt at 3-step encoding is marginally quicker than the 2nd attempt and is more than 5x as fast as the current method. With small pixelbuffers it is less than twice as slow as the 4-step, and with large buffers it is almost as fast as the 4-step. The temporary memory demand for it is a bit more than the 4-step ( 2 x 16 x 32-bit lookup vs 16 x 16-bit lookup tables = 96 bytes more in lookup tables)

The quickest is of course 256 size lookup tables, but that seems excessive, wasting a whole KB on them.

Anyway i thought i'd share it. I'll get the whole cloning and branch thing sorted soon.

Makuna · 2024-10-24T15:58:50Z

Makuna
Oct 24, 2024
Maintainer

Make note of the endianness of variables. I believe I ran into one of the ESP32 (c3?) uses a different core (arm) and the endianness layout was different, part of the reason the move to working only in bytes was to reduce complexity and fix some bugs around this.

So, you need to test all the platforms to make sure nothing appears.

2 replies

devarishi7 Oct 24, 2024
Author

Ok well i don't have boards for all platforms, but it's good you mention this. It is not complicated to provide code for both endianness though. I will do some research on the matter. I was anyway thinking to add it as an option, not to completely replace,.
The dual 16-bit output is the same for all ? And the pins can be inverted for all as well right ? A single 32-bit for the ESP8266 will be a little bit more complex, but not to bad, and that i can test. As you know i am also working on the DMX, and i have come up with a method that is almost twice as fast as the method that is used for the ESP8266 for that as well. (the method i posted still had a few small errors and is actually nearly twice as slow) converting the ESP8266 method which is 32-bit to the dual-16-bit of the ESP32 was quite confusing, but i got my head around it, and as long as i have the correct bit-patterns i can simply compare the output to the current method and correct the location of any bytes that end up in the wrong place. Anyway if there is someone who does have all boards, it really is not much work to make sure it all works correctly. I made a fork, i did a clone but i think it is the wrong way around, i have to look again in the morning. Adding the 3-step encoding to the ESP8266 is probably interesting as well. Sure there are no parallel modes and so it will not be as interesting, but an ESP8266 has less memory though, Still it is a saving of 2.5KB when encoding 4 Universes. That is significant enough. Also wondering if 3 Step encoding for any UART methods would be a good idea. 7N1 ? Easiest encoding of the lot i'd say.

But yes Endianness, i will try and check which MCU uses what.

devarishi7 Oct 25, 2024
Author

Ok so a little investigating and i ended up here via a link on reddit about endianness.
Wrote a small sketch to confirm these macros exist in the core, and they do.

void setup() {
  Serial.begin(115200);
  Serial.println("Endianness Test");
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  Serial.println("Little Endian");
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
  Serial.println("Big Endian");
#elif __BYTE_ORDER__ == __ORDER_PDP_ENDIAN__
  Serial.println("Pop Endian");
#else
  Serial.println("Endianness not defined");
#endif
}
void loop() {
}

Implementing this should solve the issue. Never heard about Pop-endian before, must be rare and i guess tends to only appear on 16-bit architecture MCUs I have been learning a lot this week.

devarishi7 · 2024-11-08T11:06:22Z

devarishi7
Nov 8, 2024
Author

I have created a pull request as part of the DMX512 encoding for the faster 3 step encoding, and i have added the use of big endian byte order in it as well, although i have not found any boards that use it and on the arduino forum no one knew of one that uses it, but if the macro is defined, it should work just fine. I did not go as far as providing for the (new to me) pop-endian. Anyway it's there, have a look if you have time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESP32 I2S 3-step Cadence Faster encoding #855

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ESP32 I2S 3-step Cadence Faster encoding #855

devarishi7 Oct 23, 2024

Replies: 2 comments · 2 replies

Makuna Oct 24, 2024 Maintainer

devarishi7 Oct 24, 2024 Author

devarishi7 Oct 25, 2024 Author

devarishi7 Nov 8, 2024 Author

devarishi7
Oct 23, 2024

Replies: 2 comments 2 replies

Makuna
Oct 24, 2024
Maintainer

devarishi7 Oct 24, 2024
Author

devarishi7 Oct 25, 2024
Author

devarishi7
Nov 8, 2024
Author