Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark comparison between V1 and V2 #256

Closed
stevstrong opened this issue Feb 13, 2021 · 13 comments
Closed

Benchmark comparison between V1 and V2 #256

stevstrong opened this issue Feb 13, 2021 · 13 comments

Comments

@stevstrong
Copy link

Hi,

I started to use V2 and wanted to check the performance compared to V1 on a STM32F407 generic board using SPI2.

The test was made with bench.ino of respective versions for different buffer sizes (see first column in the table below).

For V2 I set ENABLE_DEDICATED_SPI to 1 and SD_FAT_TYPE = 3.
Both versions use the same default value SD_SCK_MHZ(50).

In both versions I used the same card:

FreeStack: 62800
Type is FAT32
Card size: 7.95 GB (GB = 1E9 bytes)

Manufacturer ID: 0X73
OEM ID: BG
Product: NCard
Version: 1.0
Serial number: 0X26463014
Manufacturing date: 7/2012

File size 5 MB

And here are the results:

buffer|  |     write speed and latency     |  |    read speed and latency      |
size  |  |  speed  |   max |  min  |  avg  |  |  speed  |   max |  min  | avg  |
bytes |  |  KB/Sec |  usec |  usec |  usec |  | KB/Sec  |  usec |  usec | usec |
--------------------------------------------------------------------------------
V1
--------------------------------------------------------------------------------
  512 |  |  224.67 | 40954 |  1919 |  2277 |  | 1305.40 |  1474 |   381 |  391 |
 1024 |  |  445.12 | 44017 |  1433 |  2298 |  | 1677.01 |  1696 |   583 |  610 |
 2048 |  |  780.02 | 31450 |  1843 |  2621 |  | 1970.50 |  1763 |  1019 | 1038 |
 4096 |  | 1202.10 | 32252 |  2662 |  3400 |  | 2180.24 |  2599 |  1842 | 1878 |
 8192 |  | 1645.41 | 33917 |  4305 |  4964 |  | 2340.57 |  4243 |  3485 | 3500 |
--------------------------------------------------------------------------------
V2
--------------------------------------------------------------------------------
n*512 |  |  314.92 | 56916 | 15847 | 16266 |  |  323.44 | 16137 | 15835 | 15839|

Important to note that V2 has the same performance independent from the buffer size, while V1 performs better as the buffer size grows.

V2 has only better write speed with buffer size of 512 bytes.
For this buffer size however the read speed is very low, where V1 performs almost 5 times better.

Honestly, I wouldn't expect so much difference in performance.
In particular, for buffer sizes larger than 512, V1 clearly outperforms V2 for both write and read accesses.

Can you please explain how can I speed up V2 to have the same read and write performance as V1 for larger buffer sizes?

Thank you in advance.

@greiman
Copy link
Owner

greiman commented Feb 13, 2021

Which STM32 board package are you using?

@greiman
Copy link
Owner

greiman commented Feb 13, 2021

I tested on a Teensy 4.1 SPI. The only difference is the SPI driver. I have not tested with the Roger Clark board package.

Here is Teensy 4.1 dedicated SPI. Buffer size doesn't matter much about 5,080 KB/sec write and 5,190 KB/sec read.

Dedicated SPI mode.
size,write,read
bytes,KB/sec,KB/sec
512,5080.41,5183.81
1024,5087.64,5192.18
2048,5090.97,5196.36
4096,5072.95,5199.12
8192,5086.21,5199.13
16384,5086.54,5200.63
32768,5078.84,5206.49

Arduino Due does about 4,440 KB/sec write and 4,580 KB/sec read.

Here is the old driver you are probably using. Someone sent it to me but I no longer test with old F405 and F407 boards. The even slower ST package is becoming popular since it supports so many chips.

I plan to make the Teensy style driver an option on all boards. It uses the standard array transfer function transfer(buf, size). I need to copy to a temp array on send and fill the buffer on receive.

@stevstrong
Copy link
Author

stevstrong commented Feb 14, 2021

I am using my fork of Roger Clark's core, I did not change the SPI driver from V1 to V2, and I paid attention DMA to be active all the time for both versions.
That means it must be some other difference which causes the different behavior.
Could you eventually give me some hints where should I look, some internal variables to monitor?

EDIT
I make sure that ENABLE_DEDICATED_SPI and DEDICATED_SPI are configured correctly, because I printed m_sharedSpi = 0.
And read/write accesses are done by sectors (in V2) and blocks (in V1).

Only focusing on 512 bytes buffer case, it would be interesting to know what processes are done in the SdFat lib between two consecutive cache reads, where V1 only waits ~390 µsecond between consecutive reads and V2 waits ~2200 µseconds.

@greiman
Copy link
Owner

greiman commented Feb 14, 2021

I don't have a clue what you are doing. If you are trying to be clever and do raw write/read to the SD forget it, V2 is not for you. It has tools to beat raw block writes. It can write a 64GB exFAT file as a single multi-block optimized write.

The ST board package has a very slow SPI driver but it does fairly well with 512 byte reads and writes. Here is output from the bench example.

I am using a NUCLEO-F446RE.

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1805.05,300,282,283
1803.75,301,282,283

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1818.18,285,280,281
1818.18,285,280,281

Notice, the max time for a read is 285 μs and the min is 280 μs. There is no 2200 μs between reads.

Here is a trace of timing for the ST transfer(buf, count). SCLK is 45 MHz. Notice the space between the two bytes.

F446REtwo

Here is the test sequence.

void loop() {
  uint8_t buf[] = {0X55, 0XAA};
  SPI.beginTransaction(SPISettings(50000000, MSBFIRST, SPI_MODE0));
  digitalWrite(CS_PIN, LOW);
  SPI.transfer(buf, 2); 
  digitalWrite(CS_PIN, HIGH);  
  SPI.endTransaction();
  delay(1);
}

There is 372 ns between bytes. A byte at 45 MHz takes 178 ns. If there were DMA with no space between byte, the rate would be more like 5 MB/sec.

@stevstrong
Copy link
Author

I eventually found the bug happened during merging your master to my repo, some lines got commented out which were responsible for setting the SPI clock correctly.

After applying the fix I got these values:

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2085.07,42775,206,245
2079.00,45520,206,245

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2454.59,509,207,208
2458.21,509,207,207

This is an amazing increase in performance when compared to V1 using a very large buffer.

Sorry for causing any trouble, but at least I could acknowledge that V2 brings a real speed performance.

So thank you a lot for your effort.
I was using this very nice library since a very long time, and I will use V2 from now on. :)

@greiman
Copy link
Owner

greiman commented Feb 14, 2021

What are you using for an SD card? Your card has a max write latency of over 40 ms. Mine has about 300 μs with the slow ST driver.

@stevstrong
Copy link
Author

stevstrong commented Feb 14, 2021

I use the card from my first post, But I tried other cards as well, and I got similar results.
None of my cards perform better than 2469 kB/sec read speed.
Do you think is there something still wrong in my fork?

@stevstrong
Copy link
Author

stevstrong commented Feb 14, 2021

I just realized I was using SPI2 all the time, which has a max. clock of 22.5MHz.
I tried now SPI1 (45MHz) and the results are much better:

Type is FAT32
Card size: 7.95 GB (GB = 1E9 bytes)

Manufacturer ID: 0X73
OEM ID: BG
Product: NCard
Version: 1.0
Serial number: 0X26463014
Manufacturing date: 7/2012

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3501.40,78707,107,145
3885.00,98004,107,131

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4712.54,408,107,108
4721.44,409,107,107

I think I can live with these values :)

@greiman
Copy link
Owner

greiman commented Feb 14, 2021

It's almost certainly the card. Modern cards are more complex than the F407 you are using. Phones recording video have totally changed the technology. Consumers spend almost $10B per year on fast SD cards.

The flash Allocation Unit in a many modern cards is now over 4MB. The Record Unit for a class 10 card is 512KB. Performance depends on how fragmented a AU and it's RUs become. At some point a card may need to copy and reprogram flash. For high end cards this can overlap write since the card has a number of large RAM buffers.

The other problem is that cards try to discover how you are using the card and use buffering and caching to optimize performance. The read performance of an area on a card depends on how it was written.

Here is how a modern card manages flash:
AU

This is how performance degrades as flash is fragmented.
RU

At some point the card has to copy the fragmented AU to a new AU and older cards can't overlap this operation and huge latencies occur.

Don't think you can guess what card you need for a given app. Every manufacturer shoots for being best for phone uses with modern SD controllers, not people with old Cortex mpus using SPI. Too bad STM32 only supported the V 2.00 standard, September 25, 2006 until recently. The STM32H7 attempts to support the January 22, 2013 standard. I bought an early H7 Nucleo but it was too buggy to work with. I just bought two V2 NUCLEO-H743ZI2 boards but STM32Cube was unusable for the SDMMC.

@stevstrong
Copy link
Author

I think the result of 4,7MB/s read speed is limitd by the HW, SPI clock of 45MHz (=45Mb/s =~ 4,5MB/s).
So a faster card could anyway not perform better than this with this board.
And I try to use my old cards, my current application does not require faster speeds, so everything is fine.

@greiman
Copy link
Owner

greiman commented Feb 14, 2021

Still has bad latency.

I wrote a ring buffer that is integrated with SdFat and made sure isBusy() works for preallocated files. This program works with fairly long latency SDs and allows a novice to write a fast logger with a simple loop. I suspect your F407 could log reliably at more than 5,000 samples per second since isBusy will insure a 512 byte write takes no longer than 110 μs. Here is the example for Teensy 4.1, it can log at 25,000 samples per second with my SDIO driver but would be trivial to convert to SPI.

This code writes a 512 byte block when the SD is not busy:

  size_t n = rb.bytesUsed();
  if (n >= 512 && !file.isBusy()) {
      // Not busy only allows one sector before possible busy wait.
      // Write one sector from RingBuf to file.
      if (512 != rb.writeOut(512)) {
        Serial.println("writeOut failed");
        break;
      }
    }

This code prints the data to be logged into the ring buffer as csv.

    // Read ADC0 - about 17 usec on Teensy 4, Teensy 3.6 is faster.
    uint16_t adc = analogRead(0);
    // Print spareMicros into the RingBuf as test data.
    rb.print(spareMicros);
    rb.write(',');
    // Print adc into RingBuf.
    rb.println(adc);

I have a DMA adc example using the ring buffer in an ISR that does 3 million samples per second on Teensy. I will also make a SPI example soon.

Amazing to write a 8GB file with three samples per μs.

TeensySqr

@stevstrong
Copy link
Author

I handled the latency with 2 bluepills, one recording 8 channels at 44kHz and shifting the values over SPI to another bluepill which stored in a contiguous file the data received over SPI (using double buffered DMA). Here is the project (host & slave), it was long time ago.

@greiman
Copy link
Owner

greiman commented Feb 15, 2021

Amazingly complex and not very flexible. You might want to look at a this 8-channel 16-bit 44.1 kHz system. It uses one Teensy and users can design the audio system with graphical programming.

Paul sent me a big box of his audio hardware and I developed a Teensy driver that can push 512 bytes into the SDMMC controller in 5μs then return to overlap other I/O. No way this could be done with your architecture.

It can use various devices for recording, play, mixing,... Check out the devices on the left side.

One key idea of V2 was to cope with write latency without complex architecture.

I realize complex architecture is often required for performance. I spent my entire career designing parts of the world's largest data acquisition systems. My last project was a network for Atlas at CERN. The network collects100GB/sec from 100 million data channels. It is now being upgraded to use 3,000 10Gbit Ethernet links in the Clos architecture I used.

erichelgeson added a commit to erichelgeson/BlueSCSI that referenced this issue Dec 3, 2022
See: greiman/SdFat#256 (comment)
- performance can be impacted if the file is fragmented on the SD card.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants