-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark comparison between V1 and V2 #256
Comments
Which STM32 board package are you using? |
I tested on a Teensy 4.1 SPI. The only difference is the SPI driver. I have not tested with the Roger Clark board package. Here is Teensy 4.1 dedicated SPI. Buffer size doesn't matter much about 5,080 KB/sec write and 5,190 KB/sec read.
Arduino Due does about 4,440 KB/sec write and 4,580 KB/sec read. Here is the old driver you are probably using. Someone sent it to me but I no longer test with old F405 and F407 boards. The even slower ST package is becoming popular since it supports so many chips. I plan to make the Teensy style driver an option on all boards. It uses the standard array transfer function transfer(buf, size). I need to copy to a temp array on send and fill the buffer on receive. |
I am using my fork of Roger Clark's core, I did not change the SPI driver from V1 to V2, and I paid attention DMA to be active all the time for both versions. EDIT Only focusing on 512 bytes buffer case, it would be interesting to know what processes are done in the SdFat lib between two consecutive cache reads, where V1 only waits ~390 µsecond between consecutive reads and V2 waits ~2200 µseconds. |
I don't have a clue what you are doing. If you are trying to be clever and do raw write/read to the SD forget it, V2 is not for you. It has tools to beat raw block writes. It can write a 64GB exFAT file as a single multi-block optimized write. The ST board package has a very slow SPI driver but it does fairly well with 512 byte reads and writes. Here is output from the bench example. I am using a NUCLEO-F446RE.
Notice, the max time for a read is 285 μs and the min is 280 μs. There is no 2200 μs between reads. Here is a trace of timing for the ST transfer(buf, count). SCLK is 45 MHz. Notice the space between the two bytes. Here is the test sequence.
There is 372 ns between bytes. A byte at 45 MHz takes 178 ns. If there were DMA with no space between byte, the rate would be more like 5 MB/sec. |
I eventually found the bug happened during merging your master to my repo, some lines got commented out which were responsible for setting the SPI clock correctly. After applying the fix I got these values:
This is an amazing increase in performance when compared to V1 using a very large buffer. Sorry for causing any trouble, but at least I could acknowledge that V2 brings a real speed performance. So thank you a lot for your effort. |
What are you using for an SD card? Your card has a max write latency of over 40 ms. Mine has about 300 μs with the slow ST driver. |
I use the card from my first post, But I tried other cards as well, and I got similar results. |
I just realized I was using SPI2 all the time, which has a max. clock of 22.5MHz.
I think I can live with these values :) |
I think the result of 4,7MB/s read speed is limitd by the HW, SPI clock of 45MHz (=45Mb/s =~ 4,5MB/s). |
Still has bad latency. I wrote a ring buffer that is integrated with SdFat and made sure isBusy() works for preallocated files. This program works with fairly long latency SDs and allows a novice to write a fast logger with a simple loop. I suspect your F407 could log reliably at more than 5,000 samples per second since isBusy will insure a 512 byte write takes no longer than 110 μs. Here is the example for Teensy 4.1, it can log at 25,000 samples per second with my SDIO driver but would be trivial to convert to SPI. This code writes a 512 byte block when the SD is not busy:
This code prints the data to be logged into the ring buffer as csv.
I have a DMA adc example using the ring buffer in an ISR that does 3 million samples per second on Teensy. I will also make a SPI example soon. Amazing to write a 8GB file with three samples per μs. |
I handled the latency with 2 bluepills, one recording 8 channels at 44kHz and shifting the values over SPI to another bluepill which stored in a contiguous file the data received over SPI (using double buffered DMA). Here is the project (host & slave), it was long time ago. |
Amazingly complex and not very flexible. You might want to look at a this 8-channel 16-bit 44.1 kHz system. It uses one Teensy and users can design the audio system with graphical programming. Paul sent me a big box of his audio hardware and I developed a Teensy driver that can push 512 bytes into the SDMMC controller in 5μs then return to overlap other I/O. No way this could be done with your architecture. It can use various devices for recording, play, mixing,... Check out the devices on the left side. One key idea of V2 was to cope with write latency without complex architecture. I realize complex architecture is often required for performance. I spent my entire career designing parts of the world's largest data acquisition systems. My last project was a network for Atlas at CERN. The network collects100GB/sec from 100 million data channels. It is now being upgraded to use 3,000 10Gbit Ethernet links in the Clos architecture I used. |
See: greiman/SdFat#256 (comment) - performance can be impacted if the file is fragmented on the SD card.
Hi,
I started to use V2 and wanted to check the performance compared to V1 on a STM32F407 generic board using SPI2.
The test was made with
bench.ino
of respective versions for different buffer sizes (see first column in the table below).For V2 I set
ENABLE_DEDICATED_SPI
to 1 andSD_FAT_TYPE = 3
.Both versions use the same default value
SD_SCK_MHZ(50)
.In both versions I used the same card:
And here are the results:
Important to note that V2 has the same performance independent from the buffer size, while V1 performs better as the buffer size grows.
V2 has only better write speed with buffer size of 512 bytes.
For this buffer size however the read speed is very low, where V1 performs almost 5 times better.
Honestly, I wouldn't expect so much difference in performance.
In particular, for buffer sizes larger than 512, V1 clearly outperforms V2 for both write and read accesses.
Can you please explain how can I speed up V2 to have the same read and write performance as V1 for larger buffer sizes?
Thank you in advance.
The text was updated successfully, but these errors were encountered: