-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded .xlsx row loading #346
Comments
+1 on making loading faster. In an application I'm working on, I have a benchmark for loading 1m rows and then writing it to a file. |
From the /// Reads the given file from the ZIP archive.
///
/// Since each file in a ZIP archive is compressed independently,
/// multiple files can be read in parallel. Which is what is needed as the structure of an ./
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── app.xml
│ └── core.xml
└── xl/
├── _rels/
│ └── workbook.xml.rels
├── sharedStrings.xml
├── styles.xml
├── theme/
│ ├── theme.xml
│ └── theme1.xml
├── workbook.xml
└── worksheets/
└── sheet1.xml Generally when you get a file handle/descriptor you have to deal with the OS changing its internal index of where its at when reading, which is why rust needs a The biggest gains would be from reading Perhaps it might be possible to instead store the Having a separate streams will also help with this issue:
|
Lines 778 to 798 in 09c25ba
The reading functions can probably be changed to return the data instead, and then populate the struct with that data, rather than relying on mutable changes to update it. Inside those functions they can spawn a thread and then return the join handle and block the main thread until the last one is finished. |
I ran some benchmarks on different rust xml parsers to see if there is a faster implementation and the two that seem to be the likely candidates were test huge_xmlparser ... bench: 1,672,879 ns/iter (+/- 20,140)
test huge_quick_xml ... bench: 2,396,037 ns/iter (+/- 39,752)
test huge_xmlrs ... bench: 36,258,312 ns/iter (+/- 180,438)
test large_xmlparser ... bench: 730,787 ns/iter (+/- 22,924)
test large_quick_xml ... bench: 1,250,053 ns/iter (+/- 21,943)
test large_xmlrs ... bench: 11,239,516 ns/iter (+/- 76,937)
test medium_quick_xml ... bench: 206,232 ns/iter (+/- 2,157)
test medium_xmlparser ... bench: 240,737 ns/iter (+/- 4,531)
test medium_xmlrs ... bench: 3,975,916 ns/iter (+/- 44,967)
test tiny_xmlparser ... bench: 1,078 ns/iter (+/- 17)
test tiny_quick_xml ... bench: 2,233 ns/iter (+/- 70)
test tiny_xmlrs ... bench: 17,155 ns/iter (+/- 429) Looking at this I decided to try out my own benchmark using the data from the qsv benchmark. I saved it as an use quick_xml::events::Event;
use quick_xml::reader::Reader;
// include file in binary to prevent IO mucking with bench
static XML: &str = include_str!(r"..\test\sheet1.xml");
fn main() {
let which = std::env::args().nth(1).expect("must provide parser to run");
match which.as_str() {
"xmlparser" => xmlparser(),
"quickxml" => quick_xml(),
_ => {
panic!("didnt provide either `xmlparser` or `quickxml` but instead provided `{which}`")
}
}
}
fn xmlparser() {
let mut count = 0;
for _token in xmlparser::Tokenizer::from(XML) {
count += 1;
}
println!("tokens = {count}");
}
fn quick_xml() {
let mut reader = Reader::from_str(XML);
reader.trim_text(true);
let mut texts = 0;
let mut starts = 0;
let mut buf = Vec::with_capacity(1024 );
// The `Reader` does not implement `Iterator` because it outputs borrowed data (`Cow`s)
loop {
// NOTE: this is the generic case when we don't know about the input BufRead.
// when the input is a &str or a &[u8], we don't actually need to use another
// buffer, we could directly call `reader.read_event()`
match reader.read_event_into(&mut buf) {
Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
// exits the loop when reaching end of file
Ok(Event::Eof) => break,
Ok(Event::Start(_e)) => starts += 1,
Ok(Event::Text(_e)) => texts += 1,
// There are several other `Event`s we do not consider here
_ => (),
}
// if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
buf.clear();
}
println!("Starts: {starts}\nTexts:{texts}")
} This is on a hyperfine --warmup 3 'target\release\xml_parser_compare.exe xmlparser' 'target\release\xml_parser_compare.exe quickxml'
Benchmark 1: target\release\xml_parser_compare.exe xmlparser
Time (mean ± σ): 6.407 s ± 0.022 s [User: 6.033 s, System: 0.372 s]
Range (min … max): 6.376 s … 6.439 s 10 runs
Benchmark 2: target\release\xml_parser_compare.exe quickxml
Time (mean ± σ): 5.854 s ± 0.017 s [User: 5.525 s, System: 0.339 s]
Range (min … max): 5.817 s … 5.871 s 10 runs
Summary
target\release\xml_parser_compare.exe quickxml ran
1.09 ± 0.00 times faster than target\release\xml_parser_compare.exe xmlparser So contrary to Memory usage for both was very low so I don't think that is a consideration to benchmark. Considering that jqnatividad's test took 10 seconds to read, there is about 4 seconds of overhead that is being done outside of parsing of the worksheet file, which is really the only area that would be actionable here. Some time is going into the In the most general of use cases, the sheet is lazily read, as in it doesn't read the other worksheets, only the one specified. This is forcing a linear read of the To make the reads, and therefore the parsing, concurrent, it might be best to read all files at the start so that the time is only of the time it takes to read the longest file. This comes at the cost of memory, however, as all the data would be in memory. This could prove an issue for very large datasets and causing OOM errors. The in-memory representation of the parsed data wont have the same overhead as the xml file itself, but if someone has 50 sheets with a million rows and any decent number of columns, then it could become a very real concern. As far as parallel parsing of a single xml file, I found this excellent dissertation on a technique to do such, mainly by breaking the file into chunks and then parsing those on other threads. The lib is written in C++, PXML, but I can't find any source code for it. And implementing this is rust would be outside the scope of this library. But perhaps there could be a way to hack together something with |
Yes, and it's problem now - #27. I think, lazy loading is also important to reduce memory and increase performance in some cases (reading a few rows from large file, because now in this case calamine reads slower pure python library - openpyxl).
Reading using pandas 2.2 from https://github.com/MarkPflug/Benchmarks/tree/main/source/Benchmarks/Data. |
Using that data's hyperfine --warmup 3 'target\release\xml_parser_compare.exe xmlparser' 'target\release\xml_parser_compare.exe quickxml'
Benchmark 1: target\release\xml_parser_compare.exe xmlparser
Time (mean ± σ): 172.0 ms ± 0.8 ms [User: 155.9 ms, System: 18.3 ms]
Range (min … max): 171.2 ms … 174.0 ms 17 runs
Benchmark 2: target\release\xml_parser_compare.exe quickxml
Time (mean ± σ): 178.6 ms ± 1.2 ms [User: 167.7 ms, System: 15.7 ms]
Range (min … max): 177.1 ms … 181.0 ms 16 runs
Summary
target\release\xml_parser_compare.exe xmlparser ran
1.04 ± 0.01 times faster than target\release\xml_parser_compare.exe quickxml The other files are only a few KB each:
Those shouldn't take that long to parse, at least compared to the 34.4MB file. So for it to take 1.5 seconds in the python bench is certainly interesting. I will try to go through some other libraries implementation for the lazy reading, and generally see if we cant try to put together something to increase performance. From the other benchmark I did, it was doing ~220MBs of throughput. From the
From the From the parse time chart, you can see that 250MB/s isn't really that fast. And my benchmark was done on much newer hardware. And the throughput I was getting was only ~220MB/s. This is all very rough of course, but not painting a great picture.
|
In a recent application of mine, 50,000 rows are read from an .xlsx-file. The part of reading the 50,000 rows is the bottle neck. Wouldn't it be nice to have multi-threaded reading of an .xlsx file?
We do live in the multi-core CPU era after all.
The text was updated successfully, but these errors were encountered: