Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Speedup ndjson reader ~40% #18197

Merged
merged 3 commits into from
Aug 15, 2024

Conversation

ChayimFriedman2
Copy link
Contributor

This speeds it up by 40% in the following benchmark:

use criterion::{criterion_group, criterion_main, Criterion};
use mimalloc::MiMalloc;
use polars::prelude::*;
use polars_core::POOL;
use rand::{thread_rng, Rng};

#[global_allocator]
static MIMALLOC: MiMalloc = MiMalloc;

fn from_json(json: &[u8]) -> DataFrame {
    JsonReader::new(std::io::Cursor::new(json))
        .with_json_format(JsonFormat::JsonLines)
        .set_rechunk(false)
        .finish()
        .unwrap()
}

fn my_benchmark(_c: &mut Criterion) {
    POOL.install(|| {
        let mut c = Criterion::default().configure_from_args();

        const SIZE: i32 = 5_000_000;
        let mut rng = thread_rng();
        let mut df = df![
            "a" => (0..SIZE).map(|_| rng.gen::<i32>()).collect::<Vec<_>>(),
            "b" => (0..SIZE).map(|v| v.to_string()).collect::<Vec<_>>(),
        ]
        .unwrap();

        let mut json = Vec::new();
        JsonWriter::new(&mut json)
            .with_json_format(JsonFormat::JsonLines)
            .finish(&mut df)
            .unwrap();

        c.bench_function("JSON Lines Deserialization", |b| {
            b.iter(|| from_json(&json))
        });
    });
}

criterion_group!(benches, my_benchmark);
criterion_main!(benches);

I had a plan to improve it more, but I can't find time for that and this will involve bigger changes, even a rewrite of the mechanism, while the changes in this PR are simple and effective, so I thought I'll just send them.

Best reviewed commit-by-commit.

A warning from the second commit, repeated here for noticeability:

This could break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for }\n instead of \n alone, and that will make this basically equivalent to the splitting logic we have for threads.

This simple change speeds up NDJSON reading by 30%.
Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code).

This gives a 20% speedup.

This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads.

As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things).

The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string.

The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).
This code errors for invalid JSON. But simd_json will already error (and we'll propagate that) for invalid JSON, so I see no reason for that.
In addition, a side-effect of that code is that it will also reject some valid JSON: the empty object (`{}`). An empty dataframe seems non-useful, but I see no reason to *forbid* it. Also, the empty object may appear in a non-empty dataframe, to signal an all-null row.

As a nice side benefit, this also improves perf by 3.5%, but that could be just noise.
Copy link

codecov bot commented Aug 14, 2024

Codecov Report

Attention: Patch coverage is 96.96970% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.31%. Comparing base (aa1950c) to head (ecf762d).
Report is 106 commits behind head on main.

Files Patch % Lines
crates/polars-io/src/ndjson/core.rs 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18197      +/-   ##
==========================================
- Coverage   80.35%   80.31%   -0.05%     
==========================================
  Files        1492     1498       +6     
  Lines      196332   198748    +2416     
  Branches     2813     2833      +20     
==========================================
+ Hits       157759   159618    +1859     
- Misses      38052    38603     +551     
- Partials      521      527       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 changed the title Speedup ndjson reader perf: Speedup ndjson reader Aug 15, 2024
@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Aug 15, 2024
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice PR and great review with those commit messages. I have one question.

crates/polars-io/src/ndjson/core.rs Show resolved Hide resolved
@ritchie46 ritchie46 changed the title perf: Speedup ndjson reader perf: Speedup ndjson reader ~40% Aug 15, 2024
@ritchie46 ritchie46 merged commit 8476f8c into pola-rs:main Aug 15, 2024
23 of 24 checks passed
@ChayimFriedman2 ChayimFriedman2 deleted the speedup-json-reader branch August 15, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants