Add an experimental csv module exposing a streaming csv parser #3743

oleiade · 2024-05-15T08:30:02Z

What?

This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon.

It aims to address #2976, and adds the capability to parse a whole CSV file as a SharedArray natively (without having to resort to parse).

Parse function

The parse function takes a fs.File instance as input, as well as options, parses the whole file as csv, and returns a SharedArray instance containing the parsed records.

It aims to offer a similar experience as to what is currently possible with the open function and papaparse with the added benefits to:

consume less memory as it uses the new fs.open function: the file content and the parsed records will be shared across VUs, too (albeit a copy in itself).
be faster, especially for larger files, as it is designed to bypass most of the JS runtime, and directly parse the file in, and store the results in a SharedArray in Go. Through our pairing sessions with @joanlopez we profiled the execution extensively with Pyroscope. We made some comparisons, and found out that most of the CPU time spent parsing using papaparse into a SharedArray was spent in the JS runtime. The approach picked in this PR mitigates that.

This API allows the trade of memory for performance. The whole file content will still be held in memory a couple of times, and we'll also hold a copy of all the file's parsed rows, however, in our benchmark, this approach was significantly faster than using papaparse.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'
import { scenario } from 'k6/execution'

export const options = {
	iterations: 10,
}

// Open the csv file, and parse it ahead of time.
let file;
let csvRecords;
(async function () {
	file = await open('data.csv');

	// The `csv.parse` function consumes the entire file at once, and returns
	// the parsed records as a SharedArray object.
	csvRecords = await csv.parse(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000})
})();


export default async function() {
	console.log(csvRecords[scenario.iterationInTest])
}

Parser

The parser results from our initial CSV parsing workshop at Crococon. Its API is specifically designed to address #2976. It exposes a Parser object and its constructor which behave similarly to a JS iterator, on which the next method can be called and returns the next set of records as well as a done marker indicating whether there is more to consume.

The parser relies exclusively on the fs.File constructs and parses rows as they go, instead of storing them all in memory. As such, it consumes less memory but is also somewhat slower (comparable to paparse) to parse as each call to next() needs to go through the whole JS Runtime and event loop (observed during our profiling sessions in Pyroscope); making the cost of creating/await the next promise significantly bigger than the actual parsing operation.

The parser effectively trades performance for memory but offers some flexibility in parsing and interpreting the results.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'

export const options = {
	iterations: 10,
}

let file;
let parser;
(async function () {
	file = await open('data.csv');
	parser = new csv.Parser(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000});
})();

export default async function() {
	const {done, value} = await parser.next();
	if (done) {
		throw new Error("No more rows to read");
	}

	console.log(value);
}

Implementation details & Open Questions

In order to support the module, we had to significantly reshape the internals of the fs module in order to facilitate opening and manipulating files using it from another module. The biggest part of the change was to introduce an interface specific to the fs.File behavior that we needed to rely on from the outside, namely read, seek and stat: ReadSeekStater. See commit 8e782c1 for more details.
As we found per our profiling investigation, instantiating shared arrays that we would fill with the results of parsing the csv file in Go showed little benefit, and displayed that most of the execution was spent in the JS runtime. In order to make things faster, we added a Go SharedArray constructor to the Go data module that allows to replicate the behavior of the JS constructor in Go, and effectively bypass most of the runtime overhead. We were not sure this was the best approach, let us know if you think of something better. See commit d5e6ebc for more details.

What's not there yet

Part of the initial design described in Add a streaming-based CSV parser to k6 #2976 included two concepts I haven't yet included here, as I'm not sure what the best API or performance-oriented solution would be (ideas welcome 🤝):
The ability to describe a strategy for the parser to select which lines should be picked for parsing or ignored (say, you want a file's lines parsing to be spread evenly across all your VUs, for instance).
The ability to instruct the parser to cycle through the file: once it reaches the end, it restarts from the top. My main question mark is that as it would probably be possible using the existing APIs (csv.Parser.next returns an iterator-like object with a done property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?

Why?

Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue users encounter is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the fs module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.

Checklist

I have performed a self-review of my code.
I have added tests for my changes.
I have run linter locally (make lint) and all checks pass.
I have run tests locally (make tests) and all tests pass.
I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

#2976

joanlopez · 2024-05-15T11:11:10Z

One common issue encountered by users is that JS tends to be rather slow when performing parsing operations.

Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison.

js/modules/k6/experimental/csv/data.csv

js/modules/k6/experimental/csv/module.go

js/modules/k6/experimental/fs/module.go

joanlopez

Thanks for giving form to what we started during the Crococon 💟

I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀

I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.

js/modules/k6/experimental/csv/csv.js

oleiade · 2024-05-27T14:07:23Z

Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:

As a user, I want to read a CSV file containing 1000 credentials, and have each credential being processed by a single iteration.

no credential should be processed more than once
unless the parser is explicitly to restart from the begining? In that scenario, the same credential can be processed multiple times.
if the option is not set, and the user calls parser.next() after all credentials are consumed, they keep getting a { done: true, value: undefined } response.

As a user, I want to read a CSV file containing 1000 credentials, and have each subset of those credentials reserved to be processed by a single VU.

the subset of credentials could be for instance a chunk: 0-100 credentials go to VU 1, 101-200 credentials go to VU 2, etc.
the subset of credentials could be every Nth credential: 0, 10, 20, 30, etc. go to VU 1, 1, 11, 21, 31, etc. go to VU 2, etc.
This is possible with the existing SharedArray approach, but it needs a faster way of processing the rows.

As a user, I want each iteration to stream through my CSV file, and have the ability to act upon each returned records.

The user has the ability to skip a record, or to stop the iteration, based on the content of the record, or the line number.
This is assuming that each iteration needs the whole content of the file to perform its test.

oleiade added the feature label May 15, 2024

oleiade self-assigned this May 15, 2024

oleiade requested review from codebien and joanlopez May 15, 2024 09:13