@CsvFileSource is unusable for datasets involving very long lines #3923

mdindoffer · 2024-08-13T10:15:34Z

Description

If you want to use a CSV file-based dataset, that contains very long values in the columns, you have to increase the maxCharsPerColumn property from the default value of 4096. If, however, you don't know what is the length of the largest datapoint in your dataset, or cannot be sure to set a hard limit for future expansion, a logical thing to do would be to set the value to the largest possible one, i.e. Integer.MAX_VALUE.

This to my surprise crashes the test execution with :

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:40)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:93)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:111)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:91)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:70)
	at org.junit.jupiter.params.provider.CsvParserFactory.createParser(CsvParserFactory.java:61)
	at org.junit.jupiter.params.provider.CsvParserFactory.createParserFor(CsvParserFactory.java:40)
	at org.junit.jupiter.params.provider.CsvFileArgumentsProvider.provideArguments(CsvFileArgumentsProvider.java:64)
	at org.junit.jupiter.params.provider.CsvFileArgumentsProvider.provideArguments(CsvFileArgumentsProvider.java:44)
	at org.junit.jupiter.params.provider.AnnotationBasedArgumentsProvider.provideArguments(AnnotationBasedArgumentsProvider.java:52)
	at org.junit.jupiter.params.ParameterizedTestExtension.arguments(ParameterizedTestExtension.java:145)
	at org.junit.jupiter.params.ParameterizedTestExtension.lambda$provideTestTemplateInvocationContexts$2(ParameterizedTestExtension.java:90)
	at org.junit.jupiter.params.ParameterizedTestExtension$$Lambda/0x00007427a8142bb0.apply(Unknown Source)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)

I was absolutely shocked to see that the CSVParser implementation used by JUnit is really pre-allocating a char array to store the CSV values in it. See 1.

This is completely unusable for CSV strings of unknown length. One could of course provide a value that would fit within the JVM limits (i.e. Integer.MAX_VALUE - 8), at which point this doesn't crash, but that just means your unit test is now allocating absolutely ridiculous amounts of heap memory to run.

Digging further, I found that the shaded univocity parsers library does actually have another implementation of DefaultCharAppender called ExpandingCharAppender, which seems to grow the char buffer at runtime, starting from a modest 8192 buffer length value.

The library is basing it's decision on which Appender to use in the CsvParserSettings, see 2. Apparently, all that is required to switch to the ExpandingCharAppender is to pass a value of -1 for the maxCharsPerColumn.

Unfortunately, the maxCharsPerColumn property of the @CsvFileSource annotation requires the value to be a positive integer:

org.junit.platform.commons.PreconditionViolationException: maxCharsPerColumn must be a positive number: -1
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: org.junit.platform.commons.PreconditionViolationException: Configuration error: You must configure at least one set of arguments for this @ParameterizedTest
		at java.base/java.util.stream.AbstractPipeline.close(AbstractPipeline.java:323)
		at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
		... 9 more

Steps to reproduce

OutOfMemoryError when using large column length limits

@ParameterizedTest
@CsvFileSource(resources = "/file.csv", numLinesToSkip = 1, maxCharsPerColumn = Integer.MAX_VALUE)
void dummy(String columnA, String columnB) {
}

PreconditionViolationException for trying to use an unbounded ExpandingCharAppender with maxCharsPerColumn = -1

@ParameterizedTest
@CsvFileSource(resources = "/file.csv", numLinesToSkip = 1, maxCharsPerColumn = -1)
void dummy(String columnA, String columnB) {
}

Context

Used versions (Jupiter/Vintage/Platform): JUnit 5.10.3
Build Tool/IDE: JDK 21

TLDR

Please switch to the ExpandingCharAppender by default when using @CsvFileSource, or at least allow its usage by removing the positive integer validation of maxCharsPerColumn property, and document the valid range.

Alternatively, you may consider switching to a better CSV parser implementation altogether. This obscure "Univocity" library has last seen a commit in 2021 and its website univocity.com returns an HTTP 404 error page.

The text was updated successfully, but these errors were encountered:

marcphilipp · 2024-08-13T11:47:17Z

Please switch to the ExpandingCharAppender by default when using @CsvFileSource, or at least allow its usage by removing the positive integer validation of maxCharsPerColumn property, and document the valid range.

We'll do that for now.

Alternatively, you may consider switching to a better CSV parser implementation altogether. This obscure "Univocity" library has last seen a commit in 2021 and its website univocity.com returns an HTTP 404 error page.

I'm hoping the fork mentioned in uniVocity/univocity-parsers#534 will get some traction but we'll keep an eye on it.

Resolves #3923.

github-actions bot added the status: new label Aug 13, 2024

marcphilipp added this to the 5.11 GA milestone Aug 13, 2024

marcphilipp added type: enhancement status: in progress component: Jupiter theme: parameterized tests and removed status: new labels Aug 13, 2024

marcphilipp self-assigned this Aug 13, 2024

marcphilipp added a commit that referenced this issue Aug 13, 2024

Allow potentially unlimited maxCharsPerColumn in Csv{File}Source

bcc39c1

Resolves #3923.

marcphilipp mentioned this issue Aug 13, 2024

Allow potentially unlimited maxCharsPerColumn in Csv{File}Source #3924

Merged

marcphilipp closed this as completed in #3924 Aug 13, 2024

marcphilipp closed this as completed in 9430ece Aug 13, 2024

marcphilipp removed the status: in progress label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@CsvFileSource is unusable for datasets involving very long lines #3923

@CsvFileSource is unusable for datasets involving very long lines #3923

mdindoffer commented Aug 13, 2024

marcphilipp commented Aug 13, 2024

@CsvFileSource is unusable for datasets involving very long lines #3923

@CsvFileSource is unusable for datasets involving very long lines #3923

Comments

mdindoffer commented Aug 13, 2024

Description

Steps to reproduce

Context

TLDR

marcphilipp commented Aug 13, 2024