Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@CsvFileSource is unusable for datasets involving very long lines #3923

Closed
mdindoffer opened this issue Aug 13, 2024 · 1 comment · Fixed by #3924
Closed

@CsvFileSource is unusable for datasets involving very long lines #3923

mdindoffer opened this issue Aug 13, 2024 · 1 comment · Fixed by #3924

Comments

@mdindoffer
Copy link

Description

If you want to use a CSV file-based dataset, that contains very long values in the columns, you have to increase the maxCharsPerColumn property from the default value of 4096. If, however, you don't know what is the length of the largest datapoint in your dataset, or cannot be sure to set a hard limit for future expansion, a logical thing to do would be to set the value to the largest possible one, i.e. Integer.MAX_VALUE.

This to my surprise crashes the test execution with :

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:40)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:93)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:111)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:91)
	at org.junit.jupiter.params.shadow.com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:70)
	at org.junit.jupiter.params.provider.CsvParserFactory.createParser(CsvParserFactory.java:61)
	at org.junit.jupiter.params.provider.CsvParserFactory.createParserFor(CsvParserFactory.java:40)
	at org.junit.jupiter.params.provider.CsvFileArgumentsProvider.provideArguments(CsvFileArgumentsProvider.java:64)
	at org.junit.jupiter.params.provider.CsvFileArgumentsProvider.provideArguments(CsvFileArgumentsProvider.java:44)
	at org.junit.jupiter.params.provider.AnnotationBasedArgumentsProvider.provideArguments(AnnotationBasedArgumentsProvider.java:52)
	at org.junit.jupiter.params.ParameterizedTestExtension.arguments(ParameterizedTestExtension.java:145)
	at org.junit.jupiter.params.ParameterizedTestExtension.lambda$provideTestTemplateInvocationContexts$2(ParameterizedTestExtension.java:90)
	at org.junit.jupiter.params.ParameterizedTestExtension$$Lambda/0x00007427a8142bb0.apply(Unknown Source)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)

I was absolutely shocked to see that the CSVParser implementation used by JUnit is really pre-allocating a char array to store the CSV values in it. See 1.

This is completely unusable for CSV strings of unknown length. One could of course provide a value that would fit within the JVM limits (i.e. Integer.MAX_VALUE - 8), at which point this doesn't crash, but that just means your unit test is now allocating absolutely ridiculous amounts of heap memory to run.

Digging further, I found that the shaded univocity parsers library does actually have another implementation of DefaultCharAppender called ExpandingCharAppender, which seems to grow the char buffer at runtime, starting from a modest 8192 buffer length value.

The library is basing it's decision on which Appender to use in the CsvParserSettings, see 2. Apparently, all that is required to switch to the ExpandingCharAppender is to pass a value of -1 for the maxCharsPerColumn.

Unfortunately, the maxCharsPerColumn property of the @CsvFileSource annotation requires the value to be a positive integer:

org.junit.platform.commons.PreconditionViolationException: maxCharsPerColumn must be a positive number: -1
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: org.junit.platform.commons.PreconditionViolationException: Configuration error: You must configure at least one set of arguments for this @ParameterizedTest
		at java.base/java.util.stream.AbstractPipeline.close(AbstractPipeline.java:323)
		at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
		... 9 more

Steps to reproduce

  1. OutOfMemoryError when using large column length limits
@ParameterizedTest
@CsvFileSource(resources = "/file.csv", numLinesToSkip = 1, maxCharsPerColumn = Integer.MAX_VALUE)
void dummy(String columnA, String columnB) {
}
  1. PreconditionViolationException for trying to use an unbounded ExpandingCharAppender with maxCharsPerColumn = -1
@ParameterizedTest
@CsvFileSource(resources = "/file.csv", numLinesToSkip = 1, maxCharsPerColumn = -1)
void dummy(String columnA, String columnB) {
}

Context

  • Used versions (Jupiter/Vintage/Platform): JUnit 5.10.3
  • Build Tool/IDE: JDK 21

TLDR

Please switch to the ExpandingCharAppender by default when using @CsvFileSource, or at least allow its usage by removing the positive integer validation of maxCharsPerColumn property, and document the valid range.

Alternatively, you may consider switching to a better CSV parser implementation altogether. This obscure "Univocity" library has last seen a commit in 2021 and its website univocity.com returns an HTTP 404 error page.

@marcphilipp
Copy link
Member

Please switch to the ExpandingCharAppender by default when using @CsvFileSource, or at least allow its usage by removing the positive integer validation of maxCharsPerColumn property, and document the valid range.

We'll do that for now.

Alternatively, you may consider switching to a better CSV parser implementation altogether. This obscure "Univocity" library has last seen a commit in 2021 and its website univocity.com returns an HTTP 404 error page.

I'm hoping the fork mentioned in uniVocity/univocity-parsers#534 will get some traction but we'll keep an eye on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants