Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text.lines enhancements #2758

Merged
merged 12 commits into from
Dec 22, 2021
Merged

Conversation

stephenjudkins
Copy link
Contributor

  • add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists
  • add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs

* add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists
* add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs
@stephenjudkins stephenjudkins changed the title text.lines enhancements: text.lines enhancements Dec 17, 2021
maxLineLength match {
case Some((max, raiseThrowable)) if stringBuilder.length > max =>
Pull.raiseError[F](
new IllegalStateException(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone recommend any other type of exception here instead of IllegalStateException?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's an IllegalStateException. We didn't call it at an inopportune time: we called it with an input that didn't match the configuration. I think IllegalArgumentException is the closest, but I might just give it its own type of RuntimeException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems right to me. I've created a new LineTooLongException

@@ -360,12 +383,25 @@ object text {
chunk.foreach { string =>
fillBuffers(stringBuilder, linesBuffer, string)
}
Pull.output(Chunk.buffer(linesBuffer)) >> go(stream, stringBuilder, first = false)

maxLineLength match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is LinesBenchmark, can you, please, run it against main version and this one?


private def linesImpl[F[_]](
maxLineLength: Option[(Int, RaiseThrowable[F])] = None,
crsOnly: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that \r can also be a line delimiter.
So, should it be a regression? i.e. should the default behavior change from treating \n & \r\n as a line delimiter to treating \r, \r\n, \n as a line delimiter? I would expect lines to handle this kind of ambiguity not with a flag, but always. In that case we should check that \r\n doesn’t produce two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that other users wouldn't want the existing behavior to change. If I were the only consumer of this library I'd just have it always treat bare \rs as line separators.

I don't know about the consensus-forming process is here. But I'd be happy to go with the flow and do whatever other stakeholders agree on here.

@stephenjudkins
Copy link
Contributor Author

FYI: Found a couple corner cases that fail, added texts, working on fix now

@stephenjudkins
Copy link
Contributor Author

So I've thought about this a bit more and actually think having the lines method split on \rs in all cases is actually the right path forward.

Arguments for:

  • \r is, in fact, a valid newline character and there is real data out there that uses it for whatever reason
  • Having text with bare \rs and also \r\n or \n and only splitting on the latter two is a pretty obscure corner case and anyone depending on having lines with \r in them is relying on some undocumented behavior
  • For \n and \r\n-delimited text we can add this support with very little performance impact, since we're rarely going to hit any of the new conditions
  • linesFor or linesFancy are weird names, but I can't think of better ones. It's annoying that linesFor[F] requires RaiseThrowable[F] even when we're not using maxLineLength, only onlyCrs. Adding multiple methods also gets weird, since in our case we want both functionalities at the same time, and I don't think any type system circus tricks for these methods would be welcome here. onlyCrs is also a weird and confusing parameter name, but again, I can't think of a much better one. If the other method were simply linesLimit[F: RaiseThrowable] things really would seem much cleaner.

Arguments against:

  • macOS/BSD head -n and other utilities doesn't respect bare \r newlines, so it's obscure
  • Don't change existing behavior

I'm prepared to be swayed either way. Regardless, the two pieces of functionality are very useful for us in real-world contexts (especially the max line length!) so I'd like to see them included. Let me know your thoughts

@mpilquist
Copy link
Member

@stephenjudkins Thanks! I agree with changing the existing default behavior to cover \r as well as \n and \r\n. I like linesLimit or linesLimited.

@stephenjudkins
Copy link
Contributor Author

Latest benchmarks:

On main:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1
...
[info] Benchmark                      (asciiLineSize)  (chunkSize)   Mode  Cnt       Score   Error  Units
[info] LinesBenchmark.linesBenchmark                0            4  thrpt       129367.374          ops/s
[info] LinesBenchmark.linesBenchmark                0           16  thrpt       127115.654          ops/s
[info] LinesBenchmark.linesBenchmark                0           64  thrpt       121482.553          ops/s
[info] LinesBenchmark.linesBenchmark                1            4  thrpt        27572.723          ops/s
[info] LinesBenchmark.linesBenchmark                1           16  thrpt        61276.567          ops/s
[info] LinesBenchmark.linesBenchmark                1           64  thrpt        68777.173          ops/s
[info] LinesBenchmark.linesBenchmark               10            4  thrpt         8109.786          ops/s
[info] LinesBenchmark.linesBenchmark               10           16  thrpt        39059.947          ops/s
[info] LinesBenchmark.linesBenchmark               10           64  thrpt        56341.091          ops/s
[info] LinesBenchmark.linesBenchmark              100            4  thrpt         1232.612          ops/s
[info] LinesBenchmark.linesBenchmark              100           16  thrpt         7520.345          ops/s
[info] LinesBenchmark.linesBenchmark              100           64  thrpt        12763.142          ops/s

On lines-enhancements:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1
...
[info] Benchmark                      (asciiLineSize)  (chunkSize)   Mode  Cnt       Score   Error  Units
[info] LinesBenchmark.linesBenchmark                0            4  thrpt       129716.313          ops/s
[info] LinesBenchmark.linesBenchmark                0           16  thrpt       133247.824          ops/s
[info] LinesBenchmark.linesBenchmark                0           64  thrpt       132027.374          ops/s
[info] LinesBenchmark.linesBenchmark                1            4  thrpt        29923.197          ops/s
[info] LinesBenchmark.linesBenchmark                1           16  thrpt        49368.691          ops/s
[info] LinesBenchmark.linesBenchmark                1           64  thrpt        77137.007          ops/s
[info] LinesBenchmark.linesBenchmark               10            4  thrpt         8076.095          ops/s
[info] LinesBenchmark.linesBenchmark               10           16  thrpt        38300.109          ops/s
[info] LinesBenchmark.linesBenchmark               10           64  thrpt        54354.817          ops/s
[info] LinesBenchmark.linesBenchmark              100            4  thrpt         1212.297          ops/s
[info] LinesBenchmark.linesBenchmark              100           16  thrpt         7655.271          ops/s
[info] LinesBenchmark.linesBenchmark              100           64  thrpt        12687.502          ops/s

@mpilquist mpilquist merged commit 61b593c into typelevel:main Dec 22, 2021
@nikiforo
Copy link
Contributor

@stephenjudkins

I'm not sure that EOF is handled correctly. Take a look at

def check3 = {
    val bytes1 = "a\r".getBytes()
    val bytes2 = "a\n".getBytes()
    val lines1 = Stream.emits(bytes1).through(text.utf8.decode).through(text.lines).compile.toList
    val lines2 = Stream.emits(bytes2).through(text.utf8.decode).through(text.lines).compile.toList
    println(lines1.length)
    println(lines2.length)
  }

it prints

1
2

@nikiforo
Copy link
Contributor

A small addition to the command:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1

The more warmups and iterations, the more accurate results are. For instance, I had to increase them from 6 and 10 to 10 and 20 here to reduce errors. Also, 25 iterations were made here. When you have only 1 iteration, the error isn't printed in the output.

@stephenjudkins
Copy link
Contributor Author

@nikiforo well, that's unfortunate. fix here #2764

@stephenjudkins
Copy link
Contributor Author

A small addition to the command:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1

The more warmups and iterations, the more accurate results are. For instance, I had to increase them from 6 and 10 to 10 and 20 here to reduce errors. Also, 25 iterations were made here. When you have only 1 iteration, the error isn't printed in the output.

Would love to see some more documentation about benchmarking tradeoffs here. I don't know enough about the specific context to be helpful

@nikiforo
Copy link
Contributor

Arguments for:

"abc\rdef".lines().collect(java.util.stream.Collectors.toList()).asScala // == `Buffer(abc, def)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants