`text.lines` enhancements #2758

stephenjudkins · 2021-12-17T22:32:29Z

add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists
add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs

* add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists * add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs

stephenjudkins · 2021-12-17T22:57:34Z

core/shared/src/main/scala/fs2/text.scala

+          maxLineLength match {
+            case Some((max, raiseThrowable)) if stringBuilder.length > max =>
+              Pull.raiseError[F](
+                new IllegalStateException(


Can anyone recommend any other type of exception here instead of IllegalStateException?

I don't think it's an IllegalStateException. We didn't call it at an inopportune time: we called it with an input that didn't match the configuration. I think IllegalArgumentException is the closest, but I might just give it its own type of RuntimeException.

Yeah that seems right to me. I've created a new LineTooLongException

I haven't run any benchmarks but this should avoid a copy, at least

nikiforo · 2021-12-20T14:43:38Z

core/shared/src/main/scala/fs2/text.scala

@@ -360,12 +383,25 @@ object text {
          chunk.foreach { string =>
            fillBuffers(stringBuilder, linesBuffer, string)
          }
-          Pull.output(Chunk.buffer(linesBuffer)) >> go(stream, stringBuilder, first = false)
+
+          maxLineLength match {


There is LinesBenchmark, can you, please, run it against main version and this one?

nikiforo · 2021-12-20T14:59:44Z

core/shared/src/main/scala/fs2/text.scala

+
+  private def linesImpl[F[_]](
+      maxLineLength: Option[(Int, RaiseThrowable[F])] = None,
+      crsOnly: Boolean = false


I see, that \r can also be a line delimiter.
So, should it be a regression? i.e. should the default behavior change from treating \n & \r\n as a line delimiter to treating \r, \r\n, \n as a line delimiter? I would expect lines to handle this kind of ambiguity not with a flag, but always. In that case we should check that \r\n doesn’t produce two lines.

My assumption is that other users wouldn't want the existing behavior to change. If I were the only consumer of this library I'd just have it always treat bare \rs as line separators.

I don't know about the consensus-forming process is here. But I'd be happy to go with the flow and do whatever other stakeholders agree on here.

stephenjudkins · 2021-12-21T23:23:42Z

FYI: Found a couple corner cases that fail, added texts, working on fix now

…rsOnly` is enabled

stephenjudkins · 2021-12-21T23:57:37Z

So I've thought about this a bit more and actually think having the lines method split on \rs in all cases is actually the right path forward.

Arguments for:

\r is, in fact, a valid newline character and there is real data out there that uses it for whatever reason
Having text with bare \rs and also \r\n or \n and only splitting on the latter two is a pretty obscure corner case and anyone depending on having lines with \r in them is relying on some undocumented behavior
For \n and \r\n-delimited text we can add this support with very little performance impact, since we're rarely going to hit any of the new conditions
linesFor or linesFancy are weird names, but I can't think of better ones. It's annoying that linesFor[F] requires RaiseThrowable[F] even when we're not using maxLineLength, only onlyCrs. Adding multiple methods also gets weird, since in our case we want both functionalities at the same time, and I don't think any type system circus tricks for these methods would be welcome here. onlyCrs is also a weird and confusing parameter name, but again, I can't think of a much better one. If the other method were simply linesLimit[F: RaiseThrowable] things really would seem much cleaner.

Arguments against:

macOS/BSD head -n and other utilities doesn't respect bare \r newlines, so it's obscure
Don't change existing behavior

I'm prepared to be swayed either way. Regardless, the two pieces of functionality are very useful for us in real-world contexts (especially the max line length!) so I'd like to see them included. Let me know your thoughts

mpilquist · 2021-12-22T00:10:28Z

@stephenjudkins Thanks! I agree with changing the existing default behavior to cover \r as well as \n and \r\n. I like linesLimit or linesLimited.

stephenjudkins · 2021-12-22T00:43:01Z

Latest benchmarks:

On main:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1
...
[info] Benchmark                      (asciiLineSize)  (chunkSize)   Mode  Cnt       Score   Error  Units
[info] LinesBenchmark.linesBenchmark                0            4  thrpt       129367.374          ops/s
[info] LinesBenchmark.linesBenchmark                0           16  thrpt       127115.654          ops/s
[info] LinesBenchmark.linesBenchmark                0           64  thrpt       121482.553          ops/s
[info] LinesBenchmark.linesBenchmark                1            4  thrpt        27572.723          ops/s
[info] LinesBenchmark.linesBenchmark                1           16  thrpt        61276.567          ops/s
[info] LinesBenchmark.linesBenchmark                1           64  thrpt        68777.173          ops/s
[info] LinesBenchmark.linesBenchmark               10            4  thrpt         8109.786          ops/s
[info] LinesBenchmark.linesBenchmark               10           16  thrpt        39059.947          ops/s
[info] LinesBenchmark.linesBenchmark               10           64  thrpt        56341.091          ops/s
[info] LinesBenchmark.linesBenchmark              100            4  thrpt         1232.612          ops/s
[info] LinesBenchmark.linesBenchmark              100           16  thrpt         7520.345          ops/s
[info] LinesBenchmark.linesBenchmark              100           64  thrpt        12763.142          ops/s

On lines-enhancements:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1
...
[info] Benchmark                      (asciiLineSize)  (chunkSize)   Mode  Cnt       Score   Error  Units
[info] LinesBenchmark.linesBenchmark                0            4  thrpt       129716.313          ops/s
[info] LinesBenchmark.linesBenchmark                0           16  thrpt       133247.824          ops/s
[info] LinesBenchmark.linesBenchmark                0           64  thrpt       132027.374          ops/s
[info] LinesBenchmark.linesBenchmark                1            4  thrpt        29923.197          ops/s
[info] LinesBenchmark.linesBenchmark                1           16  thrpt        49368.691          ops/s
[info] LinesBenchmark.linesBenchmark                1           64  thrpt        77137.007          ops/s
[info] LinesBenchmark.linesBenchmark               10            4  thrpt         8076.095          ops/s
[info] LinesBenchmark.linesBenchmark               10           16  thrpt        38300.109          ops/s
[info] LinesBenchmark.linesBenchmark               10           64  thrpt        54354.817          ops/s
[info] LinesBenchmark.linesBenchmark              100            4  thrpt         1212.297          ops/s
[info] LinesBenchmark.linesBenchmark              100           16  thrpt         7655.271          ops/s
[info] LinesBenchmark.linesBenchmark              100           64  thrpt        12687.502          ops/s

nikiforo · 2021-12-22T21:31:41Z

@stephenjudkins

I'm not sure that EOF is handled correctly. Take a look at

def check3 = {
    val bytes1 = "a\r".getBytes()
    val bytes2 = "a\n".getBytes()
    val lines1 = Stream.emits(bytes1).through(text.utf8.decode).through(text.lines).compile.toList
    val lines2 = Stream.emits(bytes2).through(text.utf8.decode).through(text.lines).compile.toList
    println(lines1.length)
    println(lines2.length)
  }

it prints

1
2

nikiforo · 2021-12-22T21:52:55Z

A small addition to the command:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1

The more warmups and iterations, the more accurate results are. For instance, I had to increase them from 6 and 10 to 10 and 20 here to reduce errors. Also, 25 iterations were made here. When you have only 1 iteration, the error isn't printed in the output.

stephenjudkins · 2021-12-22T21:54:30Z

@nikiforo well, that's unfortunate. fix here #2764

stephenjudkins · 2021-12-22T21:56:44Z

A small addition to the command:

sbt:root> benchmark/Jmh/run fs2.benchmark.LinesBenchmark -i 1 -wi 1 -f 1 -t 1

The more warmups and iterations, the more accurate results are. For instance, I had to increase them from 6 and 10 to 10 and 20 here to reduce errors. Also, 25 iterations were made here. When you have only 1 iteration, the error isn't printed in the output.

Would love to see some more documentation about benchmarking tradeoffs here. I don't know enough about the specific context to be helpful

nikiforo · 2021-12-22T21:57:24Z

Arguments for:

"abc\rdef".lines().collect(java.util.stream.Collectors.toList()).asScala // == `Buffer(abc, def)`

text.lines enhancements:

a6c6b01

* add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists * add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs

stephenjudkins changed the title ~~text.lines enhancements:~~ text.lines enhancements Dec 17, 2021

stephenjudkins added 5 commits December 17, 2021 14:35

Fix docs, change some param names

a56fe21

format

6e503a4

scalafmt

e05d9b2

scalafmt

601e72c

Fix for scala 2.12 build

9d46873

stephenjudkins commented Dec 17, 2021

View reviewed changes

stephenjudkins added 3 commits December 19, 2021 15:49

Use custom exception for too-long line

f6adcbe

Don't use deprecated method, for good measure.

960e1fc

I haven't run any benchmarks but this should avoid a copy, at least

Scala 2.12 fix, again

f54d4fd

nikiforo reviewed Dec 20, 2021

View reviewed changes

Test that existing CRLF-delimited text is split appropriately when `c…

68a2a59

…rsOnly` is enabled

Always split lines on '\r'-delimited input; simplify methods

774cb3b

stephenjudkins force-pushed the sdj/lines-enhancements branch from 4205e1a to 774cb3b Compare December 22, 2021 00:38

unsure how that compiled

e8c89f2

mpilquist merged commit 61b593c into typelevel:main Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`text.lines` enhancements #2758

`text.lines` enhancements #2758

stephenjudkins commented Dec 17, 2021

stephenjudkins Dec 17, 2021

rossabaker Dec 19, 2021

stephenjudkins Dec 19, 2021

nikiforo Dec 20, 2021

nikiforo Dec 20, 2021

stephenjudkins Dec 21, 2021

stephenjudkins commented Dec 21, 2021

stephenjudkins commented Dec 21, 2021

mpilquist commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

nikiforo commented Dec 22, 2021

nikiforo commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

nikiforo commented Dec 22, 2021

text.lines enhancements #2758

text.lines enhancements #2758

Conversation

stephenjudkins commented Dec 17, 2021

stephenjudkins Dec 17, 2021

Choose a reason for hiding this comment

rossabaker Dec 19, 2021

Choose a reason for hiding this comment

stephenjudkins Dec 19, 2021

Choose a reason for hiding this comment

nikiforo Dec 20, 2021

Choose a reason for hiding this comment

nikiforo Dec 20, 2021

Choose a reason for hiding this comment

stephenjudkins Dec 21, 2021

Choose a reason for hiding this comment

stephenjudkins commented Dec 21, 2021

stephenjudkins commented Dec 21, 2021

mpilquist commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

nikiforo commented Dec 22, 2021

nikiforo commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

stephenjudkins commented Dec 22, 2021

nikiforo commented Dec 22, 2021

`text.lines` enhancements #2758

`text.lines` enhancements #2758