-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text.lines
enhancements
#2758
text.lines
enhancements
#2758
Conversation
stephenjudkins
commented
Dec 17, 2021
- add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists
- add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs
* add support for input with '\r'-only newlines, like from macos classic. yes, content like this still exists * add support to throw an error when an accumulated line is over a certain size; good to prevent bad/malicious inputs from causing OOMs
maxLineLength match { | ||
case Some((max, raiseThrowable)) if stringBuilder.length > max => | ||
Pull.raiseError[F]( | ||
new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone recommend any other type of exception here instead of IllegalStateException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's an IllegalStateException
. We didn't call it at an inopportune time: we called it with an input that didn't match the configuration. I think IllegalArgumentException
is the closest, but I might just give it its own type of RuntimeException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems right to me. I've created a new LineTooLongException
I haven't run any benchmarks but this should avoid a copy, at least
@@ -360,12 +383,25 @@ object text { | |||
chunk.foreach { string => | |||
fillBuffers(stringBuilder, linesBuffer, string) | |||
} | |||
Pull.output(Chunk.buffer(linesBuffer)) >> go(stream, stringBuilder, first = false) | |||
|
|||
maxLineLength match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is LinesBenchmark
, can you, please, run it against main
version and this one?
|
||
private def linesImpl[F[_]]( | ||
maxLineLength: Option[(Int, RaiseThrowable[F])] = None, | ||
crsOnly: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that \r
can also be a line delimiter.
So, should it be a regression? i.e. should the default behavior change from treating \n
& \r\n
as a line delimiter to treating \r
, \r\n
, \n
as a line delimiter? I would expect lines
to handle this kind of ambiguity not with a flag, but always. In that case we should check that \r\n
doesn’t produce two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is that other users wouldn't want the existing behavior to change. If I were the only consumer of this library I'd just have it always treat bare \r
s as line separators.
I don't know about the consensus-forming process is here. But I'd be happy to go with the flow and do whatever other stakeholders agree on here.
FYI: Found a couple corner cases that fail, added texts, working on fix now |
…rsOnly` is enabled
So I've thought about this a bit more and actually think having the Arguments for:
Arguments against:
I'm prepared to be swayed either way. Regardless, the two pieces of functionality are very useful for us in real-world contexts (especially the max line length!) so I'd like to see them included. Let me know your thoughts |
@stephenjudkins Thanks! I agree with changing the existing default behavior to cover \r as well as \n and \r\n. I like |
4205e1a
to
774cb3b
Compare
Latest benchmarks: On
On
|
I'm not sure that EOF is handled correctly. Take a look at def check3 = {
val bytes1 = "a\r".getBytes()
val bytes2 = "a\n".getBytes()
val lines1 = Stream.emits(bytes1).through(text.utf8.decode).through(text.lines).compile.toList
val lines2 = Stream.emits(bytes2).through(text.utf8.decode).through(text.lines).compile.toList
println(lines1.length)
println(lines2.length)
} it prints
|
A small addition to the command:
The more warmups and iterations, the more accurate results are. For instance, I had to increase them from 6 and 10 to 10 and 20 here to reduce errors. Also, 25 iterations were made here. When you have only 1 iteration, the error isn't printed in the output. |
Would love to see some more documentation about benchmarking tradeoffs here. I don't know enough about the specific context to be helpful |
"abc\rdef".lines().collect(java.util.stream.Collectors.toList()).asScala // == `Buffer(abc, def)` |