Bump PosLen capacity up to 80 bits #98

quinnj · 2021-10-22T06:12:01Z

cc: @nickrobinson251 @bkamins @nalimilan

I'm exploring solutions to JuliaData/CSV.jl#935. If you look at the first commit ("code"), I have 2 PosLen types, normal PosLen with 64-bits, and PosLen80, with 80-bits. As I started to look into how we would use that in CSV.jl; it got real messy real fast. We'd have to allow switching based on PosLen type and pass it through everywhere and it would be a mess to ensure performance stays correct throughout. I'm not saying it's impossible, just some real non-trivial work. There are additional complications because PosLenString/PosLenStringVector are hard-coded for PosLen right now, so we'd either have to make them parameterized on AbstractPosLen subtype, or not allow stringtype=PosLenString to avoid the mess there.

Alternatively, in the 2nd commit, I redefine the existing PosLen to have 80 bits, which results in maximum pos of ~70TB, and maximum single cell length of ~4GB. There are only 2-3 failing tests in Parsers.jl with this change, mainly from the removal of constants that are referenced in tests. I also ran CSV.jl's tests and with some minor changes to use of internal Parsers.jl consts, the tests also pass there. So the question is: are we ok w/ making the PosLen size go from 64-bits to 80-bits everywhere? Obviously that will result in more memory in the stringtype=PosLenString case, but surprisingly the impact would be pretty minimal otherwise. We use PosLen in several places during the "detection" code, but that's only looking at a small sample of rows here and there or parsing column names, so unlikely to make a noticeable difference in memory usage.

Why 80 bits? It seemed like the smallest increase that gives us the largest boost in pos/len values. Note the extra 16 bits vs. current 64-bit PosLen are allocated with 12 bits to len and only 4 to pos. pos already had a max size of around ~4TB, which seems like a pretty reasonable maximum already. With 12 extra bits for the len, we go from the max single cell size of ~1MB to ~4GB, which also seems like a pretty generous maximum. 80 bits also seems pretty reasonable from an alignment standpoint, since it's modulo 16 at least, which I believe is the default alignment value for Julia arrays/strings.

Anyway, I'm going to let this simmer in my mind for a bit and mull things over, but I'm leaning towards going forward with it.

bkamins · 2021-10-22T07:30:42Z

I think switching to 80 is OK.

nalimilan · 2021-10-23T13:32:48Z

Sounds reasonable, but I'd double-check the alignment issues, as AFAICT a Vector of 80-bit values takes as much space as a Vector of 128-bit values:

julia> sizeof(fill((1, Int16(1)), 1000))/1000
16.0

julia> sizeof(fill((1, 1), 1000))/1000
16.0

Also FWIW it seems that arrays are 64-bit aligned when they get large.

quinnj · 2021-10-26T03:36:47Z

Yeah, upon further investigation, there are some very foundational alignment requirements, even for primitive types, that mean that PosLen will always take up 128-bits. Hmmmm, I'll have to think about this a little more; unfortunately, 128-bits means we effectively "waste" quite a lot of bits for a, IMO, rare use-case. Perhaps there's an alternative solution in CSV.jl where we can avoid using PosLen to parse large strings at the user's request.

quinnj added 2 commits October 21, 2021 23:18

code

3b5cf62

Single PosLen w/ 80-bit capacity

0da4a2d

quinnj mentioned this pull request Oct 22, 2021

Unable to parse really long CSV cell (breaks Parsers.jl) JuliaData/CSV.jl#935

Open

nickrobinson251 mentioned this pull request Jun 20, 2022

Parsing fails with long strings JuliaData/CSV.jl#1009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump PosLen capacity up to 80 bits #98

Bump PosLen capacity up to 80 bits #98

quinnj commented Oct 22, 2021

bkamins commented Oct 22, 2021

nalimilan commented Oct 23, 2021

quinnj commented Oct 26, 2021

Bump PosLen capacity up to 80 bits #98

Are you sure you want to change the base?

Bump PosLen capacity up to 80 bits #98

Conversation

quinnj commented Oct 22, 2021

bkamins commented Oct 22, 2021

nalimilan commented Oct 23, 2021

quinnj commented Oct 26, 2021