Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
cc: @nickrobinson251 @bkamins @nalimilan
I'm exploring solutions to JuliaData/CSV.jl#935. If you look at the first commit (
"code"
), I have 2 PosLen types, normalPosLen
with 64-bits, andPosLen80
, with 80-bits. As I started to look into how we would use that in CSV.jl; it got real messy real fast. We'd have to allow switching based on PosLen type and pass it through everywhere and it would be a mess to ensure performance stays correct throughout. I'm not saying it's impossible, just some real non-trivial work. There are additional complications because PosLenString/PosLenStringVector are hard-coded for PosLen right now, so we'd either have to make them parameterized onAbstractPosLen
subtype, or not allowstringtype=PosLenString
to avoid the mess there.Alternatively, in the 2nd commit, I redefine the existing
PosLen
to have 80 bits, which results in maximumpos
of ~70TB, and maximum single cell length of ~4GB. There are only 2-3 failing tests in Parsers.jl with this change, mainly from the removal of constants that are referenced in tests. I also ran CSV.jl's tests and with some minor changes to use of internal Parsers.jl consts, the tests also pass there. So the question is: are we ok w/ making thePosLen
size go from 64-bits to 80-bits everywhere? Obviously that will result in more memory in thestringtype=PosLenString
case, but surprisingly the impact would be pretty minimal otherwise. We usePosLen
in several places during the "detection" code, but that's only looking at a small sample of rows here and there or parsing column names, so unlikely to make a noticeable difference in memory usage.Why 80 bits? It seemed like the smallest increase that gives us the largest boost in
pos
/len
values. Note the extra 16 bits vs. current 64-bitPosLen
are allocated with 12 bits tolen
and only 4 topos
.pos
already had a max size of around ~4TB, which seems like a pretty reasonable maximum already. With 12 extra bits for thelen
, we go from the max single cell size of ~1MB to ~4GB, which also seems like a pretty generous maximum. 80 bits also seems pretty reasonable from an alignment standpoint, since it's modulo 16 at least, which I believe is the default alignment value for Julia arrays/strings.Anyway, I'm going to let this simmer in my mind for a bit and mull things over, but I'm leaning towards going forward with it.