On-the-fly cut filtering + improvements in arrow/lazy-dict #294

pzelasko · 2021-05-04T15:47:52Z

It adds a .filter() method for sampler which is useful when dealing with large and lazy manifests. I reached out to Arrow developers and learned that they already support automatic schema inference in Python, I just didn't discover it -- I'm removing some code related to that. I also optimized lazy reading a bit by making the chunks larger (makes for less frequent disk reads) and moved to random-access Arrow format (not leveraged yet but I have some ideas, we'll see).

…to feature/lazy-manifests-impr2

pzelasko added 5 commits May 4, 2021 09:38

Increase JSONL block size and simplify the arrow schema inference

44eb32a

Add ".filter()" method to samplers

46e62d8

Merge branch 'master' into feature/lazy-manifests-impr2

7e664db

Enable reading chunks from URLs with a warning

79796b9

Merge remote-tracking branch 'origin/feature/lazy-manifests-impr2' in…

fa3dc67

…to feature/lazy-manifests-impr2

pzelasko added this to the v0.7 milestone May 4, 2021

pzelasko merged commit 68bad38 into master May 4, 2021

pzelasko deleted the feature/lazy-manifests-impr2 branch July 1, 2021 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-the-fly cut filtering + improvements in arrow/lazy-dict #294

On-the-fly cut filtering + improvements in arrow/lazy-dict #294

pzelasko commented May 4, 2021

On-the-fly cut filtering + improvements in arrow/lazy-dict #294

On-the-fly cut filtering + improvements in arrow/lazy-dict #294

Conversation

pzelasko commented May 4, 2021