-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Add some comments #16008
refactor: Add some comments #16008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nameexhaustion FYI
@@ -54,6 +55,8 @@ pub(crate) fn get_offsets( | |||
} | |||
} | |||
|
|||
/// Reads bytes from `file` to `buf` and returns pointers into `buf` that can be parsed. | |||
/// TODO! this can be implemented without copying by pointing in the memmapped file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this reads from file
into buf
and then returns offsets (SyncPtr<u8>, usize)
, which is semantically (*const u8, usize)
, being the starting offset ptr and the length of the chunk that can be parsed.
These pointers currently point into buf
. However we can directly mmap the file and point into the memory mapped slice directly. This saves an explicit read and reslice
(copy remaningin bytes to start of buf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking through and noticed that we also have an mmap-based reader at:
pub struct BatchedCsvReaderMmap<'a> { |
It looks like we switch between the 2 readers depending on low_memory
:
polars/crates/polars-pipe/src/executors/sources/csv.rs
Lines 92 to 100 in 414e5f6
let batched_reader = if options.low_memory { | |
let batched_reader = unsafe { Box::new((*reader).batched_borrowed_read()?) }; | |
let batched_reader = Box::leak(batched_reader) as *mut BatchedCsvReaderRead; | |
Either::Right(batched_reader) | |
} else { | |
let batched_reader = unsafe { Box::new((*reader).batched_borrowed_mmap()?) }; | |
let batched_reader = Box::leak(batched_reader) as *mut BatchedCsvReaderMmap; | |
Either::Left(batched_reader) | |
}; |
I'm thinking we can maybe just always use the mmap reader? I don't think mmap should be memory intensive, and we could also adjust how far we seek as well to accomodate low memory conditions.
@@ -109,18 +112,23 @@ impl<'a> ChunkReader<'a> { | |||
self.buf_end = 0; | |||
} | |||
|
|||
fn return_slice(&self, start: usize, end: usize) -> (usize, usize) { | |||
fn return_slice(&self, start: usize, end: usize) -> (SyncPtr<u8>, usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return SyncPtr
as it is more clear that we return pointers that way.
/// | ||
/// This will make a pointer sync and send. | ||
/// Ensure that you don't break aliasing rules. | ||
pub unsafe fn from_const(ptr: *const T) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating this isn't unsafe.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16008 +/- ##
==========================================
- Coverage 80.95% 80.91% -0.05%
==========================================
Files 1384 1384
Lines 178151 178181 +30
Branches 3043 3050 +7
==========================================
- Hits 144229 144179 -50
- Misses 33439 33513 +74
- Partials 483 489 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
No description provided.