-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3417: Fix read_csv with skiprows
and header
parameters
#3419
Conversation
2e4762b
to
affc60e
Compare
Codecov Report
@@ Coverage Diff @@
## master #3419 +/- ##
===========================================
- Coverage 85.61% 48.98% -36.63%
===========================================
Files 201 187 -14
Lines 16852 16034 -818
===========================================
- Hits 14428 7855 -6573
- Misses 2424 8179 +5755
Continue to review full report at Codecov.
|
d970de1
to
0b524c8
Compare
PR is ready for review. Changes performance measurements (no performance degradation was obtained):
|
@amyskov if we need so much code for each edge case, then the code base for IO functions will be very difficult to maintain. |
You are right, such edge cases bring logic complication. These edge cases occurs because of complex interconnections of |
What if we delete rows after the dataframe is created? |
We can do that, but we still should know what to skip. See pandas example below:
Output:
In this case rows |
Hi @modin-project/modin-core please take a look at this PR. |
# when `skiprows_md` contains `in` operator | ||
mod_index = skiprows_md(index_range) | ||
assert is_list_like(mod_index) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why catch all exception types? This is not ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, exception types were clarified.
|
||
Returns | ||
------- | ||
skiprows : int, array or callable | ||
Updated skiprows parameter. | ||
skiprows_md : int, array or callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md
stands for Modin. In the context of this and csv_dispatcher._read
functions skiprows
is original pandas.read_csv
skiprows
parameter, while skiprows_md
is parameter version modified for Modin specific purposes (sorting of list-like parameter, etc.)
0b524c8
to
df35302
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @anmyachev about the complexity of the code.
pandas read_csv
has a lot of edge cases and parameter combinations. It would be good to try to solve more generally or in a more compact and maintainable way.
I know this is blocking other issues, I am ok to proceed with this one, but adding 100+ lines for each combination of parameters will make the code unmaintainable.
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
It was done in attempt to localize all manipulations with |
is_list_like(skiprows_md) and skiprows_md[0] < header_size | ||
) or ( | ||
callable(skiprows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you use skiprows
and skiprowd_md
variables at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in that way because in the case of list-like skiprows
we need already sorted array to get the smallest element (the first element in the skiprows_md
array) but in the case of callable skiprows
we can't get skiprows_md
since it's value already shifted on header_size
value and we can't get intersection information from it (so we need unshifted skiprows
callable).
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
6bc8cbd
to
a054a26
Compare
…r.py Co-authored-by: Anatoly Myachev <anatoly.myachev@intel.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
8623bc7
to
c04de5b
Compare
@anmyachev @devin-petersohn do you have some additional comments? |
@amyskov, I think we can merge this PR after @devin-petersohn's comments are addressed and conflicts are resolved. |
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@amyskov, please update release notes. |
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Right. Release notes doc was updated. |
Signed-off-by: Alexander Myskov alexander.myskov@intel.com
What do these changes do?
PR fixes corner cases of
read_csv
function withskiprows
andheader
parameters, which values have intersections. Also manipulations withskiprows
parameter were localized in_manage_skiprows_parameter
function.flake8 modin
black --check modin
git commit -s
read_csv
fails withskiprows
andheader
parameters passed #3417docs/developer/architecture.rst
is up-to-date