-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12675: [C++] CSV parsing report row on which error occurred #10321
Conversation
9a51fa3
to
57b90e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I left some comments about the tests. I'm not as familiar with the CSV parser so we'll pull in someone else to take a look there too, but this looks good overall.
63b663c
to
ca1a124
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another nit about the tests.
93e00c7
to
6853b3d
Compare
Antoine, could you take a quick look here as you're more familiar with the CSV parser? The changes here look minimally invasive + having the extra error context would be nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing this. Here are some suggestions and questions.
python/pyarrow/tests/test_csv.py
Outdated
csv.write(linesep) | ||
for row in arr.T: | ||
csv.write(",".join(map(str, row))) | ||
csv.write(linesep) | ||
csv = csv.getvalue().encode() | ||
columns = [pa.array(a, type=pa.int64()) for a in arr] | ||
expected = pa.Table.from_arrays(columns, col_names) | ||
expected = pa.Table.from_arrays( | ||
columns, col_names) if write_names else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the condition is for here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If write_names is false then col_names is not set so the Table cannot be created. I'll change this to be a condition on col_names because that is a bit more obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can set col_names
unconditionally, which will be less fragile IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then the column names returned by the csv parser may be different than the table or is that not an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the test should ensure that the column names are the same.
cpp/src/arrow/csv/reader.cc
Outdated
: io_context_(std::move(io_context)), | ||
read_options_(read_options), | ||
parse_options_(parse_options), | ||
convert_options_(convert_options), | ||
num_rows_seen_(count_rows ? 1 : -1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Why not have a separate member bool count_rows_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to reduce the number of member variables. Using a separate bool might be a bit clearer but it doesn't simplify the code any and you end up having one variable indicating if another variable is being used and using -1/<0 to indicate disabled is common enough I didn't think it obfuscated the intent too much.
If you feel strongly there should be two member variables to track row count I can make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having separate member variables would be clearer IMO.
cpp/src/arrow/csv/parser.h
Outdated
auto start = values[pos].offset; | ||
auto stop = values[pos + 1].offset; | ||
auto quoted = values[pos + 1].quoted; | ||
ARROW_RETURN_NOT_OK(visit(parsed_ + start, stop - start, quoted)); | ||
Status status = visit(parsed_ + start, stop - start, quoted); | ||
if (ARROW_PREDICT_FALSE(first_row >= 0 && !status.ok())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
if (ARROW_PREDICT_FALSE(!status.ok())) {
if (first_row >= 0) {
status = ...
}
return status;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_RETURN_NOT_OK adds the extra context when that is enabled so I think it would be better to keep that around or add a new macro which doesn't check status just adds the context and returns when it is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. But you should still ensure that !status.ok()
is the first condition inside ARROW_PREDICT_FALSE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to be almost what you suggested but I kept the ARROW_RETURN_NOT_OK around status. If there is a desire to add an ARROW_RETURN_WITh_CONTEXT macro I am happy to add that and modify the other macros to use it. It only gets rid of a duplicate status.ok() check which isn't much overhead so probably not really worth it.
8482c8f
to
fe78690
Compare
@ursabot please benchmark |
Benchmark runs are scheduled for baseline = 861b5da and contender = fe78690. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates! Just a remaining question.
fe78690
to
a5b7d4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @n3world
…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>
Track the row number for readers which process blocks serially and report the row number in column mismatch message.
If the DataBatch::VisitColoumn visitor returns an error status prepend the row number on which the error occorred to add context.
a5b7d4d
to
c1178d2
Compare
I've rebased on master to try and fix the AppVeyor CI issue. |
For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting. I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to. Closes apache#10321 from n3world/ARROW-12675-report_rows Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>
For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting.
I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to.