Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

Closed
wants to merge 4 commits into from

Conversation

n3world
Copy link
Contributor

@n3world n3world commented May 14, 2021

For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting.

I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to.

@github-actions
Copy link

@n3world n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 9a51fa3 to 57b90e2 Compare May 18, 2021 18:23
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I left some comments about the tests. I'm not as familiar with the CSV parser so we'll pull in someone else to take a look there too, but this looks good overall.

python/pyarrow/tests/test_csv.py Outdated Show resolved Hide resolved
cpp/src/arrow/csv/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/csv/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/csv/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/csv/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved
@n3world n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 63b663c to ca1a124 Compare May 19, 2021 19:03
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another nit about the tests.

cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved
@n3world n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 93e00c7 to 6853b3d Compare May 19, 2021 20:23
@lidavidm lidavidm requested a review from pitrou May 20, 2021 13:23
@lidavidm
Copy link
Member

Antoine, could you take a quick look here as you're more familiar with the CSV parser? The changes here look minimally invasive + having the extra error context would be nice.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this. Here are some suggestions and questions.

python/pyarrow/tests/test_csv.py Outdated Show resolved Hide resolved
csv.write(linesep)
for row in arr.T:
csv.write(",".join(map(str, row)))
csv.write(linesep)
csv = csv.getvalue().encode()
columns = [pa.array(a, type=pa.int64()) for a in arr]
expected = pa.Table.from_arrays(columns, col_names)
expected = pa.Table.from_arrays(
columns, col_names) if write_names else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the condition is for here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write_names is false then col_names is not set so the Table cannot be created. I'll change this to be a condition on col_names because that is a bit more obvious

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can set col_names unconditionally, which will be less fragile IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the column names returned by the csv parser may be different than the table or is that not an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the test should ensure that the column names are the same.

: io_context_(std::move(io_context)),
read_options_(read_options),
parse_options_(parse_options),
convert_options_(convert_options),
num_rows_seen_(count_rows ? 1 : -1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Why not have a separate member bool count_rows_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to reduce the number of member variables. Using a separate bool might be a bit clearer but it doesn't simplify the code any and you end up having one variable indicating if another variable is being used and using -1/<0 to indicate disabled is common enough I didn't think it obfuscated the intent too much.

If you feel strongly there should be two member variables to track row count I can make that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having separate member variables would be clearer IMO.

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved
auto start = values[pos].offset;
auto stop = values[pos + 1].offset;
auto quoted = values[pos + 1].quoted;
ARROW_RETURN_NOT_OK(visit(parsed_ + start, stop - start, quoted));
Status status = visit(parsed_ + start, stop - start, quoted);
if (ARROW_PREDICT_FALSE(first_row >= 0 && !status.ok())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

if (ARROW_PREDICT_FALSE(!status.ok())) {
  if (first_row >= 0) {
    status = ...
  }
  return status;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_RETURN_NOT_OK adds the extra context when that is enabled so I think it would be better to keep that around or add a new macro which doesn't check status just adds the context and returns when it is enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. But you should still ensure that !status.ok() is the first condition inside ARROW_PREDICT_FALSE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to be almost what you suggested but I kept the ARROW_RETURN_NOT_OK around status. If there is a desire to add an ARROW_RETURN_WITh_CONTEXT macro I am happy to add that and modify the other macros to use it. It only gets rid of a duplicate status.ok() check which isn't much overhead so probably not really worth it.

cpp/src/arrow/csv/parser.cc Show resolved Hide resolved
@n3world n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 8482c8f to fe78690 Compare May 26, 2021 20:29
@pitrou
Copy link
Member

pitrou commented May 27, 2021

@ursabot please benchmark

@ursabot
Copy link

ursabot commented May 27, 2021

Benchmark runs are scheduled for baseline = 861b5da and contender = fe78690. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️33.33%] ec2-t3-large-us-east-2 (mimalloc)
[Finished ⬇️75.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️2.07% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Finished ⬇️5.06% ⬆️5.61%] ursa-thinkcentre-m75q (mimalloc)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates! Just a remaining question.

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved
@n3world n3world force-pushed the ARROW-12675-report_rows branch from fe78690 to a5b7d4d Compare May 27, 2021 13:29
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @n3world

n3world added 4 commits May 27, 2021 15:51
…h error

Add the line which has the incorrect column count to the output so it
is easier to identify in large inputs.

Authored-by: Nate Clark <nate@neworld.us>
Signed-off-by: Nate Clark <nate@neworld.us>
Track the row number for readers which process blocks serially and
report the row number in column mismatch message.
If the DataBatch::VisitColoumn visitor returns an error status prepend
the row number on which the error occorred to add context.
@pitrou pitrou force-pushed the ARROW-12675-report_rows branch from a5b7d4d to c1178d2 Compare May 27, 2021 13:52
@pitrou
Copy link
Member

pitrou commented May 27, 2021

I've rebased on master to try and fix the AppVeyor CI issue.

@pitrou pitrou closed this in 26de76e May 27, 2021
@n3world n3world deleted the ARROW-12675-report_rows branch May 27, 2021 16:15
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting.

I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to.

Closes apache#10321 from n3world/ARROW-12675-report_rows

Authored-by: Nate Clark <nate@neworld.us>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants