ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

n3world · 2021-05-14T05:38:34Z

For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting.

I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to.

github-actions · 2021-05-14T05:38:51Z

https://issues.apache.org/jira/browse/ARROW-12675

lidavidm

Thanks for doing this. I left some comments about the tests. I'm not as familiar with the CSV parser so we'll pull in someone else to take a look there too, but this looks good overall.

python/pyarrow/tests/test_csv.py

cpp/src/arrow/csv/reader.cc

cpp/src/arrow/csv/parser_test.cc

lidavidm

Just another nit about the tests.

cpp/src/arrow/csv/parser_test.cc

lidavidm · 2021-05-20T13:23:43Z

Antoine, could you take a quick look here as you're more familiar with the CSV parser? The changes here look minimally invasive + having the extra error context would be nice.

pitrou

Thanks a lot for doing this. Here are some suggestions and questions.

python/pyarrow/tests/test_csv.py

pitrou · 2021-05-26T10:02:26Z

python/pyarrow/tests/test_csv.py

    csv.write(linesep)
    for row in arr.T:
        csv.write(",".join(map(str, row)))
        csv.write(linesep)
    csv = csv.getvalue().encode()
    columns = [pa.array(a, type=pa.int64()) for a in arr]
-    expected = pa.Table.from_arrays(columns, col_names)
+    expected = pa.Table.from_arrays(
+        columns, col_names) if write_names else None


I'm not sure what the condition is for here?

If write_names is false then col_names is not set so the Table cannot be created. I'll change this to be a condition on col_names because that is a bit more obvious

Or you can set col_names unconditionally, which will be less fragile IMO.

But then the column names returned by the csv parser may be different than the table or is that not an issue?

Well, the test should ensure that the column names are the same.

pitrou · 2021-05-26T10:03:33Z

cpp/src/arrow/csv/reader.cc

      : io_context_(std::move(io_context)),
        read_options_(read_options),
        parse_options_(parse_options),
        convert_options_(convert_options),
+        num_rows_seen_(count_rows ? 1 : -1),


This is weird. Why not have a separate member bool count_rows_?

Just to reduce the number of member variables. Using a separate bool might be a bit clearer but it doesn't simplify the code any and you end up having one variable indicating if another variable is being used and using -1/<0 to indicate disabled is common enough I didn't think it obfuscated the intent too much.

If you feel strongly there should be two member variables to track row count I can make that change.

Having separate member variables would be clearer IMO.

cpp/src/arrow/dataset/file_csv.cc

pitrou · 2021-05-26T10:06:12Z

cpp/src/arrow/csv/parser.h

        auto start = values[pos].offset;
        auto stop = values[pos + 1].offset;
        auto quoted = values[pos + 1].quoted;
-        ARROW_RETURN_NOT_OK(visit(parsed_ + start, stop - start, quoted));
+        Status status = visit(parsed_ + start, stop - start, quoted);
+        if (ARROW_PREDICT_FALSE(first_row >= 0 && !status.ok())) {


Suggestion:

if (ARROW_PREDICT_FALSE(!status.ok())) { if (first_row >= 0) { status = ... } return status; }

ARROW_RETURN_NOT_OK adds the extra context when that is enabled so I think it would be better to keep that around or add a new macro which doesn't check status just adds the context and returns when it is enabled

Ah, you're right. But you should still ensure that !status.ok() is the first condition inside ARROW_PREDICT_FALSE.

I updated it to be almost what you suggested but I kept the ARROW_RETURN_NOT_OK around status. If there is a desire to add an ARROW_RETURN_WITh_CONTEXT macro I am happy to add that and modify the other macros to use it. It only gets rid of a duplicate status.ok() check which isn't much overhead so probably not really worth it.

cpp/src/arrow/csv/parser.cc

pitrou · 2021-05-27T11:42:39Z

@ursabot please benchmark

ursabot · 2021-05-27T11:43:07Z

Benchmark runs are scheduled for baseline = 861b5da and contender = fe78690. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️33.33%] ec2-t3-large-us-east-2 (mimalloc)
[Finished ⬇️75.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️2.07% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Finished ⬇️5.06% ⬆️5.61%] ursa-thinkcentre-m75q (mimalloc)

pitrou

Thanks a lot for the updates! Just a remaining question.

cpp/src/arrow/dataset/file_csv.cc

pitrou

+1, thank you @n3world

…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

Track the row number for readers which process blocks serially and report the row number in column mismatch message.

If the DataBatch::VisitColoumn visitor returns an error status prepend the row number on which the error occorred to add context.

pitrou · 2021-05-27T13:52:30Z

I've rebased on master to try and fix the AppVeyor CI issue.

For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting. I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to. Closes apache#10321 from n3world/ARROW-12675-report_rows Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ Component: Python labels May 14, 2021

n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 9a51fa3 to 57b90e2 Compare May 18, 2021 18:23

lidavidm reviewed May 19, 2021

View reviewed changes

n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 63b663c to ca1a124 Compare May 19, 2021 19:03

lidavidm reviewed May 19, 2021

View reviewed changes

cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved

n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 93e00c7 to 6853b3d Compare May 19, 2021 20:23

lidavidm requested a review from pitrou May 20, 2021 13:23

pitrou requested changes May 26, 2021

View reviewed changes

n3world force-pushed the ARROW-12675-report_rows branch 2 times, most recently from 8482c8f to fe78690 Compare May 26, 2021 20:29

pitrou reviewed May 27, 2021

View reviewed changes

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

n3world force-pushed the ARROW-12675-report_rows branch from fe78690 to a5b7d4d Compare May 27, 2021 13:29

pitrou approved these changes May 27, 2021

View reviewed changes

n3world added 4 commits May 27, 2021 15:51

ARROW-12675: [C++] csv/parser: Add more information to column mismatc…

856293c

…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

ARROW-12675: [C++] Track row number for serial readers

9d76ce1

Track the row number for readers which process blocks serially and report the row number in column mismatch message.

ARROW-12675: [C++] Prepend row numbers to error messages

1497690

If the DataBatch::VisitColoumn visitor returns an error status prepend the row number on which the error occorred to add context.

ARROW-12675: [C++] Add a count_rows field to ReaderMixin

c1178d2

pitrou force-pushed the ARROW-12675-report_rows branch from a5b7d4d to c1178d2 Compare May 27, 2021 13:52

pitrou closed this in 26de76e May 27, 2021

n3world deleted the ARROW-12675-report_rows branch May 27, 2021 16:15

asfimport mentioned this pull request May 27, 2021

[C++] CSV should include line/row numbers in parsing error messages #28423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

n3world commented May 14, 2021

github-actions bot commented May 14, 2021

lidavidm left a comment

lidavidm left a comment

lidavidm commented May 20, 2021

pitrou left a comment

pitrou May 26, 2021

n3world May 26, 2021

pitrou May 26, 2021

n3world May 26, 2021

pitrou May 26, 2021

pitrou May 26, 2021

n3world May 26, 2021

pitrou May 26, 2021

pitrou May 26, 2021

n3world May 26, 2021

pitrou May 26, 2021

n3world May 26, 2021

pitrou commented May 27, 2021

ursabot commented May 27, 2021 •

edited

Loading

pitrou left a comment

pitrou left a comment

pitrou commented May 27, 2021

ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

ARROW-12675: [C++] CSV parsing report row on which error occurred #10321

Conversation

n3world commented May 14, 2021

github-actions bot commented May 14, 2021

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm commented May 20, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 27, 2021

ursabot commented May 27, 2021 • edited Loading

pitrou left a comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented May 27, 2021

ursabot commented May 27, 2021 •

edited

Loading