-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: native CSV parsing #4073
feat: native CSV parsing #4073
Conversation
6202ca9
to
fa32f9b
Compare
Looks very useful. Thanks! |
Although not explicitly supported, I am wondering if having a multiline csv example in the test folder would be useful reference. |
8786329
to
d7e5e91
Compare
@rajan-chari Thanks for reviewing, I have added that in d7e5e91 |
12c7993
to
53d8632
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
<3 the extensive test coverage
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Just rebased the PR and it seems like that the CI error is not cause by this PR but recent commits in the master, as that passes for the previous run https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/2833443417 |
I agree, that break is not introduced by this PR. It snuck in in #4109. |
Awesome stuff, merging now :) |
Previous reviews: HollowMan6#1
A project of Reinforcement Learning Open Source Fest 2022.
Tutorial CSV
Iris.csv
Iris.txt
iris2.csv
Build CSV parser with cmake option
-DBUILD_CSV=On
Summary of features supported and tested:
,
, but"
|
or:
are reserved and not allowed to use, since the double quote ("
) is for escape, vertical bar(|
) for separating the namespace and feature names,:
can be used in labels.--csv_separator
symbols that appeared inside the double-quoted cells are not considered as a separator but a normal string character._label
and_tag
by default. For each separated field in header except for tag and label, it may contain namespace and feature name separated by namespace separator, vertical bar(|
).--csv_header
to override the CSV header by providing (namespace,|
and) feature name separated with,
. By default, CSV files are assumed to have a header with feature and/or namespaces names in the CSV first line. You can override it by specifying--csv_header
. Combined with--csv_no_file_header
, we assume that there is no header in the CSV file and under such condition specifying--csv_header
for the header is a must.\r\n\f\v
) as well as some UTF-8 BOM characters(\xef\xbb\xbf
) before separation.\t
to represent tabs. Otherwise, if assigning more than one character, an error will be thrown.--csv_ns_value
to scale the namespace values by specifying the float ratio.e.g. --csv_ns_value=a:0.5,b:0.3,:8, which the namespace a has a ratio of 0.5, b of 0.3, empty namespace of 8, other namespaces of 1.