Skip to content
This repository has been archived by the owner on Nov 8, 2023. It is now read-only.

feat: Native CSV Parser #1

Closed
wants to merge 54 commits into from
Closed

feat: Native CSV Parser #1

wants to merge 54 commits into from

Conversation

HollowMan6
Copy link
Owner

@HollowMan6 HollowMan6 commented May 11, 2022

Iris.csv
Iris.txt
iris2.csv

multiline.csv
multiline.txt

Build CSV parser with cmake option -DBUILD_CSV=On

vw -d Iris.txt
vw --csv -d Iris.csv
vw --csv -d Iris2.csv  --named_labels Setosa,Versicolor,Virginica --oaa 3

vw -d multiline.txt --cb_adf
vw --csv -d multiline.csv --cb_adf

Summary of features supported and tested:

  1. Allows specifying the CSV field separator by --csv_separator, default is ,, but " | or : are reserved and not allowed to use, since the double quote (") is for escape, vertical bar(|) for separating the namespace and feature names, : can be used in labels.
  2. For each separated element, auto remove the outer double-quotes of a cell when it pairs. --csv_separator symbols that appeared inside the double-quoted cells are not considered as a separator but a normal string character.
  3. Double-quotes that appear at the start and end of the cell will be considered to enclose fields. Other quotes that appear elsewhere and out of the enclose fields will have no special meaning. (This is also how Microsoft Excel parses.)
  4. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote, and will remove that escape symbol during parsing.
  5. Use header line for feature names (and possibly namespaces) / specify label and tag using _label and _tag by default. For each separated element in header except for tag and label, it may contain namespace and feature name separated by namespace separator, vertical bar(|).
  6. --csv_header to override the CSV header by providing (namespace, | and) feature name separated with ,. By default, CSV files are assumed to have a header with feature and/or namespaces names in the CSV first line. You can override it by specifying --csv_header. Combined with --csv_no_file_header, we assume that there is no header in the CSV file and under such condition specifying --csv_header for the header is a must.
  7. If the number of the separated elements for current parsing line is greater than the header, an error will be thrown.
  8. Trim the element for ASCII "white space"(\r\n\f\v) as well as some UTF-8 BOM characters(\xef\xbb\xbf) before separation.
  9. If no namespace is separated, will use empty namespace.
  10. Separator supports using \t to represent tabs. Otherwise, if assigning more than one character, an error will be thrown.
  11. Directly read the label as string, interpret it using the VW text label parser.
  12. Will try to judge if the feature values are float or string, if NaN, will consider it as a string. quoted numbers are always considered as strings.
  13. If the feature value is empty, will skip that feature.
  14. Reset the parser when EOF of a file is met (for possible multiple input file support).
  15. Support using --csv_ns_value to scale the namespace values by specifying the float ratio.
    e.g. --csv_ns_value=a:0.5,b:0.3,:8, which the namespace a has a ratio of 0.5, b of 0.3, empty namespace of 8, other namespaces of 1.
  16. If all the cells in a line is empty, then consider it as an empty line. CSV is not a good fit for the multiline format, as evidenced by the large number of empty fields. Multi-line format often means different lines have different schemas. However, I still leave the empty line support to make sure that it’s flexible and extendable enough. We still throw an error if the number of fields separated by the line doesn’t match previous, even all the fields are empty, as this usually means typos that users may not intend.

@HollowMan6 HollowMan6 changed the title Native CSV Parser feat: Native CSV Parser May 11, 2022
.vscode/settings.json Outdated Show resolved Hide resolved
vowpalwabbit/core/src/parse_args.cc Outdated Show resolved Hide resolved
vowpalwabbit/core/src/parse_args.cc Outdated Show resolved Hide resolved
vowpalwabbit/core/src/parse_args.cc Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/CMakeLists.txt Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/src/parse_example_csv.cc Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/src/parse_example_csv.cc Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/src/parse_example_csv.cc Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/src/parse_example_csv.cc Outdated Show resolved Hide resolved
@HollowMan6 HollowMan6 force-pushed the csv branch 17 times, most recently from 6ec35db to ef3b342 Compare May 19, 2022 03:12
@HollowMan6 HollowMan6 force-pushed the csv branch 6 times, most recently from 9dbf6f9 to e1546b8 Compare May 22, 2022 03:21
@HollowMan6
Copy link
Owner Author

New feature:

  1. --csv_label also support specifying multi columns in order separated with ,
    to represent each component in the label type.
  2. Directly read the label as string, combine each label components together
    in order with spaces by default. If use the option --csv_multilabels, will combine with ,.
  3. Support tags.

@HollowMan6 HollowMan6 force-pushed the csv branch 2 times, most recently from bfd461b to 75afd71 Compare May 22, 2022 15:24
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
…s empty

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants