dataset_format

CSV DATASET

CSV dataset are CSV standard files with some additional conventions (and minor limitations):

only one example is allowed per line. A single example cannot contain newlines and cannot span multiple lines;
columns are separated by commas. Commas inside a quoted string aren't column delimiters;
when available, the label (output value) of every example is stored in a single, fixed, column. User can specify the output column (or its absence);
if the value contained in the output column is a:
- number, the model is a REGRESSION model;
- string, we have a CLASSIFICATION model.
each column must describe the same kind of information;
the column (feature) order doesn't weight the results. The first feature is not weighted any more than the last;
TEXT STRINGS
- text matching is case-sensitive: wine is different from Wine;
- if a string contains a comma or a double quote, then the string must be enclosed in double quotes; a double quote must be escaped with another double quote, for example: "sentence with a ""double"" quote inside"
NUMERIC VALUES
- both integer and decimal values are supported;
- numbers in quotes and multiple numeric values in the same field will be treated as strings. For example:
  - Numbers: 2, 12, 2.36
  - Strings: 2 12", a 23, "12"
Test set can have an empty output value.

As a best practice, remove punctuation (other than apostrophes) from your data. This is because commas, periods and other punctuation rarely add meaning to the training data, but are treated as meaningful elements by the learning engine. For example, and, is not matched to and.

NOTES

The CSV parser has been improved and now it can detect headers and identify delimiters (take a look at pocket_csv.h for further details). Anyway the above suggestions guarantee a hassle-free experience.