Skip to content

Parser Specification

knormoyle edited this page Oct 17, 2014 · 31 revisions

H2O Parsing Concepts

  • import/upload of files and resulting H2O key names
  • data and header rows in a file
  • special header_from_file specification by user
  • h2o automatic determination of column type and headers from files.
  • override of header determination and header file selection. There is no override of column determination.
  • column token rules, including white space stripping and number vs enum determination
  • row rules (eol etc)
  • special parsing ($, %, time)

documentation updates needed

Need to describe how using the Hive 0x1 separator allows single quote and double quote in tokens. Need to describe how the param "single_quote=1" makes single quote get treated as plain character everywhere.

Parsing multiple files

Normally, a beginning user only parses a single file. This file will have an optional header row, and data rows. The entire file is examined, and h2o deduces whether the first row is a header. The use of the first row as header can be forced with the header=1 param.

Now h2o can also be directed to parse multiple files as-if one file. In this case the file used for header determination is random. H2O can be forced to use one of the files in the pattern match for file selection or it can be pointed to a file outside of the pattern match. In both cases a key name needs to be used. Pattern match is not allowed for this header_from_file= param

A "file" has the following contents:

  • Comment row (optional, first line only)

  • Header row (optional, follows Comment if present)
  • Data row(s) (optional. follows Header if present, otherwise follows Comment if present) may be ignored if header_from_file= file is not part of the parse pattern match for selected files)

The algorithm H2O uses when parsing multiple files, that may have data and header rows is:

Default behavior:

  • Get setup(parser type, separator, number of columns) from all files, using h2o "column determination" rules.
  • Look for possible header (first line with all strings, followed by line with at least one number. If this results in multiple header choices, pick one at random.
  • Report the info to user. (browser only. This is the "query" step that doesn't exist if using json).

User then picks/confirms global setup (separator, number of cols, parser type...)

User can also pick header file (header_from_file=) for which he has following options:

  • Pick one of the files being parsed. Can have data iff it has matching setup. If it does not have any data lines, it can use the special override on separator determination: comma and space can be used as separators in addition to the current global setup. Space is still stripped as white space, so multiple spaces end up being treated as a single space. But remember, multiple commas are always individual separators.

  • Pick another unparsed file as a header file. It can not have any data. Setup must match either the global setup or use comma or space as separator.

  • Pick already parsed dataset (? this is currently not tested) The key must point to a VA or frame and the number of cols matches the global setup.The dataset CAN have data, they won't be included in the new dataset though.

The different separator is allowed only for one line header-only file. If the file contains more than one line, it must have the same setup as the rest.

If you have a file with non-matching setup in your list of files to parse, the browser will complain about it and user must pick it as header file manually (and than comma and space will be tested as potential separators).

When using the json interface, the api requires header=1 if header_from_file= is used. There is no "prompting" of information to the user, from the files, as in the browser.

Parsing a single file

A raw data file is interpreted (parsed) to create a data .hex file. Raw data that can be parsed, doesn't necessarily meet the requirements of all algorithms. This specification only covers the H2O parsing step.

Logistic regression may limit the values in output classes. Random Forest may limit the number of unique output class values, or exclude fractional values.

Parsing decides how many columns a dataset has, whether the columns are numbers or strings, handles missing values, handles incorrect values for a column, and extracts column labels from an optional header row.

After a Parse, columns with entries that got default values or NA will be identified by "num_missing_values" in the result of an Inspect.

Some algorithms will ignore parsed rows that have missing values. (RF). (NEED TO CLARIFY THIS) Some algorithms will ignore columns that are all the same value. In some cases, H2O will translate a column to all NA's, if it has mixed strings and numbers (likely error).

Row format

A row is a sequence of tokens separated by {SEP} characters, terminated with a CR, CR LF, or LF line ending.

UPDATE: Since NUL (0x00) characters may be used for padding, NULs are allowed, and ignored after any line end, until the first non-NUL character. NUL is not a end of line character, though. verify: NUL (0x00) can be used as a character in tokens, and not be ignored.

{SEP} must be space or comma. Update: Tab and | (vertical bar) are now valid for use as {SEP}, similar to comma.

In this text, <sp> will be used to identify a space character.

Two adjacent {SEP} characters (,,) cause a NaN value for number column, and empty strings for string columns. Note this is different than the default filling for missing columns at the end)

If space is used as {SEP}, multiple adjacent spaces are treated as one {SEP}.

Any adjacent space outside of a quoted token is illegal. Although one could argue it's whitespace, the definition of {SEP} is then complicated. And {SEP} is always a single character.

If a row has a smaller number of columns than expected, the remaining columns are filled with default values. The default is 0 for number columns. For string columns, it's the first defined string (enum) for the column .

Blank lines are ignored. (IS THIS TRUE? or do blank lines load NA sometimes? Are blank lines allowed before the first two non-comment lines in a file?)

Update: Comments are allowed at the beginning of the file. They should start with "#" at the beginning of a line. No comments allowed after the first non-comment line.

Every dataset must have a minimum of two non-comment lines.

The last record in a file may or may not be ended with an end of line character. (Verify?)

Column detection

Column headers, numbers of columns, and separators are detected from the first two lines. This places additional constraints on the first two rows.

If the first row is only string tokens and second row has at least one number token, then the first row is considered to be a header. Otherwise the first row is considered to be data.

A header with any tokens that match the number E-BNF will cause it to be treated as a data row.

The first two rows should have the same number of columns and this is required to be the max numbers of columns across all lines in the data

Token parsing

Each token can be quoted (single or double quoted). Single quote is treated as a character though, unless the parameter single_quotes=1 is used for the parse.

If the token is quoted then whatever lies inside the quotes (single or double) is part of the token. For instance "<sp>1" is not the number 1, but a two character string: space followed by character 1.

If single_quotes=1 is used, the single_quotes will be stripped, that surround a token, just like double quotes are normally.

"1" will be parsed as number, since there is nothing other than the 1.

There are two whitespace characters: <sp> and Tab.

Leading whitespace for tokens is always ignored, trailing whitespace for numbers is ignored too. Therefore <sp><sp><sp>123<sp><sp><sp> is the number 123, and <sp><sp><sp>Hello_World<sp><sp><sp> is the string Hello_World<sp><sp><sp>. Assuming space itself is not {SEP}.

While there is whitespace for tokens, there is no additional whitespace allowed around separators. This comes up when quoted tokens are used e.g. "string1",<sp>"string2" is illegal. If the tokens are not quoted, the rules for whitespace around tokens covers the cases. Note extra whitespace at the beginning and end of lines with quoted tokens is illegal.

The whitespace and quote stripping rules, imply that a pure number can never be used or interpreted as a string. (Question: does that mean raw numbers are illegal in string columns?)

Strings can contain any character - we do not do any unicode goodness on them, so they will appear as they do in the CSV file itself.

If the token needs to contain the same quote (single or double) that was used to delineate the token, you can escape it with another. For instance: "John said ""Hello there""" or 'This ain''t wrong'. Tokens with embedded commas or lineends need no special handling other than quote (single or double) around the entire token.

Whatever does not parse as number is parsed as string.

This is the E-BNF grammar for numbers:

DIGIT = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 

NUMBER = [ '-' | '+' ] {DIGIT} FRACTION EXPONENT | 'NA' | 'NAN' | 'NaN'

FRACTION = empty | '.' DIGIT { DIGIT }

EXPONENT = empty | 'e' | 'E') [ '-' | '+' ] | DIGIT { DIGIT }

Where | = choice, [ ] option and { } is repetition 0 or more.

We do not expect to see more digits than there are in the double precisions. If more numbers than the double precision limit is found, they are truncated.

There is no support for thousands separators in numbers, e.g. 1,000,000.

Beyond numbers and strings

Added feature: spotting % as "percentage" - and auto-dividing by 100. So "100%" is parsed as 1.00.

Added feature: ignore $ in numeric columns.

NEW STUFF

There are no token recognitions, for something called 'strings' and something called "uuid" also: time stamp type needs definition to-be-written-up.

here's a place holder for some special handling around binomial enums with 0 for specifying NA

Answers are for tomas-parse branch, as of right now, everything is tested on small files.

On 07/29/2014 04:49 AM, Tom Kraljevic wrote:

(dataset) —> (h2o in mem representation) 0 —> NA N —> N Y —> Y

questions:

Are there only 3 possible values in the response column? (0/N/Y) Yes, either 0/N/Y or 0/n/y or 0/F/T or 0/f/t

what happens if there are other values, or missing value? are any other "types" for parse possible in that column as error? (strings, other numbers, etc) then it’ll fall back to default behavior, could become enum or numbers

can N and Y be lower case? yes, as long as it’s consistent

what about single or double quotes surrounding 0/N/Y? Not interesting or ? yes, as long as it’s consistent

Is there a particular column separator? no, anything goes

What's the expected line-end? as always, windows or linux

Is this multi-file or single file? so far only single file

gz or not gz? gz is failing right now, not clear why

is there a separate header file or is the header embedded?? no header needed, it’s the new “heuristic"

Is N and Y legal for header or no? Can be problematic if the other columns’ header names are also confused as data, but should be fine in general.

Is the response column always a particular column or can it be any column? any

What's the # of cols and rows expected to be at the customer? dunno

what's the largest number of machines that will parse the dataset? dunno

Sparse libsvm or svmlight data format.

There are no headers in libsvm format. All the language above about headers should be ignored. (We don't currently have any multi-file libsvm parsing. Should add that?

Automatically recognized by parser (libsvm tokens are another token case that should be added to the number/string BNF above.

Note there is no such thing as a header specification. If there is a header file or commented header, not sure what H2O currently does.

non-standard issues:

  • values equal to 0 are not illegal, just unnecessary..i.e. this is okay

    6:+0.000000e+00

  • comments are not allowed at the end of lines like this?

    1 1:2 2:1 # your comments

  • 'inf' and 'nan" are not allowed for values

  • each line must end with newline (the alternate end of lines are allowed also)

  • The index values must be in ascending order and positive integers.

  • valueN must be a integer or real number (not enum/strings)

  • labels can be integers or reals (not enums/strings)

  • a multi-label format allows [label] to be a comma-separated group of labels. Do we support this? is it non-standard?

  • H2O should be able to tolerate any number of spaces between tokens?

  • H2O should be able to tolerate any of the supported lineends?

  • Tab not legal for whitespace between tokens

libsvm data format

[label] [index1]:[value1] [index2]:[value2] ... [label] [index1]:[value1] [index2]:[value2] ...

label: Sometimes referred to as 'class'. The response/output. Usually integers. (positive or negative) QUESTION: is there something for a "missing" class to cause NAs?

index: Ordered indexes. Usually continuous integers.

value: The data for training.

Some current 'issues'

  1. H2O thinks single character '+' and '-' are numbers. (check)

Comments

If the document follows the above mentioned rules, we should always parse it correctly. Violations should be reported, please assign the bugs immediately to me (Peta). We will however parse a lot more than specified here, but in general when you diverge from these rules, the behavior is unspecified.

Subtle issues in other definitions

For comparison and debug of compatibility issues, only

There are a number of subtle variant definitions that we are currently not compatible with. I will list them here, since people may assume we are, or we may want to adopt these details in the future. (Actually, in reading them closely and our text above, I think we do support the escaping with quotes as described here?) You can read these closely and then find the text above that is not exactly the same behavior.

  • Unix style programs escape commas by inserting a single backslash character before each comma, i.e. a single cell with the text apples, carrots, and oranges becomes apples, carrots, and oranges. If a field contains a backslash character, then an additional backslash character is inserted before it.

  • Unix style programs have two distinct ways of escaping end of line characters within a field. Some Unix style programs use the same escape method as with commas, and just insert a single backslash before the end of line character. Other Unix style programs replace the end of line character using c style character escaping where CR becomes \r and LF becomes \n.

  • In an Excel escaped CSV file, in fields containing a double quote, the double quote is escaped by replacing the single double quote with two double quotes.

  • Some files use an escaping format that is a mixture of the Excel escaping and Unix escaping where fields with commas are embedded in a set of double quotes like the Excel escaping, but fields containing double quotes are escaped by inserting a single backslash character before each double quote like the Unix style comma escaping.

  • Some trim all leading and trailing whitespace characters, commas and tab characters, adjacent to commas or record delimiters.

  • Excel in Northern Europe may use ; as a separator.

  • The end of line characters used for record delimiters are sometimes changed to other characters like a semicolon.

  • Non-printable characters in a field are sometimes escaped using one of several c style character escape sequences, ### and \o### Octal, \x## Hex, \d### Decimal, and \u#### Unicode.

Pending H2O issues that need clarification

We support CR, CR+LF, or LF line ending. I'll call those the EOL characters.

We haven't said what h2o does with mixtures of those in a dataset

Since we no longer have the ability to escape EOL characters, because of the unmatched double quote "feature".. where we close an enum on a EOL ...to avoid unmatched quotes causing buffer overrun, using multiple rows for that one enum (till you find a closing quote)

I don't know if h2o does this closure assuming any legal EOL or just the current EOL

I think the current h2o treats any of the endings as valid EOL at any time.

Brandon and Tomas have discussed adding a user choice to parse for EOL, like the current column separator (EOL separator)

it would have a "guess" value. But then ONLY that value would be used for EOL. That would allow embedded "other" EOLs in enums, for whatever reason, and not mess up the closure for missing quotes.

Also:

brandon says LF+CR is an EOL thing some people support, but we don't support that (not tested) Should we add that as legal?

On guessing the "right" EOL symbol for the dataset, how about: The first EOL "thing" you run into, is considered "the EOL"

if you hit another EOL thing that's different after that, it's considered an error? (I'm thinking for today, if we don't add the user-controlled EOL thing above).

This would give users feedback on whether their datasets is "correct" relative to the definition of "correct" covered by current h2o tests (which don't mix EOLs in a dataset)

Right now, we don't have any h2o to user feedback (or description) on these issues.

Clone this wiki locally