rfc.txt


                                                            Christoph Vilsmeier
                                                                  January 2018


            The Tabular Data (TDAT) Data Interchange Format


Abstract

    TDAT is a lightweight, text-based, language-independent data interchange
    format. It was derived from CSV (Comma Separated Values). TDAT defines a
    small set of formatting rules for the portable representation of tabular
    data.


Status of This Memo

    This is a preliminary draft version.


Copyright Notice

    TODO


Table of Contents

    1. Introduction
      1.1. Conventions Used in This Document
    2. TDAT Grammar
    3. Values
      3.1. Integer Values
      3.2. Floating Point Values
      3.3. Boolean Values
      3.4. String Values
      3.5. Time Values
    4. String and Character Issues
      4.1. Character Encoding
      4.2. Whitespace Characters
    5. Parsers and Generators
    6. Examples
    7. References


1. Introduction

    TDAT is a text format for the serialization of tabular data. It is derived
    from CSV (Comma Separated Values). A TDAT model contains zero or more
    tables, as shown in the following sample:

        products
        |id:i    |name:s          |in_stock:b     |dateOfEntry:t
        |1       |"The Zen"       |true           |2014-02-12T13:14:15.116
        |2       |"Zweigelt Blau" |true           |2016-10-11T08:37:16.143

    A TDAT table has a name, a column header and a list of data rows. Each
    column has a name and a type. Valid types are:

        'i' for integer numbers

        'f' for floating point numbers
        
        'b' for booleans
        
        's' for strings
        
        't' for times

    A string is a sequence of zero or more Unicode characters [UNICODE]. Note
    that this citation references the latest version of Unicode rather than a
    specific release. It is not expected that future changes in the Unicode
    specification will impact the syntax of TDAT.

    A time is represented as a ISO8601 [ISO8601] date/time value without
    timezone information, in UTC time coordinates.

    TDAT's design goals were for it to be minimal, portable, textual,
    type-safe, well-defined and human-readable.


1.1. Conventions Used in This Document

    The grammatical rules in this document are to be interpreted as described
    in ABNF [RFC5234].


2. TDAT Grammar

    A TDAT model is a ordered sequence of zero or more tables.

    A TDAT table has a name and an ordered sequence of zero or more columns
    and a ordered sequence of zero or more data rows. The table name must be
    unique within a model.

    A TDAT column has a name and a type, separated by a colon ':'. The column
    name must be unique within a table. The type must be one of the supported
    types: 'i', 'f', 'b', 's', 't'. Whitespace characters before the column
    name and after the column type are allowed and must be silently removed by
    parsers.

    A TDAT data row is a ordered sequence of zero or more cells. The number of
    cells in a row must be equal to the number of columns.

    A TDAT cell starts with the separator '|', followed by the cell value.
    The type of the n-th cell value is derived from the type of the n-th
    column. Whitespace characters before or after the cell value are allowed
    and must be silently removed by parsers. Empty cells are allowed, the
    value for an empty cell is null (not set, undefined, nil).

    Empty lines are allowed and must be silently ignored by parsers. A line is
    empty if it contains only whitespace characters followed by a newline
    character.

    White space characters are space U+0020, horizontal tab U+0009 and
    carriage return U+000D. (Note that newline U+000A is not a whitespace
    character).

        ws = *( U+0020 /              ; Space
                U+0009 /              ; Horizontal tab
                U+000D )              ; Carriage return


3. Values

3.1. Integer Values

    The representation of integer numbers is similar to that used in most
    programming languages. An integer is represented in base 10 using decimal
    digits. It may be prefixed with an optional minus sign. Leading zeros are
    not allowed.

    An exponent part begins with the letter E in uppercase or lowercase, which
    may be followed by a plus or minus sign. The E and optional sign are
    followed by one or more digits.

    Numeric values that cannot be represented in the grammar below (such as
    Infinity and NaN) are not allowed.

        integer   = [ minus ] digits [ exp ]

        digits    = zero / ( digit1-9 *DIGIT )

        digit1-9  = U+0031 - U+0039       ; 1-9

        exp       = e [ minus / plus ] 1*DIGIT

        e         = U+0065 / U+0045       ; e E

        minus     = U+002D                ; -

        plus      = U+002B                ; +

        zero      = U+0030                ; 0

    This specification allows implementations to set limits on the range of
    integer values accepted. Good interoperability can be achieved by using
    and assuming 64 bit representations.


3.2. Floating Point Values

    The representation of floating point values is similar to that used in
    most programming languages. A floating point value is represented in base
    10 using decimal digits. It contains an integer component that may be
    prefixed with an optional minus sign, which may be followed by a fraction
    part and/or an exponent part. Leading zeros are not allowed.

    A fraction part is a decimal point followed by one or more digits.

    An exponent part begins with the letter E in uppercase or lowercase, which
    may be followed by a plus or minus sign. The E and optional sign are
    followed by one or more digits.

        float          = [ minus ] digits [ frac ] [ exp ]

        digits         = zero / ( digit1-9 *DIGIT )

        digit1-9       = U+0031 - U+0039      ; 1-9

        e              = U+0065 / U+0045      ; e E

        exp            = e [ minus / plus ] 1*DIGIT

        frac           = decimal-point 1*DIGIT

        decimal-point  = U+002E               ; .

        minus          = U+002D               ; -

        plus           = U+002B               ; +

        zero           = U+0030               ; 0

    This specification allows implementations to set limits on the range of
    floating point values accepted. Good interoperability can be achieved by
    using 64 bit representations.


3.3. Boolean Values

    The representation of boolean values is similar to conventions used in the
    C family of programming languages. A boolean value is represented as
    "true" or "false".

        boolean  = "true" / "false


3.4. String Values

    The representation of strings is similar to conventions used in the C
    family of programming languages. A string begins and ends with quotation
    marks. All Unicode characters may be placed within the quotation marks,
    except for the characters that must be escaped: quotation mark, reverse
    solidus, and the control characters (U+0000 through U+001F).

    Any character may be escaped. If the character is in the Basic
    Multilingual Plane (U+0000 through U+FFFF), then it may be represented as
    a six-character sequence: a reverse solidus, followed by the lowercase
    letter u, followed by four hexadecimal digits that encode the character's
    code point. The hexadecimal letters A through F can be uppercase or
    lowercase. So, for example, a string containing only a single reverse
    solidus character may be represented as "\u005C".

    Alternatively, there are two-character sequence escape representations of
    some popular characters. So, for example, a string containing only a
    single reverse solidus character may be represented more compactly as
    "\\".

    To escape an extended character that is not in the Basic Multilingual
    Plane, the character is represented as a 12-character sequence, encoding
    the UTF-16 surrogate pair. So, for example, a string containing only the G
    clef character (U+1D11E) may be represented as "\uD834\uDD1E".

        string      = quotation-mark *char quotation-mark

        char        = unescaped / 
                        escape ( U+0022 /          ; "    quotation mark
                                 U+005C /          ; \    reverse solidus
                                 U+002F /          ; /    solidus
                                 U+0062 /          ; b    backspace
                                 U+0066 /          ; f    form feed
                                 U+006E /          ; n    line feed
                                 U+0072 /          ; r    carriage return
                                 U+0074 /          ; t    tab
                                 U+0075 4HEXDIG )  ; uXXXX

        escape      = U+005C               ; \

        quotation-mark = U+0022            ; "

        unescaped   = U+0020 - U+0021 / U+0023 - U+005B / U+005D U+FFFF


3.5. Time Values

    The representation of time values is similar to ISO-8601. A time value
    starts with the date part, followed by a 'T', followed by the time part.
    Time values are represented without time zone information. Best
    interoperability is achieved by using and assuming UTC exclusively.

        time    = date t time

        date    = year "-" month "-" day
 
        year    = 4DIGIT               ; 0000-9999

        month   = 2DIGIT               ; 01-12

        day     = 2DIGIT               ; 01-N, N based on year/month

        time    = hour ":" minute ":" second [frac]

        hour    = 2DIGIT               ; 00-23

        minute  = 2DIGIT               ; 00-59

        second  = 2DIGIT               ; 00-59

        frac    = decimal-point 1*DIGIT
        

4. String and Character Issues

4.1. Character Encoding

    TDAT text must be encoded using UTF-8 [RFC3629]. There are no other
    encodings allowed.

    Implementations must not add a byte order mark (U+FEFF) to the beginning
    of a TDAT text. However, parsers may ignore the presence of a byte order
    mark rather than treating it as an error.


4.2. Whitespace Characters

    TDAT cells may contain trailing whitespace characters. These whitespace
    characters are used to provide padding to cells (for better readability by
    humans). These whitespace characters are insignificant. A TDAT cell with
    whitespace characters is equivalent to a cell without trailing whitespace
    characters.
    

5. Parsers and Generators

    A TDAT parser transforms a TDAT text into another representation. A TDAT
    parser must accept all texts that conform to the TDAT grammar. A TDAT
    parser may accept non-TDAT forms or extensions.

    A parser implementation may set limits on

        - the size of texts that it accepts

        - the range and precision of numbers

        - the number of columns in a table

        - the number of rows in a table

        - the length and character contents of strings.

    A TDAT generator produces TDAT text. The resulting text must strictly
    conform to the TDAT grammar.


6. Examples

    This is a TDAT model with two tables:

        teachers
        |id:i   |name:s       |birth:t                   |male:b
        |1      |"John Doe"   |1972-07-15T10:11:12.333   |true
        |2      |"Mary Doe"   |1984-04-05T11:12:13.444   |false

        courses
        |id:i|name:s|room:s
        |1|"Biology"|"S-30"
        |2|"Mathematics"|"N-12"
        |3|"Mathematics"|

    In the first table "teachers", cells are padded for better human
    readability. The second table "courses" is not padded.

    In the second table "courses", the third row's room cel is empty.
    Therefore, the room value is null (not set, undefined, nil).

    This is a TDAT model with two empty tables:

        products

        owners

    Both tables have zero columns and zero rows.



7. References

    [UNICODE]  The Unicode Consortium, "The Unicode Standard",
               <http://www.unicode.org/versions/latest/>.

    [ISO8601]  "Data elements and interchange formats -- Information
               interchange -- Representation of dates and times", ISO
               8601:1988(E), International Organization for
               Standardization, June, 1988.

    [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
               Specifications: ABNF", STD 68, RFC 5234,
               DOI 10.17487/RFC5234, January 2008,
               <https://www.rfc-editor.org/info/rfc5234>.

    [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
               10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
               2003, <https://www.rfc-editor.org/info/rfc3629>.


Contributors

    This RFC was written by Christoph Vilsmeier.


Author's Address

    Christoph Vilsmeier 
    Email: cv@vilsmeier-consulting.de