-
Notifications
You must be signed in to change notification settings - Fork 0
/
rfc.txt
387 lines (243 loc) · 13 KB
/
rfc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
Christoph Vilsmeier
January 2018
The Tabular Data (TDAT) Data Interchange Format
Abstract
TDAT is a lightweight, text-based, language-independent data interchange
format. It was derived from CSV (Comma Separated Values). TDAT defines a
small set of formatting rules for the portable representation of tabular
data.
Status of This Memo
This is a preliminary draft version.
Copyright Notice
TODO
Table of Contents
1. Introduction
1.1. Conventions Used in This Document
2. TDAT Grammar
3. Values
3.1. Integer Values
3.2. Floating Point Values
3.3. Boolean Values
3.4. String Values
3.5. Time Values
4. String and Character Issues
4.1. Character Encoding
4.2. Whitespace Characters
5. Parsers and Generators
6. Examples
7. References
1. Introduction
TDAT is a text format for the serialization of tabular data. It is derived
from CSV (Comma Separated Values). A TDAT model contains zero or more
tables, as shown in the following sample:
products
|id:i |name:s |in_stock:b |dateOfEntry:t
|1 |"The Zen" |true |2014-02-12T13:14:15.116
|2 |"Zweigelt Blau" |true |2016-10-11T08:37:16.143
A TDAT table has a name, a column header and a list of data rows. Each
column has a name and a type. Valid types are:
'i' for integer numbers
'f' for floating point numbers
'b' for booleans
's' for strings
't' for times
A string is a sequence of zero or more Unicode characters [UNICODE]. Note
that this citation references the latest version of Unicode rather than a
specific release. It is not expected that future changes in the Unicode
specification will impact the syntax of TDAT.
A time is represented as a ISO8601 [ISO8601] date/time value without
timezone information, in UTC time coordinates.
TDAT's design goals were for it to be minimal, portable, textual,
type-safe, well-defined and human-readable.
1.1. Conventions Used in This Document
The grammatical rules in this document are to be interpreted as described
in ABNF [RFC5234].
2. TDAT Grammar
A TDAT model is a ordered sequence of zero or more tables.
A TDAT table has a name and an ordered sequence of zero or more columns
and a ordered sequence of zero or more data rows. The table name must be
unique within a model.
A TDAT column has a name and a type, separated by a colon ':'. The column
name must be unique within a table. The type must be one of the supported
types: 'i', 'f', 'b', 's', 't'. Whitespace characters before the column
name and after the column type are allowed and must be silently removed by
parsers.
A TDAT data row is a ordered sequence of zero or more cells. The number of
cells in a row must be equal to the number of columns.
A TDAT cell starts with the separator '|', followed by the cell value.
The type of the n-th cell value is derived from the type of the n-th
column. Whitespace characters before or after the cell value are allowed
and must be silently removed by parsers. Empty cells are allowed, the
value for an empty cell is null (not set, undefined, nil).
Empty lines are allowed and must be silently ignored by parsers. A line is
empty if it contains only whitespace characters followed by a newline
character.
White space characters are space U+0020, horizontal tab U+0009 and
carriage return U+000D. (Note that newline U+000A is not a whitespace
character).
ws = *( U+0020 / ; Space
U+0009 / ; Horizontal tab
U+000D ) ; Carriage return
3. Values
3.1. Integer Values
The representation of integer numbers is similar to that used in most
programming languages. An integer is represented in base 10 using decimal
digits. It may be prefixed with an optional minus sign. Leading zeros are
not allowed.
An exponent part begins with the letter E in uppercase or lowercase, which
may be followed by a plus or minus sign. The E and optional sign are
followed by one or more digits.
Numeric values that cannot be represented in the grammar below (such as
Infinity and NaN) are not allowed.
integer = [ minus ] digits [ exp ]
digits = zero / ( digit1-9 *DIGIT )
digit1-9 = U+0031 - U+0039 ; 1-9
exp = e [ minus / plus ] 1*DIGIT
e = U+0065 / U+0045 ; e E
minus = U+002D ; -
plus = U+002B ; +
zero = U+0030 ; 0
This specification allows implementations to set limits on the range of
integer values accepted. Good interoperability can be achieved by using
and assuming 64 bit representations.
3.2. Floating Point Values
The representation of floating point values is similar to that used in
most programming languages. A floating point value is represented in base
10 using decimal digits. It contains an integer component that may be
prefixed with an optional minus sign, which may be followed by a fraction
part and/or an exponent part. Leading zeros are not allowed.
A fraction part is a decimal point followed by one or more digits.
An exponent part begins with the letter E in uppercase or lowercase, which
may be followed by a plus or minus sign. The E and optional sign are
followed by one or more digits.
float = [ minus ] digits [ frac ] [ exp ]
digits = zero / ( digit1-9 *DIGIT )
digit1-9 = U+0031 - U+0039 ; 1-9
e = U+0065 / U+0045 ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
decimal-point = U+002E ; .
minus = U+002D ; -
plus = U+002B ; +
zero = U+0030 ; 0
This specification allows implementations to set limits on the range of
floating point values accepted. Good interoperability can be achieved by
using 64 bit representations.
3.3. Boolean Values
The representation of boolean values is similar to conventions used in the
C family of programming languages. A boolean value is represented as
"true" or "false".
boolean = "true" / "false
3.4. String Values
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with quotation
marks. All Unicode characters may be placed within the quotation marks,
except for the characters that must be escaped: quotation mark, reverse
solidus, and the control characters (U+0000 through U+001F).
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be represented as
a six-character sequence: a reverse solidus, followed by the lowercase
letter u, followed by four hexadecimal digits that encode the character's
code point. The hexadecimal letters A through F can be uppercase or
lowercase. So, for example, a string containing only a single reverse
solidus character may be represented as "\u005C".
Alternatively, there are two-character sequence escape representations of
some popular characters. So, for example, a string containing only a
single reverse solidus character may be represented more compactly as
"\\".
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence, encoding
the UTF-16 surrogate pair. So, for example, a string containing only the G
clef character (U+1D11E) may be represented as "\uD834\uDD1E".
string = quotation-mark *char quotation-mark
char = unescaped /
escape ( U+0022 / ; " quotation mark
U+005C / ; \ reverse solidus
U+002F / ; / solidus
U+0062 / ; b backspace
U+0066 / ; f form feed
U+006E / ; n line feed
U+0072 / ; r carriage return
U+0074 / ; t tab
U+0075 4HEXDIG ) ; uXXXX
escape = U+005C ; \
quotation-mark = U+0022 ; "
unescaped = U+0020 - U+0021 / U+0023 - U+005B / U+005D U+FFFF
3.5. Time Values
The representation of time values is similar to ISO-8601. A time value
starts with the date part, followed by a 'T', followed by the time part.
Time values are represented without time zone information. Best
interoperability is achieved by using and assuming UTC exclusively.
time = date t time
date = year "-" month "-" day
year = 4DIGIT ; 0000-9999
month = 2DIGIT ; 01-12
day = 2DIGIT ; 01-N, N based on year/month
time = hour ":" minute ":" second [frac]
hour = 2DIGIT ; 00-23
minute = 2DIGIT ; 00-59
second = 2DIGIT ; 00-59
frac = decimal-point 1*DIGIT
4. String and Character Issues
4.1. Character Encoding
TDAT text must be encoded using UTF-8 [RFC3629]. There are no other
encodings allowed.
Implementations must not add a byte order mark (U+FEFF) to the beginning
of a TDAT text. However, parsers may ignore the presence of a byte order
mark rather than treating it as an error.
4.2. Whitespace Characters
TDAT cells may contain trailing whitespace characters. These whitespace
characters are used to provide padding to cells (for better readability by
humans). These whitespace characters are insignificant. A TDAT cell with
whitespace characters is equivalent to a cell without trailing whitespace
characters.
5. Parsers and Generators
A TDAT parser transforms a TDAT text into another representation. A TDAT
parser must accept all texts that conform to the TDAT grammar. A TDAT
parser may accept non-TDAT forms or extensions.
A parser implementation may set limits on
- the size of texts that it accepts
- the range and precision of numbers
- the number of columns in a table
- the number of rows in a table
- the length and character contents of strings.
A TDAT generator produces TDAT text. The resulting text must strictly
conform to the TDAT grammar.
6. Examples
This is a TDAT model with two tables:
teachers
|id:i |name:s |birth:t |male:b
|1 |"John Doe" |1972-07-15T10:11:12.333 |true
|2 |"Mary Doe" |1984-04-05T11:12:13.444 |false
courses
|id:i|name:s|room:s
|1|"Biology"|"S-30"
|2|"Mathematics"|"N-12"
|3|"Mathematics"|
In the first table "teachers", cells are padded for better human
readability. The second table "courses" is not padded.
In the second table "courses", the third row's room cel is empty.
Therefore, the room value is null (not set, undefined, nil).
This is a TDAT model with two empty tables:
products
owners
Both tables have zero columns and zero rows.
7. References
[UNICODE] The Unicode Consortium, "The Unicode Standard",
<http://www.unicode.org/versions/latest/>.
[ISO8601] "Data elements and interchange formats -- Information
interchange -- Representation of dates and times", ISO
8601:1988(E), International Organization for
Standardization, June, 1988.
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234,
DOI 10.17487/RFC5234, January 2008,
<https://www.rfc-editor.org/info/rfc5234>.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
2003, <https://www.rfc-editor.org/info/rfc3629>.
Contributors
This RFC was written by Christoph Vilsmeier.
Author's Address
Christoph Vilsmeier
Email: cv@vilsmeier-consulting.de