-
Notifications
You must be signed in to change notification settings - Fork 980
UTC Timestamp Data Type Proposal
Drill provides the TIMESTAMP
data type which is a millisecond-resolution timestamp
in the local time zone. Proposed is a new UTCTIME
timestamp with nanosecond resolution
in the UTC time zone.
With a UTC timestamp available, we will find it necessary to revisit and rationalize Drill's other date/time data types. Proposed are new types to accurately capture the idea of a zone-less date and time, and ways to define the existing data types (with their quirks) to preserve backward compatibility.
SQL does not provide syntax for declaring data types. This note uses a simple extension:
FN(type1) --> type2
Where type
and type2
are Drill data types. The -->
notation simply
means that FN
maps type1
into type2
.
Some Drill functions use a data type as a literal. In this case, we trust the reader to understand the difference between a type and literal:
CAST(BIGINT AS UTCTIME) --> UTCTIME
Here, the first UTCTIME
is a literal.
UTCTIME
is a data type for a value, represented as a 64-bit integer,
that gives the number of nanoseconds since the epoch, UTC.
Drill has two informal groupings of data types: inferred types and
explicit types. Since Drill operates without schema, readers map from incoming
data to one of the inferred types (typically VARCHAR
, INT
or BIGINT
, FLOAT8
and so on.) The explicit types add behavior, but typically cannot be inferred
from an input. Explicit types are things like (VARDECIMAL
, TIMESTAMP
, DATE
,
TIME
, INTERVAL
and so on.) UTCTIME
is an explicit data type: the user must
write a SQL expression to convert an inferred BIGINT
column to a UTCTIME
value.
We add a new keyword to Drill's existing TIMESTAMP
syntax:
-
TIMESTAMP string --> TIMESTAMP
-- Declare a constant in the local time zone. -
UTCTIME string --> UTCTIME
-- Declare a constant in the UTC time zone. The string is in ISO notation and may include a timezone offset portion.
UTCTIME
is directly converable to and from the BIGINT
data type:
CAST(UTCTIME AS BIGINT) --> BIGINT
CAST(BIGINT AS UTCTIME) --> UTCTIME
The conversion is direct: the numeric value of the BIGINT
becomes the
nanoseconds since the epoch value for UTCTIME
and visa-vera.
The following invariants hold for BIGINT
column bi
and UTCTIME
column utc
:
CAST( CAST(utc AS BIGINT) AS UTCTIME) = utc --> TRUE
CAST( CAST(bi AS UTCTIME) AS BIGINT) = bi --> TRUE
The following math operations are extended for UTCTIME
:
-
UTCTIME +/- INTERVAL --> UTCTIME
-- Add or subtract a time interval to/from a UTC timestamp giving another UTC timestamp. (Same asDATE_ADD
below.) -
UTCTIME +/- BIGINT --> UTCTIME
-- Add or subtract a given number of nanoseconds to/from a UTC timestamp giving another UTC timestamp. (Same asDATE_ADD
below.) -
UTCTIME - UTCTIME --> BIGINT
-- Subtract one UTC timestamp from another to produce a difference in nanoseconds. (See alsoDATE_SUB
below.) -
UTCTIME relop UTCTIME
-- Compare two UTC times. Comparision is based on the underlying 64-bit integer values.
Drill provides a number of date/time functions
which are extended for UTCTIME
:
-
AGE(UTCTIME) --> INTERVAL
-- Returns the span of time between the given time and now. -
AGE(UTCTIME, UTCTIME) --> INTERVAL
-- Returns the span beween two UTC times. -
DATE_ADD(UTCTIME, INTERVAL) --> UTCTIME
-- Adds an interval to a UTC timestamp giving a new UTC timestamp. -
DATE_DIFF(UTCTIME, UTCTIME) --> INTERVAL
-- Same asAGE
. -
DATE_PART(keyword, UTCTIME) --> INTEGER
-- Returns the date part of a UTC timestamp. -
DATE_SUB(UTCTIME, INTERVAL) --> UTCTIME
-- Subtracts a date/time interval from a UTC timestamp, giving another UTC timestamp. -
EXTRACT(arg FROM UTCTIME) --> DOUBLE
-- Returns the value of a date/time field.
Note that Drill intervals are accurate to milliseconds. To compute the precise difference between
two UTC timestamps, convert them to BIGINT
first, to get another BIGINT
which represents
the difference in milliseconds.
The following new functions complement existing functions.
-
NOW_UTC --> UTCTIME
is parallel toCURRENT_TIMESTAMP --> TIMESTAMP
andNOW --> TIMESTAMP
, produces the current time, UTC to a resolution of nanoseconds.
Drill has other "CURRENT" functions (CURRENT_TIME
, CURRENT_DATE
, etc.) We omit UTC
versions because UTC is used in use cases here current dates and times are seldom needed.
The values can be computed if needed.
New functions to generalize existing functions:
-
GROUP_TIME(UTCTIME BY arg) --> UTCTIME
, wherearg
is one ofDAY
,MONTH
,YEAR
, or isINTEGER unit
whereunit
is one of the units above, optionally with anS
appended (e.g.5 DAYS
.) -- Truncates the time to the given unit so that all time values within that same range map to the time at the start of that range. -
ROUND_TIME(UTCTIME TO arg) --> UTCTIME
, where arg is as defined above. -- As above, but rounds rather than truncates.
Here is a current Drill query which uses a BIGINT
as a stand-in for the
missing UTCTIME
data type:
SELECT hostname,
time_bucket_ns(timestamp_ns, 36000000) as hour_ts,
AVG(usage_user) as avg_metric
FROM dfs.test.`tsbs_test_large.json`
WHERE (timestamp_ns / 1000000000) BETWEEN 1451606400 AND (1451606400 + 3600*24)
GROUP BY hostname, hour_ts
And the proposed Drill version after the revisions proposed here:
SELECT hostname,
GROUP_TIME(timestamp_ts BY HOUR) as hour_ts,
AVG(usage_user) as avg_metric
FROM ...
WHERE timestamp_ts >= UTCTIME '2016-01-01T00:37:12Z'
AND timestamp_ts < UTCTIME '2016-01-01T00:37:12Z'
+ INTERVAL(1 DAY)
GROUP BY hour_ts
The UTCTIME
is represented by an 8-byte signed integer; the same as the BIGINT
data
type.
Some basic implementation details:
- Add
UTCTIME
to theMinorType
enum. - Add
UTCTIME
to theTypes
class, metadata classes and so on. - Create a
UtcTimeVector
and associated nullable and repeated vectors, holders and so on. - Create
UTCTIME
column readers and writers (in both new and legacy formats.) Use a Javalong
to store the timestamp at the API level. - Implement each of the proposed Drill functions.
- Implement code generation for the proposed math operations.
Note that the changes must be done in a way that are backward compatible since Drill provides an ODBC driver which we are no longer able to update.
A robust product will provide conversion to/from Drill's existing data types.
The Drill TIMESTAMP
class is defined as a timestamp in local time.
However since Drill previously had no UTC time type, people sometimes
used TIMESTAMP
as if it were UTC. Still,
Drill functions do assume local time, hence
TIMESTAMP
behavior is very confusing to most people.
It is possible (though perhaps not wise) to convert a UTCTIME
to a
Drill TIMESTAMP
and visa-versa:
UTC_TO_LOCAL(UTCTIME) --> TIMESTAMP
rounds the nanosecond resolution to the closest
millisecond, then performs the required conversion from the UTC time zone to the
local time zone using the Java 8 timezone operations.
LOCAL_TO_UTC(TIMESTAMP) --> UTCTIME
performs timezone conversion from local time
to UTC, then multiplies the value by a million to convert from milliseconds to
nanoseconds.
Alternative syntax, since SQL allows syntactic structures inside functions:
CONVERT_TIME(UTCTIME TO LOCAL) --> TIMESTAMP
CONVERT_TIME(TIMESTAMP TO UTC) --> UTCTIME
With the obvious no-op transformations:
CONVERT_TIME(UTCTIME TO UTC) --> UTCTIME
CONVERT_TIME(TIMESTAMP TO LOCAL) --> TIMESTAMP
The above syntax could be extended to take a timezone name in place of the two kewords. However, it is beyond the scope of this project to convert to/from timezones other than the local time as Drill provides no data types to represent other timezones.
For those people who have (mistakenly) treated as TIMESTAMP
as UTC, the following
(undocumented, except for here) workaround is possible to undo the error:
CAST( CAST(TIMESTAMP AS BIGINT) AS UTCTIME) --> UTCTIME
In SQL, DATE
is a concept independent of timezones: it is simply a representation
of the concept of a day on a calendar, it is not a date in a specific location.
Drill, however, represents a DATE
as a millisecond timestamp of midnight relative
to the epoch in the local timezone. This representation is unfortunate as it leads
to endless confusion. There are two interpretations of a DATE
:
- As a calendar date
- As an instant in the local time zone at midnight on the given date.
Unfortunately, these two interpretations give different results with working with UTC
dates. Under the calendar date interpretation, converting from UTC to a DATE
would
produce the date in the UTC timezone on which the instant occurred. Under the local
time interpretation, we would get the date in the local time zone on which the instant
occurred, which may be different.
Fixing this issue is, however, beyond the scope of this project. Instead, we live with the muddle by providing two functions:
localDate(utcTimeCol) AS localDate
utcDate(utcTimeCol) as utcDate
localDate(UTCTIME) --> DATE
performs UTC-to-local time conversion as described above,
then truncates the value to a date value.
utcDate(UTCTIME) --> DATE
omits the timezone conversion and simply truncates the
valute to a date value. Users must be careful not to interpret the DATE
's timezone.
In SQL, a TIME
is simply a clock reading: a duration since midnight in a relative,
unspecified time zone. In Drill, the time is assumed to be in the local time zone
to be consistent with DATE
. That is, add a DATE
and a TIME
to produce a
TIMEZONE
in local time.
In fact, a TIME
is just an offset from midnight. But, given the historical
interpretation in Drill, we also need to conversion functions:
localTime(utcTimeCol) AS localTime
utcTime(utcTimeCol) AS utcTime
localTime(UTCTIME) --> TIME
performs UTC-to-local time conversion, then truncates
the date portion, leving only the offset from midnight local time.
utcDate(UTCTIME) --> TIME
performs only the truncation giveing the offset from
midnight UTC.
With the two UTC functions, we end up with two numbers (DATE
and TIME), which, if added, would produce a UTC timestamp in milliseconds for which Drill has no declared data type. Multiplying by a million would recover a (rounded)
UTCTIME`.
-
AGE(UTCTIME, TIMESTAMP) --> INTERVAL
-- Converts local time to UTC, then works as above. -
AGE(TIMESTAMP, UTCTIME) --> INTERVAL
-- As above. -
DATE_ADD(UTCTIME, arg) --> UTCTIME
-- Adds an interval to a UTC timestamp producing another UTC timestamp. -
DATE_DIFF(TIMESTAMP, UTCTIME) --> INTERVAL
-- Same asAGE
. -
DATE_DIFF(UTCTIME, TIMESTAMP) --> INTERVAL
-- Same asAGE
.
The following new functions complement existing functions.
-
LOCALTIME
? -
LOCALTIMESTAMP
? -
NOW
? -
TIMEOFDAY
? UNIX_TIMESTAMP
An alternative is to create one-argument versions of each function:
CURRENT_DATE(LOCAL|UTC)
CURRENT_TIME(LOCAL|UTC)
CURRENT_TIMESTAMP(LOCAL|UTC)
Although beyond the scope of this project, it is worth thinking how we would round out Drill's date/time support.
We would start by adding three new types which are explicitly not tied to any time zone:
- Add
LOCALDATE
- A date with no implied relation to a timestamp, it is just a combination of a year, month and day. - Add
LOCALTIME
- A time with no implied relation to a timestamp or time zone, it is just a combination of a hour, month, second and fractions of a second. - Add
DATETIME
- A date and time with no implied time zone.
These form the following relation:
DATETIME = LOCALDATE + LOCALTIME
We cannot change the meaning of existing types, but we can be very clear how they work:
-
TIMESTAMP
is the number of milliseconds since midnight Jan 1, 1970 in the local time zone. -
DATE
is a timestamp in the local time zone truncated to midnight. -
TIME
is a timestamp in the local time zone with days remove.
The existing types now are defined via the following relationship:
TIMESTAMP = DATE + TIME
Drill provides a number of keywords to define date/time constants. Exising keywords:
-
TIMESTAMP string --> TIMESTAMP
-- Declare aTIMESTAMP
constant. -
INTERVAL args --> INTERVAL
-- Declare anINTERVAL
constant. -
DATE string --> DATE
-- Declare a date in the local time zone. -
TIME string --> TIME
-- Declare a time in the local time zone.
To these we add new keywords:
DATETIME string --> DATETIME
LOCALDATE string --> LOCALDATE
LOCALTIME string --> LOCALTIME
Casts convert between these types. Casts simply reinterpret the value; they apply no time zone conversions.
-
CAST(DATETIME AS UTCTIME) --> UTCTIME
-- Interpret a date/time as UTC. -
CAST(DATETIME AS TIMESTAMP) --> TIMESTAMP
-- Interpret a date/time as local. -
CAST(TIMESAMP AS DATETIME) --> DATETIME
-- Return the date and time (without timezone information) for a local timestamp. -
CAST(UTCTIME AS DATETIME) --> DATETIME
-- Return the date and time (without timezone information) for a UTC timestamp. -
CAST(LOCALDATE AS DATETIME)
--> Converts a local date to a local date/time at midnight. -
CAST(DATETIME AS LOCALDATE)
--> Returns the date portion of date/time. -
CAST(DATETIME AS LOCALTIME)
--> Returns the time portion of a date/time. -
CAST(TIMESTAMP AS LOCALDATE) --> LOCALDATE
-- Return the date part of a local timestamp. -
CAST(UTCTIME AS LOCALDATE) --> LOCALDATE
-- Return the date part of a UTC timestamp. -
CAST(TIMESTAMP AS LOCALTIME) --> LOCALTIME
-- Return the time part of a local timestamp. -
CAST(UTCTIME AS LOCALTIME) --> LOCALTIME
-- Return the time part of a UTC timestamp.
In general, Drill should perform automatic converions (via casts) for all functions. If either argument is in localtime, then all arguments should be cast to local date and/or time. For example:
AGE(LOCALDATE, UTCTIME)
is equivalent to:
AGE( CAST(LOCALDATE AS DATETIME), CAST(UTCTIME AS DATETIME))
Both widening and narrowing operations are supported. If both are possible, narrowing is chosen (based on standard "significant digits" reasoning):
AGE(LOCALDATE, DATETIME)
Is converted to:
AGE( LOCALDATE, CAST(DATETIME AS LOCALDATE))
If one argument is in UTC time, and the other is local, then both arguments should be converted to UTC. For example:
AGE(TIMESTAMP, UTCTIME)
Is equivalent to:
AGE( UTCTIME(TIMESTAMP), UTCTIME)
Extensions to existing Drill functions:
-
AGE(LOCALDATE, LOCALDATE) --> INTERVAL
-- Interval (of days) between two dates. -
AGE(DATETIME, DATETIME) --> INTERVAL
-- Interval (of days) between two date/times. -
DATE_ADD(LOCALDATE, arg) --> LOCALDATE
-- Add an interval to a local date. -
DATE_ADD(DATETIME, arg) --> DATETIME
-- Add an interval to a local date/time. -
DATE_DIFF(LOCALDATE, LOCALDATE) --> INTERVAL
-- Subtract to local dates. -
DATE_DIFF(DATETIME, DATETIME) --> INTERVAL
-- Subtract two local date/times. -
DATE_PART(LOCALDATE, arg) --> INTEGER
-- Extract a field from a local date. -
DATE_PART(DATETIME, arg) --> INTEGER
-- Extract a field from a local date/time. -
DATE_SUB(LOCALTIME, INTERVAL) --> LOCALTIME
-- Subtract an interval from a local time. -
DATE_SUB(LOCALDateE, INTERVAL) --> LOCALTIME
-- Subtract an interval from a local date. -
DATE_SUB(DATETIME, INTERVAL) --> LOCALTIME
-- Subtract an interval from a local date/time. -
EXTRACT(field FROM LOCALDATE | LOCALTIME | DATETIME) --> DOUBLE
-- Extract the given field from the date and/or time. Returns 0 if the field is not represented.
New functions to generalize existing functions:
-
ROUND_DATE(type TO arg) --> type
, wheretype
is one ofDATE
,TIMESTAMP
,UTCTIME
,LOCALDATE
orDATETIME
, andarg
is one ofDAY
,MONTH
,YEAR
, or isINTEGER unit
whereunit
is one of the units above, optionally with anS
appended (e.g.5 DAYS
.) -
ROUND_TIME(type TO arg) --> type
, wheretype
is one ofTIME
,TIMESTAMP
,UTCTIME
,LOCALTIME
orDATETIME
, and arg is one ofHOUR
,MINUTE
,SECOND
, orMS
, or isINTEGER UNIT
as above.
Functions to be deprecated (though left in the product) because they are redundant:
LOCALTIMESTAMP
-
NEARESTDATE(LOCALDATE | DATETIME)
-- Rounds to the nearest date. TIMESTAMPADD
TIMESTAMPDIFF