Skip to content

Date Column Design

Adam Hooper edited this page Apr 27, 2021 · 14 revisions

Proposal: a new "date" column type.

Why?

Users currently use "timestamp" to represent dates. This is ambiguous: to different people, the same timestamp happens at different dates.

You should be allowed to format a "timestamp" in any timezone. That makes no sense with dates.

Data-model change

A new "date" format will store an int32 number of days since 1970-01-01. (Why not integer number of periods, like Pandas? Because Parquet/Arrow tools wouldn't format the integers correctly.)

Unit (year / quarter / month / week / day)

A "date" column has a "unit": "year", "month", "day", "week", "quarter".

The unit determines which dates are allowed. "year" allows 1970-01-01 and not 1970-01-02, for example. Only the first date in its unit is valid. The others are invalid: Workbench will not store them or display them. (Weeks start Monday, as per ISO8601.)

Format

Let's defer deciding upon a "format" akin to the Number type's "format". For now, format one "auto" format per unit. These are how the data is displayed, not how it is stored. (Exception: if you "convert to text", it will be stored this way.)

  • Year: 2021
  • Month: 2021-03
  • Quarter: 2021-Q1
  • Week: 2021-03-01
  • Day: 2021-03-03

Export format

The export format is always YYYY-MM-DD.

Rationale: ISO8601 is a widely-accepted standard; there's no widely-accepted standard for "Quarter" or "Week starting Sunday".

Code-wise, this keeps our generic Arrow/Parquet tools generic. They don't need to understand "unit".

Prior art

  • Arrow has date32 but no notion of "unit": only days are allowed.
  • Pandas has Period. It's very flexible: e.g., it allows "3-hour intervals starting from 2021-03-03T13:48:12.142Z", and it supports int64. There's also "business day" and "business hour" functionality. All this flexibility makes for ValueErrors and edge cases when alternating between Timestamp and Period.

How we deviate

  • We don't deviate from Arrow date32, for better or worse. In most units, most date32 values are invalid and Workbench modules must take care to only produce valid dates.
  • We use the name "Date", not Pandas' "Period", because "Period" has annoying nuance. "Period" means, "continuous span between two Timestamps": our users must think of timezone when they think of the period. "Date" omits the concept of timezone. Most users don't think about timezones when they're thinking of dates. (Rephrased: many users will convert Timestamps to Dates, but few users will want to convert Dates to Timestamps.)

Potential problems

  • What is a unit (period's) origin? ISO "Week" starts Monday, but in USA culture Sunday is common. There are 12 typical financial quarter origins (Q1=Jan-Mar, Q1=Feb-Apr, Q1=Mar-May, etc.) and week-starts-Saturday is the norm in some cultures. For now, let's stick with origin=Monday, origin=January for now. Later, we can add an "origin" field to go alongside the "unit" field.
  • Are we locking ourselves away from "Hours" and "Minutes"? (These would involve timezones.)

User-facing features

Seeing unit

The column header should show the column type as "Date (week starting Monday)" or "Date (Year)".

Converting units

The user can convert from "year" to "day": Workbench will impute the first day of the year. Ditto for all casts to a more-specific type ("week-monday => day", "month=>quarter", etc.)

The user can "round" from "day" to "year" or "week" that the date appears in.

The user can "round" _Timestamp_s to Dates, by supplying a timezone and unit.

It should be difficult to convert from "week" to "month", because the data isn't there. The user can explicitly "convert + round" ... and forcing the user go to through those two steps should encourage them to realize they're creating errors in their workflow.

Date math

We should have something similar to Timestamp Math: how many days between two date columns, for instance? The dates must be in the same unit.

The "Group by" module

The most obvious use case is helping users chart with aggregation.

As of 2021-03-02, the "group by" module lets users group timestamps into "rounded timestamps". That was a mistake that led to misleading workflows: those timestamps have no timezone.

Ideally, "group by" + "by date" users will all migrate to grouping by a regular "date".

First, we'll build a "Timestamp to Date" conversion module.

Second, let's consider whether to retrofit the "group by" date-math section (there may be huge backwards-compatibility problems) or to nudge the user towards "timestamp to date" with a quick-fix button. Nudging existing users would be easy: they already indicated the unit they want. How do we nudge new users?

Charting

Line charts will also have an "auto" format per unit. This should look good enough for now.

Roadmap

TODO put these on Pivotal:

  • ✓ Make cjwpandasmodule create and support pandas.PeriodArray. (Nix most of cjwkernel.pandas.types to get there.)
  • ✓ Make a DateCellFormatter.
  • ✓ Write a first module that converts to Date.
  • ✓ Write more modules that use Date.
  • Prototype a Group By that groups by Date.
  • Help users migrate to Group By Date.

List of modules that need tests/changes

  • convert-date (??? What to do with this one???)
  • ✓ converttotext
  • countbydate (??? what to do with this one???)
  • ✓ filter (condition)
  • ✓ formula (need to convert from int32 to Excel int date)
  • groupby (??? what to do with this one???)
  • ✓ linechart
  • ✓ reshape
  • selectsql (maybe)
  • splittimestamp (output dates, not text)
  • timestampmath
  • ✓ transpose

List of new modules

  • ✓ converttimestamptodate (requires timezone)
  • converttexttodate (no timezone? We could warn when data has timestamps)
  • convertdatetotimestamp?
  • datemath