Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for JOINing time series via "AS OF" time #271

Open
min-mwei opened this issue Oct 21, 2017 · 24 comments
Open

Add support for JOINing time series via "AS OF" time #271

min-mwei opened this issue Oct 21, 2017 · 24 comments

Comments

@min-mwei
Copy link

For time series data, a powerful function is asof pioneered in kx/kdb, also being implemented by Pandas:
http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0190-enhancements-asof-merge

It would be really cool if Timescaledb could support it natively instead of having write 3 tricky queries to join two time series tables.

@mfreed
Copy link
Member

mfreed commented Oct 21, 2017

Definitely an analytical function on our short list. Thanks for the request.

(And if others would similar like this, please upvote!)

@mfreed mfreed changed the title asof join Add support for JOINing time series via "AS OF" time Oct 22, 2017
@ydesai0830
Copy link

Hi, is there an update on support for an As Of Join. I am looking to test out various Timeseries solutions and AsOf is something we would really like to support.

@mfreed

@oldrichsmejkal
Copy link

Hi,

any plan for this feature?
It is major point for financial (tick) data use case, so it can help to adopt timescale in this niche..
Another reference from leading db used in this market..: https://code.kx.com/wiki/Reference/aj

Thanks!

@runekaagaard
Copy link

runekaagaard commented Jan 14, 2019

I agree. Having a performant global "AS OF" on a traditional RDMS like postgres would be a holy grail. Maintaining history today with postgres is, hmm, painful. This alone has made me seriously consider using Datomic (https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/as-of). A dream query for me would be writing:

SELECT foo.id, bar.baz, sup.sop FROM foo
JOIN bar ON ...
JOIN sup ON ...
WHERE ...
LIMIT By 42
AS OF 2017-11-12

and getting the state of the world from that datetime. I realise this is moving from time series and into another territory which might not be a good fit for timescaledb. But a man can hope :)

@franz101
Copy link

Any updates or best practices?

@franz101
Copy link

This seems to be a possibility?
https://dba.stackexchange.com/posts/185372/revisions

@mridsole
Copy link

This is the best I've managed to get so far:

SELECT * FROM table_a
CROSS JOIN LATERAL (
  SELECT * FROM table_b
  WHERE (
    ...
    AND
    table_a.time > table_b.time
  )
  ORDER BY table_b.time DESC LIMIT 1
) lookup
ORDER BY table_a.time;

Still not very performant though, takes >1s on two tables of ~200k rows (pandas can do it >50x faster).

@bboule bboule added the core label Feb 19, 2020
@EgorKraevTransferwise
Copy link

Are there any plans to add this? Would be a really big deal for financial applications, among many others

@kzk2000
Copy link

kzk2000 commented Mar 20, 2021

+1

Any update here? Is there an ETA for as-of join support?

@runekaagaard
Copy link

runekaagaard commented Mar 21, 2021

Very under the (my) radar, mariaDB suddenly has very cool looking support for AS OF time machining:

https://mariadb.com/kb/en/system-versioned-tables/

@kzk2000
Copy link

kzk2000 commented Mar 21, 2021

@runekaagaard thanks, so no ETA?

Just to be crystal clear: We are referring to as-of joins and not AS OF for data versioning.

Good example for AS OF JOINS is here: https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html --> see bottom of that page which illustrates how to join market trades & quotes

The kdb+ equivalent command is "aj", see https://code.kx.com/q/ref/aj/

If I misunderstood the purpose of this issue here, let me know and I'm happy to open another one

@alex-tate
Copy link

The addition of as-of joins within Timescale would be a huge benefit to us. We have lots of high frequency time series datasets from environmental sensors that have slightly differing timestamps (at the ms scale) and occasional gaps. Our requirements are pretty much met by the functionality described in the pandas.merge_asof documentation but it would be great if we could do the same sort of thing at the database level. A possible addition would be to allow for aligning the sensor time series datasets against a strict timeline (e.g. every exact second, minute etc.).

@davidkohn88
Copy link
Contributor

davidkohn88 commented Jun 21, 2021

The simple way to do this is to write the following sort of join:

CREATE TABLE foo ( time timestamptz, id int, val double precision);

CREATE INDEX on foo(id, time DESC);

SELECT t1.time, t1.value as t1_val, t2.value as t2_val
FROM foo AS t1, 
LATERAL (
  SELECT value
  FROM foo t2
  WHERE t2.id = 2 AND t2.time <= t1.time
  ORDER BY t2.time DESC
  LIMIT 1
) t2
WHERE t1.id = 1
ORDER BY t1.time;

You will definitely want the index on id, time DESC there, as that will make it much more efficient.

In cases where the tables are separate it will be slightly different, simpler in some ways as it involves less aliasing:

CREATE TABLE foo1 ( time timestamptz, val double precision);
CREATE TABLE foo2( time timestamptz, val double precision)

CREATE INDEX on foo1(time DESC);
CREATE INDEX ON foo2 (time DESC) INCLUDE (value);

SELECT foo1.time, foo1.value as foo1_val, foo2.value as foo2_val
FROM foo1,  
LATERAL (
  SELECT foo2.value
  FROM foo2 
  WHERE foo2.time <= foo1.time
  ORDER BY foo2.time DESC
  LIMIT 1
) foo2
ORDER BY foo1.time;

This may not always be the most efficient, but it should work reasonably well in smallish cases. We'll also work on some ways of doing this with the timeseries API as discussed: timescale/timescaledb-toolkit#162 that may be more efficient in some cases.

@JLockerman
Copy link
Contributor

note: the LATERAL query above benefits significantly from being able to perform an index-only scan on foo2, you would want an index like

CREATE INDEX ON foo2 (time DESC) INCLUDE (value);

@NunoFilipeSantos
Copy link
Contributor

Thank you @davidkohn88 this is a fantastic solution suggestion. 👍

@enthusiastics
Copy link

I am also very interested in this "as_of" join, what would be the best solution for now?

@davidkohn88
Copy link
Contributor

davidkohn88 commented Sep 10, 2021

I am also very interested in this "as_of" join, what would be the best solution for now?

The joins above ( #271 (comment)) are reasonable solutions for now and should be reasonably performant, depending on what you're doing, we're also thinking about adding some more functionality around this in the toolkit, but it's probably a bit of a ways off, but you can add comments in this issue for now, and maybe explain more what exactly you're trying to achieve and what you think we should prioritize: timescale/timescaledb-toolkit#162

@akuzm
Copy link
Member

akuzm commented Sep 22, 2021

With LATERAL join on timestamp inequality, I guess the right side is going to do index lookups, but it's still O(left_rows * log(right_rows)). A merge-join-like algorithm will be O(left_rows + right_rows).

An older attempt of implementing this in vanilla postgres: https://www.postgresql.org/message-id/flat/bc494762-26bd-b100-e1f9-a97901ddad57%40postgrespro.ru

For the reference, ClickHouse uses special grammar for this: https://clickhouse.com/docs/en/sql-reference/statements/select/join/#asof-join-usage

@akuzm
Copy link
Member

akuzm commented Sep 27, 2021

The different grammar is probably required because the semantics is different from the normal join on inequality condition -- we only have to return the closest righthand row that matches. In an extension, we can't introduce new grammar (or can we?), so we can consider using a special dummy function for the join, e.g. JOIN ON left.series = right.series AND timescale.asof(left.timestamp <= right.ts).

I had an old patch that extended the merge join executor to support full join on inequality: https://www.postgresql.org/message-id/flat/b31e1a2d-5ed2-cbca-649e-136f1a7c4c31@postgrespro.ru
I think I could simplify and reuse it for ASOF joins.

@akuzm
Copy link
Member

akuzm commented Oct 19, 2021

@akuzm
Copy link
Member

akuzm commented Dec 14, 2021

@alex-tate
Copy link

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

syvb pushed a commit to syvb/timescaledb that referenced this issue Sep 8, 2022
271: Lambda fixes r=WireBaron a=JLockerman

Two minor fixes to lambdas:
 1. Allow newlines within lambdas.
 2. Allow the parsing of multiple successive `let`s instead of stopping
    at the first.

Altogether this allows the parsing of expressions such as
```SQL
let $foo = -2;
let $bar = $foo * $foo;
$bar * $bar
```

Co-authored-by: Joshua Lockerman <josh@timescale.com>
@akuzm
Copy link
Member

akuzm commented Oct 10, 2022

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

We're planning to try and prototype this inside the TimescaleDB extension in the Q4 of 2022, using the no. 3 merge-hash algorithm I posted upstream. That's just a research prototype, can't promise when and if we will release something that is actually usable.

@RMB-eQuant
Copy link

We have found the “as-of” join to be essential in our analyses involving high frequency financial data (market data, trade data etc.). The approach we present in the link below seems to be very fast and execution time appears to scale linearly with row count (n), unlike lateral join type approaches that typically scale as O(n^2). The SQL query we link to could perhaps be further optimized but we think that this is a promising approach in general.

https://gist.github.com/RMB-eQuant/758539f8914f2dd4461ec0ce144b048b

The table below compares the execution time of our approach (called “UNION ALL ALGO”) to the lateral join approach presented in the post #271 (comment) (“LATERAL JOIN”). Execution time is in seconds and the benchmarks were run on a Timescale-pro-100-16gb-2cpu-compute-optimized (2 CPU, 16 GB RAM, 100 GB storage) instance. In the table below, NaN values correspond to runs that were too slow to complete.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests