-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package functionality review #482
Comments
Hi @sairus7.
There are a many designs that I'm not sure about the original ideas (I guess most of them are for financial time series). Let's work out each issue one-by-one, I. The What's your idea about the first issue? I'm studying Tables/TableOperations, I will post my blueprint here later. |
Here is my proposal: I. a. The type parameter: b. The interface stuffs.
b.1
@sairus7 could you review these ideas? and I can make a PR for this proposal. |
@iblis17
Should there be any other compile-time information, except time type? a.1, a.2
Do you mean that there can be two columns with different precision for time, like this link says https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.Date ? I think, it can be solved with "batching", where you have one timestamp type for a whole bunch of records (table of tables, or partitions), and a detailed relative timestamp for every element within that batch. So, I am not sure we should add it for a single table level. (But we can add reference time as a metadata field for a whole table object.) What about other idea of having several timestamps (like a time interval t1:t2) for each row, I think it is similar to sparse arrays for highly duplicated data (since all rows between t1:t2 have the same values), and we should think of it after defining the main functionality. 1.b About interface stuff. I think, first, we should outline, what is the main difference between TimeSeries and table-data packages, like DataFrames.jl or IndexedTables.jl, otherwise there is no need in TimeSeries.jl itself. Expecially, if we decide to support heterogenous column types and stick to table interface. Or, if there are some minor differences, we can just rewrite TimeSeries.jl as a thin wrapper around those packages. From my point of view, timeseries differ from simple arrays, because they have timestamps, and some specific operations. I think of timestamps like "position in time", or a timeindex, in addition to integer element index. So, here is a list of questions, that should be answered first:
|
Maybe @bkamins can give us his opinion on this? |
Thank you for working on this. As there are many aspects of the issue (and probably I do not grasp everything you discussed) I would start with the question what is the main use case for TimeSeries.jl? and then design against this use case. E.g. DataFrames.jl design objective is to be maximally flexible, possibly at the cost of performance (when used correctly it is fast though), i.e. to be used when no more specialized package exists. I would assume that TimeSeries.jl would be a more specialized package which would provide functions that may be only made available if we have a notion of time index. There are many such use cases, that are currently hard with DataFrames.jl, e.g.:
So in summary. The question is: what features TimeSeries.jl should provide so that a user would want to switch from DataFrames.jl to TimeSeries.jl for some specific task. This will probably mean that TimeSeries.jl should provide by default more restrictions than DataFrames.jl at the benefit of doing things better (as currently you can do anything with DataFrames.jl but not always fast or conveniently). Below I say what I would find intuitive in answers to the 7 points you put:
I do not think it would be super useful (though sometimes it might be useful). So if you see big benefits of having homogeneous type I would go for homogeneous choice. However, my intuition is that there will not be big benefits.
If you go for homogeneous type I think for sure dynamic is better (as you have type inference for free). However, in general in DataFrames.jl although it is not type stable mostly you can easily switch to "type-stalbe" mode. Actually I think that the crucial thing is if you want to allow to add rows to time series in place. I assume you want it (which e.g. means that you cannot use
Do you see any uses for such a row-object? (like taking advantage that it would know its time stamp). If yes I think it is not a problem to have a custom type for it. Also do you want the row-object to be a view (like
I would find keeping them sorted intuitive. In what cases would you want to allow to change a timestamp? I would feel that it should be immutable? Also I feel timestamps should be unique and that join should be performed only on timestamp (at least by default).
I would normally think that it should be a "hidden" column like index.
For me timestamp would be an index only
I think it should create a new table. Please treat these comments as loose first impressions of course. |
I.a.
I think for
ah, right. I try to list some here, and maybe we can make it completed later, once we decide some key design.
a.1, a.2
I want to cover both these cases (difference precision and multiple timestamp as interval) in the type parameter design. 1.b Interface
This is a hard question. Since the property of time index breaks all the rules and make wrapping around those pkgs not profitable I think. So in the beginning, I prefer not to depend on them. I keep opening mind to this issue. After we explored the enough use cases, maybe part of cases we can leverage those pkgs.
Well, in short, my answer is that we can implement all styles if needed.
I cannot find cases that user need to manipulate an unsorted time series. So combination 4, 6, 8 are kept, and I think case 6 won't have enough performance benefit. Combination 4 and 8 might have benefit if the underlying structure is So, I will vote for combination 2 as top priority then implementing combination 4 and 8 if we still have enough mental effort.
I managed to list them in the table of part
I want timestamps sorted all the time. The timestamps isn't needed to be unique, and the order between these records which shares the same timestamps is defined by user. We should make sure function provide by this pkg not change that relative order. I have some sensor generated data that share same timestamp, since the timestamp precision isn't enough. About
I think a hidden column is fine for me. I want that user can always set/switch the index. Once set, the column will be the first column in the presentation (via
Well, this is quite complex question since I encountered both situations in single project: (1) I want the raw row value without timestamp, so I can feed them into function from I don't have an elegant approach at this moment. I still think about it. I write down some here. (i) Determine which is the common use case, make the common case as default. Then provide a variant function to support another. For example: (ii) Always return the row object. Then provide a
Yes, the type with dynamic columns should support this funcitonality. |
I think we should divide our methods into three distinct parts, with increasing functionality: Here are some (incomplete) considerations about theese three parts: A. Timestamps I agree that timeindex differ from integer index. One of the key difference - it has global "adress space". Integer index exist only within a specific collection, and can be changed or dropped when querying a subcollection or element. But timeindex refers to some "adress in time", not a certain collection, and cannot be dropped by default. Also, while index is integer, timeindex is continuous, and it can have different precision levels (days, minutes and so on) with different rounding and comparison behaviour between two time values which have different precision. A.1 Time types What time types can be used: Possible operations:
A.2. Vector of timestamps, some kind of "point process". I agree that timestamps should be sorted and not unique. It can be:
For discrete case, should we think that each timestamp has non-zero length equal to timestep? I have some draft examples of how I use a combination of types from A.1 (a), (b), (d) as timestamps, and transformations between them: https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8 Possible operations:
B. Time vector / column Here we have just two synced vectors - timestamp vector from A.2., and data vector. I will call it here a column. Operations:
C. Table - seems like this is just a set of columns from B, which have the same timestamps vector? What about table of combinations, IndexedTable have sorted primary key, so they are in option 6 too. using AxisArrays, Dates
t = Dates.now()
timerange = t : Millisecond(5) : t + Millisecond(45)
data = reshape(1:20, :, 2) |> collect
a = AxisArray(data; time = (timerange), chan = ([:c1, :c2]))
a[time = t, chan = :c1]
a[time = t..t+Millisecond(10), chan = :c2] I agree that we can start with option 2 from the table. I agree we should return time series objects on I agree on joins on non-index columns. Should In general, I don't see any disagreements with your proposals. |
The next step is completing the interface specs. I think the naming issue is the most difficult part. Feel free to correct me if my naming is confused. A. TimestampsA.1. Time typesI only want to discuss about (b) Period and (d), and others are fine for me. About the relative time to some reference point: (b) Period: for time intervals, or time offset value (time relative to some reference point). julia> Minute <: Period
true Do we need a type for holding both reference point + offset value? Maybe no, at least I cannot find such type in Cartesian coordinate or in C/Cpp pointer, there isn't a type contains both info. So I think The general policy of adding a operations:
Operations
A.2. Vector of timestamps(well, I know nothing about the point process before you mention it, any resource that I can consult are appreciated)
Allowing repeat does introduce some API design issues, I'm not sure about which one is the good design, just write down my thought here.a) secondary index: By automatically building the secondary index that represent the given order, the join (or other) operations can be done via the compound key of timestamp + secondary index. For instance:
We can provide two set of APIs. The first set will use 'secondary index=1' as default, then return the b) parametric type By adding a boolean parameter in type that denotes that the existence of repeated timestamps, dispatch the
This example is a good starting point. A.2.i TypesSo, assume that we have an abstract type abstract type AbstractTimeAxis{T} <: AbstractVector{T} end where I think the naming is still need more discussion and aims to not confuse users.
A.2.ii Operationsa. Support the iterator protocolSome types in @sairus7's example are lazy (calculate the timestamp while needed). I will make it support the iterator protocol, so user can materialize it if desire. b. Indexing related operations
c. Relative time calculationThe signature
Example: julia> g = TimeGrid(DateTime(2020, 1, 1), 60)
TimeGrid(DateTime("2020-01-01T00:00:00"), 60.0)
julia> g[+, 0]
0 milliseconds
julia> g[+, 42]
700 milliseconds
julia> g[+, Minute(1)]
3601 d. Get a subvector of timestampsSupport
I think this can be simply achieved by e. Reduction operations
f. Resampling operations
g. Consolidate two vector of timestamps: merge and intersectThis is a quite complex case dealing with the
About the case 1 and 2, I reduce the problem as the product (or sum) of periodic function problem. The product (or sum) of two periodic functions may or may not be a periodic function. It depends on the period ratio. But in case of implementation, I want to treat a normal
h. Search the point with given criteria, like equal or nearest neighboursFor two point process
I propose implementing
i. Common vector operationsHere I only list some operations being notable for discussion.
B. Time vector / column
And I will consider the time vector is a B.1 Operationsa. Indexing
b. Element-wise binary operations with two `AbstractTimeSeries`
c. Get a sub-table or view
d. `join` operations
e. Change the time vector
f. Reduction operations on values for single table
g. Resampling operations
C. Table
Yes, so how about the treat the section B and C as the same?
The interval feature looks great. If I understand correctly, that interval data type is provided by IntervalSets.jl, and we can support it. |
A side note about precision and rounding, which is closely related to the question from A.2: "should we think that each timestamp has non-zero length equal to timestep?" Why would we need it? I think of how to represent time segments (intervals) as timestamps, and the main difference is that intervals have additional "time length" attribute. Which makes me think that any timestamp is not a point with zero length, but a time interval with "unit" length. This is similar to the inner representation of timestamp itself as integer value ( But AFAIK there are no methods to check that higher-resolution timestamp lies within a lower-resolution timestamp. More than that, we even don't know the actual resolution, right? using Dates
t_month = floor(Dates.now(), Dates.Month)
t_sec = floor(Dates.now(), Dates.Second)
t_sec in t_month == true # method error From this example I'm not sure if we should leave this to user knowledge of his data, or decide to make some additional time-interval operations and check for (or dispatch on) known and unknown time-length. But if we do, then we should add some additional timestamp vector types with metadata. |
I think the "time length" attribute will only related to additional operations. It only meaningful when doing operations against the time length attribute, we won't I googled around this topics randomly. Maybe we can consult some operation designs from here: https://www.codeproject.com/Articles/168662/Time-Period-Library-for-NET
So, yes, we should leave it to user knowledge, but with a common assumption as default. |
Sorry, I late for the party, but I have a couple of things to add.
I've met this situation too, but there is an easy(?) workaround, at least it worked for me. Since struct DateTimeBar{T <: TimeType, L <: Real} <: TimeType
ts::T
duration::L
end
duration(x::DateTimeBar) = x.duration
Base.isless(x1::DateTimeBar, x2.::DateTimeBar) = isless(x1.ts, x2.ts) and generate a vector of "bar" times. There is no need to create an extra column or do anything like that. Something like that can work with struct CountingDateTime{T <: TimeType, L <: Period} <: TimeType
start::T
offset::L
counter::Int
end
DateTime(x::CountingDateTime) = start + counter * offset so specialized functions can be written if needed (maybe even in another package?) to work with such type. Exploring this idea further, one can define struct DateTimeWithKeys{T <: TimeType, S <: Tuple} <: TimeType
ts::T
keys::S
end and generate time column with embedded keys, for example, if you gather signal from different sources, you can have something like dts = [DateTimeWithKeys(Date("2021-01-01"), ("Device A", )),
DateTimeWithKeys(Date("2021-01-01"), ("Device B", )),
DateTimeWithKeys(Date("2021-01-02"), ("Device A", )),
DateTimeWithKeys(Date("2021-01-03"), ("Device C", ))] and "keys" can be used for filtering, joining, sorting, etc. This idea is actually implemented in google's BigTable design http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf Regarding Row values which should be returned when a table is indexed, maybe it makes sense to utilize JuliaQuant/Timestamps.jl? I revive it recently after few years of hibernation, and one of the ideas was to have a useful row-level timestamp data presentation. It can solve some questions like "what to return value or timestamp + value" since you can return |
Oh, that may be a good option. After I finish the interface spec in this discussion thread, we can check |
@sairus7 Could you review it? |
I'm looking into using the methods in this package in DimensionalData.jl/GeoData.jl - when there is a time dimension present, as in AxisArrays.jl. Often we have multidimensional arrays where time is one of the dimensions. GeoData.jl also defines So to add to this functionality review, it would be useful if this package generalised to working with any arbitrary-dimension arrays organised in a time-series vector, somewhat like how Interpolations.jl does that. |
Has any kind of |
Hi!
I have some experience working with time series (from medical sensors), and I was thinking of using TimeSeries.jl for my projects. For now I have some sort of review of this package, outlining choises that look strange to me, at least from docs, along with proposals from my point of view. Maybe authors will find it helpful.
I.
AbstractTimeSeries
is absent from docs - is this some kind of common interface for different timeseries types? If so, you should add an example, which methods should I implement to support custom timeseries type.II. Heterogenous series (tables) are dropped, from docs:
This is a huge limitation, if one needs timeseries with complex information, stored as vector of structures, or a namedtuple of columns of different types (see StructArrays.jl).
Maybe there should be a different TimeTable type with heterogenous columns (similar to DataFrame), and TimeArray for a single column type, sharing the same timestamps from parent table?
More than that, individual columns can be a custom AbstractVector with some metadata for exotic element types. For example, if elements are encoded and metadata is needed to decode them on
getindex
:III. There is no separate implementation for timeseries with regular sample rate, that can be constrained to operations that produce a uniform sampling (similar to SampledSignals.jl). This type does not need to store materialized
timestamps
vector at all, since time can be calculated fromindex
,startdate
andsamplerate
(I call this a "time grid", which provides aindex2time
andtime2index
pair of functions). Timeseries remains uniform unless you want to take irregular / arbitraty samples from it - result is then converted to a common (non-uniform) timeseries with timestamps vector in it.IV. There are no timeseries with several timestamp columns. In my practive, I always have three different timeseries types:
There are several special cases for (3) with regard to indexing (what to do if I request time point inside the segment or time interval that partially overlap with segments on edges).
Maybe there can be even more exotic (or common) timeseries with more that two timestamps (each row is itself a repetition of some complex process in time with many "phases"), where you should explicitly choose, wich timestamp column you want to index by. But I would not complicate it that far.
V. Row indexing. You can index rows by:
What is missing:
time
andindex
positional arguments for different combinations).VI. Splitting by condition section has two different sets of functions:
where
in tables, but for timeseries (when
,findwhen
,findall
),from
,to
).VII. Maybe there should be some convention between functions that take and return timeseries, and functions that return standard vector types:
findwhen
vsfindall
;Also, there may be some methods to toggle between timeseries type - and underlying Table type, or standard array / vector of tuples. This is similar to Tables.columntable from DataFrames, they are using it to toggle between type-stable and compile-friendly cases.
VIII. Operation on single columns - or whole timeseries
this is very tricky part, because there is implicit inner join, and all columns should be the same numeric type. So maybe it should be applied only on a single column, or a single column can be modified this way inplace? This is also about heterogenity, as in section II above.
diff
,percentchange
,moving
,upto
with similar functions from any other package.basecall
looks strange to me - what if I want to run function not from Base, and run it on a single column, or a set of selected columns?IX. Combine methods
merge
naming instead of more commonjoin
?collapse
: AFAIK this is calleddecimation
orresampling
with another samplerate or time intervals - not onlyday
,week
, etc. Maybe even a vector of custom intervals. And there should be any arbitrary function, that canreduce
all elements that fall within each time interval (for example, you can get time distribution, if you count number of elements over a fixed time intervals)X. Customize TimeArray printing
Can I choose a time string format to show, or is it chosen automatically based on - what? It would be nice to have examples for high-frequency timestamps in units of milliseconds.
The text was updated successfully, but these errors were encountered: