Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time series #267

Closed
vincentqb opened this issue Sep 5, 2019 · 5 comments
Closed

Time series #267

vincentqb opened this issue Sep 5, 2019 · 5 comments
Assignees

Comments

@vincentqb
Copy link
Contributor

vincentqb commented Sep 5, 2019

I'd like to reflect upon what functions would be desired (that could live within torchaudio or outside) in order to offer preliminary support for time series.

  • Time series data format (e.g. many channels compared to audio)
  • Missing data imputing
  • Interaction with calendar information
  • Conversion to other formats, say to audio waveform
  • Option on transformations to respect time direction
  • Streaming use case?

Could we make sure that our constructs are general enough to touch on time series, without sacrificing the primary goal of audio for this library?

Motivation

An audio (multichannel) waveform is a (vector) time series with constant time step whose length is given by sample_rate.

Audio processing and time series analysis are related, though their goals may differ. The type of transformations used in audio and general time series are sometimes different (i.e. dB, Mel, ...). For instance, in time series forecasting, transforms are usually expected to respect the time direction, and only consume past information for future value, as in "online" consumption of audio waveform.

@nairbv @zdevito @kingjr @adefossez @gully -- do you have use cases for time series that could relate to torchaudio?

Additional context

@vincentqb vincentqb self-assigned this Sep 5, 2019
@vincentqb
Copy link
Contributor Author

@sergulaydore -- for reference

@gully
Copy link

gully commented Oct 12, 2019

Here are some perspectives on timeseries from the astronomy data perspective.

The astropy project had a discussion on tradeoffs surrounding a TimeSeries class for astronomical applications in their ongoing Proposal for Enhancement. There are some subtle discussions distinguishing two types of time series:

  • sampled time series that sum up a count rate observed over a time interval, such as how many photons were received from a telescope sensor in a 30 minute interval
  • event data that are timestamps of discrete events, such as the energy of single proton measured at the instant of impinging a sensor.

The distinction essentially comes down to sparsity-- populating zeros in between infrequent/discrete events is wasteful.

Here at the NASA Kepler/K2 Guest Observer Office we focus on high-precision flux time series: the brightness of a star measured every 30 minutes for four years, with a quarterly gaps for transmitting the telescope data back to Earth. You can see that this acquisition rate yields a modest amount of data by the standards of audio: our "impressive" 70,000 time samples is acquired in under 2 seconds of single channel 44.1 kHz audio.

Some other distinctions: our time series data come with metadata headers that are generally preserved in our objects. Each time sample possesses columns (multichannels) of mixed data types: time, flux (float), flux uncertainty, quality flag (int), quality mask (bool), sky coordinate xy movement. Our in-house toolkit lightkurve deals with this time series data, with tons of application-specific pre-processing steps that wouldn't matter much for a general time series class. The name nods to the convention of "light curves" rather than the audio-familiar waveforms.

We do frequency-domain analysis with FFTs all the time with some slight differences: we use an algorithm that can support unevenly sampled time spacings. We occasionally do spectrogram analysis, but you can see that a 70,000 sample signal can only be cut into 175 bins of nfft=400, which makes for a crude spectrogram.

Astronomers use scalable Gaussian Process analysis all the time. Popular frameworks are tailored towards 1D time series astronomy, but could (and should?) apply more broadly to time series applications that care about uncertainty quantification or probabilistic prediction. The GPyTorch framework is promising, and I aspire to create astronomy-specific demos to advertise this library more widely to astronomers. The fixed time sample size of audio makes it amenable to some of the geometric assumptions of GPyTorch.

Those are some thoughts for now. Very curious to see how these themes evolve!

@nairbv
Copy link

nairbv commented Jan 17, 2020

For non-audio applications (e.g. in finance) I could imagine a number of useful features/functions.

I'm not familiar with audio time series requirements, but similar to what @gully describes above, there are a number of ways time series of financial data can be represented that may be broadly applicable:

  • In raw trade or tick data, each data point discretely represents a trade or price change. Some data might include each change to the best bid and ask.
  • Tick data is typically aggregated into "candlesticks" as open (start), low, high, close (final), volume (total number of shares traded) per time period. Each period then ends up being represented as a vector of these five values.
  • Other approaches similar to the "sampled time series" described above by @gully would be open/low/close/high/duration per N trades or shares traded or ticks. There are a variety of approaches like this that can be used for summarizing "bars" of discrete financial time series data.

Ideally a time-series representation should be flexible/abstract enough so that other representations can be added easily. Tools that convert representations of the data could be useful.

Some other functionality that could be useful in time series tools, at least if applied to certain financial problems:

  • For "Interaction with calendar information," it could be useful to have a way to "join" multiple time series from different sources.
    • One may want to train a single model with data from multiple securities aligned on time.
    • Maybe also useful for multi-modal models or stereo audio?
  • A way to augment time-series data with cumulative or moving averages, stdev, etc.
    • Traders often augment their price data with a variety of derived metrics (bollinger bands, EMA, SMA, MACD, etc). I'm not sure if there are similar derived metrics from audio time series.
  • Forecasting data loaders that help deal with look-ahead or recency bias, maybe using sliding time windows?
    • It's easy to introduce look ahead bias, especially if trained online learning incrementally.
    • One wouldn't want to re-train a model from scratch with each new tick, but could use some kind of sampling method to incorporate new information while controlling or eliminating recency bias.
    • Ways to preprocess the data during loading, e.g. to convert values to deltas or returns
  • Ways to test for and adjust for stationarity.
    • One might want to normalize a return series with mean return, but need to use a cumulative or rolling historical mean to avoid look ahead bias.
  • Something for generating simplistic auto-regressive test time series could be useful (http://www.jessicayung.com/generating-autoregressive-data-for-experiments/)

@nairbv
Copy link

nairbv commented Jun 4, 2020

More ideas:

@vincentqb
Copy link
Contributor Author

pytorch/pytorch#49338

@vincentqb vincentqb changed the title Time series? Time series Jan 8, 2021
@mthrok mthrok closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants