Data classes #228

Gautzilla · 2024-11-27T11:48:02Z

Context

I think it is mandatory to rethink the way we structure/access/manipulate the data in OSEkit.

Global thoughts

Audio data, auxiliary data...

Data logic should be managed by base classes (e.g. Data, DataItem...). A Data instance would be as simple as some values associated with timestamps.

Specific data, like audio, should be adressed by specialized classes that inherit from the base classes: AudioData(Data), AudioDataItem(DataItem), which should contain only the audio-related stuff.

This would make the base classes code reusable for all data type: If we add an AuxiliaryData which values are just floats stored in a csv, we could reshape this data using the general methods (e.g. super().reshape(...))

Base classes structure

At the moment, I'm thinking of:

Splitting the relationship between the data and the actual files, with separate Data and DataFile classes.
- In the present context, Data would represent the audio items contained within the dataset, with the user-specified duration, sampling rate, etc.
- DataFile points to a file on disk, with timestamps for its begin/end date and methods to access the data within a timestamp range.
Since a single Data object could either be shorter than a file, or covering multiple files, an intermediate DataItem class should help recover the whole Data values from the files:
- Data would have a list of DataItem as attribute. Methods from Data that read the data would in fact concatenate the data accessed through each DataItem
The Dataset class would just have a list of Data as attribute, and only interact with this class.

Examples of simplified operations

This would help simplifying the workflow and separing the concerns: We could do lots of stuff on the dataset without touching the original audio files (e.g. Reshape in the drawing herebelow), and add methods that consolidate the audio files on demand:

Example code

Kind of a code dump here, but that's a very simple implementation without any reshaping methods or anything:

import soundfile as sf
from pathlib import Path
from os import PathLike
from pandas import Timestamp, Timedelta
import numpy as np
from OSmOSE.utils.timestamp_utils import strptime_from_text

class AudioFile:
    def __init__(self, path: PathLike | str, begin: Timestamp | None = None, strptime_format: str | None = None):
        self.path = Path(path)

        if begin is None and strptime_format is None:
            raise ValueError('Either begin or strptime_format must be specified')

        self.metadata = sf.info(path)
        self.begin = begin if begin is not None else strptime_from_text(text = self.path.name, datetime_template=strptime_format)
        self.end = self.begin + Timedelta(seconds = self.metadata.duration)

    def read(self, start: Timestamp, stop: Timestamp) -> np.ndarray:
        sample_rate = self.metadata.samplerate
        start_sample = round((start-self.begin).total_seconds() * sample_rate)
        stop_sample = round((stop-self.begin).total_seconds() * sample_rate)
        return sf.read(self.path, start=start_sample, stop=stop_sample)[0]

class AudioItem:
    def __init__(self, file: AudioFile, begin: Timestamp | None = None, end: Timestamp | None = None):
        self.file = file
        self.begin = begin if begin is not None else self.file.begin
        self.end = end if end is not None else self.file.end

    def get_value(self) -> np.ndarray:
        return self.file.read(start=self.begin, stop=self.end)

class AudioData:
    def __init__(self, audio_items: list[AudioItem]):
        self.audio_items = audio_items

    @classmethod
    def from_file(cls, file: AudioFile):
        audio_item = AudioItem(file)
        return cls(audio_items=[audio_item])

    def get_value(self) -> np.ndarray:
        return np.concatenate([item.get_value() for item in self.audio_items])

class AudioDataSet:
    def __init__(self, data: list[AudioData], path: PathLike | str):
        self.data = data
        self.path = Path(path)

    @classmethod
    def from_folder(cls, folder: PathLike | str, strftime_format: str | None = None, timestamps_file: PathLike | str | None = None):
        if not strftime_format and not timestamps_file:
            raise ValueError('Either strftime_format or timestamps_file must be specified')

        folder = Path(folder)
        audio_files = (AudioFile(path = p, strptime_format=strftime_format) for p in folder.glob("*.wav"))
        audio_data = [AudioData.from_file(audio_file) for audio_file in audio_files]
        return cls(data = audio_data, path = folder)

The text was updated successfully, but these errors were encountered:

Gautzilla · 2024-11-28T10:59:45Z

@ElodieENSTA I have marked this as APLOSE related since it might change the relationship between the Data (with which the spectrograms are plotted) and the audio files.

Gautzilla added the high priority Urgent issue label Nov 27, 2024

Gautzilla self-assigned this Nov 27, 2024

Gautzilla added the APLOSE related The changes are impacted APLOSE behavior label Nov 28, 2024

This was referenced Nov 29, 2024

integration of timestamp utils #216

Merged

[DRAFT] New data classes #233

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data classes #228

Data classes #228

Gautzilla commented Nov 27, 2024

Gautzilla commented Nov 28, 2024 •

edited

Loading

Data classes #228

Data classes #228

Comments

Gautzilla commented Nov 27, 2024

Context

Global thoughts

Audio data, auxiliary data...

Base classes structure

Examples of simplified operations

Example code

Gautzilla commented Nov 28, 2024 • edited Loading

Gautzilla commented Nov 28, 2024 •

edited

Loading