Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data classes #228

Open
Gautzilla opened this issue Nov 27, 2024 · 1 comment
Open

Data classes #228

Gautzilla opened this issue Nov 27, 2024 · 1 comment
Assignees
Labels
APLOSE related The changes are impacted APLOSE behavior high priority Urgent issue

Comments

@Gautzilla
Copy link
Contributor

Context

I think it is mandatory to rethink the way we structure/access/manipulate the data in OSEkit.

Global thoughts

Audio data, auxiliary data...

Data logic should be managed by base classes (e.g. Data, DataItem...). A Data instance would be as simple as some values associated with timestamps.

Specific data, like audio, should be adressed by specialized classes that inherit from the base classes: AudioData(Data), AudioDataItem(DataItem), which should contain only the audio-related stuff.

This would make the base classes code reusable for all data type: If we add an AuxiliaryData which values are just floats stored in a csv, we could reshape this data using the general methods (e.g. super().reshape(...))

Base classes structure

At the moment, I'm thinking of:

  • Splitting the relationship between the data and the actual files, with separate Data and DataFile classes.
    • In the present context, Data would represent the audio items contained within the dataset, with the user-specified duration, sampling rate, etc.
    • DataFile points to a file on disk, with timestamps for its begin/end date and methods to access the data within a timestamp range.
  • Since a single Data object could either be shorter than a file, or covering multiple files, an intermediate DataItem class should help recover the whole Data values from the files:
    • Data would have a list of DataItem as attribute. Methods from Data that read the data would in fact concatenate the data accessed through each DataItem
  • The Dataset class would just have a list of Data as attribute, and only interact with this class.

Examples of simplified operations

This would help simplifying the workflow and separing the concerns: We could do lots of stuff on the dataset without touching the original audio files (e.g. Reshape in the drawing herebelow), and add methods that consolidate the audio files on demand:

data_classes

Example code

Kind of a code dump here, but that's a very simple implementation without any reshaping methods or anything:

import soundfile as sf
from pathlib import Path
from os import PathLike
from pandas import Timestamp, Timedelta
import numpy as np
from OSmOSE.utils.timestamp_utils import strptime_from_text

class AudioFile:
    def __init__(self, path: PathLike | str, begin: Timestamp | None = None, strptime_format: str | None = None):
        self.path = Path(path)

        if begin is None and strptime_format is None:
            raise ValueError('Either begin or strptime_format must be specified')

        self.metadata = sf.info(path)
        self.begin = begin if begin is not None else strptime_from_text(text = self.path.name, datetime_template=strptime_format)
        self.end = self.begin + Timedelta(seconds = self.metadata.duration)

    def read(self, start: Timestamp, stop: Timestamp) -> np.ndarray:
        sample_rate = self.metadata.samplerate
        start_sample = round((start-self.begin).total_seconds() * sample_rate)
        stop_sample = round((stop-self.begin).total_seconds() * sample_rate)
        return sf.read(self.path, start=start_sample, stop=stop_sample)[0]

class AudioItem:
    def __init__(self, file: AudioFile, begin: Timestamp | None = None, end: Timestamp | None = None):
        self.file = file
        self.begin = begin if begin is not None else self.file.begin
        self.end = end if end is not None else self.file.end

    def get_value(self) -> np.ndarray:
        return self.file.read(start=self.begin, stop=self.end)

class AudioData:
    def __init__(self, audio_items: list[AudioItem]):
        self.audio_items = audio_items

    @classmethod
    def from_file(cls, file: AudioFile):
        audio_item = AudioItem(file)
        return cls(audio_items=[audio_item])

    def get_value(self) -> np.ndarray:
        return np.concatenate([item.get_value() for item in self.audio_items])

class AudioDataSet:
    def __init__(self, data: list[AudioData], path: PathLike | str):
        self.data = data
        self.path = Path(path)

    @classmethod
    def from_folder(cls, folder: PathLike | str, strftime_format: str | None = None, timestamps_file: PathLike | str | None = None):
        if not strftime_format and not timestamps_file:
            raise ValueError('Either strftime_format or timestamps_file must be specified')

        folder = Path(folder)
        audio_files = (AudioFile(path = p, strptime_format=strftime_format) for p in folder.glob("*.wav"))
        audio_data = [AudioData.from_file(audio_file) for audio_file in audio_files]
        return cls(data = audio_data, path = folder)
@Gautzilla Gautzilla added the high priority Urgent issue label Nov 27, 2024
@Gautzilla Gautzilla self-assigned this Nov 27, 2024
@Gautzilla Gautzilla added the APLOSE related The changes are impacted APLOSE behavior label Nov 28, 2024
@Gautzilla
Copy link
Contributor Author

Gautzilla commented Nov 28, 2024

@ElodieENSTA I have marked this as APLOSE related since it might change the relationship between the Data (with which the spectrograms are plotted) and the audio files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APLOSE related The changes are impacted APLOSE behavior high priority Urgent issue
Projects
None yet
Development

No branches or pull requests

1 participant