[Discussion]: Audio Query Support to master #577

gkennickell · 2024-11-15T20:35:26Z

Providing an interface to query the raw audio samples is a very useful feature for multi-modal research.

The topic has been discussed before, with feature support added here: #233

But the discussion from 2021 #183 left off with the recommendation to use the fork until the feature could be refined.

The main issue with this approach, is there have been many useful changes in master since; and with those changes, enough divergence to require modifications to the fork. On our end, we have to maintain a custom build in order to support these audio use cases, which is not ideal as it can be error-prone and costly from a time perspective.

Is there a plan to merge this feature to master?

pseudo-rnd-thoughts · 2024-11-16T01:00:11Z

Thanks for raising the issue, I agree that it would be super interesting to add this directly into ALE.
@gkennickell would you be willing to make a working version of #233?

If so, for multi-modal policies, would it be worth adding the option for observations that contain both image and audio?

gkennickell · 2024-11-18T16:19:51Z

would you be willing to make a working version of #233

Yes, can do. What would be the preferred way to submit a working version given the state of the original PR, and wanting to maintain attribution/credit? The other question: given the doc/example/ folder was removed a while back and replaced with a minimal set of markdown docs, should the scripts added here 53eb67d be converted to markdown and moved under docs/ or removed?

If so, for multi-modal policies, would it be worth adding the option for observations that contain both image and audio?

The original commit added a combined call on the ale python interface:
53d1c84#diff-044087092cf92ca5a0d16c162cd1a3fd99540c760f10fca3824ad5e1449dc84bR265

I did not integrate that change into the version we maintain, opting to keep the lower level ale python interface as minimal as possible.

For the AtariEnv, returning an observation that contained both the 'obs_type' and audio would seem ideal, given gym wraps everything required for a step(). I am exclusively using the ale api, so I do not currently have audio exposed on the gym interface.

If this would be useful to expose to gym, one way to achieve this could be to:

add 'sound' boolean to init (default=False)
if 'sound' is True, __get_obs() would return a tuple of [obs_type, audio]

For applications that do not use audio, it would be backwards-compatible.
Downside - for those that do, they'd have to maintain two call return signatures for testing with/without audio enabled.

It might be useful for _get_obs() to return an extensible dict or utility class in the future, as there's been research which used image data along with RAM as additional compressed state.

Open to suggestions.

pseudo-rnd-thoughts · 2024-11-18T16:38:22Z

Yes, can do. What would be the preferred way to submit a working version given the state of the original PR, and wanting to maintain attribution/credit? The other question: given the doc/example/ folder was removed a while back and replaced with a minimal set of markdown docs, should the scripts added here 53eb67d be converted to markdown and moved under docs/ or removed?

Amazing, could you make PR here, we can link / reference to the older versions

I did not integrate that change into the version we maintain, opting to keep the lower level ale python interface as minimal as possible.

I'm happy with a minimal interface as that means less work to support

If this would be useful to expose to gym

I would use sound_obs: bool = False with just adding it to the get_obs function or just to reset / step directly.

Are you currently working on a research paper with this? Is there anything that I should be aware of?

gkennickell · 2024-11-18T16:40:09Z

Ok, thank you. I'll get started on this.

Are you currently working on a research paper with this? Is there anything that I should be aware of?

No paper planned as of yet, just general research.

gkennickell · 2024-11-25T15:38:34Z

While putting together a PR of the original commit (#233) on top of latest, I reviewed what was done and decided to refactor/simplify the code.

The original commit dual-purposed the SoundSDL+SoundExporter path in order to implement the feature.
The primary purpose of SoundSDL is audio playback to a device, with the SoundExporter a test option for writing samples generated for playback to a file.

Conflating SoundSDL with additional support for a per-frame sound observation through SoundExporter added a level of complexity that will most likely make maintenance more difficult and error-prone due to the multiple boolean switches to enforce mutually exclusive behavior in both the exporter and sound playback paths, while also pulling in ale common utility headers to an emucore base class.

Additionally, generating a sound observation from the TIA sound registers does not require SDL support, so an artificial dependency is created.

It looks like the main reason why the feature was implemented through SoundExporter on SoundSDL was to:

re-use the SoundExporter data buffer for storing samples generated for the frame.
re-use the TIA sound register queue on SoundSDL.

The commit I'm proposing at #579 does the following:

Requires 'sound_obs' boolean to enable (in the original commit this was 'record_sound_for_user').
- NOTE: The audio buffer will be 0'd out if 'sound_obs' is False, otherwise populated with the per frame sound data generated from the TIA sound registers
Creates a simple SoundRaw class which holds a TIA register queue and provides a simple frame processing command.
Enforces the same mutual exclusivity with SoundSDL+SoundExporter option.
Adds a sound data buffer on the stella_environment, similar to the other observations: image and RAM.
I left the call logic in stella_environment unchanged: sample generation only happens on the last frame-skip frame of act().

For the API:

The ale interface is straightforward, and a simple python test is added.
For gym, this is not as straightforward:
- returning the sound obs from step() seems the cleanest, but will require applications to upgrade their call logic when they update to latest.
- the atari_env test requires a change to gymnasium/utils/env_checker.py:check_env, but it's not clear what the process is for synchronizing those changes.

The other consideration is how many users would be interested in multi-modal tests in gym. We could forego modifying the gym api until there's demand, and in the meantime developers of multi-modal gym applications could call env.ale.setBool("sound_obs", True) and env.ale.getAudio() directly.

Thoughts?

Pull request: #579

gkennickell mentioned this issue Nov 25, 2024

Add support for audio queries #579

Merged

pseudo-rnd-thoughts closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion]: Audio Query Support to master #577

[Discussion]: Audio Query Support to master #577

gkennickell commented Nov 15, 2024

pseudo-rnd-thoughts commented Nov 16, 2024

gkennickell commented Nov 18, 2024

pseudo-rnd-thoughts commented Nov 18, 2024

gkennickell commented Nov 18, 2024

gkennickell commented Nov 25, 2024

[Discussion]: Audio Query Support to master #577

[Discussion]: Audio Query Support to master #577

Comments

gkennickell commented Nov 15, 2024

pseudo-rnd-thoughts commented Nov 16, 2024

gkennickell commented Nov 18, 2024

pseudo-rnd-thoughts commented Nov 18, 2024

gkennickell commented Nov 18, 2024

gkennickell commented Nov 25, 2024