Skip to content
Mark Nadal edited this page Dec 14, 2020 · 1 revision

Hum to Tweet:

You're talking about identifying a frequency and converting that into a certain symbol? Wouldn't an FFT suffice to move from time-domain into frequency-domain, and then a peak-detection (perhaps even shape detection to recognize vowels and such) algorithm to detect what is present in the signal?

The goal is to hum into a microphone and output a series of letters (symbols, not notes). Most systems try and match your hum to some fixed note, but that is not what I want to do. I just want to match the relative distance between hums.

A test could be, sing "daa daa daa" into microphone and have it output "AAA" text, then sing "daa dee daa" and have it output "ABA", then "daa dee doo" output "ABC", and so on. Effectively, you're matching the ?pitch? difference in the hums, on some configurable step interval, to a letter - since there are 26 letters + 10 numbers, the divisible would probably be 36.

Timing will be handled in the 2nd iteration of the project. As far as input transformation, the goal is not to match a particular absolute input sound to an output, the goal is to just track the relative change in input sounds. Other features, like the Tweet-to-Song tool, will decide what the actual notes played are (often they'll be full instrumental synths, but in the future, will be an AI orchestra with control-points/parameters for adjusting emotion or other variables).

If it is gonna be easier to work with "timing" from the start, the Tweet-to-Song system already uses "=" to extend a sound. It is extended for as long as whatever the current BPM is set to, which is dynamic. So for recording I'd probably default to the smallest increment of 8 letters per second. I couldn't really beatbox/beep-boop faster than that. However I assume the smaller the interval the harder it comes to detect in code? So if easier for the 1st working version (again, goal is to do this all as incremental value+add gain that is usable at each step) to have people beep-boop slowly that is fine. Again, remember, all timing will be able to be adjusted in the Tweet system later, speeding things up or slowing it down, etc.

I'm trying to do relative melody capture so it can be modified/instrumentified/synth-ized/+ everything else later in text format (and a higher fidelity data format that annotates the text format). The goal of tweet-to-song is not to have a canonical representation of Western music theory in letter form, but to actually disconnect specific notes from being fixed to certain letters. So that way they can be relatively reassigned to match different inputs. Like, for example, a laptop QWERTY keyboard.

Our approach is to have 3 components, (1) input (2) template (3) output, the same input will create the same output if using the same template, however the template can be changed just as much as the input can be changed. This is where we'll be able to swap in AI templates using the same input parser, etc. in the future.

So, basically, you want to convert an input sound sample into something like frequency+overtones+duration (or midi-like), and then produce audio from that again, optionally in a different voice like an instrument or back to speech again?

Yes. (tho not likely speech)

The timing of the glyph would be handled in a 2nd iteration.

Interval wouldn't be a problem in such a scheme. If you're pulling audio samples (time domain) through fft to convert it into frequency domain, if you use 20ms slices of audio, you would get a 50hz resolution and 50hz max interval between symbols. The hardest part would be to find plenty samples of the phonetic alphabet (different notes/hums) and detecting those in the frequency domain.

Oye, a symbol every 20ms is probably too high-bandwidth for the human-level text editing of it. My plan later is that each symbol can be "opened up" and more finely tune controlled using curve editor to change attack/speed/volume/decay/etc., but that level of the "meta" editor is +1 up in skill, for people who want to learn the tool for more powerful creative control. But the tool needs to be usable for people who don't want to dive deeper. Like with the text/letter/tweet editor, the sampling would be constrained to az09, so 36 division steps.

Then the intermediary format should probably be different. Perhaps a line/json-object per symbol, because then you can later easily extend functionalities.

Correct, a more detailed internal format will be saved as a graph inside GUN. But to reduce the scope/complexity of the 1st version, we've only been working with "lossy" data, as in... the 1st thing that we started to build was the "hum->letter" converter, but we quickly just skipped to the 2nd step which was the "tweet->song" system which has already been working for about a year. This has produced "good enough" results, with its limited control/tooling, to warrant continuing. Doing the more detailed format right now would not be as valuable to the current iteration. Tho, yes, it would be good for long-term step goal - I'm trying to be careful to be incremental about development and progress. Defining the format now would be too early cause it needs to be informed by the curve editor we build.

Sounds a bit like the speex codec though. That one is able to use frequencies, overtones & white-noise to compress & decompress speech. If you'd base the code on that, you could easily change the intermediary format and decoder to produce any voice you want (like a guitar or piano). Basically, if you're able to detect the distance between overtones, you should be able to find the fundamental frequency of the hum (either at peak distance or peak distance /2). Based on the overtones, you should be able to detect vowels (diff between dee, daa, doo), then convert into the symbols you want for whatever application you desire.

Woah! OK, now that might be going to the opposite extreme of being overly literal, which.. may not be a bad thing lol. Let me send you some samples of the tweet to song output - they're low quality, but the Music Maker tool already exists for it and is "usable" for silly/small projects (that is the incremental goal I'm trying to hit, each new piece needs to be useful for some sort of result, and I'm actually trying to tie it to the mini "games" I'm making, so each level of the game is some sligthly more advanced feature that dictates which next meta tool is built).

Here are some examples of songs it has generated already - they are not the songs/output that I want but they at least do something which is passable for some type of art/backgrounds/projects:

A good summary of "why" is that my son (I'm a single dad) could produce/create a song in a 1 minute, even at age 3! Here - it is about building tools that empower people to create at faster scale, speed, volume, and (eventually) quality than professionals or orchestras.

For more context on how this could even be possible, please see my article on creating worlds.

The next coding/development step would be to improve the Tweet-to-Song system by adding more control points to create more sophisticated songs, but I don't want to do that "yet" until the original 1st step is built (humming into a microphone). Maybe a literal "dee, daa, doo" could fit towards next step? I'm not sure how literal this would be, in terms of human-like voice, in producing output that is too constrained to "to-speech" like results, rather than "dee, daa, doo" -> "instrumental synth" -> (then in far future) "AI musical synth".

You could turn the symbols into midi-compatible signals. That way you would be able to re-use them in a plethora of audio software.

Yeah, I'm more OK going in that direction rather than the "convert into Western music theory note symbols". However, probably due to my ignorance, most of the midi synths I've seen are... very.. MIDI-ish sounding? Or is it just that wwhen I hear high quality stuff generated from MIDI I'm not realizing it was? The article I sent probably explains the background to this much better, giving context of the type of editing tools I'm building up towards, which I'm not sure how well MIDI philosophically aligns with - tho without a doubt I'd want to support exporting to midi.

Here's a high quality soundfont demonstration.

YESSSS! Perfect, actually this is why we're already using soundfonts. That is how we're generating the instrumentals currently. My understanding from our experimental prototypes done in ToneJS (I'd rather not have dependencies tho, but ToneJS has done a great job) is that even each soundfont can be "distorted" individually with different curve/attack adjustments. So you may have 1 sound font for a grand piano, but you can still then play each "stroke" of that same piano key sound with slightly different variable sound.

The route for incremental usage I'd take now, would be to add a midi-json-layer between the frequency domain and your symbols. That way you can introduce a translator between your symbols and midi, making it usable in audio software. Because you said you already had some audio into symbol conversion, right? I believe MIDI supports supports all the control points necessary for adjusting soundfont attack/decay/etc., but not 100% sure. Could also dig into https://github.com/dntj/jsfft to get a proper grasp for transforming time domain (voltage levels) into frequency domain (frequencies, phases and amplitudes). After that, you can ditch the phases and apply pattern-matching on the frequencies+amplitudes to detect what sound is being made by the user

On a slightly different note (ahem, pun), a specific use case I want for myself, personally, is to generate retrowave songs from my hums or typed out tweets. Like, typing "qq ww ee" then choosing a retrowave 'instrument' instead of the current 'piano' instrument. Tho my assumption is that doing such a full retrowave composition would be more towards it being an "AI" instrument than a simple single-track instrument on its own. Our tweet-to-song supports multi-track inputs, chords, to multi-track output. But stuff like the AI-versions would auto-generate multi-track output for you from single-track input. Then this could be combined with the tweet-to-song multi track input to add non-instrumental controls, like "emotion" and speed/intensity, that effect some or all of the other tracks.

But I'm getting distracted by the future. For right now tho, I'm just trying to go from hum-to-text, then reuse the tweet-to-song single instrument per-track output, to create songs like what I demoed earlier, but this time rather than me typing it as input, I hum it instead. You'd sample it through MIDI?

Ignoring the input part (starts to sound like the compressor part of an auto-encoder), you should be able to convert any format you want into midi, once you've defined the format. Then from midi you can go into almost anything, including adding the full orchestra produced by an AI.

Cool. Well, hmm, just trying to clarify, is if MIDI will be able to handle doing multi-track on-the-fly dynamic changes to the song. For instance, say one of the AI-level tracks may define emotion, or another one may define "error", so if we have "qq ww ee" input text for the song part, we may want to tell the AI to generate a full orchestra melody based on that sequence, but then add in another track of "f...abf" inject randomness/noise/non-determinism/mistakes into orchestra output. So rather than the timing of each instrument in the orchestra being exactly in sync like a code/robot would do, it'd fuzzily shift some of it by some milliseconds so it sounds more human-made. That is an example where I have no clue if a format like MIDI would support encoding.

For example, on the current version of the tweet-to-song system, I can, on song playback, dynamically adjust blur/distortion sound effects to the song. Which is cool, cause it is what made some of the demoed songs earlier (I think only 1 of them did it) sound more human. I do this by waving my mouse around like a conductor as the browser simultaneously records & downloads the song as it plays. So just like we have list of instruments, we have list of effects, and one of those effects would be "error" ontop of the blur/distortion, etc. ones we have.

Have you got a well-defined specification of the format? You could add "wiggle" (term just invented), basically making the odd notes earlier and the even notes later. That way it would be measurable, probably making it compatible with other things. On top of that you could account for human imperfections, making all notes wiggle by +/- 5ms. Most musicians are able to hit notes to within 5-10 ms. But still, your tweet music notation has a specification? If so, converting human-produced audio into that notation should be doable

It has a parser, here, but remember, this parser does not contain more advanced controls yet. The way I'm defining multiple tracks or simultaneous instruments, etc., is pretty flexible, here's an example of the different approaches that already work:

aa bb cc
qq ww ee

This can be received as a full text blob, non-streamed. These 2 lines represents 2 tracks that play simultaneously. But we also support streaming, take doing the same thing but saving the 2 lines as strings separately but concurrently in GUN:

aa bb cc

and

qq ww ee

We can sync these simultaneously, and then tell the Tweet-to-Song system via preconfigured meta-data to play both simultaneously. So we're able to achieve multi-track playing multiple ways, 1 that is more human-text friendly, and another, which adds additional data in a more machine-friendly way. This 2nd approach is a bit easier to "stream". But there is also a 3rd approach:

[aq][aq] [bw][bw] [ce][ce]

This is sort of human-readable, and the parser already supports it, it plays them as chords. So:

  • a~0 are playable soundfonts (based on whatever the configuration is... this may be defined by the creator, or lol, could be overridden by the listener. The author may not have specified the instrument, having played it in piano, but the listeners current instrument may be set to drum, and he'll hear it in drum)

  • [abc] plays each letter simultaneously (so this does require the parser to seek for closing symbol).

  • \n represents track.

  • (abc) represents grouping, for later when we support math operations (this was added already, but buggy).

  • = means extend previous soundfont, as in b==== "holds" the sound for a longer amount fo time, as defined by whatever the BPM is currently.

  • / volume up, \ volume down (not implemented, I don't think)

And then which actual notes are played is determined by another dynamic configuration (just like the instrument), which defaults to a to a major harmonic sacle, I think? Here's a demo / debugging tool someone wrote for it. They made like 7 different scale mappings, but we choose the default based off which one made best harmony for a qwerty keyboard setup. It should show which scale is being used in the tool.

So instrument, scale, harmony, etc. even BPM, blur, distortion, are all dynamic configs. The cool thing about this is they can be dynamically adjusted during playback like I was mentioning earlier. During playback I can modify the playback by modifying the configs (wiggling my mouse based on what command I have selected) without it changing the source input. Tho in the future, I'll want to record those actions & save it as a track where that track is played "simultaneously" just like an instrumental track, but this track is a modifier track to the other tracks (or specific tracks) that change their volume/speed/effects more precisely. This is getting into Iteration 5 tho, and we're still just going from 2 back to 1 (hum-to-text). Like I was saying before, these more fine-grain controls are timing-based, and will depend either on recorded input or a curve editor (and will also be what later lets us control the AI emotion/intensity/error/etc. in the future). But yeah, trying to not get ahead of ourselves.

Let's say I'd be able to convert audio into midi-ish (freqs, timings and overtones), would that suffice for you to continue the project to use those mappings etc?

That sounds like a good, scope-able, next step. Yes. However, lol, you'll have to tell me - I assume it won't be hard for you to show me how to read the MIDI format such that I then write a interpreter that lossily-downconverts it into letters for a tweet, right? So then I can pipe hum -> midi -> text -> song, and then later we can skip the text hum -> midi -> song, or in future when have a more advanced text parser text -> midi -> song, right?

Yes, once it's in machine-readable format, you could easily convert it to any notation you like (so the tweet format for example).

Waahooo! So now dumb question lol, now that you've said the obvious... I'd assume there already is a hum -> midi JS library out there already? Or no, because they all assume fixed Western music theory notes as the output? Where what we're doing differently, is having the MIDI format just define the relative FFT changes? (To me, this seems like the obvious way anybody would approach writing such a tool, but yeah, all the online "hum to song" tools I tried all were like "OK, try singing into the microphone and we'll match the exact music note!" and I'm like, dude, if I was that good at singing I wouldn't need this tool! lol)

Haven't investigated that yet. This project would attempt converting hummed audio into notes (I mean, pfrequency, because then you can map it into any note notation you want), so that may already exist (probably not the first attempting such a thing). Or, perhaps 1-36, adding octave notation you could cover the whole human-detectable frequency range.

Speaking of making mappings, I recently discovered/learned lerp! So now I'm obsessed with lerping everything. It is used for video game logic, but can be applied to anything. D3 has a really good system for it too, I believe, called range or scale? So basically now, rather than hard coding these "lossy" downcoding, I'd just create a serires of piping lerps hum -> midi -> text.

I've actually written a cubic interpolation function once when writing re-sampling functions for audio tools. Basically just stacked linear interpolation. Check one of them out. Explanation of that smoothing/resample function: x1 & x2 are 1/3 and 2/3 points between y1 and y2, assuming the slope between y0-y2 and y1-y3. After that a simple bezier smoothing is applied between y1,x1,x2,y2.

NIIIIICE!! I never knew about this generalizable math function, despite being a math guy and having my entire life thought of "encoding" everything between 01 (0100%). Actually that is what the article I wrote kinda talks about that I linked earlier. Which is for unskilled people, you can fake being skilled by effectively mapping every variable/parameter to a single dimension, then controlling that dimension in a way more forgiving to unskilled people, then re-composing all those layers back ontop of each other. Your cubic interpolator, interesting, so it isn't quite a lerp (between 2 points) it is more like an SVG control point for what happens after the points?

One frustrating thing I have with "pen" tools is that it wants to map a curve into a control point that is both rotation and distance combined which makes it hard for me as the human to manipulate the UI correctly (especially when having to click down with mouse and stuff, which is the biggest offender, as I explain in my article). Figma is a good example where they have an option to "pull/bend" the line between 2 points, as if it was a bow. I know this can lead to asymptotes/edges on the vertexes themselves but for the sake of controlling the curve, it is much easier. Redefining the UI as bending a line on 1 point rather than managing rotation & distance of a vertex, was very helpful for my unskilled abilities.

The quick cubic explanation is that it smooths the area between points, attempting to not introduce any above-niquist frequency components. It still uses tension, like a bow, but enforces known points by making the tension points calculated from the slope of surrounding points

As in your cubic is kinda like having a dampening at the end? But doesn't allow the dampening to overextend too far on the opposite side (heavily weighted)? Like if we want a ball to squish/bounce against a brick wall but we don't want the squishing of the ball itself to go "into" the wall? Oh, woah, that's actually an interesting idea - could you use it for that? Like, basically saying let the curve of x1 to x2 define ball position, but let the curve x2 to x3 define the squishing of the ball from the x1 direction? This seems like it could fix a lot of clipping issues in video games - seems like an immediately obvious yet extremely useful physics feature? Though I assume you're using your cubic in a music context, so maybe my analogy of understanding it there would be to cap the music at a certain max volume, but when it hits that limit, let it "go over" a little but smoothly, bound by some secondary limit, so it doesn't result in the annoying scratching sounds?

Upsampling never is accurate. Upsampling is always about producing a realistic result without annoying artefacts, not about accuracy. This function does not have ringing beyond 1 sample, unlike sinc-based resampling. It does go over, yes. Which may be accurate if you're in the audio context. But, we kinda drifted off-topic. Hum -> midi would be an fft + note detection. So that step would look like hum -> fft -> midi, fft would be noted as it's own component, as it's rather cpu intensive in the project as I understand it & because tuning the fft may produce either accurate notes or bogus ones.

Is there anyway we can "skip" the notes entirely? In a Western music theory type way - again, I'm primarily interested in simply the relative change between hums, not the note of the hum itself. So it is more encoding the line between points, rather than the location of the points themselves. If need be, the first point/hum could always be set to 0, or some base identifier/whatever.

Assuming my voice, the lowest note I can produce is abt 55 hz last I measured. At that frequency, we would need at least 1 western musical note of spacing, so 1/12 of sqrt2, so 1.059x 55, resulting in a diff of 3hz, limiting the max symbol rate to abt 3hz. If we'd ditch the fundamental frequency & measure overtones, we'd need to be able to measure 110 hz, limiting to abt 6hz symbol rate.

Nice napkin math! Where the "symbol" rate here would be MIDI instead of mine, right? Then I could write a lerp that pipes from midi to text.

Yes, that'd be midi-like. Napkin math to the win, giving educated estimations. So the question is: do you want to be fast or do you want to be accurate? So, initially targeting 10hz symbol rate?

Probably speed over accuracy, cause accuracy can always be tweaked later in the text editing (or curve editor, when it exists). If that lets you do more realtime decoding, then all the better. If one rate is feasible to let you get to realtime on an average market cell phone, then that seems like a good bandwidth/CPU constraint. (I like products that progressively enhance, so like, building a video game that is playable on old gen, but then adding code checks to see if newer perf/stuff available and opting into it. Rather than games/apps assume newer tech, then fail/glitchy/fallback to old polyfills. Generally speaking, my assumption is, if the game defaults to like 1000 triangles in its rendering engine, then its not hard to later just be like "oh 1000*10" that and crunch more triangles. Because the underlying algos assumed their constraint.)

Still plenty of power then :P extrapolating the fft, simply just overlapping the windows by ~50%, is still possible on a cellphone. That way we can achieve 6hz symbol rate AND 3hz accuracy. If extrapolation is impemented, we could easily reduce cpu load on low-power machines if it's lagging or then we could increase or decrease accuracy.

Perfect! How hard to do that?

(one of my 7 year goals, is to build a rendering engine, that doesn't "download" anything, it just instead spends that time running physics simulations "baking" an AI at whatever resolution the device's benchmark can handle within reasonable human time expectancy, then the rendering engine composites from the ML rather than re-running the physics sims. Basically memoized physics calls mapped to more intelligent outputs on screen.)

Simply fft on a sliding window, converting the complex numbers to quadrature, then assuming a high-pass boost with it's knee at the fft window limit. We can assume the fft sliding window is a lowpass, turning the high-pass boost in parallel linear current sample estimation, so not really that hard. Napkin example: if the window is 100ms, we slide 10% each sample (giving 10ms symbol time), the previous sample was 100, the current sample is 95, then the current sample can be assumed to be 50.

The smartest I got was just looking at these docs/examples, MDN, pitch, hp5. I don't actually understand the signal processing as you're explaining, but it is nice to see/learn from.

Anyway, I'm going to bed & let the hum -> midi pipe stew for a bit. Got to get out of bed in abt 6 hours. That MDN page is a nice example though, as it's examples enable hardware acceleration for the fft on devices having that because browsers are quite eager to use that.

Sounds great.

For anybody who read this whole thread, please join the conversation with your thoughts on it, check out the Games page, and let us know if you can help build these music tools we've discussed!

This wiki is where all the GUN website documentation comes from.

You can read it here or on the website, but the website has some special features like rendering some markdown extensions to create interactive coding tutorials.

Please feel free to improve the docs itself, we need contributions!

Clone this wiki locally