The radio hackers in the audience will be familiar with a spectrogram display, but for the uninitiated, it’s basically a visual representation of how a range of frequencies are changing with time. Usually such a display is used to identify a clear transmission in a sea of noise, but with the right software, it’s possible to generate a signal that shows up as text or an image when viewed as a spectrogram. Musicians even occasionally use the technique to hide images in their songs. Unfortunately, the audio side of such a trick generally sounds like gibberish to human ears.
Or at least, it used to. Students from the University of Michigan have found a way to use diffusion models to not only create a spectrogram image for a given prompt, but to do it with audio that actually makes sense given what the image shows. So for example if you asked for a spectrogram of a race car, you might get an audio track that sounds like a revving engine.
The first step of the technique is easy enough — two separate pre-trained models are used, Stable Diffusion to create the image, and Auffusion4 to produce the audio. The results are then combined via weighted average, and enter into an iterative denoising process to refine the end result. Normally the process produces a grayscale image, but as the paper explains, a third model can be kicked in to produce a more visually pleasing result without impacting the audio itself.
Ultimately, neither the visual nor audio component is perfect. But they both get close enough that you ge ..
Support the originator by clicking the read the rest link below.