Unfinished Business: Salience and the Pandemic

The mid-1980s was a good time to be in Paris. A highlight for me was experiencing some of the earliest concerts involving computer-aided musical improvisation: George Lewis brought his Voyager system to IRCAM and was around developing new work for a premiere in 1986. David Wessel worked up his first system for his enduring, “catch and throw” paradigm. I saw the premiere of that too in the Centre Pompidou.

David’s challenges playing in that concert were striking and suggestive. He was "catching” in real-time a stream of MIDI representations of Roscoe Mitchell’s live saxaphone playing. MIDI streams were gated into 12 buffers concurrently by each note of the lower octave of a MIDI keyboard. Keys on each octave above the keyboard would play back the corresponding buffer with different transformations and instruments.

Deciding polyphonically when to record what with the left hand while also responding musically with re-evocations of remembered prior sequences was a nearly intractable challenge. This lead David into interesting research on how many tasks people could perform at once and also to a commitment to build musical listening “agents” that could analyze incoming musical streams at different time scales and identify, for example, what pitch space they were in.

In 1985 I was working in David’s “Small Systems” group at IRCAM on MacMix, the precursor, to the DAW and I was curious to explore how musically salient information derived from recordings could be usefully displayed. I was pedantic about it in those days. I didn’t want to display anything misleading - I knew of David's challenges with the unreliability of the pitch-to-MIDI convertor he was using. In the early MacMix I had to settle for a time-domain waveform display which at all time scales only ever displayed an actual sample from the original recording, i.e. using peaks not averaging. This differentiated my waveform display from band-limited, filtered versions or oscilloscopes and was handy catching glitches and clicks that were rather common in the early days of digital recordings.

The difficulties identifying perceptual onset times in audio streams meant that I didn’t trust algorithms to find good edit points. Seeing a presentation on the early harmonizers and pitch correctors of that era, I noted that the best sounding ones were picking splice points at zero crossings where the signal was moving in the same direction. This is why MacMix has a mode where a sample level zoom is displayed for each of the 2 edit cursors.

I also built in a system so that these salient, potential, splice points could be entered with clicks of the mouse while you listened to a recording. This use of the musician’s own idea of what might be salient time points influenced the musical projects made with MacMix which was free of a strict temporal grid - a different choice that later DAW designers would make. You can hear an example of music made with MacMix on “Go Where” by David Wessel and Ushio Torikai, Victor VDR-1026.

At CNMAT, we developed many systems to identify and organize salient parameters from real-time sound sources. David referred to the ones for his musical activities as listening, composing and performing assistants. These explorations include Augmented Transition Networks (ATNs), Probabilistic Graphical Models, and Explicit Duration Hidden Markov Models (EDHMMs).

Along the way we were always thinking of how these parameters could be used during musical performance. One of the most fruitful control structures was considerably evolved from the simple selection mechanism of a keyboard that David started with. It was to organize the high-dimensional salient feature vectors into a tractable low-dimensional space that could be explored (i.e. performed) in real-time. The mathematical formalization we called RBFI, radial basis function interpolation and we developed several GUIs based on them. These interpolation tools turned out to be applicable to a surprisingly diverse range of applications. We used them to help people customize their hearing aids, to organize pitch harmonic and melodic space, timbre space and metric space.

In 2011 we were faced with an unusual challenge: to create a compelling concert experience for a group of musical improvisers who were geographically dispersed using access that UC San Diego and UC Berkeley had to unusually high performance internet links. The audio part was relatively straightforward - a sound engineer in each location made sure they were reliably sending local audio and receiving remote audio. They also mixed for the local audience and performers on stage. This worked but was unsatisfying to experience unless the performers and audiences could see the remote performers. This was solved with a concurrent video link. With just a wide shot of the stage the remote performers were not perceived as “present” in the music. We found we could solve this by using a combination of wide shots and closeups. And this is where things become intractable without a huge staff to manage the cameras and a live video editor to produce a good, dynamic feed according to where the musical action was at any given moment.

The solution was to use multiple cameras, composite the source with Max/MSP/Jitter and build a machine learning system that arranged the sources according to musical salience - active musicians taking solos would expect to see the feed contain mostly closeups of their performance. The system uses video activity measures and sound activity measures to decide what is most salient and then RBFI to smoothly interpolate the multiple camera views in the feed.

In the paper about this, Yotam Mann wrote up a clear "cooking show" example. At the presentation of the paper I used the system in real time to demonstrate how, in the context of a guitar lesson, the system could automatically and smoothly move between closeups of the left hand, right hand and a wide angle shot - according to the whims of the teacher and without requiring any camera operators.

This brings me to the unfinished business: During the 2020 pandemic, musicians have turned to trying to make a living performing using videoconferencing software. Unlike many influencers who have their hands free to operate production controls for their own shows, musicians have to rely on other people or accept the consequences of a “set and forget” approach.

It’s clear how some good listening and watching agents could automatically help with the challenges teachers, musicians and other performing artists face. Some videoconferencing systems have limited automation features. Many are just now working on multiple camera support. I haven’t found any yet with the ambitious vision we worked on for Myra Melford and her collaborators in San Diego.

Further Exploration

Miriam Akkermann history of "Catch and Throw" https://quod.lib.umich.edu/cgi/p/pod/dod-idx/contacts-turbulents.pdf

David's 1991 musings: http://cnmat.org/News/Wessel/InstrumentsThatLearn.html

Robert Rowe's 1993 Survey of Machine Listening: http://wp.nyu.edu/robert_rowe/text/interactive-music-systems-1993/chapter5/

Our paper on Pervasive Cameras: http://adrianfreed.com/content/pervasive-cameras-making-sense-many-angle...

MacMix:

David Wessel's on instruments that privilege improvisation:

CNMAT's machine learning work: http://adrianfreed.com/content/machine-learning-and-ai-cnmat

Cyril Drame's 1998 demonstration of musical style transfer using neural network and additive synthesis: http://cnmat.org/~cyril/

Aaron Einbond et al. improvisation system inspired by David's work: https://openaccess.city.ac.uk/id/eprint/15424/1/

Commercial Developments

Izotope's "intelligent" assistants: www.izotope.com/en/learn/izotope-and-assistive-audio-technology.html

Descript's Overdub Voice. Style injection - of spoken voice: https://www.descript.com/lyrebird