The mid-1980s was a good time to be in Paris. A highlight for me was
experiencing some of the earliest concerts involving computer-aided musical improvisation:
George Lewis brought his Voyager system to IRCAM and was around developing new work
for a premiere in 1986. David Wessel worked up his first system for his enduring, “catch and throw”
paradigm. I saw the premiere of that too in the Centre Pompidou.
David’s challenges
playing in that concert were striking and suggestive. He was "catching” in real-time a stream of MIDI
representations of Roscoe Mitchell’s live saxaphone playing. MIDI streams were gated into
12 buffers concurrently by each note of the lower octave of a MIDI keyboard. Keys on each octave above the keyboard would
play back the corresponding buffer with different transformations and instruments.
Deciding polyphonically when
to record what with the left hand while also responding musically with re-evocations of remembered prior sequences
was a nearly intractable challenge. This lead David into interesting research on how many tasks people could
perform at once and also to a commitment to build musical listening “agents” that could analyze incoming
musical streams at different time scales and identify, for example, what pitch space they were in.
In 1985 I was working in David’s “Small Systems” group at IRCAM
on MacMix, the precursor, to the DAW and I was curious to explore
how musically salient information derived from recordings could be usefully displayed.
I was pedantic about it in those days. I didn’t want to display anything misleading - I knew of David's
challenges with the unreliability of the pitch-to-MIDI convertor he was using.
In the early MacMix I had to settle for a time-domain waveform display which at all time scales only
ever displayed an actual sample from the original recording, i.e. using peaks not averaging.
This differentiated my
waveform display from band-limited, filtered versions or oscilloscopes and was handy catching glitches and clicks
that were rather common in the early days of digital recordings.
The difficulties identifying
perceptual onset times in audio streams meant that I didn’t trust algorithms to
find good edit points. Seeing a presentation on the early harmonizers and pitch correctors
of that era, I noted that the best sounding ones were picking splice points at zero crossings
where the signal was moving in the same direction. This is why MacMix has a mode where
a sample level zoom is displayed for each of the 2 edit cursors.
I also built in a system
so that these salient, potential, splice points could be entered with clicks of the mouse while
you listened to a recording. This use of the musician’s own idea of what might be salient
time points influenced the musical projects made with MacMix which was
free of a strict temporal grid - a different choice that later DAW designers would make.
You can hear an example of music made with MacMix on “Go Where” by David Wessel and Ushio Torikai,
Victor VDR-1026.
At CNMAT, we developed many systems to identify and organize salient parameters from real-time
sound sources. David referred to the ones for his musical activities
as listening, composing and performing assistants. These explorations include
Augmented Transition Networks (ATNs), Probabilistic Graphical Models, and Explicit Duration Hidden Markov Models (EDHMMs).
Along the way we were always thinking of how these parameters could be used during musical performance.
One of the most fruitful control structures was considerably evolved from
the simple selection mechanism of a keyboard that David started with. It was to organize
the high-dimensional salient feature vectors into a tractable low-dimensional space that
could be explored (i.e. performed) in real-time. The mathematical formalization we called RBFI, radial basis function
interpolation and we developed several GUIs based on them.
These interpolation tools turned out to be applicable to a surprisingly diverse range of applications. We
used them to help people customize their hearing aids, to organize pitch harmonic and melodic space, timbre space and metric space.
In 2011 we were faced with an unusual challenge: to create a compelling concert experience for a group of
musical improvisers who were geographically dispersed using access that UC San Diego and UC Berkeley had to
unusually high performance internet links. The audio part was relatively straightforward - a sound engineer
in each location made sure they were reliably sending local audio and receiving remote audio. They also mixed for the local
audience and performers on stage. This worked but was unsatisfying to experience unless the performers and audiences
could see the remote performers. This was solved with a concurrent video link. With just a wide shot of the stage
the remote performers were not perceived as “present” in the music. We found we could solve this by using a combination of
wide shots and closeups. And this is where things become intractable without a huge staff to manage the cameras
and a live video editor to produce a good, dynamic feed according to where the musical action was at any given moment.
The solution was to use multiple cameras, composite the source with Max/MSP/Jitter and build a machine learning system that
arranged the sources according to musical salience - active musicians taking solos would expect to see the feed contain
mostly closeups of their performance. The system uses video activity measures and sound activity measures to decide what
is most salient and then RBFI to smoothly interpolate the multiple camera views in the feed.
In the paper about this, Yotam Mann wrote up a clear "cooking show" example. At the presentation of the paper I used the system
in real time to demonstrate how, in the context of a guitar lesson, the system could automatically and smoothly move between closeups of the left hand,
right hand and a wide angle shot - according to the whims of the teacher and without requiring any camera operators.
This brings me to the unfinished business: During the 2020 pandemic, musicians have turned to trying to make a living performing
using videoconferencing software. Unlike many influencers who have their hands free to operate production controls
for their own shows, musicians have to rely on other people or accept the consequences of a “set and forget” approach.
It’s clear how some good listening and watching agents could automatically help with the challenges teachers, musicians and other
performing artists face. Some videoconferencing systems have limited automation features. Many are just now working on multiple camera
support. I haven’t found any yet with the ambitious vision we worked on for Myra Melford and her collaborators in San Diego.