This paper is intended to illustrate some of the ways in which artificial intelligence has been able to explain aspects of the nature of our musical experience. It will start by posing the question "What is that happens when we experience a piece of music?", and will describe the ways in which research that has been carried out over the last thirty years can successfully explain why we experience what we experience in listening to music. However, this success is relative, and the paper will conclude by examining some of the challenges that music poses for artificial intelligence.
What is it that happens when we experience a piece of music? To make the question rather more concrete, we shall examine the notional experience of part of a specific and simple piece of Western tonal music, the first eight bars of El Noy de la Mare, a Catalan folk song arranged by Miguel Llobet.
An answer was hinted at by the Gestalt psychologists earlier in this century. They were trying to account for the perception of visual shape, but their theories seemed also to apply to melody, to auditory shape. They suggested that certain laws seemed to underlie our perception of form, laws of proximity, similarity, good continuation and so on. These laws governed our tendencies to group elements in a visual field so as to constitute shape or form at a larger scale than the individual elements of the pattern (so, for example, if two sub-elements in an array of elements are closer to each other than they are to a third, they are likely to be experienced as visually "grouped" by the "law of proximity").
More recently, Al Bregman in Canada (see Bregman, 1990) has developed a theory in which he suggests that these laws are heuristics or best guesses that we employ in parsing or making sense of our auditory environment. He refers to the processes whereby we make sense of the world of sound as Auditory Scene Analysis, a non-conscious process of guessing about "what's making the noise out there", but guessing in a way that fits consistently with the facts of the world. Auditory Scene Analysis processes operate on sound signals, employing principles that enable the making of valid inferences about the existence and the character of the sources of sounds in the real world, principles that are rarely if ever breached in nature and are highly generalisable.
For example, if a sound reaches a listener's ears and is slightly less intense in the left ear than in the right, they can immediately infer that the source of the sound - the thing that's making the noise - is to their right. If that sound has a particular pitch, they will probably infer that any other sounds made by that sound source will be similar in pitch to the first sound, as well as similar in intensity, waveform, etc., and further infer that any sounds similar to the first are likely to come from the same location as the first sound. From these types of inferences, something very like the operation of the laws of similarity, proximity and good continuation emerges, and Auditory Scene Analysis principles can be used to explain why we experience the sequence of pitches in El Noy de la Mare as a melody, pitch moving in time. Consecutive pitches in this melody are very close to each other in pitch-space, so on hearing the second pitch a listener will activate our Auditory Scene Analysis inference mechanisms, and assign it to the same source as the first pitch.
If the distance in pitch space had been large, they might have inferred that a second sound source existed, even although they knew that it's the same instrument that's making the sound - this inferred sound source would be a virtual rather than a real source.
Hence a pattern such as shown in Figure 4, where successive notes are separated by large pitch jumps but alternate notes are close together in pitch, is probably heard as two separate and simultaneous melodies rather than one melody leaping around. This tendency to group together, to linearise, pitches that are close together in pitch-space and in time provides us with the basis for hearing a melody as a shape, as pitch moving in time, emanating from a single - real or virtual - source.
Many different musics of the world exploit these Auditory Scene Analysis processes. J. S. Bach used them frequently to conjure up the impression of compound, seemingly simultaneous, melodies even though only one single stream of notes is presented. For example, the pattern given in Figure 5 (from the Courante of Bach's First 'Cello Suite) can be performed on guitar on one string, yet at least two concurrent pitch patterns or streams will be heard - two auditory streams will be segregated (to use Bregman's terminology).
However, this process of stream segregation can be reversed; streams can not only segregate, they may also fuse. A sound example of this can be found on Bregman's own Auditory Scene Analysis Demonstration CD in which two Bugandan xylophones are being played. In this music each player performs something that is identifiable as a melody, but in the actual performance the melodies are interleaved so that what is heard is an emergent shape from the interaction of both melodies. Figure 6(a) shows the sequential pattern of keystrokes that are performed on each of the two xylophones or amandindas while Figures 6(b) and 6(c) give the resulting pitch patterns produced by each separate instrument. Having first heard either amandinda A - 6(b) - or amandinda B - 6(c), one might expect to hear something like the pattern shown in Figure 6(d). However, it is much more likely that a listener will experience something like the pattern shown in Figure 6(e), with "resultant" pitch patterns arising by virtue of pitch proximity. It is virtually impossible to hear either of the two separate parts when both are playing together, interleaved.
To recap: Auditory Scene Analysis - processes that apply in general auditory cognition - can help explain a significant aspect of our experience of our original piece, our perception of a succession of pitches as a connected line, as pitch moving in time. But a listener does not just hear a vague, unarticulated shape, a figure or contour (though it's arguable that very young infants may experience a melody in just this way); they hear that shape unfolding within a predictable time framework, they hear some pitches in the melody as more important than others, they hear repetitions of temporal patterns within the melody. What principles govern our capacities to experience musical time?
So, on hearing the first two events of El Noy de la Mare, a listener would predict that the next event will occur as far after the second note as the second is after the first. In fact, this doesn't happen; the third event happens half as far after the second note as the second is after the first. This renders the initial hypothesis somewhat provisional, but sticking with it an event occurs which fits with the original prediction - the fourth event of the piece occurs as far after the second note as the second is after the first. Indeed the third event, falling halfway between the second and fourth events, fits well into this framework, as a lower-level temporal framework that subdivides the original best guess. As El Noy de la Mare unfolds there is very little that conflicts with the initial hypothesis about the appropriate unit time interval to employ in setting up our temporal framework, the initial estimate of an appropriate beat or tactus for the piece.
Having abstracted the basic beat, beats are grouped over larger time-scales, creating a multilevelled temporal framework that is at least partially dependent on whether events stay the same or change radically. It is also dependent on whether or not the exact rhythmic structure, the exact pattern of time intervals between the onset of notes, is repeated, and it can be seen that at the opening of El Noy de la Mare, a specific rhythmic pattern is repeated. This provides the listener with a first estimate of a higher level - larger time-scale - temporal framework, or metrical framework for the piece, a means of working out not just when events are likely to occur in the piece but where in time particularly important events are likely to happen. In fact, throughout the rest of this piece, the rhythmic and metrical structure set up in the first bar remains wholly applicable; no events occur that conflict enough to lead a listener to revise their hypotheses about an appropriate beat and metre.
Had the piece started with a short duration between the first and second notes, and then a longer duration between the second and third notes
it's likely that we would have started our metrical and beat frameworks not on the first note but on the second; we would immediately have revised our initial hypothesis that the interval between the first and second notes should function as the beat interval. That such processes are non-conscious and automatic is evident in the fact that at the beginning of performance of Beethoven's Fifth Symphony, for example, the conductor doesn't turn to the audience and say "To make sense of this piece you will need to know that it's in 2/4 time and starts with an upbeat". He doesn't need to do this that because the audience will work it out for itself (even if its members don't know that that's what they're doing). Processes of Dynamic Attending as outlined by Jones would seem to be almost as universal, generalisable and automatic as Bregman's processes of Auditory Scene Analysis. But, will be seen, there may be exceptions and limits to its generalisability.
This has changed the identity of the melody and produced quite a different piece. The element that has not been considered so far is harmony, the effect of notes sounding at the same time as other notes. To explain the experience of the original piece, one must consider what are the factors that allow some collections of notes to be heard as sounding well together and some as not sounding well together. It is also necessary to consider how those factors are operational in the experience of this particular piece. One explanation of harmony has been around for a very long time, perhaps more than 2,000 years: an explanation based on the properties of the harmonic series.
When a listener hear a single musical note, it's almost always the case that what is reaching our ears is not just one sound but a multiplicity of sounds that we experience as one thing, as one pitch. When an open string on a guitar is plucked a listener will generally hear one note. If the listener is then presented with, for example, the fifth harmonic of the original note (produced by preventing the string from vibrating at a point 2/5ths of the way along its length and plucking it), they will also hear one note. But then, if the first note is played again, it is likely that the listener will hear not one but at least two different pitches, even though what they are presented with is exactly the same as the initial, open string. The difference is that by playing the second pitch, a harmonic of the first, the listener's attention has been focused on something that was there all the time - the fact that the other pitch is in there with the first sound, is part of the same complex sound.
A simple explanation for this phenomenon is that when a string is plucked it vibrates in many different ways all at the same time. Each of these different modes of vibration contributes a different sound, a different frequency, to the overall sound of the note. These different frequencies of vibration are all related in a lawful way to one another. There's almost always a lowest component of the sound, the fundamental frequency, and higher partial components, harmonics. The frequencies - the rate at which the components vibrate - are simply integer-related, so that the second harmonic component vibrates at twice the frequency of the fundamental frequency, the third at three times the frequency, the fourth at four times and so on. When the harmonics of a complex tone are played one after another, the pitches that they individually produce appear as in Figure 10(a)
When all these harmonics sound together one generally hears a rich tone that has the same pitch as the pitch that would be experienced if one were to hear the fundamental by itself (see Figure 10b). So the pitch identity of a complex harmonic tone is pretty much the same as the pitch identity of the fundamental - a complex tone has a unitary identity in perception. Even when a group of harmonically related partials is heard and the fundamental is not present (the "missing-fundamental" phenomenon), there is still a tendency to hear the set of harmonics as having a single pitch - a unitary perceptual identity - that corresponds to the pitch of the fundamental (se Figure 10c), even although the fundamental is not physically present in the sound.
Now, looking back the first chord of the piece it can be seen that the individual pitches of the chord seem to occur within a harmonic series based on the lowest note of the chord (see Figure 10d), and perhaps that is why the notes of the chord fit well together. This explanation for harmony - that the chords that appear 'good" to us fit within a harmonic series - has certainly been around in Western music theory for at least three hundred years, and definitely appears plausible. The trouble is that it doesn't quite work!
The reason for this is that the tuning of most Western instruments doesn't allow the production of frequencies that are related by simple integers; the tuning is not intended to facilitate the production of pitches that conform exactly to the harmonic series. If one produces the fifth harmonic of the bottom string on a guitar, and then plays the "same" note on the top string of the guitar, it is actually a slightly different pitch. Sounded together, the two notes are quite dissonant - they don't fit together at all. The reason is that for the last four hundred years most fretted instruments (and for the last 150 years, most keyboard instruments) have been tuned using a system called Equal Temperament, where the pitches don't quite fit with the harmonic series. There are complex historical, theoretical and structural reasons for this choice of tuning which are outlined in detail in the entries on Tuning and Temperament in Groves Dictionary of Music (1980) and in Lindley (1984). But if Equal Temperament is taken as a given, a listener is likely to notice that despite the pitches of chords being slightly out of tune with the "ideal" harmonic series, they still sound more-or-less acceptable. The reason for this is that our ears can accept a degree of approximateness in tuning; we appear to differentiate between pitches on the basis of the categories that they fall into rather than on the basis of the absolute frequencies that make them up (see Burns, 1999). An example might help make this more clear.
We can think of a pentatonic melody as a melody played on the black notes of the piano. The black notes of the piano have different numbers of white notes intervening between them - there is only one white note between some pairs of black notes, but there are two between others. This means that a step from one black note to the next can be a small step, or a large step. And it seems reasonable to suppose that when a listener hears a pentatonic melody, part of the identity of the melody is the pattern of successive small and large steps.
If the black notes on the piano were to be re-tuned using five equal-sized steps to the octave so that the size of the interval between each black note was the same (a tuning similar to the Bugandan xylophone music in Bregman's example), and the melody were to played on the re-tuned black notes, it is likely that a listener accustomed to Western music and Western tunings would hear a version of the "original" melody, with some big steps and some small steps. This would be in spite of the fact that all steps between adjacent notes were now the same size. The same sort of "approximate fit" works for the perception of harmony, but our tolerance here is much smaller. The important point to take from this is that we can still extract, from a group of nearly-but-not-quite harmonically related tones such as one would find in a chord in equal temperament on a guitar, something like the unitary qualities possessed by a group of tones that are harmonically related.
Recent computational theories have used this fact to explain how it is that we experience harmony in music, how it is that we experience groups of tones as sounding more, or less, well when they sound together. Based on the theories of Ernst Terhardt (see Terhardt, Stoll and Seewan, 1982), Richard Parncutt (1989) has put together a computational model that seeks to explain our perception of chords using the same sorts of principles as can be used to explain our perception of single complex tones. Parncutt's theory exploits the "missing-fundamental" phenomenon within a computational theory that models the results of general auditory processes. It suggests that, just as when a group of harmonically related partials is presented without a fundamental we hear them as having a pitch corresponding to the fundamental, when a group of complex tones that are nearly harmonically related are sounded as a chord, we can experience that chord as having a more, or less, clear and unitary identity. That is, the better a chord fits with the harmonic series, the more unambiguous its unitary perceptual identity is - we can call the unitary identity of a chord its root, which would be to the chord what the fundamental is to the complex tone.
A chord whose constituents correspond pretty well to the harmonic series - like the one that starts the piece - is likely to be experienced as having an fairly unambiguous root, an unambiguous unitary identity. So it's likely to be experienced as complete, at rest, as being somehow stable. The more complicated a chord is, the further that it deviates from harmonicity, the more likely it is that its identity will be ambiguous; it will have several possible roots and will be experienced as incomplete, somehow unstable. This fits very well with the ways in which chords are used in tonal music. The chord in Figure 12(a), for instance seems unstable; it has a number of possible roots, its identity is unresolved, and it seems to require to be followed by something stable. On the other hand, the chord in Figure 12(b) has an unambiguous root - it is stable, complete and final.
These potentials of chords - their stability or instability - are exploited in the patterning in time of music. A stable chord will be followed by, and unbalanced by, an unstable chord, and the unstable chord will require to be followed by a stable chord to give a sense of completion - the patterning of chords can contribute a sense of directed movement, of dynamic structure, to the experience of the unfolding of a piece of music.
Parncutt's theory is largely a theory at the level of "raw" and untutored perception; it concerns the nature of immediate perceptions and relies on the operation of perceptual processes that may be more or less innate. But most listeners have been exposed to many, many pieces of music, the vast majority of which have been in a style similar to the piece that has been used as an example here - tonal music. Another set of theories about our perceptions of musical pitch organisation is based largely on the consequences of our long-term exposure to tonal music, together with close examination of some of the structural characteristics of the collections of pitches (such as scales) that tonal music employs. There is not space here to expand on the role of these structural characteristics in our musical perceptions, but the collections of pitches that are used in tonal music - scales, modes, chords and keys - when considered in mathematical terms within computational representations have been shown to exhibit some extraordinary properties that suit them excellently for use in the outlining of pattern in time (Balzano, 1980; Browne, 1980).
Those theories that focus on the consequences of listeners being exposed to music that exhibits consistent structural features - to tonal music - suggest that as a result of that long-term exposure they abstract and schematicise in their long-term memories mental representations that embody the regularities of pitch usage displayed by tonal music at several levels.
In the first two bars of our piece (see Figure 1), although the harmonies change one note stays constant through these bars, almost as a drone - a pedal point, in music theoretic terms. As the piece unfolds, a listener registers (unconsciously) that this pitch is sounding for longer than any other pitch in the piece so far, and it becomes marked as more important than the other pitches - as indeed it is, for it is the tonic - the key-note - of the entire piece, the note on which it will eventually end, the most stable note in the piece.
As the piece unfolds a listener will also abstract information about all the other pitches that are used, registering their frequency of occurrence and their total sounding duration. Some are used more than others; some potential notes are not used at all. And so a listener abstracts from this piece a hierarchy of importance of the different notes based on how long they sound for (and also on the places in which they sound - if they occur on strong beats, at expected temporal locations, they are likely to be registered as more important than notes that occur on weak beats); thus in the course of listening to the piece a listener forms a mental representation of the different degrees of importance of the notes in the piece, an event hierarchy. The more pieces in a similar style to which the listener is exposed, the more event hierarchies they abstract, and these become integrated into a representation in long-term memory of a tonal hierarchy that reflects the regularities of pitch usage across all the pieces to which they have been exposed. Researchers who have examined these processes over the last twenty years - Carol Krumhansl (1990), Jay Bharucha (1999), David Butler (Brown, Butler and Jones, 1994) and Lola Cuddy (1997) - have shown that the processes involved in abstracting a tonal hierarchical representation of pitch, together with the processes that operate on the particular and unique structural properties of the sets of pitches that are used in tonal music, account very well for the ways in which people experience melodic and harmonic structure in tonal music (a recent overview is given in Cross, 1997).
Lerdahl and Jackendoff's theory postulates four different "components of musical intuition", four different but inter-related domains of, or perspectives on, what is experienced when a listener encounters a piece of tonal music; grouping structure, metrical structure, time-span structure and prolongational structure. Each of these four different domains is constituted of a set of well-formedness rules that define legitimate structures; because more than one legitimate structure might be assignable to a single piece of music, Lerdahl and Jackendoff also incorporate sets of preference rules within each domain that select out the "preferred" structural interpretation.
Their grouping structure derives from the types of principles that Bregman employs in his theory of Auditory Scene Analysis; it represents the results of a listener's responses to attributes of the musical surface, the "pure sound" of the music (the succession of pitches and rhythms that are presented to their conscious awareness - see Jackendoff, 1987), and depicts the ways in which the music is partitioned into groups as it unfolds in time, with relations between elements within a group being experienced as more immediate, more direct, than relations between elements that fall within different groups. The types of cues that tell a listener when a group boundary is occurring are largely concerned with degree of change; this can be in any area, harmony, melodic movement, rhythmic structure, etc. Grouping also relies on phenomena such as repetition. As can be seen in Figure 13, in the first bar of the piece the end of the first group is probably signalled by the change in harmony, the size of the second group is determined by the next change of harmony but also by the fact the rhythmic structure of the first group is repeated.
The grouping structure of the piece is indicated by the square brackets above the music; notice how groups are formed into larger groups, and how the pattern is generally symmetrical with one exception. This occurs because a melody note that could act to end a short group is here incorporated into the subsequent group, on the grounds that (a) the time interval between it and the next melody note is shorter than the time interval between it and the note that preceded it, and (b) that the note fits better with the harmony of the chord that follows it than with that which precedes it. In fact, the grouping here is a question of preference, perhaps even of musical interpretation - the music "works" even if the note is grouped with its predecessors.
Metrical structure results from the application of the types of processes that have already been discussed above in respect of metrical inference; note durations, note intensities, are registered and compared as the piece progresses and each timepoint in the piece at which an event occurs is assigned a structural weight, giving rise to a simple periodic structure in the experience of the piece, a metrical frame, which is represented by the dots beneath the music.
Time-span structure is rather more abstract; it derives from the grouping and metrical structures but also draws on the representations of tonal pitch usage in a listener's long-term memory in working out which are the most important events in each group or timespan, and how they are connected with each other across the groups.
The derivation of timespan structure involves something analogous to the operation of a tonal grammar (see Figure 14), a process of interpreting - re-writing - the elements of the musical surface as tonally-functional entities, as abstract harmonies.
The processes whereby this tonal grammar is applied employ the schematicised tonal hierarchy in long-term memory to identify the chords of a piece as it progresses and to work out what function they are fulfilling in the piece. So the first event of the piece is most important event in the first group as it is a chord on the tonic of the key of the piece (see Figure 15).
The most important event of the second group is a chord on the dominant of the key, the second most important chord of the key, so it is less important overall than is the first event. The branchings of the timespan tree structure represent how important each event is in the experience of the piece, and it can be seen that the very first chord is the most important in the entire first half of the piece, in which the second most important event is the last chord of the first half. The most important event in the whole piece is the last - a chord on the tonic - and this is symbolised by having the most direct connection - a straight line - to the highest level of the time-span structure.
Prolongational structure shows how the events depicted in the timespan structure flow as the piece unfolds
It depicts the ways in which chords that constitute stable harmonies give way to more unstable chords as the piece progresses, and the way in which the unstable chords eventually resolve to a stable, final chord, all within one overarching framework or key. It is clear in Figure 16 that the first chord of the piece connects at the highest level to the final chord, and all the events intervening between the two chords can be thought of as elaborating a pattern of increasing and then decreasing harmonic instability - can be thought of as increasing and then decreasing the harmonic tension. The prolongational structure represents the flow - the increase and subsequent decrease - of harmonic tension that is likely to be experienced as the piece unfolds. The prolongational structure has some similarities with the timespan structure, but is much less symmetrical, showing how the tension increases up to the least stable chord in the piece (see Figure 16) which occurs about three-quarters of the way through the piece and then rapidly unwinds through a series of increasingly stable chords to the close.
For example, using these theories one can account for the experience of almost any tonal piece. But theories that account for our experience of pieces that are not tonal (such as those of Schoenberg, Boulez or Reich) are thin on the ground (though there are a few exception, such as Lerdahl (1989)). We can account for some aspects of the experience of such pieces using auditory scene analysis - hearing different "virtual sources" as lines, for instance - but current theories of musical cognition don't enable us to go much further.
This is probably because they have been developed and tested almost entirely on tonal music - the dominant music of Western culture. And culture is something about which theories in the domain of artificial intelligence - indeed, theories within cognitive science - have not had much to say until very recently. Perhaps the point can be made best by turning to a piece of music from an entirely different culture, that of the Bolivian campesinos of Northern Potosí, who live in the high Andes and whose culture seems to retain many of the characteristics of pre-Hispanic - Inca - ways of interpreting and experiencing the world.
I have come to know the music of this culture a little through working with an ethnomusicologist, Henry Stobart, who has spent a great deal of time immersed in it; much of what I'm able to suggest about that music and of our experience of it comes from working with Henry and relies greatly on his knowledge (a brief account of the following can be found Stobart and Cross, 1994).
Superficially it may seem to be a very simple musical culture, but there's considerable complexity just beneath the surface. To give an example, a matrimonios (wedding song) as shown in Figure 17 appears fairly simple (although with something of a rhythmical twist). If one tried to clap along with it on hearing it, it is very likely that one would clap at the points shown by the crosses in Figure 17(a).
That is, one would probably interpret the two short events that start the piece as occurring on an upbeat before the first strong beat of the piece which occurs on the first longest note; at least, that's what one would be doing if one were perceiving it in accord with the types of principles discussed above in respect of deriving a metrical framework. However, for the campesinos of Northern Potosí, the piece has no upbeat; the first event of the piece is the first event of the metrical frame, being heavily marked by footfalls. they appear likely to interpret the rhythm of the piece as shown by the crosses in Figure 17(b).
On looking more closely at the music of the this culture and asking some of its members to clap along with different pieces of music, it became evident that this culture does not appear to use upbeats at all. Its members simply take the first event of a piece to be the initial element of its metrical framework. They do not employ the strategy that seems so natural to Western listeners, of listening for the first longest event and conferring on that event some referential status, organising our metrical framework around that event. And this is a strategy that has been claimed by some (Lerdahl and Jackendoff, among others) as being a cognitive universal - which it self-evidently is not.
One of the reasons why the Potosíans employ this strategy appears to be related to the prosodic structure of their language, Quechua, which is an inflected, suffixal language. Words are formed by joining suffixes to a stem, and the primary prosodic stress occurs on the penultimate syllable. The suffixal nature of the language means that the absolute position of the stressed syllable within the word is subject to constant change. There is, however, a secondary stress which occurs on the initial syllable of the word, and it seems very likely that this initial secondary stress, the only fixed position stress feature in the language, is what is giving rise to - or is at least related to - the "non-upbeat" metrical inference strategies used by the campesinos.
So here is an instance of culture - in the form of linguistic practice - shaping aspects of musical perceptions that cognitive-scientific theory has tended to be thought of by Western theorists as fixed and immutable, as "natural" and innate.
But not only culture, in the form of language, has shaped the music of Potosí; patterns of human movement also play a significant role in determining aspects of musical structure. Indeed, an emphasis on human movement as an integral component of music and of the experience of musical structure has been noted in respect of other non-Western musics (see Blacking, 1976; Baily, 1985). In another genre of songs, pascuas (Easter songs), a repeated rhythmic pattern occurs that seems, to a Western listener, to comprise two short events followed by a long, as shown in Figure 18(a). An analysis conducted by Henry and myself showed that in this pattern the first of the paired short notes is always shorter than the second, by a consistent amount, and that their durations consistently stand in a ratio of 4:5 as indicated in Figure 18(b), a ratio that we rarely if ever encounter in Western music.
The only discoverable basis for this derived from the consistent and unequal motoric pattern involved in producing the rhythmic pattern; the short first note occurs on a downstroke, the longer duration of the second note, produced on an upstroke, being a result of delaying the fall of the hand for the third note. The inequality in the pattern was a direct consequence of the action embedded in the music, was a consequence of the embodied nature of music. In fact, in the original performances, not only was the first of the paired short notes consistently shorter than the other, it was usually almost inaudible, subverting and playing with the referentiality of the first event in respect of the rest of the pattern.
Now if these sorts of factors - linguistic, cultural factors: motoric, embodied factors - condition the way that music happens in another culture it is not unreasonable to suggest that these types of factors play a major role in the music of our culture that is largely ignored in both theory in AI and experiment within cognitive science - and it seems feasible that by looking at what actually happens in music of our and other cultures, by addressing issues of culture and embodiment - and of emotion, which has been neglected in this paper but is a fast-growing area of research in cognitive science - that cognitive science and AI can advance our understanding of mind in general. Music bears the unique and specific stamp of the culture from which it originates as well as the imprint of the actions required for its materialisation, and affords a powerful means of tracking the dynamics of the interaction of mind, behaviour and culture.
But music not only bears the traces of culture, it appears to be a universal human capacity; every culture that we know of has something we can recognise as music. Every member of those different cultures seems to come equipped with the capacity to participate in music, much as they appear to come equipped with the capacity for language (Blacking, 1995). Music seems to be as innate as language, as much a part of the human cognitive and behavioural repertoire. Bearing this in mind, it is self-evident that the achievement of an understanding of human musical capacities should be central to the cognitive-scientific research programme, because music presents it with challenges that seem to go beyond even those involved in understanding language. Language has self-evident evolutionary, social and individual utility; after all, language is, at the least, about something. Music is not about anything in particular. And the challenge and potential rewards for both AI and the broader domain of cognitive science lie in attempting to reconcile the dual aspects of music: music is humanly universal yet at the same time appears to be a strangely functionless activity.
Steven Pinker has suggested that "as far as biological cause and effect is concerned, music is useless". I believe that Pinker is wrong, and that music is extraordinarily functional, but in subtle ways, and in respect of aspects of human behaviour and cognition that cognitive science has barely begun to touch on (Cross, 1999). A good case can be made for music as being an individual and social activity that quite literally changes our minds, affording us the scope to integrate and restructure our mental processes throughout our development and through interaction with the environment and with others. Music, as a matrix that integrates the results of cultural processes embodied dynamics and emotional significations, can provide an extraordinarily fruitful resource for the development of cognitive-scientific thinking.
Human experience is beginning to be conceived of in increasingly rich and complex ways in cognitive science; it is beginning to be acknowledged as a complex, embodied, encultured, valenced and "enhistoried" set of social and cognitive processes (see Damasio, 1994; Port and van Gelder, 1995; Elman et al, 1996). Music has to be approached in the same light, rather than being thought of as simply pretty patterns in sound, and as such the study of musical experience can offer much to artificial intelligence. I am confident that it will do so, and, conversely, that artificial intelligence will have considerably more to offer to music.
References:
Baily, J. (1985) Musical structure and human movement. In P. Howell, I. Cross and R. West (Eds.) Musical structure and cognition. London: Academic Press.
Balzano, G. (1980) The group-theoretic representation of 12-fold and microtonal pitch systems. Computer Music Journal, 4, 66-84.
Bharucha, J. (1999) Neural nets, temporal composites and tonality. In Deutsch, D. (Ed.) The psychology of music (2nd edition). London: Academic Press.
Blacking, J. (1976) How musical is man? London: Faber.
Blacking, J. (1995) Music, Culture and Experience. London: University of Chicago Press.
Bregman, A. (1990) Auditory scene analysis: the perceptual organisation of sound. Cambridge, MA: MIT Press.
Browne, R. (1981) Tonal implications of the diatonic set. In Theory Only, 5, 3-21.
Burns, E. (1999) Intervals, scales and tuning. In Deutsch, D. (Ed.) The psychology of music (2nd edition). London: Academic Press.
Brown, H. Butler, D. and Jones, M. R. (1994) Musical and temporal influences on key discovery. Music Perception, 11 (4), 371-407.
Cook, P. (Ed.) (1999) Music, cognition and computerized sound. Cambridge, MA: MIT Press.
Cross, I. (1997) Pitch schemata. In I. Deliège and J. Sloboda (Eds.), Perception and cognition of music. Hove: Psychology Press.
Cross, I. (1999) Is music the most important thing we ever did? Music, development and evolution. In Suk Won Yi (Ed) Music, mind and science, Seoul: Seoul National University Press.
Cuddy, L. (1997) Tonal relations. In I. Deliège and J. Sloboda (Eds.), Perception and cognition of music. Hove: Psychology Press.
Damasio, A. (1994) Descartes' error: emotion, reason and the human brain. New York: G. P. Putnam's Sons.
Deutsch, D. (Ed.) (1999) The psychology of music (2nd edition). London: Academic Press.
Elman, J., Bates, E. A, Johnson, M., Karmiloff-Smith, A., Parisi, D. and Plunkett, K. (1996) Rethinking innateness: a connectionist perspective on development. Cambridge, MA: MIT Press.
Handel, S. (1989) Listening. Cambridge, MA: MIT Press.
Huron, D. (1992) The Ramp archetype and the maintenance of passive auditory attention. Music Perception, 10, 1, 83-92.
Jackendoff, R. (1987) Consciousness and the computational mind. Cambridge, MA: MIT Press.
Jones, M. R. (1992) Attending to musical events. In M. R. Jones and S. Holleran (Eds.), Cognitive bases of musical communication. Washington, D.C.: American Psychological Association.
Krumhansl, C. L. (1990) The cognitive foundations of musical pitch. Oxford: Oxford University Press.
Lerdahl, F. (1989) Atonal prolongational structure. Contemporary Music Review, 4, 65-87.
Lerdahl, F. and Jackendoff, R. (1983) A generative theory of tonal music. Cambridge, MA: MIT Press.
Lindley, M. (1984) Lutes, Viols and Temperaments. Cambridge: Cambridge University Press.
Narmour, E. (1989) The analysis and cognition of basic melodic structures. London: University of Chicago Press.
Narmour, E. (1992) The analysis and cognition of melodic complexity. London: University of Chicago Press.
Parncutt, R. (1989) Harmony: a psychoacoustical approach. London: Springer-Verlag.
Port, R. F., and van Gelder, T. (Eds.) (1995) Mind as motion: explorations in the dynamics of cognition. Cambridge, MA: MIT Press.
Stobart, H. and Cross, I. (1994) Aspects of rhythmic structure in the music on Northern Potosí, Bolivia. In I. Deliège (Ed), Proceedings of the 3rd International Conference on Music Perception and Cognition, Liège.
Terhardt, E., Stoll, G. and Seewan, M. (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of the Acoustical Society of America, 71, 679-688.
Todd, N. (1994) The auditory "primal sketch": a multi-scale model of rhythmic grouping. Journal of New Music Research, 23, 25-70.