Gesture–vocal coupling in Karnatak music performance: A neuro–bodily distributed aesthetic entanglement

In many musical styles, vocalists manually gesture while they sing. Coupling between gesture kinematics and vocalization has been examined in speech contexts, but it is an open question how these couple in music making. We examine this in a corpus of South Indian, Karnatak vocal music that includes motion‐capture data. Through peak magnitude analysis (linear mixed regression) and continuous time‐series analyses (generalized additive modeling), we assessed whether vocal trajectories around peaks in vertical velocity, speed, or acceleration were coupling with changes in vocal acoustics (namely, F0 and amplitude). Kinematic coupling was stronger for F0 change versus amplitude, pointing to F0's musical significance. Acceleration was the most predictive for F0 change and had the most reliable magnitude coupling, showing a one‐third power relation. That acceleration, rather than other kinematics, is maximally predictive for vocalization is interesting because acceleration entails force transfers onto the body. As a theoretical contribution, we argue that gesturing in musical contexts should be understood in relation to the physical connections between gesturing and vocal production that are brought into harmony with the vocalists’ (enculturated) performance goals. Gesture–vocal coupling should, therefore, be viewed as a neuro–bodily distributed aesthetic entanglement.


INTRODUCTION
Across a wide range of musical styles worldwide, vocalists tend to gesture manually while they sing. In existing research, such co-singing gesture practices have been analyzed with regard to communication, expressivity, transmission, iconicity/metaphor, and perceived effort, often as part of broader discussions of music embodiment. [1][2][3][4][5] However, fundamental questions remain unanswered regarding the coupling between gesture and sound-namely, what features of vocal sound and gesture kinematics are most closely coupled, and in what to gesture spontaneously while performing, and as a result, there is already a small body of research on gesturing in these practices. This study is thus part of a larger inquiry into gesture and vocal performance in Indian musical contexts. Meanwhile, as the wider field of music and gesture/movement studies is still heavily skewed toward a focus on Western Art Music, jazz, and popular music, this study contributes to increasing diversity among the styles examined. We propose that studies of real musical practices across diverse cultural contexts can play an important role in understanding connections between gesture and sound production in performance contexts.
Karnatak vocalists frequently gesture while singing, producing a variety of tracing, pointing, flicking, pushing, pulling, and stretching motions (for an example, see https://youtu.be/INk1KvYOf8U). These hand and upper body gestures do not comprise a formal system of symbols and referents; instead, performers experience their gesturing as being spontaneous. 4 Nevertheless, similarities can be found between the gestures of different vocalists. The gesturing in Karnatak contexts is akin to that of related North Indian styles, and indeed, the majority of research on gesture and Indian music has focused on North Indian practices, including Khyal and Dhrupad. Across these styles, gestures are not taught formally. Instead, the tendency to gesture in certain ways appears to be acquired implicitly during the lengthy learning process. 3,4 In Indian musical contexts, performer and audience gesturing has been analyzed to explore topics, including audience perception of metrical structure, 2 kinetic analogy through sound, 6 vocalization and gesture as parallel channels for melody, 3 metaphor, iconicity, and cross-domain mapping, 4 performance practice across cultural contexts, 7 and connections between physical effort and vocal sound. 5 Notwithstanding this body of work, questions remain regarding the nature of the coupling between gesture kinematics and vocal sound, namely, which aspects of each are structurally related to the other. Until recently, research on music-related gesture-vocal coupling has been hampered by a lack of appropriate statistical methods for assessing the coupling, but methodological progress has provided new solutions, which we employ here.
In this study, we ask what features of vocal sound and gesture kinematics are most closely coupled, and in what way. Second, we examine how this varies across performers and performances, and whether the nature of the coupling is affected by musical context, in particular across the melodic frameworks known as ragas (rāgas). a Such examination will contribute to understanding of what the gestures index or represent, and how, which has relevance for the question of why performers gesture as they do. Furthermore, the findings could have implications for broader debates regarding how gestures relate to vocal production through the physical connections between the two systems.
These questions are all part of the general unsolved conundrum of why humans tend to move along with music (or play music along with movement) in complexly varied ways. [8][9][10][11] The study has distinctly interdisciplinary foundations, and its theoretical background is drawn from two areas; the first from gesture a Ragas are melodic frameworks that include rules on which musical pitches can be played and in which order, the gamakas (ornaments) that should be played on those pitches, and the characteristic phrases that must be performed in order to properly express the raga. As a result, each raga is considered to have its own particular character or "color." studies, where there is an existing body of research addressing coupling between gesture and vocal production in speech contexts, the second from work within musicology and associated fields on relationships between music and movement, and also on the aesthetics of the Karnatak style.

BACKGROUND IN GESTURE-SPEECH COUPLING
During speaking, the hands gesture not only to convey meaning through enacting, depicting, or symbolizing, but also by adopting a certain prosody (or melody) together with speech. When emphasis is given to a speech segment, the concomitant rising excursion in the fundamental frequency is often accompanied with a salient jerky movement of the gesturing hand. [12][13][14][15] This so-called beat-like quality of gesture-where manual movement synchronizes with sharp rises in pitch and other acoustic markers of emphasis-has traditionally been understood to be a cognitively acquired tendency. 15,16 However, the upper limbs are attached to the torso, which is part of the respiratory-vocal system. It is known that simply moving your arms can interact with breathing cycles, 17 and sudden arm movements can change the intra-abdominal (and thus potentially subglottal) pressures due to recruiting a wider set of posture-stabilizing muscles during upper-limb movements. 18,19 Indeed, in nonhuman animals, such as flying bats and birds, vocalizations entrain to wing beats because the flying-related muscle tensions affect respiratory-vocal control. 20,21 Gesture-speech physics research has followed this line of thinking through for human manual gesturing, and found that chest kinematics are affected by upper limb movement, which in turn affect acoustic markers of emphasis during, for example, monosyllable utterances. 22 It has also been shown that standing (vs. sitting), moving higher mass effectors (two arms > one arm > hand), with higher de/accelerative movements leads to increased acoustic markers of emphasis of upper limb movement in vocalizing and fluent speech. 23,24 This supports the gesture-speech physics hypothesis that the beat-like quality of gesture should be understood as force-generating physical impulses, which recruit a wider set of (posture-stabilizing) muscles, 25,26 some of which that are implicated in respiratory-vocal control. 19,27,28 Thus, acceleration is an important kinematic marker of gesture-speech kinetics, as force transfers (i.e., physical impulses) are necessarily occurring when a body segment with a certain mass accelerates or decelerates over some time (note that force = mass × acceleration). Other kinematic variables, such as speed, or position, are less directly informative about the forces generated by gesturing. Thus, according to the gesture-speech physics thesis, acceleration is a key kinematic marker of force generation, and is predicted to be the most reliable parameter to understand coupling between gesture and the respiratory-vocal system. This has recently been supported by machine learning results showing that gesture acceleration is better predicted from speech acoustics as compared to gesture speed. 29 The existing research suggests that there are physical tensegrityrelated interconnections between the respiratory-vocal system and the upper limbs, which we predict are enacted in this real musical context too. We study this here through an exploratory analysis based on acceleration being a marker for force transfers across the body that are primarily powered by muscle contractions (but also tensile equilibrating properties of connective tissues). If acceleration couples more strongly with acoustic variables than with speed or vertical velocity, for example, this would be consistent with an interpretation in which force transfer is a significant factor in the coupling of gesture and vocal production. However, we do not mean to imply that gesturing in this context is purely determined by physical coupling, only that it plays a dispositional role; it poises performers to move their arms in a particular way rather than another-but it does not physically obligate them in any way, and there can be individual differences in how performers cope with the physical constraints of gesturing and vocalizing at the same time. Therefore, we see such gesturing as also being a particular strategy that is "cognitively" acquired and may then also be further shaped in relation to other cross-modal perceptual mappings, specific cultural tendencies, and the aesthetic goals of the vocalist for a particular performance. This leads us to characterize the coupling observed as a neuro-bodily distributed aesthetic entanglement, which we will further explicate in the discussion. We suggest that it would be misguided to imagine a binary opposition between either aesthetic goals or biomechanical stabilities. Instead, we propose that performances should be viewed as arising out of multiple constraints brought into harmony by the performer.

BACKGROUND IN MUSIC-MOVEMENT CORRESPONDENCES AND KARNATAK MUSIC AESTHETICS
In seeking to account for observed connections between co-musicking movements and musical sound, some research has looked to the potential influence of perceptual cross-domain mappings found between movement and sound/music. 4,5,30 For example, studies using motion imagery paradigms have identified correspondences between change in fundamental frequency and change in vertical position; increase in loudness and movement toward the listener; and increase in loudness and the application of force. 31,32 Meanwhile, studies where participants are asked to move or trace in response to short musiclike phrases have similarly found correlations between change in fundamental frequency and vertical position, 33,34 loudness and movement velocity, 35 and impulsive sounds and high acceleration peaks. 34 Through cross-modal interaction, performers' gestures can also influence audience members' perception of musical sound, biasing pitch perception, 36 and perceived duration. 37 Similarly, in speech contexts, seeing a beat gesture can change the perception of a co-occurrent syllable as if it were lexically stressed. 38 Following such research, music is now increasingly understood as a multimodal phenomenon, wherein physical movement and musical sounds are intertwined in behavior and experience. Cross-modal correspondences are held to develop largely through repeated experience of regularities in the environment, 39 including accumulated experiences of physical interactions with objects and the sounds that result. 32,40 For example, we know that when we hit an object with more force, a louder sound arises. Such insights connect with ecological psychology perspectives on sound perception, in which it is proposed that people perceive the causes of sounds more immediately than their acoustic qualities. [41][42][43] Vocalists thus perform in the context of such perceptual crossmodal correspondences and statistical regularities across the body and environment, evidence of which can be observed in their gesturing. 4,5,44,45 However, the likely impact of performers' aesthetic goals on gesturing also demands consideration. Unlike in experimental studies on cross-modality, for the current study, vocalists were not asked to trace or respond to music with movement, but rather they gestured spontaneously while focusing on performance goals related to the aesthetics of the Karnatak style, as they normally do when performing. Based on interviews with the vocalists conducted alongside the ragaālāpana b recordings made for this study, the goals of such performances are to express the character and beauty of the raga and thus move the audience. In Karnatak music, the character of each raga is conveyed through its characteristic phrases and motifs, which must be performed with specific patterns of emphasis and de-emphasis through modulation of loudness, pitch, duration, and timbre. [46][47][48] The correct and expressive performance of such phrases forms the basis for what is considered beautiful in the style-its aesthetic qualities. 46 Therefore, the required modulations of loudness, pitch, and duration within characteristic phrases and motifs should be conveyed to the audience. As music is a multimodal practice, wherein cross-modal interaction means that gestures can affect the perception of musical sound (as discussed above), both sound and gestural movement may contribute to the audience's experience of these qualities. Gestures can thus be viewed as contributing to the aesthetic experience of performances, and the influence of particular perceptual cross-modal correspondences on gesturing should be viewed in this context. Considering that vocalists have performance goals, it seems likely that they use such correspondences skillfully and unreflectively in performance, with a keen appreciation of what is effective in their aesthetic endeavor.

CURRENT STUDY
In this study, we examine coupling between gestures and vocal sound based on the following research questions and rationales.
Which couples most strongly with gesture kinematics: F0 or amplitude?
Following existing work on cross-modal perception discussed above, we expect to find cross-modal relationships between performers' gestures and vocal sound produced. But, which variables couple most strongly in this real musical context? The question is important because b Ragaālāpana is a musical format without meter, where the performer extemporizes on a raga based on its existing characteristic phrases and raga grammar. In this format, nonlexical vocables are sung, rather than lyrics. strength of coupling can provide insight into which sonic features are indexed by performers' gestures, and thus may also have an impact on audience members' perception of the music (considering the literature on cross-modal interaction discussed above)? Here, we examine the strength of coupling between kinematic variables and two acoustic features commonly implicated in studies on cross-modal mappings in musical contexts-change in F0 and amplitude.
What couples most strongly with vocal acoustics: acceleration, speed, or vertical velocity?
Following existing research on gesture-vocal coupling in speech and musical contexts, we expect acceleration to couple most strongly with acoustic variables. We aim here to provide a fine-grained analysis of coupling between kinematic and acoustic variables, looking not only at temporal coupling but also magnitude coupling. Such an analysis is important for what it can tell us about how performers index the sonic features-which kinematic features are most strongly implicated.
In addition, a finding of acceleration being most reliably coupled with acoustics would be consistent with interpretations highlighting the salience of force, as this is more directly related to acceleration than the other kinematic features analyzed.

How do the couplings examined above vary across performers and performance types (different ragas)?
We ask whether individual performers have idiosyncratic modes of coupling between gesture and acoustics, or whether there are commonalities across performers? In addition, we seek to discover whether the raga performed has an effect on the quality of coupling. This question is stimulated by the fact that ragas are often considered by musicians to have particular characters or moods. 48 As Karnatak vocal performance is a complex human behavior involving physical, cognitive, cultural, and aesthetic influences, we expect the results to be similarly complex, but we hope to identify some underlying trends in answer to our questions through our analysis of a large number of performances by four Karnatak vocalists who, as socially acknowledged expert performers, are taken to be indicative of current performance practice in the style.
The outline of this study is as follows. We first compare acceleration peaks with 3D speed and vertical velocity peaks, the latter two of which are kinematic variables with a high likelihood of entraining to musical features (based on research discussed above). We focus on studying temporal regions around peaks in movement as we know that gestures are intermittent in their activity, such that there are moments of vocalization without gesture that we should not average with moments of gesturing. Further, our analyses procedure is tailored for the study of time series that likely couple polyrhythmically due to the inherently different time scales that define each system (see Methods). We then perform a coupling analysis that focuses on the presence of temporal coupling, such that some acoustic fluctuation consistently occurs relative to the timing of a kinematic fluctuation. Then, we follow up with a magnitude coupling analysis, which quantifies the degree to which a gesture kinematic magnitude scales with the acoustic fluctuations.
Finally, we analyze whether there is consistent variability in gesturevocal coupling across ragas (melodic frameworks) or performers. For example, some performers may use one particular cross-modal mapping (e.g., vertical motion with F0 change) over another (e.g., speed and F0 change).

Performances and performers
In total, 35 recorded performances of ragaālāpana were analyzed, cov-

Audio recording
Sound was recorded at 48 Khz using Neumann KM184 condenser microphones.

Motion tracking
Motion tracking was performed with Xsens MVN Awinda (Xsens, the Netherlands; 60 Hz sampling); a full body inertial sensor motion capture system. We smoothed x, y, z traces with a zero-lag 30 Hz thirdorder Butterworth filter to reduce noise-related jitter. We extracted movement traces (x, y, z displacements) of the left and right wrists.

Video recording
Video recording was performed with a GoPro Hero4 camera at 50 fps.

Manual-vocal events measurement
Manual gesture events were annotated in ELAN. 49 The gesturing events were defined as a sequence of movement strokes and poststroke holds in the gesture space in front of the performer. The start c All four performers sung all eight ragas, apart from raga Bhairavi, which was not performed by one vocalist. boundary of the gesture event was approximately determined as the moment when the hand finished its preparatory phase from rest position to gesture space. The end boundary of the gesture event was the moment when the gesturing hand retracted to the rest position.
Thus, the gesture events did not include the preparatory and retraction phases from and to rest positions (a common approach in co-speech gesture coding). 50

Acoustic measurements
For each acoustic measure x, we consider the absolute change (|Δx|) in magnitude (i.e., the absolutized first derivative of x with respect to time). The derivative is used as we are interested in whether kinematics couple with the dynamic changes in the acoustics (does acceleration couple with F0 movements). This is different from asking whether gesture kinematics tend to covary with high or low F0/amplitude.

Absolute change fundamental frequency (|ΔF0|)
The fundamental frequency is the main acoustic determinant of the perceived pitch of the sound. Fundamental frequency was extracted to a time series with a sample rate of 200 Hz using Praat. 51 Pitch tracks were hand checked for noise-related tracking problems (e.g., period doubling), and the ranges for each performance were adjusted in Praat accordingly. We smoothed F0 with an 8 Hz Hanning window and then computed the absolute change of F0 over time, henceforth, |ΔF0|. |ΔF0| is expressed in Hertz change per second.

Absolute change amplitude envelope (|ΔENV|)
The amplitude envelope (ENV) tracks gross intensity changes in the sound. Using a custom-written script (https://osf.io/6vjqn/), we extracted the amplitude envelope from the audio. To extract the amplitude envelope, we applied the Hilbert transform and took the complex modulus of the analytic signal, yielding a 1D time series. 52 We then smoothed the amplitude envelope using an 8 Hz Hanning window. This smoothing of the amplitude envelope should provide us gross information in the acoustics that couple at comparable time scales with that of kinematics, while ignoring very fine structured information in the amplitude signal. We downsampled the sampling rate of our ENV time series to 200 Hz. We rescaled the amplitude envelope within each performance from 0 to 1. We then computed the absolute change of amplitude envelope over time, henceforth, |ΔENV|. |ΔENV| is expressed in a rescaled amplitude envelope unit (or arbitrary units, a.u.) change per second.

Velocity z
The negative or positive rate of vertical displacement (velocity in the z dimension) was obtained. Positive values indicate that the hand is moving up, and negative values indicate movement downward. This measure is especially of interest if there is an acoustic mapping onto the vertical dimension, for example, positive change in F0 (increase in Hz) is reflected with an upward gesture. We express velocity z as a vector d quantity in centimeters per second. Speed 3D speed of a particular body segment was calculated from individual x, y, z velocity (v) components, s = √ vx 2 + vy 2 + vz 2 , as provided by the motion tracking system. When speed is higher, it reflects that the body segment is moving faster in an arbitrary direction, and when speed is zero, there is no movement. Speed cannot be negatively valued. Therefore, we express speed as a scalar quantity in centimeters per second.

Measurement aggregation
Acoustic measures and kinematic measures were aggregated using a custom-written processing script (https://osf.io/q3rxa/). We upsampled the motion tracking data using linear interpolation from 60 to 200 Hz, which was then aligned in time with the acoustic measures (already sampled at 200 Hz). All our coupling analyses take this acoustic + kinematic time series as their input. The lower panel of A shows the concomitant gesture acceleration time series, where the positive peaks are identified and given a magnitude category based on 33% quartiles (three lower peaks, three middle peaks, and three high peaks). The time series are then analyzed for temporal coupling (while taking into account magnitude peaks) as shown in B. For the temporal coupling analysis, we prepare the time series for GAM by sampling vocal time series around an interval (here 600 ms) centering the kinematic peaks. An example of such a sample is given in B, which shows peak 2 of the time series in panel A (annotated as "sample 2"). If we repeat this process for all peaks, we can generalize over the vocal trajectories while taking into account magnitude, and this results in GAM fitted (nonlinear) slopes as given in the right panel of B. It can be seen that there is a consistent pattern for this single gesture event such that after about 130 ms, a positive peak in the acceleration |ΔF0| follows, that is, there is a delayed temporal coupling. It can also be seen that this general pattern (given in black) is much more pronounced for the high magnitude peaks, followed by the middle peak, and then the low peak, which suggests a role for magnitude of the peak in establishing temporal coupling. (C) To further assess the magnitude coupling, we establish from the GAM the delayed coupling in milliseconds, and then take average samples of |ΔF0| and relate this continuously to acceleration peak magnitude using linear mixed regressions.

Gesture-vocal temporal and magnitude coupling analysis
example, expected if there is some kind of dynamic neurophysiological feedback between gesture and vocalization trajectories, or when there is a biomechanical coupling between gesture and vocalization. laryngeal counter adjustments after about 30 ms. 55 We further added fixed effects of different kinematic peak magnitudes to the GAM model so as to provide an initial test for magnitude coupling (does acoustic output generally scale with kinematic peak magnitude?) and visualize how kinematic magnitude affects the temporal coupling of kinematics with the vocal trajectory (does temporal coupling especially arise when the kinematic peak reaches a certain magnitude?). We refer to Figure 1 as a graphical explanation of our temporal analysis approach.
Specifically, for each gesture event, we identified all positive peaks (in the case of scalar quantity, such as speed) or negative and positive peaks (in the case of vector quantities, such as acceleration and vertical velocity) in the time series. Kinematic peaks were determined using a peak-finding algorithm implemented by R-package pracma. 56 In order to capture sufficient variability in the magnitude of acoustic peaks to be related to the kinematic peak, the peak-finding algorithm was not thresholded (e.g., minimum magnitude of the peak), though positive or negative peaks needed to exceed the 0 boundary. Thus, we also have peaks that are of relatively minor magnitude, next to more pronounced kinematic peaks ( Figure 1). To make a distinction between different magnitudes of the peaks and their relation to temporal coupling, we initially distinguish between low, middle, and high magnitude peaks, which were determined for each performer separately by identifying the lowest (0-33% quantile), middle (33-66%), and highest (66-100% quantile) peaks for that performer.
In a follow-up analysis, we quantify magnitude coupling continuously, and these arbitrary cutoffs are merely used for assessing context-dependent effects of magnitude coupling with temporal coupling. Comparisons to other types of analyses Note that our analysis procedure differs in some ways from other common approaches to assessing synchronization processes in music and movement. 34,58,59,60 Cross-wavelet analysis is currently a popular approach for analyzing (multiscale) synchronization between two time series of a similar nature, such as movement produced by two persons. 58,60 However, changes in vocal acoustics are of several magnitudes faster in oscillation period than the relatively slower manual gesture system (e.g., compare |ΔF0| and the acceleration time series in Figure 1). Hence, the two systems operate on inherently different time scales and are likely to couple polyrhythmically, 61 or indeed in pulses, 22 and it is then less ideal to perform an analysis that is designed to assess continuous coupling within time scales between (multiple) oscillations that have a comparable period. Further, our analysis is sensitive to lagged synchronization processes, similar to cross-correlation analysis or cross-recurrence quantification analysis. 62 However, the current GAM approach (combined with linear mixed regression) provides a statistical tool for disentangling random effects of performer and performance, from nonlinear main effects of time relative to peak kinematics and fixed effects of kinematic magnitude. 57 A drawback of the current approach is that it is not possible to infer which system is likely causing (in a statistical sense) some effect in the other, and approaches, such as Granger causality analyses, are an interesting further avenue of inquiry for the current research. 62

Performer and raga difference in temporal and magnitude coupling analysis
Our main temporal and magnitude coupling analysis is focused on whether we can generalize over performers and performances the way that gesture couples with vocalization. Of course, this might obscure interesting differences between performers or what is being performed (i.e., which raga). We will, therefore, further explore performer and raga-dependent differences by summarizing all GAM model fits of the different gesture-vocal coupling using dimensionality reduction (principal component analysis; PCA).

Descriptive performances
Gesture-vocal events

Gesture kinematics and vocal acoustics
In Table S1, we report descriptive information about gesture kinematics (e.g., average peak velocity of a gesture).  Table 1 provides the GAM modeling coefficients with each model's explained deviance for |ΔF0|. Please see Table S3 for the GAM results for amplitude envelope, which showed generally poorer modeling performances (<6% deviance explained), though slightly better fit for vertical velocity (6.03%) as compared to speed (4.38%) or acceleration (<5.43%). Figure 2 provides the fitted trajectories for all models. As a sanity check, we also model the kinematic trajectories, as separated by magnitude of the kinematic peaks (low, middle, vs. high magnitude).

Temporal coupling analysis
The related vocal trajectories show differences per magnitude of the peak, with (1) higher baselines for higher magnitude kinematic peaks (indicating a generally higher |ΔF0| during the 1 s around the kinematic peak) and more pronounced peaks (indicating a heightened |ΔF0| at a particular moment relative to the peak in kinematics; i.e., temporal coupling is more pronounced for higher magnitude of the kinematic peak). shows the GAM fitted trajectories for |ΔENV|.
We noticed from inspecting the GAM results reported in Table 1  and speed (8.28%) in explaining deviance in |ΔF0|. Note these models are all generalizations over performers, and there might be considerable individual differences underlying these patterns. In the final "Individual and performance differences" section below, we report on performer differences in temporal coupling.

Quantifying magnitude coupling
We found that sudden vocal changes occur around peaks in speed, acceleration, and deceleration, especially for |ΔF0|, and secondarily |ΔENV|. We thereby show clear temporal coupling with kinematics.
We also found an indication that such peaks in vocal changes scale with the magnitude of kinematic peaks, providing clear evidence that F I G U R E 2 Generalized additive modeling around peak kinematics. The upper row shows the GAM predicted vocal trajectories around a kinematic peak (and reported in Table 1). The row below shows the kinematic trajectory relative to a positive or negative peak and provides a sanity check that we have normalized time correctly such that at time 0, there is a kinematic peak of a particular magnitude. The lowest left panel shows the vocal trajectories for speed and the lowest right panel shows enlarged vocal trajectories for positive acceleration.
magnitude coupling is also occurring. However, we should model the magnitude of kinematic peaks continuously relative to the magnitude of vocal changes, to provide a strong estimate of the differences in magnitude coupling between kinematic features. We do this for |ΔF0| as this was the acoustic feature most strongly coupled to kinematics.
For this analysis, we use information from the GAM models to sam-   We consistently find that |ΔF0| is best predicted by all kinematic variables through log-log transformation as is also evident from Figure 3, indicating a nonlinear scaling between sound and kinematic magnitude. Log |ΔF0| is best predicted by the log acceleration as indicated by the higher marginal effect sizes (>14.9%) relative to other kinematic variables (<10.3%) that quantify the fixed effects contribution in explaining the data ( Table 2). From these marginal effect sizes, we conclude that gesture acceleration (as compared to speed and vertical velocity) is the best predictor for the magnitude of |ΔF0|.
Note, that the conditional effect sizes are also informative about the by-performer and by-raga modulation of the kinematic effects.
In other words, how much variance is explained by the model if we unfix the fixed effects coefficients so that they vary per raga and per- We conclude from all this that the gesture-vocal coupling is generally best described by acceleration as is evident from the higher fixed effect sizes. But, we also observe that depending on performer and raga, there can be high predictive performance by positive vertical velocity. In other words, acceleration-|ΔF0| is evident across ragas and performers, but vertical mapping is more variably an important mode of gesture-vocal coupling (conditional on a particular raga or performer).

Individual and performance differences
Gross generalities between performers' gesture-vocal coupling can obscure larger individual differences in gesture-vocal coupling styles.
The previous magnitude coupling analysis already provided some information about differences between performers and ragas (as particularly evident in the conditional effect sizes for vertical velocity), and in this final Results section, we further visualize and analyze the crossperformance and cross-performer variability that underlies our general results.
As an indication of the individual differences of performers and ragas in the temporal coupling analysis, we have refitted GAM trajectories for |ΔF0| by performer and raga for vertical velocity z and acceleration separately (Figure 4).
To investigate this variability as seen in Figure 4 (and Figure 3), we extract information from our models for each gesture-speech coupling relation using GAM (temporal coupling) and linear mixed effects model (magnitude coupling) separately for each participant, so as to determine whether there are individual differences across performances in terms of which kinematic variable couples most with vocal acoustics (for simplification, we reduce our analysis to only |ΔF0|). To visualize this variability across many models, we use a dimensionality reduction PCA to plot potential differences in performances and/or F I G U R E 4 |ΔF0| GAM trajectories are plotted per performer (1 through 4) and for the different ragas. The black colored lines reflect the general trajectory over all the performers and ragas. Gray shaded areas indicate the standard error. Please note that interactive versions of the current graphs for velocity z (positive peaks; by performer and raga) and acceleration (positive; by performer and raga), as well for velocity z (negative; by performer and raga), speed (by performer and raga), and acceleration (negative; by performer and raga) are provided on our supplemental page on the Open Science Framework.
performers ( Figure 5). There is considerable variability in gesture-vocal coupling across ragas and performers. To provide a quantitative indication of how well the variability is captured by performer or raga category, we set up a machine classification task using R-package caret (Kuhn 63 ). For three different seed initializations, we trained a random forest classifier on randomly selected 50% of the data, so as to then predict on the other F I G U R E 5 PCA biplots for raga (left plot) and performer (right plot) are shown, indicating how much of the variability in the gesture-vocal coupling (as indexed by deviance explained by GAM model for each performance) is structured according to raga or performer. The arrows indicate the dimension of variability. For example, when there is a positive value in the direction of z.velocity.pos, this indicates that there was relatively more deviance explained by a GAM model for |ΔF0| for that performance. Note also that some arrows are aligned, meaning that those dimensions are correlated (e.g., the arrows of acceleration.positive and acceleration.negative are aligned, indicating that the deviances explained are correlated).
50% of the data the performer or raga. We scaled and centered the data before training, and used a repeated cross-validation method for training (for code, see: https://osf.io/3mquh/). Though our results should be carefully interpreted given that we do not have many datapoints, we obtained that the classifier could not differentiate ragas on the basis of the model results, yielding an average classification accuracy of 17.95% (Table S4). This suggests that gesture-vocal coupling is not sufficiently different across ragas. We did, however, find that a machine learning (ML) classifier could more reliably predict the performer class in the testing set (52.38% accuracy), suggesting that performers were somewhat stable in their gesture-vocal coupling (Table S5).

DISCUSSION
In this article we studied coupling between gesture kinematics and vocal acoustics in a South Indian style of musical performance that is characterized by multimodal expression. We asked which kinematic variables that are in turn informative about physical effort and impulse (acceleration), amount of movement per unit time (speed), and vertical mapping (vertical velocity), are most coupled with changes in vocalization acoustics (fundamental frequency and amplitude envelope). Across the board, kinematics were more strongly temporally coupled with F0 rather than amplitude. This is interesting as although amplitude variation is important in music performance, F0 is arguably a more perceptually significant acoustic variable for characterizing the nature and structure of a melody. The gestural indexing of musical pitch may have aesthetic and communicative benefits in this style. This is particularly the case, considering the aesthetic importance in Karnatak music of conveying subtle nuances of pitch movement, along with the quality and rhythms of this movement, including which pitches are emphasized or de-emphasized (as discussed above in the background on music section).
We find that acceleration has the highest predictive performance for modeling the timing and magnitude of nearby change in F0, both for temporal coupling (are acoustic fluctuations timed relative to kinematic peaks?) and magnitude coupling (is magnitude of the acoustic fluctuations scaled to the magnitude of the kinematic peak?). However, for certain performers, the coupling of upward movements with changes in F0 was also evident. To understand this individual and byraga variation, we also followed up with a machine classification analysis, where we tried to predict the performer or raga based on gesturevocal coupling modeling results across kinematic and acoustic variable relations. We obtained that the variability in gesture-vocal coupling is more determined by performer than by a particular raga, though in general, the classification accuracy was poor, perhaps due to lack of data, and perhaps also suggesting that gesture-vocal coupling is something that is more richly and creatively varied by the performer.

Gesture-vocal coupling strength
In general, kinematics and acoustic relations were weakly associated in time (explaining less than 13% of the variance) to moderately  Levin notes, 26 scaling relations between a force perturbation on the body and its effects on more peripheral elements of the tensioned bodily system are also likely to be nonlinear, such that a more extreme perturbation (e.g., 2× higher gesture acceleration peak) can be attenuated by the bioarchitecture as its tensioned setup allows the distribution of forces over multiple elements of the system (leading to only a one-third acoustic effect). Though this finding aligns with those in human movement and biomechanics studies, the current magnitude coupling should not be treated as a power law in a strict 1:1 sense, as the acoustic fluctuations in our findings more variably scaled to kinematics, as compared to the classic negative one-third power law observed between movement curvature and speed, which almost pattern as a single-valued function. 66 Nevertheless, it is interesting that we obtain this novel multimodal power relationship between a kinematic and an acoustic variable, which should serve as a promising basis for further research that could connect principles from human movement science with research on the human voice.

Gesture-vocal coupling from a gesture-speech physics perspective
Though acceleration is more directly informative about physical impulses onto the body and was the most reliable kinematic parameter associated with acoustics in the current study, we are cautious in interpreting this as direct evidence for the gesture-speech physics hypothesis as it has been originally proposed. [22][23][24]68 In this study, we only observe associations and we have not systematically varied kinematic and acoustic conditions so as to probe putative causal relations between forces and acoustics. Additionally, we obtain that performers' vocal peaks followed manual acceleration peaks but preceded manual deceleration peaks. This suggests that the connection between force transfer and vocal movements in this context is not straightforward (as would be the case if all moments of force transfer connected with acoustic peaks), but rather that gestural movements are managed by performers to take into account both aesthetic and biophysical constraints. Further, it should be emphasized that acceleration is a kinematic variable and it might be that the performance variable for the performer is also kinematic in nature; for example, the vocalizer might aim to change the direction of the movement during a vocal inflection to visually signal to the audience. Sudden changes in direction will also be preceded and followed by negative and positive peaks in the acceleration profile of the gesture. Further, head gesture accelerations have been found to align with external beats, 69 and these head gestures seem to function to visually signal piece onsets and tempo. 70 Thus, acceleration as a kinematic variable that couples most reliably with some kind of structure in sound is a finding open to multiple interpretations regarding the underlying mechanisms. 71,72 At the same time, manual accelerations will by necessity involve force transfers in the musculoskeletal system. Given that there is accumulating evidence that gestures can affect the voice directly through force transfer via the respiratory system (as evidenced by studies relating mass of the movement articulators and acceleration, to chest kinematics and vocalization), 73 and given that the current study is also about voicing and gesturing and shows highly comparable relationships in a music context, we conclude and emphasize that the current findings are in line with the general gesture-speech physics thesis.
Gesture-vocal coupling did diverge in some subtle ways from what has been observed previously in steady phonation and fluent speech.
For example, in phonation and speech, the perturbing effects of gestures seem to have more extreme effects on the amplitude envelope than F0, 22

Gesture-vocal coupling and bodily tensegrity: an aesthetic entanglement
In our theoretical contribution to this article, we propose that gesturevocal coupling should be considered in relation to the tensegrity structure of the body, as well as within the wider aesthetic and performance context. In this sense, it can be viewed as a neural-bodily distributed aesthetic entanglement. This entanglement occurs within the context of kinetic cross-modal mappings that are best understood as an active sensing and perturbing of the deformations of a prestressed tensegrity-structured body due to gesture-induced physical impulses.
Tensegrity structures involve tension (connective tissues and muscles) and compressive elements (bones) that form a networked architecture.
This imbues such systems with particular dispositions, 77 which are characteristics of many living systems, such as (human-) animal bodies, as well as cells. One of these dispositions is that there is always some level of tension, which naturally distributes locally induced forces over more peripherally connected sets of muscoskeletal elements. 78 This pre-stress entails that tensioning one element (left hand fist clenching) can affect movement parameters (e.g., stiffness and amplitude) of more peripheral elements (right arm movement). 79 In the case of vocalizing and moving the upper limbs, we argue that the tensegrity structure of the body creates the conditions wherein a forceful gesture can affect respiratory-vocal processes as the cascading effects of moving one body part can affect respiration and, therefore, vocalization. 22 It is not just that a bodily action can impact the wider muscoskeletal system "downstream" in this way, but also that the perception of those effects 80 becomes part of the aesthetic performance itself. That is, the "dynamical causal loop" 81 between gestural action and the sensed constraints on the respiratory-vocal system is regulated in relation to the musical context in each vocal performance.
We describe gesture-vocal coupling as an aesthetic entanglement in the sense that gesture-induced physics are brought in harmony with the performer' goals, linked to aesthetic norms that are part of the musical practice. The gesture-vocal complex is thus distributed across the neural-bodily system as influenced by factors, such as the particular body and tensegrity structure of the performer, 82 cross-modal perception, the performer's personal history (e.g., the influence of their teacher and learning process), and the structure and character of the performance. All of this is brought into active neural-bodily harmony through gesturing and vocalizing in biomechanically stable ways.

Future directions
This study's findings regarding acceleration being the best statistical predictor for gestural-vocal coupling are consistent with the interpretation that mappings between force-producing movements and acoustic change are salient for gesture-vocal coupling in this context. This is consistent with observations by Paschalidou regarding the significance of gestural enactments of effort in North Indian vocal performance. 5 However, the findings of the current study do not prove the interpretation of force transfer being salient. We suggest that future research should, therefore, target this specific mapping in co-singing and comusicking gesturing. An important future research endeavor is a more fine-grained study of which gesture-related muscle groups, implicated in respiratory control, are best functionally recruited during which vocal targets. Further, bodily gestures that can interact with the vocal system need not be limited to hand gestures, [83][84][85] and indeed, it is apparent in the current performances that movements were performed with the whole upper body. Given that, for example, postural changes associated with piano-playing seem to affect superior airflow, affecting the harmonic formant of vocal acoustics when simultaneously singing and playing the piano, 86  In addition, more research is needed that addresses the generalizability of particular gesture-vocal coupling across and within performance styles. Within styles, it is important to note that the number of professional performers who participated in this study constitutes a relatively low "sample size" if weighted against conventional standards in psychology, and thus, we should caution to make any sweeping generalizations from the current data alone. Note that the number of samples taken for this study is high and contributes to the reliability of the current results-there were many gesture events (n = 1630) collected, which were recorded during multiple performances per performer (total performances, n = 35). Across styles, we would hypothesize that biomechanical stabilities between vocalization and gesture would apply to all gesture-vocal performances. However, the ways in which this is manifest are likely to vary depending on the aesthetic qualities admired in the style. For example, much contemporary opera performance tends to favor relatively naturalistic acting, without additional gesturing. Therefore, it would be important in each case to examine how the biomechanical constraints discussed here are brought in harmony with the aesthetics and sociocultural context of the particular style.
Finally, viewing gesture-vocal coupling as a neural-bodily distributed aesthetic entanglement invites systematic research into connections between musical syntax (ragas, phrases, and motifs) and gestural physics, both within and across performers. This aesthetic entanglement perspective considers the harmony between the physical tensegrity structure of human bodies and the aesthetic goals of the specific music performance. We propose that through such an approach, progress can be made in understanding why gesture manifests as it does in musical contexts.