Virtual Instrument Performances (VIP): A Comprehensive Review

Driven by recent advancements in Extended Reality (XR), the hype around the Metaverse, and real‐time computer graphics, the transformation of the performing arts, particularly in digitizing and visualizing musical experiences, is an ever‐evolving landscape. This transformation offers significant potential in promoting inclusivity, fostering creativity, and enabling live performances in diverse settings. However, despite its immense potential, the field of Virtual Instrument Performances (VIP) has remained relatively unexplored due to numerous challenges. These challenges arise from the complex and multi‐modal nature of musical instrument performances, the need for high precision motion capture under occlusions including the intricate interactions between a musician's body and fingers with instruments, the precise synchronization and seamless integration of various sensory modalities, accommodating variations in musicians' playing styles, facial expressions, and addressing instrument‐specific nuances. This comprehensive survey delves into the intersection of technology, innovation, and artistic expression in the domain of virtual instrument performances. It explores musical performance multi‐modal databases and investigates a wide range of data acquisition methods, encompassing diverse motion capture techniques, facial expression recording, and various approaches for capturing audio and MIDI data (Musical Instrument Digital Interface). The survey also explores Music Information Retrieval (MIR) tasks, with a particular emphasis on the Musical Performance Analysis (MPA) field, and offers an overview of various works in the realm of Musical Instrument Performance Synthesis (MIPS), encompassing recent advancements in generative models. The ultimate aim of this survey is to unveil the technological limitations, initiate a dialogue about the current challenges, and propose promising avenues for future research at the intersection of technology and the arts.


Introduction
The digital evolution of performing arts, including musical experiences, in virtual settings, stands at the forefront of a transformative era driven by Extended Reality (XR), the Metaverse, the widespread adoption of Artificial Intelligence (AI), and recent advances in real-time computer graphics.This shift has significantly altered the performing arts landscape, unlocking unparalleled possibilities for inclusivity, creativity, and live performances in diverse locations.Beyond the challenges accentuated by the recent pandemic, which served as a catalyst for these possibilities, our motivation is firmly grounded in the inherent potential for innovation and growth within virtual and mixed-reality spaces.
The digitization and visualization of the performing arts play a pivotal role in enhancing accessibility to art, preserving cultural heritage in an intangible format, and reaching a diverse global audience.This transformation not only ensures the long-term preservation, documentation, and analysis of these art forms for future generations but also serves educational purposes.It enables students to delve into the intricacies of various art forms, exploring their historical, cultural, and technical aspects.In this revolutionary era where the digital realm seamlessly connects with artistic expression, the performance of musical instruments in extended reality (XR) represents a dynamic evolution.We define this as Virtual Instrument Performance (VIP), a multimedia presentation that encompasses the comprehensive execution of a musical instrument within a virtual environment.This multidisciplinary art form combines musical skill with advanced audiovisual technologies for high-quality audio production, animations, and interactive elements, creating a holistic and captivating experience for both performers and audiences.Within this digital realm, the performer plays their instrument in a computer-generated world, where visual effects and animations synchronize with the music, enhancing sensory engagement.This blurs the boundaries between reality and virtual worlds, expanding the possibilities of musical expression and entertainment in the digital age.VIP represents a boundary-pushing fusion of music, visuals, and interactivity, transcending traditional barriers tied to physical presence and allowing artists to connect with worldwide audiences.The Metaverse, a shared virtual space, and digital twins, enhanced virtual replicas of physical entities, act as catalysts, driving performing arts into new dimensions.In this expansive digital canvas, artists craft immersive experiences that transcend geographical boundaries, leading to an era where live performances are not restricted to a specific stage but resonate across borders.Moreover, the flexibility introduced by recording and broadcasting liberates audiences from the constraints of time zones and schedules, allowing them to enjoy concerts at their convenience.Performing in a virtual environment unleashes new horizons for creativity, freeing artists from the constraints of the physical, tangible world while offering innovative means to engage and interact with their creations.Virtual spaces unlock a realm of boundless potential, from altering appearances and introducing virtual entities to defying gravity, unlocking new dimensions of scalability and creativity.Viewers can now access exclusive vantage points and unique perspectives on performances, and even participate in ways that were previously unimaginable.
However, creating virtual characters that convincingly play mu-sical instruments presents significant challenges.Firstly, there is the issue of the loss of the live experience, as watching a performance on a screen or in a virtual environment lacks the energy and connection between the audience and performers.Additionally, there are several other hurdles to overcome, including maintaining quality and authenticity (for example, the camera angles, sound recording, and post-production editing can affect the viewer's perception of the performance), technical obstacles (for example capturing the essence of a live performance and presenting it in a visually appealing way requires skill and equipment), and financial constraints since the cost of digitization and visualizations can be prohibitive.Like other performing arts, playing musical instruments encompasses intricate, multi-modal performances with complex and fine detail subtle movements, making their acquisition, analysis, comprehension, and synthesis inherently demanding.In particular, data acquisition involves integrating and synchronizing various types of data, capturing precise motion with high fidelity, accommodating variations in musicians' playing styles, addressing occlusion challenges, and dealing with instrument-specific nuances.On the other hand, generating convincing animations of musicians playing musical instruments requires replicating instrument sounds accurately, synthesizing complex and multi-modal animations (covering pose, wrist, facial expressions, and instrument animations), infusing emotional expression, ensuring real-time interaction, and efficiently managing computational resources.Balancing these aspects necessitates advanced technology, including cutting-edge motion capture systems, sound modeling techniques, and advanced AI algorithms, all crucial for achieving the realism and expressiveness required for convincing virtual musical performances.
The importance of this domain, along with an acknowledgment of its challenges, has been emphasized by the attention it has received from various global organizations.These organizations provide financial support to numerous projects with the aim of shaping the future of performing arts digitization, visualization, and the advancement of their virtual enhancements.Among several others, European projects like PREMIERE [PRE23], SHARESPACE [SHA23a], CAROUSEL+ [CAR21], Apollo project [APO23], the PHENICX project [LGS15] are key players in this dynamic landscape.For instance, PREMIERE is dedicated to developing a comprehensive ecosystem of digital applications powered by advanced AI, XR, and 3D technologies to cater to the diverse needs of individuals involved in performing arts productions.Simultaneously, SHARESPACE paves the way for inclusive hybrid societies by facilitating remote interactions within a shared sensorimotor space.The CAROUSEL project allows online users to participate on online performing art creations, such as dance, despite physical separation, addressing issues of isolation and loneliness.These developments also lay the foundation for novel forms of online communication and expression.On the other hand, the Apollo project adds a physical dimension to this digital landscape, establishing a permanent exhibition in the foyer of the Konzerthaus, providing visitors with insights into the Berlin's musical heritage, and the opportunity to experience virtual reality.The PHENICX project utilizes new digital methods to make classical music performances more accessible and engaging through innovative multi-modal enhancements.These projects serve as exemplary illustrations of the significant contributions that funding and collaborative efforts can make in shaping the future of performing arts.However, it's worth noting that none of these projects primarily focuses on VIPs, underscoring the untapped potential for future research in this domain.
In addition to global funding organizations, the past few years have witnessed a surge in interest from the industry within the realm of VIP.This transformative landscape has magnetized prominent artists who discern the immense potential of virtual concerts.Collaborative ventures with platforms like Roblox [Rob23], Meta's Horizon Venues [Met23a], WaveXR [Wav], and Epic Games' Fortnite [For] have given birth to immersive experiences that transcend conventional musical performances.Renowned figures such as John Legend [Leg20], who seamlessly combined vocals and piano, and acclaimed bands like Foo Fighters [Met22] and 21 Pilots [Mov23a], have boldly ventured into this digital frontier.Their efforts have not only gained a huge audience and attention but also made virtual concerts a profitable business, changing how we see art and redefining the landscape of artistic expression and entertainment.
Despite the transformative potential in virtual instrument performances, this dynamic and ever-evolving field has not received the attention in research it deserves, mostly due to the formidable challenges it presents that often act as barriers to further exploration.This survey serves as a groundbreaking state-of-the-art report, offering a comprehensive exploration of the intricate fusion of technology, innovation, and artistic expression in this domain.It goes beyond being a mere response to global challenges and instead positions itself as an enlightening guide to the boundless possibilities that the virtual world opens up for musical experiences.While a comprehensive musical performance encompasses a multitude of elements, our survey specifically emphasizes the instrumental dimension, focusing on the delicate nuances of musicians' movements and the audio quality of the music.
In particular, this survey explores the recent advancements in data acquisition, with a specific focus on the multi-modal aspects within this field.Our study extends to existing multi-modal repositories, particularly those centered around musical instruments and musicians, which may serve as valuable resources for training AI networks and models.We have carefully assessed data acquisition methods and systems, which encompass a wide array of techniques, including motion capture, facial expression recording, and the capture of audio and MIDI data.Our evaluation highlights the strengths of these methods while also addressing the limitations and challenges they present.Furthermore, our study delves into recent techniques for Music Information Retrieval (MIR) tasks, with a particular emphasis on the Musical Performance Analysis (MPA) field, and offers an overview of various works in the realm of Musical Instrument Performance Synthesis (MIPS), encompassing recent advancements in generative models (e.g., methods that take MIDI information as its sole input and generate realistic animations featuring individuals playing musical instruments).Our analysis covers both the progress made in this area and the limitations that these innovative techniques currently face.The primary objective of our survey is to shed light on the current technological constraints, discuss ongoing challenges, and propose future research pathways in this continually evolving intersection of technology and the arts.To analyze and synthesize new VIPs, performance capture technologies are used to record multi-modal data of performers and their instruments.The data is then either directly used by analysis and synthesis systems or stored in databases using appropriate formats and representations.In some cases, these datasets are used for archiving purposes and are therefore enriched with metadata, analysis, synthesized data, and semantic annotations by experts in the respective domains.
ing their diversity and multi-modality, while Figure 2 provides a visual representation of the structural interconnection among the various sections within this survey.
Our survey is structured as follows: in Section 2, we begin by presenting the various representations and formats employed in repositories that store virtual instrument performances, including pose, facial, and audio files.Moving on to Section 3, we provide a comprehensive exploration of the existing datasets related to performing music.These repositories are categorized based on their modality, e.g., audio-modality or multi-modality, as well as their scope and the range of instruments they encompass.In Section 4, we delve into the technologies utilized for data acquisition.Here, we explore a multitude of methodologies and technologies that capture human movements, spanning from pose and facial expressions to finger dexterity and audio aspects.Section 5 offers an in-depth view on methods to analyse musical performances.This section serves as a canvas where we extract diverse musical properties and explore the nuances of artistic expression, considering inputs such as posture and finger extensions.Section 6 unfolds examples of Musical Instrument Performance Synthesis, presenting various methodologies and recent machine learning models employed to generate musical performances with different instruments.In Section 7, we engage in a thoughtful discussion, addressing the challenges and limitations encountered throughout our exploration of the virtual musical performances pipeline, and conclude our survey with closing remarks that encapsulate the essence of our journey across the vast realm of virtual instrument performance.This section not only provides a reflective analysis of the insights we have accumulated, but also gives practical recommendations and outlines future research directions in this multi-disciplinary domain.

Background Knowledge
This section explores the vital concept of data representation in the context of VIP.From capturing the gestures of instrumentalists to the audio itself, data representation serves as the bridge connecting the world of art performance to the digital realm, enabling new possibilities for artistic expression and analysis.We start by mentioning various audio representations, and then we explore motion representation of performers.

Audio Representation and Storage
This section delves into the diverse methods used for storing, describing, and documenting sound in the realms of music and technology.It encompasses various protocols that facilitate communication between audio hardware devices, a collection of music annotations that provide detailed descriptions of sounds, and a range of audio file formats optimized for music storage.Firstly, let's explore two popular Communication Protocols: MIDI and OSC.Musical Instrument Digital Interface (MIDI), is a standardized protocol and set of specifications used for the digital communication and control of electronic musical instruments and computer systems.MIDI enables the exchange of musical information and instructions between different devices.A MIDI message starts with a status byte indicating its type and channel, followed by pitch and velocity data bytes.For example, to play a note in MIDI, a "Note On" message is transmitted, with an assigned "velocity" setting that influences the note's volume [MIDa, Epi].OpenSoundControl (OSC), similar to MIDI, serves as a protocol for the real-time exchange of messages between software and hardware in various applications [MIDb].OSC is a newer protocol that can transmit a wider range of data types than MIDI, such as numerical values, strings, arrays, and even user-defined data structures, but it is also more complex and less widely supported.OSC is more suitable for a wider range of creative applications beyond traditional music, including interactive installations, multimedia performances, and communication between various types of software and hardware devices.
Secondly, we list the Basic Music Elements [Sar16], the fundamental concepts that define and give structure to a piece of music.They contribute to the mood, harmony, and rhythm of what we hear.Pitch: the frequency of the note's vibration (how high or low the sound is); Duration: How long or short the sound is; Dynamics: the volume (how loud/quiet the sound is); Timbre: the unique sound of an instrument, for example an electric guitar sounds different from an acoustic guitar (tone color of a sound); Melody: a succession of musical notes; Harmony: the simultaneous, vertical combination of notes, usually forming chords (multiple pitches played at the same time); Tempo: beats per minute (how fast or slow a piece of music is played); Texture: the density (thickness or thinness) of layers of sounds, melodies, and rhythms in a piece (a complex orchestral composition will have more possibilities for dense textures than a song accompanied only by guitar or piano).
Thirdly, we proceed with Audio Formats which they encapsulate the diverse ways in which digital sound is stored and represented.While some formats might prioritize minimizing file size for easier sharing and storage, others might focus on retaining the utmost audio fidelity for professional applications.Choosing the right audio format depends on the need for quality and usage.WAV, AIFF, FLAC, and PCM provide high-quality, uncompressed or lossless audio, ideal for editing and archiving, though with larger file sizes.For online distribution, compressed formats like MP3, AAC, and OGG offer smaller files at the cost of potential quality loss.M4A and WebM are versatile, supporting various codecs and are suited for web use and Apple devices.Ultimately, the choice should balance audio quality and file size, considering the end-user's platform and needs.
Lastly, we provide a reference list of acronyms and terminology associated with musical performances that will be employed in subsequent sections of this survey.MFCCs (Mel-frequency Cepstral Coefficients): These are features used to simplify audio signals, making them more amenable to analysis and pattern recognition.MFCCs are particularly valuable in speech and audio processing applications; Onset/Offset: The identification of the starting and ending points of musical notes.This process is crucial for the precise analysis of various musical elements, including tempo, pitch, and more; String quintets: This term refers to a musical composition designed for five string players, often involving combinations of violins, violas, cellos, and double bass; Vibrato: A musical technique where the pitch of a note is subtly varied, typically through small, rapid oscillations in pitch, to add expressiveness and depth to the sound.Vibrato is commonly used by string players and singers to enhance the emotional quality of their performance.

Motion Representation and Storage
Typically, character animation is represented using joint/bone hierarchies; each bone's transformation is relative to its parent and bones are used to drive parts vertices of a mesh with specific influence (weights).These hierarchies allow for efficient manipulation and animation of the entire skeleton through techniques such as keyframe animation, forward and inverse kinematics and motion capture.Rotations in these representations are typically represented using Euler Angles, Quaternions [PGA18], Rotation Matrices (or 6D representations) [ZBL * 19] or variations such as Dual Quaternions [AAC22].The storage and retrieval of this type of data is usually achieved using suitable motion capture file formats and protocols.One of the most common used motion capture formats is BVH (Biovision Hierarchical Data).It is divided into two sections: the first delineates the skeleton's hierarchical structure and initial pose, while the second captures the motion, providing channel data for each frame [MM * 01].Another format that the last years is gaining popularity is SMPL [LMR * 15].The Skinned Multi-Person Linear Model (SMPL) is a data-driven model that accurately captures a wide range of human body shapes and poses using a vertex-based approach.It utilizes parameters derived from the rest pose template, blend weights, pose-dependent blend shapes, identity-dependent blend shapes, and a regressor from vertices to joint locations.Furthermore, in combination with motion capture data, that are able to realistically animate the body of a virtual avatar, Facial capture has surged in prominence with the advancing horizons of technology, and offers accurately translations of the subtle movements of our faces into digital form for realistic representation (more details in Section 4.4).Central to this is the concept of "blendshapes".This technique involves a set of predefined facial expressions that can be blended in various combinations to represent a spectrum of human emotions.When these blendshapes are integrated into 3D models, they allow these models to emulate real-world facial expressions with incredible precision.To store and transfer these complex datasets, formats like FBX [AUT], Alembic [SL], and COLLADA [AB06] are utilized.These formats not only encapsulate the blendshape data but also ensure compatibility across different software and platforms.

Future Research
We argue that future research should incorporate facial expression data when capturing musical performances data, as they offer significant insights into the emotions and intentions behind the music.The interplay of facial expressions with musical elements provides a richer context, allowing for a deeper understanding and appreciation of the performance.

Multi-modal Datasets of Performing Music
Creating multi-modal repositories of musical performances data is a complex task that requires careful organization and systematic presentation.It also involves addressing significant challenges in data acquisition, including the capture of high-fidelity data, curation, and synchronization across various modalities (see Section 4).The intricate nature of music-related performance capture data adds an additional layer of complexity, with challenges like data occlusion, the capture of nuanced dexterous movements of the performer, and the need for standardized metadata to ensure the repository's quality, usability, and comprehensiveness.In this section, we provide an overview of various databases and repositories, each offering a unique perspective on musical content.These repositories encompass a wide range of data types, ranging from sheet music, audio recordings, video, and MoCap, to a diverse spectrum of musical instruments, genres, and styles.Exploring those databases is a valuable step in the research process, enabling researchers to access, evaluate, and leverage existing resources to advance their work, validate algorithms, promote interdisciplinary collaboration, facilitate data integration, train machine learning models, and inspire innovative research directions, thereby contributing to the growth of knowledge in the field.In the following subsections, we briefly discuss about music data archiving and then, we categorize various databases firstly based on their data modality(audio and multi) and secondly based on their primary intended usage; it is worth noting that certain datasets may be well-suited for multiple tasks, but we group them according to their predominant use cases.Our organization partially relies on the approach presented by Li et al. [LLD * 19], offering a structured exploration of this rich landscape.Our analysis additionally enlists recent repositories not covered in the original paper, along with datasets that exhibit greater variability and are not closely associated with URMP [LLD * 19], ensuring a more comprehensive review of the available resources.While our primary focus lies on multi-modal datasets, we have also chosen to include repositories centered around audio and MIDI, recognizing their potential utility for the research community.Finally, this section includes a concise discussion on music composition.

Archiving Musical Performances
The organization and accessibility of any database play a pivotal role, with metadata serving as the hub, providing the essential descriptions of the underlying data.In essence, metadata functions as a documentation system for the data at hand.These metadata can be categorized into five primary types, each shedding light on different facets of the resources [DIHB08]: 1. Descriptive metadata aids in the discovery and identification of resources.It captures elements such as the pitch contour of a vocal line, the genre, or specific instrument types used in a musical composition.2. Structural metadata delves into the organization of data, elucidating details like the sequencing of note annotations in a musical score or the hierarchy of layers in a multi-track recording.3. Administrative metadata comes into play when documenting aspects like the date of a song's annotation, the file type of a performer's video, or rights related to the usage of motion capture data of a dancer interpreting the music.4. Reference metadata might describe standard classifications, like predefined categories of emotion or sentiment, or reference points for motion capture data related to standard body movements.5. Statistical metadata within these datasets could reveal patterns, such as the frequency of a particular emotion across several songs or common movements found in motion capture data across multiple performances.
To systematically and cohesively organize data, it is imperative to implement and delineate metadata schemas.These schemas illustrate the interconnections among various metadata components [Sic14].The primary role of metadata is to assist users in locating information, exploring resources, and conducting in-depth examinations of the content and structure of the data.This is particularly vital for managing electronic resources and ensuring the digital preservation of information and assets.Similarly to the work of Aristidou et al. [ASC19] that deals with the acquisition of dance data and proposed a schema for comprehensive archiving of dance performances, a similar schema for musical instrument performances should be established.While numerous schemas are focused on music data, they often overlook the multi-modal aspects of musical instrument performance data.To our knowledge, the closest resemblance to a schema describing multi-modal musical performance data is RepoVizz [MLMG11], a data repository and visualization tool that offers structured storage browsing of multimodal recordings.This tool stores data as DataPacks, which are essentially tree documents with nodes categorizing data, providing descriptions, or pointing to different data files, but it does not rely on a specific structural schema.Hence, we assert that the creation of a suitable metadata schema or protocol, designed to facilitate the organization and maintenance of a substantial volume of multimodal musical performance data, is of paramount importance and will significantly benefit future research endeavors.In this survey, we will not be delving deeply into the details of music metadata and archiving.However, for those seeking a preliminary exploration of this subject, we recommend referring to the work of Serra et al. [SMB *  13], that provides detailed insights and discussions on various aspects of music data, its organization, and preservation.
It serves as an excellent initial reference for anyone interested in delving into the specifics of music archiving.

Audio/MIDI-modal Datasets
This subsection briefly reviews several performance datasets which are mainly focused on audio and MIDI modality.These datasets are categorized into four groups according to their predominant applications.The first category, "Pitch Estimation, Transcription, and Analysis", focuses on the foundational process of understanding music by extracting notations and interpreting individual notes, forming the basis for further analysis.The second category, "Music Information Retrieval and Instrument Recognition", centers on extracting metadata and differentiating between various musical sources.Moving on to the third category, "Music Generation and Composition", it delves into the creative aspect of music.It emphasizes tools and algorithms designed to create novel sounds and automate musical composition.The final category, "Source Separation, Mixing, and Signal Processing", delves into the technical aspects, focusing on refining audio quality, isolating vocals or instruments, and ensuring an optimal listening experience.Table 1 lists the audio/MIDI-modality repositories, categorized based on their primary scope.This table provides details regarding the instruments featured, duration, and data content and formats for each repository.
Pitch Estimation, Transcription, and Analysis: The MAPS database [EBD10] was designed as a robust resource for the music information retrieval community.Comprising MIDI-annotated piano recordings, its intent is to further the evolution of pitch estimation and automatic transcription techniques.It boasts an array of sounds captured under varied conditions.Furthering the discourse on transcription, the LabROSA dataset [PE07] is a collection of 130 pieces of audio and MIDI, recorded on a Yamaha Disklavier grand piano, mainly aids research into classificationbased transcription methods.Drawing attention to stringed instruments, the GuitarSet [XBP * 18] stands out with its use of a hexaphonic pickup.This comprehensive dataset includes numerous acoustic guitar excerpts accompanied by time-aligned annotations, pivotal for transcription and performance analytics.For ensemble works, the TRIOS dataset [FP13] is a valuable resource, offering separated tracks from five chamber music trio recordings, along with their corresponding MIDI scores.The dataset by Su et al. [SY16] involves an innovative approach where a musician recreates nine musical excerpts, where they are later checked for accuracy with annotated MIDI, with the aim of resolving any possible mismatches.Delving into classical realms, the Bach10 dataset [DPZ10] is tailor-made for polyphonic music research.Featuring ten J.S. Bach chorales, it provides a blend of audio recordings and accurate ground-truth data for each part played by distinct instruments.Shifting focus to orchestral compositions, the PHENICX-Anechoic dataset [MCOB * 16] offers denoised recordings for four symphonic pieces, accompanied with note annotations sourced from the Anechoic Dataset [PPL08].Concluding this category, the MusicNet dataset [THK17] contains classical music tracks from 10 composers and 11 instruments, spanning 34 hours, each annotated with precise, time-specific labels from 513 classes.

Music Information Retrieval and Instrument Recognition:
The Wood Wind Quintet (WWQ) dataset [BED09] provides insights from a single classical quintet, releasing a 54-second snippet for public use, which has served as a benchmark for the MIREX Multi-F0 Estimation And Tracking task [MIR23].Moving to a broader spectrum, the RWC Music Database [GHNO02, GHNO03, G * 04] contains six unique collections, featuring everything from popular music to classical tunes, totaling 315 musical pieces.A standout aspect of this database is its exhaustive compilation of 50 instruments, capturing diverse playing styles and dynamics.Likewise, provides original audio signals, corresponding standard MIDI files, and, for song entries, supplementary text files containing lyrics.Furthermore, Nlakh [KPJ * 23], combines the NSynth [ERR * 17] and Lakh [Raf16] datasets, offered in two distinct versions focusing on solo and mixed tracks.It caters to a wide instrument range and is notable for its large size.The SSMD dataset [HKS12] offers individual ground-truth annotated audio tracks from cover versions of popular western songs, majorly spotlighting vocals, with its library of 104 songs.Venturing into regional tunes, the iKala dataset [CYF * 15] contains high-quality Chinese pop songs, each paired with human-annotated pitch contours and time-marked lyrics, challenging separation algorithms with its inclusion of nonvocal segments.Last but certainly not least, the Free Music Archive (FMA) [DBVB17] sets itself apart as a vast repository, covering 343 days of audio from over 100K tracks, neatly categorized into 161 genres, while also offering a plethora of metadata.

Music Generation and Composition:
The MAESTRO dataset [HSR * 18] presents around 200 hours of audio and MIDI recordings from ten years of the International Piano-e-Competition.The recordings have been synchronized to maintain an accuracy close to 3 ms, and each piece is thoroughly annotated, offering insights into composers, titles and performance years.Another contribution comes from the ADL (Augmented Design Lab) Piano MIDI dataset [FLW20], which showcases a collection of piano compositions, spanning various genres.Extracted and refined from the larger Lakh MIDI dataset [Raf16], this dataset emphasizes compositions associated with "Piano Family" instruments.Adding to this category, the dataset developed by Benetos et al. [BKD12] emerges as an instrumental tool for automatic piano tutoring.The dataset consists of seven real-world recordings, intentionally captured with a moderately detuned Yamaha U3 Disklavier.Each recording is a true reflection of human performances, complete with occasional mistakes, which are precisely documented in the MIDI ground-truth.The NSynth dataset [ERR * 17] offers a collection of 306K musical notes from 1,006 instruments, each categorized by its distinct pitch, timbre and envelope.Notably, each note is a monophonic audio snippet, covering every pitch on a standard MIDI piano and five distinct velocities.Notes are further annotated with details like their sound production source, their instrument family and various sonic qualities.The Nintendo Entertainment System Music Database (NES-MDB) [DMM18] features tracks synthesized by the iconic NES and spans approximately 46 hours of chiptunes.Each track in the dataset provides a score for four instrument voices, accompanied by details on dynamics and timbre.The POP909 dataset [WCJ * 20] includes piano arrangements for 909 songs, totaling 60 hours, produced by expert musicians.These songs, available in MIDI format, feature  Emotion and Sentiment Analysis/Generation in Music: The VGMIDI dataset [FW19] is a collection of 823 pieces extracted from video game soundtracks in MIDI format.These tracks, converted to piano arrangements, are of varying lengths, with some as short as 26 seconds and others extending up to 3 minutes.The selection criteria focus on the pieces' emotional intensity, with 95 pieces annotated based on valence, indicating the emotion's positivity or negativity, and arousal, denoting the emotion's intensity.The EMOPIA dataset [HCD * 21] centers around the perceived emotion in pop piano music, combining both audio and MIDI formats.The emotion detected in each clip is verified through labels provided by a team of four annotators, ensuring a comprehensive understanding of the emotional content.Furthermore, the DEAM dataset [AYS16] offers a more expansive perspective on Western popular music genres, including but not limited to rock, pop, electronic, and jazz.It includes 58 full-length tracks and 1,744 45-second excerpts.

Multi-modal Datasets
This subsection is dedicated to the examination of performance datasets featuring multi-modal data.These databases incorporate a range of modalities, extending beyond audio and note annotations and may encompass visual data, motion capture (MoCap) data, and information related to style and emotion.plemented with audio, MIDI, and skeletons and video resources with clearly visible hands.Finally, RepoVizz [MLMG11] emerges as a tool tailored for the needs of the scientific community studying music performance, while is not only a data repository but also an effective visualization tool.It provides structured storage and userfriendly access to multi-modal recordings, spanning audio, video, motion capture, and much more.The goal of RepoVizz is to enable seamless online access to a shared music performance database, enabling collaboration and innovation among researchers.

Challenges and Limitations
As we conclude this section, it is essential to highlight the open challenges and limitations in the realm of multi-modal music databases.While these repositories are invaluable for various research areas, there are ongoing challenges related to data acquisition, documentation, and organization, such as the quality of the data, the synchronization of the multiple modalities, dealing with data occlusions, interoperability, stylistic variations, and metadata standards.Moreover, as evident from Table 2, most of the datasets that feature motion capture data are predominantly centered on stringed instruments.This concentration on a specific subset of instruments represents a limitation within the domain of musical instruments.In light of this observation, it is imperative that future research initiatives direct their efforts toward the establishment of repositories that encompass a more diverse array of musical instruments.The anticipated result of such endeavors is the creation of resources that are not only more expansive in their coverage but also more readily accessible.This expansion and increased accessibility would be of great benefit to both the music and virtual instrument research communities, as it would provide a richer and more representative datasets for exploring and advancing the field.

Music Composition
In

Conclusions
It is crucial to underscore that the results generated in this section are not flawless.They stem from various challenges, such as limitations in creativity and musical structure, difficulties in conveying emotion, restricted user interactivity, and inconsistencies in music evaluation criteria.Nonetheless, given the rapid progress in this field, we expect that more advanced tools for automated music generation will soon emerge, which will be well-suited for research purposes.

Musical Performance Capture
In the context of musical instrument performance, the interplay between a musician's bodily movements, finger dexterity, and facial expressions, combined with the characteristics of the musical instrument and the resulting auditory experience, collectively shape the expressive and artistic delivery of the music.When it comes to digitizing a musical performance for archiving, documentation, streaming, analysis, and synthesis, it is essential to capture all the elements that are integral to the overall experience.This holistic approach to digitization and documentation is crucial for faithfully preserving the essence of the performance.
These elements encompass auditory aspects, such as capturing the unique timbre of the instruments and obtaining a high-quality audio recording that faithfully reproduces the voices of the performers.This audio component plays a pivotal role in retaining the emotional depth of the performance.Beyond the auditory aspects, digitization also extends to the visual components of the performance, including the appearance of the performers, their attire, the stage, lighting, shapes, colors, instruments, and any objects used.Moreover, the digitization process encompasses dynamic and kinesthetic components, including the performers' postures, the nuanced movements of their fingers during instrumental play, and the emotions conveyed through their facial expressions.The extent to which each of these elements is captured in detail can vary based on the objectives of replicating the performance in a virtual environment.Moreover, apart from capturing the movements of the performers and the sounds of the instruments, in some cases, it is also necessary to capture the kinematics of props.This includes the movement of instruments on the stage and the dynamics of instrument accessories, such as the drumsticks of a drum kit.
In the scope of this survey, our primary focus centers around the interaction between artists and their instruments, their ability to convey emotion, and the quality of the sound they produce and transmit within a virtual context.Therefore, our concentration is mainly directed towards capturing the dynamic movements of performers, which include their postures, finger actions, and facial expressions, as well as achieving a faithful reproduction of the instruments' sound as played by the performers.Within this section, we will explore various systems and technologies designed to capture each of these critical modalities.We will assess their suitability within the context of VIP, highlighting their advantages and drawbacks, and addressing the challenges they present.The aim is to provide a well-informed basis for selecting the most suitable capture technology that aligns with the specific needs of users in the realm of virtual instrument performance.

Music
Recording instruments can be accomplished through various techniques, with the resulting sound being saved as audio files, often annotated with the corresponding instrument(s) that produced the sound.Some instruments, particularly electronic ones (e.g., electronic keyboards, electronic drums, and certain wind instruments), support the automatic retrieval of MIDI data in addition to capturing the sound.This MIDI data is valuable for further processing and analysis.
One of the initial and most common methods for capturing instrument sounds is to use microphones.When recording acoustic instruments, a common practice is to position a microphone in front of the instrument, as depicted in the left image of Figure 3. Conversely, when capturing the sound of electric instruments, the microphone is frequently situated in front of the amplifier to record the amplified sound, as demonstrated in the right image of Figure 3.
Alternatively, audio interfaces offer an effective means of audio acquisition.Among the available audio interfaces, the "Scarlett" [Foc23], produced by Focusrite, stands out as one of the most renowned and widely used options.To use these interfaces, instruments are connected to the audio interface using a jack cable, and the interface is then linked to a computer.Another notable audio interface is the "iRig" [Mul23], known for its portability.It allows for direct connections to smart devices like iPhones, iPads, or personal computers.An alternative method involves directly connecting an electronic musical instrument to a computer, provided that the instrument is electronic and compatible.This setup enables the automatic acquisition of both the instrument's sound and corresponding MIDI data.As mentioned earlier, this approach is particularly useful for electronic instruments that have built-in MIDI recording capabilities.
MIDI files are often used as the ground-truth transcription for music, enabling precise representation of the musical data.For instruments that lack integrated MIDI recording capabilities, manual annotation remains the most accurate method for creating ground-truth transcriptions.However, this manual approach is labor-intensive and time-consuming.To address this challenge, several software solutions for converting audio to MIDI are available.Some popular options include Basic Pitch by Spotify [Spo23], Piano Scribe by Google [Goo23], and Logic Pro [App23b], each offering various features for MIDI conversion.
Finally, in the domain of audio and music editing, as well as notation, a wide range of software applications is available for recording, post-production and musical composition.These applications cater to different user needs and preferences, ranging from industry-standard commercial tools [App23b, Abl23, Stu23, Ste23, Ado23] to software designed for small businesses or home users [App23a].Additionally, there are research-based solutions for specialized applications [Aud23,CLS10,MRL * 15,LL21].However, the discussion of these methods and tools is beyond the scope of this survey.For a comprehensive review of audio editing methods and tools, readers are encouraged to refer to the following works [Col13,Mat23a], which provides an in-depth exploration of this topic.

Body Movement Capture
Motion capture technology has played a pivotal role in digitizing, preserving, and disseminating intangible creations, such as dance performances [ASC19], or sport performances [vdKMR18].Recent years have witnessed a growing demand for realistic 3D animation in various sectors, including media, entertainment, research, and training, prompting industries to seek effective 3D motion capture solutions.The advantages and disadvantages of these technologies have been extensively reviewed in surveys, such as [WF02,MHK06].This technology has found widespread application in the entertainment industry, notably in the production of animated films, video games, and virtual reality experiences.Recent advancements in hardware and software, including high-speed cameras, inertial measurement units, and depth sensors, have sig- The left image shows musicians playing violin, motion captured using an optical MoCap system with reflective markers, tracking their body and instrument motion (image extracted from [Fut]).The right image shows a musician playing piano, while the movements of the finger joints are tracked with 5mm reflective markers (image extracted from [Ger]).
nificantly enhanced the sophistication and accuracy of motion capture.
Motion capture can be broadly categorized into marker-based and marker-less systems.The choice of the most suitable system depends on the required quality and purpose of the application (e.g., mobility, interaction), as well as the desired level of accuracy and precision, allowing for the capture of even the most subtle movements and expressions, such as finger gestures, facial expressions, and even eye movements.In the following subsections, we offer a concise review of the prevalent technologies and systems utilized for capturing human motion.This will encompass a comparative analysis of different methods and an examination of more intricate movements, including fingers and facial expressions.

Marker-based Systems
Marker-based systems necessitate the attachment of sensors, markers or stickers to the bodies of performers.These systems can be categorized into two main types: optical and intertial-based systems.

Optical Systems
Optical motion capture systems use fiduciary markers near joints for real-time data acquisition.Popular in studios, these markers enable 3D positioning via high-speed cameras using triangulation.Passive systems like Vicon [Sys23] and NaturalPoint's OptiTrack [Opt23b] use retroreflective balls, offering high accuracy but are sensitive to lighting and marker swapping issues.Active systems like PhaseSpace [Pha23b] and Qualisys [Qua23] use LEDs for cleaner and labelled data but require wires and power sources.While precise, optical systems are costly, intrusive, lack portability, and require extensive setup.Data cleaning, especially for occlusions, remains a challenge [AL13, PHLW15, LC10, SDB * 12], with recent attempts using Deep Learning (DL) for denoising and restoring missing markers [Hol18, CWZ * 21].Wheatland et al. [WWS * 15] survey several systems and technologies, highlighting their advantages and limitations.An example of full-body and finger tracking using an optical motion capture system with reflective markers is shown in Figure 4.

Inertial-based Systems Inertial systems, including
XSens [Mov23c] and Rokoko [Rok23b], use micro-inertial measurement units (IMUs), biomechanical models, and sensor fusion algorithms for motion capture.These systems measure rotational rates using gyroscopes, magnetometers, and accelerometers, translating them into a skeleton model.Tesla Suit [Sui23b] has introduced a suit that additionally encompasses a full-body haptic feedback system that uses electro muscle stimulation and transcutaneous electrical nerve stimulation.While inertial-based systems offer advantages such as cost-effectiveness, portability, and suitability for outdoor use, they are not without challenges.They can be complex, lack precise orientation measurement, and suffer from positional accuracy issues and drift over time.Despite these challenges, they are gaining popularity among independent game developers due to their quick setup.More recently, there is a trend towards reducing the equipment and body attachments for motion tracking using only six or even less inertial sensors, such as Sony's MoCapi [Den22]; several machine learning techniques using sparse sensors show promise, especially in applications like virtual reality and sports training [YZH * 22, PYA * 23, DKP * 23].However, these methods are still in research development and face challenges in capturing highly dynamic and heterogeneous movements.

Markerless Systems
The markerless family of methods and systems is less intrusive than the previous two families of methods as it eliminates the need for subjects to wear specialized tracking equipment.Typically, the subject's outline silhouettes is captured from various angles using single or multiple vision or RGB-depth-sensitive cameras along with specialized software.Voxel-based representation encompasses three primary methodologies: generative, discriminative, and hybrid approaches.Generative motion capture methods (model-based) determine a person's pose and body shape by fitting a template model to data extracted from images.By inputting a set of model parameters, such as body shape, bone lengths, and joint angles, a representation of the model is generated, capturing the pose and shape of the body [GPKT10, WZC12, YLH * 12, HYXC15, YSD * 16].On the other hand, discriminative approaches (model-free) map image features directly to pose descriptions or search a database of poses to find the closest match to the current image, as seen in studies such as [SSK * 13, TSSF12, PMTS * 15, YGTW15].A blend of the previously mentioned strategies is utilized in hybrid methods [BMB *  11].
Many researchers prefer using single-camera setups for markerless motion capture because of their cost-effectiveness, simplicity, and speed.Monocular systems, being generally less expensive than multi-camera configurations, with quicker setup times and fast data processing.Single-camera setups have been seen in various recent studies [PCG * 19, ZPT * 19, YZZ * 20].To address the challenges of occlusions in monocular systems, and to achieve greater accuracy and precision, along with a full 360degree coverage, the use of multiple cameras has become more prevalent [HAF * 16, OERF * 16, DDF * 17], at the cost of a more complex configuration, and increased processing demands.In the recent era of DL, there has been a significant increase in efforts dedicated to pose reconstruction, utilizing both single and multiple camera setups, to enhance accuracy, adaptability, and automation.DL models have the capacity to automatically extract and learn complex features from raw data, perform end-to-end learning, adapt to various poses and conditions, and deliver high accuracy.It benefits from extensive data availability, parallel processing capabilities, and continuous advancements in model architectures, making it a versatile and powerful approach for accurately estimating poses in diverse applications [MSM * 20, SAA * 20, HZZ * 21].There are several commercial motion capture systems and companies in use today, that fuse markerless technology with DL, e.g., Microsoft's Kinect [Mic23], Move.AI [AI23], DeepMotion [Dee23], Plask [Pla23], Mediapipe [LTN * 19], FreeMoCap [Mat23b], etc. Nonetheless, they have not yet achieved the level of accuracy and fidelity seen in optical MoCap systems.
The main advantage of these methods lie in their affordability, portability, the absence of body-attached sensors, and ease of setup.However, they encounter challenges when the articulated body is obscured from cameras due to self-occlusions or occlusions by other objects, subject clipping, or when the subject wears extensive clothing like bulky costumes.Furthermore, localizing subjects in a global coordinate system becomes extremely challenging without multiple synchronized video sources.Proper lighting conditions are essential, given that performances may vary from low-light conditions to illumination from several light sources.The clothing of performers and the complexity of the environment add to the challenges of obtaining desired outcomes.To address these challenges, controlled lighting and controlled background environments are typically employed.Despite these efforts, capturing multiple characters becomes problematic when other elements in the scene obstruct the subject's view, especially in scenarios like performances on a stage crowded with multiple people and objects.In comparison to optical or IMU-based systems, these methods have not yet achieved the same level of fidelity and versatility.
Recently, volumetric capturing has gained prominence among various vision-based methods.This approach constructs 3D models from multiple 2D images or videos, as demonstrated by companies like 4Dviews Studios [4DV23] and Evercoast [Eve23].While useful for virtual reality and animation, it requires numerous cameras and view-angles in order to produce a detailed and accurate 3D model, while it is sensitive to lighting and moving objects.Heavy clothing poses or other elements in the scene pose challenges, obscuring body shape and creating shadows.Moreover, characters are usually represented as one combined mesh with their clothes, which makes it challenging to separate different costume layers, or rigging and skinning.

Discussion on the MoCap Categories
Optical motion capture technology stands as the industry standard for capturing the movements of intangible entities, including mu-  3.

Fingers Capture
Finger motion capture differs from full-body acquisition due to the intricate nature of hand movements, demanding advanced precision in capturing the highly articulated motions of the fingers.This precision is particularly crucial for applications such as surgery, sign language interpretation, or playing musical instruments.These systems encounter unique challenges that set them apart from wholebody tracking, including self-occlusion and precise contact modeling.They often require specialized hardware, such as gloves or infrared cameras.
The level of detail and precision needed to capture hand and finger movements can vary greatly based on the project's specific requirements.Some applications require high-fidelity tracking of hand and finger motions, while others prioritize capturing broader body movements.In contexts like musical performances, finger and hand movements play a critical role in instrument playing.To achieve accurate tracking of these intricate movements, a reliable motion capture system is essential.Therefore, assessing a system's ability to capture hands and fingers is vital for tailoring it to a project's specific needs.
Numerous commercial motion capture systems offer specialized gloves designed for finger motion capture.These systems include optical-based products such as Vicon, Optitrack, and PhaseSpace gloves [Sys23, Opt23a, Pha23a], as well as inertialbased products like Rokoko Smartgloves [Rok23c], Xsens Metagloves [Mov23b], Tesla Glove [Sui23a], and Perception Neuron Studio Gloves [Neu23].Moreover, the MANUS Quantum MoCap Metagloves can be integrated in most of the industry standard motion capture systems [Met23b].These products inherit both the advantages and limitations of their respective family systems, as described in the previous section.It's important to note that using gloves may not be suitable for musicians as they can interfere with their ability to play instruments effectively.One potential solution is to employ an optical motion capture system that uses small markers directly applied to the fingers and hands or thin gloves with markers, as demonstrated in relevant research papers [Ari18, KMO * 09, PPHB18].
Another approach is to use camera-based systems that do not require physical markers attached to the body, such as the Leap Motion Controller [Ult23].There is a wide range of specialized hand tracking methods that rely on silhouette extraction principles and achieve animation by fitting a skeleton into a 3D model.Recent advancements in this field, exemplified by methods like DeepMotion [Dee23], Move.AI [AI23], Google's MediaPipe [LTN * 19], and the Free Motion Capture Project (FreeMoCap) [Mat23b], have expanded their capabilities to encompass tracking of complete body parts, including the face, hands, and fingers, even for multiple individuals, using only monocular video input.

Markerless Systems Accuracy
The accuracy of markerless systems in hand tracking and reconstruction surpasses that of full-body tracking, primarily due to the more constrained articulation of hand movements.However, while these markerless systems may not be the primary choice for hand tracking in musical instrument applications due to sensitivity to lighting and environmental conditions and susceptibility to occlusions, they are more commonly used in hand tracking compared to full-body tracking solutions.

Face Capture
Facial capture is specialized and distinct from full-body motion capture due to the unique challenges associated with capturing the complexity and subtlety of facial expressions, as well as its critical role in conveying emotions and character in various applications.The technical challenges involved in capturing minute facial details, and the priority on realism over efficiency in facial capture setups.These distinct requirements contribute to the specialization of facial capture as a field within motion capture technology.
Facial expression capture and motion transfer to virtual characters have been subjects of research for over three decades.Central to this has been the Facial Action Coding System (FACS), developed by Paul Ekman and Wallace V. Friesen in the 1970s [EF78], Industry standards in facial motion capture have converged towards the utilization of Head-Mounted Cameras (HMCs), exemplified by systems such as Vicon's Cara [Car].These specialized helmets can accommodate both cameras and smartphones, ensuring a stable and consistent perspective of the actor's face, even during head movements.This prevents any blurriness in the expressions captured and maintains the quality of the data.The lightweight and comfortable design of HMCs ensures that the artist's freedom of movement is preserved, providing a seamless experience during performances.Moreover, the capabilities of HMCs extend beyond facial motion capture, encompassing comprehensive motion capture setups that include the entire body and musical instruments, as illustrated in Figure 5.This exemplifies the versatility and wideranging applications of HMCs in motion capture.
There are two principal categories within the domain of HMCs: marker-based and markerless systems.Marker-based systems use physical markers tracked by cameras.Examples include reflective marker systems like Vicon and OptiTrack, which use small reflective spheres and infrared cameras; painted or sticker markers applied directly to the actor's skin.
Despite their accuracy and reliability, these systems can be intrusive and time-consuming to set up.On the other hand, mark-erless systems eliminate the need for physical markers, relying on computer vision and machine learning algorithms to directly track facial movements, for example, by drawing dots on the face to extract facial expressions.This category includes depth-sensing cameras (e.g., Apple's Face ID using TrueDepth Camera technology [App23c]) that generate a three-dimensional map of the face, RGB cameras combined with software algorithms (e.g., Faceware [Fac23b]), and smartphone applications capable of markerless motion capture.A trend towards markerless systems is evident, marked by a transition from compact cameras such as GoPro to devices featuring Apple's TrueDepth Camera technology.This technology captures facial data by projecting and analyzing numerous invisible dots, generating a depth map and concurrently recording an infrared image of the face at high resolution (up to 4K) and frame rate (up to 240 fps).
Both marker-based and markerless systems have their distinct advantages and are suited to different applications.Marker-based systems, while potentially intrusive, offer unparalleled accuracy, especially for subtle facial expressions.Markerless systems, in contrast, provide rapid setup and are less obtrusive, but may not achieve the same level of precision.The selection between these two types of systems should be informed by the specific requirements of the project and the resources available.
In terms of software, there are numerous specialized applications designed to animate characters based on the facial data captured by the camera.Among the many options available, such as Maya [Aut23] and Blender [Ble23], MetaHuman Animator [Gam23] from Epic Games stands out as a leading solution.This software enables the rapid and precise translation of realworld performances into high-fidelity facial animations, compatible with both iPhone and stereo HMCs.Other applications such as Live Link Face [Fac23a], Rokoko's Face Capture [Rok23a], and iClone [Rea23] also offer real-time facial motion capture and are compatible with Apple smart devices.

Challenges in Multi-modal Synchronization
Synchronizing multi-modal data captured during an instrumental music performance is of critical importance.The research conducted by Li et al. [LLD * 19] addresses the intricate challenge of synchronizing concurrent sound sources when generating multitrack datasets.Furthermore, when integrating data from diverse modalities using varying capture devices, achieving synchronization among these devices is vital to attain the desired output.These devices may possess distinct processing speeds, capture frequencies, and data transfer rates, which can introduce inconsistencies.While manual synchronization of all devices is possible, this approach is labor-intensive and susceptible to errors.An efficient alternative is to employ a global clock.Timecode generators are commonly used for this purpose, maintaining local synchronization across devices by assigning a unique code to each frame or data packet.This ensures a consistent timeline across multiple devices, thereby facilitating precise data alignment.Another method that can aid in synchronization, though it may not completely solve the issue, is Genlock, which ensures that all devices operate at the same capture frequency.This is particularly crucial in scenarios where even minor differences in data capture rates can result in © 2024 The Authors.Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd. significant inconsistencies in the final output.Most MoCap systems support both Genlock and Timecode [Opt, XSe].Typically, a central control PC or synchronization unit is utilized to initiate and conclude recordings on the various devices within the same network.

Musical Performance Analysis
The analysis of musical performances involves evaluating a range of modalities that stem from playing a musical instrument.This assessment encompasses not only the music itself but also the performer's posture, including body language, finger movements, and facial expressions.Both methods provide distinct perspectives, offering valuable insights into the nuances of the performance and contributing to a comprehensive understanding of the artistic expression.On the other hand, MPA focuses on the evaluation of live musical performances, examining how musicians interpret a piece and highlighting nuances in variations and expressiveness that go beyond the original score.For instance, one application of MPA is tutoring musical instrument learning, where students perform and receive feedback [EMNS20].Another MPA example is illustrated by the PHENICX project [LGS15], which focuses on visualizing information within orchestral music, incorporating elements from the musical score and performance-related aspects.It is important to recognize that the interpretation of a musical piece during a performance can profoundly influence listeners' perceptions.Even when working with the same musical score, different renditions can lead to distinct preferences and interpretations among listeners.The parameters of music audio performance can be categorized along the same fundamental dimensions as audio: tempo and timing (musicians adjust tempo and timing during performance for expressive effect), dynamics (performers make decisions about volume variations based on their musical judgement), pitch (musicians enhance musical expression by employing techniques like vibrato, adding nuances to the prescribed pitches in the score), and timbre (performers shape the timbre of a musical piece through their playing techniques and instrument configurations) [Ler12,LAPG19,LAPG21].Within the domain of MPA, there are relevant surveys that offer valuable insights, including [Gab99, Gab03, GDDP * 08, Ler12, LAPG19, LAPG21].

Pose Analysis
Similar to the research conducted in the field of sports analysis and physiology [CECS18,BNWY23,HW23], this subsection provides an overview of methods that evaluate performers' posture and musculoskeletal systems.These methods are designed to promote both their physical and mental well-being, prevent injuries, and enhance the quality of their performance.fifteen violinists playing a G scale under three shoulder rest conditions.The study found that a higher rest led to less rotations of head, left shoulder, and left acromion elevation, but increased left shoulder flexion and left forearm pronation, emphasizing the musicians' ability to adapt their body rather than their bowing technique.These results suggest that tailored assessments and improved shoulder rest designs could enhance player comfort and adaptability without compromising sound quality.Together, these studies underscore the transformative impact of motion capture technology in the realm of musical performance, fostering a data-driven approach to skill development, injury prevention, and the enhancement of training methodologies.

Pose Analysis using Motion
Pose Analysis using Computer Vision: Blanco-Pineiro et al. [BPDPM15] investigated the postures of 100 music students, utilizing video and photo analyses performed by expert evaluators.The methodology included recording musicians in both seated and standing positions, as well as capturing still images in "ready-toplay" static poses.They examined 11 variables related to overall and specific body part postural quality, identifying common postural flaws and the contexts in which they occur.The aim was to highlight these issues and promote corrective measures for better postural habits during musical performances.Araujo et al. [ACML09] investigated postural flaws in four student violinists from an orchestra, using 20-minute frontal video recordings and anatomic markers.The study aimed to categorize and evaluate the frequency of these postural flaws.The findings revealed that all the violinists displayed postural flaws during their performance, highlighting that these flaws were unnecessary and could be avoided as they are not intrinsic to standard instrumental techniques.Chan et al. [CDA13] took a different approach by evaluating the effectiveness of a 10week intervention programs on the posture of 57 professional orchestral musicians.Utilizing photographs for pre-and-post intervention assessments, they found improvements in Exercise Therapy showcasing the potential of visual assessment tools in evaluating postural changes.In a different study, Bejjani et al.  ] to analyze the connection between body posture, muscle activity, and sound quality in clarinetists, as well as the impact of different chairs on the postures of violin and viola players.These studies collectively underscore the significance of ergonomic considerations and the potential of postural exercise therapy in enhancing musicians' performance and well-being.

Conclusions
In this section, we emphasize the importance of evaluating both the auditory output and physical movements in a musical performance.We delve into various methods and applications for analyzing musical performances, which play a pivotal role in assessing quality and enhancing artistic development.These analyses have the potential to enrich the overall experience for both artists and their audience.For example, the audience can enjoy a more immersive experience by gaining additional insights into the performance, such as detailed note transcriptions or experiencing dynamic lighting adjustments that align with the mood and artistic intent.Musicians, on the other hand, can derive multiple benefits from such analyses.They can use the insights to prevent potential injuries resulting from repetitive movements or improper posture during performances and to refine their techniques.They can serve as a valuable tool in music education, helping musicians refine their skills.Furthermore, these insights can assist instrument manufacturers in creating more ergonomic instruments and supportive accessories, ultimately contributing to the prevention of musculoskeletal issues among performers.Future research direction could be benefited by the recent developments in volumetric capturing that can enhance the analysis.When combined with other data modalities such as ECG, EEGs, dynamic 3D scans (i.e., 4D scans), and muscle deformations, the analysis can further improve our understanding of performance quality.

Musical Performance Synthesis
Musical performance synthesis refers to the intricate process of replicating the nuances of a physical musical performance using technology [DZBKM22].This interdisciplinary field brings together elements of music theory, sound science, and computer methodologies to capture more than just the fundamental notes of a piece.It aims to encapsulate the true essence of a performance, encompassing elements such as motion, unique expressions, dynamics, and the variations introduced by an artist.Within this section, our primary emphasis is on techniques designed to produce human motion in direct response to audio or MIDI input.An essential aspect in the faithful replication of a musician's performance on an instrument lies in our ability to capture the subtleties of their gestures, posture, fingers, and emotions.
A related area of research in human animation synthesis involving musical instrument performance is audio-driven dance motion synthesis [YWJ * 20], sign language generation [RKES21], and gesture generation [GFH * 23] to audio.In the context of dance motions synthesis, numerous studies have leveraged machine learning methods to create realistic human animations.However, when it comes to generating motion based on audio input, a significant challenge lies in achieving temporal consistency and synchronization between the audio and the motion.Various techniques have been explored, such as recurrent neural networks (RNNs) [GMK * 19] but are susceptible to temporal error accumulation issues and may result in static poses, particularly when dealing with inputs not present in the training data or when noise is introduced.To address this, temporal convolution was introduced [GBK * 19] to generate simple gestures.Capturing complex and varied dance movements or nuanced musician motions presents a great challenge due to their intricate long-term spatial-temporal and kinematic characteristics.It is important to acknowledge that dancing and musical performances share certain similarities, such as their reliance on rhythm and timing to create a sense of movement and flow [Bri].Both are considered as creative forms of expression capable of evoking emotions, telling stories, and conveying messages.However, they also exhibit notably different characteristics, particularly in the context of playing instruments.Musicians frequently engage with a diverse array of instruments, making the capturing of nuanced motions, particularly subtle finger movements, a challenging task.Moreover, the demand for high precision at contact points, especially in intricate finger positioning, further increases the difficulty of motion synthesis.Musicians typically exhibit more static and delicate movements in contrast to the expansive stage spaces commonly used by dancers.Moreover, in musical performances, the motion itself generates the sound, in contrast to dancing where the audio complements the movement, necessitating precise synchronization between the captured motions and the resulting audio.For all these reasons, the field of musical instrument motion synthesis has experienced limited progress.Another factor contributing to the underdevelopment of this field is the scarcity of accessible motion repositories.Previous works tend to overlook the multi-modality inherent in musical instrument movements, often focusing solely on either upper body actions or finger movements, thus failing to comprehensively encompass the entirety of the motion.As a result, most available methods generate partial body animations.In this section, we review the most prominent methods for synthesizing musical instrument performances, organized according to the specific musical instrument being emulated.

Piano:
The piano is a well-known musical instrument that is conventionally played with the performer seated close to the instrument.When playing the piano, the upper body is primarily engaged in striking the piano keys, while the feet are responsible for manipulating the pedals.Achieving the desired musical notes on a piano necessitates an extremely precise placement of the fingers on the keys.Most approaches to piano playing concentrate on the movements of the hands and fingers, overlooking the broader physical engagement required for a nuanced performance.Early methods for generating piano animations, like those by Sekiguchi and Eiho [SE00], used a virtual space simulator and hand movement generator.They assigned fingers, positioned hands using spline functions, and calculated finger angles based on note difficulty.Nagata et al.Violin: The violin is a renowned musical instrument traditionally played by a musician holding it close to their body.When playing the violin, the performer uses the bow to draw across the strings, while their fingers press on the strings to produce specific musical notes.Achieving the desired tones on a violin requires precise finger placement and control of the bow's speed and pressure to produce accurate and expressive music.Several studies have explored the use of neural networks and deep learning in generating

Conclusions
In summary, it is clear that the field of musical performance synthesis has undergone a substantial transformation in recent years.It has shifted from optimizing fingering based on predefined rules and heuristics to utilizing machine learning techniques, particularly deep learning, to create natural and expressive musical performances.These advancements have not only facilitated the automatic generation of lifelike 3D animations and human-like performances but have also raised the prospect of exciting applications in virtual performances, interactive entertainment, music education, and humanoid robotics.However, the development of this domain faces challenges due to the limited availability of multi-modal musical instrument repositories and the difficulties in capturing and synchronizing them.Most existing research has primarily concentrated on partial body reconstruction, with a strong emphasis on fine details in hand and finger animations.To push this field forward, future research should broaden its focus from merely hitting the right notes to generating expressive musical gestures and harnessing tactile feedback.A recent trend in this field is the adoption of generative models, including progressive GANs like GANimator [LAZ * 22] and diffusion models like MDM [TRG * 23], Motion Diffusion in Latent Space [CJL * 23], and TEDi [ZLAH23].These models, although not directly related to audio or music inputs, excel in handling other multi-modal inputs like text and emotions.They generate temporally consistent, high-fidelity, and natural long motions with the potential for sentimental control.These advancements, with the incorporation of additional constraints for precise motion synthesis, such as physics-based constraints [YSI * 23], could provide a robust framework for use in musical instrument performance synthesis.This promising direction opens doors to even more sophisticated and expressive musical performances in the future.

Conclusions and Discussion
In this report, we provide an overview of the current state of Virtual Instrument Performances which can be a powerful tool for digitizing and visualizing the performing arts.This process plays a pivotal role in preserving cultural heritage, expanding access to a global audience, fostering creativity, and enriching educational resources.We have explored various aspects, including methods for storing motion and audio data (Section 2) and a comprehensive list of significant multi-modal datasets (Section 3).However, a universal schema for capturing and storing musical performances is still lacking.The imperative need for a common format to represent multi-modal data is evident.By defining appropriate encodings for each data type, we can effectively capture the nuances of motion and audio, benefiting the broader community and advancing future research.
Next, we present methods for capturing instrument-based performances (Section 4).These methods encompass capturing audio directly (MIDI) or indirectly from the instruments (raw audio) and capturing the motion of performers, including their body, fingers, and face.Each aspect presents unique challenges, and we summarize the pros and cons of each technology.We emphasize the importance of synchronized data from various sources, as high-quality data is indispensable for subsequent tasks, such as training models.
High-quality multi-modal data enables the development of innovative solutions to analyze performances (Section 5).These solutions help us understand how performers interact with their instruments for both performance and health reasons.We cover approaches related to audio and pose analysis separately, presenting several solutions based on technologies such as motion capture, vision, and photogrammetry.We identify the potential of highquality motion capture systems and newer approaches like volumetric capture, which allow for non-intrusive analysis of body interactions with instruments, especially when combined with various data modalities including, ECG, EEGs, dynamic 3D scans, and muscle deformations; these methods are expected to further enhance our comprehension of performance quality.
Importantly, high-quality data opens the door to disruptive approaches that can enhance artists' creativity and change the possibilities in virtual performances (Section 6).Generative deep learning systems are at the heart of these possibilities, enabling the generation of new motion that respects the properties of source data, based on different control signals such as audio (e.g., MIDI, raw audio), emotions/style, and text.Recent trends in this area highlight the importance of diffusion models and physics-based constraints to generate physically plausible and expressive instrument performances.

Recommendations
Based on the insights we have gained on our journey, we recommend that, within today's technological landscape, a comprehensive pipeline for capturing and storing musical instrument performances should consider the following factors.Indeed, capturing musical performances requires a delicate equilibrium between recording auditory and visual intricacies.When it comes to audio recording, sound techniques like employing microphones and audio interfaces provide precise representation, particularly for electronic instruments that can directly interface with computers.While MIDI files offer precision in musical representation, they necessitate manual transcription for instruments lacking MIDI outputs.On the other hand, the choice of motion capture technology for pose performance acquisition is a critical decision.Marker-based systems offer precision but may be cost-prohibitive, they are not as portable as other solutions, and impose movement restrictions due to sensors attached to the body or the instruments themselves.In contrast, markerless systems offer greater versatility but may compromise on fidelity.Capturing the subtleties of finger movements presents its own set of challenges, including self-occlusion and the need for specialized equipment like gloves, which can be restrictive for musicians.Recent advancements in deep learning and computer vision offer reliable methods in controlled environments, primarily due to the highly constrained articulation of the hand, aiding in pose prediction.Facial capture, crucial for conveying emotion, relies on systems like FACS for expression categorization.While HMCs ensure consistent facial capture, the choice between marker-based and markerless techniques introduces its own challenges, ranging from intrusiveness to potential precision limitations.Lastly, the synchronization of multi-modal data is of utmost importance, especially when different devices can introduce inconsistencies.Techniques such as timecode generators and Genlock are essential, with many motion capture systems supporting both methods for seamless synchronization.Ultimately, the selection of technology hinges on striking the right balance between achieving precision and practicality in capturing musical performances.

Challenges and Future Work
The field of virtual instrument performances continues to face a range of emerging challenges, which we aim to address and outline future directions for innovation and advancement in this domain.An interesting avenue for exploration lies in the collaborative synthesis of musical instrument performances by various entities, including human artists, robots, and AI agents.While there are numerous works in the literature on virtual performances, there is notably limited discourse on the interaction between performers, whether they are virtual or real.A notable example of work shedding light on this aspect is the research conducted by Chakraborty et al. [CDT21].Such collaborations raise intriguing questions about the division of roles, creative decision-making, and the integration of AI into artistic expression.Future research can delve into the possibilities and challenges of this multi-agent collaboration, exploring how it might redefine the boundaries of virtual instrument performances.
Within the domain of creative and artistic performances, the visualization of virtual instrument performances is of paramount importance.Future research should focus on the development of innovative methods to visualize these performances, encompassing advanced techniques for rendering lifelike avatars, creating immersive virtual concert halls, and generating interactive visual representations of the performer's emotional state.Effective visualization not only enhances the audience's experience but also offers valuable insights for performers and researchers.Moreover, the advent of XR technologies introduces unique challenges when dealing with multiple performers.Future research should explore the intricacies of XR settings, investigating how VR devices and sensors might offer new ways for performers to interact with virtual instruments.Understanding the dynamics of multiple performers in these environments, such as collaborative improvisation or synchronized actions, is essential for pushing the boundaries of virtual instrument performances.
Furthermore, future research should place a stronger emphasis on collecting data that describes the style and conditions of performances.This includes monitoring the performer's heart rate, tracking improvisational moments, and observing emotional fluctuations during the performance.This data can provide a deeper understanding of the performer's state and creative choices, ultimately leading to more immersive and emotionally resonant virtual instrument performances.Investigating the integration of biofeedback data and improvisational tracking can pave the way for groundbreaking developments in this field.
Another avenue for future research involves exploring the possibilities of simulating audio and interactions in virtual environments that differ from the settings in which the performance was originally captured.It is evident that the spatial and environmental context significantly impacts the audio and visual experience of a performance.For instance, capturing a performance in a studio but simulating it in a large auditorium or a confined space can result in distinct audio characteristics and altered expressions in motion.Similarly, investigating how the presence or absence of an audience influences a performer's emotion and expression is also a promising avenue.Future research should delve into audio simulation techniques, environmental modeling, and their effects on the overall virtual instrument performance.
We consider the thoughtful and careful study of privacy and ethics around the application of these methods to be an important issue.The digitization of an artist and their performance includes sensitive private data, such as motion, playing style, and biofeedback data.Additionally, there is the issue of unauthorized use of a performer's data to generate novel performances, raising issues about ownership of synthesized content, similar to the discussion around generative models for images and audio [Bar23] and Large Language Models [WMR * 21,Har23].Since DL systems are trained on data, it is essential to select data in such a way that racial, sex, or body-related biases are minimized or ideally removed completely.Ethics is a crucial topic in several domains surround-ing XR environments and DL methods; we recommend reading the interesting study by Slater et al. [SGLH * 20] for more information.
Addressing the challenges and opportunities presented in these areas will undoubtedly lead to groundbreaking developments in this dynamic field.Researchers and practitioners are strongly encouraged to explore these themes, pushing the boundaries of what is achievable in the world of virtual instrument performances.

Figure 2 :
Figure 2: Structural interconnection between the different sections of this survey.To analyze and synthesize new VIPs, performance capture technologies are used to record multi-modal data of performers and their instruments.The data is then either directly used by analysis and synthesis systems or stored in databases using appropriate formats and representations.In some cases, these datasets are used for archiving purposes and are therefore enriched with metadata, analysis, synthesized data, and semantic annotations by experts in the respective domains.

Figure 3 :
Figure 3: Recording instruments with microphones in front of the instrument: a cello on the left [Zie23], and an amp on the right [Bra23].

Figure 4 :
Figure4: The left image shows musicians playing violin, motion captured using an optical MoCap system with reflective markers, tracking their body and instrument motion (image extracted from[Fut]).The right image shows a musician playing piano, while the movements of the finger joints are tracked with 5mm reflective markers (image extracted from[Ger]).
A voxel-based representation of the subject's body evolves over time, and animation is achieved by fitting a skeleton into the 3D model [DAST * 08, GSdA * 09, VBMP08, LSG * 11, LGS * 13].Over the last decade, numerous methods have been proposed; in this work, we draw insights from two key surveys, the work of Desmarais et al. [DMSM21] and the work of Xia's et al. [XGL * 17].Additional insights can be found in related studies [HXZ * 19, XCZ * 18].

Figure 5 :
Figure 5: Use of a motion capture system to holistically track a musician (full-body, face, fingers), drum sticks, and drums.Image extracted from [Cin21].
Music Information Retrieval (MIR) and Musical Performance Analysis (MPA) are two closely related research fields, both centered on aspects of music.MIR concentrates on developing algorithms and techniques to extract information from music audio signals, which can serve various purposes, including music genre classification [TC02], instrument classification [HBKD06], beat detection [PBDL23], music recommendation [ZSQJ12] and music transcription and melody extraction [SG12].The work presented in [W * 03], which revolves around audio identification, has been successfully incorporated into applications like Shazam [Sha23b].Shazam stands as an exemplary MIR application, capable of identifying songs by analyzing short audio samples and matching them against an extensive audio database.Within the realm of MIR, there are several noteworthy surveys that provide valuable insights, including [Dow03, TWV05, Ori06, CVG * 08, SGU * 14, SNA19, KR12].
Capture: The field of instrumental performance analysis has been significantly enriched through the integration of motion capture technology, providing intricate insights into musicians' motor skills for improved training and injury prevention.The Tone project [Cye23] introduces a virtual mirror, offering musicians real-time musculoskeletal feedback and the ability to analyze their posture and muscle activity from various perspectives.Additionally, Ancillao et al. [ASGA17] examines upper limb and bow positioning in violin players, emphasizing the criticality of quantitative assessments in skill evaluation and motor disorder diagnosis.Investigating finger movement coordination during piano playing, the study conducted by Winges et al. [WF15] discerns the nuanced differences in technique between professional and amateur musicians.Wolf et al. [WMB * 19], introduced a marker-based method to explore upper body movements, with a particular focus on addressing musculoskeletal disorders of high string players (violin and viola) -see Figure 6.In the pursuit of injury prevention and performance optimization, Shan et al. [SV03] delve into Overuse Syndrome in violinists, advocating for training strategies that emphasize physical economy.

Figure 6 :
Figure 6: Analysis of 3D upper body kinematics of high string players during performance.Image extracted from [WMB * 19].
[BH89] examined how body measurements influence the postures of 16 professional trumpeters while performing standing up.Through detailed photographs and anthropometric data collection, providing data on the physical limitations that can impact a musician's performance.Longo et al. [LDSR * 20] contributed by investigating the impact of body posture on voice performance during simultaneous singing and instrument playing.The study, which included 17 musicians, involving guitarists and pianists, utilized the Multi-Dimensional Voice Program (MDVP) for voice analysis and visual assessments for evaluating posture.Results underscored the complex relationship between a musician's physicality and their auditory output.Shifting the focus to ergonomics, Valenzuela-Gomez et al. [VGRGAG20] investigated the postural implications of different guitar supports (guitar cushion, rigid lap support and footstool) on classical guitarists.By integrating REBA and 3DSSPP software with subjective questionnaires, their work highlighted the ergonomic challenges and the need for improved support designs to enhance comfort and performance.Finally, Islan et al. [IBP * 18]provided a comprehensive analysis of the glenohumeral joint dynamics in violinists, employing a multifaceted approach involving the RULA (Rapid Upper Limb Assessment) method, CATIA software for geometric modeling, and ANSYS software for FEM

Figure 7 :
Figure 7: Analysis of the upper body posture of a musician playing high-stringed bow instrument using 3D back scans.Image extracted from [OMB * 18].
[KMO * 09] used motion capture to acquire piano fingering movements, rendering visualization of the fingering, and automatic fingering generation utilizing optimized algorithms.Zhu et al. [ZRHN13] used motion planning and optimization methods based on graph theory for 3D piano animations, where, similarly to Yamamoto et al. [YUS * 10], they used Inverse and Forward Kinematics for hand modeling and animation, from MIDI files.More recently, there has been a shift towards the use of machine learning and generative approaches.The first category of methods utilized Long Short-Term Memory networks (LSTMs), mainly due to their capabilities in modeling sequential data and capturing the temporal dependencies.For instance, Li et al. [LMD18] developed a deep neural network system that translates MIDI note data and metric structures into a real-time skeleton sequence of a pianist playing a keyboard instrument.Their approach combined Convolutional Neural Networks (CNNs) and LSTMs to generate human-like piano performances.Similarly, Shlizerman et al. [SDSKS18] transformed audio recordings of piano (and violin) performances into animations.They trained an LSTM network on internet-sourced videos and applied the predicted points to rigged avatars (see Figure 8).Bogaers et al. [BYV21] introduced a music-driven method that generates expressive musical gestures for virtual humans using 3D motion capture data and LSTM networks.In contrast, Guo et al. [GCZ * 21] introduced an augmented reality training system for piano, using MIDI data to generate 3D hand animations based on pre-trained Hidden Markov Models.The Viterbi algorithm determined the optimal finger path, and optimization methods modeled different fingerings and skills.Xu et al. [XLW * 22] used Reinforcement learning (RL) to create piano finger animations.They employed an end-to-end RL approach to train an agent for piano playing using touch-augmented hands on a simulated piano.They designed touch-and audio-based reward functions and utilized the Soft Actor Critic (SAC) method for training the RL agent.The results showed that tactile sensor feedback enhanced learning efficiency, leading to proficient piano playing in a fixed number of training iterations.In a recent study, Zakka et al. [ZWS * 23] introduced a system that builds upon the work of Xu et al., by utilizing deep reinforcement learning techniques to train anthropomorphic robotic hands in piano playing, resulting in the synthesis of dexterous robotic hand performance.

Figure 8 :
Figure 8: Synthesizing piano playing movements: (a) input an audio signal (b) fed into LSTM network to predict body movement points, (c) animate an avatar and show it playing the input music on a virtual piano.Image extracted from [SDSKS18].

Figure 10 :
Figure 10: Overview of the music-to-motion framework proposed by [CFZ * 21].The framework consists of a generator and a discriminator.

Table 1 :
Audio/MIDI-modal Music Performance Datasets [Raf16]9]ck recordings designed mainly for melody extraction research.While 14 tracks don't have a defined melody, they still play a significant role in other musical research areas.The dataset also delivers instrument activation annotations and extensive metadata.The MUSDB18 dataset [RLS * 17] offers 150 fulllength tracks, from a spectrum of genres.For each track, its original stems, isolating elements like vocals, drums and bass are provided.Meanwhile, the MTG-Jamendo Dataset [BWT * 19] emerges as an open framework for music auto-tagging.Sourcing its music from Jamendo, it comes with over 55K tracks tagged across multiple categories such as mood, genre and instruments.Finally, the Slakh dataset[MWSLR19]integrates multi-track audio files and aligned MIDI.Stemming from the Lakh MIDI dataset[Raf16], it employs high-quality virtual instruments to render individual MIDI tracks, which are then combined to form complete musical compositions.Its current version, Slakh2100 offers 2.1K tracks, generated from a diverse range of 187 patches across 34 categories.
N: Note, M: MIDI, PC: Pitch contour, G: Genre, I: Instrument, L: Lyrics, E: Emotion, m: metadata, : Unavailable/Non-working Link vocal and instrument melodies alongside piano accompaniments, all aligned with the original audio; annotations include tempo, beat, key, and chords.The Groove MIDI Dataset [GRE * 19] offers 13.6 hours of electronic drum performances from 10 professional drummers, paired with relevant metadata like style annotations and tempo, all in MIDI format.The Bach Doodle dataset [HHR * 19] stems from an interactive tool [Bac], allowing users to craft melodies harmonized in Bach's style by the Coconet [HCR * 19] model.This resulted in over 21.6 million compositions across 8.5 million sessions, detailing user melodies, harmonizations, and various metadata attributes.Finally, MusicCaps [ADB * 23], which focuses on text to music generation, contains musical snippets sourced from AudioSet [GEF * 17].Each of these clips is matched with its English text description.For every 10-second snippet, there are a descriptive caption and a list of music aspects.Source Separation, Mixing, and Signal Processing: The MASS (Music Audio Signal Separation) dataset [VIN08] provides short song excerpts, lasting between 10 to 40 seconds.Each of these excerpts offers Stereo Microsoft PCM WAV files at 44.1Khz and 24 bits, capturing every instrumental track, where based on production settings, may or may not have effects.On a parallel note, the "MIXPLORATION" Dataset [CPR14] is designed to provide an analysis of audio mixing and includes four root components: the raw source audio files, the specific mixing parameters, survey data capturing the listener feedback on these mixes and a time-series log of the mixing adjustments.Furthermore, regarding melody extraction, MedleyDB [BST * 14] is a collection of melody annotated, royalty-free presented in Table2, and regarding motion data, we specify which elements were captured, between upper-body, lower-body, fingers and instrument.
[Raf16]vement analysis, pose estimation, and the study of how music influences or interacts with physical movements.Details about these datasets are*  11]focuses on ensemble performances across blues, funk, and swing genres.It not only provides multitrack audio but also delves into the rhythmic quality, onset detection, and other intricate musical annotations.URMP [LLD * 19] includes 44 classical chamber music pieces, varying from duets to quintets, accompanied with visual information.Each piece comes with musical scores, individual audio tracks and detailed groundtruth annotations with both frame-level and note-level transcriptions.The Lakh MIDI dataset[Raf16], with its vast collection of MIDI files, offers ground truth data for audio content-based music information retrieval, transcription, meter, lyrics, and advanced musicological characteristics.Lastly, the YouTube-100M dataset [HCE * 17], while not exclusively a music dataset, has been used for soundtrack classification.The dataset contains around 100 million YouTube videos, which have been auto-labeled with multiple labels out of a set of 30K topic labels, averaging 5 labels per video, based on information, context, and visuals.* 19], offering detailed recordings across a plethora of instruments.These are further sup-

Table 2 :
Multi-modal Music Performance Datasets N: Note, M: MIDI, m: metadata, V: Video, Mo: MoCap (U: Upper body, L: Lower Body, f: fingers, I: Instrument), : Unavailable/Non-working Link Music Form/Structure Generation.Lu et al. developed MeloForm [LTY * 22], an expert system to construct melodies from motifs to phrases using a predefined musical form, while they employed a transformer-based refinement model to enhance the richness.Museformer, proposed by Yu et al. [YLW * 22] also use a transformer that incorporates novel fine-and coarsegrained attention mechanisms for music generation, capturing both music structure-related correlations and additional contextual information, leading to high-quality, well-structured long music sequences.An interesting work by Wang et al. [WLL * 20] introduces an algorithm for synthesizing interactive background music based on visual content.Using neural networks for scene sentiment analysis and a cost function for music synthesis, it ensures emotional consistency between visual and auditory elements, as well as music continuity.Moreover, it is imperative to acknowledge the existence of several studies in the domain of Song Writing.Specifically, the works by Sheng et al. [SST * 20], Xue et al. [XSW * 21], and Ju et al. [JLT * 22] have made noteworthy contributions to this field.We encourage readers, who are interested in exploring this topic in greater detail, to refer to the comprehensive analysis conducted by Ji et al. [JYL23] which extensively explore the current popular music generation tasks using deep learning techniques.Likewise, Siphocly et al. [SEHS21], describe and analyze various AI algorithms and techniques available for composing computer music.
recent years, advancements in music composition technologies have provided new opportunities for research.These tools enable digital music composition, enabling researchers to investigate and analyze musical constructs with greater precision and depth.Notably, these technologies present the opportunity to create customized datasets for further research initiatives, such as training machine learning models or augmenting existing datasets.While not the main focus of this survey, this subsection offers a brief overview of recent studies that employ diverse methods to compose computer-generated music.Over the last few years, Text-to-Music Generation has gained significant popularity.Agostinelliet al. [ADB * 23] proposed MusicLM, a generative model that delivers high-quality music, maintaining consistency over extended durations and accurately adhering to text-based conditioning cues.Similarly, the MusicGen by Copet et al. [CKG * 23], a single Language Model that operates over music tokens, can generate high-quality samples, influenced by textual descriptions or melodic * 23], a framework for generating music with any arbitrary source-target track combinations, which relies on a novel music representation combined with a diffusion model.Furthermore, some works focused on

Table 3 :
The 3 primary Mocap categories with their advantages and disadvantages.It finds widespread use in various fields, such as film production, video game development, biomechanics, and medical research.By utilizing cameras to track reflective or lightemitting markers, it is known for its remarkable precision and reliability, even when capturing subtle or intricate movements, and high-frame acquisition.However, it is generally costly, requires camera calibration, and can be hindered by occlusions.Conversely, inertial motion capture systems are prized for their portability and versatility.However, these methods are susceptible to issues like positional drift and magnetic interference, and the precision of data acquisition is somewhat diminished.Finally, markerless (or visionbased) motion capture systems leverage advanced computer vision algorithms, providing flexibility and convenience, albeit at the cost of reduced pose accuracy.They are notably sensitive to lighting conditions and occlusions.For a summarized overview of the advantages and limitations of each of these methods, please refer to Table