Stylized Crowd Formation Transformation Through Spatiotemporal Adversarial Learning

Achieving crowd formation transformations has wide‐ranging applications in fields such as unmanned aerial vehicle formation control, crowd simulation, and large‐scale performances. However, planning trajectories for hundreds of agents is a challenging and tedious task. When modifying crowd formation change schemes, adjustments are typically required based on the style of formation change. Existing methods often involve manual adjustments at each crucial step, leading to a substantial amount of physical labor. Motivated by these challenges, this study introduces a novel generative adversarial network (GAN) for generating crowd formation transformations. The proposed GAN learns specific styles from a series of crowd formation transformation trajectories and can transform a new crowd with an arbitrary number of individuals into the same styles with minimal manual intervention. The model incorporates a space–time transformer module to aggregate spatiotemporal information for learning distinct styles of formation transformation. Furthermore, this article investigates the relationship between the distribution of training data and the length of trajectory sequences, providing insights into the preprocessing of training data.


Introduction
Crowd formation transformation involves controlling the movement of a crowd under strict constraints, [1] often simulating crowd behaviors as they transition from one formation to another. [2]During this transformation process, individual agents intelligently navigate to their destinations, while ensuring that the entire crowd maintains order and elegance.Generating crowd transformations is a significant area of study with practical implications in various domains, including unmanned aerial vehicles formation control, [3] animated films, [4] large-scale performance planning and design, [5] and more.
The style of crowd formation transformation is characterized by the relative positions and states of individuals during movement, as well as the overall movement trends and boundaries.In many scenarios, generating a specific style of crowd formation transformation (SSCFT) is essential.For example, in crowd performance design, multiple transformation schemes with the same style are needed as alternative options.Moreover, during the early creative stages, the exact number of performers is uncertain, and it is necessary to evaluate the performance states for different quantities of actors.During the implementation phase, slight adjustments to the actors' positions are required to observe various artistic effects.
Current methods for constructing crowd formation transformations primarily focus on the processes of formation generation, formation maintenance, and formation alteration. [6]In the construction phase, they typically rely on techniques such as image registration, [5] grid partitioning, [3] and geometric constraints. [7]Subsequently, formation maintenance and alteration are enforced through rule matching, [5,6] behavior models, [8] collision avoidance models, [9] or other constraints. [10]While these methods yield convincing results in crowd formation transitions, they often require manual operations or expert knowledge during formation generation and adjustment.As the crowd size increases, these approaches become more complex and labor-intensive.Furthermore, with advancements in the field of computer vision, extracting motion trajectories from real group formations has become much easier.This progress is particularly important in constructing simulations for games or performances.However, existing methods often require a cumbersome reconstruction process.
To address these challenges, this work pioneers the use of deep learning models to tackle SSCFT problems.Specifically, given a specific crowd transformation style template, the proposed model transfers a crowd, regardless of its initial formation or the number of agents, to a new formation following the template's style.This approach differs from existing deep learning-based research, which mainly focuses on predicting crowd behavior using deep learning [11,12] or reinforcement learning, [13,14] or generating crowd data using GANs. [15]rowd formation transformation is a specialized branch of crowd simulation that imposes unique restrictions compared to crowd planning and pedestrian simulation.For example, in the design of large-scale performances, it is imperative to ensure the artistic quality of crowd modeling during the migration process, considering the individual characteristics and trajectory features of each agent.
Inspired by these challenges, a deep learning-based method is proposed for generating SSCFT data.A spatiotemporal feature extraction module is designed to capture both spatial features of crowds at each moment and temporal features along individual trajectories.This module takes basic position-based formation transformation data and a new crowd formation with an arbitrary number of agents as input, producing SSCFT data for the new crowd.The model's generated results can be controlled for diversity or similarity with the original data using hyperparameters.These results can be readily applied to real-world tasks after simple postprocessing.
Figure 1 showcases the generated results.Section A presents a formation process where a 121-person square formation is derived from an 800-person circular scheduling process.Such simple geometric shapes are common in performance activities, and many studies [3,7,10,16] have focused on transformations involving simple geometric formations.Section B depicts the adjustment of headcounts in the transition process of a large, irregular crowd formation, a common scenario in mass crowd formation transformation.Section C illustrates the adjustments made to a cartoon-patterned crowd formation process after increasing the number of participants.The main contributions of this work can be summarized as follows: 1) Introduction of a style transfer architecture capable of generating SSCFT data based on specified migration data and initial crowd positions.The model allows control over the similarity with the original data and the diversity of generated results through hyperparameter adjustments.To the best of our knowledge, this work is the first to apply deep learning methods to SSCFT problems, opening new avenues for creating and modifying crowd formation transformations.Compared to traditional methods, this approach eliminates the need for manual constraint modifications and crowd distribution adjustments for each keyframe when changing the number of people in the crowd, significantly enhancing the efficiency of crowd formation transfer design.
2) Investigation of the relationship between sequence lengths of crowd migration data and the generated results, providing insights into the processing of training data.When dealing with large data spans, finer-grained time series divisions should be employed.3) Introduction of a space-time transformer model that remains unaffected by the crowd's size, enabling effective feature extraction from crowd trajectories.

Crowd Formation Control
In the field of formation construction, various approaches are explored.Xu et al. [17] employed shape constraints to establish the positions of static groups, while Li et al. [5] used image sampling for constructing static formations.Zheng et al. [7] adopted geometric constraints to generate crowd distributions, and He et al. [10] used user interaction to recognize user gestures and created formations based on predefined templates.Wang et al. [18] introduced a topological formation control method, while Allain et al. [19] proposed a dynamic crowd model for formation editing.Gu et al. [20] leveraged user-input sketches to construct group formations.
][22] Typically, these methods reframe the crowd as a static ensemble and perform matching and path planning based on the relationships between these ensembles.In the design process, rule-based models iteratively update formation transformations by manually adjusting the positions of individuals within each static ensemble and the matching rules within groups until they reach the desired outcome.Agent-based models, in contrast, depend on additional information such as social attributes [9,23,24] or velocity information [6,[25][26][27][28][29] to manipulate each agent or target with distinct attributes.
Macroscopic strategy models often rely on physical models, [30] potential field models, [10] and similar methods to drive collective transformations.Nevertheless, neither individual-based nor macroscopic strategies alone suffice to manipulate each individual's position within a crowd.Furthermore, their performance can be sensitive to parameter selection, making it tedious to determine parameters that yield satisfactory results.
When modifying the transition process based on given crowd position data, manual adjustments are typically performed through repetitive trial-and-error to reconstruct rules or strategies, which can be exceedingly time-consuming.In contrast, our method adopts an end-to-end approach to generate a similar scheduling process when designing transformations in crowd formation.The results generated by our method can be easily fine-tuned by the previous method, significantly reducing the manual intervention required.

Crowd Feature Extraction
Crowd trajectories feature extraction has been widely studied in the problem of crowd trajectories prediction.Earlier methods used linear models, [31] Gaussian regression models, [32] time series analysis, [33] autoregressive models, [34] and other methods.However, these methods have limited ability to cope with complex trajectory features.Researches in recent years have achieved great success in using recurrent neural network (RNN) models, [15,35,36] which can effectively extract crowd feature, but has disadvantages in the parallel processing of data.Now, with the success of the transformer model [37] in natural language processing, the sequence feature extraction tends to shift gradually from the RNN model to the transformer model, which breaks through the limitation that the RNN model cannot be calculated in parallel, and the special attention mechanism can take into account both global and local information.The latest research in crowd feature extraction based on the transformer model [38] has achieved excellent results.The trajectory prediction task shows that extracting crowd trajectory feature should consider not only the trajectories of a single individual but also the influence of other individuals.Alahi et al. [35] and Gupta et al. [36] aggregated group information by modeling the social interaction, and Salzmann et al. [39] and Zhao et al. [40] aggregated by the scene context.
Inspired by previous works, we consider both the spatial compatibility and temporal consistency, and propose the space-time transformer model to generate SSCFT.In this model, feature extraction is not affected by the number of groups, and it can focus on each individual in the group from a global perspective.The time and space complexity of calculation only increases linearly with the increase in the number of groups.To verify the effectiveness of the architecture we designed, we also test the use of the RNN model in temporal dimensional feature extraction and use different space modules in spatial dimensional feature extraction, which also produce visually acceptable effects.

Generative Model
Generative adversarial network (GAN) [41] is a new method of training generative models proposed by Goodfellow.It consists of a generator network for data generation and a discriminator counterpart to penalize the differences between generated and real data.It has achieved excellent results in image generation, [42][43][44] style transfer, [45,46] and sequence generation. [47]onditional generative adversarial network (CGAN) [48] is an extension of the original GAN.CGAN is implemented by feeding extra information y to the discriminative and generative models as part of the input layer.In the generative model, the prior input noise pðzÞ and conditional information y together form the joint hidden layer representation.The adversarial training framework is quite flexible in how the hidden layer representations are composed.Similarly, the objective function of CGAN is a two-person minimax game with conditional probability.In this article, we exploit the mechanism of the CGAN for our transformer model.

Problem Definition
Following previous works, [1,5,16,[20][21][22] we assume the space is a 2dimensional plane.In addition, all data are assumed to scatter within a specific area, ranging from À1 to 1.It is implemented by conducting a À1 to 1 normalization of data in different crowds.
This work focuses on SSCFT problems, where a novel crowd position Ĉ with m agents should be transformed according to a style template X with n agents, and generate a novel crowd migration data Y ¼ ½ T1 , T2 , : : : , Tm .We note that m and n is not exactly equal, indicating the strong generalization capacity of the proposed method.The style template X ¼ ½T 1 , T 2 , : : : , T n , where T i is the moving trajectories of the i-th agent.T i ¼ ½P 1 i , P 2 i , : : : , P t i , representing the complete paths of the i-th agent from the position P 1 i at time 1 to the position P t i at time t.Each position The time interval between P j i and P jþ1 i from the same crowd transformation data X is fixed and is denoted as Δt.

Model Overview
In the process of crowd formation transforming, the trajectory of each agent and the overall transformation formation compose the crowd transformation style.The transformation process of an agent in a time series is its temporal feature, and the relative positional relationship of the crowd is its spatial feature.Both spatial compatibility, where agents should scatter without collision, and temporal consistency, where agents should move intelligently, should be considered.Therefore, we propose the spatiotemporal characteristics when extracting the style features of formation transformation and discriminating the generated groups.
As shown in Figure 2, to generate style-specific crowd data, our model is divided into three key components: style encoder (E), generator (G), and discriminator (D).E extracts the style features of data.Its input is formation transformation data, and the output is its style features.G generates formation transformation consistent with the style of E's output, in which input is the output of E and a new set of positions of a crowd D identifies whether the data is real or generated.Its input is formation transformation data, the style features, and a group of initial positions of a crowd.The overall structure is a CGAN [48] , which can effectively generate SSCFT data based on the input style vector and crowd positions.We design the style encoder module, which is specially used to extract the style features of formation transformation.The E and G modules compose a conditional autoencoder to ensure the effectiveness of style extraction.
Finally, the results generated by our model can be used in real applications after postprocessing.For this purpose, we develop a simple software that allows users to visually select individuals within the keyframes of a formation and modify their positions by moving them up, down, left, or right.

Time and Space Transformer
In the crowd formation transformation task, the trajectory sequences of individuals constitute the temporal features, while the positions of the group at a specific moment constitute the spatial features.Within the computational framework, the layers dedicated to extracting sequential features and spatial features are based on transformer modules.However, there is a difference in data formatting during input.We refer to the component responsible for temporal feature extraction as the time transformer, and the one dedicated to spatial feature extraction as the space transformer, and the joint application method of the two is called a space-time transformer.Transformer performs better than conventional neural networks for extracting features from sequential data.Since the transformer is superior in achieving long-range dependencies.It does not need to specify the length of the sequence when processing data, and its attention mechanism is independent of the order of the input sequence, making it easier to process information for crowd data with unfixed human numbers.The style of crowd formation transformation is determined by the entire process of trajectory, which is consistent with the global attention mechanism of the transformer.When extracting spatial features at a specific moment, the number of agents among different groups is not fixed, and there is no order among the agents.This is also consistent with the special effects of transformer.Figure 3 illustrates the distinct processing methods employed by the time transformer and space transformer for time series and spatial data.Both the time transformer and the space transformer are multilayer structures.In the time transformer module, input consists of vectors arranged based on the temporal sequence of trajectories.Trajectory features of each agent are then computed in parallel after the addition of position coding.The positions of each moment are viewed from a global perspective.In the space transformer module, the processing method is similar to the time transformer module, but the input is different.Here, the input comprises spatial information about the crowd at specific moments.Given that spatial characteristics lack a sequential order, position coding is not applied.This approach enables each agent to consider the location information of other crowd members at the same moment.Through the transformer's attention mechanism, a more effective grasp of how other agents influence the trajectory of the target agent is achieved.
With the help of the encoder module in the transformer, we meticulously construct a space-time transformer model consisting of E, G, and D. The model efficiently extracts time series and space position features.The attention calculation formula for each transformer uses the method of. [37]The detailed architecture of the three modules will be sequentially demonstrated in the following sections.

Style Encoder
The style encoder aims at mapping a given crowd trajectories template to corresponding style embeddings to guide the transformation of a new crowd data stylistically.It takes as input the positional coordinates of people and outputs style embeddings with 64-dimensions.This process assumes that the trajectories of a portion of agents are sufficient for representing the overall style of the given formation transformation template so that the stylistic information of the entire crowd could be embedded.Inspired by this assumption, we design a style encoder to implement style-specific transformation to new crowd data according to the style template.
The style encoder structure is demonstrated in Figure 4A.It could be divided into 4 steps.First, the trajectories positions information of the input data is extended to a 64-dimensional vector from ℝ 2 by a multilayered perceptron (MLP) layer.Second, we regard the spatial dimension as the length of the sequence and add a token to the first position of the sequence, so the sequence length of intermediate features is 1 adding the human numbers of the crowd.The token is a n * 64-dimensional vector filled by 0, and the spatial features will be aggregated into it.n is the number of people in the crowd.Then, we sent the features to the space transformer to aggregate the spatial information.The space transformer consists of a series of transformer  blocks that aggregate the spatial features of each moment to obtain the spatial outputs.Referring to the idea of refs., [35,36]  the spatial sequences of each moment share the same transformer module.The output of the space transformer does not change the feature dimension.Thirdly, we take the first position of all spatial features as a spatial vector, then flip its spatial and temporal dimensions to obtain a time series with a spatial dimension of 1. Finally, we send it to the time transformer module, which is similar to the space transformer structure composed of a series of transformer blocks.The output of the time transformer is a t * 64-dimensional vector representing the style features of this set of crowd transfer processes with t moments.

Generator
The generator aims at mapping a new crowd to its SSCFT based on the style template.Its structure is shown in Figure 4B.We use the structure of the conditional generative adversarial nets [48] to fit the conditional probability distribution of style and initial positions to trajectories.In the generator, we regard the style vector of specific crowd formation transformation data as part of the input, and the initial positions of a group of people C as the other part of the input, so as to generate the overall movement process of this group of people.In training, we use the first positions of part people in the original data as the conditional input, and the number of generated trajectories is related to the number of positions in the input C. We still use a single-layer MLP layer embedding position information to obtain a fixed-dimensional vector h 1 , then send it to a space transformer module to incorporate features from other people's positions.After adding random noise, the style feature vector is mapped back to the original dimension by a new MLP layer.Then, each person's feature vector h 1 i and the new style vector are concatenated respectively and sent to the time transformer to generate trajectories feature in parallel.Random noise is sampled from a standard normal distribution to ensure the variety of generated results.Finally, we also use an MLP layer mapping the feature to ℝ 2 .When generating, the initial positions of each person generate complete trajectories based on the style vector.The model structure consists of multiple transformer blocks that share weights for different trajectories.

Discriminator
The discriminator is used to identify whether data are a crowd trajectory with a specific condition and style.Therefore, its input consists of three parts, including the sequences of crowd movement, the style of the movement, and the initial positions of the crowd.Its structure is shown in Figure 4C.
Different from the encoder, the discriminator sequentially aggregates features of time series and spatial series.In the discriminative process, two MLP layers are used to harmonize the feature vectors with multiple sources to the same dimension.The former maps the positional coordinates data from ℝ 2 to its embeddings with 64 dimensions and concatenates with the style embeddings.The later maps the conditional positions information from ℝ 2 to 128 dimensions and concatenated to the initial positions of the sequence.
Finally, we send them to the neural network together.The discriminator needs to consider two parts: the trajectory correctness of the individual agent and the correctness of the group trajectories.Therefore, after the time transformer is completed, we send its results to the single-layer MLP layer as the judgment of the correctness of each agent.At the same time, we also perform space-time transformation on the results of the time transformer and send them to the space transformer to obtain a judgment on the whole.

Encoder Loss
The overall structure of the style encoder together with the generator is similar to that of the autoencoder, except that additional conditions are added during the generation process.During training, their weights are learned simultaneously.We use the reconstruction loss of the generator to optimize the style encoder, and the input condition of the generator is the initial positions C of all the people in the original crowd formation.Its loss function is as follows k is the number of crowd formation transformation style templates.C i is the initial positions of style template X i .
Considering that crowd formation transformations with similar styles should have similar style vectors, we also design a style loss to punish the differences between vectors extracted from the crowd of the same style.
X i part represents part of the people randomly selected in X i , and it is required that the difference between the number of people in X i part and X i should not be too large.We assume that X i part and X i have the same style.

Adversarial Cost
As mentioned before, our GAN uses the structure of conditional adversarial neural, [48] but our discriminator consists of two parts: global discrimination and individual discrimination, so our adversarial loss also consists of two parts.Through experiments, we found that it is difficult to obtain ideal results by simply using the training method of adversarial generative network, [41] so we tried to use the WGAN [49] method of training.The adversarial cost uses the Wasserstein distance to evaluate the distance between the discriminative distribution of the generated data and the original data.Its loss function is as follows θ D is the parameters set of the discriminator.
During training, we set the weight w 1 for L G .

L2 Loss
For the generated data, we also apply a L2 loss to measure the distance between generated data and ground truth.
The above formula describes the distance between the generated data by the initial positions of part people in the original trajectories and its original data.During training, we set the weight w 2 for this loss.
To increase the generalization ability of the generator, we close to the style features of the original data and the generated data with initial positions different from the original data.
P is a set composed of the initial positions P 1 of all training data.We randomly select part of the data in P as P part each time.This shows that P part may not come from the initial crowd position of the corresponding style template, but the style of the crowd scheduling process generated with the same style vector should be consistent.
Due to the influence of random noise, generated results multiple times with the same inputs should be different.To generate diverse results, we maximize the difference with the following formula.The weight of this loss during training is set to 0.5.

Experimental Section
The training data in this article is composed of multiple components.One part of the data is sourced from real-world large-scale performance activities, capturing the transformation process of crowd formation.The other part of the data is created manually.Our goal is to generate SSCFT data using the trained data and the new initial positions as conditions, so we mainly focus on the generated results of the data being trained.We select a total of 15 groups of formation transformation data, in which the number of each group person ranges from 100 to 3000, and the sequence length is selected as 15 default.The number of attention heads in every transformer module is set to 4. Due to the optimization method using WGAN, we use the RMSP optimizer.The initial learning rate is set to 0.0001.During training, the discriminator uses more training times than the style encoder and generator.For every two times of discriminator training, the encoder and generator are trained once.The hyperparameters w1 and w2 mentioned above are set to 1 by default.

Evaluation Metrics
For the similarity to the original transformation style, we use three metrics to evaluate our results: 1) Average displacement error (ADE): average L2 distance between ground truth and our prediction overall predicted time steps.
2) Final displacement error (FDE): the distance between the predicted final destination and the true final destination at the end of the forecast.
3) Chamfer distance (CD): measures the spatial similarity between predicted and ground truth trajectories by calculating the average minimum distance between their respective positions.
][40] Although this article focuses on the generation tasks, which are essentially different from prediction tasks, the methods used in training are similar to prediction tasks, so we use these two indicators to evaluate the accuracy of our model.CD is typically used to assess the spatial similarity between point clouds, which is similar to the spatial similarity of crowd positions.We use it to evaluate the similarity, on average, between the generated crowd and the style template at corresponding keyframes.
To diminish the negative effect of the randomness in the results, we calculate the average ADE, FDE, and CD of 100 generated results.

Time and Space Transformer Performance Analysis
This section mainly verifies the influence of different depths of the transformer and different space-time modules on the results.First, we use different transformer depths (1, 2, 4) to evaluate the performance of our model with the crowd trajectories sequences length equal to 15.In Table 1, the performance of intermediate transformer modules with varying depths is presented.It is evident that deeper transformer modules exhibit superior performance in terms of the ADE and FDE metrics.This indicates that deeper models possess the capability to generate more precise trajectories, thereby enhancing predictive capabilities.Notably, an increase in model depth from 1 layer to 2 layers results in a significant performance improvement, underscoring the importance of depth augmentation for enhanced performance.However, when depth is further increased from 2 layers to 4 layers, performance gains become relatively modest.This suggests that the model has already achieved the requisite complexity to fulfill task demands, with additional depth offering marginal improvements.Importantly, the Transformer module with a depth of 2 performs optimally in the CD metric.This observation may be attributed to Bold indicates better indicator results.
task-specific characteristics, as the CD metric primarily emphasizes simulating spatial features and does not necessitate an excessively deep model.Thus, considering both performance and computational efficiency, setting the depth of the transformer module to 2 in subsequent experiments is a judicious choice.This ensures high performance while optimizing computational efficiency.
In Figure 5, we present a diagram illustrating the results generated using different model structures.This process is relatively challenging to learn due to its representation of a typical scenario with diverse crowd behaviors.It includes the overall coordination, the intermingling of the crowd, and the processes of the crowd splitting and merging.Section (A) of Figure 5 is an initial formation transformation process with 200 agents.In Section (B-D) of Figure 5, we generate new data with the style of A by a set of initial positions different from the original human initial positions.The original transformation process contains 200 agents, and we use an initial condition of 136 agents.In Section (B), we generate new data similar to the original formation transformation style using models with different transformer depths.It be observed that transformer modules with depths of 2 and 4 generate results that are more stylistically similar to the ground truth formation.Furthermore, the details of the generated crowds become more precise, which aligns with the data results of the evaluation metrics.
The RNN models have always been the mainstream method for feature extraction of time series information, which includes improved methods such as long short-term memory (LSTM) and gate recurrent unit (GRU) networks.At present, some of the latest researches still use the RNN models.We believe that the RNN models are also effective in the temporal feature extraction of our architecture, so we compare the results of temporal feature extraction using different structures in our experiments.Spatial information is different from the time series process in that extracting data features should be independent of the order of input individuals.Therefore, our model does not use the RNN model for comparison in the aggregation of spatial features.The social mechanism used in refs.[35,36] can aggregate spatial features but requires a lot of memory when there are too many people in a crowd, which makes training impossible.We compare the results of using MLP layers and transformer layers in the space module.In the MLP space module, we just replaced the transformer structure with the MLP layer.For the problem that the number of people is not uniform, we align it to a uniform dimension by filling the data.
We use the RNN models for temporal feature extraction to verify that our overall architecture is effective.We fix the space transformer, and use RNN, LSTM, and GRU to train the model, respectively.To verify the performance of the space transformer, we also fix the time transformer, and use MLP space module to get the results.The results are shown in Table 2.The results show that using the transformer module in the time dimension is better than other sequence models, and using space transformer can improve the effect.Section © of Figure 5 shows the different time series feature extraction methods generated results.Experiments show that RNN models can also generate visually acceptable effects.
Using different feature extraction methods to produce results indicates the effectiveness of the model architecture designed by us.But in contrast, the global attention mechanism of the transformer can better capture the time series details.In Section (D) of Figure 5, we show the result of using the MLP layers for the space module.Experiments show that MLP models can also generate visually acceptable effects, but the result is worse than using transformers.This demonstrates the effectiveness of our time-space transformer structure.

Effect of Trajectory Length on Performance
In this section, we investigate the impact of different sequence lengths on model performance.Initially, we interpolate the sequence lengths for each group to 10, 15, and 20 based on their migration processes and then train the model to explore the outcomes.
Table 3 presents the results of these experiments.It shows that the performance of various evaluation metrics under different sequence lengths has little difference, which may be related to the distribution of our data.We look at the evaluation metrics of each group of data in detail and find that values are far apart for different groups, suggesting that the generated results are related to the distribution of data.We analyze the result and find that using shorter sequences for crowd formation transformation data with larger space spans in the original data may not be able to effectively capture their characteristics.In practice, small changes in the generated transformation data will cause great differences in the actual prediction, resulting in the generated crowd data only having the movement trend of the original data.We retrain with more fine-grained crowd trajectories sequences, where Maxððx max À x minÞ, ðy max À y minÞÞ > 180 and Maxððx max À x minÞ, ðy max À y minÞÞ < 60, respectively, ADE, FDE, and CD on the data.Figure 6 shows the results.
The experimental results reveal that for data groups with extensive spatial spans, an increase in trajectory sequence length corresponds to decreasing values of ADE, FDE, and CD.Furthermore, this trend continues as the sequence length grows, indicating an enhancement in model performance with longer sequences.However, such a phenomenon is not observed in data groups characterized by smaller spatial spans.Our experiments show that the spatial scale of the crowd formation transformation should be considered in practice, and when the data span is large, a finer-grained time series division should be used.

Ablation Experiment
Various loss functions play essential roles in our architecture design.In this section, we aim to verify the significance of each loss term.Specifically, L e1 and L e2 serve the purpose of enhancing the feature and style extraction capabilities of the style encoder.These are fundamental components that we do not intend to ablate.
In contrast, we analyze the influence of L 2 , L G , and L G d , which control the accuracy and diversity of the generated outcomes.We perform ablation experiments separately for these loss terms.While assessing their impact, we compute the averages of 100 generated results for metrics including ADE, FDE, and CD, along with the standard deviations of ADE and FDE.The standard deviation measurements provide insights into result diversity.We omit the standard deviation for CD since its calculation yielded very small values, suggesting relatively close spatial similarity between the results generated multiple times and the original data.The smaller the value of ADE, FDE, and CD, the more similar the generated result is to the original transformation process, and the larger the value of vADE and vFDE, the better the diversity of the generated result.The results are presented in Table 4.The experimental results demonstrate that the L 2 loss function improves the accuracy of generated results effectively.This indicates that when we minimize the L 2 loss, the model can generate trajectories that closely resemble the original data more easily.However, this approach also introduces a significant drawback, which is the limitation of diversity.The adversarial cost L G places greater emphasis on the diversity of generated results.By encouraging the model to produce distinct trajectories, L G enhances result variety, albeit at the expense of some accuracy.Additionally, the use of L G d can further amplify the diversity of model-generated outcomes, albeit with a minor reduction in model accuracy.In Figure 7, the diversity of our approach is illustrated by generating three different results from the same scene, showcased in the same frame.While the overall arrangement of individuals remains consistent, there are subtle yet distinct differences in the positions of the agents.Such variations are evident at the detail level, illustrating our model's capability to produce similar yet distinct outcomes.For certain tasks, all these variations represent acceptable and practical results.
To further investigate the impact of L 2 loss and adversarial cost on the diversity of generated results, we conduct experiments by varying the weights throught w1 and w2.The results are presented in Table 5.The above results show that our model can increase result diversity while maintaining style similarity.Increasing the weight of the adversarial cost yields more diverse outcomes while increasing the weight of the L2 loss makes the generated results more similar to the original data.
L G g aims to improve the accuracy of generating specific styles.In our model testing, we observe that the generator is highly sensitive to two key factors: the style vector and the initial conditions.Particularly, when we simultaneously train multiple crowd formation transformation templates, the generated content may erroneously match templates that do not belong to the intended style.
To evaluate the impact of L G g , we construct a crowd set containing the first-frame crowd positions from all crowd formation transformation templates.Then, we randomly sample between    80 and 800 distinct crowd positions as initial conditions and generate results for all style templates in the training set.We classify the generated results based on the one with the lowest CD value to assess their accuracy.By conducting ablation experiments on L G g and performing 100 experiments for each case, we have observed a significant increase in correctness rates.The usage of L G g has elevated the correctness rate from 51.1% to 93.6%.This demonstrates a significant improvement in the success rate of generating desired results with L G g .

Efficiency Improvement
To evaluate improvements in efficiency in our method, we use real performance data and transform it into a new formation scheduling approach following the guidance of professional choreographers.First, we adopt a baseline method by simply editing the formation data.Then, five volunteers with a background in computer science, using the baseline method's time as a reference, apply our method, as well as the posture recognition-based method proposed in ref. [10] and the image samplingbased method presented in ref. [5] to adapt the existing formations to the new scheme.We evaluate efficiency improvement by comparing the time spent with the baseline time.The results show that the posture recognition-based method, image sampling-based method, and our method respectively require 75%, 63%, and 26% time compared to the baseline method.This indicates significant potential for improving efficiency in our approach.Figure 8 illustrates a scene used in this experiment, which is a performance scene from the opening ceremony of the 2022 Beijing Winter Olympics.The migration of these individuals is selected from the results generated by our system, with minor adjustments made to collision situations for a small number of individuals.

Conclusion
A novel network structure has been proposed for the SSCFT problem, which includes a style encoder, generator, and discriminator.For the first time, the adversarial generative networks are applied to formation transformation, and the space-time transformer module extracts the style features of the crowd formation.We build the style encoder to extract the original crowd transformation style, then mix the style and a new group's initial crowd positions to generate a new SSCFT.The results show that the model is feasible in the transfer of SSCFT.
The similarity with the original data and the diversity of the generated results can be controlled by adjusting the    hyperparameters in the loss function, indicating that GAN has good extendibility on SSCTF problems, and more possibilities can be created by constructing more loss functions in the future.
In conclusion, it is important to acknowledge that while our model demonstrates promising results in generating SSCFT, it still exhibits certain limitations in practical applications.A key challenge lies in the model's microinteraction learning outcomes, which are not yet fully refined.This often necessitates postprocessing to achieve the desired results, especially in scenarios where detailed crowd dynamics are crucial.Moving forward, we are committed to addressing these limitations, particularly by enhancing our model's capability to handle intricate micro-level interactions, thus broadening its applicability across various tasks.Furthermore, we believe that our method holds significant potential for a variety of domains, especially in crowd simulation tasks.We are eager to collaborate with more researchers in this field to further explore and refine our approach, ensuring it meets diverse application needs.

Figure 1 .
Figure 1.Generate transformations similar to template styles, using initial conditions with various formations and numbers of people.A) Shape-based formation transfer, B) Irregular formation scaling, C) Pattern formation scaling.

Figure 2 .
Figure 2. The overview of our framework for generating SSCFT.

Figure 3 .
Figure 3.The transformer module handles time series data and spatial data differently.The input for the time transformer consists of individual trajectory information, while the input for the space transformer comprises group position information at a specific moment.

Figure 4 .
Figure 4.The network structure of our model.It consists of three key components: A) Style encoder, B) Generator, and C) Discriminator.

Figure 5 .
Figure 5. Generated results with different transformer depths, time modules, and space modules.(A) is the initial crowd formation transformation which includes 200 people.(B), (C), and (D) are SSCFTs consistent with the style of (A) that are generated based on 136 people's initial positions using different transformer depths, time modules, and space modules.(For the parts that we argue to be flawed, we use red lines to identify them.)

Figure 6 .
Figure 6.The performance in different sequence lengths and space ranges.

Figure 7 .
Figure 7. Triple-generated results in the same frame.

Figure 8 .
Figure 8. Application of the results generated by our model after postprocessing.

Table 1 .
Performance in different transformer depths.

Table 2 .
Performance in different feature extraction method.

Table 3 .
Performance in different trajectories sequence lengths.

Table 4 .
Measure the effect of each loss function on model performance.

Table 5 .
Comparison of similarity and diversity under different hyperparameters.