Virtual‐real‐fusion simulation framework for evaluating and optimizing small‐spatial‐scale placement of cooperative roadside sensing units

Roadside sensing units’ (RSUs) perception capability may be substantially impaired by occlusion issue even they work cooperatively. However, the joint influence of static and dynamic occlusions in real‐life situations remains inadequately considered in optimizing RSUs’ placement. This study proposes a virtual‐real‐fusion simulation (VRFS) framework that combines traffic simulation and point clouds of real‐world road environment to optimize RSUs’ deployment. Point clouds and triangular meshes are used to model static and dynamic obstacles, respectively. A structure‐retained spherical projection method is developed to efficiently emulate RSUs’ data collection. Based on the developed VRFS, the probabilistic occupancy maps (POM) are created to represent traffic scenarios. The POM‐based cross entropy (CE) is proposed as the surrogate metric for evaluating the detection performance of cooperative RSUs. The Bayesian optimizer is applied to optimize the RSUs’ placement parameters (decision variables) by minimizing CE. Test results show that it is viable to use the POM‐based CE as a proxy for evaluating cooperative RSUs’ sensing performance. Considering the occlusion effect adds to the efficacy of POM‐based CE as a surrogate metric. Compared with traffic volume, the adverse effect of the proportion of large vehicles on RSUs’ detection performance is more significant. There are no significant patterns regarding how the optimized RSU positions vary with traffic parameters. The comparisons with existing methods further verify the importance of considering both static and dynamic occlusions in optimizing RSUs’ placement. Besides, the proposed method can yield better optimization results more efficiently than existing approaches.


Background
Many studies in the transportation field are built on the premise that vehicles are fully connected and all traffic agents' motion data are available based on vehicle-toeverything technology.The assumption about connected environment brings about many new potentials.Ding et al. (2022) and Ma, Yu, et al. (2023) optimized traffic signals at isolated intersections in a connected traffic environment.Based on the high-quality and fine-grained path data provided by connected and autonomous vehicles (CAV), Shi, Zhou, et al. (2022) and Shi, Nie, et al. (2022) proposed deep reinforcement learning methods to control CAV and buses, respectively, in a distributed manner.Likewise, Li et al. (2023) develop a deep reinforcement learning-based model to control CAV platoon at signal-free intersections.
However, it has been recognized that a long transition period is required that entails vigorous development of cooperative vehicle-infrastructure system (CVIS) before CAV substantially penetrate the market (Xu, Li, et al., 2022).An essential role of CVIS is capturing motion data of multiple agents on the roads, which serve as the primary inputs for intelligent traffic management and control systems (Chen, Dong, et al., 2021;Shi et al., 2021).To achieve this purpose, roadside sensing units (RSUs), a fundamental part of CVIS, need to be deployed to enhance the perception capability of infrastructures.
Over the past decades, placing various RSUs to collect traffic data for estimating traffic state (e.g., congestion; Adeli & Ghosh-Dastidar, 2004) and detecting incidents (Karim & Adeli, 2003) acts as a fundamental part of constructing intelligent transportation system.Although the light detection and ranging (Lidar), camera, and millimeter-wave radar and sensors are all used as RSUs for various transportation applications, Lidar is currently the primary sensor supporting cooperative detection on a small scale (Gouda et al., 2021;Meng et al., 2023).Video data collected by cameras are essentially image sequences, which are two-dimensional (2D).Despite the remarkable advances in image processing, it is very challenging to accurately map video data from different cameras into a unified three-dimensional (3D) space for seamless traffic monitoring.Multiple millimeter-wave radars can cooperatively sense and track traffic agents within a specific road area, whereas the low-resolution data captured by radars make them challenging to identify static or small targets (e.g., pedestrians) (Zhang et al., 2021).Compared with cameras and radars, Lidar sensors can sense objects in a fully 3D manner and obtain high-resolution point clouds.Besides, different Lidars can work cooperatively to better monitor traffic situations (Kloeker et al., 2020).Even when Lidars are fused with radars or cameras, Lidars are the fundamental sensor for positioning traffic agents (Caltagirone et al., 2019).Therefore, this study emphasizes Lidar sensors as RSUs.

Problem context
Similar to human eyes, RSUs cannot see through nontransparent substances.In this regard, static roadside obstacles (e.g., building façade) may obstruct RSUs' view at road intersections or curved road sections.Besides, a moving vehicle may be continuously invisible to an RSU because of the presence of a large-size vehicle, as shown in Figure 1.Therefore, RSUs' sensing capability may be impaired by both static and dynamic occlusions.
Although several RSUs can cooperate to reduce blind spots, the monitoring performance of cooperative RSUs may be unsatisfactory if RSUs are not well placed.For instance, many traffic agents may be undetected if RSUs are not placed high enough at an intersection where the proportion of large vehicles is large.Note that large vehicles (e.g., buses) in this study refer to vehicles, which are longer than 8 m and higher than 2 m.In that case, the normal functions of CVIS, such as risk monitoring, cannot be guaranteed.
The research community has recognized the importance of optimizing RSUs' placement toward better sensing performance within a limited budget (Jin et al., 2022).The general process of optimizing RSUs' configurations in existing studies is similar, and comprises three main parts: (I) scenario definition, (II) emulation of sensor data collection, and (III) performance evaluation.Part I defines the virtual environment in which RSUs need to be placed.Part II parameterizes the configurations of RSUs and associates them with the sensor modeling.An evaluation metric is used in Part III to quantify the performance of cooperative sensing.As such, an optimization procedure can be established using the RSUs' configuration parameters and the evaluation metric as decision variables and the objective, respectively.
Unlike road infrastructures in the design phase, the environments of existing or as-built roads are much more complex, which causes more difficulties in RSUs' placement optimization issues.In this regard, though many research efforts were expended on addressing the RSUs' placement problem, there is still a lack of an effective method for optimizing RSUs' deployment at existing road infrastructures.

Related work
RSUs are essentially a type of visual sensor.Therefore, in addition to the studies focusing on roadside Lidars' deployment optimization, the literature related to the placement of visual sensors in other scenarios is also reviewed in this section.Because of the high cost of RSUs' procurement and installation, this study focuses on placing RSUs on small scale.Therefore, those studies related to the configuration of RSUs at a road network level (e.g., Li et al. 2020;Kang et al., 2022) are not discussed.As previously mentioned, optimizing RSUs' deployment involves three main methodological parts.Each part may significantly affect the optimization results.Besides, the scenario definition is closely related to the sensor emulation.Therefore, the literature review section is divided into two subsections and the main gaps in existing studies are separately discussed.

Scenario definition and sensor modeling
Based on the dimension and fineness of scenario and sensor models, current studies in this research domain can be generally categorized into three types: 2D, 3D and 3D+.The features of each type of study are summarized in Table 1.
Regarding the definition of scenarios, targets and obstacles are two fundamental components for optimizing visual sensors' placement.Specifically, targets denote the objects (or areas) to be detected while obstacles refer to the objects that may cause occlusions and thus reduce sensors' sensing capability.To reduce the computational complexity and costs, either targets or obstacles are represented by simple geometry such as 2D polygons or 2D grids in many studies.
In Hiromori et al. (2013), obstacles and targets were represented by 2D blocks and 2D regions where pedestrians may appear, respectively.The placement of sensors was optimized to maximize the coverage of pedestrian areas.Yang et al. (2018) used 2D polygons to represent the layout of construction sites and optimized the placement of surveillance cameras toward better coverage of the construction areas.Similarly, Altahir et al. (2018) and Suresh et al. (2020) used 2D shapes to model buildings as obstacles in an urban context and optimize multicameras' placement to improve their coverage of public areas.
In 2D studies, sensor models are commonly represented by triangles or fan-shapes originating at sensor positions.As such, the detection accuracy can be quantified by estimating to what extent target areas overlap with the sensors' field of view (FOV).Two-dimensional approaches play an essential role during the early stages when computational power was limited and are still active in solving placement optimization of multiple camera system (Suresh et al., 2020).However, the simplified representation of scenarios and models may compromise the accuracy of optimization results.Besides, placing visual sensors in 2D settings cannot handle the optimization problems in which the sensor heights matter.Owing to the deficiency of 2D methods, researchers are paying increasing attention to optimizing visual sensors' deployment in 3D environments.
Compared to 2D approaches, 3D methods can take more sensor configuration parameters into account, such as height, pitch, and roll angles, thereby more capable of tackling complex optimization problems.
To place monitoring sensors optimally on a 3D structure, Dybedal and Hovland (2017) divided the volume of interest into numerous small cubes.Visual cameras were modeled as 3D cones with limited FOV and ranges.As such, the detection performance was measured by estimating whether small cubes fall inside any sensor's FOV.A mixed-integer linear programming-based framework was constructed to optimize cameras' configurations to ensure the volume of interest was seamlessly covered.Zhang et al. (2019) found it necessary to optimize cameras' placement in a 3D space for monitoring metro station construction sites.Therefore, user-defined 3D meshes were employed to model sidewalls and supporting structures that may obstruct the cameras' view.In Zhang et al. (2019), the target monitoring areas were represented by discrete points.Given a target point, a line segment connecting it with the camera was constructed.By analytically solving the intersection between the line segment and rectangular meshes, the target point's visibility to cameras can be estimated.
Because of the user-defined 3D geometry's inflexibility in modeling diverse scenarios, the majority of 3D studies optimized visual sensors' placement on the simulation platform including Autodesk Revit (Chen et al., 2021;López et al., 2023), Autodesk CAD (Rougeron et al., 2022), RoadRunner (Vijay et al., 2021), Unity (Jin et al., 2022), Unreal (Hermann et al., 2022), and CARLA (Cai et al., 2023;Qu et al., 2023), as noted in Table 1.The advantages of simulation software are twofold: (1) Users can create customized scenarios by assembling 3D models provided in the software and (2) the built-in ray-casting algorithms Dynamic capturing.The workflow of simulation-based optimization involves three steps.First, 3D scenes are created interactively in the simulation environment.Targets are predefined in this phase as well.Second, sensor simulators are activated to evaluate how well the targets are perceived by sensors.A surrogate evaluation metric is commonly used at this step to quantify the sensing performance.Third, optimization problems are formulated by setting the surrogate metric as the objective function.Different approaches, such as genetic algorithm (Chen et al., 2021;López et al., 2023), integer programming (Qu et al., 2023;Vijay et al., 2021), and evolutionary strategy (Hermann et al., 2022;Rougeron et al., 2022), are applied to solve the optimization problems.
Although 3D approaches can produce more accurate optimization results than 2D approaches, current 3D methods for optimizing cameras or Lidars' placement work in hypothetical scenes may fail to precisely map realworld environments.Specifically, the real-world scenes are commonly more complex than those in the simulation platform.In this regard, the simulation-based optimization results may be unreliable because of domain shift (Huch et al., 2023;Ma, Yu, et al., 2023).
To address this issue, the components in the virtual space should be carefully handcrafted to create digital twins of the real-world environment.However, manually creating realistic scenarios using simulation software is very costly and time-consuming (Fang et al., 2020;Li et al., 2019).Three-dimensional point clouds collected on-site enable a precise depiction of real-world scenarios and are easy to update (Gouda et al., 2021).Therefore, using point clouds of real-world environments as optimization settings is gaining increasing acceptance in 3D+ studies.
The studies regarding the placement of wireless sensors first exploited 3D point clouds in optimizing the topology of wireless sensor networks (WSN).In analogy to visual sensors, signals between two wireless sensors can be significantly affected by obstacles.In this case, the assumption in previous studies of a fixed isometric transmission range among WSN nodes may fail in complex terrain areas.Therefore, Demetri et al. (2019) and Oroza et al. (2021) used a Lidar-derived digital terrain model as the background of WSN.
To optimize the configuration of the multicamera surveillance system, 3D point clouds were used to depict the indoor environment in Malhotra et al. (2022).They concluded that 3D point cloud data acting as optimization background can better address cameras' placement in real environments.Likewise, Ma et al. (2022) used 3D point clouds to represent the road environment of constructed highways.A deep learning (DL)-enabled virtual scanning method was proposed to evaluate the joint coverage of road areas by roadside Lidars.As such, various RSU placement schemes can be quantitatively compared in the virtual scenes modeled by point clouds.
As mentioned in Section 1.2, either static or dynamic occlusions may impair RSUs' sensing capability, whereas the overwhelming majority of existing studies only considered static obstacles, as noted in Table 1.
Some researchers have started to explore the influence of dynamic occlusions on RSUs' placement.Du et al. (2022) established an analytical model that projects 3D cuboids representing vehicles onto the camera plane.Based on the extent to which different vehicles overlap on the plane, an occlusion degree model (ODM) was constructed to quantify the detection performance of RSUs.Extensive simulations were conducted to reveal the mathematical relationship between placement schemes and detection performance.Nonetheless, their method worked in a simplified environment where static obstacles and road geometry were not considered.Jin et al. (2022) utilized the ray-casting function in Unity3D to simulate the Lidar sensor.The detection performance of roadside Lidar was investigated under different sensor configurations.However, they only considered a single sensor and their study did not involve the optimization of sensor placement.
To take both static and dynamic occlusions into account, Ma, Liu, et al. (2023) developed a virtual method for optimizing roadside Lidars' placement where both road environment and traffic agents were modeled by point clouds.A coarse-to-fine subsampling algorithm was proposed to emulate the process of collecting sensor data under the joint impacts of static and dynamic obstacles.However, traffic agents in this study were driven by the trajectory data collected on-site.In this regard, the authors' method is not flexible enough to handle diverse traffic scenarios because in many cases high-accuracy trajectory data are unavailable.

Evaluation metric
The evaluation metric is used to quantify how well sensors are deployed, which is indispensable when optimizing sensors' placement.Over the past 5 years, the coverage of targets, though evaluated in different forms, is the most frequently used metric for optimizing sensors' configurations.
The coverage-based measure works well in studies considering static occlusions, but may not handle dynamic occlusions.Road traffic agents may appear at random, leading to the high uncertainty of dynamic occlusions.
As previously mentioned, Du et al. (2022) used ODM to describe dynamic intervehicle occlusions.In this regard, an ODM threshold determines whether a traffic agent is detectable.However, their method substantially simplified road models and did not consider static occlusions, which makes it challenging to address sensors' placement on existing roads with complex environments.
In Ma, Liu, et al. (2023), the recall rates of traffic agents across multiple time steps were averaged to measure the detection performance of RSUs.As previously mentioned, road environment and traffic participants were both represented by point clouds, which results in large volume of point cloud data.In this regard, though both static and dynamic occlusions were incorporated in the computation of recall rates, the estimating process was very time-consuming (hundreds of seconds per iteration) even though parallel computing technique was applied.
To accommodate the uncertainty of occlusions, researchers in the intelligent vehicle (IV) domain used information theory-based metrics to configure on-board visual sensors.Ma, Liu, et al. (2021) voxelized the targets to be detected within the sensors' perception space.The distribution of voxels' state (i.e., occupied or void) was estimated using real-world data.Since sensors cannot completely perceive all voxels within their FOV, the conditional entropy of the ground truth state of voxels on the sensor measurements is adopted as the surrogate metric.
Hu et al. ( 2022) converted IV's FOV into probabilistic occupancy grids where each grid stores its probability of being occupied.The information gain (IG) metric of occupied grids for a specific sensor configuration was used as a surrogate evaluation metric.More IG means more uncertainty a sensor configuration can reduce.As such, the perception capability can be improved by maximizing IG.However, their method works in simulated scenarios.
Besides, configuring sensors on IV differs from placing RSUs on existing roads.In addition to dynamic occlusions, the complex static obstacles in real-world environments may substantially affect RSUs' perception performance.However, neither Ma, Liu, et al. (2021) nor Hu et al. (2022) consider the static occlusions when computing the surrogate metric.
Inspired by studies in the IV domain, Jin et al. ( 2022) computed the number of points within each voxel (called PVPC) to represent the voxel's state.The maximum density gain based on the PVPC was adopted as the surrogate metric to evaluate the placement of RSUs.However, though Jin et al. (2022) focused on RSUs' placement, they also overlooked the occlusions caused by static objects, like Du et al. (2022).

Motivations and objectives
Based on the literature review, the following gaps are identified in existing studies: 1. when defining scenarios for RSUs' placement, an approach is expected that maps real-world road environment while retaining the flexibility of simulating diverse traffic scenes; 2. an efficient sensor model of RSU is needed when accurate and dense 3D data are used to depict real-world road objects; and Overview of the proposed virtual-real-fusion simulation (VRFS).
3. a lack of an evaluation metric for optimizing RSUs' placement that considers both static and dynamic occlusions.
To fill the gaps, this study aims to combine traffic simulation and 3D data of real-world road environments and to propose a computer-aided virtual-real-fusion simulation (VRFS) framework for optimizing RSUs' deployment at existing roads.

Overview
The proposed VRFS framework comprises four main steps (see Figure 2).First, road models in SUMO (Lopez et al., 2018) are constructed using the driving lines extracted from point clouds of real-world road scenes.Traffic parameters can be configured in this phase to simulate different situations.Second, traffic agent models controlled by the traffic simulator are fused with point clouds representing road environment.As such, virtual-real-fusion traffic scenarios can be created.Third, a sensor model is developed to emulate the data collection of RSUs, based on which the performance of cooperative detection can be quantitatively evaluated using the intersection over union (IoU) metric.
Because estimating the average IoU involves long-time simulations, IoU is not directly used as the optimization objective.Based on the developed VRFS, probabilistic occupancy maps (POM) are generated to represent traffic scenarios (called POM-oc).Taking the occlusion effect into account, POMs that can be observed by RSUs are denoted as POM-ob.Therefore, a surrogate metric that measures the difference between POM-oc and POM-ob is proposed for evaluating the detection performance of cooperative RSUs at the fourth step.Note that in this phase, candidate RSUs' positions are generated prior to formulate the optimization model.As such, the search space of the subsequent optimization problem can be substantially reduced.Setting the surrogate metric as the objective, the Bayesian optimizer is employed to optimize RSUs' placement.Each step is described in detail as follows.
Current studies regarding RSUs' placement address two types of problems: (1) FIX problem, which aims to maximize RSUs' perception performance given a fixed number of RSUs, and (2) MIN problem, which aims to minimize the number of RSUs given a target sensing performance.This study mainly addresses the FIX problem.
The VRFS framework can be divided into four computational layers, as illustrated in Figure 3.The links between steps and layers are presented as well.The contents in each brace indicate data associated with the corresponding layers.The first layer is the 3D road environment.All computations related to real-world data processing, such as extraction of road surface, delineation of driving lines, generation of candidate RSU positions, are conducted in this layer.The second layer is mainly related to traffic simulations, which control the volume and composition of traffic flows.Using traffic agents' kinematic information in layer 2, triangulated meshes representing 3D agents are merged into point cloud data and thereby 3D traffic scenarios are created in layer 3.In this layer, ground truth bounding box (BBox) and POM are also estimated.The sensor models in layer 4 can virtually and cooperatively detect agents and evaluate the average IoU.

Road reconstruction
The driving lines can be either semiautomatically extracted (Holgado-Barco et al., 2015) or manually delineated.It would be more desirable to have high-definition maps, which contain rich road information including driving lines and their topology.
SUMO is an open source, highly portable, and microscopic traffic simulation software.Users can create customized road models and simulate diverse traffic scenarios.Different data formats of road networks including XML, OpenDrive, and OSM as input are supported by SUMO.More importantly, one of the striking features of SUMO is that users can flexibly adjust dimensions of traffic agents.Therefore, this study uses SUMO as the traffic simulator.Note that other traffic simulators can also be used if the application programming interface is provided.
The driving lines can be partitioned into endpoints and edges and be written into an extensive markup language (XML) file following the required syntax (Lopez et al., 2018).The road model can be created by importing the XML file using the "NETEDIT" tool of SUMO.Manual adjustments are recommended to refine the road model when the automatically generated junction shape is unsatisfactory.
Once the road model is built in SUMO, multiple simulations can be conducted by configuring traffic parameters.This study gives emphasis to motorized traffic.Although SUMO supports simulation of multimodal traffic, the modeling of interactions between vehicles and vulnerable road users is unsatisfactory.Therefore, if pedestrians and cyclists are included in the analyses, some other commercial software such as Vissim is more desirable.

Virtual-real-fusion traffic scenarios creation
The general process of creating virtual-real-fusion traffic scenarios is shown in Figure 3a.SUMO does not provide built-in 3D models of traffic agents.Therefore, a bank of 3D agent models is constructed in which different models are obtained from open-access online resources.Some examples of agent models are displayed in Figure 3b.Note that the agent model library can be expanded to increase the diversity of traffic scenarios.To accurately map agents in the 2D simulation environment to the 3D space, all agents' sizes should be set according to those in the agent model library when configuring traffic parameters for simulations.
Road environments in the real world are 3D while the simulated trajectory data in SUMO are 2D.Therefore, the ground was first extracted from point cloud data using a pillar-based filtering method (Ma, Easa, et al., 2021).In Ma, Easa, et al. (2021), raw point clouds of road environment are partitioned into pillars.The grid sizes along the x-axis and y-axis are   and   , respectively.Then, a pillarwise filtering is conducted by removing the points higher than   measured from the bottom of each pillar.Finally, a connectivity and similarity-based clustering technique is used to extract the ground area from the remaining points.As shown in Figure 4a, a digital elevation map (DEM) is created based on the extracted ground points.As such, the elevation of each agent at any simulation step can be obtained with linear interpolation.
Triangulated meshes are used to represent 3D agents in this study.A mesh model is made up of many faces, while each face is formed by three vertices.The common data format of meshes is illustrated in Figure 4c.The coordinates and indices of the vertices that form a triangulated face are stored separately.Usually, other information such as texture is stored along with the indices.
In the library of agent models, each agent is described with respect to its local coordinate frame   -  -  , as shown in Figure 4c.Let    be the position of agent  at the  time step.To align agent  with the road background, a coordinate transformation is required as Equation ( 1).For each agent model, the vertex index  is with respect to its local vertex set.In this case, the aligned agent mesh models are discrete in essence.To facilitate the subsequent emulation of RSUs' virtual detection mechanism, discrete mesh models are merged into an entire mesh by reindexing vertices via Equation (2).
where  = sorted order of agents by increasing their ID,   = number of vertices of the -th agent in the traffic scenario, and    ,    = local and global vertex index of the -th agent, respectively, and Note that the vertices of each agent mesh are indexed to the entire set of vertices at each time step.

2.4
Sensor model

General workflow of sensor modeling
Sensor model plays an important role in quantitatively evaluating the detection performance of cooperative RSUs and validating the proposed surrogate evaluation metric, as indicated in Figure 2. As previously mentioned, the study mainly considers Lidar as RSUs.In this case, the sensor model needs to take RSUs' placement parameters as input variables and outputs virtual Lidar scans.
Considering that road environment and traffic agents are represented by point clouds and meshes, respectively, a pixel-level fusion method is developed to generate the synthetic sensor output.The general process is shown in Figure 5a.For an RSU, let   be the its position, let   and   be the depth maps of static road environment and movable traffic agents, respectively.Using the same RSU configurations, the sizes of   and   will be identical and there is a one-to-one correspondence between elements in   and   .An element-wise operation is conducted to every nonzero elements in   using Equation (3).
where   (, ),   (, ) = -th row, the -th column element of   and   , respectively, and  ′  = fused depth map. ′ agt represents the available information under the joint impact of static and dynamic obstacles for a specific RSU.The sensor output (i.e., Lidar scans) is obtained by projecting  ′ agt back to the x-y-z coordinate frame.In real-world applications, several RSUs can be installed to achieve cooperative detection of traffic agents.Let   The dimensions of each agent are already known, making it easier to create the ground truth BBox (see Figure 5b).An efficient L-shape algorithm (Zhang et al., 2017) is employed to fit cuboids over the clustered data points.As such, the IoU metric is computable by comparing the detected BBox with the ground truth at each time step.IoU measures the extent to which the detected BBoxes overlap with the ground truth BBoxes.Occlusions may cause substantial data loss, which will be directly reflected in IoU.Therefore, IoU is a straightforward metric for measuring the detection performance of cooperative RSUs under the influence of occlusions.Examples of the cooperatively detected BBox are shown in Figure 5c.
As noted in Equation (3),   and   are the most two important inputs for generating sensor outputs.  in this study is obtained with Ma et al. (2022).Because point cloud data are very dense, Ma et al.'s method can output a smooth depth map for any specific RSU configuration using the spatial interpolation techniques.However, the triangulated meshes representing traffic agents are different from point clouds.The sizes of triangles in a traffic scenario may vary greatly and traffic agents' states may change with simulation steps.Therefore, generating   is more challenging in this study.

Generating depth map of traffic agents
When a Lidar works, it emits high-frequency laser beams or pulses within its FOV following a specific pattern.A beam hitting an object will return the relative distance and angle information, based on which the relative 3D position of that point to the sensor can be obtained.
Let [  ,   ] and [  ,   ] be the ranges of RSU's horizontal and vertical viewing angles, respectively.Let   and   be the horizontal and vertical angular resolutions, respectively.Note that   of some commercial Lidar products may be uneven.
In existing studies, numerous rays from   pointing to different directions were created to emulate laser beams (López et al., 2023).Finding the positions where rays intersect with different objects is the main computational task of sensor modeling.Agents are represented by triangulated meshes in this study.Therefore, the core of generating   is the ray-triangle intersection solving algorithm.
Thanks to the advances in the graphics processing unit (GPU)-based computing, the efficiency of solving ray-face intersection has been substantially improved.However, currently the GPU power is not effectively harnessed because of redundant computations.Suppose there are   triangles and   rays in the ray-face intersection computation.In this case, there are in total   ⋅   combinations of ray-face in pairs.Nonetheless, a ray does not touch the face in most combinations.For instance, 3 × 10 5 triangulated faces and 3 × 10 5 million rays will generate 9 × 10 10 combinations.In contrast, the number of valid ray-triangle intersections are possibly less than 2 × 10 5 .In this regard, a lot of computational power is wasted, and thus efficiency is reduced.
Using data structures can accelerate the process of finding effective ray-face combinations (López et al., 2022).However, employing data structures in this case is not a prior choice.Traffic agents are in motion, meaning the data structure needs to be reconstructed as time step updates, which will consume a lot of computational resources.Therefore, a structure-retained spherical projection (SRSP) method, which is data structure-free, is developed to search ray-triangle in pairs efficiently.The pseudo code is presented in Algorithm 1.
The vertices of agent models are mapped into a new coordinate space by Equation (4), as shown in Figure 6a.A Cartesian coordinate system is constructed by setting , , and  as its three coordinate components (see Figure 6a).This process does not modify the topology of vertices, so that the structure of meshes is retained.The orientation of a laser ray is determined by  and .In this regard, the 3D ray in the -- frame is equivalent to an upward vector in the -- frame.As such, examining whether a 3D ray intersects with a face is simplified into checking whether a point falls inside a triangle on the - plane.
A preliminary filtering is conducted by excluding triangles outside the RSU's FOV, as illustrated in Figure 6b.
When filtering point clouds, a point is discarded if it does not fall within the specified region.Differently, a triangle is discarded only if its three vertices all fall outside the FOV.The vertices are then rasterized on the - plane using Equation (4).
where [.] = function that rounds numbers to an integer,  ′  ,  ′  = minimum -value and the maximum -value of the remaining vertices after the preliminary filtering.In this way, the center of each grid cell corresponds to a virtual ray, which can be indexed by (, )  .A BBox of each rasterized triangulated face is then constructed as shown in Figure 7a.Use face  for an illustration, where  is face ID.All grid cells inside the BBox of face  indicate potential rays that may intersect with face .A void grid cell means the corresponding ray does not intersect with any triangle and will not be considered in the computations.Notably, the number of ray-face combinations has been substantially reduced at this step.
However, simulating RSU in this case needs to not only return the coordinates of collision points, but also compare distances tracing the same ray hitting on different faces.Therefore, ray-face combinations should be organized according to the ray index.To this end, an encode-sort-decode process is proposed for searching and organizing ray-face combinations.Specifically, the ray indices (, )  corresponding to grid cells within a BBox are separately encoded as a one-dimensional (1D) code for each triangle (see Figure 7a).The encoded grid cells are concatenated into a linear array.To indicate the face ID to which each grid cell belongs, a same-length array of face ID is also created as noted in Figure 7a.Therefore, 2D grid cells are transformed into a 1 ×  ⋅ × 2 ( ⋅ denotes the number of grid cells) matrix for each BBox.
The size of projected triangles may vary significantly, which results in uneven  ⋅ .To facilitate the use of GPU computing, an all-zero   ×  _ × 2 matrix is created in GPU memory blocks. _ corresponds to the largest triangle and   is the number of triangles.In this way, the process of encoding and concatenating for each triangle can be executed in parallel by different threads, thus improving efficiency.
Then, as shown in Figure 7b, the 1D codes are further concatenated and then sorted in an ascending order on GPU.During the sorting process, the positions of face ID cells will change along with code cells.The BBox of different faces may overlap, meaning that one code may correspond to several faces.The sorting operation can bring the same codes together, as illustrated in Figure 7c.After decoding the sorted code, the reorganized ray-face combinations are obtained.For computational simplicity, the linear function (Equation 6) is used as the encoder.As such, the decoder can be a combination of modulo and remainder operation as Equation ( 7).
The algorithm by Möller and Trumbore (2005) is applied to solve the intersection points of rays and faces on GPU.In cases when a ray intersects with several faces, only the closest intersection point to   is retained.It is noteworthy that the agent ID to which the collided face belongs is also being buffered for each ray during this process.For each grid cell on the - plane, its depth value is the distance from   to the intersection point along the corresponding ray.The depth value is 0 if the ray does not intersect with any triangle.As such,   is generated.
Note that if discrete agent models were not combined into an entire mesh in Section 2.3, a for-loop is required to cyclically apply the sensor model over traffic agents, which may significantly increase the computation time when the traffic volume is large.

Candidate RSU positions
In this study, the procedure in

POM estimation
IoU is a direct metric for evaluating the detection performance of RSUs, whereas it is not a good metric for optimizing RSUs' placement.Let  be the number of simulation time steps.Let   be the variable set for example (positions and poses) that configures an RSU.If   was set as the decision variables, the optimization procedure may at least undergo a -step simulation to yield the average IoU during each iteration, which is very time consuming.
Inspired by studies in the automated driving area using the information theory-based metrics to evaluate multisensors' configuration (Ma, Liu, et al., 2021), a POM-based surrogate metric is proposed in this study for evaluating the placement of cooperative RSUs.The process of estimating POM is shown in Figure 8.
In order to incorporate the uncertainty of traffic agents' emergence, traffic scenarios are voxelized within a specified region of interest (ROI).Each voxel inside ROI has two possible states during traffic operation: occupied and void.POM is then defined as the joint probability distribution of voxels' occupancy within ROI.
Two types of POMs are involved in this study: (1) POMoc, which denotes the joint probability of voxels occupied by the ground truth BBoxes over  time steps, and (2) POM-ob that represents the conditional joint probability distribution of voxels' occupancy under an RSU's observation.POM can be visualized by coloring voxels according to their probability of occupancy.
Let  be the number of voxels occupied for at least one time throughout  time steps.Let   (1 ≤  ≤ ) be the -th voxel in ROI, and   ∼   .Let   { 1 ,  2 …   …  −1 ,   } be the set of ground truth BBoxes, where   buffers the BBoxes characterized by positions, dimensions, and orientations at time step .In this way,   is estimated by Equations ( 8) and ( 9): where 1 ≤  ≤ ,   ∈   , and   ∉   = whether the -th voxel is inside and outside any BBox at time step t, respectively, and p = estimated probability distribution from samples.It is assumed that   = p for simplicity.
Considering that one voxel's state does not imply whether other voxels are occupied or not, these  voxels' occupancy can thus be treated as independently distributed random variables.As such, the joint probability of occupied voxels can be estimated by Equation (10).
From an RSU's perspective, the probability distribution of occupied voxels is different from p ( 1 ,  2 …   ).On one hand, the voxels outside an RSU's FOV cannot be observed.In that case, p of those voxels are inferred as 0 by the RSU.On the other hand, virtual rays are emitted from the sensor center   to capture data points impinging on agents.During the ray-casting process, the probable occupancy of a voxel in a ray may affect the observability of voxels downstream.
To estimate POM-ob, the voxels corresponding to static road environment are also considered when estimating p ( 1 ,  2 …   ).A ray-tracing is employed to find voxels tracing RSU's virtual rays, as illustrated in Figure 9.
Similar to the sensor model in Section 2.4.1, the sensor space is mapped to the -- coordinate frame at first to facilitate the creation of virtual rays.Then, each ray in the -- frame is discretized into points, which are set  apart, where  is the voxel size.Note that those points are sorted by increasing their -values.After that, points are mapped back to the Cartesian space and to voxelize them, as shown in Figure 9a.
As previously mentioned, the observability of voxels downstream of the ray may be impaired by those upstream.For instance, a p of 1.0 means that the corresponding voxel  presents consistently during the observation (e.g., static obstacles).In this case, the p  of voxels downstream are all 0 even if their p > 0. Therefore, p  is estimated via Equations ( 11) and ( 12).
where p  = conditional probability of -th voxel's occupancy under the RSU's observation,  ∩   = condition that the -th voxel in ROI intersects with a ray of RSU.Unlike traffic agents whose presence is uncertain, the pres- ence of a specific RSU configuration   is determined.Thereby,    = 1.An example of p and p  along a ray is presented in Figure 9b.
Different rays may intersect with the same voxel, which may generate different p  for that voxel.In that case, only the maximum p  is retained.To this end, a similar process with searching ray-face combinations is applied.Specifically, the voxels tracing rays are stored in the form of an   ×   ∕ × 2 matrix, where   and   denote the number of rays and the maximum detection range, respectively.Let (, , )  be the position of a voxel within ROI. (, , )  is encoded as .As such, each voxel in the matrix has a corresponding code.All voxels are sorted by increasing or decreasing their -values.As illustrated in Figure 10, the voxels sharing the same  will be in the same block, making it easy to find the maximum p  .The POM-ob can be estimated after all voxels are processed.Note that all computations in this phase are executed on GPU.
It is assumed that the probability distributions of voxels' occupancy under RSU' observation are independent for simplicity.In this regard, the conditional joint probability of occupied voxels can be estimated by Equation ( 13).where p  > 0,  = number of voxels that are occupied for at least one time under an RSU's observation across  time steps, and  ≤ .

Optimization process
One intuitive way of improving an RSU's sensing capability is to reduce the difference between p ( 1 ,  2 …   ) and p ( 1 ,  2 …   ).p ( 1 ,  2 …   ) is assumed to be the true distribution over voxels.The cross-entropy ( p , p ) of p ( 1 ,  2 …   ) relative to p ( 1 ,  2 …   ) is used to quantitatively measure how well an RSU is placed via Equations ( 14) and ( 15).
In Equation ( 13),  is commonly less than  because of the FOV limits and occlusion issue.In this case, the p  of the occupied voxels in POM-oc that are not observed by the RSU are assumed to be 0.However, a p  of 0 may fail Equation ( 15).Therefore, a very small number 1∕ ⋅ 10 −1 in place of 0 is empirically used in actual computations for numerical stability.Such a strategy was also adopted in previous studies (Hu et al. 2022;Ma, Liu, et al., 2021) When multiple RSUs are considered, POM-ob of different RSUs can be separately estimated.considered at each candidate RSU position.RSU configurations can be modeled as a high-dimensional matrix (see Figure 11a) where row, column, and depth correspond to position index, discrete roll, and pitch angles.In addition, each element of the matrix can store candidate yaw angles.The set of candidate Euler angles is user-defined.In this way, a specific RSU configuration can be indexed by (, , , )  .One or more configuration parameters (e.g., pitch angle) can also be fixed as required to reduce the search space.
The general optimization process is shown in Figure 11b.After the acquisition of POM-oc using VRFS, the raytracing method is applied to estimate POM-ob for each candidate RSU configuration, and a POM-ob array can be obtained.Precomputing the POM-ob array is recommended if computer memory permits.As such, the POM-ob of a specific RSU configuration can be efficiently retrieved via (, , , )  and thus avoiding the repetitive computation of POM-ob.
Smaller ( p , p ) means better placement.Therefore, the optimization problem is formulated as Equation ( 16).
(, , , )  = arg min ((((, , , )  )), ) where (.)= function for estimating cross entropy (CE) and (.) = encoder mapping (, , , )  to the row  The constraints are as follows: where   ,   , and   = number of candidate roll, pitch, and yaw angles.Setting (, , , )  as decision variables, the Bayesian optimizer is used to update variable values.POM-ob can be retrieved using the updated (, , , )  from the POMob array.If multiple RSUs are considered, POM-ob should be combined before computing ( p , p ).Besides, the -values cannot be identical (see Equation 18) because different RSUs are not allowed to be installed at the same position.
Note that RSUs' configuration parameters can be directly set as decision variables.In that case, the optimization procedure will search in a continuous parameter space and thus demanding more iterations and computation time.Besides, though this study mainly addresses the FIX problem, the proposed framework can be modified to address the MIN problem by setting the number of RSUs as decision variable.In that case, a specific ( p , p ) should be formulated as the constraint.

TIME PERFORMANCE OF SRSP-BASED METHOD
The time performance of SRSP-based and GPU-based computing methods is compared on solving ray-triangle intersections.The test results are presented in Table 2 and are plotted in Figure 12.All computations are executed on As noted in Figure 12a, the GPU-based method outperforms the SRSP-based method when the number of ray-face combinations   is less than 10 8 .However, as   continues to increase, the runtime of GPU-based method increases substantially.The reasons are twofold: (1) The SRSP method can reduce the number of ray-triangle in pairs and thus avoid wasting computational power on rays, which are less possible to collide with triangles, and (2) when   is too large to fit in the GPU memory, the computations need to be split into several subparts and to be sequentially executed on GPU, and thereby the efficiency is reduced.It is true that a more powerful GPU may help further improve the time performance of the GPU-based method.However, it also means higher cost.
The SRSP-based sensor model comprises three main parts: (1) preliminary filtering, (2) encoding and reorganization, and (3) decoding and ray-face intersection solving.Figure 12b shows the variation of computation time of different parts with   .As noted, the second part will take the most time when   > 10 8 .An   ×  _ × 2 needs to be created, as described in Section 2.4.1, to accelerate the process of encoding and concatenating.It is possible that  _ may also increase substantially with   , which consume large computation resource.

Test sites
The proposed VRFS framework is tested on two different sites: an X-shape intersection (Site 1) and a curved highway segment (Site 2), respectively.At Site 1, a two-lane undivided highway intersects with a four-lane divided highway.The median barrier is about 3.0 m wide, which can be compressed to create a left-turning lane.Traffic agents from the south to the east direction and those from the north to the west direction can use the dedicated right-turning lane.
There are different kinds of roadside obstacles including building façade, continuous walls, and vegetation at Site 1.The complex road environment poses a challenge to the placement of RSUs.Site 2 is a four-lane divided highway section.The trees alongside the inner road edge and vegetation in the median barrier may cause occlusions to RSUs.The geometric information of two test sites is presented in Table 3.
Point cloud data of real-world 3D road environments at both sites were collected by a mobile mapping system, which mainly comprises a single laser scanner (Zoller+Frohlich, 2020), a combined positioning system (Novatel, 2021).The technical information of the scanning system is presented in Table 4.To address the occlusion issue caused by median barrier and vehicles on the road during the data collection, point clouds at both sides of the median barrier were separately scanned and then merged.
The average point spacing is about 0.05 m.The process of creating VRFS scenario is graphically presented in Figure 13.In this study, the driving lines were manually delineated tracing lane markings to guarantee accuracy.To generate DEM from point clouds using the method proposed by Ma, Easa, et al. (2021),   and   were both set as 0.2 m.Different vehicles including sedan, sport utility vehicle, van, truck, and bus were included in the VRFS while the latter two are viewed as large vehicles.

POM-based metric validation
In this section, the IoU metric mentioned in Section 2.4.2 is used to examine the effectiveness of the POM-based metric.DL models can take RSU-captured data as input and output detected BBoxes.However, the varied accuracy performance of different DL models may cause bias.Therefore, the IoU is estimated using the cuboids fitted over point clusters segmented by agent ID instead of those detected by DL models.At each site, five candidate RSUs are randomly configured (see Tables A1 and A2 in Appendix A).Separately deploying 1 RSU, 2 RSUs, and 3 RSUs are considered, which have  1 5 ,  2 5 , and  3 5 placement plans, respectively.The voxel size is set as 0.2 m in this case.
For each placement plan, the VRFS method is run for 7200 steps (step size is 1.0 s) to estimate the stepwise IoU and POM-oc.Then, the POM-ob of RSUs are estimated and are compared with POM-oc to compute CE (i.e., ( p , p )).In the process of ray-tracing, the occlusion effect of voxels is considered (see Figure 9b).For comparisons, POM-ob in the absence of occlusions is also estimated.In that case, p  = p if the voxel touches any ray of RSUs.The relationships between the average IoU and CE with and without considering occlusions are displayed in Figure 14.
As noted in Figure 14a, no matter whether the occlusion issue is considered, the negative correlations, though with fluctuations, between CE and IoU are manifested in different cases.This finding implies the effectiveness of CE as a surrogate metric for evaluating the cooperative sensing performance.The difference between the CE values with and without considering occlusions may vary with test sites.
Gathering data of all placement plans at each site, the Pearson's correlation coefficients (PCC) between IoU and CE are calculated and shown in Figure 14b.Even if occlusions were not considered in the process of ray-tracing, the CE is significantly correlated with IoU.This result is in accordance with Hu et al. (2022), which used the IG metric derived from POM irrespective of obstacles.However, the PCCs between IoU and CE considering occlusions are closer to −1 than those without considering the effects of occlusions at both sites, indicating that taking into account the occlusion effect caused by static or dynamic obstacles can help improve the efficacy of the surrogate metric.
Different from the intersection scenario, two PCCs are almost identical in the highway scenario, which can be understood from two aspects.First, the curvature of the highway segment is not large; hence, the roadside trees will not substantially affect the data collection of RSUs.Second, different from intersections where vehicles need to wait for their green phase, traffic flow on the highway is continuous.In this regard, even a sedan is occluded by a large bus, the duration of dynamic occlusion will not be long.

4.3
Optimization results and analyses

Influence of traffic parameters
As different parameters (e.g., traffic volume) may affect optimization results, the optimization procedure is applied in different scenarios, as noted in Table 5.The detailed traffic flow information of different directions is presented in Appendix B. For simplicity, q1, q2, q3 and p1, p2, p3 are used to represent different traffic volume and vehicle proportions, respectively.In addition to traffic volume and traffic composition, different  and different number of RSUs are also considered.Totally 81 test scenarios are created for each site.
To enable the use of the optimization procedure, as mentioned in Section 2.5.1, the approach by Ma et al. (2022) is applied to generate candidate RSU positions within ROI.In this case, the horizontal interval Δ and the vertical spacing Δℎ were set as 4.0 m and 0.5 m, respectively.The upper and lower limits on the height of candidate RSUs from the ground are 10.0 and 4.0 m, respectively.
The generated candidate RSU positions at each site are shown in Figures 15.There OS1 32), the same RSUs are deployed to better compare the optimization results.The technical parameters of RSU to be placed and candidate configurations are presented in Table 6.
The Bayesian optimizer for all cases is identically configured: number of initial evaluation points is 10, the point size for fitting the Gaussian process model is 300, the acquisition function is a variant of "expected improvement," which iteratively evaluates whether it overexploits a local search area (Bull, 2011), and the limit of objective function evaluation is 100.The optimized RSU positions at the two sites are displayed in Figure 16.
Different markers are used to differentiate the optimization results obtained using POM estimated with different simulation steps.For clarity, only optimization results corresponding to  of 5400 are presented in Figure 16.The optimized horizontal positions of RSU vary greatly with traffic volume and traffic composition, as noted in Figure 16.Besides, there is no visually significant pattern regarding the optimized RSU positions at both sites.This finding implies the importance of creating an optimization procedure for placing RSUs.Specifically, 3D environment and traffic flow parameters may vary greatly across different road sections.In this regard, optimizing RSUs' placement is a site-specific problem, and the desirable placement plan may be quite different for different existing road infrastructures.Therefore, it is challenging to draw general conclusions on how to better configure RSUs.
Figure 17 shows the average height of the optimized RSU locations of different test scenarios.The value in each cell of heatmaps denotes the average height.The horizontal and vertical axes of heatmaps represent traffic composition and traffic volume, respectively, as indicated in Figure 17a.The curve diagrams on the left and top of heatmaps depict how the average height varies with traffic volume and traffic composition, respectively.The upward and downward arrows in the curve diagrams indicate the increasing and decreasing trends of height variation, respectively.
Similar to Figure 16, there is no significant pattern regarding how the average height changes with traffic parameters.However, it can be observed that when traffic volume or the proportion of large vehicles increases, higher positions of RSUs do not mean better sensing performance in all cases.Such a result can be attributed to the limited FOV of each RSU.Many traffic agents on the road may fall outside of RSUs' joint FOV if RSUs are placed too high.On the other hand, lower RSUs are more likely to be obstructed by vehicles, which may also result in poorer detection performance.Results in Figures 16 and 17 imply that a computer-aided tool such as the proposed VRFS framework that can handle diverse traffic scenarios is required to carefully optimize RSUs' placement.
The CE curves of optimized RSU configurations under different conditions are plotted in Figure 15.The dashed, dotted, and solid lines correspond to q1, q2, and q3, respectively.In theory, longer simulation time means more accurate estimations of probability distributions of voxels' occupancy (i.e., POM).It is noted in Figure 15a that though  affects the CE results, it does not substantially affect the variation patterns of CE curves.This finding is meaningful to real-world applications.Specifically, observation of traffic flow and estimation of POM can be performed within a limited time, thus saving a lot of engineering time.Similar to , increasing the number of RSUs increases the entire CE level without changing the CE curves' pattern.
The CE is positively correlated with the proportion of large vehicles in all cases, which is reasonable.Larger CE means poorer detection performance.When the proportion of large vehicle increases, the dynamic occlusion issue will be more severe and it is more difficult for RSUs to accurately sense traffic agents.In addition to the proportion of large vehicles, the traffic volume may also affect the CE results.More vehicles mean more overlaps between vehicles from RSUs' perspective, which thus degrade the sensing capability of RSUs.To compare the impacts of traffic volume and large vehicle proportion on RSUs' detection performance, the CE curves corresponding to  of 7200 and 6 RSU, as shown in Figure 18b, are further investigated.
The adverse impact of large vehicle proportion on RSUs' detection is more significant than that of traffic volume.As noted in Figure 18b, the CE results change slightly when the traffic volume increases from q2 to q3 at both sites.Differently, changing the large vehicle proportion substantially changes the CE results in all cases.When the traffic volume is large with few large-size vehicles, the overlaps between traffic agents may only cause partial occlusions.Therefore, RSUs can still work effectively via DL models (Marvasti et al., 2020).However, some traffic agents may be completely occluded by large vehicles and no signals will be collected by RSUs.In that case, those traffic agents cannot be detected even with very powerful DL techniques.Therefore, RSUs need to be more carefully placed for roads with many large vehicles.Both static and dynamic occlusions are considered in M23, but the vanilla model relies on real-world trajectory data.Therefore, M23 is adapted to address RSUs' placement issue in this case.Specifically, traffic agents are driven by trajectory data outputted by the traffic simulator.At each iteration step of optimization, a group of time steps are randomly selected from  steps.As such, a batch of traffic scenarios can be randomly reconstructed and the average IoU of those scenarios is computed as the optimization objective.The Bayesian optimizer is employed to optimized RSUs' positions by maximizing the mean IoU.
For the application of the proposed method and M23 traffic volume, large vehicle proportion, and  are configured as q2, p2, and 5400, respectively.The batch size of M23 is specified as 50.The Bayesian optimizers in both methods are identically configured.
To compare the optimization results obtained by different methods, RSUs are virtually placed at the optimized positions and VRFS is applied to estimate IoU stepwise.Boxplots of IoU results of different methods are shown in Figure 19.
M23 and the proposed method substantially outperform V21 and M22.It implies that including dynamic occlusions is important when optimizing RSUs' placement.M22 generates better optimization results than V21 at Site 1.However, the mean IoU values corresponding to V21 and M22 are very close at Site 2. This result is reasonable.Although Site 2 is a curved highway section, the radius (1670 m) is not small.In this regard, the adverse impact of roadside obstacles is limited.Therefore, it is undesirable to overlook roadside obstacles when configuring RSUs at road sections with complex road environments.M23 is comparable to the proposed method regarding the optimization results.Notably, M23 outperforms the proposed method at Site 2. However, the efficiency of M23 is unsatisfactory, as mentioned in Section 1.3.The batch size is 50 in this case, which means M23 needs to perform a 50-step simulation to compute the average IoU at each iteration step of optimization.Owing to the large data volume of point clouds, M23 is time-consuming even though the parallel computing technique is applied (Ma et al. 2023).
The time needed to solve the optimization problem (maximum iteration steps: 150) of the two methods is presented in Table 7.As noted, the proposed method is much more efficient than M23 at both test sites.Regarding the proposed method, the average time of creating a POM is about 0.1 s per simulation step.Note that the created POM can be repeatedly applied to address different optimization problems.The processing time of M23 could be longer if a larger batch size is used.

CONCLUSION
This research proposed a VRFS framework that leverages traffic simulation and 3D point clouds of real-world road infrastructures for optimizing the placement of cooperative RSUs.When properly developed, the proposed framework can support road agencies in deploying RSUs to ensure better surveillance functions of CVIS in the foreseeable future.Based on this study, the following comments are offered: 1. Point cloud data precisely depicting the real-world road environment were combined with an open-source traffic simulator to create VRFS traffic scenarios.On this basis, more accurate static and dynamic occlusions can be modeled while the flexibility of addressing various traffic scenarios is retained.2. This study develops an SRPS-based algorithm for efficiently searching ray-triangle in pairs, which serves as the backbone of sensor model in this study.The SRSPbased method outperforms the GPU-based approach when the number of ray-triangle combinations is larger than 10 8 .3. Based on POMs as representations of traffic scenarios, a ray-tracing method is developed to incorporate the occlusion effect in the computation of CE.The effectiveness of CE as a surrogate metric was demonstrated through comparisons with simulation-derived IoU.Besides, considering the occlusion effect can improve the efficacy of CE. 4. The optimization results regarding RSUs' placement are sensitive to the scenario setups.In this regard, it is more desirable to apply computer-aided tools such as the VRFS framework to configure RSUs rather than drawing general conclusions about RSUs' deployment through extensive simulations.5.There are no significant patterns regarding how the optimized RSU positions vary with traffic parameters.Therefore, placing RSUs on a small scale is a sitespecific problem that needs computer-aided tool to address.6.The results show that the number of simulation steps and RSUs to be placed do not substantially affect the CE curves' pattern.Compared with traffic volume, the adverse effect of the proportion of large vehicles on RSUs' detection performance is more significant.7.This study does not emphasize the optimal placement of RSUs.Investigating the performance of different algorithms in optimizing RSUs' configuration is a meaningful research direction.8. Several limitations remain to be addressed in the future study.First, 3D data of real-world road environment is required for the proposed framework.Second, vulnerable road users were not considered when estimating the POM.Including vulnerable road users in RSUs' placement at urban streets is expected in the future.Third, owing to the limitation of agent kinematic models in traffic simulators, it is difficult to create some very critical interactions among traffic agents with traffic simulators.In this regard, some DL techniques (e.g., Xu, Xiang, et al., 2022) for generating adversarial scenarios can be integrated with the proposed VRFS to optimize RSUs' configuration toward better monitoring.Besides, only Lidar as RSUs was considered in this study.A library of different sensor models (e.g., Cai et al., 2023) should be constructed to address the deployment issue of multimodal RSUs.Since other parameters such as voxel size may also affect the performance of the VRFS framework, a more comprehensive investigation of the impacts of different parameters on VRFS is expected in the future.

A C K N O W L E D G M E N T S
We would like to express our gratitude to the editorin-chief and anonymous reviewers for their constructive and helpful comments.

F
Static and dynamic occlusions.

F
Fusing virtual objects with point cloud-modeled environment: (a) process of creating virtual-real-fusion traffic scenarios, (b) examples of agent models, and (c) data format of a mesh model.
,   ,   ) T  = local coordinates of vertices of agent , which are time-invariant, (, , ) T , = global coordinates corresponding to (  ,   ,   ) T , , and  3×3 = rigid rotation matrix, which can be estimated by the coordinates of adjacent trajectory points    and   −1 .

F
I G U R E 5 Process of cooperative detection: (a) fusion of foreground and background maps, (b) ground truth bounding box (BBox), and (c) segmented agents and detected BBoxes.be the set of RSU positions.The general process of virtual detection is shown in Figure 5a.At each time step, an  ′  is obtained for each component of   .Because the agent ID attribute has been attached to each data point captured by RSUs in the sensor model, traffic agent data obtained by different RSUs can be easily clustered according to their agent ID.

F
Process of estimating probabilistic occupancy map (POM).

F
I G U R E 9 Ray-tracing process: (a) rays to voxels in order; (b) variation of p and p  along a ray.
Refinement of voxels' occupancy probability.

F
I G U R E 1 1 General optimization process: (a) example of roadside sensing unit (RSU) configuration matrix, (b) optimization procedure.

F
Time test results: (a) time curves and (b) runtime of different parts.a computer with RAM of 16 GB, CPU of Intel R Core™ i7-12700H, and GPU of RTX 3060 (laptop).
are 693 and 1900 candidate RSU locations at Sites 1 and 2, respectively.Although the proposed VRFS can address the placement of mixed RSUs with different technical parameters (e.g., Puck 32 C and TA B L E 5 Traffic scenario setups.a.  ≤ 147 40 ≤  ≤ 88 8 ≤  ≤ 20 (shifted) a The simulation steps for estimating POM (i.e., ) are 3600, 5400, and 7200; the number of RSUs are 2, 4, and 6.F I G U R E 1 5 Candidate roadside sensing unit (RSU) locations: (a) Site 1 and (b) Site 2.

Fa
I G U R E 1 6 Optimized roadside sensing unit (RSU) positions (T = 5400): (a) Site 1 and (b) Site 2 (only road edges and median barriers are delineated).TA B L E 6 Technical parameters of roadside sensing unit (RSU) and candidate pose angles.Not applicable because the RSU can capture 360 • view of the surrounding environment.

F
Average height of optimized roadside sensing unit (RSU) positions: (a) Site 1 and (b) Site 2.

F
Cross entropy (CE) curves: (a) CE curves under different conditions and (b) influence of traffic volume and vehicle proportion.

F
Intersection over union (IoU) of optimized roadside sensing unit (RSU) positions by different methods: (a) Site 1 and (b) Site 2. 4.3.2Comparisons with existing methods Vijay et al. (2021), Ma et al. (2022), Ma, Liu, et al. (2023), and the proposed method are separately applied to optimize the placement of four RSUs at both sites.Technical parameters of RSUs are specified in Table 6.For simplicity, V22, M22, M23 are used to denote Vijay et al. (2021), Ma et al. (2022), and Ma, Liu, et al. (2023), respectively.The optimization objectives of V21 and M22 are both to enlarge the coverage of road areas.Differently, road environment in M22 is modeled by point clouds while no obstacles are included in Vijay et al.Note that neither study considers dynamic occlusions.
Features of existing and proposed studies.
TA B L E 1 Static Hiromori et al., 2013; Altahir et al., 2018; Yang et al., 2018; Geissler & Gräfe 2019; Suresh et al., 2020 3D 3D geometry Analytical (3D) Static Dybedal & Hovland, 2017; Zhang et al., 2019 Because the observation event of one RSU does not affect or imply observations of other RSUs, the voxel-by-voxel sum of p  , which is estimated at different RSUs, denoted as ∑ Suppose   candidate RSU positions are generated.Given that 3D poses may also affect an RSU's perception capability, different pitch, roll or yaw angles can be p ≤ 1.

Number of ray-face combinations Number of reduced combinations GPU-based computing (ms) SRSP-based approach (ms)
TA B L E 2 Runtime tests of graphics processing unit (GPU)-based and structure-retained spherical projection (SRSP)-based computing.number of the POM-ob array estimated by Equation (17).

TA B L E 7 Optimization time. Time (s) Proposed method Test sites Ma et al. 2023 Optimization Probabilistic occupancy map (POM) creation
This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant JZ2023HGTA0191 and Grant JZ2022HGTA0338, in part by Anhui Provincial Natural Science Foundation under Grant 2308085QE188, the Natural Sciences and Engineering Research Council of Canada under Grant Ryerson-2020-04667, and in part by the Guangdong Science and Technology Strategic Innovation Fund (the Guangdong-Hong Kong-Macau Joint Laboratory Program) under Grant 2020B1212030009.