Enabling Viewpoint Learning through Dynamic Label Generation

Optimal viewpoint prediction is an essential task in many computer graphics applications. Unfortunately, common viewpoint qualities suffer from two major drawbacks: dependency on clean surface meshes, which are not always available, and the lack of closed‐form expressions, which requires a costly search involving rendering. To overcome these limitations we propose to separate viewpoint selection from rendering through an end‐to‐end learning approach, whereby we reduce the influence of the mesh quality by predicting viewpoints from unstructured point clouds instead of polygonal meshes. While this makes our approach insensitive to the mesh discretization during evaluation, it only becomes possible when resolving label ambiguities that arise in this context. Therefore, we additionally propose to incorporate the label generation into the training procedure, making the label decision adaptive to the current network predictions. We show how our proposed approach allows for learning viewpoint predictions for models from different object categories and for different viewpoint qualities. Additionally, we show that prediction times are reduced from several minutes to a fraction of a second, as compared to state‐of‐the‐art (SOTA) viewpoint quality evaluation. Code and training data is available at https://github.com/schellmi42/viewpoint_learning, which is to our knowledge the biggest viewpoint quality dataset available.


Introduction
3D models play an essential role in all areas of computer graphics, such as games, animated movies or virtual reality. To effectively showcase these models or to assess their quality, not only model parameters are important, such as geometry and material, but also the selection of optimal views is crucial. Optimal views should ensure that the model complexity is appropriately communicated, and relevant structures are visible. Many quality measures have been developed to aid in the automatic selection of optimal viewpoints on 3D models. The applications range from obtaining vantage points for capturing stills in architecture [HWZ * 17], to initial camera positioning for complex scene inspection [MEB * 17, HVH * 16, SLMR14, SLF * 11], camera control [LC15] and recommendations for scientific visualization [YLLY19]. Most common viewpoint quality measures aim to measure the information content of rendered 2D images of 3D models. The information content is usually derived from the visibility of the model geometry, making it sensitive to mesh quality and, in some cases, discretization. For this reason view quality measures generally assume the geometry to be a clean watertight surface mesh, which is not always available in real world applications. Faulty meshing on the other hand, such as holes in the geometry or self-intersecting triangles, distort the resulting viewpoint quality [BFS * 18]. Finally, to find the best viewpoint existing work often renders the 3D model for a large set of candidate viewpoints, which makes finding optimal views a costly brute-force search [VFSL02, FWBK15, KTL * 17].
In this paper, we present the first optimal viewpoint learning approach, and demonstrate its applicability by learning different existing viewpoint qualities. To overcome the above mentioned limitations of finding optimal views, we train a deep neural network end-to-end to predict high quality viewpoints directly from the 3D model, dropping the rendered image from the optimization. This makes our approach independent from rendering during evaluation, which means that the time-consuming rendering is separated from the optimal viewpoint prediction. Hence, in contrast to previous work, our learned approach allows instantaneous predictions, making an expensive brute-force search over many rendered view alternatives unnecessary, and reducing prediction times from several minutes to a fraction of a second. In order to reduce the influence of mesh quality and discretization, we only use unstructured surface points as input, i.e., no information about the polygonization is provided to the network. This forces the network to predict optimal views from an implicitly learned latent geometry representation, which, by design, is independent of the actual mesh polygonization. By training the network on clean meshes, we bias the latent representations towards clean surfaces, and as a result during evaluation the network will estimate a latent representation of a clean surface from the given points. These considerations make our approach robust to the model's discretization and mesh quality during evaluation, which enables the prediction of optimal viewpoints on a wide range of 3D models from different sources with varying mesh quality.
However the end-to-end mapping from 3D model to optimal viewpoint is not well defined, as viewpoint quality measures do not necessarily have a unique maximum, but may have several, for instance, but not exclusively, due to model symmetries. This ambiguity leads to conflicting ground truth information, resulting in opposing gradients which prevent meaningful learning. Existing techniques to resolve such label ambiguities only work for specific settings, e.g., local ambiguity [GXX * 17] or symmetric ambiguity [LGS19]. Thus we propose a more general approach, the dynamic label generation, which integrates the label decision into the training process. This allows the network to dynamically adjust the labels during training which results in a harmonized label decision over the dataset, effectively reducing the influence of contradicting label decisions, and thus gradients, and enabling learning for this more general type of ambiguity.
Thus, within this paper we make the following contributions: • We present the first learning-based approach that directly predicts optimal viewpoints directly on 3D models, while being robust to the input mesh quality. • We introduce a novel dynamic label generation method, incorporating the label decision into the training to resolve label ambiguity. • We release viewpoint quality annotations for a subset of Mod-elNet40, which makes it the largest available viewpoint quality dataset -by a large margin.

Related Work
The search for a good viewpoint of a 3D object is a problem that can be dated back to ancient societies such as the Greeks and Romans. Several rules such as the golden ratio, or the rule of thirds have been proposed to estimate beauty or proportion. More recently, the search for preferred views has also been addressed, especially in computer vision tasks (e.g., for object recognition), Despite the number of articles devoted to this issue, little has been done to generate fast algorithms for good viewpoint selection. In most cases, the measures require inspecting a very dense set of candidate views, which is time consuming. Accelerations presented in literature are typically greedy algorithms (e.g. for light source positioning [Gum02], or for volumetric models [MVN12]).
Our learned viewpoint prediction outperforms all these methods by design, as one forward pass through the network enables viewpoint prediction in milliseconds, rather than minutes, which are required by the brute-force approaches.
Label ambiguity. Ambiguous labels are present in many tasks, such as image classification, image segmentation, pose-estimation or age estimation [GXX * 17] and can hurt the performance of a learner if not considered [RLDB17]. There are different sources for these ambiguities, some tasks naturally allow multiple correct labels, e.g., in image classification an image can contain multiple objects, for other tasks it is difficult to provide a definitive label, e.g., it is hard to determine the exact pose of a partially occluded person. While classification tasks can resolve label ambiguity to some degree by design, regression tasks often struggle with ambiguous label information. While restating a regression as a classification is possible [ST19], it limits the possible performance by discretizing the output space. In cases where ambiguity exclusively stems from symmetry, partial restatement can be a trade-off, e.g., to resolve axial symmetry [LGS19] or rotational symmetry [CKF18]. The problem of label ambiguity can also be viewed as a problem of contradicting gradients. While the influence of such gradients can be reduced using mixtures of experts [JJNH91, JJ94], where multiple experts are trained together with a gating network to divide the problem space into disjoint regions, each having its own expert, this method is not applicable to the problem of label ambiguity which is not separable in the input space, e.g., the same data point could be present twice in the dataset with different labels.
In contrast to these approaches, we present a novel dynamic label generation, which integrates the label generation into the training stage and harmonizes the label decision without further assumptions or restrictions.

Viewpoint Quality Measures
To demonstrate the proposed deep learning technique, we have considered four different viewpoint quality measures, which we selected based on their effectiveness in previous studies and their popularity: Viewpoint Entropy (VE) [VFSH01], Visibility Ratio (VR), also referred to as surface area [PB96], Viewpoint Kullback-Leibler divergence (VKL) [SPFG05], and Viewpoint Mutual Information (VMI) [FSG09], which are defined as: According to Bonaventura et al.
[BFS * 18], VE and VMI are the most popular viewpoint quality metrics used in most papers, When evaluating them regarding user preference, Bonaventura et al. also found that these were in both extremes of the user preference spectrum. While the views selected by VE are highly preferred by users, the ones selected by VMI were not always deemed as informative. Further, Secord et al. [SLF * 11] ranked VR and VE as the two most preferred. Finally we added VKL as it is partially sensitive to the models discretization, and the other ones were in the extremes (non-sensitive/highly sensitive). Other measures could also be considered, however we restricted ourselves to these four measures as they represent the range of two main properties, see Table 1.
The best viewpoints for VR and VE correspond to the highest viewpoint quality values, and for VKL and VMI to the lowest viewpoint quality values. These viewpoint quality measures are defined for polygonal models and thus are, in contrast to our approach, dependent on the actual meshing with various degrees. While VR and VMI are insensitive to the discretization of the model, and VKL is near insensitive, they all still assume clean surface meshes, as for example self-intersections of polygons change At and Az and thus also VR and VKL , without necessarily altering the visible surface. This underlying assumption makes it harder to compare good viewpoints for models under different meshing qualities or resolutions, which is a problem if we want to extract model-spanning features of good viewpoints and bias the network towards good viewpoints of clean surface meshes. We reduce these influences with a mesh cleaning pipeline (see Section 4.2), in order to ensure that the viewpoint quality measures work as expected for different meshes.
To compute the best viewpoints for a given model we sample the unit sphere S 2 with 1k viewpoints V ⊂ S 2 ⊂ R 3 on a Fibonacci sphere [Gon10], generating almost equidistantly distributed view-  The viewpoint quality measures are computed on rendered 2D images, and are thus influenced by the image resolution. We compared different image resolutions, see Fig. 2, and chose to render the 3D models with 1024 × 1024 pixels, where the camera is placed at a distance of half the diagonal of the bounding box, centered on the mean of the bounding box, using perspective projection. We found this a good trade-off in accuracy and compute time.
We further normalized each measure to the range [0, 1], where 0 and 1 refer to the viewpoint quality of the worst and best viewpoint, respectively: where v + ∈ V is a viewpoint with the best and v − ∈ V is one with the worst viewpoint quality of the sampled views V. In the following we will always refer to these normalized versions of the viewpoint quality measures.

High Quality Viewpoint Prediction
Predicting good viewpoints with neural networks confronts us with two major challenges, the non-uniqueness of the best viewpoint and the mesh dependency of the viewpoint qualities. In the following sections we will describe how we address these challenges.

Dynamic Label Generation
Optimal viewpoints are not necessarily unique, e.g., due to model symmetry, which means that instead of a definite optimal viewpoint v + ∈ V we typically find a set of viewpoints V + ⊂ V which maximize the viewpoint quality measure. This phenomenon is referred to as label ambiguity.
In the general setting of label ambiguity a set of labels Y is given together with a quality measure p : Y → [0, 1], which in our case is given by the normalized viewpoint qualities The naïve label decision would be to ignore label ambiguity and choose one label y + ∈ Y as the Single Label (SL) for each model prior to training and train to minimize the loss (ŷ, y + ), between the predictionŷ and the chosen label. For viewpoints the natural choice is the cosine distance between the predictionv and one viewpoint v + with a viewpoint quality of 1. (Note: 2 norms are 1 as we evaluate on the unit sphere.) However, if this decision is not consistent over the entire dataset, the network is unable to resolve the label ambiguity during training, e.g., if two similar models with similar viewpoint quality distributions are labeled differently, the networks receives contradicting gradients impacting the learning capability, as illustrated in Fig. 3 (top).
We aim to resolve this problem by moving the label decision from a preprocessing step into the training process, by making it dependent on the current network prediction. This way the label decision is implicitly learned by the network, and can change dynamically during training to harmonize the label decisions over the dataset. In the following we propose two techniques for dynamic label generation.
Multiple Labels (ML). We choose a subset of high quality labels with a quality threshold 0 ≤ α ≤ 1. During training the loss between the current predictionŷ and the closest label in Y + is minimized, In our setting of viewpoint prediction this simplifies to where we select labels V + with a quality threshold α = 0.99.
In practice V + often consists of clusters covering areas of good viewpoint quality values, which are similar for similar input models, causing the gradients to reinforce each other. However, as the network only optimizes to the closest label, we observe it stopping at the boundary of one of these clusters, rather than moving towards its center (see Fig. 3), which results in non optimal values. To further improve the performance we propose a second approach which considers the quality measure and not just a quality threshold.
Gaussian Labels (GL). We propose to select labels with a high quality value p in the proximity of the current network prediction.  Figure 3: Dynamic label generation. Illustration of the proposed dynamic label generation technique for best viewpoint prediction, we use Mercator projections of the viewpoint sphere, as indicated on the right. Top: Best viewpoints are not necessarily unique, thus randomly choosing a maximum as label can create different labels for similar input models, which the network is unable to resolve. Bottom: To harmonize the label decision, we propose our two stage dynamic label generation. We first provide the network with multiple labels (ML) of high viewpoint quality and optimize towards the closest one. The labels typically form clusters in high quality areas, in which case the optimization tends to converge towards the boundary. To refine the predictions, we generate the label dynamically in a second stage (GL). The viewpoint quality distribution is weighted with a Gaussian centered at the current prediction and the maximum of the result is used as a label, which is typically a close local maximum, i.e., the maximum of the closest cluster. Both stages, ML and GL, provide more similar labels for similar input.

Spherical view quality map
We incorporate this through a locality constraint by weighting the label distribution with a shifted Gaussian function pg(y,ŷ) = p(y) · exp − y −ŷ 2 2σ 2 + s , and then optimize towards a label which maximizes this measure y + g (ŷ) = argmax y∈Y pg(y,ŷ).
The additive term s ensures that distant high quality labels are not dismissed, which keeps the network from getting stuck in larger regions with low p Y values. For our experiments we set σ = 2, s = 1, which leads to We observe this approach to keep optimizing towards a local maximum of V Qg, whereby the value of this local maximum can be in some cases sub-optimal, e.g., if the initial guess of the network is in a bad region.
For best results we use ML for initialization to first optimize towards the closest high quality viewpoint, followed by GL to refine the predictions inside a promising region, see Fig. 3.

Mesh Cleaning
As mentioned above, some viewpoint quality measures are sensitive to the meshing of the models, and bad mesh quality can lead to distortions in the viewpoint quality computation. These inaccuracies reduce the comparability between different models, which makes it hard for a network to determine the important features. Providing clean and comparably meshed models on the other hand biases the network to implicitly extract features of a clean surface solely from point information. To minimize these influences, we pass all meshes through a mesh cleaning pipeline, which resolves mesh intersections and regularizes the meshing. For details on the mesh cleaning and its influence on the viewpoint quality measures we refer the user to the supplementary material.We note that our pipeline does not remove all artifacts but the achieved mesh quality proved sufficient for our experiments.

Network Architecture
We deliberately chose point clouds as input to achieve robustness / independence to mesh polygonization / discretization, in contrast to neural networks which operate directly on meshes.  Fig. 4 (top). The learned latent representation is processed by four parallel Multi Layer Perceptrons (MLPs) with three layers of sizes 1024, 256, 3, each outputting a viewpointv ∈ R 3 for one of the four viewpoint quality measures, see Fig. 4 (bottom). We found that training one feature extractor for all four viewpoint qualities improves the performance as compared to training four separate networks. An effect we account to the different losses improving the feature extractor, similar to auxiliary losses [SWY * 15]. We further use batch normalization and ReLU activation between all layers.
For all conducted experiments, we used the same hyperparameters, stressing that our network is applicable to different categories and viewpoint quality measures without further tuning. Namely we use dropout [SHK * 14] in the MLP layers with a dropout rate of 0.5, Adam optimization [KB14] with batch size of 8 and a learning rate decay with an initial learning rate of 0.001 which is multiplied by 0.75 every 200 epochs. We train for a total of 3000 epochs and switch from ML to GL after 1500.

Experiments
To validate our viewpoint learning approach, which is enabled by dynamic label generation, we conducted three experiments. First, we trained a neural network to predict good viewpoints on point clouds of arbitrarily oriented 3D models, while we compare our label generation method to existing techniques. Next we inspected the robustness of our method towards different meshing and sam-plings of the input models, and lastly provide timings for both the sampling algorithm and our network.

Data
All experiments were conducted on a subset of Model-Net40 [WSK * 15], composed of the categories airplane, bench, bottle, car, chair, sofa, table and toilet, which we split into 80% training, 10% validation and 10% test data. All models were preprocessed as described in Section 4.2. In order to sample the viewpoint quality measures in reasonable time we only use models with at most 10k faces. All meshes are converted into point clouds by sampling 20k random uniform points on the faces. We use a rather dense input of 4096 points per model to capture fine geometric detail, for comparison common object classification networks typically use only 1024 points [HRV * 18, QYSG17].
Data Augmentation Neural networks working with three dimensional input data usually require a large database to achieve noteworthy performance. This is due to the high dimensionality of the input space, as well as to the complexity of the tasks. As the available sources for 3D data are rather limited, as compared for example to image data, the use of data augmentation, which increases the dataset virtually, are crucial for our experiments. Therefore, we use the following two data augmentation strategies, random sampling and rotations: 1. The input point clouds are generated by selecting 1024 points using farthest point sampling [ELPZ97], and selecting additional 3072 random points. Viewpoints predicted by our network for unseen models and ground truth achieved from sampling the viewpoint sphere. We also show the corresponding viewpoint quality spheres centered at the displayed viewpoint. The network successfully predicts high quality viewpoints, indicated by the yellow areas in the viewpoint spheres.

Viewpoint Prediction
We demonstrate the effectiveness of our two stage dynamic label generation (ML+GL) by comparing it against single label cosinedistance (SL) and existing work on resolving label ambiguity, Deep Label Distribution Learning (DLDL) [GXX * 17], to directly predict the viewpoint quality distribution, and Spherical Regression (SR) [LGS19], which splits the optimization into two parts, a regression for the absolute value |v| and a classification task for the signs.
We train a network as described in Section 4.3 on each category to predict viewpoints for the four different viewpoint quality measures, the Viewpoint Entropy (VE), the Visibility Ratio (VR), the Viewpoint Kullback-Leibler divergence (VKL) and the Viewpoint Mutual information (VMI). For SR and DLDL we have to adapt the architecture and loss as follows: SR: The loss function for SR consists of the cosine distance to |v + | and the cross entropy-loss of the sign prediction. Furthermore we use two MLPs per output to predict absolute values and the sign categories.
DLDL: The loss function for DLDL is the per-pixel 2 distance. The MLPs are replaced by 2D decoder networks consisting of 2D deconvolutions and residual blocks [HZRS16] predicting the viewpoint quality distributions, for more details on the architecture we refer to the supplementary material. For a fair comparison we predict at a resolution of 32 × 32 = 1024, which is close to the 1000 sampled viewpoints used as labels. We choose the viewpoint with the highest predicted viewpoint quality as the predicted viewpoint.
We measured the mean viewpoint qualities of the predicted viewpoints on the test set, averaged over all categories, and compared the different methods in Table 2. Our proposed two stage combination of ML and GL (ML+GL) clearly outperforms the naïve approach SL and both DLDL and SR. Further, we performed an ablation study where we compare our combined ML+GL approach to only using multiple labels (ML) and Gaussian labels (GL). While all three dynamic label generation methods outperform the existing methods with precomputed labels, confirming that our proposed method provides a better way to resolve label ambiguity for this task, the two stage ML+GL method improves over both single stage methods. We conclude that initializing the predictions with ML substantially improves the results over training solely with GL, as GL has a stronger locality restriction, making it sensitive to initialization.
We found that SR is not always able to resolve the ambiguity leading to predictions with wrong sign decisions or false regres- Table 2: Viewpoint prediction results. Top: Mean viewpoint quality in % of the predicted viewpoints using the different labeling techniques on the test set. Using dynamically generated labels (ML, GL, ML+GL) improves over one stage methods (ML, GL) and existing methods (SL, DLDL, SR), where our proposed two stage dynamic label generation method ML+GL yields best results for all four viewpoint quality measures. Bottom: Mean viewpoint quality in % of the ML+GL approach for the different categories. The performance is consistent over all categories.  Figure 6: Spherical regression. SR struggles with resolving label ambiguity as the ambiguity is not axis-symmetric leading to predictions that have flipped sign decision (yellow) although the absolute value might be correct (blue).
sion results for |v|, interpolating good viewpoints, see Fig. 6. We theorize that this is because an underlying assumption for SR is that |v| is the same for all labels, but as in our case the label ambiguity does not solely stem from model symmetry and the input is not necessarily aligned with the 3D axes, the assumption does not hold.
Predicting the viewpoint quality distribution (DLDL) resulted in the worst results. By analyzing the network prediction we found that the predicted distributions are much smoother than the ground truth distributions, for details we refer the reader to the supplementary material. We hypothesize that the network is unable to capture the geometric details which create the high frequency properties of the viewpoint quality distribution and as a result predicts an averaged distribution for similar models. We account this to two main factors. First the tasks itself is harder than only predicting the optimal viewpoint which demands the extraction of geometric features at a finer scale. The extraction of such details would however require a denser input sampling and a wider and deeper feature extractor, and in consequence also a larger data set for training. Secmax VE Network prediction Figure 7: Robustness to mesh polygonization. We show predictions using VE for different subdivisions of the seating surface.
As VE favors small triangles the bias towards views from the top increases with higher mesh density (red). Our network based approach remains stable independent of the meshing (yellow).
ond the influence of mesh quality on the viewpoint quality distribution is naturally higher than on the position of the optimal viewpoint. Thus our preprocessing pipeline might be insufficient and create distortions that the network is unable to resolve.
The results of our method are stable for all examined categories as can be seen in the bottom half of Table 2, showing that no additional tuning of the hyperparameters is necessary to learn various categories or viewpoint quality measures, detailed results can be found in the supplementary material.
Viewpoints predicted on the test set, i.e. unseen models, by our network trained with ML+GL labels can be seen in Fig. 5. We stress that due to label ambiguity the network is not optimized towards reproducing the same viewpoint as the sampling method, but to predict a viewpoint with high viewpoint quality. This potentially leads to different views, e.g., the toilet in Fig. 5, for which both views have a high quality, as can be seen in the viewpoint quality spheres in the figure.

Mesh and Sampling Independence
We use unstructured 3D convolutions and hence the input to the network are point clouds only consisting of coordinate information. As these points carry no additional information about the polygonization of the underlying mesh we expect our approach to be insensitive to the discretization of the mesh at test time. Furthermore, the use of MCCNNs, which consider an estimate of the point density, should result in a robustness to point sampling strategies.
To confirm this we perform two different experiments. The first one is the application to a toy example, in which we subdivide a part of the chair_0047 mesh into smaller polygons. On the original model VE prefers views from the bottom showing more geometric details in form of the legs, while after subdividing the seating surface VE mistakes the small faces as surface details, emphasizing the visibility of this area, see Fig. 7, which results in a viewpoint far from the optimal views of the original mesh. Our approach on the other hand, predicts viewpoints in an optimal area of the original mesh, independent of the meshing.
In a second experiment we show the robustness of our approach in practice to input that differs from the clean data provided during training. First, to investigate the robustness to mesh quality, we tested our network on the raw ModelNet40 models, which contain self-intersections, non-surface faces and non-uniform discretiza- tion. This is a more challenging task than the first toy example, as the model geometries are different and not only the mesh discretization, which confronts the network with out-of-domain input. Second, to show that our approach is additionally robust to different point sampling strategies, we also evaluate on the points provided by Qi et al. [QYSG17], who use a different pipeline to achieve clean surface point clouds. The results reported in Table 3 confirm that our approach is robust under sampling of the input data. We infer that the network has learned an internal representation of the meshing used during training .

Timings
We compared the time needed to estimate high quality viewpoints using the sampling approach described in Section 3, and the time needed to predict high quality views using our neural network model, as described in Section 4.3. The timings were measured on a system with an Intel Core i7-8700K CPU @ 3.70GHz and a NVIDIA GeForce GTX 1080 GPU. While the sampling approach was implemented using Python and OpenGL, our network approach was realized through Python and TensorFlow. To make the measurements comparable, we employed the following two conditions. First, we neglected initialization times, which include loading the meshes, preprocessing the meshes for the sampling method and sampling points and loading the weights for the network. Second, we sampled the viewpoint quality measures in one procedure, computing shared values only once. For the evaluation we chose models of different sizes, ranging from 10k faces to 1M faces, whereby we processed all these models 10 times with both methods and reported the averaged times in Table 4.
While the elapsed time of the sampling approach is approximately linear in the number of candidate views and the number of faces the network only requires one execution. This execution's time is independent of the model size, outperforming the other method in orders of magnitude. While we see some variation in the execution time of the network, which we account to varying numbers of points in the 3D convolutions and point hierarchy levels, the timings are comparable for all inspected models.

Limitations and Future Work
To achieve the reported results, we trained category specific instances of our network in a divide-and-conquer scheme, which is common for similar deep learning tasks such as viewpoint estimation [ST19] or upright prediction [LZL16]. This prevents the proposed network from generalizing to unseen categories, however, we see no theoretical limitation of our method and expect such generalization to be possible in the future by i) expanding the learning capabilities, e.g. using mixture of experts as was shown for viewpoint estimation [LGS19], and ii) increased amount of training data, a key ingredient in order to generalize to unseen categories.
While our network can predict multiple viewpoints at once, the views are independent, as it predicts one viewpoint per measure. We see potential for predicting multiple viewpoints that compliment one another. However, this leads to the problem of defining a good second view. Is it one that best covers the unseen parts of the model or a second view with high quality value? Note that the latter Table 4: Time comparison. Elapsed time of sampling based methods and ours for different model sizes, all timings are averaged over 10 executions. We measure the brute force sampling method using 250, 500 and 1k candidate views, and measure our model when batch processing 1, 64 and 256 models at the same time. Our network approach is faster in orders of magnitude and is independent of the model size as it uses a point cloud of fixed size. We report N/A where the execution did not finish after 12h. can be a very similar view direction. Moreover, the number of good views may vary per model, which could be addressed with network architectures that can output sequences, e.g. recurrent models.
Our method learns good viewpoints based on existing viewpoint quality measures, however, no measure is able to fully model human preference. While our method is general enough to learn on manually selected viewpoints, currently no large scale data set is available, and existing data is too limited for deep learning (16 models [SLF * 11], 68 models [DCG10]). A way to overcome the need for a large data set would be self-or weakly-supervised training, which could in future be investigated based on recent advances in differentiable rendering [NDVZJ19, NPLBY18].
Furthermore, we see potential to induce non-geometrical biases to the network by considering semantics, e.g. up-right orientation.

Conclusion
The proposed learned viewpoint prediction provides a way to predict high quality viewpoints for different viewpoint quality measures and model categories. By separating viewpoint selection and rendering our approach performs faster than existing techniques by several orders of magnitude. This makes our method suitable for applications which benefit from speed and parallelizability, such as automatic thumbnail generation of 3D data sets or initial camera placement for user interaction. The prediction of viewpoints directly from unstructured 3D point data proved to make the prediction robust to meshing properties, which makes us believe that the network has learned an internal representation of a clean mesh, as intended. The proposed dynamic label generation method is essential to resolve label ambiguity during training, outperforming existing methods, and is designed to be transferable to other learning tasks that involve label ambiguity.
On top of the contributions made in this article, we provide a dataset, which will be, to our knowledge, the first large scale viewpoint quality dataset containing more than 16k models in total, more details can be found in the supplementary material.