SENS: Part-Aware Sketch-based Implicit Neural Shape Modeling

We present SENS, a novel method for generating and editing 3D models from hand-drawn sketches, including those of abstract nature. Our method allows users to quickly and easily sketch a shape, and then maps the sketch into the latent space of a part-aware neural implicit shape architecture. SENS analyzes the sketch and encodes its parts into ViT patch encoding, subsequently feeding them into a transformer decoder that converts them to shape embeddings suitable for editing 3D neural implicit shapes. SENS provides intuitive sketch-based generation and editing, and also succeeds in capturing the intent of the user's sketch to generate a variety of novel and expressive 3D shapes, even from abstract and imprecise sketches. Additionally, SENS supports refinement via part reconstruction, allowing for nuanced adjustments and artifact removal. It also offers part-based modeling capabilities, enabling the combination of features from multiple sketches to create more complex and customized 3D shapes. We demonstrate the effectiveness of our model compared to the state-of-the-art using objective metric evaluation criteria and a user study, both indicating strong performance on sketches with a medium level of abstraction. Furthermore, we showcase our method's intuitive sketch-based shape editing capabilities, and validate it through a usability study.


Introduction
Data-driven techniques have become the de facto state-of-the-art for recovering a shape from a partial representation in computer graphics.Training neural networks can leverage prior domain knowledge of the data to deal with the innate ambiguity of the input.Neu-ral implicit fields are currently widely used as a generative model because of their ability to represent arbitrary shapes at arbitrary resolutions [CZ19, PFS * 19, AHY * 19, OELS * 22, TTM * 22].However, generative models either allow one to randomly sample from the latent space or interpolate between known latent representations, and hence offer only very limited control over the output shape, which hinders creativity.Thus, editing implicit representations for creative processes is not straightforward [HPG  *  22, HASB20].
In this paper, we approach the generation and editing of neural implicit shapes based on free-form sketching.Sketching is an intuitive and effective way to visually communicate shape information.Moreover, sketch-based modeling and editing can be particularly impactful in fields such as architecture, game development and product design, where 3D models are an essential part of the workflow.Despite vigorous efforts in sketch-based 3D modeling, it remains a challenging problem: First, the reconstruction of a 3D shape from an image is inherently ill-posed, since a raw image without annotation is generally a representation of a 3D object merely from a single viewpoint.Second, sketches can vary significantly in style and abstraction level, ranging from fast, casual or even sloppy styles to professional, rigorous sketches.In this paper, we define abstract sketches as hand-drawn representations that may lack geometric accuracy and focus more on capturing the essence or key features of the intended 3D shape rather than its exact specifications.When assuming near-perfect correspondence between the sketched silhouettes or other shape features and the output shape, high quality results can be achieved, see e.g.[LGK * 17, LPL * 18, DSC * 20, ZLY * 23].Similarly, exceptional 3D results can be extracted from high quality input technical drawings that include 3D clues, such as hidden lines [LPBM20] or symmetric strokes [HGSB22].However, designing a sketch-based 3D modeling system that is agnostic to the level of sketch abstraction of the input and the personal style of the user, accommodating inexact or unskilled drawings, is challenging.
Aside from using sketches to retrieve scenes for modeling [ERB * 12], data-driven generating techniques have always been susceptible to being mere retrievals of the datasets [TRR * 19, SSG * 22].Providing guarantees that shape-generating systems create novel shapes is thus imperative.We therefore approach the problem using a part-aware generative model to avoid this retrieval pitfall.Partaware modeling can mitigate the issue, since the generation first detaches the different parts, before assembling the whole shape coherently.This motivates us to use SPAGHETTI [HPG * 22], a partaware neural implicit shape representation model, as our backbone.
We present SENS, a method that leverages part-aware neural implicit representation to output novel shapes out of an input sketch.Our framework decomposes the input sketch into patches that are fed into a Vision Transformer [DBK * 20].A transformer decoder then outputs the latent code into the latent space used by SPAGHETTI [HPG  *  22].Using this space, editing can be applied to specific isolated parts of the shapes.For example, the user can manually select a part of the generated shape, such as the back of a chair, and redraw it by restricting the modification to the selected part only.Furthermore, our method offers the ability to systematically replace selected parts of a generated shape, providing an effective means of refining the model and removing any undesired artifacts.SENS also offers the possibility to outline the obtained shape while modeling, granting the user the possibility to modify the sketch directly and lowering the sketching skill cap.
We compare SENS with state-of-the-art sketch-to-shape techniques, encompassing both empirical and quantitative analyses.To illustrate that our method goes beyond simple shape retrieval, we present the top-4 shapes retrieved from the shapes generated by our approach.We further validate the quality of SENS's generation ability via a comparative perceptual user study.We also showcase the editing possibilities of our method in an interactive environment.
Our key contributions are: • Sketch-based modeling based on single-view sketches of diverse levels of abstraction.• State-of-the-art results for shape generation with limited retrieval.
• New editing capabilities that allow for part-based shape refinement and localized sketch-based reshaping and combinations.

Related work
Sketch-based modeling.Sketch-based modeling was extensively researched before the recent burst of data-driven techniques.As we focus on the latter, we only present a fragment of this domain and refer the reader to [CIW08 Neural networks shape representation types.The rise of deep learning for 3D geometry inspired the use of many shape representations.Explicit representations are popular for their expressiveness and editing possibilities.However, mesh representations require using graph neural networks [HHF * 18, WZL * 18, FFY * 19], which are computationally harder to process due to the inherent lack of regularity.Parametric representations offer mathematical accuracy but are hard to acquire and often rely on other representations for learning, such as meshes [PUG19], point clouds [SLK * 20] or distance fields [SFK * 20].Voxel representations leverage the regularity of the grid to ease the design of effective networks [ZZZ * 18, WZZ * 18], but they are resolution dependent and lead to poor representations of details.Point clouds are easy to acquire and process but do not embed geometrical structures [FSG16, YHCOZ18, YHH  *  19].We refer the reader to [MKKv22] for a comprehensive survey on neural shape representations.
Neural implicit shape generation and modeling.Neural implicit representations emerged as an alternative representation.Neural sketch-to-mesh methods.Using neural networks for sketchbased modeling is an active area of research in computer graphics, and there has been notable progress in recent years in developing neural network-based approaches for generating 3D models from 2D sketches.ShapeMVD [LGK * 17] and SketchCNN [LPL * 18] reconstruct shapes from 2D sketches using a convolutional neural network, but require multiple views and do not support abstract sketches.ProSketch [ZQG * 21] and DeepSketch [ZGZS22] are trained on a mix of synthetic and professional sketches.Some view-aware modeling systems exist: Sketch2Mesh [GRYF21] proposes an encoder/decoder architecture to reconstruct 3D shapes that can be refined via a user interface; Garment Ideation [CWC * 22] is a feature aggregation-based iterative method targeted towards garment ideation that predicts a winding number to generate 3D shapes; concurrently to our work, LAS-Diffusion [ZPW * 23] proposes a multi-class diffusion method based on an attention mechanism and GA-Sketching [ZLY * 23] proposes a multi-view method with modeling options via iterative refinement, but they fall short in effectively processing abstract sketches.Edit3D [CCR * 22] employs a unified latent space to generate 3D shapes, sketches, and RGB images, thereby establishing a correspondence between these three types of representations that enables shape and color editing.Delanoy et al. [DBA  *  17] propose a method to recover a volumetric shape from an input sketch.Parametric representations are also used for sketches [SBS19].Sketch2CAD [LPBM20] is based on the generation of primitives, Free2CAD [LPBM22] decomposes an input sketch into a sequence of strokes that are mapped to a sequence of CAD instructions, and GeoCode [PLH * 22] offers sketch-based modeling of parametric shapes with additional part-aware control of the relevant parameters.Note that neural methods can also recover shapes from non-sketch images.Pixel2Mesh [WZL * 18] recovers a mesh from an image while 3D-R2N2 [CXG * 16] and NeRFs [MST * 20] can reconstruct a shape from multiple views.SKED [MPS * 23] is a NeRF-based method which provides a sketch-guided text-based shape editing method.Table 1 presents the strengths and weaknesses of the works most related to ours.

Method
SENS generates a neural implicit shape from a single-view input sketch.More specifically, it associates to a sketch a latent code that can be interpreted by a neural implicit shape decoder.To this end, we design a neural network that learns to match a sketch to its corresponding shape's latent code in the latent space of SPAGHETTI [HPG  *  22].SPAGHETTI is designed to convert a latent vector into a collection of m Gaussians, where each Gaussian represents a part of the object.Subsequently, each part goes through a "mixing network", a transformer encoder that ensures global consistency across the shape.An "occupancy network" follows for decoding the final shape, as it returns the signed distance function from a query point.The main property of SPAGHETTI that we use lies in the fact that it is a part-aware implicit shape decoder, which means that its latent space is divided in several parts, and each part of the latent space encodes for a corresponding part in the resulting shape.This feature enables to train our network on partial inputs to mitigate shape retrieval, to train a refinement network that can regenerate selected latent parts, and to restrict the shape generation to specific parts during the modeling process.

Data generation and input normalization
To improve our network robustness with respect to the style and the level of abstraction of the input, we use a dataset with a variation of designs.Our dataset is based on a subset of the ShapeNet dataset [CFG * 15]: the chair dataset with 6755 shapes, the lamp dataset with 833 shapes, and the airplane dataset with 1775 shapes.Each was rendered with six different views in three different manners: (i) volumetric rendering that relies on ray marching; (ii) outline rendering based on the depth map; and (iii) partial outline rendering, which are renders of SPAGHETTI's shapes after masking out parts of their latent code.In addition, (iv) abstract sketches of eight strokes were computed based on each view of the volumetric renderings by using CLIPasso [VPB * 22] with 2000 iterations.For chairs, we used an additional dataset, (v) ProSketch [ZQG * 21], to add freehand sketches drawn by experts.We display examples from our sketch dataset in Fig. 3.The data is augmented with random perspective transformation and horizontal symmetry.Using the fact that CLIPasso provides vector graphics outputs, we applied data augmentation to its abstract sketches by modifying the stroke width before rendering it as an image.We normalize the input by centering the sketch, cropping the empty borders and resizing it to a 256 × 256 image.Partial outline renderings are normalized and cropped in alignment with their respective full renders.

Sketch-to-latent representation
Our network maps a sketch to the latent representation of a neural implicit shape generator, namely SPAGHETTI [HPG * 22].SPAGHETTI receives as input a latent representation that is mapped to a collection of m vectors of dimension d model that represents a Gaussian mixture model (GMM), i.e. each of these m vectors corresponds to a 3D Gaussian.SPAGHETTI outputs a 3D implicit shape by mapping each Gaussian to a part of the represented shape, and mixes these parts to produce a globally coherent shape.In this work, we make use of this intermediate GMM-based latent space and map the input sketch directly to it.For each given shape, we precompute its collection of latent vectors {z i } m i=1 using shape inversion [HPG  *  22].An overview of our network architecture is displayed in Fig. 2. Inspired by the DETR object detection model [CMS * 20], our network is composed of an image encoder that takes an input sketch and outputs visual embeddings.A transformer decoder maps a set of learned part queries together with these visual embeddings to SPAGHETTI's multi-part latent space.The image encoder (Fig. 2 Figure 4: A sketch of poor quality (i) may yield inadequate results (ii).Users can select unsatisfactory parts of the output (ii, lasso selection on shape in red; iii, selected parts in orange).Our refinement network can predict a refined shape (iv) by regenerating the selected parts of the latent space based on the non-selected parts.
The transformer decoder (Fig. 2 middle) takes as input these visual embeddings and a set of m part queries, and processes them using its self-attention and cross-attention layers.The part queries are learnable vectors, i.e. they are optimized at the same time as the network.Finally, each output vector of the decoder is mapped to a latent part vector {z i } m i=1 of the neural implicit shape decoder, SPAGHETTI, which uses them to generate the output shape (Fig. 2 right).The training loss we use is where z i is the ground truth ith part vector of the 3D shape that corresponds to the input sketch, and zi is the prediction of SENS.

Partial shape
SENS is trained to perform reconstruction also by additional outline renders of partial 3D shapes.The goal is to reinforce the uncoupling between parts of the restored shape as demonstrated in Sec.4.6.
The partial outline rendering supervision for this task is obtained by randomly selecting a subset of part vectors {z i [c i ]} m i=1 where the binary assignment c i indicates the presence of part i in the subset.Then the subset of vectors is given to SPAGHETTI which generates the corresponding partial 3D implicit shape.Finally, we render the partial shape.See Fig. 3(iii) for an example.
When feeding partial outline renders into SENS, we use different loss functions.In this case, the output of the transformer decoder is passed through an MLP to an additional classification score ci ∈ [0, 1] which indicates the presence of part i in the input outline render.We optimize it by the binary cross entropy loss, where c i is the ground truth indicator of part i in the input render.Moreover, the loss for the latent vector prediction of our network is where c i are used to ignore latents of parts not present at the input and the normalizing factor ∥c∥ 0 counts the number of non-zero entries in c = [c 1 , ..., cm].

Refinement network
The refinement network allows to regenerate parts of a given shape, and is illustrated in Fig. 4. In some cases, poor quality or ambiguities in the input sketch (i) may lead to artifacts in the generated shape (ii).The user can select unsatisfactory parts from the output shape (iii, marked in orange).A selected part on the shape has a corresponding latent vector part in the GMM latent space.Our refinement network, which is conditioned on the latent vector parts of the non-selected parts, outputs a set of vectors parts that replace the selected ones.Finally, the shape decoder regenerates the refined shape using the new latent vector parts (iv).
The refinement network is a bidirectional transformer encoder network that receives the set of latent vectors z ∈ R m×dmodel such that the corresponding vectors of the selected parts are masked (i.e., zeroed).It outputs ẑ ∈ R m×dmodel , which contains the refined vectors in the entries corresponding to the selected parts.
The network uses a masking objective [DCLT18], where 5 − 40% of the input vectors are masked, and the network has to predict their content based on the unmasked context.The loss is where the indicator 1 i equals one if and only if the input vector zi was masked and ∥1∥ 1 = ∑ m i=1 1 i is a normalizing factor.

Results
We show our shape generation and editing results, with quantitative evaluation and insights into retrieval, completion, ablation, and limitations.We trained two single-class SENS networks over chairs and airplanes and a multi-class network that was trained jointly over chairs, airplanes and lamps.The latent space of the pre-trained SPAGHETTI model consists of m = 16 and m = 32 parts with dimensions d model = 512 and d model = 768 for the single and multiclass networks respectively.We will publish our sketches dataset, code, pre-trained models and user interface upon acceptance.

Generation comparison
SENS can generate a shape from a single input sketch.As we trained our neural network on a combination of outline renderings, abstract sketches and expert freehand sketches (Fig. 3), we are able to produce sensible outputs from sketches of diverse styles.Fig. 13, Fig. 14, and our supplementary material show some examples of our sketch-based generation.
We compare SENS in Fig. 6 with three single-view image to shape methods, namely Pixel2Mesh [WZL * 18], Sketch2Mesh [GRYF21] and DeepSketch [ZGZS22].Pixel2Mesh is a generic, non-sketchspecific, image-to-shape method.Though able to reconstruct a shape that maps the outline of the input, the result is less aesthetically pleasing.While DeepSketch and Sketch2Mesh are targeted towards sketch-to-shape applications, their methods struggle to produce reasonable output from abstract sketches.Deepsketch is trained on synthetic shapes [ZGZS20] and expert freehand sketches [ZQG * 21], and even though Sketch2Mesh is trained on several sketch styles, the external contours of the input sketches remain the same.We do not use its refinement because it requires additional camera view parameters.
We also compare with ShapeMVD [LGK * 17], a multi-view reconstruction method in Fig. 7, using input sketches from their own test dataset.The inputs to ShapeMVD are two orthogonal views that are precisely aligned.Their method predicts the depth map and normal map to output a point cloud from which a mesh is extracted using screened Poisson Surface Reconstruction [KH13].Because the additional view reduces the ambiguity, their method is able to generate shapes that are more accurate to the input, but which seem to be more prone to artefacts.We noticed that ShapeMVD failed at shape generation from abstract sketches, thereby raising the level of skill required to use it.

Evaluation
For an objective evaluation, we ran Pixel2Mesh, Sketch2Mesh, DeepSketch and SENS on the AmateurSketch dataset [QGS * 21], which contains 3000 freehand sketches of ShapeNet chairs of medium abstraction level.We then computed the chamfer distance (CD), the Earth Mover's distance (EMD) and the shading-imagebased Fréchet Inception distance (FID) [HRU * 18,PZZ22,ZLWT22].Our results are reported in Table 2, and we refer the reader to our supplementary material for more details about the used metrics.Note that SENS performs better in all the metrics referenced here.
As an additional perceptual evaluation, we conducted a user study.We randomly sampled 24 sketches from the AmateurSketch [QGS * 21] dataset on which we applied the methods we compare with.Users were asked to rank the four chairs for how realistic and how similar to the input sketch they are.Table 3 shows the results for both questions in separate columns.54 people took part in our user study.Note that SENS consistently ranks highest both in terms of realism and similarity.More details are to be found in the supplementary material.

Shape completion
As explained in Sec.3.3, our network predicts latent codes z ∈ R m×dmodel and a continuous score c ∈ [0, 1] m , where ci indicates the probability that the ith component of z is represented in the sketch.While the use of partial outline rendering allows our training to disentangle the different parts of the input sketch, the prediction of the mask c is useful to determine the confidence of the network in the reconstruction of each part.Because SENS reconstructs a shape from a single viewpoint, it often has to reconstruct parts of the shape that are not depicted in the input sketch.We show in Fig.Note that these sketches were not part of the dataset.We also show the top-4 retrieval: we first remesh the output shapes of SPAGHETTI [HSG18] that were used for training SENS, and compute the Chamfer Distance over 100,000 sampled points over the surface.We display the output shape of SPAGHETTI.The order is left to right, top to bottom.
5 several examples of completion.The part i of a shape is said to be completed if the mask probability c i is below a certain threshold, here set to 0.01.Completed parts are displayed in orange.

Shape retrieval
It is crucial for shape-generation techniques to address the retrieval problem.This means that a method should be able to generate a desired shape based on a given sketch, and not just retrieve a shape from the training dataset that approximates a reasonable result.In Fig. 6, we provide evidence that SENS does not merely retrieve shapes.The main enabler for this is the part-aware property of SENS as it is trained to produce disentangled part vectors that are combined to generate the whole shape.For instance, while the first and second output shapes share similar legs as their respective first retrievals, they exhibit significant differences in the back area.The rounded back of the third chair is not present in the top-4 shape retrieval results.While the fourth shape has an identical structure to its top retrieval, the back, seat, and legs' lengths vary.

Editing
The ability to generate 3D shapes from sketches can simplify 3D modeling.Yet, a user may desire to edit the generated shape, which is a complex task.One major advantage of SENS is the ability to easily edit shapes through sketching (Fig. 1).We implemented a user interface using the Visualization Toolkit (VTK) [SMLK06] featuring a drawing canvas and a viewer that displayed the generated shape after its conversion to a mesh via marching cube [LC87].We present a live demonstration of the editing possibilities in a video attached to the supplementary material.

Outline rendering
Our interface proposes an outline rendering method of the displayed shape, enabling users to perform direct modifications on the drawing canvas.The pipeline is illustrated in Fig. 8: after an initial drawing (i) generates a starting shape (ii), the shape is rendered as a depth map (iii), which is then smoothed via a Gaussian filter.Edges are then extracted using the Canny edge detection method [Can86].
Consequently, the outline aligns with the shape's orientation on the screen (iv).As a result, our interface allows users to first create an 87.4 9.5 2.9 0.2 93.5 5.9 0.5 0.1 Figure 8: Our outline rendering pipeline.An initial drawing (i) serves for shape generation (ii).We render its depth map (iii), which is in turn used for edge extraction (iv).The outline can be modified (v) and used as an input for further shape generation (vi).
abstract sketch of a chair, generate its outline, and then directly edit the outline (v) for further shape generation (vi).This simplification of the 3D modeling process greatly reduces the demand for advanced sketching skills.

Refinement via part reconstruction
Because SPAGHETTI is a part-aware shape decoder, it is possible to select parts of the latent code and use a refinement network to regenerate them based on the unselected parts, as described in Sec.
3.4.The selection is illustrated in Fig. 4 and operates as follows: first, the user employs a freehand lasso selection on the screen (ii).Then, our interface detects which faces of the mesh are picked by the lasso selection.The parts of the latent code that encode for the generation of these picked faces are then labeled as "selected".Once a part is selected, we display in orange all the faces that are generated by this part (iii), not only the originally picked faces.Our interface will mask the selected parts and feeds the latent code to the refinement network, which generates new parts of the latent code to replace the selected one (iv).While the refinement network was initially trained to reconstruct 5% − 40% of masked latent vectors, there are no practical constraints on the number of vector components that can be masked for refinement.The refinement strategy can be particularly useful for removing artifacts from the generated shape, as exemplified in Fig. 4 and in the supplementary video.

Part-based modeling
The use of a part-aware shape decoder also enables local modifications to the generated shapes.Indeed, SENS accepts a sketch as input and produces a corresponding latent code that can be broken down into several parts.However, these latent parts can originate from different input sketches, hence allowing the fusion of features from distinct shapes.We provide an illustration of part-based modeling in Fig. 9.The initial drawing (i) generates a latent code that, when decoded, yields a shape (ii).The user can select parts of the latent code, illustrated in orange on the output shape (iii).Drawing another sketch (iv) generates a new latent code, that if decoded by SPAGHETTI, would yield a completely different shape (v).Instead of replacing the entire latent code, only the selected parts are replaced, hence producing a new shape that blends features from both original shapes (vi).In this example, the resulting chair combines the base of the first chair with the backrest of the second chair.This technique represents a substantial improvement over traditional sketch-to-shape methods in significantly extending the modeling flexibility and generation capabilities, going beyond the dataset's inherent limitations.Note that our part-based modeling method can be used with sketches of different abstraction levels, which strengthens its flexibility.
Figure 9: Part-based modeling example.The input sketch (i) is fed to our network to generate a shape (ii).The user can select parts of the resulting shape (iii).Given another sketch (iv), SENS would generate a completely different shape (v).But using part-based modeling, our interface will only replace the selected parts (vi).

Evaluation
To evaluate the usability of our method's editing capabilities, we carried out a user study with 8 participants from diverse backgrounds, possessing varying levels of modeling and sketching expertise.During this session, participants were tasked with two assignments: firstly, creating any chair design, ensuring they utilized all available editing tools to familiarize themselves with our system; and secondly, modeling three distinct shapes based on provided images.After the modeling session, participants completed two questionnaires to gauge the system usability and the efforts required to use it.We show the results in our supplementary material, where we detail the questionnaire outcomes and showcase a range of shapes crafted during the study.Feedback from participants was largely positive; they found the system intuitive and user-friendly, expressing satisfaction with their outputs.

Ablation studies
To analyze the relevance of different components of SENS, we provide an ablation study.Visual results are displayed in Fig. 10 on inputs presented with increasing levels of abstraction from top to bottom.For quantitative evaluations, refer to Table 4.To provide a fair comparison, no model in our ablation study has been trained on the ProSketch dataset, and all networks were trained for 40 hours.First, we trained the same network by removing the mask loss L cls and the partial loss Lpart, both explained in Sec.3.3 and referred to as "ablation partial loss".We claim that these losses improve the part disentanglement, hence allowing SENS to produce shapes that are less prone to mere shape retrieval.This is particularly visible in the chairs' handles that are not present or not connected to the seat in the original drawing.Yet, they are visible in the output shape.The quantitative comparison supports our analysis.The metrics indicate that eliminating the partial loss significantly decreases the distance between the shapes in the dataset and those generated, indicating Input sketch Ablation partial loss Ablation dataset Ours (full) Figure 10: We present our ablation study on three different input styles, namely a shape outline, a drawing and an abstract sketch."ablation partial loss" means that the network did not train with partial loss; "ablation dataset" means that the network did not train with abstract sketches.
a tendency towards retrieval.Second, we trained SENS without using abstract sketches, referred to as "ablation dataset".It clearly appears that the more abstract the input sketch, the more the obtained result decreases in quality, notably with some parts being absent from the output shape.The quantitative metrics further demonstrate that incorporating sketches of varying abstraction levels enhances our method's adaptability to different input sketch styles.This is evidenced by the weaker performance on our metrics by the version of our method with dataset ablation.

Multi-class reconstruction
Until now, we had conditioned SENS on a specific class of shapes.
We demonstrate here that it is possible to condition SENS on multiple classes at the same time.To account for the greater shape diversity, our multi-class shape generator relies on a higher number of Gaussians and the latent representation has a higher dimension.Fig. 11 compares the results of the multi-class network to the single-class network of the respective category.We observe that the multi-class version produces successfully shapes that correspond to the right category.Compared to the single-class version, the output shapes are slightly less accurate, especially with sharp features.This can  be observed in the chair and airplane sketches.Note though that for lamps, we get better results with the multi-class network.As lamps have a smaller training set, the multi-class network exhibits better generalization than the single class as it has access to more data.

Limitations
Single-view sketch-to-shape reconstruction is a challenging problem as it requires overcoming necessary ambiguities.SENS tackles this by conditioning the network on a limited number of classes.Yet, it might struggle to produce a shape that corresponds to the input sketch if it cannot resolve these ambiguities.Fig. 12 shows such limitations.Fig. 12(i) exhibits that SENS might omit, deform, or add additional details that were not required by the user.This problem also appears in the airplane's tail in Fig. 7. Yet, Fig. 5 demonstrates that tackling this ambiguity can benefit the consistency of the result.This often comes down to a trade-off between being close to the input or producing a coherent shape.Fig. 12(ii) shows that although the sketch may be drawn with precision, the final shape may not include the high-frequency details or patterns depicted in the sketch.Such challenges can be attributed to our method's handling of sketches with varying abstraction levels, inherent limitations in the SPAGHETTI shape decoder's detail rendering capabilities, and the absence of view parameters to guide the generation process; factors that collectively impact the method's ability to deal with intricate details.As we condition SENS to limited classes of shapes, the output is restricted to an object of such a class, even when the input sketch is unrelated.Fig. 12(iii) presents such direct example.Also, the stool sketch in the middle row of Fig. 13 is not correctly mapped.Multi-class SENS is subject to misinterpretation of the shape category, as exemplified in the top right corner of Fig. 14.Finally, SENS inherits some of SPAGHETTI's limitations, such as the necessity of training on a limited number of shape classes with similar structures, similar artifacts and lack of fine detail in the generated shapes, and potential under or over-clustering of parts within the same Gaussian, which restricts the desired level of control permitted by our selection tool for refinement or part-based modeling.

Conclusion
In this paper, we present SENS, a method for generating neural implicit shapes through sketching.The key concept of our approach is mapping different parts of the input sketch to a part-aware latent space.Each latent code's part is consistently mapped to a different part of the generated shape.Our part-aware reconstruction approach allows the network to integrate the relationships between different parts of the object, resulting in 3D models that are less prone to mere shape retrieval from the training dataset.In addition, we also offer part-based shape modeling, where users can select a part of a shape and redraw its corresponding sketch.This allows for even more precise model editing, and enables users to combine features from different shapes, thus expanding the scope of what can be modeled beyond the dataset's inherent limitations.Another implication of a part-aware latent space is the possibility to refine specific parts of the shape, hence allowing systematic artifacts removal in the final model.Recent developments in generative diffusion-based models have shown promise for sketch-to-shape modeling, as highlighted in works like [ZPW * 23].These models, when combined with partaware shape decoders [BKD * 23], offer new potential for advancing the field.This integration not only enhances current methodologies but also paves the way for innovative research directions in sketchbased shape generation.
Among the key contributions of our method also lies the ability to generate shapes via a single sketch at various levels of abstraction.Moreover, we can edit their outline directly through sketching, reducing the need for advanced artistic skills in the modeling process.We have shown through our experiments and comparisons with prior shape generation methods that SENS generates models with a higher level of detail and realism while requiring less drawing expertise.We believe that our method provides a powerful tool for creating 3D models, offering both ease of use and high-quality results.

Abstract
We provide more details related to data preparation, implementation, training and evaluation of our method.

Network Architecture
The network is composed of three parts: The Transformer decoder also takes as input m learnable part queries of dimension 1.5h d that are optimized simultaneously with the weights of the network.It is composed of 12 cross-attention layers and feed-forward networks with layer normalization.The output of the Transformer decoder is then mapped to the latent code z h of the shape decoder latent space via an MLP with ReLU activation.

Training
Single-class models are trained on an Nvidia RTX 3090 GPU for 850 epochs.We use a gradual warmup scheduler [?] to linearly increase the learning rate at each epoch.The learning rate starts at 10 −7 and linearly increases to 10 −6 .Our approach to training the multiclass model was based on a combined dataset from various classes, namely chairs, planes, and lamps.We include ShapeNet outline and partial outline renderings, as well as CLIPasso

Evaluation
Our evaluation is performed on the AmateurSketch dataset [QGS * 21], which contains 3000 freehand sketches of ShapeNet shapes [CFG * 15] of medium abstraction level.We only compare with the chair class, because this is the only class ubiquitously supported by all the methods we compare with.
For each sketch in the AmateurSketch dataset, we extract a mesh from the implicit shape produced by our network.Then, we sample 100, 000 points on the surface of our output and on the reference mesh, and compute the chamfer distance between the two produced point clouds using the Point Cloud Utils library [?].

Earth mover's distance (EMD)
The earth mover's distance is a measure of dissimilarity between two probability distributions or point sets, and is often described as the minimum cost to transform one distribution into the other.The EMD between two point sets can be formally defined as: where π is a correspondence between A and B, i.e.Π(A, B) is the set of n × m matrices, where rows and columns sum to one and π i, j ∈ [0, 1] is the coefficient indicating how much points a i and b j correspond to each other.Due to the computational complexity of the EMD, we sample 1000 points on both meshes.We also use Point Cloud Utils library [?] for the computation of the EMD.

Fréchet inception distance (FID)
To take visual perception into consideration, we use the Fréchet inception distance [HRU * 18].FID evaluates the similarity between two sets of images, generated and real, by computing the Fréchet distance between the Gaussian distributions of their respective features.A lower FID value signifies a greater resemblance between the two image sets.The shading image based FID has been described in SDF-StyleGAN [ZLWT22], for which the authors report that it yields relevant results for measuring the plausibility and similarity of two shapes.We sample 20 views and render the shape Sout produced by SENS and the reference shape S ref .
The features are then extracted from these image via the Inception-V3 network [?], an architecture trained over ImageNet [?], which maps an image to a probability distribution over 1000 classes.From this probability distribution, we can extract the mean µ i and the covariance matrix Σ i for each image i.The formula used to compute the FID is given by: To compute the FID, we use the cleanFID library [PZZ22].

Interpretation
We report the results of our objective evaluation in Table 1.First, we note that Sketch2Mesh [GRYF21] fails to produce a shape in 112 cases when the input was cropped, and to provide a fair comparison we could not use their refinement because the camera view parameters are not an input of our method.We report the results for both cropped and padded input sketches, observing that the optimal method varies depending on the used metric.Because the training procedure is available for DeepSketch [ZGZS22], we train this method for our evaluation in two ways: (1) using their default dataset, which includes their synthetic renders and ProSketch [ZQG * 21], and (2) using our training dataset which consists of our full outline rendering, ProSketch, and abstract CLIPasso [VPB * 22] renders.
We indicate results for both training procedures.The evaluation on the default DeepSketch is done on padded input.Because cropped inputs are used for retraining DeepSketch on our dataset, we crop and center the AmateurSketch input sketches for its evaluation.Pixel2Mesh [WZL * 18] and our method are evaluated with cropped input sketches.
For both geometric and perceptual metrics, SENS performs substantially better than the state of the art.This indicates that SENS is particularly suitable for sketches with different levels of abstraction, and therefore is a relevant approach to allow people of various drawing skills to attempt sketch-based modeling.Since training DeepSketch on our dataset does not show any improvement on the metrics, this additionally indicates that the dataset is not the sole factor that explains the difference of performance between SENS and the state of the art.

Multi-class reconstruction
While LAS-Diffusion [ZPW * 23] is targeted toward a view-aware setting, this sketch-to-shape method can run without camera parameters.Since the authors provide the multi-class pretrained network for this task, we compare multi-class SENS with LAS-Diffusion using the same evaluation metrics as for the single-class comparison.The results are reported in Table 2.We can see that our method performs better than LAS-diffusion on the AmateurSketch dataset.However, we emphasize that the multi-class LAS-diffusion has been trained on all the ShapeNet classes, while our method training was focused on only 3 classes.Moreover, while it is possible to run LAS-diffusion without input view information, the authors state in their ablation study that using a view-agnostic network tends to yield additional or wrong geometry.Therefore, no definitive conclusion can be drawn from this comparison.
Additionally, when comparing single-class and multi-class SENS, we notice that the metrics give very similar results.This shows that our multi-class setup has good generalization abilities.

Subjective evaluation (user study)
To perform a perceptual evaluation of our work, we conduct a user study.We randomly sample 24 sketches from the AmateurSketch Figure 1: The two types of questions asked in our user study.When asking for how realistic the shape looks, the same view is applied for rendering the shapes.When asking for similarity with the input sketch, shapes are rendered with the same azimuth angle as the input sketch.The azimuth angle is provided by the AmateurSketch dataset.dataset and render the output of SENS, Pixel2Mesh [WZL * 18], Sketch2Mesh [GRYF21] (cropped input), and retrained DeepSketch [ZGZS22].We show in Fig. 1 the exact format used for the user study.For each sketch, we ask participants to rank the four methods' output in two questions: how realistic and how close to the input sketch the resulting chair looks.For the second question, we align the rendering view of the shape with the same azimuth angle as given by the AmateurSketch dataset.The order of the methods is randomized across the sketches, but the same order is used for both questions for each sketch.We recruit 54 individuals of diverse backgrounds and ages to partake in the user study, including 15 women and 39 men.
The results are reported in Table 3 and Fig. 2. According to this study, SENS provides the most realistic shape in 87.9% of the cases and the most similar to the input sketch in 94% of the cases.Pixel2Mesh is often deemed to perform the worst, especially in terms of realism.Sketch2Mesh and DeepSketch both seem to perform equally well for both questions and rank second and third with nearly equal scores, as shown by the interquartile range in Table 4. Therefore, our user study is aligned with our objective evaluation.Table 3: Perceptual evaluation through a user study, highlighting the performance of our method in comparison to Pixel2Mesh [WZL * 18], Sketch2Mesh [GRYF21] and retrained DeepSketch [ZGZS22] in terms of realism and similarity to input sketches.The ranking in each question is from 1 (best) to 4 (worst).

Usability study
To evaluate the usability of our sketch-to-shape generation and editing methods, we carried out a usability study, drawing inspiration from the study presented in GA-Sketching [ZLY * 23].Eight participants from diverse backgrounds participated in the study.Among them, half were aged between 20 and 30, while the rest were above 30.The gender distribution was balanced, with 50% women and 50% men.In terms of 3D modeling experience, 25% reported having no experience, 50% had limited experience, and 25% identified as hobbyists.When it came to 2D sketching or drawing, half the participants had no experience, 25% reported limited experience, and 25% described themselves as hobbyists.Notably, none of the participants were professional 2D illustrators or 3D artists.The modeling session was divided into two phases.Initially, participants were introduced to the software's operation and its various functionalities, which included sketch-to-shape generation, outline rendering, part-based modeling, and part refinement.Subsequently, participants undertook two tasks.In Task 1, they had the freedom to sketch any chair design; however, they were required to use each of the software's functionalities at least once during the session, ensuring they became familiar with all available options.Task 2 involved modeling three specific shapes provided as reference images.While their sketches did not need to align with the image's perspective, the resulting shapes should closely resemble the target.The outcomes from both tasks are depicted in Fig. 3 and Fig. 4. The outcomes of Task 1 underscore the system's resilience and adaptability.Even when participants, some of whom lacked advanced drawing skills, sketched rudimentary or imprecise chair designs, the algorithm consistently produced coherent 3D shapes.Often, only a few additional intuitive modeling steps were needed to refine the shape.Task 2 further demonstrates the system's ability to convert target ideas into concrete 3D models.
Participants were able to transform target images into 3D chairs, even when the sketched perspectives differed from the reference images.This ease of transformation from a 2D reference image to a realistic 3D chair model accentuates the system's ability in bringing users' visions to realization.
After completing the modeling session, participants were invited to complete a feedback form including both the System Usability Scale (SUS) questionnaire [?] and the NASA Task Load Index (NASA-TLX) questionnaire [?].The SUS questionnaire contains ten questions which evaluate the system's usability, and gauge its usefulness, ease of use, and consistency.The NASA-TLX questionnaire is designed to measure task-related effort intensities, such as mental (Q1), physical (Q2), and temporal (Q3) demands, as well as performance (Q4), effort (Q5), and frustration levels (Q6).The results are shown in Fig. 5 and Fig. 6.Notably, the exceptionally low SUS scores for Q2 and Q4, combined with elevated scores for Q5 and Q7, and notably the unanimous score of 1 for Q10, suggest a high intuitiveness with the editing options.This observation is further corroborated by the low scores reflected in the NASA-TLX.The marginally subpar scores for Q6 and Q9 appear to align with the absence of very high-frequency details from sketches to the resulting shape, a limitation we acknowledge in the main paper.However, it is worth noting the significant elevation in the NASA-TLX Q4 score, implying participants' satisfaction with their performance.Participants could readily conceptualize an initial rudimentary shape, even from the most abstract sketches and for those with very limited experience.

Additional visual results
In addition to the quantitative and qualitative evaluations, we also provide further visual results.We randomly sample 128 sketches from the AmateurSketch dataset and present the result of SENS in Fig. 7

Figure 2 :
Figure 2: SENS takes as input a 256 × 256 normalized grayscale sketch.It is partitioned into 16 × 16 patches, and then passed through a Vision Transformer.A transformer decoder is then used to generate the latent variable z ∈ R m×dmodel , which is a part-aware latent space with m parts represented by latent vectors of dimension d model that conditions the neural implicit representation given by SPAGHETTI, which is used to generate the output shape.By the part-aware latent space we get a mapping between sketch and shape parts.volumetricoutline partial outline abstract sketch freehand sketch tracing rendering rendering [VPB * 22] [ZQG * 21]

Figure 3 :
Figure 3: We used a variety of sketch styles as inputs to our method.The target shape is an implicit shape rendered via volume rendering, outline rendering and partial outline rendering.The abstract sketch is produced using CLIPasso [VPB * 22] on the volume rendering.Expert freehand sketches come from ProSketch [ZQG * 21].
left) is a Vision Transformer network [DBK * 20].It divides 256 × 256 sketch images into 16 × 16 patches.Each patch is mapped to a single visual embedding via a transformer encoder.

Figure 5 :
Figure 5: We exemplify how our network performs shape completion from single-view sketches.If input sketch does not display the full shape, the network is still able to reconstruct it, notably taking advantage of the symmetry of the class of shapes in the dataset.

Figure 6 :
Figure6: We compare our method with state-of-the-art single-view reconstruction methods on sketches of various styles such as an outline, an abstract sketch, a non-expert handmade sketch and an expert freehand sketch from the ProSketch dataset [ZQG * 21].Note that these sketches were not part of the dataset.We also show the top-4 retrieval: we first remesh the output shapes of SPAGHETTI[HSG18]  that were used for training SENS, and compute the Chamfer Distance over 100,000 sampled points over the surface.We display the output shape of SPAGHETTI.The order is left to right, top to bottom.

Figure 7 :
Figure 7: We compare SENS with ShapeMVD [LGK * 17], a sketchto-shape method requiring multi-view input sketches.The pairs of input sketches belong to ShapeMVD test set.Since SENS relies on a single-view input, we show the results for both input sketches.

Figure 11 :
Figure 11: We compare our multi-class and single-class sketch-toshape models.The input of the last column comes from AmateurSketch.The other sketches are produced by us.

Figure 12 :
Figure 12: While our method quickly allows to obtain a shape from a drawing, it struggles in certain cases.(ii) comes from ProSketch [ZQG * 21] but was not included in the training data.

Figure 13 :
Figure13: We showcase our method using sketches of various styles and levels of abstraction.Chairs in the first row are casually drawn or produced via image processing techniques.The second row shows that our method works with sketches drawn by professionals.Images from the last row are front and side views of chairs originating from ShapeMVD [LGK * 17].We include them here to facilitate comparisons with further works.

Figure 14 :
Figure 14: Multi-class SENS can produce chairs, planes and lamps out of sketches at diverse abstraction levels.Note that we do not indicate to the network at inference the kind of object we draw.In some cases, this can lead SENS to misinterpret the class of the drawn shape (see top-right).
a Vision Transformer encoder, a Transformer decoder and an implicit shape decoder (SPAGHETTI).The Vision Transformer encoder consists in a "sketch to visual embeddings" Transformer encoder.It takes as input a 256 × 256 grayscale image, decomposes it into 256 patches of size 16 × 16, uses a learnable position encoding, and maps each patch to a visual embedding of dimension h d = 512.The Vision Transformer itself consists in 8 layers intertwining multi-head attention layers and feed-forward networks with layer normalization [DBK * 20].Then, we use a Transformer decoder as our "visual embedding to shape latent code" network.It maps the 256 visual embeddings to latent space code.The latent space code is composed of m vectors of dimensions d model .Single-class SENS uses m = 16 and d model = 512, while multi-class SENS uses m = 32 and d model = 768.
[VPB * 22] abstract sketches, and ProSketch chair sketches [ZQG * 21].The training was based on 630 epochs, and the training duration for the multi-class model was 96 hours, which is longer than the 60 hours required for the single-class model due to the increased amount of data per epoch.The same learning rate and scheduler were used.

Figure 2 :
Figure 2: Results of our user study, displayed as an histogram.The results highlight the performance of our method in comparison to Pixel2Mesh [WZL * 18], Sketch2Mesh [GRYF21], and retrained DeepSketch [ZGZS22] in terms of realism and similarity to input sketches.

Figure 3 :
Figure 3: Some sketches and shapes from the Task 1 of the usability study.The results come from each user (P1 to P8, ordered from left to right, top to bottom).Some sketches (P3, P6, and P8) are edited versions of the outline rendering from previously generated shapes.The displayed shapes are not solely generated by the input sketches, but might have been refined via part reconstruction or part-based modeling.
Median and interquartile range (IQR) of the results of our user study, for both realism and similarity to input sketches.

Figure 4 :
Figure 4: The three target shape images are displayed in the first column, with four attempts to model them during Task 2 of the usability study.The target shapes are sourced from the public domain.

Figure 5 :Figure 6 :
Figure5: The mean of SUS scores.The whiskers represent the standard deviation.For questions with odd index, higher scores indicate better performance; for even-numbered questions, lower scores are preferable.

Figure 7 :
Figure 7: We randomly sample sketches from the AmateurSketch dataset and showcase the results of our method.

Figure 8 :
Figure 8: We randomly sample sketches from the AmateurSketch dataset and showcase the results of our method.

Figure 9 :
Figure 9: We randomly sample sketches from the AmateurSketch dataset and showcase the results of our method.

Table 1 :
Comparison of sketch-based shape generation methods.

Table 3 :
Perceptual evaluation through a user study, highlighting the performance of our method in comparison to Pixel2Mesh [WZL * 18], Sketch2Mesh[GRYF21], and retrained DeepSketch[ZGZS22]in terms of realism and similarity to input sketches (1 is best rank).

Table 4 :
Performance comparison of ablated methods on the Ama- © 2024 The Authors.Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd.

Table 1 :
Performance comparison of shape reconstruction methods on the AmateurSketch dataset [QGS * 21] using chamfer distance The chamfer distance calculates the average distance between each point in one set to its closest point in the other set and is an intuitive way to quantify the dissimilarity between two point clouds.It is © 2024 The Authors.Computer Graphics Forum published by Eurographics -The European Association for Computer Graphics and John Wiley & Sons Ltd.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.arXiv:2306.06088v2[cs.GR] 21 Feb 2024 thus widely used for geometric comparison.The chamfer distance between two point sets A and B can be defined as follows: