A Hierarchical Architecture for Neural Materials

Neural reflectance models are capable of reproducing the spatially-varying appearance of many real-world materials at different scales. Unfortunately, existing techniques such as NeuMIP have difficulties handling materials with strong shadowing effects or detailed specular highlights. In this paper, we introduce a neural appearance model that offers a new level of accuracy. Central to our model is an inception-based core network structure that captures material appearances at multiple scales using parallel-operating kernels and ensures multi-stage features through specialized convolution layers. Furthermore, we encode the inputs into frequency space, introduce a gradient-based loss, and employ it adaptive to the progress of the learning phase. We demonstrate the effectiveness of our method using a variety of synthetic and real examples.


Introduction
Modeling the appearance of real-world materials in a physically faithful fashion is crucial for predictive rendering.This, however, is a challenging task: Many materials comprise complex fine-grained geometries that largely drive their macro-scale appearances.Traditionally, material reflectance is specified using spatially varying BRDFs (SVBRDFs) or Bidirectional Texture Functions (BTFs).While these models work adequately for many applications, they are typically limited to one physical scale (or resolution).Further, SVBRDFs have difficulties handling parallax effects while BTFs [DvGNK99] are highly data-intensive.
To address these limitations, several appearance models utilizing neural representations [KMX*21; KWM*22; FWH*22] have been introduced recently.Although some of these methods-such as NeuMIP [KMX*21]-can be applied to learn the appearances of complex materials at varying scales, their physical accuracy can degrade drastically for material exhibiting complex shadows or specular highlights.
In this paper, we address this problem by introducing a neural appearance model with a new hierarchical architecture.As shown in Fig. 1, our model enjoys the generality to accurately capture complex directionally dependent effects-including shadows and highlights-at multiple physical scales.
Concretely, we make the following contributions: • We propose a new framework to improve neural materials to better capture highly glossy materials and better capture selfshadowing and sharp highlights by introducing a new hierarchical architecture and an input encoding step to the network to map the training inputs into a higher dimensional space ( §3.1).• For better robustness, we also introduce new losses to allow our model to better capture both high-and low-frequency effects ( §3.2).
We demonstrate the effectiveness of our technique by comparing it to the original NeuMIP [KMX*21] as shown in an example in Fig. 1.In practice, similar to NeuMIP, our neural reflectance model can be integrated into most rasterization-and ray-tracing-based rendering systems.

Related Work
Neural rendering has emerged as a promising approach for a wide variety of applications, including material rendering, texture synthesis, and view synthesis.In this section, we review the most recent and relevant work in the area of neural rendering, focusing on techniques used for material rendering and displacement mapping.
Displacement mapping serves as a powerful technique for augmenting material complexity on surface geometries, thereby yielding convincing parallax, silhouette, and shadowing effects.However, it imposes a considerable demand on computational resources.Conventional ray-tracing-based renderers typically effectuate displacement by tessellating the base geometry, a process that necessitates significant storage and computational capabilities [TBS*21].
Bidirectional Texture Function (BTF) have been employed to represent arbitrary reflective surface appearances, first proposed by Dana et al. [DvGNK99].The storage of a discretized 6D function incurs substantial costs, and a multitude of compression techniques have been scrutinized [HFM10].Rainer et al. [RJGW19] introduced a neural architecture based on an autoencoder framework for compressing BTF slices per texel, and later advanced their work by integrating diverse materials into a shared latent space.
NeuMIP [KMX*21; KWM*22], an innovative neural approach, has been formulated to render and represent materials across disparate scales efficiently.Despite its advantages, NeuMIP faces constraints due to its network architecture and design, struggling to simulate the high-frequency information inherent in materials.Furthermore, it fails to accommodate curved surfaces.A more recent endeavor [KWM*22] aimed to overcome these shortcomings by incorporating surface curvature and transparency information into the neural model.Yet, the task of capturing high-frequency materials remains a formidable challenge.In this paper, we compare to Neu-MIP [KMX*21] and not the curved-surfaces variation [KWM*22] because our contribution is clearly visible in flat samples and our technique can be easily applicable to the newest version as well.[RJGW19] proposed a neural structure based on an autoencoder framework to compress the BTF for each material.The decoder takes the latent vector and incoming and outgoing directions as inputs and each BTF requires separate training for the autoencoder.Later, Rainer et al. [RGJW20] extended this work by suggesting a shared latent space for different materials.Xu et al. [XWH*23] developed a novel importance sampling method for neural materials.Gauthier et al. [GFL*22] proposed a technique for mapping normal maps to anisotropic roughness levels.[ZRW*24] recently developed a neural rendering model using transformation layers and an encoder-decoder structure.However, due to the need for more parameters in the dataset, their model is not compatible with existing real BTF datasets.Furthermore, it does not support materials with displacement.

Recent advancements in
Micro-geometry appearance models grapple with the granular details of the material and provide high-fidelity rendering results.The realistic rendering of fabrics, for instance, continues to be an elusive goal despite substantial efforts [KSZ*16; MXF*21].More recently, Montazeri et al. [MGZJ20a] introduced an efficient and unified shading model for woven and subsequently knit [MGZJ20b] fabrics, though these models do not address multiresolution.In this study, we exploit their model to generate our fabric samples for training data.

Our Method
In this section, we describe our neural method for modeling the appearances of complex materials exhibiting effects like shadows and specular highlights that cannot be accurately handled by previous neural models.In what follows, we first detail in §3.1 our network designwhich is crucial for better accuracy.Then, we explain in §3.2 our optimization strategies to further improve the accuracy of the training process.

Hierarchical Network Architecture
Overview of NeuMIP.The input to NeuMIP is a 7D parameter set including the position u, incoming and outgoing direction ω i , and ωo as well as the kernel size for prefiltering.Their pipeline consisted of three main stages: (i) update the position u to compensate the micro-geometry using a neural offset module; (ii) query a neural texture pyramid using the updated position to handle different levels of detail; and (iii) pass the queried feature vector to a decoder network to obtain the reflectance value.
Our method uses the same three-step approach as NeuMIP but with several fundamental differences: we adopt the inception module decoder to replace the NeuMIP decoder, and the latent texture pyramid remains the same.This is shown in Fig. 2 and detailed as follows.

Inception module:
To better capture high-frequency effects such as detailed highlights or shading variations, we use an Inception module (instead of MLP layers used in NeuMIP).
Inception modules [SLJ*15] are specialized network blocks designed to approximate an optimal local structure of a convolutional network.It allows multiple types of filter sizes instead of a constant one.Networks leveraging the Inception modules [SLJ*15] have been demonstrated capable of accurately preserving image features at both micro and macro scales.
Our core network architecture, which is predominantly based on the Inception modules, is demonstrated in Fig. 3.The two 1 × 1 convolution layers at both ends serve the purpose of fullyconnected layers and adjust the input and output size.Central to this design are the 4-layer Inception modules that capture the image features at four different scales using four kernel sizes that operate in parallel.For the convolutions in the inception module, they take   all 25 channels as input.Each of these Inception module blocks consists of four parallel pathways.The first three pathways employ convolutional layers with window sizes of 1 × 1, 3 × 3, and 5 × 5, respectively, to extract information at various spatial scales.The middle two pathways initially apply a 1 × 1 convolution to the input, which reduces the number of input channels and decreases the model complexity.The fourth pathway makes use of a 3 × 3 maxpooling layer, succeeded by a 1 × 1 convolutional layer to modify the number of channels.All four pathways introduce suitable padding to ensure that both input and output dimensions, in terms of height and width, remain consistent.
The output channel count for the Inception modules stands at 7+12+3+3 = 25, with the output channel ratios for the four pathways being represented as 7 : 12 : 3 : 3 = 2 : 4 : 1 : 1.Every Inception module is structured to support both an input and output of 25 channels.In our comparative study between fully connected networks of equivalent depth and neuron count, the fully connected networks displayed a noticeably lower performance.Moreover, ramping up the number of neurons or opting for deeper fully connected networks did not lead to notable enhancements in their ability to capture intricate details.
Input encodings: To further improve the effectiveness of our method in handling detailed appearances, we adopt the position encoding that was originally introduced by NeRF[MST*21].Specifically, we incorporate high-frequency encoding for lighting direction ω i and camera directions ωo, along with texture position u.Rahaman et al. [RBA*19] showed that neural networks are biased toward learning low-frequency functions and perform poorly at representing high-frequency variation.Thus, we modify the MLP decoder by mapping its inputs to a high-dimensional space using Fourier transformation [BB86] instead of directly operating on input coordinates, such as in previous work.The Fourier transformations are applied to the inputs (wi, wo,u), mapping them to the frequency domain.
This mapping significantly improves the ability of the network to reconstruct highlights and capture high-frequency image features, addressing the shortcomings of the original NeuMIP network.The formulation of our decoder F is a composition of two functions F = submitted to COMPUTER GRAPHICS Forum (4/2024).where L defines the level of frequencies.Based on our experiments, we set L as 10 and 4 for γ(u) and γ(ω), respectively.
The contribution of our proposed architecture capturing the fine details is exhibited in the first two columns of Fig. 4.

Enhanced Loss Functions
While our hierarchical network design introduced in §3.1 is crucial for accurately reproducing material appearance at varying scales, training the network using standard image losses (e.g., L1 or L2) submitted to COMPUTER GRAPHICS Forum (4/2024).may lead to results that still lack details.To address this issue, we propose using the following losses for training.
Gradient loss.: Inspired by the Canny edge detection algorithm [Can86], we utilize a gradient loss to encourage the network to better preserve detailed shading variations: where, for any I, Gx(I) := kx * I and Gy(I) := ky * I indicate, respectively, the image I convolved with the Sobel edge-detection filters [SF73] Additionally, Î is the reference image.The new gradient loss only works in the training phase and does not influence the network evaluation.
Output remapping: The perception of an object's luminance by humans is inherently nonlinear [BMKM20].However, neural networks tend to minimize global loss, equating the same numerical loss in both high and low-luminance regions.This is in contrast to human perception where the numerical loss in low luminance regions is more pronounced.Building upon this insight, we introduce an "output remapping" strategy to assist neural network learning.Once the neural network predicts (linear) RGB values, this remapping assigns different weights based on luminance.Specifically, The remapping is applied to the texture (both reference and generate results) to compute the loss during the training phase.Subsequently, these weighted values are passed through our gradient loss Eqn (1), leading to a significant enhancement in the quality of shadows and darker regions, with no adverse effect on the image's overall brightness.The new gradient loss only works in the training phase, so would not influence the network evaluation.Therefore, our proposed strategy achieves superior results in these two loss metrics.After rigorous experiments, we noticed applying 4 th root functions as the image remapping is the optimum spot to capture both low and high frequencies.Our final loss function is formulated as follows: where I −4 and Î−4 are obtained by applying per-pixel exponents to the output and reference images, and L G is the gradient loss defined in Eqn (1).

Dataset and training
Identical to NeuMIP, our neural appearance model takes as input 7D queries (expressing the camera and light directions ωo and ω i , UV location u, and the prefilter kernel size σ) and outputs a single 3D vector indicating the corresponding RGB reflectance value.For each material, our training data involves a large set of input-output value pairs.To train our model, we minimize the loss discussed in §3.2 using the Adam algorithm.
In practice, we generate our synthetic training datasets (Basket and Twill cloth, Metal ring, Bump) using the Keyshot path-tracer renderer [Lux20].Specifically, the Metal ring and Bump data are rendered using displaced geometry (expressed using height maps).The Basket and Twill cloth datasets, on the other hand, use stateof-the-art ply-based cloth models [MGZJ20a;MGZJ20b].Our generated data involves 500 input-output value pairs, and our training process uses mini-batching with a batch size of 30, 000.
Additionally, we use two datasets (Victorian cloth and Turtle shell) published by NeuMIP [KMX*21] to evaluate our method.We retrained NeuMIP's victorian fabric and turtle shell datasets with the training parameters tweaked for best results.
The training of our model as well as the ordinary NeuMIP is performed per material and uses all available training data.Our training of one material model takes about 90 minutes.

Rendering
Similar to NeuMIP, our neural reflectance model can be integrated into Monte-Carlo renderers.The results shown in this paper use an implementation in Mitsuba 2 rendering engine [NVZJ19], accounting only for direct illumination.At render time, we use material query buffers (storing u, ω i , ωo, sigma) to compute the inputs to our framework, then we pass the whole buffer to GPU to evaluate the queries as a batch.The LoD is also accounted for in rendered results based on the camera distance per query.All the comparison results have 1 sample-per-pixel (SPP).

Performance
Our model requires approximately 0.035 seconds to generate a 512x512 texture using NVIDIA V100 GPU, compared to around 0.028 seconds for NeuMIP, which is a marginal difference.Given the added complexity of convolution architecture, the number of parameters will increase and the time needed for evaluating our network is about 25% longer than that of the single-resolution Neu-MIP.This increase in evaluation time could be seen as a justifiable trade-off for achieving more accurate quality in capturing details. Figure 8: An assortment of materials with our method on a non-flat surface.Please view the video in our supplemental materials for light rotation around the scene as well as gradual zooming in to showcase the level-of-detail.

Results
In what follows, we demonstrate the effectiveness of our method empirically using rendered results.Specifically, we first show ablation studies to justify the necessity of individual components of our method ( §5.1).Then, we compare our model to the NeuMIP baseline using a range of materials ( §5.2).

Ablation Studies
In Fig. 4, we employ the metal ring and basket cloth examples to demonstrate the significance of each component within our proposed method.The ring example uses displacement maps in which vertical displacement follows a two-dimensional Gaussian func-tion and the basket cloth is rendered with real micro-geometry.As shown in this figure, our hierarchical architecture enabled by the Inception module enhances the overall model's expressive capability.Consistently, our input encoding module and gradient loss prove instrumental in capturing edges and high-frequency features.Furthermore, the remapping strategy aids in the high-quality reconstruction of shadows.In case of skipping the remapping step, the back yarn is missing in the fourth column as marked by the red square.This is due to the using MSE loss where the network tends to evenly reduce the loss in the whole texture.However, the same error has different effects in low-luminance areas and highluminance areas.The remapping simulates the human eye's reaction to light power, so the network better learns the importance of the texture value.This is a simple example exhibiting strong selfshadowing and sharp highlights.NeuMIP has difficulties in accurately handling these effects-even when using significantly larger network sizes (see Fig. 5) that are much more expensive to train and evaluate than our model.

Evaluation Results
Comparisons with large NeuMIP: In order to challenge the original NeuMIP framework fairly, we also experimented larger size of their network by increasing the number of neurons and layers.As shown in Fig. 5, the self-shadowing and sharp highlights have difficulty to be captured even with larger size MLP.However, our hierarchical architecture reproduces the features.Additionally, as the network size increases the training and query time are increased while the performance of our method stays nearly the same as the original NeuMIP.
Comparisons with previous works: In Fig. 1 and 6, we compare results generated using our method and NeuMIP on a wide range of complex materials (from both our and NeuMIP's datasets).In the Metal ring example, the NeuMIP result misses most shadows and specular highlights.In both the Twill cloth and Victorial fabric samples, NeuMIP has difficulty capturing the sharp highlights correctly.The Metal grid scene showcases high-frequency details that are captured using our hierarchical network architecture while missed by NeuMIP decoder.We did not compare it to the recent variant of NeuMIP[KWM*22] as they don't publish their code, and they require additional parameters as input (curvature).
The input to our neural method framework is a 7D parameter set that can be obtained using either synthetic datasets or realmeasured BTF.We used the leather example from UBO 2014 dataset [MK14] to exemplify the effectiveness of our model regardless of the input source.As shown in Fig. 6, unlike NeuMIP, our method better captures the fine-grain details as well as the highlights in the leather scene.
Comparisons with large NeuMIP: In order to challenge the original NeuMIP framework fairly, we also experimented larger size of their network by increasing the number of neurons and layers.As shown in Fig. 5, the self-shadowing and sharp highlights have difficulty to be captured even with larger size MLP.However, our hierarchical architecture reproduces the features.Additionally, as the network size increases the training and query time are increased while the performance of our method stays nearly the same as the original NeuMIP.

Multi-resolution results:
We demonstrate the effectiveness of our method addressing the different levels of detail of the material in Fig. 7.As expected, for the courser levels, the errors become smaller as we travel down in the hierarchical structure.This is due to the natural downsampling effect and gradual fading of the highfrequency details.We refer to the closest view as level-0 and the coarser levels are assigned to higher grades.The error scores for different levels are highlighted in Table 2.In the Basket cloth sample, please note the deeper yarns are missing in NeuMIP while ours can successfully reproduce low-luminance regions as well as the high-frequency features such as edges and fiber details.Furthermore, in Fig. 7 we show non-flat surface shaded using our method to showcase our integration with a renderer.Please view the accompanying video for the gradual change in the level-of-detail.
Quantitative evaluation: We also measure the numerical error of our neural method when compared to the reference.Our method performance in comparison with NeuMIP is listed in Table 1 using both MSE loss (Means Square Error) and perceptual loss LPIPS (Learned Perceptual Image Patch Similarity).This shows the overall average on the whole dataset and our method always outperforms NeuMIP using the same configuration.Later, in Table 2 we demonstrate the scores of multiple scales of ours and NeuMIP model using the different levels of detail from the reference dataset.

Discussion and Conclusion
Limitation and future work: To integrate our neural reflectance models into physics-based Monte Carlo rendering frameworks, efficient importance sampling techniques for these models need to be developed-which we think is an important problem for future investigation.Our model only captures direct illumination with a single light bounce and the global illumination is another interesting future work.Besides, due to the larger footprint required by the convolution layer.our method is slightly slower than the original method which could be optimized.
Further, adopting our technique to improve the accuracy of the more recent neural reflectance model [KWM*22] (with better silhouettes) is worth exploring.
Lastly, generalizing our technique to introduce neural BSSRDFs (that can capture subsurface scattering) can be beneficial to many future applications.

Conclusion:
In this paper, we improved the accuracy of the Neu-MIP [KMX*21] by introducing a new neural representation as well as a training process for this representation.Using neural networks with identical sizes, compared with NeuMIP, our neural representation is capable of reproducing detailed specular highlights and shadowing at significantly higher accuracy while better preserving a material's overall color.Additionally, we proposed an optional modification to the decoder architecture that further enhances the performance.We demonstrated the effectiveness of our technique by comparing to NeuMIP (at equal network size) using several examples.
Our model is inspired by NeuMIP [KMX*21] and uses a novel network design.Further, our technique is compatible submitted to COMPUTER GRAPHICS Forum (4/2024).

Figure 2 :
Figure 2: Overview of our neural architecture.The inputs are 2D spacial coordinates (u), incident and outgoing directions (ω i , and ωo), then encoded using Fourier transformation.The encoded u ′ is again updated based on the micro-geometry using the Neural texture pyramid [KMX*21].We use the Inception modules illustrated in Fig. 3 to decode the color output.Lastly, we employ a remapping for optimized final output color (R ′ , G ′ , B ′ ).

Figure 3 :
Figure 3: Our decoder architecture incorporates four layers of CNN, unlike NeuMIP that uses MLPs.The main distinction lies in the central incorporation of two Inception modules, flanked by two 1*1 convolution layers on both sides.This design choice significantly bolsters our performance due to its hierarchical structure.

Figure 4 :Figure 5 :
Figure 4: The ablation study was conducted by sequentially deactivating each feature at a time to showcase the effect of each component.

Figure 6 :
Figure 6: Comparisons of real and synthetic data with the reference.Please see the accompanying video for further comparisons.submitted to COMPUTER GRAPHICS Forum (4/2024).

Figure 7 :
Figure 7: Rendered results at the different levels of detail for selected materials.

Table 1 :
Errors for images in Fig.6.

Table 2 :
Errors for images rendered across multiple levels of details as shown in Fig.7.