Modeling of 3D NAND Characteristics for Cross‐Temperature by Using Graph Neural Network and Its Application

Herein, the impact of cross‐temperature on 3D NAND flash memory is modeled by considering adjacent cells using machine learning. The cells comprising NAND flash memory exhibit diverse states and connectivity patterns. To effectively capture this complexity, the 3D NAND flash is converted to graph structure and the graph neural network (GNN) is leveraged, known for its exceptional performance in handling graph data. To the best of the authors' knowledge, this is the first attempt to model 3D NAND flash memory using GNN. This method has good generalization performance across various retention times and temperatures, achieving a remarkable overlap of 95.28% between ground truth and predicted distributions. Moreover, two applications of this method are introduced that contribute to the NAND flash memory improvement. One is a GNN‐assist program, which leverages well‐trained GNN to suppress the Vth$V_{\text{th}}$ degradation affected by cross‐temperature, resulting in reduced Vth$V_{\text{th}}$ shift and narrower Vth$V_{\text{th}}$ distribution width. The other is the sensitivity decomposition to identify parameters influencing the cell at cross‐temperature. It is found that cross‐temperature impact extends beyond physically connected cells to adjacent cells at close distances. Overall, this work provides valuable insights into modeling 3D NAND flash memory using GNNs and offers practical methods for enhancing NAND flash memory reliability.

As the NAND flash memory was developed to meet the required capability, both the device process and device parameters became more complex and increased considerably.In addition, it leads to an increase in the temperature source, which affects cell performance.Machine learning can be used to deal with this expanded data.Machine learning can handle massive amounts of data faster than humans and can simultaneously consider the complex parameters of NAND flash memory, a task that cannot be easily accomplished by humans.Recently, researchers have applied machine learning methods to analyze and optimize NAND flash memory characteristics. [9,10]Research related to cross-temperature has also been performed, and several methods have been suggested to suppress its impact; however, the expended research efforts aimed at the collective effect and not at the effects of individual cells. [8,11]In addition, although the characteristics of 3D NAND included interference by neighboring cells, [12,13] previous machine learning approaches [9,10] have not focused on this undesirable interaction.In this study, we investigated a machine learning technique with 3D NAND flash memory to improve the influences of cross-temperature based on individual NAND cell analyses.To address the interference between the cells in 3D NAND, we employ graph neural networks (GNNs), [14] which are well-suited for handling the geometrical relationship according to the cell position.To the best of our knowledge, no previous studies utilized GNNs for modeling 3D NAND flash and applied machine learning techniques to predict the properties of individual cells.The contributions of this study are as follows: 1) We utilize a GNN to predict the changes in NAND cell characteristics caused by crosstemperature without any additional measurements.2) We propose a neural network-based program method to improve the V th distribution of NAND flash memory to correct the degradation by cross-temperature.3) We propose a sensitivity decomposition method that can analyze the impact of each adjacent cell on the target cell at cross-temperature conditions.
In the remaining parts of this article, we describe the details of our artificial intelligence (AI)-based method in Section 3, experimentally prove the superiority of our method in Section 4, and conclude this work in Section 6.

Cross-Temperature Effect on NAND Flash Memory
The performance of semiconductor devices is affected by thermal conditions.In NAND flash memory, there are performance variations, such as the current and V th , depending on the operating temperature. [7]Previous research has shown that V th varies more in a 3D than in a 2D structure array when the temperature varies.NAND flash memory has a read operation that checks the amounts of trapped charges and a program operation that induces the charges to be trapped in the charge trap layer.Because the NAND flash memory is not always used with constant temperature, temperature differences occur when programming and reading data; this can significantly affect the reliability of NAND flash memory.Previous studies investigated the temperature coefficient (Tco) and analyzed the cross-temperature impact on the V th of NAND flash memory. [8,11]They figured out that the cross-temperature increases the fail bit count of NAND flash memory, and the V th distribution shifts to the right when the temperature changes from high to low, and the left when the temperature changes from low to high.In addition, they suggested several methods to improve the temperature dependency, such as ł the use of a different read current and a higher bottom select gate pass voltage.

Deep Learning for NAND Flash
As the capacity of a single cell increases from SLC to TLC, the V th distribution and the margin between the distributions become extremely narrow.Moreover, nonstationary noises arise, including retention error, and cross-temperature, due to device scalingdown and operation in 3D NAND flash memory.These noises make it difficult to figure out the accurate V th distribution and affect data recovery.
[17][18] However, these methods are limited to measuring the error rate at the chip level rather than evaluating reliability at the cell level.Furthermore, they primarily focused on just a  few types of noise, such as P/E cycle and retention time, and neglected other types, such as cross-temperature.In addition, some studies introduced methods to enhance detection during the read process. [19,20]They used neural networks to distinguish the V th distributions more clearly and figure out the characteristics without specific knowledge.

Graph Neural Network
Graph neural networks (GNNs) [14] have gained significant attention for addressing problems with graph-structured data such as 3D mesh, material, and molecule.GNNs utilize message passing to extract representative features from graph-structured data.
Message passing involves the exchange of features among connected nodes and edges and updates their respective features, and is conducted iteratively.This iterative process allows GNNs to capture and propagate relevant information throughout the graph, improving the model's understanding of the underlying relationships and patterns within the data.Kipf and Welling [21] proposed graph convolutional networks (GCNs), which integrate the locality concept of convolutional neural networks (CNNs) into GNN, effectively addressing the issue of overfitting.Graph attention network [22] introduced an attention mechanism into GCN, highlighting the importance of individual neighboring nodes and achieving notable performance improvements across various tasks.
[25] Molecular properties are influenced by both the connectivity (or bond) and interactions among atoms.To address this, Gilmer [23] proposed message passing neural network, which incorporates node and edge features in the message-passing phase.This approach has led to significant performance improvement in the molecular property prediction task.

Method
We design the model that predicts the characteristic (V th ) of a single cell in 3D NAND flash memory.Since the cell is affected by its neighbors, [12,13] the characteristics of cells are changed simultaneously during operation as the chip temperature changes.Therefore, it is crucial to carefully consider both the single cell and these neighboring cells to accurately predict the electrical characteristics.To this end, we adopt GNNs, which model interactions well across node (or vertex), to predict the V th of each cell.Furthermore, using the proposed model, we propose two applications that can improve the reliability of NAND flash memory: first, a program (PGM) method to enhance the V th distribution, and second, a degradation analysis that is influenced by the neighboring cells under cross-temperature.In the following subsections, we will explain the aforementioned techniques.

NAND Flash Data Preparation
A commercial TLC 3D NAND flash memory chip was used for massive data generation.NAND flash memory has a vertically stacked 48-layer structure referred to as the SMArT scheme, [26] and the V th of the cells on the same page is measured at varying temperatures.The temperature condition of the NAND chip was controlled by the temperature chamber, and the V th was measured using a chip tester.All cells of the target page were erased first, then randomly programmed to have a V th within erase to P7.After the chip temperature was changed and stabilized for a period, V th was measured again.In total, 73728 cells were programmed to have each program and verify (PV) level with a similar number of cells, and the coordinate information of 3D NAND flash memory, such as block, COLUMN, input/output (IO), word line (WL), string (STR), and the two V th at different temperatures were obtained using experimental measurements.
We transformed the measured chip data into graph-structured data for cross-temperature characteristic modeling.Graphstructured data can be represented by node (n) and edge (e), and we defined each cell of the NAND flash and the connectivities between the cells as node and edge, respectively.To transform the measured data into graph-structured data, we defined the connections between the cells, namely joint word line (JWL) and joint string line (JSL), as shown in Figure 3a.JWL and JSL represent connectivity in the XY-axis and Z-axis directions, respectively.Based on the defined connections, we figured out the connectivities between all the cells of the measurement data and created an undirected graph (G) for an entire cell.Then, we obtained the subgraphs (G ¼ fg 1 , g 2 , •• • ,g N g) induced by the K-hop neighborhood of the target cells for which we want to know V th at target temperature (Temp 2 ).Each cell (or node, n i ) in the subgraph has following features: location of the cell (COLUMN, IO, WL, STR), initial and target temperatures (Temp 1 , Temp 2 ), and V th at the initial temperature (V th;1 ).The temperature units are Celsius (°C).Also, our measured data include the retention effect [27][28][29] in addition to the crosstemperature; therefore, the retention time (T ret ) was also used as a node feature.Furthermore, because the target cell was affected differently by the physically connected JSL cell and the surrounding JWL cell, [30][31][32] we needed to differentiate their connectivities.To achieve this, we used the edge feature (e) with indices of 0 (JWL) and 1 (JSL).For training our GNN network, we used V th at the target temperature (V th;2 ) and the PV level.

Graph Neural Network Modeling for Threshold Voltage Prediction
Given that the characteristics of NAND cells are affected by their neighbors, we leverage the power of GNN to model the 3D NAND flash.In our approach, we adapt and extend the molecular property prediction method [23] to the 3D NAND flash, taking advantage of its effectiveness in capturing node-level interaction.
Our model inputs the subgraph (g) consisting of the target cell and its surrounding cells, and predicts the V th of the target cell (V th;2 ) at the specified temperature (Temp 2 ).The proposed model comprises three phases, as illustrated in Figure 4: a) node and edge feature projection phase, b) message passing phase, and c) readout phase.
In the feature projection phase (Figure 4a), the lowdimensional node feature (n i ) and edge feature (e ij ) are projected into high-dimensional features via multilayer perception (MLP, f proj ) and embedding lookup table (f emb ) respectively.
In the message passing phase (Figure 4b), meaningful features are extracted by iteratively updating the feature of the node.As shown in Figure 5, the hidden feature of the node (h l i ) is updated based on messages m lþ1 i according to where N ðiÞ denotes the neighborhood of the i-th node in subgraph (g).We choose to mean operator for aggregate in Equation ( 1).This message is constructed using the adjacent cell feature (h l j , j ∈ N ðiÞ) and the relationship (ê ij ) between the cells, resulting in the update of the feature of each cell based on this message.We show experimentally that considering this interaction has a significant impact on the GNN performance in Section 4.5.1.
After the message passing phase, the readout MLP (f read ) (Figure 4c) predicts the V th;2 using the extracted feature (h L i ) of the target cell as an input.
The detailed implementation of our GNN is described in Appendix A.1.
Furthermore, during training, our method predicts not only V th;2 of the target cell but also V th;2 of its neighboring cells, incorporating all predicted V th;2 in the loss calculation.This approach ensures stable training of the GNN.However, during inference, our method predicts only the V th of the target cell.

Stable Training with Multitask Learning
To further improve the performance of our GNN, we incorporate the principle of multitask learning (MTL) [33] into the readout MLP (f read ).MTL involves the simultaneous training of a single model on multiple tasks, thereby enhancing the performance of the primary task, which is known as eavesdropping. [33]To apply MTL to our method, we extend the functionality of the readout MLP to predict the threshold voltage and the PV of the cell.To accommodate the 8-state TLC dataset, we expand the output channels of the readout MLP from 1 to 9 as shown in Figure 6.The first output channel is dedicated to the prediction of V th whereas the remaining channels are used for PV prediction.For training the PV prediction task, we use cross-entropy loss, ℒ PV (Equation ( 5)).where y i represents the actual value, which can be either 0 or 1, and p i represents the probability assigned by the softmax function to the i-th PV.The final training loss is as follows where λ denotes weight of ℒ PV , and we used value of 0.1 for λ.

GNN-Assist PGM
In NAND flash memory, precise V th control is one of the essential techniques to increasing bit density and reliability.However, NAND cells can be unintentionally programmed by various factors such as disturbance, initial charge loss, and WL interference, making it difficult to obtain an accurate V th distribution by abnormally programmed cells. [34,35]Many programming (PGM) schemes have been studied to overcome these problems, [36,37] but it is difficult to consider the continuously changing temperature.Therefore, we propose a novel program method to improve the V th distribution using GNN that considers the temperature at which the target cell operates and the state of the adjacent cells.
Our program method improves the V th distribution by figuring out the V th which has the lowest temperature dependency considering the adjacent cells according to the following steps.First, we virtually program the un-programmed P0 cell into P1 P7 and obtain the V th;2 for each PV using a well-trained GNN (Figure 7a).To ensure that the programming aligns with the existing V th characteristics, we calculate the histograms of each state (P1-P7) based on the measurement data and use them to sample V th for programming.Then we program P0 cells with the predicted PV that is least affected by the cross-temperature (Figure 7b).In this step, we evaluate the V th shift for all PV candidates and choose the candidate with the smallest V th shift.Finally, we conduct the same process iteratively to obtain the improved V th distribution with the smallest V th shift, resulting in reduced V th width.

Sensitivity Decomposition
In semiconductor research, the sensitivity analysis of device parameters is one of the important techniques in terms of both optimization and analysis. [38,39]However, since there are a lot of device parameters and complex correlations between parameters, it is difficult to analyze the sensitivities of the parameters separately.42] Our method analyzes the sensitivity of the target cell by decomposing the gradient of adjacent cells at the crosstemperature.The overall procedure of our sensitivity analysis is described in Figure 8. First, we obtain the V th;2 using a well-trained GNN, and calculate the MSE between the output of the GNN and the initial V th (V th;1 ).Then, we conduct backpropagation on this MSE to the input graph and obtain the gradient of each node and its features.This gradient describes how features affect the V th degradation at cross-temperature.

Experimental Setup
We used the 19 chip-level NAND datas measured at different temperatures and retention time settings.To show the generalization performance of our method, we divided these data into a training (D train ), validation (D val ), and two test datasets (D test t and D test t=r ).For training and validation, 15 data were selected and randomly divided into training and validation sets.We constructed two test datasets: one for the evaluation of the performance on temperature that was not seen during training (D test t ) and the other for the evaluation of the performance on both unseen temperature and retention time (D test t=r ).The resulting dataset consisted of 65% training data, 15% validation data, and 10% for each of the two test datasets, comprising a total of   1 400 832 subgraphs.The splitting of the dataset is presented in Figure 9.
For training, we used AdamW optimizer [43] with a learning rate scheduler [44,45] and a weight decay of 0.005.A detailed illustration of the training hyperparameter is provided in Appendix A.2.All experiments were conducted in PyTorch v1.13.0 and DGL v1.0.0 on a single RTX3090 GPU.
In practice, the trained models exhibit slight performance variations even when using the same architecture and training hyperparameters because of the random initialization of model parameters and random data order.To reduce the effect of randomness, we conducted five training runs for each model under the same configuration.We then reported the average and standard deviation across these five trials unless otherwise specified.
Our model predicted the V th of each cell, allowing us to obtain the distribution of the predicted V th for each PV state and temperature setting.We employ the Bhattacharyya coefficient (BC) [46] to quantitatively assess the similarity between the predicted and the ground truth distributions.The BC measures the degree of overlap between two distributions as a value from 0 to 1.In addition, we reported the mean absolute error (MAE, Equation ( 7)) and the accuracy of PV prediction (Acc) for evaluation.

Generalization Evaluation
We found that our method has strong generalization ability beyond its training dataset.To validate its generalization ability, we analyzed the results of the out-of-distribution dataset (D test t and D test t=r ).
Figure 10 and Table 1 and 2 illustrate the performance of GNN under varying temperatures and retention times.The values listed in each table represent the average (with the standard deviation) of the results obtained from five training runs of the GNN.Our method exhibits a low V th error that is nearly the same as the measurement resolution (0.727) and has a high BC value for the temperature and retention time that were not seen during training.This indicates that our method performs well in out-ofdistribution settings and has good generalization performance.
Although our method performed well on most datasets, it showed unsatisfactory results in high-temperature settings, as indicated by the 0.778 error in the À25 °C ! 100 °C D test t case in Table 1.Moreover, in the bottom-left of Figure 10, the ground truth and prediction distributions do not align.This is due to data preprocessing during which any measured values below the measurement boundary are replaced with the boundary value.This preprocessing often results in the substitution of numerous P1 V th with the boundary value, particularly in datasets that exhibit a considerable V th shifts, such as high-temperature data (100 °C).As shown in Figure 11a, the V th distribution of P1 peaks again at a boundary value at 100 °C, resulting in a distinct distribution shape that differs from the V th distribution at À25 °C.In contrast, Figure 11b demonstrates similar V th distribution shapes across different temperatures.In addition, our method tends to predict a shifted V th distribution based on the shape of the input V th distribution.For instance, as shown in Figure 11a, our method predicted the V th;2 (dashed red line) by following the V th distribution of the initial temperature (solid blue line).However, due to preprocessing, the ground truth data exhibited two peaks (solid red line), which had a different shape compared with the initial temperature distribution with one peak.Consequently, the predicted distribution (dashed red line) shows a considerable difference in the V th distribution shape compared with the ground truth (solid red line), which resulted in an MAE of V th (about 0.830 error) and a BC (about 0.7).Nevertheless, our method using GNN modeled successfully the V th characteristics of 3D NAND and exhibited good performance in terms of error and BC, except for cases where the value of V th was clipped due to pre-processing, e.g., 100 °C !À25 °C in D val and À25 °C ! 100 °C in D test t .

Result of GNN-Assist PGM
We evaluated the effectiveness of the proposed program GNN-assist PGM in improving the V th distribution at the cross-temperature.We compared our PGM method with the conventional random program method using two criteria: distribution shift and width.To measure the distribution shift, we  calculated the difference between the means of the two distributions (V th;1 and V th;2 ) for each method.Then, we obtained the ratio of the two differences (V th shift ratio) and used it as an evaluation metric V th for the distribution shift.To evaluate the effect of reducing the distribution width, we computed the standard deviation (std) of the V th;2 distribution obtained using the two methods.Similar to the V th shift evaluation, we calculated the ratio of the two standard deviations as a metric (std ratio).All the ratios represent the improvement of the GNN-assist PGM over the random PGM.For the evaluation data, we randomly erased 50% of the programmed cells per state (P1 -7) and used them as P0 cells which were programmed newly for comparison.In addition, to avoid programming bias toward specific PVs, we set the number of erased PVs to the maximum number of programmable PVs, ensuring an equal number of states as in the original data.
The results of GNN-assist are presented in Table 3, showing that our method outperforms the randomized PGM in terms of both V th distribution shift and width in all PV.Our method achieves an average shift improvement of 47.20% compared with that of the random PGM method, especially with a minimum V th shift ratio of 47.72% at P1. Furthermore, an average width improvement of 18.36% was achieved using our GNN-assist PGM method compared with that of the random PGM method, with the lowest std ratio of 70.45% at P6.We also provide a visualization of the results in Figure 12, where each column (a,b) shows the results from different PVs, and each row shows the results from the two PGMs.In Figure 12, the black line in each row shows the V th;1 distribution obtained by each method, and the colored dashed lines represent the V th;2 distribution.We put both PGM results on the same x-axis to compare the distribution width and the degree of shift.As shown in Figure 12, the GNNassist PGM method caused less V th shift than the random PGM method, which can be observed by the reduced gap between the gray dashed lines.In addition, since both distributions were compared with the same V th scale, it can visually be observed that the distribution width of our proposed method has been reduced.The improved distribution of the GNN-assist PGM method ensures a larger read window margin of NAND flash memory Table 1.Overall performance of GNN.The error denotes a mean absolute error between the ground truth and the GNN prediction.["] 99.9 (0.0) 94.8 (12.7) 99.9 (0.0) 96.1 (9.6) 95.7 (10.6) 99.9 (0.0) 99.8 (0.2) 99.9 (0.0) 99.9 (0.0) 99.9 (0.0)  and enables a precise V th control for increasing the bit per cell by improving reliability.

Result of Sensitivity Decomposition
We decomposed the sensitivity of adjacent cells with gradients to investigate the degradation source under cross-temperature.We defined the V th in regions above and below the 3-sigma in the V th distribution of each PV as outliers and performed sensitivity decomposition for the tail sides of the V th distribution, which critically affects reliability.We conducted two different types of sensitivity decomposition for several cells and individual cells.
Figure 13 shows the sensitivity decomposition of the several target cells for tendency analysis.Each box plot denotes the gradient per cell (target, JWL, and JSL cells), showing the impact of each cell on the V th shift.To focus on adjacent cells, we excluded the target cell, which always yielded the largest gradient.Figure 13a shows the result of the sensitivity decomposition for data including cells on another STR (JWL-5 and JWL-6), which are located relatively farther away in the 3D NAND than the other JWL cells.The results showed that JWL-5 and JWL-6 cells exhibited considerably lower gradient values than other JWL cells.This indicates that the impact of adjacent cells, which are far from the target cell at cross-temperature is not significant.As shown in Figure 13b, the JSL cell, which is physically connected, typically exhibits the highest gradient value among adjacent cells.Interestingly, the JWL cells that are not physically connected to the target cell also affect the V th degradation under cross-temperature conditions.To investigate the effects of the PV differences between the target cell and its neighbors, we conducted a gradient analysis based on the PV difference.Given that the gradient has both scale and direction influences on the V th shift, using the average or median of the gradients is insufficient to study the trends among PV differences.Instead, we employed the absolute value of the difference between the largest and smallest gradient values, referred to as the gradient variation.Figure 13c presents the gradient variation based on the PV difference between the target and the adjacent cell.The dashed line in Figure 13c indicates that the gradient variation increases linearly with the PV difference.Specifically, when the PV of the adjacent cell is larger than that of the target cell (x-axis < 0), it shows a low gradient value; otherwise, it shows a large gradient value  (x-axis > 0).The JSL cells in 3D NAND, which were directly connected with the target cell, caused the trapped charge migrations, and it degraded as the PV difference between the target cell and adjacent cells increased. [27]Furthermore, as the PV of the target cell increase, the variation caused by the trapped charge increase.Under cross-temperature, the V th of JWL cells is also changed, interfering with the target cell.This illustrates that adjacent cells with lower PV than the target cell play a significant role in causing the target cell to become an abnormal cell.We further conducted the sensitivity decomposition on individual cells with a large V th shift for the more detailed influence analysis and presented it in Figure 14.As shown in Figure 14, the gradient variation of adjacent cells was decomposed individually, and absolute gradients were not dependent on the adjacent types (JSL, JWL).In more specific decomposition for each cell, adjacent cells with lower V th;1 than the target cell exhibited high gradients.For instance, JSL and JWL-2 cells in the first row, as well as JSL, JWL-2, and JWL-4 cells in the second row, had higher gradients than the others.In contrast, cells with V th;1 similar to that of the target cell, such as JWL-1 in the second row, exhibit very small gradients.Furthermore, it can be confirmed that not only the V th;1 but the programming temperature (Temp 1 ) also caused the V th shift of the target cell.This suggests that the cells influencing at cross-temperature are not limited to physical connections like JSL cells.Instead, the V th difference from the target cell has a greater influence on the target cell to degrade its V th characteristic for cross-temperature.Consequently, both the tendency of several cells and the influence of each cell were verified with gradient variation using our sensitivity decomposition method.This sensitivity decomposition using GNN can be used to evaluate the influence of several physics models on electric characteristics and analyze the degradation source.

Investigating Key Components in GNN Training
In this section, we conducted the ablation study of both the graph structure (number and type of adjacent cells) and network architecture to illustrate the key components of the proposed method.

Graph Structure
The graph structure, such as the number of adjacent cells (K-hop) and type of adjacent, is crucial to achieve optimal GNN performance.Based on this understanding, we trained GNN by varying the types and numbers of adjacent cells.By analyzing the training results, we identified the range of adjacent cells that significantly affected the target cell.To train GNN using the various adjacent cells, we constructed several graphs as follows.First, we only used the target cell and ignored adjacent cells (0-hop).Second, we built a graph via K-hop sampling, which involved sampling from 1-hop to 2-hop neighborhoods.Third, we only sampled cells adjacent to JSL relationships.For the 0-hop dataset, we trained a simple MLP network rather than GNN and evaluated it because the 0-hop data consisted of only the target cell.Detail of the MLP network is described in Appendix A.1.Figure 15 shows the training result with the dataset described above.When predicting V th;2 using only the target cell (0-hop, blue line), it fails to consider the influence of adjacent cells and performs worse than when considering adjacent cells (JSL, 1-hop, 2-hop).In contrast, when the impact of adjacent cells is considered more (2-hop, green line), it leads to inferior performance compared to the 1-hop (red line).This result suggests that cells located far from the target cell may not affect considerably the cross-temperature effect and could lead to training failure.The target cell with JSL (JSL, the light blue) shows better performance than the 0-hop, but worse performance than the 1 and 2-hop datasets (red and blue lines).This result suggests that the impact of the cross-temperature is not solely governed by the target cell and JSL cell but also by JWL cells, and these results are consistent with the discussion in Section 4.4.
We also measured computational cost and reported in Table 4.We measured the inference time of the proposed model using   the test dataset (D test t ).For accurate measuring, we measured the time after 50 warm-ups.Table 4 shows that the proposed method (1-hop) can predict V th in real-time (about 5 ms) compared to the real measurement environment, which requires a lot of time and additional equipment.Although the MLP model (0-hop in Table 4) exhibits a very fast speed (0.7 ms), it exhibits a lower performance compared with that of the GNN model (Figure 15) and lacks the ability to model the relationships among adjacent cells.This limitation makes it unsuitable for the proposed applications (GNN-assist PGM and sensitivity decomposition).

Network Architecture
In terms of network architecture, the crucial factor that affects the performance of the GNN is the number of layers, also known as its depth.In general, for deep learning models, such as MLP and CNN, deeper models tend to yield better performance.In contrast, with GNN, increasing the depth leads to an over-smoothing problem, where node features converge to a single value and result in worse performance. [47]To determine the appropriate depth for our model, we conducted the ablation study; the results are shown in Figure 16.We found that if the depth was too small (blue line), the network was under-fitted.Conversely, when the depth was large (green line), the training became unstable.Based on this result, we determined the depth of GNN to be equal to 2.
To demonstrate the effectiveness of the training methods employed to improve the GNN performance, we analyzed the loss curves, and Figure 17 shows the loss and BC curves.Each column represents the training, validation, and two test curves.The baseline model was trained using the threshold voltage loss (ℒ V th ) using only the threshold voltage of target cell.The baseline model (light blue line) shows a very unstable loss for all datasets during the training process.The model trained with MTL(green line) exhibited more stable training than the baseline, but still exhibited an unstable loss curve.We addressed this unstable training by forcing the GNN to predict the V th of the target cell and the V th of its neighbors (blue line).The measured data have a retention effect.To design a GNN with generalized performance on this retention effect, we adopted the retention time as a node feature.This resulted in a stable performance on the retention test dataset (D test t=r ) (red line).

Discussion and Future Work
In Section 4.5.1, our GNN model showed lower performance when enlarging the graph size (from 1-hop to 2-hop) in the 3D NAND structure.However, as the NAND flash memory developed for high bit density, the cells in 3D NAND scale down, and more factors need to be considered since the space between cells becomes closer.Consequently, the graph size and modeling   Recent GNN research investigates goal-directed graph generation, generating graphs based on specific conditions.We can anticipate studies using GNN that generate new structures of 3D NAND flash and find device structure that meets target performance.These techniques are also expected to dramatically reduce the time required to develop new devices.

Conclusion
In this study, we trained a neural network to model the V th characteristics of 3D NAND flash memory at a cross-temperature.We used a GNN to consider the 3D NAND structural characteristics at various temperatures, and the trained network predicted the test data precisely with an average BC of 0.95 and an accuracy of 99.87%.We proposed two application approaches: GNN-assist PGM and Sensitivity decomposition.The GNN-assist PGM method enabled the empty cell (P0) programmed with the optimal PV for the cross-temperature to mitigate the V th distribution degradation and achieved an average of 47.20% V th distribution shift improvement and 48.36% V th distribution width reduction.
The sensitivity decomposition successfully figured out the temperature influence that causes the V th degradation of the target cell using the gradient analysis method.This method identified the tendency of multiple cells that have the V th values corresponding to the tails of the distribution at cross-temperature and the degradation factors that make a single cell less reliable.Based on these findings, we confirmed that the V th difference between the target and the adjacent cells at cross-temperature has a significant influence on the degradation of the target cell.Consequently, our research achieved a GNN that successfully modeled the V th characteristics of 3D NAND cell at the crosstemperature, and proposed two application approaches that improved the V th distribution and figured out the degradation source.Our research results can be used in future NAND development to improve NAND reliability.

Appendix
Network Architecture

MLP Network
In Section 4.5.1, we used MLP network for predicting V th with the only target cell, and the detailed network structures for this are shown in Figure 18.(a) (b) Figure 18.Detail of MLP network.The number in parentheses for each layer denotes the dimension change.BN denotes batch normalization. [48]or the activation function, we used SiLU. [49]a) Detail of FC layer; b) structure of MLP network.

GNN Network
Figure 19 shows the detailed structure of our GNN.FC Layer used in GNN network has the same structure as the same that used in MLP network (Figure 18a).The lookup table retrieves a vector that is learnable according to the edge category (JWL and JSL), and we used this vector as edge feature.

Training Parameters
For stable learning, we adopt the Cosine annealing learning rate scheduler with warm-restarts. [44,45]This scheduler increases the learning rate (lr) linearly, starting from the minimum learning rate (lr min ) to the maximum learning rate (lr max ).Once the maximum learning rate is reached, the learning rate is decreased according to Equation (A1), until it reaches the minimum learning rate again.

> > > <
> > > : In addition, this schedule is repeated multiple times, and each repetition of the schedule is modified by multiplying T and lr max by specific factors (T mult and γ).In this way, we can avoid overfitting issue without special methods such as early stopping. [44]or training, we set lr max , T, T mult , T up , and γ to 0.001, 50, 2, 10, and 0.7 respectively, and visualized the learning rate per epoch in Figure 20.We trained all models for 140 epochs using the AdamW optimizer with weight decay, β 1 , and β 2 set to 0.005, 0.9, and 0.999, respectively, where these all values are founded by grid search.

Prediction Result of GNN Across Various Temperatures
Figure 21 shows the overall result of GNN prediction.

Figure 1 .
Figure 1.Plots depicting V th shift.The black dashed line shows the distribution when programmed (25 °C).The blue and red lines denote distributions when read at different temperatures (75 °C and 100 °C).The average values of each distribution are listed in the legend.The threshold voltage is represented with arbitrary units (a.u.).

Figure 2 .
Figure 2. P5 cell statistics.a) Median cell V th shift.The black arrow denotes the median of P5 when programmed at 25 °C.The blue and red lines show the changed V th distribution at 75 °C and 100 °C for cells that were median of V th distribution when programmed.The median values of each distribution are listed in the legend.b) V th variance of individual cells.Each box corresponds to the V th when a single cell is read five times, with the difference between the max and min values displayed on each upper whisker.The circle denotes the outlier.

Figure 3 .
Figure 3. 3D NAND flash data preprocessing.a) Coordinate definition; b) the process of generating graph-structured data of 3D-NAND.

Figure 4 .
Figure 4. Overview of the proposed method.a) Feature projection, b) message passing (update node feature), and c) readout (predict V th ).

Figure 6 .
Figure 6.Structure of multitask MLP.The dashed red box outputs the probability of PV class.

Figure 7 .
Figure 7. Overview of GNN-assist PGM.w and s denote the width of V th;2 distribution and shift of V th , respectively.a) Virtual programming; b) find optimal PV for cross temperature.Figure 8. Overview of sensitivity decomposition.

Figure 8 .
Figure 7. Overview of GNN-assist PGM.w and s denote the width of V th;2 distribution and shift of V th , respectively.a) Virtual programming; b) find optimal PV for cross temperature.Figure 8. Overview of sensitivity decomposition.

Figure 9 .
Figure 9. Split of the dataset.a) Training and validation sets (D train , D val ).b) Unseen temperature test dataset (D test t ).c) Unseen temperature and retention test dataset (D test t=r ).Yellow bars (and subscripts R1, R2, • • • , R5) denote retention time-variant data with the same temperature.Subscript STR denotes the measured data with another STR.

Figure 10 .
Figure 10.Distribution prediction of GNN across various temperature settings.Not all the temperature settings were used for training.

Figure 12 .Figure 13 .
Figure 12.Distribution comparison between random and GNN-assist PGMs.V th distributions for a) PV1 and b) PV5.The black line denotes the distribution of V th;1 .Blue and red dashed lines denote random PGM and GNN-assist PGM V th;2 distribution, respectively.The gray vertical dashed lines denote the mean values of the distributions.

Figure 14 .
Figure 14.The sensitivity decomposition per each cell that shows a large V th shift.a) Gradient per each node.b) Gradient per each feature.abs(gradient) denotes the absolute value of the gradient.The value above each subplot represents the V th;1 or V th shift, and the unit transform to the arbitrary units (a.u.).

Figure 15 .
Figure 15.BC versus epoch during the training process results for a) the unseen temperature test dataset and b) the unseen temperature and retention test dataset.

Figure 16 .
Figure 16.Ablation study of depth.We present a learning curve obtained by training only once at each depth setting.

Figure 17 .
Figure 17.Loss and quantitative evaluation curve per epoch.
interactions within this expanded set of cells become even more important for advanced devices.

Table 2 .
Quantitative evaluation per temperature and PV.The reported values denote the mean of five training runs, with the standard deviation shown in parentheses.

Table 3 .
Comparison of GNN-assist PGM and random PGM.The V th value and the value in parentheses represent the mean and standard deviation of the distribution respectively.

Table 4 .
Comparison inference time.We report both the average and standard deviation of inference time.