Automated Evaluation of Human Embryo Blastulation and Implantation Potential using Deep‐Learning

In in vitro fertilization (IVF) treatments, early identification of embryos with high implantation potential is required for shortening time to pregnancy while avoiding clinical complications to the newborn and the mother caused by multiple pregnancies. Current classification tools are based on morphological and morphokinetic parameters that are manually annotated using time‐lapse video files. However, manual annotation introduces interobserver and intraobserver variability and provides a discrete representation of preimplantation development while ignoring dynamic features that are associated with embryo quality. A fully automated and standardized classifiers are developed by training deep neural networks directly on the raw video files of >6200 blastulation‐labeled and >5500 implantation‐labeled embryos. Prediction of embryo implantation is more accurate than the current state‐of‐the‐art morphokientic classifier. Embryo classification improves with video length where the most predictive images show only partial association with morphological features. Deep learning substitute to human evaluation of embryo developmental competence thus contributes to implementing single embryo transfer methodology.

involving expansive datasets, is gradually incorporated into the health-care system worldwide [12][13][14] and IVF clinics specifically. [15] Deep learning was used for performing automated morphokientic annotation [16] and morphological scoring of blastocysts based on the last frame that was acquired just before transfer. [17] Tran et al. also used deep learning for predicting the implantation outcome of day-5 embryos by training on the timelapse images. However, negative implantation outcome was mislabeled by including discarded embryos that were not transferred to the uterus (80% of the entire dataset). [18] A recent publication reports the prediction of implantation of day-5 transferred embryos using single "static" images of blastocyst stage. [19] By training a deep neural network (DNN) using whole-video time-lapse images of preimplantation embryo development, we achieved high accuracy and provide an early prediction of embryo blastulation and embryo implantation already on day-3.

The SHIFRA Database
We conducted a retrospective study in which we trained DNNs directly on the raw video files to generate automated and standardized classification algorithms of embryo blastulation and implantation. DNN training requires large datasets of labeled embryos. Therefore, we assembled and compiled a database that we term "SHIFRA" by collecting video files of >20 000 fresh embryos that were cultured in nine time-lapse incubators (Embryoscope, Vitrolife) during the past 6 years in four medical centers dispersed across Israel ( Figure S1a and Table S1, Supporting Information). Our data-providing clinics accept patients with a heterogeneous ethnic and racial backgrounds (Eastern and Western European Jews, North African and Middle Eastern Jews, Arabs, and others) [20] and span different maternal age groups ( Figure S1, Supporting Information), thus decreasing the effect of confounding variables and increasing embryo classification generality. Seven-frame z-stacks, 15 μm apart, were recorded at 18-to-20 min intervals for up to 6 days of incubation, providing a continuous 3D imaging of preimplantation embryo development ( Figure 1a). Based on time-lapse imaging, morphokinetic profiles of 16 000 embryos were annotated, specifying the time series of discrete events ( Figure S2b-i, Supporting Information): pronuclei appearance and fading (tPNa/f ), cleavage of N cells (tN; N ¼ 2 to 9), morula compaction (tM), and start of blastulation (tSB). Morphokinetic annotations were determined by qualified and trained embryologists according to established protocols (2-to-3 annotators). [2,21] Quality assurance (QA) of morphokinetic annotations was conducted in a blindly fashion by an additional expert embryologist ( Figure S2a, Supporting Information). The temporal intervals between consecutive morphokinetic events were included as well, showing temporal separation between cleavage cycles ( Figure S2b-ii, Supporting Information).
Labeling of embryo blastulation is based on morphokinetic histories. Embryos that reached start of blastulation (tSB) inside the incubator were labeled blastulation-positive (BLAST_p). BLAST-negative (BLAST_n) embryos that had been arrested at earlier developmental states and BLAST-unknown (BLAST_u) embryos were identified by projecting their morphokinetic profiles onto the time windows that permit transitioning from one embryo state to the next ( Figure S3a, Table S2 and S3, Supporting Information; see Experimental Section). The SHIFRA database contains additional metadata, including maternal age, day-of-transfer, and co-transferred embryo statistics ( Figure S1a,b, Supporting Information). Known Implantation Data (KID) labeling was determined based on embryo transfer statistics and the number of gestational sacs and fetal heart beats as measured on weeks 5-7 of pregnancy (see Experimental Section). Positive and negative implantation outcomes (KID_p and KID_n) refer to embryos that were successfully implanted or failed to implant, respectively, and embryos whose implantation outcome was uncertain are labeled KID unknown (KID_u; Figure S1c, Supporting Information).

Early Prediction of Blastocyst Formation
The potential of an embryo to undergo blastulation marks its developmental quality and is linked with its potential to implant in the uterus. [22] Despite the fact that BLAST_n embryos were arrested at earlier stages, their temporal overlap with the morphokinetic profiles of BLAST_p embryos is large (KS < 0.28] Two-sample Kolmogorov Smirnov test quantifies the distance between two distributions.]; Figure 1b-i,ii). We divided a total of >6200 BLAST-labeled embryos into a train-validation set and a strictly uncontaminated test set of randomly selected 28% of the embryos. A fully automated blastulation classifier termed as SHIFRA B was designed as consisting of two parts (see Experimental Section). Packet learning: A DNN was trained directly on the time-lapse image packets of the embryos outputting a scalar value for each frame (temporal features) from the start of fertilization to time of prediction (t p ). Embryo learning: The potential to blastulate was evaluated based on the packet scores (temporal features) of each embryo by training a Random Forest classifier. Random Forest was chosen as it provided the most accurate predictions compared with other classifiers that are suited for this task (XG-Boost and Logistic Regression).
Predictive strength was quantified using the area under the curve (AUC) of the receiver operating characteristic (ROC). Blastulation prediction AUC, which was evaluated for test-set embryos, increased monotonically with time of prediction, t p (Figure 1c): 0.65 at 48 h, 0.73 at 72 h, 0.88 at 96 h, and 0.94 at 110 h. To confirm that SHIFRA B can significantly predict blastulation also of high-quality embryos, we calculated the AUC for a cohort of embryos that reached at least 8-cell cleavage state (8C_p). As expected, BLAST-prediction AUC of 8C_p embryos was lower than for total embryo population, yet it reached 0.63 at 72 h, 0.84 at 96 h, and 0.91 at 110 h (Figure 1c inset). Automated blastulation prediction by SHIFRA B was compared with the manual morphokinetic classifier developed by Milewski et al. for five-cells positive (5C_p) embryos. [8] AUC values were comparable with small advantage to SHIFRA B . Blastulation rate shows a near-perfect monotonic correlation with SHIFAR B prediction but not with Milewski prediction (Figure 1d). In the clinic, binary classification is obtained by setting the threshold values of SHIFRA B that define negative prediction (below threshold) and positive prediction (above threshold).
Herein, we present retrospective blastocyst prediction statistics by setting two threshold values that support positive predictive value 0.91 ( Figure S8a-i, Supporting Information) and sensitivity 0.98 ( Figure S8a-ii, Supporting Information). To test generality, we conducted a fivefold stratified cross-validation yielding an average AUC ¼ 0.83 AE 0.02STD ( Figure S6a, Supporting Information). In addition, we compared the blastulation prediction of test-set embryos from each clinic separately (AUC ¼ 0.78 AE 0.07STD; Figure S6b, Supporting Information) and performed leave-one-clinic-out cross-validation where classifiers were trained on embryos from three clinics and tested on the fourth ( Figure S6c, Supporting Information). As H3 clinic is the largest data provider of BLAST-labeled embryos, H3 AUC was smaller than the other clinics. The above-average AUC of H4 is likely skewed by the small number of and the uneven ratio between BLAST_p and BLAST_n test-set embryos. Temporal comparison between the morphokinetic events and intervals of embryos that were obtained from young (age < 32) and older (age > 38) women showed negligible differences (KS < 0.08) except for time of start blastulation (tSB), which occurred on average 3 h faster, and tM-tSB interval, which was 1.5 h shorter in embryos derived from young women ( Figure S9, Supporting Information). To verify that blastulation prediction was not confounded by maternal age, we divided all test-set embryos to four age groups and calculated AUC separately for each ( Figure S6d, Supporting Information). Indeed, blastulation prediction AUC was comparable between maternal age groups: AUC ¼ 0.75 AE 0.03 STD. Collectively, we establish accuracy, robustness, and generality of a fully automated day-3 prediction of embryo blastulation.

Prediction of Embryo Implantation
Relative to blastulation prediction, KID prediction is required to overcome two major obstacles. 1) Unlike blastulation, which depends on the capacity of the embryo to develop in the incubator under controlled conditions, implantation also depends on endometrial receptivity-a parameter that is not accounted for during www.advancedsciencenews.com www.advintellsyst.com training. 2) Training is limited to embryos that had been preselected for transfer according to existing morphological and/or morphokinetic protocols. As a result, training is restricted to a dataset of morphokinetically homogenous KID-labeled embryos. Accordingly, the temporal distributions of morphokinetic events and intervals of KID_p and KID_n embryos are almost fully overlapping (Figure 2a-i,ii). We divided the dataset of >5500 KID-labeled embryos into a train-validation set and an uncontaminated test set that consisted of randomly selected 21% of the embryos ( Figure S1c, Supporting Information). We designed an automated two-stage KID classifier that consisted of packetlearning followed by embryo learning. Similar to SHIFRA B , we term the KID classier as SHIFRA K . To improve predictive strength and robustness, packet learning combined three DNNs, which were trained separately on KID-labeled embryos, and one DNN that was trained on BLAST-labeled embryos.
In all cases, we used the same preprocessing steps and all networks had the same architecture ( Figure S5, Supporting Information). The temporal features were obtained by summing together the four scalar values that were generated by each network for each frame. Embryo learning was performed using a logistic regression classifier that was trained on all the temporal features that belong to each embryo (see Experimental Section). KID predictive strength increased with time of prediction t p , as evaluated for the same cohort of day-5 transferred test-set embryos ( Figure 2b). AUC increases slowly from 48 to 84 h and more rapidly from 84 h onward. Compared with the manualmorphokinetic KIDScore decision support tools (Vitrolife), embryo implantation prediction by SHIFRA K is as accurate as KIDScore-D3 on day-3 and more accurate than KIDScore-D5 on day-5 as evaluated for the same test-set embryos (Figure 2b-inset). Day-5 predictive strength of SHIFRA K remains high despite the fact that 98% of the transferred embryos were blastocysts (very high-quality embryos) and endometrial receptivity was likely an important factor. Prediction of implantation of day-5 transferred embryos reaches 0.71 (calculated on test-set embryos). Satisfyingly, implantation rate positively correlates with SHIFRA K , thus demonstrating its clinical utility (Figure 2c  . AUC values by H1 (above average) and H2 (below average) clinics were likely skewed due to a highly uneven ratio between KID_p and KID_n train-set embryos ( Figure S1c, Supporting Information). We also tested generality via leave-one-clinic-out cross-validation: AUC ¼ 0.64 AE 0.04 STD ( Figure S7c, Supporting Information). The below-average AUC by H3 is consistent with the smallest available train-set once H3 embryos were removed. The fraction of implanted embryos out of all transferred embryos (considering also KID_u transfers) was 2.7-fold higher for young women (age < 32; 39%) than older women (age > 38, 14%). To verify that SHIFRA K is not biased by maternal age, we divided the embryos into four age groups and tested KID-prediction on each. Satisfyingly, variation in day-5 KID prediction across age groups was small: average AUC ¼ 0.75 AE 0.03 STD ( Figure S7d, Supporting Information). We thus confirm automated prediction of embryo implantation with superior predictive strength of day-5 transfers and verify generality across clinics and age groups.

SHIFRA K Differentiates between Implanted and Nonimplanted Blastocysts
Aneuploidy might impair embryo implantation but permit blastocyst formation. [23] To study the relationship between blastulation and implantation potentials, we analyze the classification of blastulation and implantation colabeled embryos: 121 BLAST_ p-KID_n (B p K n ) embryos and 275 BLAST_p-KID_p (B p K p ) embryos ( Figure 3a). Blastulation and implantation prediction scores are weakly positively correlated (Pearson correlation score 0.3), which is indicative of common visual elements. Consistent with their BLAST_p labels, blastulation prediction score distributions of B p K n and B p K p embryos overlap. In contrast, average implantation scores of the latter embryos were 40% higher than B p K n embryos (Figure 3b), indicating that SHIFRA K differentiates between blastocysts that have the capacity to implant and blastocysts that do not.

The last <10 h Are Sufficient for Prediction of Embryo Quality
To gain additional insight into the underlying mechanisms of embryo classification, we analyze the visual information that is embedded within the time-lapse images that SHIFRA B and SHIFRA K are sensitive to. To this end, we use the SHapley Additive exPlantions (SHAP) methodology for quantifying the impact of the temporal features on embryo prediction. [24] We identify the frames that contribute the most to accurate embryo classification by setting the sign of the SHAP values of each feature according to the BLAST or KID label of the embryos (negative: À1; positive: þ1) and averaged across embryos (mean adjusted SHAP). In this manner, positive (negative) SHAP values of temporal features of positively (negatively) labeled embryos contribute oppositely to the mean-adjusted SHAP compared with positive (negative) SHAP values of temporal features of negatively (positively) labeled embryos. We find that the contribution of the temporal features to blastulation prediction To study whether the top-ranked temporal features can direct blastulation and implantation prediction without including the rest of the temporal features, we trained again the embryolearning blastulation and implantation classifiers using only the top-ranked features. Indeed, we found that the top-ranked features were sufficient for reaching high predictive power with comparable AUC values (BLAST: AUC ¼ 0.75; KID: 0.7 Figure 4c-i,ii) as SHIFRA B (AUC ¼ 0.74, Figure 1c-inset) and SHIFRA K (AUC ¼ 0.71, Figure 2b-inset). This indicates that the top-ranked temporal features as defined here mark the developmental potential of the embryos to blastulate and to implant.

SHIFRA B & SHIFRA K Are Sensitive to Dynamic Features Beyond Morphological/Morphokinetic States
The ambiguity in identifying the actual visual elements that direct neural network prediction is one of the major drawbacks in deep learning. To obtain a mechanistic insight into how embryo prediction by SHIFRA B and SHIFRA K work, we present the top-ten positive SHAP images and top-ten negative SHAP images for each of the top-ranked SHIFRA B and SHIFRA K temporal features ( Figure 5 and 6). Herein, we mark the cleavage-stage embryos with four cells or less (blue frames), embryos with asymmetric blastomeres (green frames), and highly fragmented embryos (red frames) in the images that contributed the most to positive and negative blastulation prediction by SHIFRA B (Figure 5). Temporal features 1 (72 h), 2 (70 h) and 4 (71 h) were most sensitive to four-cell cleavage stage embryos and were identified as SHAP-negative images. Temporal features 1, 3 (71.7 h) and 8 (66.3 h) were most sensitive to uneven blastomere size and temporal features 2 and 4 were most sensitive to embryo fragmentation, which also obtained negative-SHAP scores www.advancedsciencenews.com www.advintellsyst.com Figure 6. Top positive versus negative SHAP-scored embryo frames directing KID prediction. Top ten SHAP-positive versus top ten SHAP-negative embryo frames are shown for the selected temporal features by mean adjusted SHAP for SHIFRA K KID prediction at 110 h. Cleavage stage or morula stage embryos exposing noncompacted blastomeres (blue frames) and morula stage embryos (green frames) are marked.
www.advancedsciencenews.com www.advintellsyst.com ( Figure 5). Similarly, we marked the cleavage and morula stage embryos with noncompacted blastomeres (blue frames) and compacted morulae (green frames) in the images that contributed the most to positive and negative KID prediction by SHIFRA K (Figure 6). We found that temporal features 10 (106.7 h) and 13 (107.3 h) were most sensitive to the appearance of noncompacted blastomeres and temporal features 3 (109.7 h), 9 (100.3 h), and 11 (103.7 h) were most sensitive to morula. In both cases, these morphological characteristics directed negative-SHAP KID prediction ( Figure 6). The morphokientic and morphological characteristics that we discuss herein are depicted only by a small fraction of the images of the temporal features, indicating that the determinant visual elements that direct blastulation and implantation prediction are not distinguished by human-level perception.

Discussion
The advantage of using retrospective over prospective embryo transfer datasets for DNN training is their ethical feasibility and far greater size. However, prediction accuracy is limited due to lacking critical information about endometrial receptivity and using a homogenous dataset of embryos that were retrospectively preselected for transfer according to established morphological and/or morphokientic criteria. To overcome these limitations, we first trained a DNN on blastulation outcome using a heterogeneous set of labeled embryos and integrated it with three additional DNNs that were trained separately on implantation outcome. Indeed, accuracy of day-3 implantation prediction was equally low by SHIFRA K and KIDSCore-D3, but it was significantly higher on day-5 by SHIFRA K , proving to be more accurate than implantation prediction by KIDScore-D5 (Figure 2b). Unlike the current state-of-the-art embryo classification algorithms, [8,10] deep learning provides full automation and standardization of embryo classification, which are both important for clinical adoption. [25,26] The SHIFRA classifiers, which provide early evaluation of blastulation and implantation potential, are the first step toward the development of a decision-making tool that will provide a personalized, multistep, embryo transfer strategy. [27,28] Namely, given a finite number of embryos obtained from a patient and their assessed quality, this tool will specify the multistep order and timing of embryo transfers (including transfers of multiple embryos), as well as which embryos are to be cryopreserved for subsequent transfers. The general framework presented herein, together with larger datasets of embryo data, open the door for the implementation of such personalized clinical tools that will optimize conception rates while shortening time to pregnancy in IVF treatments.

Experimental Section
The SHIFRA Database, Embryo Annotation, and QA: We developed a PostgreSQL database with a front-end website that supported display, query, and data annotation (Hebrew University IT). Maintenance of the SHIFRA database was outsourced (CHELEM LTD), and data curation was conducted by a full-time trained embryologist. Anonymized time-lapse video files and the corresponding metadata were imported from five hospitals (Table S1, Supporting Information). Data were imported under the approval of the Helsinki ethical committee in each hospital.
Only embryos that were fertilized via intracytoplasmic sperm injection (ICSI) and showed two-pronuclei appearance (2PNa) inside the incubator were included. In this manner, we accurately defined time of fertilization, discarded nonfertilized embryos, and obtained a full morphokinetic profile starting from tPNa. Preimplantation genetic screening/diagnosis (PGS/PGD) tested embryos were discarded as well. Morphokientic annotations were performed by qualified and experienced embryologists in each IVF clinic adhering to established protocols. [2,21] QA was carried out by comparing the morphokinetic annotations of 253 randomly selected embryos with the annotations of an expert embryologist in a blinded manner (Dr. IHV, Soroka Medical Center, Figure S2a, Supporting Information).
Labeling of Embryo Blastulation: Identification of arrested embryos that cannot advance toward blastulation was based on their morphokinetic profiles ( Figure S2b-i,ii, Supporting Information) and the time windows for advancing from one morphokinetic event to the next (Table S2 and S3, Supporting Information).
Morphokinetic Events: Each embryo was represented by its total time of incubation measured from fertilization, t inc , and its latest developmental state reached inside the incubator, S n ( Figure S3a-i, Supporting Information). Embryos, which have the capacity to advance to the next developmental state if incubation was extended, are located within the orange regions, which were bound between the 1st percentile of the temporal distribution of the morphokinetic event S n and the 99th percentile of the temporal distribution of the consecutive morphokinetic event S nþ1 (Table S2, Supporting Information). Embryos that were located in the red regions, which were bottom bounded by the 99th percentile of the consecutive morphokinetic event S nþ1 , missed the permitted time window for advancing toward the next developmental state. For example, an embryo that was incubated for 96 h and reached four-cell state, was arrested and was labeled 4-cells positive 5-cells negative (4Cp5Cn).
Morphokinetic Intervals: A similar statistical analysis was performed to identify developmentally arrested embryos based on the temporal distributions of the morphokinetic intervals. Each embryo was represented by the time that lapsed between time of last morphokientic event and time of incubation, t inc À t n , which was plotted versus the latest developmental state reached inside the incubator, S n ( Figure S3a-ii, Supporting Information). The orange regions were bound between the 1st percentile and the 99th percentile (Table S3, Supporting Information). Embryos that were located in these regions reached developmental state S n and still hold the potential to advance to morphikinetic stage S nþ1 if incubation was extended. In contrast, embryos that were located in the red regions missed their interval time window and were thus arrested in developmental state S n . For example, an embryo that had completed four-cell cleavage and that 36 h had lapsed as without advancing to the next cleavage event was arrested.
BLAST-Labeling: Adhering to a constringent methodology, embryos that were found to be developmentally arrested either based on their morphokinetic profiles or based on their interval profiles are classified BLAST_n ( Figure S3c-ii, Supporting Information). Embryos that reached start-ofblastulation (SB) inside the incubator were labeled BLAST_p and were located in the green regions ( Figure S3c-i, Supporting Information). Obtaining BLAST_p embryos required lengthy incubations, which was facilitated by clinics that transfer embryos on day-5 (H3; Figure S3d and S1b, Supporting Information). The remaining embryos that were located within the orange regions were classified BLAST-unknown (BLAST_u, Figure S3c-i, Supporting Information).
Labeling of Embryo Implantation and Clinical Pregnancy: Implantation of transferred embryos in the uterus was labeled by KID. KID status was determined based on established protocols by comparing the number of transferred embryos with the number of implanted embryos as determined by the measured number of gestational sacks on week five of pregnancy. In the case that the number of transferred and implanted embryos was equal, the embryos were labeled KID-positive (KID_p). KID-negative (KID_n) corresponds to the case of no implanted embryos. KID-unknown (KID_u) marks embryos whose implantation outcome cannot be determined, for example, when three embryos were transferred and only one or two were implanted. Clinical pregnancy (CP) accounts for the implantation of a viable embryo as determined by fetal heart beat www.advancedsciencenews.com www.advintellsyst.com ultrasound measurement on week 7 of pregnancy. Positive CP (CP_p) accounts for the case that the number of transferred embryos, gestational sacs, and fetal heart beats were the same. Negative CP (CP_n) includes transferred embryos that failed to implant (KID_n) and KID_p embryos with no fetal heartbeat. KID Classification by KIDScore-D3 and KIDScore-D5 Decision Support Tools: Prediction of embryo implantation was performed by KIDSCore-D3 on day-3 (66 h prediction time) and by KIDScore-D5 on day-5 (110 h prediction time) according to the manufacturer's protocols (Vitrolife). Specificity and sensitivity were evaluated as a function of KIDScore values.
Statistical Analysis and Software: Statistical analyses were performed using NumPy, Sikit-Learn, and SciPy Python packages (Python Software Foundation) and R (The R Foundation). Graphs were generated using Matplotlib and Seaborn Python packages (Python Software Foundation).
Preprocessing of Embryo Video Files: The convergence of supervised neural networks depended on the balance between the number of labeled objects and their total size. We performed preprocessing of the input video (%100 MB) and decreased their size 16-fold while capturing the dynamic nature of preimplantation embryo development. Preprocessing included 4Â-resizing of the embryo images (2Â biaxially) and segmentation of the embryo region of interest. Later, we described the steps for performing automated image preprocessing using a U-Net classifier. [29] Semiautomated Interactive Labeling of Embryo Segmentation: The first goal was to generate a large set of validated embryo images labeled and by binary masks of the embryo region of interest that will be used for training a fully automated classifier. Using the semiautomated interactive GrabCut algorithm, [30] the embryos were framed manually by a bounding box, allowing for an initial segmentation of the embryo ( Figure S4a, Supporting Information). Next, embryo segmentation was adjusted interactively by scribbling the regions to include within the embryo region of interest (ROI) and regions to exclude from the embryo ROI, allowing fine tuning of embryo segmentation. Multiple segmentation-scribblingfine-tuning iterations were allowed. Once embryo segmentation was approved, morphological operations were executed (filling holes, contour smoothing, and dilation), and a binary mask was defined. In total, we generated binary masks for 2700 images spanning all relevant developmental states of the embryos.
Automated Embryo Segmentation Using An Automated U-Net Pixel Classifier: A U-Net classifier was trained using the images that were labeled interactively via GrabCut. Segmented images were resized to 256 Â 256 pixels and divided into train set (2350 images), validation set (200 images), and uncontaminated test set (150 images). Training was performed using randomly selected 100 images batch size at 1500 steps per Epoch. Each batch of images underwent random augmentation that included 0 -180 rotations, horizontal flips, 0-0.1 shearing, and 0-0.1 zooming. In this manner, training on exactly the same images on different steps was prevented. Network convergence was reached within 20 epochs. The U-Net classifier obtained embryo images as an input and generated a binary mask output of the embryo ROI resized to 500 Â 500 pixels (Figure 2b-i). Accuracy of embryo segmentation was evaluated on testset embryos using the Dice Coefficient, which measures the overlap between the segmented pixels and the binary mask labels. Dice coefficient ranged between 0.91 and 0.94 across all developmental states ( Figure S4b-ii, Supporting Information). Embryo ROIs were evaluated for all images in the database. Empty well images were identified based on small mask area and discarded.
Grouping of Five-Frame Packets: All segmented embryo frames were resized 2Â along each axis into a 128 Â 128 pixel images. Five consecutive resized and segmented frames of the same focal plane and the same embryo were grouped together into packets, P m n , of embryo m and first frame time index n ( Figure S5a, Supporting Information). Each packet was associated with a blastulation and/or implantation label y m . These packets served as the input objects of the packet learning neural networks (see in the following sections).
Packet Learning-Neural Network Design and Training Parameters: The network was implemented using the PyTorch framework, [31] and trained using Stochastic Gradient Descent with Nesterov of 0.9. The input objects of the network were packets of five preprocessed frames P m n ( Figure S5a, Supporting Information). Training batches included k ¼ 4 packets obtained from K ¼ 8 embryos within 12 h windows. Training sets were balanced by selecting positive-and negative-labeled embryos (both blastulation and implantation) opposite of their respective frequencies.
Packets of the three central focal planes (À15, 0, and þ15 μm) were used for training, whereas validation and test-set embryos contributed packets only of the middle focal plane. We used a residual network architecture [32] consisting of 13 layers that included seven residual blocks and two fully connected layers ( Figure S5b, Supporting Information). The last layer consists of w input neurons (see below) that were associated with nonoverlapping time windows as defined for blastulation and implantation prediction networks (see in the following paragraphs). Packet score output was determined by selection of one of the input neurons according to the time index n of the first frame of the packet ( Figure S5b, Supporting Information).
Packet Learning-Weighted Logistic Loss: We postulate that embryo developmental potential was marked by scarce dynamic events that last 30-60 min and were thus captured by individual packets. Hence, a highquality embryo will have only a few packets that were scored high, whereas all packets of a low-quality embryo will be scored low. This principle was implemented by weighing logistic loss as follows. [33] The weighted loss of embryo m is calculated based on all k packets Note that the sum of weights w m n across k packets of embryo m is 1. γ is the softness parameter. At the limit γ ! 0, the weight becomes 1 k independent of the scores of the packets. At the opposite limit of large γ, w m n approaches 1 only for the packet of maximal score. The problem of approaching this limit was that it will be increasingly difficult for the network to converge. The batch loss L is The DNN weights were thus optimized to minimize L over all K embryos in the batch. Performances were optimized by setting γ ¼ 3. For a negative-labeled embryo (y i ¼ À1), even if a single packet will obtain a positive score, its weight w m n will be highest and the loss of the embryo l m will be large. As a result, convergence will be approached only if all packets of a negative embryo will obtain negative scores. In contrast, one packet with a high-positive score was sufficient for obtaining a small loss for a positive embryo (y i ¼ þ1). For these embryos, the packets with low scores will have small weights and the packet with the highest score will have the highest weight and a small loss will be obtained.
Packet Learning of Blastulation-and Implantation-Labeled Packets: Packet learning for blastulation prediction and for implantation prediction was performed using the same DNNs as described earlier ( Figure S5b, Supporting Information). Network training for blastulation prediction was performed using w ¼ 25 nonoverlapping time windows (0-4; 4-8; 120; >120). DNN training using BLAST-labeled and KID-labeled embryos typically converged within 20-to-60 epochs.
Unlike blastulation, implantation outcome of transferred embryos depended not only on their developmental competence, but also on endometrial receptivity, which was not considered in the learning process. Therefore, packet-learning for KID-prediction was performed www.advancedsciencenews.com www.advintellsyst.com using an ensemble of three DNNs that were trained on KID-labeled packets and one DNN that was trained on BLAST-labeled packets (four networks in total). The three KID-networks were trained using different hyperparameters (initial weights and learning rates). In this manner, blastulation prediction packet learning generates one frame score, whereas KID prediction packet learning generates four scores for the first image of each packet that are summed into one final frame score. Embryo Learning: Scoring the developmental potential of embryos was performed by training a second classifier. Each embryo was represented by a vector of frame scores obtained by packet learning as described earlier. Next, temporal features were generated by interpolation of the vectors of frame scores, thus obtaining a synchronized representation of all embryos. To allow embryo prediction at different time points (time of prediction, t p ), different classifiers were trained independently. Given t p , a classifier was trained on all train-set labeled embryos of video length greater than t p using the temporal features earlier than t p . BLAST prediction was performed using a Random Forest classifier, whereas KID prediction was performed using logistic regression. This choice was made based on the performance of multiple classifiers that were tested for BLAST and KID prediction. In both cases, training parameters were optimized via grid-search fivefold cross validation.