An Intelligent, User‐Inclusive Pipeline for Organic Semiconductor Design

Data‐guided methodologies for designing new materials are developing apace, yet advances for organic crystals have been infrequent. For the case of organic crystals, data‐guided design platforms involve two problems: the need to construct regression models that can predict solid‐state properties from molecular structures alone and the vast number of molecular structures that satisfy a targeted solid‐state property. In this paper, a new pipeline is presented for designing molecules with targeted solid‐state electronic properties. Here, the first problem is overcome by means of a novel visualization method for band structure data, which allows the user to specify their design preferences and simplify the creation of the required regression models. The second problem is overcome by allowing for user intervention in the selection of the final molecule for subsequent synthesis. The effectiveness of this pipeline is demonstrated by using it to design a simple new molecule with a targeted bandgap and impressive band dispersion. This pipeline is the first data‐guided method that can design new molecules on the basis of genuine solid‐state electronic properties, and with its emphasis on user‐inclusivity, it can easily be applied in collaborative or industry settings.


Introduction
Computational materials science, like most fields of science, has embraced the methods of data science. Over the last decade, a dramatic shift-away from calculations on single materials and toward the application of machine learning to large databases of materials-has played out in the literature. [1,2] Although this shift is still in progress, it is widely expected to result in a new scientific paradigm: "data-driven material science". [3,4,5] While "datadriven materials science" lacks a precise meaning, it is broadly DOI: 10.1002/adts.202300159 understood as the idea that large parts of the materials discovery pipeline, from the proposal of a hypothetical structure to its final synthesis, could be performed by computers and machines that operate according to machine learning algorithms. Aspects of data-driven material discovery have been demonstrated for certain types of materials, including single molecules, [6,7,8] molecular assemblies, [9] polymers, [10,11,12] and various types of organic solids. [13,14] Until now, however, the literature has been dominated by algorithmic developments, proofsof-principle, and numerous review articles on machine learning. The important question of how human expertise might figure in this emerging scientific paradigm has received surprising little attention.
Will future materials science involve human expertise at all? By considering ongoing trends in materials science within the wider context of automation, this question can be answered with a confident "yes". Economists have examined two possible outcomes of automation. [15] The first outcome is one of the widespread displacement of human workers by machines, which is characterized by significant job loss. [16,17] This outcome has been widely promoted in the popular media (see, e.g., Ref. [18]) and dominates public discourse. The second outcome is one in which human workers are not displaced by machines but simply change the way in which they work while staying in the same job. In the case of materials science, the first outcome would be a rather extreme situation in which human expertise is eliminated from the materials discovery pipeline all together. However, by considering a broad range of jobs across society, there is surprisingly little evidence that such outcomes are normal. Indeed, by accounting for the fact that most jobs involve a wide range of tasks, many of which cannot be easily implemented on a machine, economists have estimated that the actual degree of job loss by automation is only ≈9-15%. [19,20,21] Judging by these statistics, it appears unlikely that "data-driven materials science" would involve a significant substitution of machines for scientists. More likely, "data-driven materials science" will simply be a new mode of research in which scientists spend a greater fraction of their time on tasks that cannot be automated. In order to speculate on the future of materials science, it would therefore seem more useful to ask a different question: what aspects of the materials discovery pipeline would be difficult for a machine to perform?
Organic crystals are a type of material for which the difficulties of an entirely machine-operated material discovery pipeline are apparent. Organic crystals are crystals comprised of organic molecules held in-place by non-covalent interactions. Their electronic structures, which arise from coupling between molecular orbitals of neighboring molecules, are typically semiconducting in character, making organic crystals attractive for device applications such as light-emitting diodes, field effect transistors, and organic solar cells. [22,23,24] For such applications, organic crystals with specific bandgaps and high charge mobilities are required, and a major advantage of organic crystals is the massive size of the design space (arising from the essentially infinite possibilities of organic synthesis) that can be explored for achieving these properties. [25,26] However, the common approach to designing new organic crystals-making a sequence of small additions and adjustments to a promising initial structure in an attempt to tune electronic properties-is notoriously ineffective. Minor changes to molecule structure usually induce dramatic changes in crystal packing and molecular orbital shapes, causing bandgaps and charge mobilities to change unpredictably. Such unpredictable changes make it difficult to tune electronic properties on the basis of small molecular structure adjustments.
Faced with this problem, the possibility of applying data-driven design strategies to organic crystals becomes attractive. However, the application of these methods to organic crystals involves two unique problems. The first problem arises as follows: datadriven design strategies require the creation of regression models that map structural descriptors to the physical property of interest. For the case of organic crystals, an important point must be observed: the regression model must exclusively map molecule structure-derived descriptors to crystal-phase electronic properties. This point follows from the fact that, in order to be useful for designing new materials, the regression model should only map variables that can be controlled during the material synthesis process. Molecular structures are the only part of an organic crystal over which a synthetic chemist has direct control. Thus, while accurate regression models for bandgaps and other properties have been built on the basis of crystal structurederived descriptors, [27,28] these models cannot be used for material design purposes. This point has been recognized by several recent authors, who have devised regression models based on molecular-structure-derived descriptors and used them to design molecules for new organic crystals. However, instead of considering genuine solid-state properties, these models instead predict molecule-or dimer-level properties such as reorganization energy [29,30,31] and transfer integrals, [32] thereby sidestepping the problem altogether. The use of these models in material design applications is therefore unclear. Based on progress so far, it appears challenging to construct a regression model for crystal-state electronic properties on the basis of molecule-derived descriptors alone.
The second problem follows from the first. Suppose that such a regression model happened to be created. In order to predict new molecules on the basis of that model, one would need to optimize it with respect to molecular descriptors, minimizing the difference between the model's predictions and the desired solidstate property. However, this optimization would necessarily be non-unique, in the sense that a massive number of candidate molecules, each of which would be predicted to exhibit the de-sired crystal-phase electronic property, would exist. This massive number of molecules would reflect the many ways in which particular structural motifs, functional groups, crystal symmetries, and molecular packing modes can be combined and permuted in order to achieve a specific property value. From the point of view of a computer, each of the many molecules obtained from this optimization would be just as good as any other, as every one of them would be predicted to satisfy the desired solid-state property. Additional criteria, such as synthetic potential and novelty, would therefore need to be invoked in order for the machine to select a single molecule for subsequent synthesis. Such criteria are highly qualitative cannot be easily checked by a machine. On the other hand, they can be immediately and unambiguously evaluated by an experienced chemist. This fact shows that an optimal data-driven pipeline for designing new organic materials would be one that allowed for user intervention at some point.
In this paper, we present a data-guided pipeline for designing new molecules with targeted solid-state band structure properties. Our pipeline posits user inclusion as the means to overcoming the two problems stated above. Concretely, it overcomes the first problem by introducing an electronic structure chart (ESC), a method for visualizing the space of organic crystal band structures. The ESC allows a user to specify a region of the band structure space within which they would like to create a new material. A machine-learning algorithm then learns the molecular motifs that are contained in that region, generating a simpler binary classification model (rather than a full regression model for a continuous physical property value) for subsequent molecular design purposes. Our pipeline overcomes the second problem by allowing the user to select a molecule for subsequent synthesis among the candidates designed by the algorithm. As presented, our pipeline can be used to design new materials with a targeted bandgap and a targeted valance band ( or highest occupied molecular orbital (HOMO) band ) curvature. However, due to data size limitations, we illustrate the pipeline by considering a target bandgap and allow for user selection to identify a molecule with high hole mobility. We predict a new molecule with a high synthetic potential, a targeted bandgap, and impressive hole mobility (effective mass = 0.66). This pipeline can be generalized to inorganic crystals or other types of solid-state materials, and could easily be adopted in collaborative or industrial settings. This paper is organized as follows: In Section 2, we introduce our band structure database and explain how the electronic structure chart is generated. In Section 3, we present the regression algorithm for designing new molecules with target band structure properties, as specified by the user. In Section 4, we show how a user can select a single molecule from the output of the regression algorithm for subsequent synthesis. In Section 5, we validate the band structure properties of our selected molecule on the basis of crystal structure prediction and density functional theory calculations.

Data Collection and Electronic Structure Chart
The electronic structure chart (ESC) is a visual projection of a band structure database and the first component of our pipeline. In this work, the ESC is built from organic crystal bandgaps and valance (HOMO) band curvatures. It allows the user to specify the region of the band structure space within which they would respectively. Grey, white, blue, purple, and yellow parts represent carbon, hydrogen, nitrogen, zinc, and sulfur atoms, respectively. Red lines indicate unit cell boxes. Images drawn with Materials Studio Visualizer. [64] E-H) Electronic band structures calculated for pentacene, ZnPc, DNT-V, and DNT-W, respectively. Occupied and unoccupied bands are colored green and blue, respectively. The vertical thin red lines indicate Brillouin zone boundaries and the horizontal line indicates the Fermi level. Local bandgap and local curvature are illustrated. I-L) Electronic fingerprints for pentacene, ZnPc, DNT-V, and DNT-W, respectively. All plots were produced with R. [69] like to design a new material and subsequently direct the construction of a regression model for designing new molecules. We first describe how the database was compiled and how descriptors ("electronic fingerprints") for generating the ESC were calculated in Section 2.1 before introducing the ESC in Section 2.2.

Data Collection and Electronic Fingerprints
A database of ≈5500 organic crystal structures was compiled by random sampling from the Crystallography Open Database (COD) [33] (see Experimental Section). A full list of crystal structures used in this database is provided in Table S1 (Supporting Information). Four representative structures from this database (pentacene, ZnPc, DNT-V, [34] and DNT-W [35] ) are shown in Figure 1A-D. For each material in this database, electronic band structures were computed in a high-throughput manner via density functional theory (DFT). The band structures calculated for pentacene, ZnPc, DNT-V, and DNT-W are shown in Figure 1E-H, respectively. These band structures, as with almost all materials in this database, show a clear bandgap, indicating semiconductivity. While these bandgaps will underestimate the experimental ones by 1-2 eV due to the generalized gradient approximation used in the DFT calculations, [36] the band structure trends will be reliable.
It is well established that charge transport in organic crystals cannot be quantitatively modeled in the band transport picture due to the presence of strong electron-phonon coupling and polaron formation. [37][38][39][40][41][42][43] However, one must not confuse the use of band structures for modeling charge transport with the use of band structures for building descriptors for electronic properties. If we are to design new organic crystals with strong electronic coupling between neighboring molecules, then band structures are useful because the curvature of the bands reflects the magnitude of this electronic coupling. Electronic coupling parameters appear as coefficients in advanced treatments of charge transport in organic crystals as well (for example, see the various mobility expressions in Ref. [41]). Bandgaps also remain physically meaningful in the presence of electron-phonon coupling, although they tend to narrow slightly in the presence of thermal fluctuations in the nuclei's positions. [44] While certainly not a satisfactory means of modeling room-temperature charge transport, band structures contain a wealth of information useful for designing new materials.
In order to create the electronic structure chart (ESC), descriptors must be extracted from the band structures. For each band structure, descriptors were extracted in three steps. First, n kpoints from the Brillouin zone were selected at random. Second, for each of these k-points, the local gap k (the energy difference between the highest occupied band and the lowest unoccupied band at that k-point) and the local curvature c k (the second derivative of the energy of the occupied band at that k-point) were calculated. This step results in n pairs of points ( k, c k ). In the third step, a joint probability distribution P( , c) of local gaps and local curvatures across the Brillouin zone was created using the n pairs of points obtained in the previous step (see Experimental Section for further details). Examples of joint probability distributions obtained from the band structures of pentacene, ZnPc, DNT-V, and DNT-W are shown in Figure 1I-L, respectively. These joint probability distributions are referred to as electronic fingerprints and serve as descriptors for the electronic band structure. For a given organic crystal, the electronic fingerprint describes how the local gap and the local curvature jointly vary across the Brillouin zone. It therefore summarizes information on hole transport in the semiconductor as well as its optical properties at absolute zero temperature. These electronic fingerprints could be modified to describe electron transport by computing the curvature of the lowest unoccupied band as well.
Electronic fingerprints are potentially demanding to calculate, as a high-quality band structure over the entire Brillouin zone should be pre-calculated in order to obtain a uniform and reliable sample of local bandgaps and curvatures. In this paper (including in the examples in Figure 1), we actually sample k-points from along the Brillouin zone path used in the band structure calculation in order to minimize computational demands. This approximation should be acceptable because the Brillouin zone paths used here, which were generated with the AFLOW package, [45,46] cover all of the representative parts of the Brillouin zone.
Construction of the ESC requires a measure of dissimilarity between pairs of band structures. Because electronic fingerprints are probability distribution functions, it is possible to quantify these dissimilarities using techniques from probability theory. In this work, the dissimilarity between pairs of electronic fingerprints is measured using the Jansen-Shannon (JS) divergence. [47] Consider two organic crystals and their respective electronic fingerprints, P( , c) and P'( , c). The JS divergence between P and P' is defined as where M( , c) = (P( , c) + P'( , c))/2 and k is the Kullback-Leibler divergence, defined as and similarly for k(P', M). See Experimental Section for calculation details. In principle, the JS divergence ranges between 0 and log 10 (2) = 0.301. However, for the cases studied here, the JS distance varied between ≈0.25 and 0.301 for most cases. In Figure 1, the JS divergence between pentacene and ZnPc is 0.28, implying moderate overlap of electronic fingerprints, and the distance between DNT-V and DNT-W is 0.30, implying weaker overlap. Between ZnPc and DNT-V it is 0.301, implying negligible overlap between electronic fingerprints and hence quite different band structures.

Electronic Structure Chart for Organic Crystals
The electronic structure chart (ESC) is presented in Figure 2A. It was computed using the tSNE (t-distributed stochastic neighbor embedding) method (see Experimental Section for details). [48] The tSNE method generates a 2D projection of the database in such a way that the distances between nearby points correspond to the JS divergence in (2). In this diagram, each point represents an organic crystal structure from our database. Crystals that are close together in the ESC should therefore have similar band structures. Indeed, it can be seen that pentacene and ZnPc are located close together, whereas DNT-V and DNT-W are located further away. The ESC is therefore a graphical visualization of the space of organic crystal band structures.
In passing, note that by convention, the horizontal and vertical axes of the ESC do not have labels or units. The reason for this is that the tSNE method only preserves the JS divergence between nearby points but not between widely separated points. The concept of "distance" therefore can only be meaningfully applied at small scales on the tSNE plot but not on large scales, and for this reason axis units are not reported.
A close inspection of the ESC reveals that it is organized on two levels. On a global level, the compounds are organized according to their local gaps. This is clear from Figure 2A, in which the points are colored according to the average local gap of the material (the local gap averaged over the Brillouin zone path). Following the trail of points starting from the left-hand side, the points gradually change from red (indicating large gaps) to bright blue (indicating very small gaps). These two extremes correspond, respectively, to insulating, -bond-dominated compounds and metal-containing compounds with small gaps due to partially filled metal ion orbitals. Aromatic compounds and other -electron-heavy compounds populate the ESC between these two extremes (see Figure S1, Supporting Information, for examples).
Locally within the ESC, the organic crystals appear to be arranged according to local curvatures. Figure 2B presents the same ESC, but with the points colored according to the logarithm of the average local curvature (the absolute value of the local curvature averaged over the Brillouin zone path). While it appears that cases with lower curvature (blue points) tend to concentrate on the edges of the trail and that cases with higher curvature tend to concentrate in the middle (red points), it is difficult to draw unambiguous conclusions on the distribution of curvatures from Figure 2B alone. In order to determine whether local curvatures are distributed in a systematic, non-random fashion across the ESC, we computed a series of autocorrelation functions (ACFs) Figure 2. Electronic structure chart. A) In the electronic structure chart, each material from an organic crystal database is projected into a 2D plane in such a way that crystals with similar electronic fingerprints are placed close together. Each point indicates one organic crystal from the database. The locations of pentacene, ZnPc, DNT-V, and DNT-W are indicated. The points are colored according to their average local bandgap. B) As for (A), but with points colored according to the logarithm of the average local valance band curvature. C) Autocorrelation functions (ACFs) for the average bandgap, average band curvature, and random noise. Insert shows the ACF for the average bandgap over a longer range of random walk steps. Figures were produced using R and the Rtsne package. [69,70] as shown in Figure 2C. Roughly speaking, an ACF measures how quickly a quantity of interest changes as we move through the trail of points in the ESC (see Experimental Section for details).
The ACF corresponding to random noise (blue line) represents a base case in which independent random variates from a standard normal distribution were assigned to each crystal. In this case, ACF quickly decays to zero. This ACF can be taken as a control case, where the quantity of interest is randomly distributed across the ESC. When the quantity of interest is the average local curvature, the ACF decays appreciably more slowly (green line). This unambiguously shows that local curvatures are distributed in a non-random, systematic way across the ESC. When the quantity of interest is the average energy gap, the ACF decays extremely slowly (red), which reflects the slow variation of the bandgap over the entire ESC. The stark difference in ACF decay rates for the cases of average local gaps and average local curvatures reflects the two-level, global-versus-local, distribution of these properties across the ESC.

Regression Algorithm for Designing New Molecules
An essential part of any data-driven pipeline for organic crystal design is the construction of a regression model that exclusively maps molecule descriptors to band structure properties such as bandgap. In essence, the ESC allows one to replace the full regression problem (prediction of continuous property values) with a simpler classification problem (prediction of binary labels), facilitating the construction of such models. Concretely, the user selects an interesting "target region" of the ESC, following which a classification model is trained to predict whether or not a molecule belongs to the target region when crystallized.
The two-level, global-versus-local, manner in which bandgaps and curvatures are organized within the ESC allows the user to select the target region in one of two ways. The first way utilizes the global organization of the ESC and is useful for designing new organic crystals whose bandgap lies within a targeted range. Here, the user can select a region of the ESC that contains crystals with bandgaps within that range and utilize the structural information contained therein. The second way utilizes the local organization of the ESC, and might be used to design organic crystals with specific bandgaps and band curvatures. Here, the user would select a specific crystal such as pentacene (which possesses a high band curvature), and utilize the structural information from the other crystals in its vicinity.
In this work, we opt for the first of these ways and consider designing a new material whose bandgap falls within a targeted range. This is because the second of these ways requires a high density of data across the ESC and, hence, a larger database than we currently have. In this work, we (the users) considered the materials contained in the green box in Figure 2A as our target region. This region contains organic semiconductors with small bandgaps (≈1.0-2.0 eV), and also contains some high hole mobility compounds (pentacene, ZnPc, and C 60 ). The corresponding molecule design problem is summarized in Figure 3A.
The molecule design procedure involves three steps: i) building a molecule database; ii) using this database to build a regression model that predicts whether or not a molecule will belong to the target region upon crystallization; and iii) generating a large number of molecules and identifying promising candidates using the regression model.
The construction of the molecule database (step (i)) is described in Figure 3B. In the implementation discussed here, each crystal structure was first expanded into a 2 × 2 × 2 supercell, from which the isolated molecules were extracted. Cases that contained multiple types of organic molecules, organic molecule radicals, or isolated inorganic ions (such as K + ) were ignored. This procedure yielded 2264 molecules in total. Each molecule was given a label 1 or 0, depending upon whether its parent crystal structure belonged to the target region or not. The molecule structures were stored as SMILES strings for subsequent processing (see Experimental Section for details).
The performance of the regression model obtained for our case is shown in Figure 3C as a confusion matrix (step (ii)). This model  Figure 2). The goal of this part is to identify molecules within the blue box (target molecules). B) Construction of the molecule database for constructing the regression models. The molecules are extracted from the organic crystal database in Figure 2. C) Performance of the classification model, presented as a confusion matrix. TPR, FPR, FNR, and TNR are the true positive rate, false positive rate, false negative rate, and true negative rate, respectively. D,E) Performance of regression models for predicting boiling point and melting point (E), respectively, for organic molecules the molecules in the crystalline state. In (C) and (D), blue and red points correspond to training and test data, respectively. F) Process for predicting molecules with targeted solidstate properties. iQSPR within the XenonPy package [54] is used to predict candidate molecules on the basis of the molecule database and the models constructed in (C-E). Details are discussed in the text.
was constructed using the molecule database from step 1, ECFP descriptors with radius 4, [49] and the XGboost library. [50] Note that while this model was trained on binary data, it assigns labels based on whether or not the model output exceeds 0.5. The regression model achieved impressive accuracy, only misclassifying 7.7% of the molecules in the test data. In addition to this model, we actually built two additional regression models to predict the boiling points ("BP model") and melting points ("MP model") of each molecule in the crystalline state ( Figure 3D,E). These models were built in consideration of real device applications, as an organic semiconducting material requires a melting point more than ≈130°C in order to be compatible with de-vice fabrication conditions. Moreover, the incorporation of a regression model for the boiling point allows the protocol to avoid proposing unrealistic molecules in which the melting point exceeds the boiling point. The BP and MP models were trained using the data in Ref,[ [51] ] which provides SMILES strings and experimental boiling and melting points for thousands of organic compounds. The BP and MP models were constructed using pattern fingerprints (PatternFP) and extended connectivity fingerprints (ECFP) with radius 2, respectively, which were generated from the SMILES strings using the RDkit library. [52] These two models were also constructed using the XGboost library.
The generation of the final list of candidate molecules (step (iii)) is performed using the Bayesian inverse molecular design algorithm (inverse-QSPR) in the XenonPy package. [53,54] In short, this procedure takes the molecule database and the above regression models as input and generates a new list of candidate molecules based on the ones in the database. The melting points, boiling points, and classification of the new molecules are then predicted with the above models, and the ones that have desirable properties (high melting points and classification >0.5) are recorded. This procedure is iterated, each time using the results from the previous step to facilitate the generation of the candidate molecules. For details, see Ref. [54] In our case, we found that this procedure could consistently generate molecules with correctly predicted labels after 28 iterations. The final list contained 248 molecules (see Table S2, Supporting Information). This long list of molecules reflects the large number of ways in which organic crystals with bandgaps within the target range could be achieved, therefore bringing us to the second problem mentioned in the introduction.

User Selection of a Molecule and Validation of Solid-State Properties
From the point of view of a machine, each of the 248 molecules generated above is just as good as the other; each one is predicted to exhibit a bandgap within the target range when crystallized. In order to select a single molecule from the list, we allow for user intervention. In this work, the users in question are two professional synthetic chemists working in the chemical industry.

Selection of Material From Candidates
Molecule selection involves mainly qualitative criteria, which are difficult to implement with machine-learned models. More specifically, we eliminate the majority of the 248 compounds based on the criteria of structure planarity and structural simplicity. This results in 15 compounds, from which one is selected on the basis of synthetic potential and structural novelty.
While we did not utilize the ESC in such a way that molecules with high solid-state band curvature would be selectively generated, we can apply some simple chemical intuition in order to identify such molecules. Many, although not all, organic semiconductors with excellent hole transport characteristics have rigid and planar structures rich in -electrons. Typical examples include pentacene, ZnPc, DNT-V and DNT-W as shown in Figure 1. Such planar molecular structures allow for closely packed co-facial alignments of molecules in the crystalline phase, through which charges can easily move. As the first step in our material selection, we therefore remove all molecules whose aromatic rings will not align in the same plane. This reduces the number of candidate materials in the list from 248 to 36. This dramatic reduction is possible because most of the 248 structures predicted by the machine are highly complex, consisting of highly branched structures (such as #170, #208, or #240, as shown in Table S3, Supporting Information) or structures with multiple discrete aromatic moieties connected by alkyl chains (such as #116, #123, and #136). There are also good theoretical reasons for excluding these complex molecules, because their predicted properties are unlikely to be reliable. This is because the machine learning models described in the previous section were constructed using training sets containing relatively simple structures and were not trained to make predictions on such complex structures. In passing, we note that filtering based on planarity could be carried out by a machine as well.
In the second step of our selection, we remove structures that are very complex and would be difficult to synthesize in practice. This criterion is qualitative but not arbitrary. For example, few chemists would consider structures # 39, # 64, or # 101 (Table S3, Supporting Information) as reasonable targets for synthesis due to the large number of fused rings in the structures, as well as the presence of strained rings (such as in # 39 and # 64). Similarly, the massive alkyl chains in structures # 148 and # 174, as well as the long fluorinated chains in #171 and # 200, would also exclude them from consideration. Removing these structures reduces the list from 36 molecules to 15. These structures are highlighted in Table S2 (Supporting Information). Note that this stage of the selection would be difficult to implement on a machine as it is based on qualitative, experience-based chemical considerations (such as the presence of extensive fused rings or massive alkyl chains) rather than quantitative criteria.
Of the remaining 15 compounds, we select the one shown in Figure 4A (henceforth "molecule 1"). Our reasons for selecting molecule 1 are as follows.
The first reason for selecting molecule 1 is that it has a high synthetic potential. It appears that molecule 1 could be synthesized in one step by direct reaction by two commercially available compounds: trifluoroethylamine and anthracenedicarboxylic anhydride (see Figure S2, Supporting Information). This same reaction has been reported previously using reactants similar to anthracenedicarboxylic anhydride with high yield and selectivity [55][56][57][58] (Figure S2, Supporting Information). The high selectivity of this reaction is attributed to the high electronegativity of the acid anhydride, making it strongly susceptible to nucleophilic attack compared to the aromatic rings. The high synthetic potential for molecule 1 is therefore due to the fact that molecule 1's structure can be decomposed into commercially available starting compounds, as well as the strong literature support for a selective, high-yield reaction between those compounds.
It would be difficult for a machine to evaluate the synthetic potential of a molecule with the same efficiency as a real chemist. An experienced chemist can quickly decompose the molecule into a set of starting materials (that is, perform a retrosynthesis) and has an instinct for which ones are commercially available and which ones react with high selectivity. While efforts to develop machine-based retrosynthesis methods have been reported (e.g., Ref. [59]) a reliable machine-based retrosynthesis with a subsequent literature investigation to determine reaction yield and selectivity would be very difficult to perform at present.
The second reason for selecting molecule 1 is that it is structurally novel. The presence of electronegative moieties (imide group and fluoro-alkyl group) at one end of the molecule has two consequences. The first consequence is that carbonyl groups (C = O) are strongly susceptible to chemical reactions with nucleophilic reactants, allowing for the selective formation of derivatives of molecule 1. Indeed, substitution reactions on molecules similar to molecule 1 have been reported, resulting in the replacement of the O atoms with S or CH 2 [60,61] (see Figure S3, Supporting Information). The ability to selectively form derivatives is highly desirable for organic semiconductor development, as it allows for fine-tuning of electronic structure and crystal packing. The second consequence of these electronegative moieties is the presence of a dipole moment in the molecule. This is a desirable property for processing thin films of organic semiconductor for printed electronics applications, as it enhances the solubility of the molecule in polar solvents (see Ref. [62]). While a machine could easily select a molecule on the basis of dipole moment, it would not be able to select a molecule on the basis of "structural novelty". Organic chemists have a clear and shared sense for structural novelty, however this sense is too qualitative to be expressed with a machine-learned model.

Crystal Structure Prediction and Solid-State Electronic Properties of Selected Material
In order to evaluate molecule 1's physical properties in the crystalline state, it is necessary to first predict its crystal structure. Figure 4B shows the lowest-energy polymorph of molecule 1 as predicted using the Polymorph Predictor module in Materials Studio, [63,64] followed by DFT-based structural relaxation (see Experimental Section). Other methods for predicting crystal structure are compared in Ref. [65]. This polymorph (polymorph 1) consists of stacks of molecules aligned along the short crystallographic axis. Within these stacks, the molecules are aligned in a co-facial manner with some lateral displacement in order to reduce repulsion between electron-rich imide and fluoro-alkyl moieties. Molecules belonging to adjacent stacks are oriented in opposite directions, which further reduces electrostatic repulsion between electron-rich parts of adjacent molecules. Figure 4C presents the band structure for polymorph 1 as calculated from DFT. The Brillouin zone-averaged local gap is 1.91 eV, which compares well to the average local gaps seen in the organic crystals in the target region of the ESC (ranging between 1 and 2 eV). This result therefore confirms that the regression model in our pipeline succeeded to identify the molecular structural motifs associated with these bandgaps. In addition, considerable curvature in the bands can be seen. In fact, the Brillouin zone-averaged valance band curvature works out to be 83.57 eV Å 2 (log-averaged curvature of 4.43), which corresponds to the 97th % percentile (top 3%) of all the curvatures in our initial database. This strong band curvature implies high hole mobility, and indeed the hole effective mass works out to be 0.66m e along the path segment (U, X), where m e is the electron mass. On the one hand, this result suggests that the data-assisted component of our scheme harnessed molecular structure information from pentacene, ZnPc, C 60 , and other high hole mobility organic crystals contained in the target region. On the other hand, this result shows the effectiveness of including human expertise in making the final choice of which molecule to synthesize, as most of the candidate molecules generated by the regression models would not be expected to pack in the cofacial arrangement conducive to charge transport.
While the crystal structure presented in Figure 4B is reasonable, it may not necessarily be the one observed experimentally, as organic crystals are typically polymorphic. It is therefore important to consider other low-energy crystal structures as well. Table  S4 (Supporting Information) presents crystal and band structures for four other low-energy polymorphs (polymorphs 2-5). Polymorphs 2-5 have almost identical energy (≈ −6.65 eV/atom) but are less stable than polymorph 1 discussed above (which has an energy of −7.18 eV/atom). The Brillouin zone-averaged gaps of polymorphs 2-5 range from 1.66 to 2.05 eV, which again compares well to the targeted bandgap range (1-2 eV). Polymorphs 2 and 5 possess a similar structure, consisting of pairs of molecules aligned with their planes parallel but in opposite directions to reduce electostatic repulsions between electron-rich parts. Because this co-facial stacking does not persist beyond pairs of molecules, the Brillouin zone-averaged curvature is reduced compared to the case of polymorph 1, becoming 63.79 and 30.61 eV Å 2 for polymorph 2 and 5, respectively (corresponding to log-averaged curvatures of 4.17 and 3.42, respectively). These band curvatures are nonetheless quite large, and correspond to the 94% and 74% percentile of the curvatures in the initial database. Polymorphs 3 and 4 also have a similar crystal structure, consisting of stacks of molecules aligned in a nearly co-facial manner. However, compared to polymorph 1, the molecules in these stacks are aligned with a slight offset to reduce electrostatic repulsions between electron-rich parts. Polymorphs 3 and 4 have nearly the same Brillouin zone-averaged band curvature of 40.21 and 38.39 eV Å 2 , respectively (corresponding to a log-average of ≈3.70), which, while not as large as that of polymorph 1, nonetheless corresponds to the 82% percentiles of the curvatures in the original data. The lowest energy polymorphs of molecule 1 therefore appear to possess the target bandgap as well as appreciable valance band curvature for hole transport.

Discussion and Conclusion
In light of the progress of automation in other fields, the emerging data-driven materials science paradigm will almost certainly involve considerable user input. Support for this statement can be found for the particular case of organic crystals. On the one hand, designing new organic crystals requires regression models that map exclusively from molecule structure-derived descriptors, the construction of which can be challenging. On the other hand, the number of molecules that satisfy a specific property in the solid-state will typically be massive. Thus, any attempt to design new molecules on the basis of regression model predictions would result in vast number of candidates. Selection of one of these molecules for synthesis would require qualitative criteria such as synthetic potential and novelty -a task for which a trained chemist is better suited than a machine. In this paper, we presented a new data-guided pipeline for designing molecules that satisfy a target bandgap range, as well as a targeted valance (HOMO) band curvature when the available data are sufficiently large. By letting the user specify a target region of the band structure space, it allows for the construction of the required regression model to be simplified. By allowing for user intervention in the selection of the final molecule for synthesis, it overcomes the second problem without introducing a manifold of artificial selection filters. We demonstrated how this pipeline can lead to the design of new organic crystals with targeted bandgaps and competitive hole mobilities.
Our pipeline uses an electronic structure chart (ESC) to facilitate the construction of the required regression model. However, due to the fact that the ESC appears in the first component of our pipeline, all subsequent components as well as the final prediction are limited by the simplifications imposed by the ESC. The first limitation arises from the use of electronic band structures to construct the ESC. While band structures summarize a wealth of useful information on bulk electronic properties, including the degree of molecular orbital overlap between neighboring molecules, it is common knowledge that charge transport in organic crystals cannot be adequately modeled by band theory. The molecules within an organic crystal are highly dynamic at room temperature, and charge-phonon coupling usually makes substantial contributions to charge transport in these materials. While the local arrangement of organic crystals within our ESC is reflective of the degree of HOMO overlap between neighboring molecules (through the local curvature), it may not be reflective of actual hole mobilities due to the absence of chargephonon coupling in our treatment. However, while this point is important for designing organic crystals with specific roomtemperature charge mobilities, it does not fundamentally undermine our pipeline. If our electronic fingerprints could be generalized to include charge-phonon coupling, then a revised ESC could be constructed via the Jansen-Shannon divergence (Equation (2)) and the tSNE method as described above. The question of how to incorporate charge-phonon coupling into the electronic fingerprints without incurring large computational costs should therefore be pursued as future research.
The second limitation concerns the organization of the ESC. As was demonstrated in Section 3, the ESC is organized on two levels. On a global level, the arrangement of organic crystals in the ESC reflects bandgaps, and on a local level the arrangement reflects band curvatures at a nearly constant bandgap size. This is inconvenient for designing new materials with high hole mobilities, as locally the ESC must contain a high density of high curvature materials in order to be useful. In turn, this requires a larger database. In this work, this shortcoming is compensated by relying on user intervention: the user was allowed to select a molecule according to their own expectation of what kinds of molecules lead to high mobility in the crystal phase (such as structural planarity). While this strategy was successful in this paper, it nonetheless highlights a shortcoming of the ESC as a material design tool. Future research should therefore investigate why the ESC is organized on two levels, and whether the electronic fingerprints could be improved so that an ESC organized on a single level can be achieved. A related limitation arises from the fact that the descriptors used to generate the ESC (the electronic fingerprints) were obtained by uniformly sampling across the Brillouin zone, whereas for hole mobilities only k-points near the valance band maximum are relevant. In Figure S4 (Supporting Information), we present some electronic fingerprints calculated using a weighting scheme that biases the k-point sampling toward the valance band maximum. For cases in which band energies do not vary so strongly across the Brillouin zone (pentacene and ZnPc), the electronic fingerprints do not differ significantly from the ones obtained with uniform sampling. This result would probably occur for a majority of organic crystals, which tend to have flat bands near the Fermi level. On the other hand, for cases in which band energies show more variation (DNT-V and DNT-W), we observe that the fingerprints tend to concentrate into peaks away from zero curvature. Inclusion of such effects would certainly have some effect on the ESC and might also help to spread curvature variation more globally across the ESC. This would clearly be an interesting topic for future research.
A final limitation of our design pipeline is that it is interpolative. In other words, it can only predict materials with electronic properties already contained in the initial dataset. While the importance of this shortcoming should not be dismissed, it is beyond the scope of this work to address it here. In fact, all machine learning (ML)-based design strategies suffer from this problem, as ML-derived models are not designed to make predictions for cases that substantially differ from the training data. This shortcoming is therefore a general problem that needs to be addressed by all who are interested in intelligent material design.

Experimental Section
Database Generation: The crystal structures for the database (cif files) were obtained from the Crystallography Online Database (COD) [33] in three steps. In the first step, a list of URLs (uniform resource locators) for all C and H-containing crystal structures in the COD was obtained using the COD search feature. In the second step, multiple batches of ≈100 structures were selected at random and downloaded. For each batch, a visual inspection was performed to remove structures that contained artifacts such as thermal disorder and under-coordinated atoms. The remaining structures were saved in VASP (POSCAR) format. The second step was repeated multiple times, resulting in a database of 5472 structures. In the third step, 33 additional compounds which are of high practical importance (because they exhibit high hole mobilities or are reasonable targets for synthesis) were selected from the literature and added to the database (see Table S1, Supporting Information, for details). This resulted in a database of 5505 compounds. Note that the additional 33 com-pounds contained 7 inorganic semiconductors (such as GaAs and InAs) and graphene as well, which were added in the hopes of discovering organic semiconductors with very high hole mobilities; in the end, no such materials were discovered.
For each structure in the database, band structure calculations were performed using DFT as implemented in the Vienna Ab Initio Simulations package (VASP) version 5.4.4. [66] These calculations used the rev-vdW-DF2 exchange-correlation functional, [67] PAW-PBE pseudopotentials, [68] 650 eV basis set cut-offs, Gaussian smearing with 0.2 eV smearing widths, and 2 × 2 × 2 Γ-centered k-points grids. Brillouin zone paths were generated using AFLOW, [45,46] with 10 k-points between each high-symmetry point. Because relaxation of a single molecular crystal can demand several days of computation, band structure calculations were performed directly on the experimental structures obtained above.
Electronic Fingerprints and Electronic Structure Chart Generation: To generate the electronic fingerprint for a given band structure, 100 k-points were sampled randomly (with replacement) from the Brillouin zone path. At each k-point k in the sample, the local curvature c k = ∂ 2 E vb (k)/∂k 2 and local energy gap k = E cb (k)-E vb (k) were computed, where E vb (k) is the energy of the highest occupied ("valance") band and E cb (k) the energy of the next band (the "conduction" band) at that point. This resulted in a sample of 100 pairs ( k , c k ). The electronic fingerprints were plotted on 10 × 10 grids, as shown in Figure 1I-L. The electronic fingerprints did not discernibly change when larger samples of k-points were used.
In order to compute the Jansen-Shannon divergence between a pair of band structures (Equation 1), their electronic fingerprints were first plotted on a common grid. Grids (10 × 10) were used, and the grids were specified so that they extended from the minimum to the maximum of the local curvatures and energy gaps exhibited by the pair. The Kullback-Leibler divergences in Equations (1) and (2) were computed using the discretized formula where the sum runs over all grid points, which are indexed by i, and P and M are the electronic fingerprints for the two band structures under consideration. The use of discrete grids and discretized Kullback-Leibler divergences is preferred to their continuous analogues, because the unavoidable use of discrete k-point paths makes it difficult to obtain a good approximation to a continuous probability distribution. These calculations were performed within R. [69] The electronic structure chart in Figure 2A,B was obtained using the Rtsne package for R [70] with a perplexity parameter of 30.
Autocorrelation Functions: The autocorrelation functions (ACFs) in Figure 2C were computed according to the formula where k = ⟨g (X k )⟩ and Here, X represents a random walk (Markov chain) that hops between points in the tSNE plot. X k is the point on which the random walk resides after making k hops. g(X k ) is the value of the property of interest (average local gap, average local curvature, or random noise) for this point. The angular brackets represent the expected value. The term in the numerator is interpreted as the correlation of g(X k ) with g(X 0 ), the value of the property of interest at the initial position of the random walk.
In this work, this random walk was simulated such that, for a given point in the tSNE plot, a hop was permitted with equal probability to any other point within a cut-off radius of 1.0. The ACFs were computed according to the above equation by simulating 10 000 random walks independently of each other, each initiated from a randomly selected point. Simulations were performed within R. [69] Prediction of Candidate Molecules: For each crystal structure in the database, supercells were generated using the ASE library. [71] Molecule structures were converted to SMILES strings using OpenBabel. [72] Default settings were used to construct the regression models with the XGBoost library. iQSPR was implemented as described in the methods sections of Ref. [53] and [54]. All calculations were performed within Python.
Crystal Structure Prediction and Band Structure Calculation for Molecule 1: Crystal structure prediction was performed using the Polymorph Predictor module in Materials Studio, [63,64] using the packing >> geometry optimization >> clustering protocol. Packing structures were generated using MC simulated annealing with default parameters. Geometry optimization was performed using the "smart" algorithm. Clustering was performed using default parameters. The COMPASSII force field was used for all calculations. For atomic charges, we used Mulliken charges calculated using DFT and the PBE exchange-correlation functional, [67] as implemented in the FHI-aims package. [73] Predictions were performed for the major space groups (P21/C, P-1, P212121, C2/C, and P21). The five structures with lowest energy were selected for band structure calculations.
Band structure calculations were performed using DFT as described above (see "Database generation"). Prior to band structure calculations, all polymorphs were relaxed using DFT with the rev-vdW-DF2 exchangecorrelation functional. [67]

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.