Materials Informatics Reveals Unexplored Structure Space in Cuprate Superconductors

High-temperature superconducting cuprates have the potential to be transformative in a wide range of energy applications. In this work we analyse the corpus of historical data about cuprates using materials informatics and re-examine how their structures are related to their critical temperatures (Tc). The available data is highly clustered and no single database contains all the features of interest to properly examine trends. To work around these issues we employ a linear calibration approach that allows us to utilise multiple data sources -- combining fine resolution data for which the Tc is unknown with coarse resolution data where it is known. The hybrid data set constructed enables us to explore the trends in Tc with the apical and in-plane copper-oxygen distances. We show that large regions of the materials space have yet to be explored and highlight how novel experiments relying on nano-engineering of the crystal structure may enable us to explore such new regions. Based on the trends identified we propose that single layer Bi-based cuprates are good candidate systems for such experiments.


INTRODUCTION
Since the discovery of high-temperature superconductivity (HTS) in cuprate materials in 1986 [1], significant amounts of research has been devoted to understanding, tuning, doping, and growing new cuprates to understand and optimise their properties -both their critical transition temperatures (Tc) and other important properties such as the coherence length of the superconducting charge carriers. This work has uncovered a variety of scaling laws [2,3] and structurefunction relationships [4][5][6] that provide insight into the origin of superconductivity in cuprates and other superconducting systems. However, the many different variables cannot all be optimised at the same time, and furthermore, some of them are interdependent, making it difficult to establish causation vs correlation for these trends. In particular, bond distances cannot be independently tuned in the 3 orthogonal directions. The most common way to tune the structure is via thin film epitaxy where the substrate tunes the in-plane lattice parameter. However, the out-of-plane lattice parameter is not fixed and it responds elastically to the inplane strain. Recently, experimental techniques based on nano-engineering [7,8] have been reported that enable independent tuning of lattice distances. These techniques open up new experimental pathways to increase Tc through the exploitation of structure-function relationships. Several structure-function relationships are now well established in the literature for HTS in cuprates. The most important factors believed to relevant for the Tc are: i) The concentration of charge carriers in the conduction planes, ii) The nature of bonding in the charge-reservoir layers, iii) The in-plane Cu-O distance, and iv) The apical Cu-O distance (here we do not consider electron-doped superconducting cuprates * Correspondence email address: aal44@cam.ac.uk without apical oxygens i.e. those adopting a T' structure).
In this work we focus on the structure-dependent relationships between Tc and the apical and in-plane Cu-O distances. The influence of the apical Cu-O distance on Tc is attributed to its effect on the localisation of charge carriers in the CuO 2 planes [9]. The importance of the in-plane Cu-O distance is believed to be due to its impact on Cu-O-Cu super-exchange in the CuO 2 planes. Such attempts to couple experimentally observed trends to physically relevant mechanisms provides qualitative understanding and insight into the nature of superconductivity in these systems. Unfortunately, whilst progress has been made on theories describing the percolative nature through which superconductivity emerges in cuprates [10][11][12] and the importance of inhomogeneity in such processes, the consolidation of satisfactory quantitative theories for HTS in cuprates that reflect known structure-function relationships and are capable of making predictions about Tc remains a challenge [13,14].
In the absence of an accepted mechanistic theory, the potential availability of large amounts of historical data and high impact applications has led some researchers towards data-driven phenomenological approaches i.e. machine learning. Thus far, most work in this area has focused on building models for predicting Tc given a set of easy to evaluate descriptors that represent the materials in question [15][16][17]. The hope is that such models may enable the discovery of new families of high-temperature superconductors by first detecting abstract empirical patterns in featurisations of materials currently known to display superconductivity, and then screening new materials based on similarity to these identified patterns. However, questions exist about whether such approaches will be fruitful when tested experimentally as the evaluation metrics used in proof of concept workflows are often not reflective of real materials discovery workflows [18]. A less explored but similar avenue is how materials informatics approaches can be used in conjunction with careful physical insights to probe our understanding of systems already known to display superconductivity [19]. It is this avenue of enquiry we pursue in this work.
By combining high-resolution data structural data where the Tc is unknown with coarse-resolution data where Tc is known, we show the existence of unexplored regions of the materials space defined by the apical and in-plane Cu-O distances that are ripe for further experimental investigation. Our approach focuses solely on the apical and in-plane Cu-O distances as the limited availability of data precludes direct inclusion of other important factors know to affect the physics. However, by selecting the points with highest Tc given the cuprate family in different regions of the lattice parameter space we are able to restrict ourselves to points where the other physical parameters are likely to be near their optima. Our results highlight how materials informatics can play an important role in helping to guide experimental efforts in material science.

INFERRING STRUCTURAL PARAMETERS OF CUPRATES
The principal data source for this work is the SuperCon database compiled and distributed by the Japanese National Institute for Material Science. Whilst SuperCon records the critical temperatures of an extensive range of superconducting materials the information available for each composition is minimal -structural information is only available for a small proportion of the entries and is limited to the lattice parameters when available. Unfortunately, it is the common structure shared between different cuprates, characterised by the apical and in-plane distances of the superconducting CuO 2 planes, that is interesting for examining trends.
Whilst the lattice parameters can be determined relatively easily via x-ray diffraction, directly measuring the atomic positions, needed to determine the apical and in-plane distances, typically involves much more specialised x-ray diffraction apparatus or neutron diffraction experiments. Fortunately, many cuprate structures are already recorded in the Inorganic Crystal Structure Database (ICSD) [20]. However, the critical temperatures of these materials are not recorded alongside their structures, therefore, limiting the utility of ICSD as a data source for studying Tc-structure trends.
Consequentially, whilst large amounts of data are available on cuprates, the structure of that data is incomplete in terms of the information required to look at structure-function relationships. As a result, trends are often examined between relatively small numbers of selected data-points that contain the necessary information for the analysis of a given structure-function hypothesis. Selection of data in this way has the potential to lead to overconfidence in Tc-structure trends observed. Below we attempt to overcome this issue of incomplete data by obtaining estimates for the apical and in-plane distances from the knowledge of the more readily available lattice parameters. Whilst the accu- Table I. Abbreviations used to describe the cuprate systems explored in this work. We make use of the A-jk(n-1)n fourdigit notation for cuprates of the form AjB k Sn-1CunO j+k+2n described in [21] for Tl, Hg and Bi based cuprates. In the RE123 and RE124 abbreviations RE denotes a range of rare earth metals i.e. RE={Nd, Gd, Pr, etc.}. The representative formulas given are not exclusive and the materials contained in both SuperCon and ICSD contain a variety of dopants e.g. Y for Ca in Bi2212 to reduce cation disorder [22] and elemental substitutions e.g. Sr for Ba in Hg1201 to produce Ba-free cuprates [23].
YBa2Cu4O8 ,  Tl1201  TlBa2CuO5  Tl1212  TlBa2CaCu2O7  Tl1223  TlBa2Ca2Cu3O9  Tl2201  Tl2Ba2CuO6  Tl2212  Tl2Ba2CaCu2O8  Hg1201 HgBa2CuO5 Hg1212 HgBa2CaCu2O6 Hg1223 HgBa2Ca2Cu3O9 Hg2212 Hg2Ba2YCu2O8 Bi2201 Bi2Sr2CuO6 Bi2212 Bi2Sr2CaCu2O8 racy of such an approach is diminished and prevents truly quantitative analysis, using estimates allows for a far greater number of examples to be considered, therefore ensuring the robustness of any trends that remain. For cuprates, the a-lattice parameter is closely related to the in-plane distance. Assuming that the CuO 2 planes are approximately square planar this entails that the in-plane distance can be estimated as a/2 for tetragonal phases and a/2 √ 2 for octahedral phases. This assumption breaks down for cuprates under pressure where the CuO 2 planes generally tend to buckle to relieve pressure on the structure.
Whilst the c-lattice parameter is often considered as a proxy for the apical distance there is little correlation between the two -the c-lattice parameter also depends on the thickness of the charge-reservoir layers which can vary significantly between different families of cuprates. Consequentially, to estimate the apical distances each family has to be treated independently. The approach adopted here is to use linear calibration models constructed on reference data that relate the a and c-lattice parameters to the apical distance. Figure  1 shows a scatter plot of the structure space as characterised by the a and c-lattice parameters for both the source data (SuperCon) and the reference data (ICSD). The 16 cuprate families investigated here were selected due to having greater than 5 examples in both ICSD and SuperCon. We see that both data sets assign density to similar regions of the structure space for these families.
The variation in the apical distance within families is due to the chemical pressure that arises from doping  Figure 1. The figure shows the overlap in the lattice parameters between the SuperCon data set (blue dots) and the ICSD reference data (orange crosses). See Table I for a list of abbreviations.
or atomic substitutions. The simplest model is that the pressure imparts a uniform stress along the c axis that strains the material. If the total strains are in the Hookean regime the strain along the c-axis should then be directly proportional to the strain in the apical distance -the implication being that we can approximate the materials' layered structure with hypothetical slabs of constant Young's modulus. The strains along a and c will also be related by the Poisson effect. Therefore, the minimal linear calibration model for the apical distances,d apical , that also includes this effect is: Where α, β and γ are the parameters of the calibration model that need to be fitted for each family. In each case a robust linear model based on the Huber penalty [24] was used to reduce the effect of outliers when fitting the calibration models of the form (1). Models were fitted for the following families: La214 (T) CuO 4 , Y123, RE123, Y124, RE124, Tl1201, Tl1212, Tl1223, Tl2201, Tl2212, Hg1201, Hg1212, Hg1223, Hg2212, Bi2201, and Bi2212 (See Table I for a list of abbreviations). To increase the amount of reference available data both neutron and x-ray scattering structures recorded in ICSD were used -ideally only structures derived from neutron scattering would be used as the oxygen positions for xray derived structures can be affected by systematic errors. However, here we believe the increased abundance and diversity of reference data outweights this potential loss of accuracy.
To check the validity of this simple approach we employ density functional theory (DFT) to investigate how the apical Cu-O distance changes as the lattice is strained. We only use DFT as a proxy to look for qualitative trends that are likely to be mirrored in real systems. This is due to well-documented discrepancies in the lattice parameters between the structures of cuprates as reconstructed from neuron/x-ray scattering experiments and the relaxed structures returned from DFT calculations.
We take La 2 CuO 4 , YBa 2 Cu 3 O 7 , HgBa 2 Ca 2 Cu 3 O 8 as prototypical test cases being illustrative of one, two and three-layer cuprates respectively. Figure 2 shows that for tensile and compressive strains of up to 4% the responses show that monotonic and broadly linear trends exist between the a and c-lattice parameters and the apical distance for the prototype systems explored. The dependence on the c-lattice parameter is stronger than on the a-lattice parameter as expected. As the strains become larger some degree of non-linearity does appear, however, the maximal strains examined here are significantly larger than the spread in the experimental data. Whilst a more complicated model could be used to fit this non-linearity in the clean data obtained 13.00 13. 25    via DFT for the experimental data other factors such as systematic variations between different experimental setups cannot be accounted for. Therefore, adding additional terms to the calibration models without strong physically-motivated priors for their inclusion is undesirable.

MATERIALS INFORMATICS REVEALS UNEXPLORED STRUCTURE SPACE
Beyond the apical and in-plane distances there are other important factors known to influence Tc that need to be considered. Unfortunately, many of these are typically harder to quantify, for example, the nature of bonding in the charge-reservoir layers. Perhaps the most important of these factors is that achieving optimal oxygen-doping is necessary to maximise Tc. Given the aim of maximising Tc we are generally only interested in trends between materials characterised in optimal states. Unfortunately, materials in SuperCon are commonly reported with unknown oxygen concentrations. This is problematic because achieving optimal oxygen doping for a given composition within a family depends on which/whether other dopants are present. As a result naively selecting the materials with the highest Tcs would end up discarding much of the diversity in the data set making it difficult to establish trends. To ensure that we maintain as much structural diversity as possible we first perform k-means clustering [25] on the data in the a and c-lattice parameter space and then take the top 20% of each cluster by Tc. This selection strategy ensures that the data points considered in the subsequent analysis are diverse in terms of their apical and in-plane Cu-O distances but are also likely to be close to optimal in terms of the other rele-vant factors for optimising Tc. Stratifying the remaining data set into different sub-groups we observe the following trends: 1. If cuprate materials are differentiated according to the number of CuO 2 planes in the unit cell there is a clear separation between the different groups ( Figures 3B and 3C). The observed increases in Tc with the number of planes are well understood considering intralayer interactions between CuO 2 planes [26]. A similar separation is also clearly visible in the Uemura relation [3] where the saturation and suppression of the Tc occurs at different relaxation rates depending on the number of CuO 2 planes.
2. Whilst grouping by the number of CuO 2 planes emphasises a strong positive correlation observed between Tc and the apical distance ( Figure 3C), grouping materials via the main cation ( Figure  3D) shows that all the highest Tcs come from Hg and Tl-based materials with high apical distances (Circled in green in Figure 3D). We note that higher critical temperatures have been achieved in both two and three layer Hg-based cuprates via the application of hydro-static pressure [27,28] which is known to decrease the apical distance [29] with minimal impact on the in-plane distance. However, these increases in Tc have been attributed to the effect of pressure on the position of Ba atoms in the buffer layer [30,31].
3. There is an apparent optimum in-plane distance for Hg and Tl-based cuprates around ∼1.92Å (Figures 3A). In contrast, Rare-Earth-based (RE) cuprates show very little variation in Tc as the in-plane distance changes. Looking at the highest Tc Bi-based materials, there is a slight increase in Tc / K Figure 3. Panels A-D show the trends between the critical temperature and the apical and in-plane distances stratified by the number of CuO2 planes and cation type. In panel A and B we see the apparent optimium in the in-plane distance of 1.92 Å for Hg and Tl-based materials. Panels B and C clearly highlight the trend that Tc increases with the number of CuO2 planes. The green circled region in D shows that the trend of Tc increasing with apical distance is only apparent due to Hg and Tl-based materials with high apical distances. E shows the variation of Tc with the apical and in-plane distances. A large region (labelled "Unexplored Region") of high apical Cu-O distance and low in-plane Cu-O distance is apparent. Experiments that probe this region are likely to provide useful insight into the nature of Tc-structure trends in cuprates. We highlight the vertically aligned nanocomposite (VAN) samples from [7,8] as examples of nano-engineering approaches to investigate the region. In the key VAN c-214/a-113 refers to the La2CuO 4-δ /LaCuO3 interface from [7] and VAN c-214/a-214 refers to the c-aligned La2CuO 4-δ /a-aligned La2CuO 4-δ interface from [8].
Tc as the in-plane distance approaches 1.92Å but as there are no Bi-based materials reported with in-plane distances above 1.92Å in the data sets examined, it is not apparent whether a drop off in Tc, as is the case for Hg and Tl-based materials, would be observed. The slight increase in Tc with the in-plane distance for these Bi-based materials could perhaps be attributed to changes in the multi-layer structure between the high Tc Bi2201 type materials (Bi 2+x Sr 2-x-y Ca y CuO 6+δ ) [32] and the Bi2212 (Bi 2 Sr 2 CaCu 2 O 8+δ ) family. This suggests that for constant apical distance there may be no strong dependence on the in-plane distance for the Bi-based materials. Figure 3E shows the apical distance versus in-plane distance. The data points are coloured according to their Tc with red colours indicating higher Tc and blue colours indicating lower Tc. It is clear that large areas of the apical/in-plane materials space, potentially yielding higher Tc materials, remain unexplored (This region is labelled 'Unexplored Region' in Figure 3E and it occurs for high apical distance, i.e. above 2.5Å).
Past efforts have only been able to sample in small regions around known systems due to the limitations of perturbing systems with mechanical and chemical pressures. A key limitation is that due to Poisson effects such methods influence both the a and c-lattice parameters preventing the exploration of trends in a onefactor-at-a-time manner. Recently, new experiments have shown that it is possible to tune the a and c-lattice parameters independently allowing for unexplored regions of the materials space to be investigated [7,8]; this vertically aligned nanocomposite (VAN) approach has led to enhanced Tcs of 50 K [7] and up to ∼120 K from magnetic measurements [8] in nano-engineered La 2 CuO 4-δ films relative to 40 K in the bulk (These points are highlighted in Figure 3E -see Methods for how the apical distances for these samples were estimated). These examples support the hypothesis that the unexplored region may yield higher Tc systems.

DISCUSSIONS AND CONCLUSION
Having established the existence of a large unexplored region of the structure space of cuprate su-perconductors we believe that novel experimental approaches that allow for new regions of the apical/inplane materials space to be probed would be fruitful to further understand structure-Tc trends in cuprates and potentially increase Tcs.
From the results presented here we believe that the 3D strain engineering of Bi2201 or Bi2212 systems are of particular interest because they lie at the base of the unexplored region with >2.5Å apical distances ( Figure  3E). 3D strain engineering using VAN is suitable for Bi2201 and Bi2212 because they can be made in-situ via epitaxial growth methods [33,34], however, other methods such as non-linear phononics might also be appropriate [35]. Such experiments would allow greater insight into whether the high Tcs of Hg and Tl based cuprates are due to structural effects from the large apical distance or intrinsic electronic effects of the Hg and Tl cations. When attempting to optimise Tc, it should be noted that both Bi2201 and Bi2212 are known to benefit from substitutional doping [22]. This potential need for substitutional doping is important to consider in the selection of suitable materials for the substrate and matrix within the VAN setup. Bi2212 is also known to naturally exhibit crystal "super-modulation" which manifests as large variations in the Cu-O apical distance at the unit cell level [36].
Of interest also, is whether it would be possible to increase Tcs in Hg1212 or Hg1223 thin films in a manner that reduces the in-plane Cu-O distance whilst constraining the apical Cu-O distance to remain high. This would allow us to move into Unexplored Region of Figure 3E from its right hand edge (from the cluster of red points at the highest apical distance values), rather than moving up from its bottom edge as proposed for Bi-based systems. However, we believe this approach will be more challenging due to issues that arise when growing Hg-Ba-Ca-Cu-O thin films [37].
Finally, we note that more systematic integration of existence data sources, deposition of new data and novel data mining efforts [38] are desirable to improve superconductivity databases. Consolidating the vast amount of information available in the literature into a comprehensive source, containing critical temperatures alongside atomic structures, would facilitate the improved application of materials informatics approaches. Critically, such resources would enable direct consideration of how distributions of bond distances, and variations in bond angles e.g. buckling of the CuO 2 planes affect Tc.

A. Examination of the variation of prototypical cuprates under strain
Density functional theory calculations were used to model how cuprates might behave under strain. Although standard DFT may not fully describe the electronic structure related to the superconducting states, we expect it still gives reasonable estimate for the strain responses. The plane wave pseudopotential code CASTEP [39] was used with the PBE exchangecorrelation functional [40]. A plane wave cut off energy of 700 eV was used. Monkhorst-Pack grids were used for sampling the reciprocal space with k-point spacing less than 2π × 0.05Å −1 . On-the-fly generated corecorrected ultrasoft pseudopotentials [41] from CASTEP's C18 library were used. The equilibrium cell volumes of the structures were optimised with residual stresses less than 0.05 GPa. Once the equilibrium cell volumes were obtained, following optimisations were performed with fix cell sizes corresponding to strains in the c and a-b directions ranging from -4% to +4%. The ionic positions were relaxed until the maximum force was less than 0.01 eVÅ −1 in all calculations. The AiiDA framework was used to manage and automate the calculations [42,43].
B. Estimation of the apical distance in VAN systems In [7] a Tc of 50K and a and c-lattice parameters of 3.79-3.76Å and 13.20-13.28Å are reported for the La 2 CuO 4-δ /LaCuO 3 (c-214/a-113) interfacial region. The presence of domain matching in the structure suggests uniform stress along the c-axis, therefore, the apical distance can be estimated using the linear calibration model for the La214 (T) family. This gives an estimate of 2.40-2.43Å for the apical distance.
In [8] a weak magnetic signature for superconductivity at 120K is reported for a c-aligned La 2 CuO 4-δ / aaligned La 2 CuO 4-δ (c-214/a-214) interface. Here there is La-block matching at the interface, rather than domain matching, suggesting a non-uniform stress. The La-block is believed to be 4.00 ± 0.01Å at the interface -much larger than the average of 3.67Å for the ICSD La214 (T) reference data. As this estimate requires a large degree of extrapolation we cannot justify a linear model. Instead, we derive an estimate for the apical distance from considering the offset from the top of the La-block to the apical oxygen. This offset is strongly peaked around 0.56Å giving an estimate of 2.56Å with a 90% confidence interval of 2.51-2.62Å.

Data and Code availability
The processed data and the processing and plotting code used to analyse it are available from www.github.com/comprhys/apical.