The genotype list string code syntax for exchanging nomenclature‐level genotyping results in clinical and research data management and analysis systems

The nomenclatures used to describe HLA and killer‐cell immunoglobulin‐like receptor (KIR) alleles distinguish unique nucleotide and peptide sequences, and patterns of expression, but are insufficient for describing genotyping results, as description of ambiguities and relations across loci require terminology beyond allele names. The genotype list (GL) String grammar describes genotyping results for genetic systems with defined nomenclatures, like HLA and KIR, documenting what is known and unknown about a given genotyping result. However, the accuracy of a GL String is dependent on the reference database version under which it was generated. Here, we describe the GL string code (GLSC) system, which associates each GL String with meta‐data describing the specific reference context in which the GL String was created, and in which it should be interpreted. GLSC is a defined syntax for exchanging GL Strings in the context of a specific gene‐family namespace, allele‐name code‐system, and pertinent reference database version. GLSC allows HLA and KIR genotyping data to be transmitted, parsed and interpreted in the appropriate context, in an unambiguous manner, on modern data‐systems, including Health Level 7 Fast Healthcare Interoperability Resource systems. Technical specification for GLSC can be found at https://glstring.org.


| INTRODUCTION
The need for standardized nomenclature systems to describe genes and gene products has been recognized for almost 70 years. 1,2 The World Health Organization Nomenclature Committee (WHO Nomenclature Committee) for Factors of the HLA system 3 and its killer-cell immunoglobulin-like receptor (KIR) gene sub-committee 4 maintain allele-name nomenclature systems for the HLA and KIR genes, which encode cell-surface molecules that facilitate the interaction of immune cells as part of the mechanisms of adaptive and innate immunity. [5][6][7][8][9] Though they differ significantly in key specifics, both gene systems' allele name nomenclatures apply a set of fields to describe the unique peptide, coding-nucleotide and noncoding nucleotide sequences for each allele, 10 with an additional field for HLA alleles describing the (often predicted) antigenicity of each unique protein sequence, 11 as illustrated in Figure 1. In both cases, the HLA and KIR nomenclatures describe what is known about a given gene's nucleotide and peptide sequences (and antigenicity for HLA) for specific reference specimens. Both gene families display high levels of polymorphism, 12 with 36,263 unique HLA nucleotide sequences, encoding 21,013 unique HLA proteins, known as of April of 2023, and 1617 unique KIR nucleotide sequences, encoding 703 unique proteins, known as December of 2022.
In HLA and KIR genotyping experiments, it is often the case that the complete nucleotide sequence of an HLA or KIR gene is not determined. 13 Given the large number of known HLA and KIR alleles, accurate reporting of an experiment usually requires that a list of multiple possible alleles sharing the interrogated sequence be returned as an experimental finding. In cases where unphased sequences are assessed for a gene, multiple possible combinations of these sequences must be considered, often resulting in multiple possible genotypes. 14 The genotype list (GL) String grammar was developed to standardize the reporting of these and other types of ambiguity that can result from HLA and KIR genotyping experiments. 15 As shown in Table 1, GL Strings apply a set of six hierarchically parsed delimiters (?,^, j, +, , and /) to accurately describe what is known and unknown about an HLA or KIR genotype for a given genotyping experiment. 16 Combined, these delimiters accurately and comprehensively represent what is known and unknown about an HLA or KIR genotype. For example, HLA-A*23:01:01:01+HLA-A*24:02:01:01/HLA-A*24:02:01:96jHLA-A*23:17:01:01+HLA-A*24:462 describes two alternative HLA-A genotypes, delimited by the "j," one of which includes two alleles that are identical in the sequenced region and are delimited by the "/." The alleles encoded by individual copies the HLA-A gene are delimited by the "+." GL Strings can further describe what is known and unknown regarding multiple loci. HLA-A*02:01/HLA-A*02:02+HLA-A*03:01jHLA-A*02:07+HLA-A*03:06^HLA-B*08:01+HLA-B*44:02/HLA-B*44:03^HLA-DRB1*03:01 HLA-DRB3*01:01+HLA-DRB1*03:01HLA-DRB3*01:01 describes genotypes for the HLA-A, HLA-B, HLA-DRB1, and HLA-DRB3 genes. In addition to the "j," "/," and "+" operators described above, genotypes for loci that are not in phase are delimited by the "+," while alleles of loci for which phase has been experimentally determined are delimited with the "." Finally, GL Strings can describe what is known and unknown regarding paralogous loci. KIR2DL5A*00105 +KIR2DL5A*00105?KIR2DL5A*00105+KIR2DL5B*00802 +?KIR2DL5B*00802+KIR2DL5B*00802 describes three possible genotypes for the KIR2DL5A and KIR2DL5B genes. The possible genotypes are delimited by the "?" operator. In this case, nucleotide sequences correspond to alleles at both the KIR2DL5A and KIR2DL5B genes, and copy number information confirms two copies of these sequences.
However, the HLA nomenclature format has changed over time, and the number of known HLA and KIR alleles has grown with continuous allele discovery. HLA allele name and sequence data are housed in the F I G U R E 1 Structural elements of HLA and KIR allele names. Four HLA and four KIR alleles with names of different lengths are shown. HLA alleles include colon (:) delimiters between fields. KIR allele names do not include a serologic group. Delimiters do not precede expression variants in HLA alleles. This figure is derived from Marsh et al. 11 and Marsh et al. 10 IPD-IMGT/HLA Database, 17 which has had 97 releases since December of 1998. KIR allele name and sequence data are housed in the IPD-KIR Database, 18 which has had 18 releases since July of 2003. Because each database release includes new allele sequences and names, a GL String that accurately represents genotyping performed in the context of one database release is likely to be inaccurate for other releases. For instance, a newly discovered allele that shares a stretch of nucleotides with previously identified alleles should be included in a respective ambiguous typing result, but cannot be included if the original genotyping had been performed before that allele was identified. If the GL String for that genotyping is interpreted in the wrong temporal context, it might appear this this new allele had been excluded as a potential typing result, when it had not been. Therefore, to accurately represent a genotyping, a GL String must be considered in the reference database context under which it was generated.
The development of a standard means of representing the relationship between a genotyping result and its reference context has proven challenging. GL Strings can be very long, requiring modifications of software and databases to accommodate them. In turn, these accommodations necessitate the concerted efforts of informatic professionals to implement changes on multiple systems that share data internationally. Attempts to include GL Strings in clinical reports written using healthcare standards have focused on Health Level 7 International (HL7) Fast Healthcare Interoperability Resource [FHIR], an international clinical data interoperability standard developed by the HL7 organization to exchange health data using the FHIR specification. 19 Since its development in 2012, the FHIR standard has become widely used in healthcare communication systems, Electronic Health Record (EHR) systems, and mobile healthcare applications.
An HL7 FHIR codeable concept data element associates a code with the code system that defines and controls that code (see https://hl7.org/fhir/R4/datatypes.html#CodeableConcept, https://hl7.org/fhir/R4/terminologies.html, and http://hl7. org/fhir/R4/terminologies-systems.html for details). For example, a codeable concept that includes the code "57290-9" and code system "http://loinc.org" defines the Logical Observation Identifiers Names and Codes (LOINC) code for the laboratory test "HLA-A [Type] by High resolution"* (https://loinc.org/ 57290-9). A result of this laboratory test could be a single HLA-A allele, which could similarly be recorded as a codeable concept by providing, for example, "A*01:01:01:01" as the code, and providing the Uniform Resource Locator (URL) for the IPD-IMGT/HLA Database, "https://www.ebi.ac.uk/ ipd/imgt/hla," as the code system. An HL7 FHIR codeable concept can optionally include a version of the code system. A compact JavaScript Object Notation (JSON) example of an HLA-A allele included in an HL7 FHIR observation.valueCo-deableConcept is illustrated in Figure 2.
This approach is feasible for individual alleles, but a GL String cannot serve as a code because the IPD-IMGT/ HLA Database cannot be specified as the code system; it does not define the GL String grammar. A new system is needed to identify the gene namespace and reference context together with the GL String. Here, we introduce the GL String Code (GLSC) system, which encapsulates GL Strings with meta-data specifying the genenamespace (which includes the code system under which the allele names should be parsed) and the pertinent reference database version or date of genotyping in a single text string. Unlike GL Strings, GLSCs can be included in HL7 FHIR valueCodeableConcepts. Use of GLSC in medical data transmission systems enables the most effective use of genotyping results for medical applications.

| METHODS AND RESULTS
The GLSC system was developed as part of the Data Standards Hackathons for Next Generation Sequencing (DaSH for NGS; https://github.com/nmdp-bioinformatics/ dash/wiki), which have been developing tools, systems and services for standardized management of immunogenomic data for the last decade. 23,24 An early DaSH for NGS product was the GL Service, 25 a RESTful web service envisioned as a Digital Object Identifier system for HLA and KIR genotypes. The GL Service generated a short, unique uniform resource identifier (URI) that corresponded to a submitted GL String, and retrieved that GL String when the URI was entered in an internet browser. The GL Service allowed URIs of uniform length (less than 100 characters) to be exchanged in lieu of potentially very-long (hundreds to thousands of characters) GL Strings. To account for the growth and changes in allele names with each reference database release, each URI included the specific gene database and release-version under which its GL String had been registered, and GL Strings for each release-version were recorded in separate instances. GL Service URIs were compact and easily transmitted, but dereferencing them required internet access, as well as the maintenance of GL Service instances for each HLA or KIR reference database release. These requirements limited the utility of the GL Service, and it has been decommissioned. The GLSC system was developed to address these limitations, by transmitting a full GL String in the context under which it was created.
The three elements of a GLSC (Gene Family [GF] Namespace, GF Nomenclature Version or GL String creation Date, and GL String) are described in Table 2. Multiple nomenclatures may be in use within a given GF Namespace. Within the hla namespace, the GLSC system supports the following identifiers: allele names, G groups and P groups defined by the WHO Nomenclature Committee, multiple allele code (MAC) designations defined by the National Marrow Donor Program (NMDP; e.g., the HLA-A*24:AMG MAC represents the HLA-A*24:02/HLA-A*24:09N ambiguity [https://bioinformatics.bethematchclinical.org/hlaresources/allele-codes/allele-code-lists/]), and the additional codes (XXXX, NNNN, UUUU, and NEW) and allele family XX codes defined by the World Marrow Donor Association (WMDA). 26 Within the kir namespace, only allele names defined by the WHO Nomenclature Committee's KIR subcommittee are supported.
For HLA alleles, G and P groups and KIR alleles, the GF Nomenclature Version should identify the corresponding IPD-IMGT/HLA or IPD-KIR Database version under which the genotyping results were generated. For NMDP MACs and WMDA additional and allele family XX codes, when a specific database version is unavailable, the date on which the GL String was constructed can be provided instead, under the assumption that the date will identify the potential range of reference database releases under which the genotype data were generated. When dates are provided, they must adhere to the HL7 FHIR date type (https://hl7.org/fhir/R4/datatypes. html#date), and follow the yyyy, yyyy-mm, or yyyymm-dd format. The elements of a GLSC are delimited with pound signs (#). An example GLSC for an HLA genotype generated under IPD-IMGT/HLA Database release F I G U R E 2 An HLA-A allele in a compact JSON HL7 FHIR observation.valueCodeableConcept. The FHIR specification describes three data types for transmitting coded data: code, Coding and CodeableConcept. The Coding datatype encapsulates the code system, version, code and display associated with a coded value. The code data type contains only the code, and may be used when the code system is indicated by the definition of the data element in which it appears. The CodableConcept data type may contain multiple Coding representations of the same concept. In the valueCodeableConcept shown, the code system is the URL for the IPD-IMGT/HLA Database, the version is the IPD-IMGT/HLA Database release version under which this HLA-A allele was genotyped (shown in bold), and the code is the name of the allele.
T A B L E 2 Elements of a genotype list string code.

Element Description Examples
Gene family namespace The set of code-systems specific to a particular gene HLA, KIR

Gene family nomenclature version
The base version of the nomenclature system used by the described gene family namespace; when a nomenclature version is not available, the date on which the GL String was constructed can be used Because each GLSC includes a GL String, GLSCs are compatible with any current application of a GL String. The addition of GF Namespace and Nomenclature Version or Date information in a GLSC will foster the standardization of allele identifiers generated in different reference database epochs, allowing the GL String data in a GLSC to be updated with respect to allele identifier changes, or nucleotide sequence extensions for a given allele, over time. Efforts to add support for GLSC and to implement these types of standardizations for extant data-analysis tools and data-storage systems are underway.

| Exchanging GLSCs on HL7 FHIR data systems
GLSCs are easily read by both humans and machine systems. Use of GLSC facilitates the transmission of HLA and KIR genotype data through modern FHIR systems, by including GSLCs in HL7 FHIR systems as part of a codeable concept. HL7 FHIR Observation resources that report laboratory results use the codeable concept in the observation.valueCodeableConcept element as the result. In these cases, the GLSC grammar is defined by the "http://glstring.org" code system. In addition to the GLSC itself, HL7 FHIR systems using a valueCodeable-Concept may include additional details of the code system and its version. Using the grammar defined for the GL String Code, an example codeable concept is illustrated in Figure 4.

| DISCUSSION AND CONCLUSION
Immunogenetic genotyping efforts have been underway for almost 40 years. Over these last four decades, innovations in nucleotide sequencing technologies, and the concomitant development of computing and informatics technologies, have driven the rapid growth in the discovery of HLA and KIR polymorphism, and its application for basic science, clinical solutions, and human health. It is likely that the number of new HLA and KIR alleles identified each year will continue to grow, as estimates suggest that the Human population harbors millions of HLA alleles at each locus. 12,28 The need for sophisticated informatic systems to manage and exchange these data in an automated fashion will only increase as new therapeutic applications requiring HLA and KIR genotyping are innovated. A key DaSH goal has been the development of seamless, easy-to-use systems for the exchange and F I G U R E 3 An example genotype list string code. The Genotype List String Code # delimiters are shown in bold. This Genotype List String was generated under IPD-IMGT/HLA Database release version 3.25.0. These HLA data were previously published. 30 F I G U R E 4 Genotype list string code embedded in a compact JSON HL7 FHIR valueCodeableConcept. A GL string code is shown in bold. The inclusion of the GLSC in an HL7 FHIR message requires that it is enclosed within in valueCodeableConcept and coding tags, which include definition of the code system, in this case, https://glstring.org, and the code system version, in this case 1.1. application of HLA and KIR data, with the aim of enabling widespread, effective application of immunogenetic health-care data.
The development of GLSC and its integration into HL7 FHIR systems is a key step in meeting those goals. A GLSC in a HL7 FHIR valueCodeableConcept describes both the genotype and the information necessary to interpret it in its proper context in a lossless fashion, which can be translated into more accurate, granular electronic health data for clinicians and patients. Looking beyond the application of GLSC for HLA and KIR genotypes, future versions of the GLSC system could easily extend its use to additional genetic nomenclature systems. For example, sequencing results for gene systems that lack a formal nomenclature (e.g., the ABO and Leukocyte Immunoglobulin-Like Receptor genes) could be named using a Gene Feature Enumeration approach, 29 and transmitted via HL7 FHIR. Schneider provided input and expertise on HL7 FHIR systems. Steven Mack drafted the paper and Supporting Information. All authors made contributions to the final version of the paper. No artificial intelligence systems were applied in the writing of the paper or for the work described.