A complete pipeline enables haplotyping and phasing macrohaplotype in long sequencing reads for polyploidy samples and a multi‐source DNA mixture

Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High‐quality long‐sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read‐backed phasing) to identify macrohaplotypes, and thus it can detect multi‐allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced ∼8‐kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole‐genome long‐read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high‐throughput sequencing data and supports applications using discriminating macrohaplotypes.


INTRODUCTION
Variation of genomic DNA sequence is widely used for theoretical studies and practical applications.The variants include tandem repeat (TR), insertion and deletion (InDel), and single nucleotide polymorphism (SNP).Short TRs (STRs) are the core genetic markers of DNA forensic applications and databases, such as the FBI's Combined DNA Index System (CODIS) database [1].Deconvoluting DNA mixtures contributed by multiple individuals is still challenging, although some efforts have been investigated [2].The combination of SNPs and STRs can provide a better discrimination power than STRs only for DNA identification [3].Any proximate variants next to widely used STRs can be combined together and phased as different macrohaplotypes.For example, a macrohaplotype can consist of a CODIS core STR and additional completely phased flanking variants of STRs, SNPs, and InDels [4].More variants in a macrohaplotype marker enrich the genetic information, increasing the resolution in individual identification [4].The recently available pan-genome strategy harbors haplotyped variants of a species like the human draft pangenome [5], which provides better-referenced variant loci for comparison between individuals than ever.These variant sites could be selected as an efficient panel to build macrohaplotypes.
Variants from the same continuous long DNA sequence are implicitly phased.The most recently developed highthroughput long-read sequencing technologies, such as PacBio HiFi and Oxford Nanopore, can obtain the similar accuracy as short-read sequencing, but with much longer reads to enable better genomic haplotype separation and macrohaplotype analysis [6,7].In contrast, the short next-generation sequencing (NGS) reads, for example, reads from Illumina NovaSeq, often rely on computational tools to phase variants from different reads with certain degree of errors [8].The correct phasing from long reads provides the feasibility of correct calling of macrohaplotypes, which could be very useful in mixture interpretation [4], as well as forensic genetic genealogy and kinship analysis [9,10].However, to make the macrohaplotype analysis applicable, the development of wet-lab experimental methods and associated bioinformatic tool(s) are needed.
Currently main stream bioinformatic tools, for example, DeepVariant [11] and Genome Analysis Toolkit [12], mainly consider the SNPs and InDels in NGS reads.Other tools, for example, STRait Razor [13] and HipSTR [14], address the variation of STR or TR in short NGS reads.TRcaller calls TR alleles from both short and long NGS reads, but not SNPs or InDels [15].To phase variants, additional tools such as WhatsHap are needed [16].Some genome assemblers, for example, FALCON-Phase, can phase the diploidy haplotypes from long reads, but it needs extra chromosomal interactive Hi-C data, which is usually very expensive and complex to obtain [17,18].In addition, most existing tools are designed for diploid samples, which may not fit the needs for analyzing a complex DNA mixture or polyploid with more than two alleles at a single locus.
PolyHaplotyper can analyze multiple haplotypes with fullsib families but only uses SNPs [19].Another issue is that most existing tools usually do not simultaneously call STRs, SNP and InDels.
In this study, we developed a complete pipeline for macrohaplotype analysis, from benchwork to bioinformatics.First, PacBio HiFi reads were generated through sequencing, either with whole genome sequencing (WGS) or targeted amplicon sequencing, and then these HiFi reads are mapped or aligned to a reference genome.These mapped reads were the input to the novel bioinformatic tool, MacroHapCaller, which calls the variants of SNPs, InDels, and STRs with new algorithms and then combining them into phased macrohaplotypes.This pipeline was validated with ∼8-kb targeted HiFi sequencing at 20 forensic CODIS core STR sites in both amplicon sequencing reads and WGS datasets of the human benchmark sample HG002 and other human DNA samples.

Amplification of targeted long DNA fragments
To obtain PacBio HiFi sequencing reads, targeted DNA fragments were amplified.PCR primers were designed to target ∼8-kb regions with the Primer3 tool [20].The expected amplicons from each locus primer pair include one CODIS core STR locus [1], SNPs sites, and InDel sites mined from the 30x on GRCh38 results (version 20201028) of 1000 Genomes project [21].The primers are provided in Table S1.Each primer sequence consists of a locus-specific sequence on the 3ʹ end, an 18-bp barcoded universal molecule identifier (UMI) in the middle, and a macrohaplotype universal primer sequence on the 5ʹ end.DNA from the human benchmark sample HG002 (son) and HG003 (father) were used as a genomic DNA template.Singleplex PCR was performed with 50 ng DNA input per locus using the Takara LA Taq Long PCR Kit (Cat.RR013A) with denaturation at 98 • C, 30 cycles of annealing at 68 • C, and a 10 min final extension at 72 • C. Amplicons were purified with PacBio Ampure PB beads (Cat.100-265-900), quantified with Agilent D1000 ScreenTape Assay on the Agilent TapeStation, and pooled at equal inputs (125 ng per locus).The pooled amplified product then went through another round of bead clean up, this time with PacBio SMRTbell Clean Up Beads (Cat.102-158-300), to bring the pool to the working volume necessary for SMRTbell construction.

PacBio HiFi reads sequencing and generation
Note that 2500 ng of purified DNA amplicons were used to construct the SMRTbell library with the PacBio SMRTbell Express Template Prep Kit 3.0 according to manufacturer recommendations and then sequenced on a PacBio Sequel IIe platform at 70 pM on plate loading concentration with the Sequel II Sequencing Kit 2.0 and Sequel Binding Kit 3.2 and a 30-min movie time per SMRT cell across a total of four SMRT cells (Part Number 101-791-800 version 02).The HiFi reads were generated with SMRT Link (version 11.0) software with default settings.

Whole genome sequencing HiFi reads
The publicly available WGS HiFi read files from the PacBio Revio platform for human benchmark sample HG002, HG003, and HG004 were downloaded from https://downloads.pacbcloud.com/public/revio/2022Q4/. Multiple runs of sequencing data generated from the same subject were merged before subsequent analysis.The reads (FASTQ format) were input into MacroHapCaller pipeline with default settings and the same configure files for 8-kb regions of 20 CODIS STR coming with the software.

Bioinformatic pipeline of macrohaplotype analysis
The bioinformatics macrohaplotype analysis pipeline includes read quality filtering, optional UMI deduplexing, read mapping to a reference genome, alignment sorting, and then the macrohaplotype calling (Figure 1A).In step 1 of this pipeline, the long sequencing reads, for example, PacBio HiFi reads, are filtered to have an average read quality score of 30 (Phred scale, Q 30 ) using a Python script.In step 2, the UMI deduplex was conducted via UMI-tools (version 1.1.2)[22] and this step is optional.In step 3, the processed reads were mapped to the reference genome (build hg38) and then were sorted and indexed to generate an alignment file in BAM format via min-imap2 (version 2.24) and samtools (version 1.10 or 1.11).In step 4, alleles from the targeted sites are extracted as a haplotype using a configuration file for each variant type (SNPs, InDels, and TRs).In step 5, the final step is to call and report macrohaplotypes via MacroHapCaller with the algorithms described later (Figure 1B).To filter possibly noisy macrohaplotypes, the minimum proportion of supported reads is considered (0.001 by default), and macrohaplotypes are reported if this proportion is no less than the threshold.Users are allowed to customize the parameters for each of the steps if different from default settings.

Algorithms of macrohaplotype calling in long sequencing reads
To call the variants at targeted sites in a long sequence of each NGS read, the MacroHapCaller first takes the alignment file in BAM.Reads with a mapping quality less than Q 40 are removed.For each aligned read, the CIGAR string is parsed and the alleles at the targeted sites are extracted.Here, we described three algorithms for TR, SNP, and InDel calling, respectively.
For TR calling, most existing variant call tools have difficulty in calling TR variants.Therefore, we developed a novel TR calling method from the alignment data (Figure 2).This algorithm requires a configuration file which defines the TR locus position in the reference genome, two anchor sequences, and repeat motif features of targeted TR loci (Table S2).A pair of short sequences (e.g., 18-20 bp) identical to reference genome, termed the left and right anchors, flanking each targeted TR are required.The anchor sequence should be as unique as possible in the TR flanking region with at least several bases, for example, 3-5 bp, from TR sequences.The purpose of the anchor sequences is to help to identify the boundaries of TR reads.During TR calling, each targeted TR site will be screened through a computing loop.In each cycle of the loop, a chunk sequence (c), 30 bp by default, of a read next to the targeted TR locus will be extracted to check the presence of an anchor sequence.A checking window slides at the length of anchor sequence (l) through the chunk sequence to locate the first best match in the read.Therefore, the matching algorithm has a linear order of O((c-l)/l).This method skips the search for a match through the whole read, which significantly reduces the computing time by reducing steps from the whole length of a read, for example, 8-30 bp.A mismatch of edit distance will be allowed, 1 by default.If both left and right anchor sequences are present in a read, the bases between the anchors, excluding anchor sequences, are extracted and then trimmed off non-TR sequence, which is assumed to be identical or the same length as the reference.Candidate TR sequence is further checked to validate the presence of the repeat motif given in the TR configure file.If satisfied, the TR sequence is recorded for each targeted site.The start- ing position in the reference coordinate, supported read abundance, and read proportion relative to the total reads at each locus in the alignment file will be reported.Multiple TR loci, if given, in a read will be analyzed and recorded as a phased haplotype (Figure 2).
The algorithm takes short anchor sequences to locate the boundaries for TR in an alignment file.A candidate raw TR sequence within the two anchor sites, not including anchor sequences, is extracted and trimmed, followed by validating the presence of TR motifs.The final TR sequence will be recorded and starting position, supported read abundance and read proportion relative to the locus will be output.Multiple TR loci in a read will be analyzed and recorded as phased haplotypes.
For the SNP calling, a configuration file is also needed which contains the locus position information of targeted SNP loci.The SNP configuration is a tab-separated text file and an example file is presented in Table S3.Each of the targeted sites in an aligned read will be called.In each calling of a given SNP site, the corresponding base in the read will be extracted if this base is aligned to the reference.The corresponding base and its position in the reference sequence will be output as genotyping results.This is different from many other variant callers, which usually do not output alleles if the alleles are identical to the reference genome.
For the InDel calling, a configure file which contains the targeted InDel positions is also required.The InDel configure is a tab-separated text file in BED format (https://useast.ensembl.org/info/website/upload/bed.html), and an example of file was given at Table S4.All sites with a CIGAR value "I" or "D" in the alignment file of a read against the reference will be considered as InDel candidates.Candidate InDels will be called if the following given criteria are satisfied.An InDel must not be inside a homopolymer region which is determined by the homopolymer length threshold, 4 by default.The insert variant must have an average quality score higher than Q 15 (by default).The InDel will be discarded if the called InDel is not included in the configure file.The base(s) will be extracted from reference to fill the position if the listed InDel site does not have an InDel in this read (i.e., same as reference).Therefore, the data from InDel call will have the same number of targeted sites, which is useful for later comparison between samples.
All called SNPs, InDels, and TRs from the same read will be concatenated as a macrohaplotype string.All macrohaplotypes are clustered based on the string context and then ranked by the supported reads with the same called string.
The variant calling algorithms are implemented in Java (version 60) and Java version of HTSlib [23].Threading is enabled by default.

Performance comparison
HiFi reads were generated from HG002 and HG003 with four and one SMRT cell runs, respectively, in independent experiments.The alignment of HiFi reads against the reference genome HG38 was generated with our mapping pipeline and macrohaplotypes were called with Macro-HapCaller.Each of the four SMRT runs were tested.To test the scalability, another alignment file was generated from four merged runs of HiFi reads of HG002 and was used as input file for the comparison.DeepVariant (version 1.1.0)was used to call SNPs and InDels from the alignment file, and then compared with the SNPs and InDels called by MacroHapCaller.TRcaller (version 1.5.3) and HipSTR (version 0.6.2) were used to call STR alleles from the alignment file.

Demonstrated application of the MacroHapCaller pipeline in DNA forensic sites
A novel bioinformatic tool called MacroHapCaller has been developed to call macrohaplotypes consisting of TRs, SNPs, and InDels with three novel calling algorithms.The MacroHapCaller pipeline was tested in 8-kb amplified regions containing the latest core 20 DNA forensic STR sites, which are well studied and have good references (Table S1).A panel of user-selected targeted variants for genotyping are specified in three variant configure files, one for each type of variants.In the demonstration, the configuration files were generated from population variant information out of the 1000 Genome Project.The selected STR sites include 20 CODIS core STRs and additional STR sites present in the given regions.The selected SNP sites were screened from all variants of the 2504 unrelated samples out of all 3202 available human samples in the 1000 Genome Project (https://www.internationalgenome.org/) and filtered to remove variants with less than 1% minor allele frequency (MAF).The variant sites in the amplicon regions were kept as the final SNP sites.To avoid interference during alignment, SNPs were filtered further to remove those within three bases of any STR or InDel sites.In total, 970 SNPs were used in the demonstration, averaging 48.5 SNPs per CODIS STR-containing amplicon.The selected InDels are a subset of variant sites with MAF > 1% among 2504 unrelated samples in the 1000 Genome Project.The InDels were then further removed if a InDel site was overlapped within a 3-bp distance to any STR sites.InDels overlapped with any homopolymers (operationally defined as four bases or more) also were removed due to high sequencing error in the homopolymer region.In total, 57 InDels were used in this InDel configure file.
As tested, more than four million long HiFi reads were generated successfully in total with four PacBio Sequel IIe runs from amplicons generated from the human benchmark HG002 sample for 8-kb regions containing 20 CODIS STRs (Table S1).Eighteen out of 20 targeted regions produced high coverage HiFi Q 30 reads (≥25,000), except that of CSF1PO (1021 reads) and D8S1179 (327 reads) (Figure 3).Alleles for all 20 core CODIS STRs were detected in the sequencing read dataset, and were 100% concordant with the publicly known truth data of HG002.Fewer supported reads for the CSF1PO and D8S1179 sites indicated that further PCR optimization for these sites may be needed.The MacroHapCaller pipeline was used to analyze the CODIS STRs in the Q30 filtered HiFi sequences.Expected STR alleles were obtained from each set of HiFi reads which were consistent with the publicly known truth size in HG002 (Table S5), suggesting a good validation of the design of the amplification, sequencing, and bioinformatics analysis pipeline for macrohaplotypes.Our results revealed that one SMRT cell run is enough to generate plenty of reads (6−9 Gbp) for an accurate Macrohaplotype calling.To compare detailed sequence and abundance difference of macrohaplotypes, the output from Macro-HapCaller can be imported into bioinformatic tool USAT for graphic viewing [24].In summary, the MacroHapCaller is good for macrohaplotype analysis in forensics.
STR alleles were analyzed from PacBio HiFi Q30 reads, which were sequenced from 8-kb targeted amplicons containing 20 CODIS core markers.The labels for the horizontal axis are the names of 20 CODIS core markers.The vertical axis in the logarithmic scale shows the number of reads which contain the expected STR alleles for each marker.

Performance of the MacroHapCaller pipeline
The accuracy of variant calling is compared between MacroHapCaller and other widely used variant calling tools DeepVariant and HipSTR (Table 1).However, the existing tools cannot call macrohaplotypes, so the called variant at each variant site was compared.The same alignment file that was generated from the merged dataset of four PacBio 8-kb amplicon sequencing datasets of the sample HG002 was used as the input for this comparison (Table 1).Macrohaplotypes were called with MacroHap-Caller for the benchmark sample HG002 (Table S5).Among the expected 37 true STR alleles [15], MacroHap-Caller detected all 37 alleles while HipSTR detected 31 alleles, suggesting a higher STR calling accuracy of Macro-HapCaller (100%) than of HipSTR (81%); both Macro-HapCaller and TRcaller generated 100% consistent alleles (Table S6).For 970 given SNP loci, MacroHapCaller called 970 loci.Among those, 182 loci are real variants, meaning different from the reference genome.DeepVariant only reported variants by default different from the diploid reference genome, which were 100% consistent with the variants called from MacroHapCaller for these 182 loci.MacroHapCaller detected all 57 InDels included in the configure file.Out of these 57 InDel sites, nine sites were detected by DeepVariant and were 100% consistent with the alleles detected by MacroHapCaller (Table S6).The macrohaplotypes at each of 20 CODIS loci were called from HiFi reads generated from 8-kbamplicon and then were compared between HG002 and HG003.In total, 40 macrohaplotypes were called.At all 20 loci, at least one allele were shared between HG002 and HG003, among which four loci shared two identical alleles between HG002 and HG003.This indicted that one identical macrohaplotype of each locus was perfectly inherited from the parent HG003 to the child HG002.
We analyzed the running time for the MacroHapCaller pipeline.To analyze the macrohaplotypes on four PacBio HiFi sequencing datasets, the total running time of alignment, calling, phasing, and reporting macrohaplotypes was around 10 min with 12 computing threads in a desktop computer with 120G memory.It took around 2 min for the MacroHapCaller to call macrohaplotypes from the BAM alignment file.MacroHapCaller runs faster than HipSTR and DeepVariant, but slower than TRcaller (Table 1).We also tested the performance results from dataset generated from individual SMRT cells, and the variants detected from one cell and four cells were consistent, except that a shorter running time is needed, which suggested data from one SMRT cell run for the targeted amplicon is sufficient to produce robust macrohaplotypes.
We also tested the application of MacroHapCaller pipeline in public WGS data from PacBio Revio for human sample HG002, HG003 and HG004 with the same set of configure files used in the above demonstration (https:// downloads.pacbcloud.com/public/revio/2022Q4/).Results showed that all macrohaplotypes for 20 CODIS loci were successfully called by MacroHapCaller, which were consistent with those from amplicon-based HiFi reads for HG002 that had an average 93x coverage of the genome.For HG003 and HG004, some macrohaplotypes were not detected because the average read depths were much lower (∼30).Thirty reads may not be sufficient to cover the whole targeted regions, which was confirmed with a manual check.Therefore, high read coverage is recommended for WGS to detect a complete set of macrohaplotypes.
Here, it is worth noting that MacroHapCaller is capable of reporting the actual allele even if an STR locus contains an interruption compared to the reference sequence.For example, if an allele in the reference is [AGAT]14 while [AGAT]7[ACAT][AGAT]6 in the sequencing data, the allele will be reported as [AGAT]7[ACAT][AGAT]6.In addition, any targeted variant sites can be added or removed by the user to set up a user-specific panel for the macrohaplotype analysis.The current panel of selected target sites in the three configure files may need more optimization with more testing samples.For example, noisy alleles were found besides the expected macrohaplotypes at the locus D1S1656, which was mainly caused by several specific variant sites.Removal and optimization of the targeted panel will reduce the proportion of noisy alleles.In addition, this pipeline will be improved in the future to allow private mutations, which are not included in the configure files.

CONCLUDING REMARKS
A novel bioinformatic software MacroHapCaller has been developed with novel algorithms for macrohaplotype analysis from long-read sequencing data.Three types of variants, TR, SNP, and InDel, in a continuous read can be efficiently called and phased as macrohaplotypes from diploid, polyploid, and DNA mixtures to increase discrimination of power of DNA identity.In addition to forensic DNA mixture interpretation, MacroHapCaller can also be employed in many other applications, such as genetic genealogy and disease diagnostics.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflicts of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
MacroHapCaller is publicly available at https://github.com/Ge-Lab/MacroHapCaller and https://github.com/XuewenWangUGA/MacroHapCaller.The HiFi sequencing data are available on request.

F
I G U R E 1 Schematic diagram of macrohaplotype analysis pipeline and algorithms of variant calling from long sequences.(A) The workflow of calling macrohaplotypes with three independent models for single nucleotide polymorphism (SNP), insertion and deletion (InDel), and short tandem repeat (STR).The output is the macrohaplotypes which consist of sequences at variance sites and count of reads supporting the macrohaplotype, etc. (B) The schematic diagram of MacroHapCaller algorithms.F I G U R E 2 Algorithms of anchor-based tandem repeat sequence calling from the read-reference alignment.

F I G U R E 3
Distribution of supported short tandem repeat (STR) alleles in PacBio HiFi sequences.

A
U T H O R C O N T R I B U T I O N SJianye Ge designed this study.Xuewen Wang, Muyi Liu, and Jianye Ge developed the algorithms and the software.Jonathan King and Melissa Muenzler designed the experiments.Melissa Muenzler and Jonathan King conducted the experiments and sequencing.Xuewen Wang and Melissa Muenzler analyzed the data.Xuewen Wang drafted the first version of manuscript.Xuewen Wang revised the manuscript.Bruce Budowle and Hongmin Li contributed to the overall design, algorithm design, and manuscript review.All authors read and approved the final version of the manuscript.A C K N O W L E D G M E N T SThis work was supported in part by award 15PNIJ-21-GG-04159-RESS, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice and by internal funds from the Center for Human Identification.The opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect those of the U.S. Department of Justice.The authors would like to thank August Woerner from the University of North Texas Center of Human Identification, who assisted in this project.
The performance comparison of MacroHapCaller and other tools.
TA B L E 1 aThe percentage in parenthesis is the consistent rate%.