Innovations in sequencing, synthesis, and manipulation of nucleic acids, built around the 4-letter DNA alphabet (A, T, G, C), have driven major advances in medicine, genomics, and synthetic biology. Yet, research over the last three decades has shown that expanding the genetic alphabet to six or more letters is possible, offering new opportunities for biotechnology. Unnatural base pairing xeno-nucleic acids (ubp XNAs) are synthetic nucleotides that maintain base pairing complementarity orthogonally to the natural bases. These ubp XNAs exhibit diverse structures, ranging from isomers of standard bases (isoG:isoC) to entirely novel hydrophobic pairs. However, the absence of next-generation sequencing tools for ubp XNAs has prevented high-throughput omics studies, limiting research progress. Further, replication errors often revert ubp XNAs to natural bases during PCR, posing a significant challenge for routine molecular biology workflows.
Here, we show how next-generation sequencing and deep learning can solve both sequencing and amplification problems. Using nanopore sequencing, we train recurrent neural network models to sequence hydrogen bonding and non-hydrogen bonding ubp XNAs with high accuracy (90-99%). We then apply these models to study ubp XNA loss during PCR amplification by tailoring to detect known replication error-modes. Using high-throughput, multiplexed condition screening (polymerase, sequence context, nucleotide concentration, etc.) we find conditions that lead to significantly enhanced replication fidelity of the highly error-prone isoG:Me-isoC pair in a 6-letter PCR reaction (98.4% fidelity per theoretical doubling). This work closes a critical technological gap of XNA-compatible sequencing techniques, paving the way for fundamental discoveries in xenobiology research with broader implications for synthetic biology, medicine, and molecular evolution.