2015 Synthetic Biology: Engineering, Evolution & Design (SEED)

Secure Offline Communication of Short Messages with DNA


1 Title: Secure Offline Communication of Short Messages with DNA

2

3 Authors: Bijan Zakeri1,2*, Peter A. Carr2,3*, Timothy K. Lu1,2*

4

5

6 Affiliations:

7 1Department of Electrical Engineering and Computer Science, Department of Biological

8 Engineering, Research Laboratory of Electronics, Massachusetts Institute of Technology, 77

9 Massachusetts Avenue, Cambridge, MA 02139, USA

10 2MIT Synthetic Biology Center, 500 Technology Square, Cambridge, MA 02139, USA

11 3MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420, USA

12 *Correspondence to: Bijan Zakeri (bijan.zakeri@oxfordalumni.org), Peter A. Carr

13 (carr@ll.mit.edu), and Timothy K. Lu (timlu@mit.edu).

14

15

16 Abstract:

17 The Internet has revolutionized communication with its great speed and volume but

18 remains vulnerable to security breaches. For applications where security is more important

19 than speed, the offline transfer of data remains vital. Moving beyond pen and paper, DNA

20 is increasingly being used as a medium for information storage and communication.

21 Inspired by one-time pads, considered to be an unbreakable form of encryption, we present

22 a rational strategy for designing individualized keyboards (iKeys) that are amenable to

23 randomization, serve as a linguistic platform for high-density encoding of plaintext into

24 DNA, and achieve the first instance of chromatogram patterning through multiplexed

25 sequencing. We used an iKey in combination with a secret-sharing system we call

26 Multiplexed Sequence Encryption (MuSE) for the secure offline communication of

27 information that is disseminated across multiple DNA strands, but can be extracted in one

28 step. By recreating a World War II communication from Bletchley Park, we demonstrate

29 that watermarks, a key, a message, and a decoy can be written on DNA and the correct

30 information is revealed only if specific strands are co-sequenced. These technologies enable

31 facile encoding and decoding of high-density information within DNA for secure offline

32 communications.

33

34

35

36

37

38 Main Text:

39 Communication has many faces. Yet the rapid advances of the past decades have made us

40 highly reliant on a single medium â?? online communication â?? that has led to emerging security

41 and privacy challenges. As the costs and time constraints of DNA synthesis and sequencing are

42 rapidly declining (1, 2), DNA is emerging as a viable medium for high-density information
43 storage (3â??9), and DNA cryptographic and steganographic methodologies are being employed
44 for securing embedded information (10â??13). Previously, DNA has been used for hiding
45 messages (3) and storing digital data (8, 9); however, these methods require advanced

46 laboratories with trained scientists to extract information. Simpler writing and reading methods

47 are required for DNA communication to become more broadly adopted. Here we combined the

48 familiarity of text-based communication, the QWERTY keyboard, and the genetic code to

49 develop iKeys that serve as facile platforms for DNA communication.

50 The natural genetic code employs three-letter DNA words (codons) to represent the 20

51 common amino acids used to build proteins. The four-letter DNA alphabet of adenine (A),

52 cytosine (C), guanine (G) and thymine (T) thus yields 43 = 64 distinct codons. These 64 codons

53 were mapped onto a modified QWERTY keyboard to produce a personalized platform â?? iKey-64

54 â?? for translating text into DNA (Fig. 1A). The codons in iKey-64 can be randomized to produce

55 a unique iKey for every message for communication security, akin to a one-time pad (14â??18).

56 Any specific version of iKey-64 can itself be encoded in DNA and provided as an additional

57 component of a communication, serving as a unique dictionary for each message (Fig. 1, B and

58 C).

59 To increase security in addition to the substitution cipher of iKey-64, we sought to

60 disseminate texts between multiple DNA strands so that the desired message would be revealed

61 only if the correct strand combinations were analyzed. This multiplexing is at the heart of the

62 MuSE strategy, a secret-sharing system where a fragmented DNA message can be securely

63 distributed between multiple parties (19). Analyzing only a single strand would yield either

64 inconclusive or incorrect messages designed to mislead unauthorized individuals [for technology

65 overview see (20)].

66 Conventionally, to extract information embedded on multiple DNA strands, one would

67 first have to sequence each strand separately followed by sequence alignments. In designing

68 MuSE, we expected that when multiple DNA strands are analyzed together by Sanger

69 sequencing using a common primer, at chromatogram positions where two bases are identical a

70 large peak would be observed, and where two bases differ a small peak would be observed,

71 thereby producing a pattern (Fig. 2A). However, it is known that stretches of homopolymers in

72 DNA often lead to sequencing inaccuracies (21). Not surprisingly, the naïve sequencing of

73 multiple DNA strands with a common primer was unable to achieve chromatogram patterning,

74 and instead produced poor chromatograms (fig. S1). To mitigate this problem, we rationally

75 designed iKey-64 to mediate pattern formation and reduce the incidence of homopolymers in

76 DNA messages. To achieve this, codon assignment was based on the frequency of use of letters

77 in the English language (22) (tables S1 and S2). The homopolymer codons AAA, CCC, GGG,

78 and TTT are assigned to four function keys, ensuring that in normal text no homopolymer longer

79 than four bases is possible. Even letter combinations yielding four identical bases (such as GTT-

80 TTC representing V-K) are kept quite rare.

2

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

We tested iKey-64 with MuSE by writing the message â??Massachusetts Institute
Technologyâ?? on two DNA strands, where space1 (AGT) was used in the first DNA strand (DNA-
1) and space2 (CTA) with the second DNA strand (DNA-2) to demarcate individual words in the sequences (Fig. 2, B and C). In this design, co-sequencing both DNA strands together should introduce troughs around words in the resulting chromatogram. Individual sequencing of DNA-1 and DNA-2 produced high quality reads. However, in a DNA-1+2 mixture, forward sequencing with a common primer did not produce chromatogram patterning, but rather camouflaged the message (Fig. 2D). This was due to variable DNA sequences placed upstream of the messages, where stretches of C and A homopolymers at the 5â?? ends interfered with base determination during Sanger sequencing, thus causing intentional misalignment of the recognized bases in the chromatogram (fig. S2, A and B). Alternatively, reverse sequencing of DNA-1+2 with a
common primer produced a distinct pattern on the chromatogram that can be readily decoded
with iKey-64. Since there were no interfering stretches of homopolymers in the variable DNA regions, there were no shifts in the base calls during sequencing, thus leading to predictable chromatogram patterning and a single-step extraction of information from the two strands (fig. S2, C and D).
MuSE can be tuned to embed information in chromatograms discreetly so that alignments of DNA sequencing data to known templates cannot be used to identify embedded information. For example, adjusting the ratio of DNA-1/DNA-2 allows the degree of contrast achieved in the chromatogram patterns to be varied (Fig. 2E). At 0% DNA-1, there is only a single peak present at each nucleotide position corresponding to DNA-2 (fig. S3). However, at 10-30% DNA-1, two peaks appear at nucleotide locations corresponding to the variable DNA regions, where the smaller peaks correspond to the sequence of DNA-1 but the base calls still match that of DNA-2. As the resulting sequence produced is that of the more concentrated partner, the sequencing output aligns perfectly with DNA-2 even in the presence of chromatogram patterning (fig. S4). A similar pattern emerges when DNA-2 is the less concentrated partner. This can be used to discreetly embed messages in sequencing data, where inspection of chromatograms would allow extraction of information. However, indiscriminate DNA sequencing and alignments against known templates would overlook embedded messages.
The MuSE platform can be used to disseminate information across many DNA strands based on a shared iKey, where multiplexed sequencing of different strand combinations would provide unique readouts. To demonstrate this, iKey-64 was used to encode watermarks, a key, a message, and a decoy across six strands in a 525 bp region to recreate a WWII communication made during the establishment of Bletchley Park (Fig. 3A) (23). The functions of the elements are: (i) watermarks â?? an identification tag for each strand, (ii) key â?? a riddle whose solution provides the correct combinations of DNA strands required for analysis to reveal the message, (iii) message â?? the desired information to be communicated, and (iv) decoy â?? a false message to be revealed if improper strand combinations are analyzed.
To extract the information via multiplexed sequencing, two different primers â?? PrimerKey and PrimerMessage â?? that are common to all six strands are required. Here a simple key was chosen and encoded on all six strands, where sequencing of any individual or combination of multiple strands with PrimerKey would reveal the information: â??Pascalâ??s triangle: d2r6-reverseâ? (Fig. 3A). This riddle means the message is revealed by co-sequencing DNA pairs on the reverse strand as ordered in Pascalâ??s triangle from diagonal 2 down until row 6. Thus, if strand pairs n1+n2,
n3+n4, and n5+n6 were to be co-sequenced using PrimerMessage, then the embedded message

3

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

â??Bletchley Park: GC&CS Codebreakersâ? would be revealed. However, if one were to misinterpret the key, then a decoy message would be revealed. For example, we embedded one decoy message â?? â??Captain Ridleyâ??s Shooting Partyâ? â?? that would be revealed if one were to co- sequence DNA pairs n2+n3, n4+n5, and n6+n1 â?? a circular permutation of the correct key. For further complexity, message-harboring strands can be interspersed amongst many decoy encoding or non-coding strands for increased security against unauthorized access.
An unauthorized user may use random primers â?? PrimerExternalFw/Rv â?? instead of PrimerKey and PrimerMessage to extract messages if they were embedded in large DNA regions. To obfuscate random sequencing, the information-containing regions were flipped between the forward and reverse strands to provide a camouflage effect. As expected, co-sequencing with PrimerExternalFw/Rv did not produce chromatogram patterning, whether message/decoy pairs or all six strands were co-sequenced (Table 1 and table S3). However, co-sequencing of all six strands with PrimerKey produced the readout â??Pascalâ??s triangle: d2r6-reverseâ?, while the message/decoy- containing regions did not lead to chromatogram patterning. Similarly, chromatogram patterning was not observed in the message/decoy containing regions when PrimerMessage was used for co- sequencing all six strands. Instead, sequencing of DNA pairs with PrimerMessage as per the order
in Pascalâ??s triangle â?? n1+n2, n3+n4, and n5+n6 â?? revealed the message via chromatogram
patterning (fig. S5), while co-sequencing of the incorrect pairs â?? n2+n3, n4+n5, and n6+n1 â??
revealed the decoy information. Expectedly, co-sequencing of other pair combinations did not lead to any patterning (Fig. 3B). This demonstrated that in addition to the security afforded by iKey-64 and MuSE, one must also possess an accurate set of primers and interpret the key correctly to unlock embedded messages.
If unauthorized individuals were to gain access to a DNA communication, next- generation sequencing (NGS) would most likely be used to extract data as messages may be incorporated in linear, plasmid, or genomic DNA regions, especially if chromatogram patterning is not used for rapid data extraction. To recreate such a scenario, we tested the difficulty associated with NGS analysis of unknown DNA samples. We prepared a purified mixture of n1+n2+n3+n4+n5+n6 and submitted it for NGS analysis to an outside party under blind experimental conditions, asking them to provide us with the assembled contents of the sample (fig. S6, A and B). While sequencing of the mixture produced ~2 million reads, the blind assembly of the reads to reconstruct the contents proved difficult and inconclusive (table S4). However, after the initial analysis we informed the outside party that there were 6 plasmids in
the sample, each containing 525 bp messages as inserts. We further provided the vector sequence and asked for the exact sequences of the messages in the sample. A second round of analysis identified 6 assembled sequences that represented our messages (table S5). Alignment of the 6 identified sequences with n1, n2, n3, n4, n5, and n6 templates provided most of the information
in the six messages, with n1, n2, n3, and n5 providing almost perfect alignments (fig. S6C). This demonstrated the difficulty associated with blind sequencing of a MuSE communication without any prior knowledge of DNA contents. Even if the sequences of a DNA communication were identified after considerable time and expense, the contents of a communication would still be protected by the iKey, combination key, and decoy/non-coding sequences.
iKey-64 is designed to convert plaintext in to a DNA encodable language. If chromatogram patterning is desired to expedite data extraction, the codons may potentially be shuffled to enable 9.1 x 1061 iKey-64 variants (table S2). Since chromatogram patterning requires iKey buttons to be categorized to reduce large homopolymeric stretches, the codon

4

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

assignment is not perfectly random and the maximum number of potential iKeys cannot be achieved. However, if chromatogram patterning is not desired then a maximum of 64! = 1.3 x
1089 iKey-64 variants exist, significantly increasing the security of encoded information far beyond a comparable 64-bit key for an online communication that produce 264 = 1.8 x 1019
variants. Nevertheless, data encoded using iKey-64 would still not be truly random due to the frequency of use for each button, but additional measures may be implemented to increase
security: (i) Linguistics â?? principles of linguistics may be applied to the layout of iKeys to modify alphabets for DNA communication, introduce new grammatical rules or create iKeys in
different languages, (ii) Codons â?? increasing the number of nucleotides per codon can introduce redundancies in the buttons to adjust for character usage frequency, and (iii) Cryptography â??
plaintext may first be subject to advanced cryptographic algorithms. To illustrate, four nucleotide codons can be used to create 256 button keyboards such as iKey-256 (fig. S7). When the number
of buttons for each letter is adjusted to reflect its frequency in English text, then the probability of using a button for E would equal Q. Similar redundancies may also be introduced for buttons
representing numerals, grammar, and other user-defined functions. For instance, the frequency of numerals may be adjusted according to Benfordâ??s Law (24). Herein we have included a series of
challenges for a community assessment of the security of iKey-encoded information [for iKey challenges see (20)].
To further increase the complexity of the iKey system, codons can be used to represent words, phrases, and characters present in different languages. Using such an approach, a single communication can include words from several languages to further complicate hacking attempts. In the case of English, it is estimated that the vocabulary of an educated native adult speaker consists of ~17,000 lemmas, while only 10 lemmas constitute 25% of common word usage (25, 26). The use of 8-nucleotide codons can generate iKeys with 65,536 buttons, sufficient to include all of the commonly used words in English as well as accommodate individual letters, numerals, grammatical characters, functional characters, and high frequency words. Theoretically, the iKey platform may be designed to incorporate the entire English language. The Oxford English Dictionary (OED), the most comprehensive record of the English language, contains 291,500 entries and a total of 615,100 word forms (27). Encoding all of the entries of the OED on an iKey, where 1 codon would encode 1 word, would require 10- nucleotide codons to generate a 1,048,576 button keyboard. Additionally, the entire text (entries and definitions) of the OED is composed of 59 million words and 350 million characters resulting in 5.9 characters/word. Thus, encoding the average 6-character word with an iKey-64 (1 codon = 3 nucleotide/character) would require 18 nucleotides, while with an iKey-1,048,576, where each codon represents 1 word, only 10 nucleotides would be required â?? representing a
44% reduction in DNA requirements. Current linguistic principles are optimized for written
communication, while encryption methodologies are designed with online communication in mind. As DNA data storage is gaining traction, writing on DNA need not abide by conventional grammatical rules or encryption algorithms. Opportunities exist for exploring DNA linguistic principles to devise new grammatical rules for transferring prose onto DNA.
Offline communication with MuSE is best for niche applications, where small texts need to be shared between limited numbers of people. Ideally, extraction of information from such communications would occur in the field in real-time, an increasing possibility with technological advancements in portable Sanger and nanopore sequencers (28, 29). We have incorporated several tiers of security to protect information communicated via DNA. Nevertheless, situations are bound to arise where the security of communications may be

5

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

compromised, either through human or machine errors. There is no such a thing as a lock without a key, but as with any security system the objectives are simple: provide access to a select few
and keep out everyone else (19).

6

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

References and Notes:

1. P. A. Carr, G. M. Church, Genome engineering. Nat. Biotechnol.. 27, 1151â??1162 (2009).
2. J. Clarke et al., Continuous base identification for single-molecule nanopore DNA
sequencing. Nat. Nanotechnol. 4, 265â??270 (2009).
3. C. T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots. Nature. 399,
533â??534 (1999).
4. C. Bancroft, T. Bowler, B. Bloom, C. T. Clelland, Long-term storage of information in
DNA. Science. 293, 1763â??1765 (2001).
5. M. Liss et al., Embedding permanent watermarks in synthetic genes. PloS one. 7 (2012).
6. J. P. Cox, Long-term data storage in DNA. Trends Biotechnol. 19, 247â??250 (2001).
7. L. Sennels, T. Bentin, To DNA, all information is equal. Artif. DNA, PNA & XNA. 3,
109â??111 (2012).
8. N. Goldman et al., Towards practical, high-capacity, low-maintenance information
storage in synthesized DNA. Nature. 494 (2013).
9. G. M. Church, Y. Gao, S. Kosuri, Next-generation digital information storage in DNA.
Science. 337 (2012).
10. D. Haughton, F. Balado, BioCode: two biologically compatible Algorithms for
embedding data in non-coding and coding regions of DNA. BMC Bioinforma. 14, 121 (2013).
11. D. Heider, A. Barnekow, DNA-based watermarks using the DNA-Crypt algorithm. BMC

Bioinforma. 8, 176 (2007).

12. D. Tulpan, C. Regoui, G. Durand, L. Belliveau, S. Leger, HyDEn: a hybrid
steganocryptographic approach for data encryption using randomized error-correcting DNA
codes. BioMed Res. Int. 2013 (2013).
13. T. Kawano, Run-length encoding graphic rules, biochemically editable designs and steganographical numeric data embedment for DNA-based cryptographical coding system. Commun. & Integr. Biol. 6 (2013).
14. A. Ekert, R. Renner, The ultimate physical limits of privacy. Nature. 507, 443â??447
(2014).
15. A. Gehani, T. H. LaBean, J. H. Reif, DNA-based Cryptography. DNA Based Comput. V:

Dimacs Workshop DNA Based Comput. V June 14-15, 1999 Mass. Inst. Technol. 54 (2000).

16. C. Mao, T. H. LaBean, J. H. Relf, N. C. Seeman, Logical computation using algorithmic
self-assembly of DNA triple-crossover molecules. Nature. 407, 493â??496 (2000).
17. M. Hirabayashi, H. Kojima, K. Oiwa, (Springer Japan, 2010), pp. 174â??183.
18. M. Hirabayashi, H. Kojima, K. Oiwa, Effective algorithm to encrypt information based on self-assembly of DNA tiles. Nucleic acids Symp. Ser. (53):79-80 (2009).
19. N. Ferguson, B. Schneier, T. Kohno, Cryptography engineering: design principles and practical applications (Wiley Publishing, Inc., Indianapolis, 2010).
20. Information on materials and methods is available at the Science Web site.
21. K. V. Voelkerding, S. A. Dames, J. D. Durtschi, Next-generation sequencing: from basic
research to diagnostics. Clin. Chem. 55, 641â??658 (2009).
22. Oxford University Press. What is the frequency of the letters of the alphabet in English?
Available at http://www.oxforddictionaries.com/us/words/ what-is-the-frequency-of-the-letters- of-the-alphabet-in-english. (2014).
23. Bletchley Park Trust. Captain Ridleyâ??s Shooting Party. Available at http://www.bletchleypark.org.uk/content/hist/worldwartwo/captridley.rhtm. (2014).

7

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

24. A. D. Alves, H. H. Yanasse, N. Y. Soma, Benfordâ??s Law and articles of scientific journals: comparison of JCR and Scopus data. Scientometrics. 98, 173â??184 (2014).
25. Oxford University Press. The OEC: Facts about the language. Available at http://www.oxforddictionaries.com/us/words/the-oec-facts-about-the-lang…. (2014).
26. R. Goulden, I. S. P. Nation, J. Read, How large can a receptive vocabulary be? Appl. Linguist. 11, 341â??363 (1990).
27. Oxford University Press. Dictionary Facts. Available at http://public.oed.com /history-of- the-oed/dictionary-facts/. (2013).
28. D. Stoddart, A. J. Heron, E. Mikhailova, G. Maglia, H. Bayley, Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc. Natl.

Acad. Sci. United States Am. 106, 7702â??7707 (2009).

29. R. G. Blazej, P. Kumaresan, R. A. Mathies, Microfabricated bioprocessor for integrated
nanoliter-scale Sanger DNA sequencing. Proc. Natl. Acad. Sci. United States Am. 103, 7240â??
7245 (2006).

Acknowledgments:

This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. B.Z. conceived the study, performed the experiments, analyzed the data, and wrote the manuscript. All authors read and edited the manuscript. The authors declare competing financial interests. Reported data and archived materials are available in the laboratory of T.K.L.

8

A!

â?? â? 1 !

2 @ 3 # 4 $

5 % 6 ^

7 & 8 *

9 ( 0 )

- _ = +

end

CGG

AGC

TGA

GAC

CGA

ACG

TAG

GCT

TCG

GCA

CTG TTA GAA

GAT

start Q

W E R T Y U I O P

{ Ç? } Ç? \ Ç?

ATC

AAC

AAT

TCA

CGT ATG CGC

GTG

GTA

ACT TGT CTT TTG ATT

shift

A S D F G H J K L

: ; ~ Ç?

enter

CAC

TGC

CAT TCT GCG

GAG

CTC CCA

TTC

AGA TCC ACC TGG

forward

Z X C V B

N M , <

. > / ?

reverse

AGG

GGA

AAG

ATA

GTT

TAT

GTC

ACA

CAG TAC TAA CAA

F1 F2 F3 F4

space1 space2 F5

GGT

AAA

CCC

GGG

TTT

AGT

CTA GGC CCT GCC

CCG

B! 1

2 3 4 5 6 7 8 9 10 11 12 13 14 CGGAGCTGAGACCGAACGTAGGCTTCGGCACTGTTAGAAGATATCAAC!

! 5! 10! 15!

15 16 17 18 19 20 21 22 23 24 25 26 27 28

AATTCACGTATGCGCGTGGTAACTTGTCTTTTGATTCACTGCCATTCT!

29 30 31 32 33 34 35 36 37 38 39 40 41 !

20! 25! 30!

GCGGAGCTCCCATTCAGATCCACCTGGAGGGGAAAGATAGTTTATGTC!

42 43 44 45 46 47 48 49 50 51 52 53 !

35! 40! 45!

54 55 56 57 58

C

ACACAGTACTAACAAAAACCCGGGTTTAGTCTAGGCGGTCCTGCCCCG !


323

324

325

326

327

328

329

330

331

Fig. 1. 64 button iKey for chromatogram patterning. (A) iKey-64, used to convert plaintext to codons for DNA transcription. Messages begin with â??startâ??, finish with â??endâ??, â??forwardâ?? and

â??reverseâ?? provide information on the strand containing the desired message, and â??space1â?? and
â??space2â?? may be used to produce troughs in chromatograms. Codons can be randomized to produce one-time iKeys. (B) iKey-64 buttons and codons were numbered to transcribe the keyboard on to a strand of DNA. (C) iKey-64 transcribed on DNA. Codons were flanked by 10
Ts to separate the start and end of the keyboard from surrounding DNA for identification,
marked by red lines.

9

A! MuSE

B! MIT Message E!

DNA-1 : DNA-2

DNA-1

C A G

A T C G

A T G C G

DNA-2

A G T

C T C G

A G A T A

DNA-1+2

A A G

C!

C T C G

A T A C G

Flanking DNA Variable DNA Message Space1/2

MIT Message: Massachusetts Institute Technology

DNA-1 ATCCACACATGCCATCATTGCATACTCGTGCATTCAATGATGCATAGTCACGTAGTCCATATGGTAATGGTGATGTCAAGTCACATGTCAATACTCGTCACTAGAACTGAGCGCGAT

DNA-2 ATCCACACATGCCATCATTGCATACTCGTGCATTCAATGATGCATCTACACGTAGTCCATATGGTAATGGTGATGTCACTACACATGTCAATACTCGTCACTAGAACTGAGCGCGAT

D!

DNA-1

DNA-2

Forward Sequencing

Reverse Sequencing

332

DNA-1+2

1 500 1,000

Length (bp)

1 500 1,000

Length (bp)

333

334

335

336

337

338

339

340

341

342

343

Fig. 2. Chromatogram patterning with MuSE. (A) Schematic for chromatogram patterning. When two DNA strands are co-sequenced via Sanger sequencing, different overlapping nucleotides produce small peaks while identical ones produce large peaks. Peaks are kept in alignment via iKey-64. (B) Schematic of chromatogram patterning for the message

â??Massachusetts Institute Technologyâ?? via MuSE. (C) Sequence of â??Massachusetts Institute
Technologyâ?? used in B and encoded with iKey-64. (D) When DNA-1+2 are co-sequenced at equal concentrations with a common primer (green arrows), chromatogram patterning is achieved during reverse (PrimerExternalRv) but not forward (PrimerExternalFw) sequencing due to the flanking variable DNA regions. (E) Chromatogram patterning can be tuned to discreetly embed information in sequencing data by varying the ratios of DNA-1 (red) and DNA-2 (black). Red lines surround embedded messages.

10

A! iKey-64

Combinatorial Message

â?? â? 1 ! 2 @ 3 # 4 $ 5 % 6 ^ 7 & 8 * 9 ( 0 ) - _ = +

end

n1 Pascalâ??s triangle: d2r6-reverse

n1)

CGG AGC TGA GAC CGA ACG TAG GCT TCG GCA CTG TTA GAA

GAT

start Q

W E R T Y U I O P

{ α } β \ γ

ATC

AAC AAT TCA CGT ATG CGC GTG GTA ACT TGT CTT TTG ATT

shift

A S D F G H J K L

: ; ~ δ

enter

CAC

AGG

TGC CAT TCT GCG GAG CTC CCA TTC AGA TCC ACC TGG

GGA AAG ATA GTT TAT GTC ACA CAG TAC TAA CAA

Bletchley

Captain

n2)

forward

Z X C V B

N M , < . > / ?

reverse

F1 F2 F3 F4

space1 space2 F5

GGT

AAA CCC GGG TTT

AGT

CTA GGC CCT GCC CCG

Key

n3 Pascalâ??s triangle: d2r6-reverse

n3)

Pascalâ??s triangle

Park: GC&CS

Ridleyâ??s

n4)

Ro 4

5

1 4 6 4 1

1 5 10 10 5 1

iagonal

6

7

n5 Pascalâ??s triangle: d2r6-reverse

n5)

6 1 6 15 20 15 6 1 8

7 1 7 21 35 35 21 7 1

Codebreakers

Shooting Party

n6)

525 bp

PrimerExternal

B!

PrimerKey

PrimerMessage

External DNA

Watermark Key

Message

Decoy

n1& n2& n3& n4& n5& n6&

n1&

n2&

n3&

344

Message!

Decoy!

n4& n5& n6&

345

346

347

348

349

350

351

352

353

354

355

356

Fig. 3. Combinatorial message depicting a WWII communication. (A) iKey-64 was used to transcribe watermarks, a key, a message, and a decoy between six DNA strands. If strands are co-sequenced according to the key (Pascalâ??s triangle on left) with the appropriate primers, then the correct communication would be revealed. (B) Chromatograms of an n1 x n6 matrix of strands tuned and co-sequenced with PrimerMessage. Chromatogram patterning is not achieved when incorrect pairs are co-sequenced. Boxes highlight patterns that communicate either the message (green) or decoy (red).

11

357

358

359

Table 1. Combinatorial message readouts. Tuning and co-sequencing of multiple DNA strands reveals a variety of messages depending on the primers used and the order of strands co- sequenced. Boxes highlight patterned messages.

DNA! Primer! Chromatogram! Readout!

Pascalâ??s triangle: d2r6- reverse!

Bletchley! Park: GC&CS! Codebreakers! Captain! Ridleyâ??s! Shooting Party!

360

361

362

363

364

365

12

366

367

368

369

370

371

372

373

374

Supplementary Materials: Materials and Methods Technology Overview

iKey Encryption: Challenges
Figures S1-S7
Tables S1-S6

13