2015 Synthetic Biology: Engineering, Evolution & Design (SEED)
Secure Offline Communication of Short Messages with DNA
Author
1 Title: Secure Offline Communication of Short Messages with DNA
2
3 Authors: Bijan Zakeri1,2*, Peter A. Carr2,3*, Timothy K. Lu1,2*
4
5
6 Affiliations:
7 1Department of Electrical Engineering and Computer Science, Department of Biological
8 Engineering, Research Laboratory of Electronics, Massachusetts Institute of Technology, 77
9 Massachusetts Avenue, Cambridge, MA 02139, USA
10 2MIT Synthetic Biology Center, 500 Technology Square, Cambridge, MA 02139, USA
11 3MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420, USA
12 *Correspondence to: Bijan Zakeri (bijan.zakeri@oxfordalumni.org), Peter A. Carr
13 (carr@ll.mit.edu), and Timothy K. Lu (timlu@mit.edu).
14
15
16 Abstract:
17 The Internet has revolutionized communication with its great speed and volume but
18 remains vulnerable to security breaches. For applications where security is more important
19 than speed, the offline transfer of data remains vital. Moving beyond pen and paper, DNA
20 is increasingly being used as a medium for information storage and communication.
21 Inspired by one-time pads, considered to be an unbreakable form of encryption, we present
22 a rational strategy for designing individualized keyboards (iKeys) that are amenable to
23 randomization, serve as a linguistic platform for high-density encoding of plaintext into
24 DNA, and achieve the first instance of chromatogram patterning through multiplexed
25 sequencing. We used an iKey in combination with a secret-sharing system we call
26 Multiplexed Sequence Encryption (MuSE) for the secure offline communication of
27 information that is disseminated across multiple DNA strands, but can be extracted in one
28 step. By recreating a World War II communication from Bletchley Park, we demonstrate
29 that watermarks, a key, a message, and a decoy can be written on DNA and the correct
30 information is revealed only if specific strands are co-sequenced. These technologies enable
31 facile encoding and decoding of high-density information within DNA for secure offline
32 communications.
33
34
35
36
37
38 Main Text:
39 Communication has many faces. Yet the rapid advances of the past decades have made us
40 highly reliant on a single medium â?? online communication â?? that has led to emerging security
41 and privacy challenges. As the costs and time constraints of DNA synthesis and sequencing are
42 rapidly declining (1, 2), DNA is emerging as a viable medium for high-density information
43 storage (3â??9), and DNA cryptographic and steganographic methodologies are being employed
44 for securing embedded information (10â??13). Previously, DNA has been used for hiding
45 messages (3) and storing digital data (8, 9); however, these methods require advanced
46 laboratories with trained scientists to extract information. Simpler writing and reading methods
47 are required for DNA communication to become more broadly adopted. Here we combined the
48 familiarity of text-based communication, the QWERTY keyboard, and the genetic code to
49 develop iKeys that serve as facile platforms for DNA communication.
50 The natural genetic code employs three-letter DNA words (codons) to represent the 20
51 common amino acids used to build proteins. The four-letter DNA alphabet of adenine (A),
52 cytosine (C), guanine (G) and thymine (T) thus yields 43 = 64 distinct codons. These 64 codons
53 were mapped onto a modified QWERTY keyboard to produce a personalized platform â?? iKey-64
54 â?? for translating text into DNA (Fig. 1A). The codons in iKey-64 can be randomized to produce
55 a unique iKey for every message for communication security, akin to a one-time pad (14â??18).
56 Any specific version of iKey-64 can itself be encoded in DNA and provided as an additional
57 component of a communication, serving as a unique dictionary for each message (Fig. 1, B and
58 C).
59 To increase security in addition to the substitution cipher of iKey-64, we sought to
60 disseminate texts between multiple DNA strands so that the desired message would be revealed
61 only if the correct strand combinations were analyzed. This multiplexing is at the heart of the
62 MuSE strategy, a secret-sharing system where a fragmented DNA message can be securely
63 distributed between multiple parties (19). Analyzing only a single strand would yield either
64 inconclusive or incorrect messages designed to mislead unauthorized individuals [for technology
65 overview see (20)].
66 Conventionally, to extract information embedded on multiple DNA strands, one would
67 first have to sequence each strand separately followed by sequence alignments. In designing
68 MuSE, we expected that when multiple DNA strands are analyzed together by Sanger
69 sequencing using a common primer, at chromatogram positions where two bases are identical a
70 large peak would be observed, and where two bases differ a small peak would be observed,
71 thereby producing a pattern (Fig. 2A). However, it is known that stretches of homopolymers in
72 DNA often lead to sequencing inaccuracies (21). Not surprisingly, the naïve sequencing of
73 multiple DNA strands with a common primer was unable to achieve chromatogram patterning,
74 and instead produced poor chromatograms (fig. S1). To mitigate this problem, we rationally
75 designed iKey-64 to mediate pattern formation and reduce the incidence of homopolymers in
76 DNA messages. To achieve this, codon assignment was based on the frequency of use of letters
77 in the English language (22) (tables S1 and S2). The homopolymer codons AAA, CCC, GGG,
78 and TTT are assigned to four function keys, ensuring that in normal text no homopolymer longer
79 than four bases is possible. Even letter combinations yielding four identical bases (such as GTT-
80 TTC representing V-K) are kept quite rare.
2
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
We tested iKey-64 with MuSE by writing the message â??Massachusetts Institute
Technologyâ?? on two DNA strands, where space1 (AGT) was used in the first DNA strand (DNA-
1) and space2 (CTA) with the second DNA strand (DNA-2) to demarcate individual words in the sequences (Fig. 2, B and C). In this design, co-sequencing both DNA strands together should introduce troughs around words in the resulting chromatogram. Individual sequencing of DNA-1 and DNA-2 produced high quality reads. However, in a DNA-1+2 mixture, forward sequencing with a common primer did not produce chromatogram patterning, but rather camouflaged the message (Fig. 2D). This was due to variable DNA sequences placed upstream of the messages, where stretches of C and A homopolymers at the 5â?? ends interfered with base determination during Sanger sequencing, thus causing intentional misalignment of the recognized bases in the chromatogram (fig. S2, A and B). Alternatively, reverse sequencing of DNA-1+2 with a
common primer produced a distinct pattern on the chromatogram that can be readily decoded
with iKey-64. Since there were no interfering stretches of homopolymers in the variable DNA regions, there were no shifts in the base calls during sequencing, thus leading to predictable chromatogram patterning and a single-step extraction of information from the two strands (fig. S2, C and D).
MuSE can be tuned to embed information in chromatograms discreetly so that alignments of DNA sequencing data to known templates cannot be used to identify embedded information. For example, adjusting the ratio of DNA-1/DNA-2 allows the degree of contrast achieved in the chromatogram patterns to be varied (Fig. 2E). At 0% DNA-1, there is only a single peak present at each nucleotide position corresponding to DNA-2 (fig. S3). However, at 10-30% DNA-1, two peaks appear at nucleotide locations corresponding to the variable DNA regions, where the smaller peaks correspond to the sequence of DNA-1 but the base calls still match that of DNA-2. As the resulting sequence produced is that of the more concentrated partner, the sequencing output aligns perfectly with DNA-2 even in the presence of chromatogram patterning (fig. S4). A similar pattern emerges when DNA-2 is the less concentrated partner. This can be used to discreetly embed messages in sequencing data, where inspection of chromatograms would allow extraction of information. However, indiscriminate DNA sequencing and alignments against known templates would overlook embedded messages.
The MuSE platform can be used to disseminate information across many DNA strands based on a shared iKey, where multiplexed sequencing of different strand combinations would provide unique readouts. To demonstrate this, iKey-64 was used to encode watermarks, a key, a message, and a decoy across six strands in a 525 bp region to recreate a WWII communication made during the establishment of Bletchley Park (Fig. 3A) (23). The functions of the elements are: (i) watermarks â?? an identification tag for each strand, (ii) key â?? a riddle whose solution provides the correct combinations of DNA strands required for analysis to reveal the message, (iii) message â?? the desired information to be communicated, and (iv) decoy â?? a false message to be revealed if improper strand combinations are analyzed.
To extract the information via multiplexed sequencing, two different primers â?? PrimerKey and PrimerMessage â?? that are common to all six strands are required. Here a simple key was chosen and encoded on all six strands, where sequencing of any individual or combination of multiple strands with PrimerKey would reveal the information: â??Pascalâ??s triangle: d2r6-reverseâ? (Fig. 3A). This riddle means the message is revealed by co-sequencing DNA pairs on the reverse strand as ordered in Pascalâ??s triangle from diagonal 2 down until row 6. Thus, if strand pairs n1+n2,
n3+n4, and n5+n6 were to be co-sequenced using PrimerMessage, then the embedded message
3
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
â??Bletchley Park: GC&CS Codebreakersâ? would be revealed. However, if one were to misinterpret the key, then a decoy message would be revealed. For example, we embedded one decoy message â?? â??Captain Ridleyâ??s Shooting Partyâ? â?? that would be revealed if one were to co- sequence DNA pairs n2+n3, n4+n5, and n6+n1 â?? a circular permutation of the correct key. For further complexity, message-harboring strands can be interspersed amongst many decoy encoding or non-coding strands for increased security against unauthorized access.
An unauthorized user may use random primers â?? PrimerExternalFw/Rv â?? instead of PrimerKey and PrimerMessage to extract messages if they were embedded in large DNA regions. To obfuscate random sequencing, the information-containing regions were flipped between the forward and reverse strands to provide a camouflage effect. As expected, co-sequencing with PrimerExternalFw/Rv did not produce chromatogram patterning, whether message/decoy pairs or all six strands were co-sequenced (Table 1 and table S3). However, co-sequencing of all six strands with PrimerKey produced the readout â??Pascalâ??s triangle: d2r6-reverseâ?, while the message/decoy- containing regions did not lead to chromatogram patterning. Similarly, chromatogram patterning was not observed in the message/decoy containing regions when PrimerMessage was used for co- sequencing all six strands. Instead, sequencing of DNA pairs with PrimerMessage as per the order
in Pascalâ??s triangle â?? n1+n2, n3+n4, and n5+n6 â?? revealed the message via chromatogram
patterning (fig. S5), while co-sequencing of the incorrect pairs â?? n2+n3, n4+n5, and n6+n1 â??
revealed the decoy information. Expectedly, co-sequencing of other pair combinations did not lead to any patterning (Fig. 3B). This demonstrated that in addition to the security afforded by iKey-64 and MuSE, one must also possess an accurate set of primers and interpret the key correctly to unlock embedded messages.
If unauthorized individuals were to gain access to a DNA communication, next- generation sequencing (NGS) would most likely be used to extract data as messages may be incorporated in linear, plasmid, or genomic DNA regions, especially if chromatogram patterning is not used for rapid data extraction. To recreate such a scenario, we tested the difficulty associated with NGS analysis of unknown DNA samples. We prepared a purified mixture of n1+n2+n3+n4+n5+n6 and submitted it for NGS analysis to an outside party under blind experimental conditions, asking them to provide us with the assembled contents of the sample (fig. S6, A and B). While sequencing of the mixture produced ~2 million reads, the blind assembly of the reads to reconstruct the contents proved difficult and inconclusive (table S4). However, after the initial analysis we informed the outside party that there were 6 plasmids in
the sample, each containing 525 bp messages as inserts. We further provided the vector sequence and asked for the exact sequences of the messages in the sample. A second round of analysis identified 6 assembled sequences that represented our messages (table S5). Alignment of the 6 identified sequences with n1, n2, n3, n4, n5, and n6 templates provided most of the information
in the six messages, with n1, n2, n3, and n5 providing almost perfect alignments (fig. S6C). This demonstrated the difficulty associated with blind sequencing of a MuSE communication without any prior knowledge of DNA contents. Even if the sequences of a DNA communication were identified after considerable time and expense, the contents of a communication would still be protected by the iKey, combination key, and decoy/non-coding sequences.
iKey-64 is designed to convert plaintext in to a DNA encodable language. If chromatogram patterning is desired to expedite data extraction, the codons may potentially be shuffled to enable 9.1 x 1061 iKey-64 variants (table S2). Since chromatogram patterning requires iKey buttons to be categorized to reduce large homopolymeric stretches, the codon
4
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
assignment is not perfectly random and the maximum number of potential iKeys cannot be achieved. However, if chromatogram patterning is not desired then a maximum of 64! = 1.3 x
1089 iKey-64 variants exist, significantly increasing the security of encoded information far beyond a comparable 64-bit key for an online communication that produce 264 = 1.8 x 1019
variants. Nevertheless, data encoded using iKey-64 would still not be truly random due to the frequency of use for each button, but additional measures may be implemented to increase
security: (i) Linguistics â?? principles of linguistics may be applied to the layout of iKeys to modify alphabets for DNA communication, introduce new grammatical rules or create iKeys in
different languages, (ii) Codons â?? increasing the number of nucleotides per codon can introduce redundancies in the buttons to adjust for character usage frequency, and (iii) Cryptography â??
plaintext may first be subject to advanced cryptographic algorithms. To illustrate, four nucleotide codons can be used to create 256 button keyboards such as iKey-256 (fig. S7). When the number
of buttons for each letter is adjusted to reflect its frequency in English text, then the probability of using a button for E would equal Q. Similar redundancies may also be introduced for buttons
representing numerals, grammar, and other user-defined functions. For instance, the frequency of numerals may be adjusted according to Benfordâ??s Law (24). Herein we have included a series of
challenges for a community assessment of the security of iKey-encoded information [for iKey challenges see (20)].
To further increase the complexity of the iKey system, codons can be used to represent words, phrases, and characters present in different languages. Using such an approach, a single communication can include words from several languages to further complicate hacking attempts. In the case of English, it is estimated that the vocabulary of an educated native adult speaker consists of ~17,000 lemmas, while only 10 lemmas constitute 25% of common word usage (25, 26). The use of 8-nucleotide codons can generate iKeys with 65,536 buttons, sufficient to include all of the commonly used words in English as well as accommodate individual letters, numerals, grammatical characters, functional characters, and high frequency words. Theoretically, the iKey platform may be designed to incorporate the entire English language. The Oxford English Dictionary (OED), the most comprehensive record of the English language, contains 291,500 entries and a total of 615,100 word forms (27). Encoding all of the entries of the OED on an iKey, where 1 codon would encode 1 word, would require 10- nucleotide codons to generate a 1,048,576 button keyboard. Additionally, the entire text (entries and definitions) of the OED is composed of 59 million words and 350 million characters resulting in 5.9 characters/word. Thus, encoding the average 6-character word with an iKey-64 (1 codon = 3 nucleotide/character) would require 18 nucleotides, while with an iKey-1,048,576, where each codon represents 1 word, only 10 nucleotides would be required â?? representing a
44% reduction in DNA requirements. Current linguistic principles are optimized for written
communication, while encryption methodologies are designed with online communication in mind. As DNA data storage is gaining traction, writing on DNA need not abide by conventional grammatical rules or encryption algorithms. Opportunities exist for exploring DNA linguistic principles to devise new grammatical rules for transferring prose onto DNA.
Offline communication with MuSE is best for niche applications, where small texts need to be shared between limited numbers of people. Ideally, extraction of information from such communications would occur in the field in real-time, an increasing possibility with technological advancements in portable Sanger and nanopore sequencers (28, 29). We have incorporated several tiers of security to protect information communicated via DNA. Nevertheless, situations are bound to arise where the security of communications may be
5
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
compromised, either through human or machine errors. There is no such a thing as a lock without a key, but as with any security system the objectives are simple: provide access to a select few
and keep out everyone else (19).
6
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
References and Notes:
1. P. A. Carr, G. M. Church, Genome engineering. Nat. Biotechnol.. 27, 1151â??1162 (2009).
2. J. Clarke et al., Continuous base identification for single-molecule nanopore DNA
sequencing. Nat. Nanotechnol. 4, 265â??270 (2009).
3. C. T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots. Nature. 399,
533â??534 (1999).
4. C. Bancroft, T. Bowler, B. Bloom, C. T. Clelland, Long-term storage of information in
DNA. Science. 293, 1763â??1765 (2001).
5. M. Liss et al., Embedding permanent watermarks in synthetic genes. PloS one. 7 (2012).
6. J. P. Cox, Long-term data storage in DNA. Trends Biotechnol. 19, 247â??250 (2001).
7. L. Sennels, T. Bentin, To DNA, all information is equal. Artif. DNA, PNA & XNA. 3,
109â??111 (2012).
8. N. Goldman et al., Towards practical, high-capacity, low-maintenance information
storage in synthesized DNA. Nature. 494 (2013).
9. G. M. Church, Y. Gao, S. Kosuri, Next-generation digital information storage in DNA.
Science. 337 (2012).
10. D. Haughton, F. Balado, BioCode: two biologically compatible Algorithms for
embedding data in non-coding and coding regions of DNA. BMC Bioinforma. 14, 121 (2013).
11. D. Heider, A. Barnekow, DNA-based watermarks using the DNA-Crypt algorithm. BMC
Bioinforma. 8, 176 (2007).
12. D. Tulpan, C. Regoui, G. Durand, L. Belliveau, S. Leger, HyDEn: a hybrid
steganocryptographic approach for data encryption using randomized error-correcting DNA
codes. BioMed Res. Int. 2013 (2013).
13. T. Kawano, Run-length encoding graphic rules, biochemically editable designs and steganographical numeric data embedment for DNA-based cryptographical coding system. Commun. & Integr. Biol. 6 (2013).
14. A. Ekert, R. Renner, The ultimate physical limits of privacy. Nature. 507, 443â??447
(2014).
15. A. Gehani, T. H. LaBean, J. H. Reif, DNA-based Cryptography. DNA Based Comput. V:
Dimacs Workshop DNA Based Comput. V June 14-15, 1999 Mass. Inst. Technol. 54 (2000).
16. C. Mao, T. H. LaBean, J. H. Relf, N. C. Seeman, Logical computation using algorithmic
self-assembly of DNA triple-crossover molecules. Nature. 407, 493â??496 (2000).
17. M. Hirabayashi, H. Kojima, K. Oiwa, (Springer Japan, 2010), pp. 174â??183.
18. M. Hirabayashi, H. Kojima, K. Oiwa, Effective algorithm to encrypt information based on self-assembly of DNA tiles. Nucleic acids Symp. Ser. (53):79-80 (2009).
19. N. Ferguson, B. Schneier, T. Kohno, Cryptography engineering: design principles and practical applications (Wiley Publishing, Inc., Indianapolis, 2010).
20. Information on materials and methods is available at the Science Web site.
21. K. V. Voelkerding, S. A. Dames, J. D. Durtschi, Next-generation sequencing: from basic
research to diagnostics. Clin. Chem. 55, 641â??658 (2009).
22. Oxford University Press. What is the frequency of the letters of the alphabet in English?
Available at http://www.oxforddictionaries.com/us/words/ what-is-the-frequency-of-the-letters- of-the-alphabet-in-english. (2014).
23. Bletchley Park Trust. Captain Ridleyâ??s Shooting Party. Available at http://www.bletchleypark.org.uk/content/hist/worldwartwo/captridley.rhtm. (2014).
7
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
24. A. D. Alves, H. H. Yanasse, N. Y. Soma, Benfordâ??s Law and articles of scientific journals: comparison of JCR and Scopus data. Scientometrics. 98, 173â??184 (2014).
25. Oxford University Press. The OEC: Facts about the language. Available at http://www.oxforddictionaries.com/us/words/the-oec-facts-about-the-lang…. (2014).
26. R. Goulden, I. S. P. Nation, J. Read, How large can a receptive vocabulary be? Appl. Linguist. 11, 341â??363 (1990).
27. Oxford University Press. Dictionary Facts. Available at http://public.oed.com /history-of- the-oed/dictionary-facts/. (2013).
28. D. Stoddart, A. J. Heron, E. Mikhailova, G. Maglia, H. Bayley, Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc. Natl.
Acad. Sci. United States Am. 106, 7702â??7707 (2009).
29. R. G. Blazej, P. Kumaresan, R. A. Mathies, Microfabricated bioprocessor for integrated
nanoliter-scale Sanger DNA sequencing. Proc. Natl. Acad. Sci. United States Am. 103, 7240â??
7245 (2006).
Acknowledgments:
This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. B.Z. conceived the study, performed the experiments, analyzed the data, and wrote the manuscript. All authors read and edited the manuscript. The authors declare competing financial interests. Reported data and archived materials are available in the laboratory of T.K.L.
8
A!
â?? â? 1 !
2 @ 3 # 4 $
5 % 6 ^
7 & 8 *
9 ( 0 )
- _ = +
end
CGG
AGC
TGA
GAC
CGA
ACG
TAG
GCT
TCG
GCA
CTG TTA GAA
GAT
start Q
W E R T Y U I O P
{ Ç? } Ç? \ Ç?
ATC
AAC
AAT
TCA
CGT ATG CGC
GTG
GTA
ACT TGT CTT TTG ATT
shift
A S D F G H J K L
: ; ~ Ç?
enter
CAC
TGC
CAT TCT GCG
GAG
CTC CCA
TTC
AGA TCC ACC TGG
forward
Z X C V B
N M , <
. > / ?
reverse
AGG
GGA
AAG
ATA
GTT
TAT
GTC
ACA
CAG TAC TAA CAA
F1 F2 F3 F4
space1 space2 F5
GGT
AAA
CCC
GGG
TTT
AGT
CTA GGC CCT GCC
CCG
B! 1
2 3 4 5 6 7 8 9 10 11 12 13 14 CGGAGCTGAGACCGAACGTAGGCTTCGGCACTGTTAGAAGATATCAAC!
! 5! 10! 15!
15 16 17 18 19 20 21 22 23 24 25 26 27 28
AATTCACGTATGCGCGTGGTAACTTGTCTTTTGATTCACTGCCATTCT!
29 30 31 32 33 34 35 36 37 38 39 40 41 !
20! 25! 30!
GCGGAGCTCCCATTCAGATCCACCTGGAGGGGAAAGATAGTTTATGTC!
42 43 44 45 46 47 48 49 50 51 52 53 !
35! 40! 45!
54 55 56 57 58
C
ACACAGTACTAACAAAAACCCGGGTTTAGTCTAGGCGGTCCTGCCCCG !
323
324
325
326
327
328
329
330
331
Fig. 1. 64 button iKey for chromatogram patterning. (A) iKey-64, used to convert plaintext to codons for DNA transcription. Messages begin with â??startâ??, finish with â??endâ??, â??forwardâ?? and
â??reverseâ?? provide information on the strand containing the desired message, and â??space1â?? and
â??space2â?? may be used to produce troughs in chromatograms. Codons can be randomized to produce one-time iKeys. (B) iKey-64 buttons and codons were numbered to transcribe the keyboard on to a strand of DNA. (C) iKey-64 transcribed on DNA. Codons were flanked by 10
Ts to separate the start and end of the keyboard from surrounding DNA for identification,
marked by red lines.
9
A! MuSE
B! MIT Message E!
DNA-1 : DNA-2
DNA-1
C A G
A T C G
A T G C G
DNA-2
A G T
C T C G
A G A T A
DNA-1+2
A A G
C!
C T C G
A T A C G
Flanking DNA Variable DNA Message Space1/2
MIT Message: Massachusetts Institute Technology
DNA-1 ATCCACACATGCCATCATTGCATACTCGTGCATTCAATGATGCATAGTCACGTAGTCCATATGGTAATGGTGATGTCAAGTCACATGTCAATACTCGTCACTAGAACTGAGCGCGAT
DNA-2 ATCCACACATGCCATCATTGCATACTCGTGCATTCAATGATGCATCTACACGTAGTCCATATGGTAATGGTGATGTCACTACACATGTCAATACTCGTCACTAGAACTGAGCGCGAT
D!
DNA-1
DNA-2
Forward Sequencing
Reverse Sequencing
332
DNA-1+2
1 500 1,000
Length (bp)
1 500 1,000
Length (bp)
333
334
335
336
337
338
339
340
341
342
343
Fig. 2. Chromatogram patterning with MuSE. (A) Schematic for chromatogram patterning. When two DNA strands are co-sequenced via Sanger sequencing, different overlapping nucleotides produce small peaks while identical ones produce large peaks. Peaks are kept in alignment via iKey-64. (B) Schematic of chromatogram patterning for the message
â??Massachusetts Institute Technologyâ?? via MuSE. (C) Sequence of â??Massachusetts Institute
Technologyâ?? used in B and encoded with iKey-64. (D) When DNA-1+2 are co-sequenced at equal concentrations with a common primer (green arrows), chromatogram patterning is achieved during reverse (PrimerExternalRv) but not forward (PrimerExternalFw) sequencing due to the flanking variable DNA regions. (E) Chromatogram patterning can be tuned to discreetly embed information in sequencing data by varying the ratios of DNA-1 (red) and DNA-2 (black). Red lines surround embedded messages.
10
A! iKey-64
Combinatorial Message
â?? â? 1 ! 2 @ 3 # 4 $ 5 % 6 ^ 7 & 8 * 9 ( 0 ) - _ = +
end
n1 Pascalâ??s triangle: d2r6-reverse
n1)
CGG AGC TGA GAC CGA ACG TAG GCT TCG GCA CTG TTA GAA
GAT
start Q
W E R T Y U I O P
{ α } β \ γ
ATC
AAC AAT TCA CGT ATG CGC GTG GTA ACT TGT CTT TTG ATT
shift
A S D F G H J K L
: ; ~ δ
enter
CAC
AGG
TGC CAT TCT GCG GAG CTC CCA TTC AGA TCC ACC TGG
GGA AAG ATA GTT TAT GTC ACA CAG TAC TAA CAA
Bletchley
Captain
n2)
forward
Z X C V B
N M , < . > / ?
reverse
F1 F2 F3 F4
space1 space2 F5
GGT
AAA CCC GGG TTT
AGT
CTA GGC CCT GCC CCG
Key
n3 Pascalâ??s triangle: d2r6-reverse
n3)
Pascalâ??s triangle
Park: GC&CS
Ridleyâ??s
n4)
Ro 4
5
1 4 6 4 1
1 5 10 10 5 1
iagonal
6
7
n5 Pascalâ??s triangle: d2r6-reverse
n5)
6 1 6 15 20 15 6 1 8
7 1 7 21 35 35 21 7 1
Codebreakers
Shooting Party
n6)
525 bp
PrimerExternal
B!
PrimerKey
PrimerMessage
External DNA
Watermark Key
Message
Decoy
n1& n2& n3& n4& n5& n6&
n1&
n2&
n3&
344
Message!
Decoy!
n4& n5& n6&
345
346
347
348
349
350
351
352
353
354
355
356
Fig. 3. Combinatorial message depicting a WWII communication. (A) iKey-64 was used to transcribe watermarks, a key, a message, and a decoy between six DNA strands. If strands are co-sequenced according to the key (Pascalâ??s triangle on left) with the appropriate primers, then the correct communication would be revealed. (B) Chromatograms of an n1 x n6 matrix of strands tuned and co-sequenced with PrimerMessage. Chromatogram patterning is not achieved when incorrect pairs are co-sequenced. Boxes highlight patterns that communicate either the message (green) or decoy (red).
11
357
358
359
Table 1. Combinatorial message readouts. Tuning and co-sequencing of multiple DNA strands reveals a variety of messages depending on the primers used and the order of strands co- sequenced. Boxes highlight patterned messages.
DNA! Primer! Chromatogram! Readout!
Pascalâ??s triangle: d2r6- reverse!
Bletchley! Park: GC&CS! Codebreakers! Captain! Ridleyâ??s! Shooting Party!
360
361
362
363
364
365
12
366
367
368
369
370
371
372
373
374
Supplementary Materials: Materials and Methods Technology Overview
iKey Encryption: Challenges
Figures S1-S7
Tables S1-S6
13