A Natural Language Approach to Automated Cryptanalysisof Twotime Pads
Joshua Mason
Johns Hopkins University
josh@cs.jhu.eduKathryn Watkins
Johns Hopkins University
kwatkins@jhu.eduJason Eisner
Johns Hopkins University
jason@cs.jhu.eduAdam Stubbleﬁeld
Johns Hopkins University
astubble@cs.jhu.edu
ABSTRACT
While keystream reuse in stream ciphers and onetime padshas been a well known problem for several decades, the riskto real systems has been underappreciated. Previous techniques have relied on being able to accurately guess wordsand phrases that appear in one of the plaintext messages,making it far easier to claim that “an attacker would neverbe able to do
that
.” In this paper, we show how an adversary can automatically recover messages encrypted underthe same keystream if only the
type
of each message is known(e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of thistype by using a statistical language model and a dynamicprogramming algorithm. It produces up to 99% accuracy onrealistic data and can process ciphertexts at 200ms per byteon a $2,000 PC. To further demonstrate the practical eﬀectiveness of the method, we show that our tool can recoverdocuments encrypted by Microsoft Word 2002 [22].
Categories and Subject Descriptors
E.3 [
Data
]: Data Encryption
General Terms
Security
Keywords
Keystream reuse, onetime pad, stream cipher
1 Introduction
Since their discovery by Gilbert Vernam in 1917 [20], streamciphers have been a popular method of encryption. In astream cipher, the plaintext,
p
, is exclusive
or
ed (
xor
ed)with a keystream,
k
, to produce the ciphertext,
p
⊕
k
=
c
.A special case arises when the keystream is truly random:the cipher is known as a onetime pad, proved unbreakableby Shannon [18].
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee.
CCS’06,
October 30–November 3, 2006, Alexandria, Virginia, USA.Copyright 2006 ACM 1595935185/06/0010 ...
$
5.00.
It is well known that the security of stream ciphers restson never reusing the keystream
k
[9]. For if
k
is
reused to encrypt two diﬀerent plaintexts,
p
and
q
, then the ciphertexts
p
⊕
k
and
q
⊕
k
can be
xor
ed together to recover
p
⊕
q
. Thegoal of this paper is to complete this attack by recovering
p
and
q
from
p
⊕
q
. We call this the “twotime pad problem.”In this paper we present an automated method for recovering
p
and
q
given only the “type” of each ﬁle. Morespeciﬁcally, we assume that
p
and
q
are drawn from someknown probability distributions. For example,
p
might bea Word document and
q
might be a HTML web page. Theprobability distributions can be built from a large corpus of examples of each type (e.g. by mining the Web for documents or web pages). Given the probability distributions,we then transform the problem of recovering
p
and
q
into a“decoding” problem that can be solved using some modiﬁedtechniques from the natural language processing community.Our results show that the technique is extremely eﬀectiveon realistic datasets (more than 99% accuracy on some ﬁletypes) while remaining eﬃcient (200ms per recovered byte).Our attack on twotime pads has practical consequences.Proofs that keystream reuse leaks information hasn’t stoppedsystem designers from reusing keystreams. A small sampling of the systems so aﬀected include Microsoft Oﬃce [22],802.11 WEP [3], WinZip [11], PPTP [17], and Soviet diplomatic, military, and intelligence communications intercepted [2,21]. We do not expect that this problem will disappearany time soon: indeed, since NIST has endorsed the CTRmode for AES [7], eﬀectively turning a block cipher intoa stream cipher, future systems that might otherwise haveused CBC with a constant IV may instead reuse keystreams.The WinZip vulnerability is already of this type.To demonstrate this practicality more concretely, we showthat our tool can be used to recover documents encryptedby Microsoft Word 2002. The vulnerability we focus onwas known before this work [22], but could not be exploitedeﬀectively.
1.1 Prior Work
Perhaps the most famous attempt to recover plaintexts thathave been encrypted with the same keystream is the National Security Agency’s VENONA project [21]. The NSA’sforerunner, the Army’s Signal Intelligence Service, noticedthat some encrypted Soviet telegraph traﬃc appeared toreuse keystream material. The program to reconstruct themessages’ plaintext began in 1943 and did not end until1980. Over 3,000 messages were at least partially recovered.The project was partially declassiﬁed in 1995, and many of the decryptions were released to the public [2]. However, theciphertexts and cryptanalytic methods remain classiﬁed.
There is a “classical” method of recovering
p
and
q
from
p
⊕
q
when
p
and
q
are known to be English text. First guessa word likely to appear in the messages, say
the
. Then, attempt to
xor
the
with each length3 substring of
p
⊕
q
.Wherever the result is something that “looks like” Englishtext, chances are that one of the messages has
the
in thatposition and the other message has the result of the
xor
. Byrepeating this process many times, the cryptanalyst buildsup portions of plaintext. This method was somewhat formalized by Rubin in 1978 [15].In 1996, Dawson and Nielsen [5] created a program thatuses a series of heuristic rules to automatically attemptthis style of decryption. They simpliﬁed matters by assuming that the plaintexts used only 27 characters of the256character ASCII set: the 26 English uppercase lettersand the space. Given
p
⊕
q
, this assumption allowed themto unambiguously recover noncoinciding spaces in
p
and
q
,since in ASCII, an uppercase letter
xor
ed with a space cannot be equal to any two uppercase letters
xor
ed together.They further assumed that two characters that
xor
ed to 0were both equal to space, the most common character. Todecode the words between the recovered spaces, they employed lists of common words of various lengths (and a few“tricks”). They chose to test their system by running iton subsets of the same training data from which they hadcompiled their commonword lists (a preprocessed versionof the ﬁrst 600,000 characters of the English Bible). Theycontinued adding new tricks and rules until they reached theresults shown in Figure 1. It is important to note that therules they added were
speciﬁcally designed to get good results on the examples they were using for testing
(hence are notguaranteed to work as well on other examples).We were able to reattempt Dawson and Nielsen’s experiments on the King James Bible
1
using the new methodologydescribed in this paper
without any special tuning or tricks
.Dawson and Nielsen even included portions of all three testpassages they used, so our comparison is almost completelyapplestoapples. Our results are compared with theirs inFigure 1.
2 Our Method
Instead of layering on heuristic after heuristic to recover speciﬁc types of plaintext, we instead take a more principledand general approach. Let
x
be the known
xor
of the twociphertexts. A feasible solution to the twotime pad problem is a string pair (
p,q
) such that
p
⊕
q
=
x
. We assumethat
p
and
q
were independently drawn from known probability distributions Pr
1
and Pr
2
, respectively. We then seekthe most probable of the feasible solutions: the (
p,q
) thatmaximizes Pr
1
(
p
)
·
Pr
2
(
q
).To deﬁne Pr
1
and Pr
2
in advance, we adopt a parametricmodel of distributions over plaintexts—known as a
language model
—and estimate its parameters from known plaintextsin each domain. For example, if
p
is known to be an Englishwebpage, we use a distribution Pr
1
that has previously beenﬁt against a
corpus
(naturally occurring collection) of English webpages. The parametric form we adopt for Pr
1
andPr
2
is such that an exact solution to our search problem istractable.This kind of approach is widely used in the speech and natural language processing community, where recovering the
1
We used the Project Gutenberg edition which matches theexcerpts from [5], available at
www.gutenberg.org/dirs/etext90/kjv10.txt
most probable plaintext
p
given a speech signal
x
is actuallyknown as “decoding.”
2
We borrow some wellknown techniques from that community: smoothed
n
gram languagemodels, along with dynamic programming (the “Viterbi decoding” algorithm) to ﬁnd the highestprobability path througha hidden Markov model [14].
2.1 Smoothed
n
gram Language Models
If the plaintext string
p
= (
p
1
,p
2
,...p
) is known to havelength
, we wish Pr
1
to specify a probability distributionover strings of length
. In our experiments, we simply usean
n
gram character language model (taking
n
= 7), whichmeans deﬁningPr
1
(
p
) =
Y
i
=1
Pr
1
(
p
i

p
i
.
−
n
+1
,...p
i
−
2
,p
i
−
1
) (1)where
i .
−
n
denotes max(
i
−
n,
0). In other words, thecharacter
p
i
is assumed to have been chosen at random,where the random choice may be inﬂuenced arbitrarily bythe previous
n
−
1 characters (or
i
−
1 characters if
i <n
), but is otherwise independent of previous history. Thisindependence assumption is equivalent to saying that thestring
p
is generated from an (
n
−
1)st order Markov process.Equation (1) is called an
n
gram model because the numerical factors in the product are derived, as we will see,from statistics on substrings of length
n
. One obtains thesestatistics from a training corpus of relevant texts. Obviously, in practice (and in our experiments) one must selectthis corpus without knowing the plaintexts
p
and
q
.
3
However, one may have side information about the
type
of plaintext (“genre”). One can create a separate model for eachtype of plaintext that one wishes to recover (e.g. Englishcorporate email, Russian military orders, Klingon poetry inMicrosoft Word format). For example, our HTML languagemodel was derived from a training corpus that we built bysearching Google on common English words and crawlingthe search results.
2
That problem also requires knowing the distribution Pr(
x

p
), which characterizes how text strings tend to be renderedas speech. Fortunately, in our situation, the comparableprobability Pr(
x

p,q
) is simply 1, since the observed
x
isa deterministic function (namely
xor
) of
p,q
. Our methodcould easily be generalized for imperfect (noisy) eavesdropping by modeling this probability diﬀerently.
3
It would be quite possible in future work, however, tochoose or build language models based on information about
p
and
q
that our methods themselves extract from
x
. Asimple approach would try several choices of (Pr
1
,
Pr
2
) anduse the pair that maximizes the probability of observing
x
.More sophisticated and rigorous approaches based on [1, 6]would use the imperfect decodings of
p
and
q
to
reestimate
the parameters of their respective language models, startingwith a generic language model and optionally iterating untilconvergence. Informally, the insight here is that the initialdecodings of
p
and
q
, particularly in portions of high conﬁdence, carry useful information about (1) the genres of
p
and
q
(e.g., English email), (2) the particular topics covered in
p
and
q
(e.g., oil futures), and (3) the particular
n
grams thattend to recur in
p
and
q
speciﬁcally. For example, for (2), onecould use a searchengine query to retrieve a small corpusof documents that appear similar to the ﬁrstpass decodings of
p
and
q
, and use them to help build “storyspeciﬁc”language models Pr
1
and Pr
2
[10] that better predict the
n
grams of documents on these topics and hence can retrievemore accurate versions of
p
and
q
on a second pass.
(a) Correct pair recovered Incorrect pair recovered Not decrypted[5] This work [5] This work [5] This work
P
0
⊕
P
1
62.7% 100.0% 17.8% 0% 20.5% 0%
P
1
⊕
P
2
61.5% 99.99% 17.6% 0.01% 20.9% 0%
P
2
⊕
P
0
62.6% 99.96% 17.9% 0.04% 19.5% 0%(b) Correct when keystream Incorrect when keystream Not decryptedused three times used three times[5] This work [5] This work [5] This work
P
0
75.2% 100.0% 12.3% 0% 12.5% 0%
P
1
76.3% 100.0% 11.4% 0% 12.3% 0%
P
2
75.4% 100.0% 11.8% 0% 12.8% 0%
Figure 1:
These tables show a comparison between previous work [5] and this work. All results presented for previouswork are directly from [5]. Both systems were trained on the exact same dataset (the ﬁrst 600,000 characters of theKing James Version of the Bible, specially formatted as in [5] — all punctuation other than spaces were removedand all letters converted to upper case) and were tested on the same three plaintexts (those used in [5], which wereincluded in the training set). Unlike the prior work, our system was tuned
automatically
on the training set, and nottuned at all for the test set. (a) The ﬁrst table shows the results of recovering the plaintexts from the listed xorcombinations. The reported percentages show the recovery status for the pair of characters in each plaintext position,
without necessarily being in the correct plaintext
. For example, the recovered
P
0
could contain parts of
P
0
and parts of
P
1
.(b) The second table shows the results when the same keystream is used to encrypt all three ﬁles, and
P
0
⊕
P
1
and
P
1
⊕
P
2
are fed as inputs to the recovery program
simultaneously
. Here the percentages show whether a character wascorrectly recovered in the
correct
ﬁle.
It is tempting to deﬁne Pr
1
(
s

h
,
o
,
b
,
n
,
o
,
b
) as the fraction of occurrences of
hobnob
in the Pr
1
training corpus thatwere followed by
s
: namely
c
(
hobnobs
)
/c
(
hobnob?
), where
c
(
...
) denotes count in the training corpus and
?
is a wildcard. Unfortunately, even for a large training corpus, such afraction is often zero (an underestimate!) or undeﬁned. Evenpositive fractions are unreliable if the denominator is small.One should use standard “smoothing” techniques from natural language processing to obtain more robust estimatesfrom corpora of ﬁnite size.Speciﬁcally, we chose parametric WittenBell backoﬀ smoothing, which is about the state of the art for
n
gram models[4]. This method estimates the 7gram probability by interpolating between the naive count ratio above and a recursively smoothed estimate of the 6gram probability Pr
1
(
s

o
,
b
,
n
,
o
,
b
). The latter, known as a “backedoﬀ” estimate, isless vulnerable to low counts because shorter contexts suchas
obnob
pick up more (albeit less relevant) instances. Theinterpolation coeﬃcient favors the backedoﬀ estimate if observed 7grams of the form
hobnob?
have a low count onaverage, indicating that the longer context
hobnob
is insufﬁciently observed.Notice that the ﬁrst factor in equation (1) is simply Pr(
p
1
),which considers no contextual information at all. This is appropriate if
p
is an arbitrary packet that might come fromthe middle of a message. If we know that
p
starts at thebeginning of a message, we prepend a special character
bom
to it, so that
p
1
=
bom
. Since
p
2
,...p
n
are all conditionedon
p
1
(among other things), their choice will reﬂect thisbeginningofmessage context. Similarly, if we know that
p
ends at the end of a message, we append a special character
eom
, which will help us correctly reconstruct the ﬁnalcharacters of an unknown plaintext
p
. Of course, for thesesteps to be useful, the messages in the training corpus mustalso contain
bom
and
eom
characters. Our experiments onlyused the
bom
character.
2.2 FiniteState Language Models
Having estimated our probabilities, we can regard the 7gram language model Pr
1
deﬁned by equation (1) as a verylarge edgelabeled directed graph,
G
1
, which is illustratedin Figure 2d, Figure 2a. Each vertex or “state” of
G
1
represents a context—not necessarily observed in training data—such as the 6gram
not se
.Sampling a string of length
from Pr
1
corresponds to arandom walk on
G
1
. When the random walk reaches somestate, such as
hobnob
, it next randomly follows an outgoing edge; for instance, it chooses the edge labeled
s
withindependent probability Pr
1
(
s

h
,
o
,
b
,
n
,
o
,
b
). Followingthis edge generates the character
s
and arrives at a new6gram context state
obnobs
. Note that the
h
has beensafely forgotten since, by assumption, the choice of the
next
edge depends only on the 6 most recently generated characters. Our random walk is deﬁned to start at the empty,0gram context, representing ignorance; it proceeds immediately through 1gram, 2gram, ...contexts until it entersthe 6gram contexts and continues to move among those.The probability of sampling a particular string
p
by thisprocess, Pr
1
(
p
), is the probability of the (unique) path labeled with
p
. (A path’s label is deﬁned as the concatenationof its edges’ labels, and its probability is deﬁned as the product of its edges’ probabilities.)In eﬀect, we have deﬁned Pr
1
using a probabilistic ﬁnitestate automaton.
4
In fact, our attack would work for anylanguage models Pr
1
,
Pr
2
deﬁned in this way, not just
n
gram language models. In the general ﬁnitestate case, different states could remember diﬀerent amounts of context—or nonlocal context such as a “region” in a document. Forexample,
n
gram probabilities might be signiﬁcantly diﬀer
4
Except that
G
1
does not have ﬁnal states; we simply stopafter generating
characters, where
is given. This is related to our treatment of
bom
and
eom
.
obnobrobnobsobnobthobnobstbbb
(a)
r16 17(b,c)(c,b)(d,e)(e,d)(r,s)(t,u) . . .(s,r)(u,t) . . .nconcrnconcsnconctinconcstccc
(b)
r
(c)
(b,c)(b,c)(b,c)obnobsnconcr17nconcuobnobt17obnobrnconcs17inconchobnob16(s,r)(r,s)(t,u)
(d)
Figure 2:
Fragments of the graphs built lazily by our algorithm. (a) shows
G
1
, which deﬁnes Pr
1
.
If
we are everin the state
hobnob
(a matter that is yet to be determined), then the next character is most likely to be
b
,
s
, space,or punctuation—as reﬂected in arc probabilities not shown—though it could be anything. (b) similarly shows
G
2
.
inconc
is most likely to be followed by
e
,
i
,
l
,
o
,
r
, or
u
. (c) shows
X
, a straightline automaton that encodes theobserved stream
x
=
p
xor
q
. The ﬁgure shows the unlikely case where
x
= (
...,
1
,
1
,
1
,
1
,
1
,
1
,
1
,...
)
: thus all arcs in
X
arelabeled with
(
p
i
,q
i
)
such that
p
i
⊕
q
i
=
x
i
= 1
. All paths have length

x

. (d) shows
G
x
. This produces exactly the samepair sequences of length

x

as
X
does, but the arc probabilities now reﬂect the product of the two language models,requiring more and richer states.
(16
,
hobnob
,
inconc
)
is a reachable state in our example since
hobnob
⊕
inconc
= 111111
.Of the 256 arcs
(
p
17
,q
17
)
leaving this state, the only reasonably probable one is
s
,
r
, since both factors of its probabilityPr
1
(
s

hobnob
)
·
Pr
2
(
r

inconc
)
are reasonably large. Note, however, that our algorithm might choose a less probable arc(from this state or from a competing state also at time 16) in order to ﬁnd the
globally
best path of
G
x
that it seeks.
ent in a message header vs. the rest of the message, or anHTML table vs. the rest of the HTML document. Beyondremembering the previous
n
−
1 characters of context, astate can remember whether the previous context includesa
<table>
tag that has not yet been closed with
</table>
.Useful nonlocal properties of the context can be manuallyhardcoded into the FSA, or learned automatically from acorpus [1].
2.3 Cross Product of Language Models
We now move closer to our goal by consructing the jointdistribution Pr(
p,q
). Recall our assumption that
p
and
q
aresampled
independently
from the genrespeciﬁc probabilitydistributions Pr
1
and Pr
2
. It follows that Pr(
p,q
) = Pr
1
(
p
)
·
Pr
2
(
q
). Replacing Pr
1
(
p
) by its deﬁnition (1) and Pr
2
(
q
) byits similar deﬁnition, and rearranging the factors, it followsthatPr(
p,q
) =
Y
i
=1
Pr(
p
i
,q
i

p
i
.
−
n
+1
,...p
i
−
2
,p
i
−
1
,q
i
.
−
n
+1
,...q
i
−
2
,q
i
−
1
) (2)wherePr(
p
i
,q
i

p
i
.
−
n
+1
,...,p
i
−
1
,q
i
.
−
n
+1
,...,q
i
−
1
)= Pr
1
(
p
i

p
i
.
−
n
+1
,...,p
i
−
1
)
·
Pr
2
(
q
i

q
i
.
−
n
+1
,...,q
i
−
1
) (3)We can regard equation (2) as deﬁning an even largergraph,
G
(similar to Figure 2d), which may be constructedas the cross product of
G
1
= (
V
1
,E
1
) and
G
2
= (
V
2
,E
2
).That is,
G
= (
V
1
×
V
2
,E
), where
E
contains the labelededge (
u
1
,u
2
)
(
char
1
,char
2
) :
prob
1
·
prob
2
−−−−−−−−−−−−−−→
(
v
1
,v
2
) iﬀ
E
1
contains
u
1
char
1
:
prob
1
−−−−−−→
v
1
and
E
2
contains
u
2
char
2
:
prob
2
−−−−−−→
v
2
. The weight
prob
1
·
prob
2
of this edge is justiﬁed by (3). Again, we never
explicitly
construct this enormous graph, which has morethan 256
14
edges (for our situation of
n
= 7 and a characterset of size 256).This construction of
G
is similar to the usual constructionfor intersecting ﬁnitestate automata [8], the diﬀerence beingthat we obtain a (weighted) automaton over character
pairs
would still apply even if, as suggested at the end of theprevious section, we used ﬁnitestate language models otherthan
n
gram models. It is known as the “samelength crossproduct construction.”
2.4 Constructing and Searching The Space of Feasible Solutions
Given
x
of length
, the feasible solutions (
p,q
) correspondto the paths through
G
that are compatible with
x
. A path
e
1
e
2
...e
is compatible with
x
if for each 1
≤
i
≤
, theedge
e
i
is labeled with some (
p
i
,q
i
) such that
p
i
⊕
q
i
=
x
i
.As a special case, if
p
i
and/or
q
i
is known to be the specialcharacter
bom
or
eom
, then
p
i
⊕
q
i
is unconstrained (indeedundeﬁned).
We now construct a new weighted graph,
G
x
, that represents just the feasible paths through
G
. All these pathshave length
, so
G
x
will be acyclic. We will then ﬁnd themost probable path in
G
x
and read oﬀ its label (
p,q
).The construction is simple.
G
x
, shown in Figure 2d, contains precisely all edges of the form(
i
−
1
,
(
p
i
.
−
n
+1
...,p
i
−
1
)
,
(
q
i
.
−
n
+1
...,q
i
−
1
))
(
p
i
,q
i
) :
prob
−−−−−−→
(
i,
(
p
i
.
−
n
+2
...,p
i
)
,
(
q
i
.
−
n
+2
...,q
i
))(4)such that
p
j
⊕
q
j
=
x
j
for each
j
∈
[
i .
−
n
+ 1
,i
] and
prob
=Pr
1
(
p
i

p
i
.
−
n
+1
,...p
i
−
2
,p
i
−
1
)
·
Pr
2
(
q
i

q
i
.
−
n
+1
,...q
i
−
2
,q
i
−
1
).
G
x
may also be obtained in ﬁnitestate terms as follows.We represent
x
as a graph
X
(Figure 2c) with vertices 0,1, ...
. From vertex
i
−
1 to vertex
i
, we draw 256 edges,
5
labeled with the 256 (
p
i
,q
i
) pairs that are compatible with
x
i
, namely (0
,
0
⊕
x
i
)
,...
(255
,
255
⊕
x
i
). We then compute
G
x
= (
V
x
,E
x
) by intersecting
X
with the languagepair model
G
as one would intersect ﬁnitestate automata.This is like the crossproduct construction of the previoussection, except that here, the edge set
E
x
contains (
i
−
1
,u
)
(
char
1
,char
2
) : 1
·
prob
−−−−−−−−−−−−−−→
(
i,v
) iﬀ the edge set of
X
contains(
i
−
1)
(
char
1
,char
2
) : 1
−−−−−−→
i
and
E
contains
u
(
char
1
,char
2
) :
prob
−−−−−−→
v
.Using dynamic programming, it is now possible in
O
(
)time to obtain our decoding by ﬁnding the best length
pathof
G
x
from the initial state (0,(),()). Simply run a singlesource shortestpath algorithm to ﬁnd the shortest path toany state of the form (
,...
), taking the length of each edgeto be the negative logarithm of its probability, so that minimizing the sum of lengths is equivalent to maximizing theproduct of probabilities.
6
It is not even necessary to use thefull Dijkstra’s algorithm with a priority queue, since
G
x
isacyclic. Simply iterate over the vertices of
G
x
in increasing order of
i
, and compute the shortest path to each vertex (
i,...
) by considering its incoming arcs from vertices(
i
−
1
,...
) and the shortest paths to
those
vertices. This isknown as the Viterbi algorithm; it is guaranteed to ﬁnd theoptimal path.The trouble is the size of
G
x
. On the upside, because
x
j
constrains the pair (
p
j
,q
j
) in equation (4), there are atmost
·
256
6
states and
·
256
7
edges in
G
x
(not
·
256
12
and
·
256
14
). Unfortunately, this is still an astronomical number. It can be reduced somewhat if Pr
1
or Pr
2
places hardrestrictions on characters or character sequences in
p
and
q
,so that some edges have probability 0 and can be omitted.As a simple example, perhaps it is known that each (
p
j
,q
j
)must be a pair of
printable
(or even alphanumeric) characters for which
p
j
⊕
q
j
=
x
j
. However, additional techniquesare usually needed.Our principal technique at present is to prune
G
x
drastically, sacriﬁcing the optimality guarantee of the Viterbialgorithm. In practice, as soon as we construct the states(
i,...
) at time
i
, we determine the shortest path from theinitial state to each, just as above. But we then keep onlythe 100 best of these time
i
states according to this metric(less greedy than keeping only the 1 best!), so that we needto construct at most 100
·
256 states at time
i
+ 1. Theseare then evaluated and pruned again, and the decoding pro
5
Each edge has weight 1 for purposes of weighted intersection or weighted crossproduct. This is directly related tofootnote 2.
6
Using logarithms also prevents underﬂow.ceeds. More sophisticated multipass or A* techniques arealso possible, although we have not implemented them.
7
2.5 Multiple Reuse
If a keystream
k
is used more than twice, the method workseven better. Assume we now have three plaintexts to recover,
p
,
q
, and
r
, and are given
p
⊕
q
and
p
⊕
r
(note that
q
⊕
r
adds no further information). A state of
G
or
G
x
now includes a triple of language model states, and an edgeprobability is a product of 3 language model probabilities.The Viterbi algorithm can be used as before to ﬁnd thebest path through this graph given a pair of outputs (thosecorresponding to
p
⊕
q
and
p
⊕
r
). Of course, this techniquecan be extended beyond three plaintexts in a similar fashion.
3 Implementation
Our implementation of the probabilistic plaintext recoverycan be separated cleanly into two distinct phases. First,language models are built for each of the types of plaintextthat will be recovered. This process only needs to occur onceper type of plaintext since the resulting model can be reusedwhenever a new plaintext of that type needs to be recovered.The second phase is the actual plaintext recovery.All our model building and cracking experiments wererun on a commodity Dell server (Dual Xeon 3.0 GHz, 8GBRAM) that cost under $2,000. The server runs a Linux kernel that supports the Xeon’s 64bit extensions to the x86instruction set. The JVM used is BEA’s freely availableJRockit since Sun’s JVM does not currently support 64bitmemory spaces on x86.
3.1 Building the Language Models
To build the models, we used an open source natural language processing (NLP) package called LingPipe [4].
8
LingPipe is a Java package that provides an API for many common NLP tasks such as clustering, spelling correction, andpartofspeech tagging. We only used it to build a characterbased
n
gram model based on a large corpus of documents(see section 4 for details of the corpora used in our experiments). Internally, LingPipe stores the model as a trie withgreater length
n
grams nearer the leaves. We had LingPipe“compile” the model down to a simple lookup table basedrepresentation of the trie. Each row of the table, whichcorresponds to a single
n
gram, takes 18 bytes except forthe rows which correspond to leaf nodes (maximal length
n
grams) which take only 10 bytes. If an
n
gram is never seenin the training data, it will not appear in the table: insteadthe longest substring of the
n
gram that does appear in thetable will be used. The extra 8 bytes in these nodes speciﬁeshow to compute the probability in this “backedoﬀ” case.All probabilities in both LingPipe and our Viterbi implentation are computed and stored in logspace to avoid issueswith integer underﬂow. All of the language models used inthis paper have
n
= 7. The language models take several
7
If we used our metric to prioritize exploration of
G
x
insteadof pruning it, we would obtain A* algorithms (known inthe speech and language processing community as “stackdecoders”). In the same A* vein, the metric’s accuracy canbe improved by considering right context as well as left: onecan add an estimate of the shortest path from (
i,...
) throughthe remaining ciphertext to the ﬁnal state. Such estimatescan be batchcomputed quickly by decoding
x
from endtobeginning using smaller,
m
gram language models (
m < n
).
8
Available at:
http://www.aliasi.com/lingpipe