of 65

Analytic Pattern Matching: From DNA to Twitter (With Open Problems)

2 views65 pages

Download

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Analytic Pattern Matching: From DNA to Twitter (With Open Problems) Wojciech Szpankowski Purdue University W. Lafayette, IN June 28, 2016 AofA, Kraków, 2016 Joint work with Philippe Jacquet. Outline
Analytic Pattern Matching: From DNA to Twitter (With Open Problems) Wojciech Szpankowski Purdue University W. Lafayette, IN June 28, 2016 AofA, Kraków, 2016 Joint work with Philippe Jacquet. Outline 1. Motivations Finding Biological Signals Searching Google Classifying Twitter 2. Pattern Matching Problems Exact String Matching Generalized String Matching Subsequence String Matching (Hidden Patterns) 3. Analysis & Applications Exact String Matching (Warmup) Generalized String Matching & Biological Motifs (Open Problem) Hidden Patterns & Ranking Google Pages (Open Problems) Joint String Complexity & Classification of Twitter (Open problems) Motivation Biology & String Matching Biological world is highly stochasticand inhomogeneous (S. Salzberg). Start codon codons Donor site CGCCATGCCCTTCTCCAACAGGTGAGTGAGC Transcription start Exon Promoter 5 UTR CCTCCCAGCCCTGCCCAG Acceptor site Intron Poly-A site Stop codon GATCCCCATGCCTGAGGGCCCCTC GGCAGAAACAATAAAACCAC 3 UTR Pattern Matching Let W = w 1...w m and T be strings over a finite alphabet A. Basic question: how many times W occurs in T. Define O n (W) the number of times W occurs in T, that is, O n (W) = #{i : T i i m+1 = W, m i n}. Basic Thrust of our Approach When searching for over-represented or under-represented patterns we must assure that such a pattern is not generated by randomness itself (to avoid too many false positives). Motivation Google & Subsequence Matching Motivation Twitter & String Complexity Figure 1: Two similar twitter texts have many common words Figure 2: Twitters Classification Outline Update 1. Motivations 2. Pattern Matching Problems Exact String Matching Generalized String Matching Subsequence String Matching 3. Analysis & Applications Pattern Matching Let W and T be (set of) strings generated over a finite alphabet A. We call W the pattern and T the text. generated by a probabilistic source. The text T is of length n and is Text: T n m = T m...t n. Pattern: W = w 1...w m, w i A; Set of patterns: W = {W 1,...,W d } with W i A m i. Basic question: how many times W occurs in T (or how long to wait until W occurs in T ). Define O n (W) = #{i : T i i m+1 = W, m i n} as the number of w occurrences in the text T n 1. Our goal: Study probabilistic behavior of O n (W) for various pattern matching problems using tools of analytic combinatorics. Variations on Pattern Matching (Exact) String Matching: In the exact string matching the pattern W = w 1...w m is a given string (i.e., consecutive sequence of symbols). Generalized String Matching: In the generalized pattern matching a set of patterns (rather than a single pattern) is given, that is, W = (W 0,W 1,...,W d ), W i A m i where W i itself for i 1 is a subset of A m i. The set W 0 is called the forbidden set. Three cases to be considered: W 0 = interest in the number of patterns from W occurring in the text. W 0 we study the number of W i, i 1 pattern occurrences under the condition that no pattern from W 0 occurs in the text. W i =, i 1, W 0 restricted pattern matching. Pattern Matching Problems Hidden Words or Subsequence Pattern Matching: We search for a subsequence W = w 1...w m rather than a string in a text. That is, there are indices 1 i 1 i 2 i m n such that T i1 = w 1, T i2 = w 2,, T i m = w m. We also say that the word W is hidden in the text. For example: W = date T = hidden pattern occurs four times as a subsequence in the text as hidden pattern. Joint String Complexity: For a given string X, we ask how many distinct subwords it contains. This is called string complexity. For two strings X and Y, we want to know how many common and distinct subwords they contain. This is called joint string complexity. Outline Update 1. Motivations 2. Pattern Matching Problems 3. Analysis & Applications Exact String Matching Generalized String Matching & Finding Biological Motifs Hidden Patterns & Ranking Google Pages Joint String Complexity & Classification of Twitter Warmup: Analysis of Exact String Matching For Language L, its generating function (GF) L(z) is L(z) = u LP(u)z u. Concatenation of languages translates into product of GFs. Autocorrelation Polynomial: For W define the autocorrelation set S as: S = {w m k+1 : wk 1 = wm m k+1 }, and WW is the set of positions k satisfying w k 1 = wm m k+1, and S(z) = k WW P(w m k+1 )zm k. Example: For W = abab, we have WW = {1,3} and S = {ǫ,ab}. w: a b a b w: a b a b ǫ w 2 1 = w4 2 a b a b Warmup: Analysis of Exact String Matching For Language L, its generating function (GF) L(z) is L(z) = u LP(u)z u. Concatenation of languages translates into product of GFs. Autocorrelation Polynomial: For W define the autocorrelation set S as: S = {w m k+1 : wk 1 = wm m k+1 }, and WW is the set of positions k satisfying w k 1 = wm m k+1, and S(z) = k WW P(w m k+1 )zm k. Example: For W = abab, we have WW = {1,3} and S = {ǫ,ab}. w: a b a b w: a b a b ǫ w 2 1 = w4 2 a b a b Define T r as set of words containing exactly r 1 occurrences of W: T r = R M r 1 U. which can be illustrated for T 4 as follows T 4 R M M M U Language Relations & Generating Functions Lemma 1. (i) The languages M, U and R satisfy: U A = M + U {ǫ}, M k = A W + S {ǫ}, W M = A R (R W). k 1 Language Relations & Generating Functions Lemma 1. (i) The languages M, U and R satisfy: U A = M + U {ǫ}, M k = A W + S {ǫ}, W M = A R (R W). k 1 (ii) For memoryless source the generating functions associated with languages M, U and R satisfy 1 1 M(z) U(z) = M(z) 1 z 1 = S W (z) + P(W) zm 1 z, R(z) = P(W)zm U(z) Language Relations & Generating Functions Lemma 1. (i) The languages M, U and R satisfy: U A = M + U {ǫ}, M k = A W + S {ǫ}, W M = A R (R W). k 1 (ii) For memoryless source the generating functions associated with languages M, U and R satisfy 1 1 M(z) U(z) = M(z) 1 z 1 = S W (z) + P(W) zm 1 z, R(z) = P(W)zm U(z) (iii) The generating functions T r (z) = n 0 Pr{O n(w) = r}z n and T(z,u) = r=1 T r(z)u r satisfy T r (z) = R(z)M r 1 W (z)u(z), r 1, T 0(z) = u T(z, u) = R(z) 1 um(z) U(z). S(z) (1 z)s(z) + z m P(w), Main Results: Moments and Limiting Distributions Moments: We have with E[O n (W)] = P(W)(n m + 1), Var[O n (W)] = nc 1 + c 2 c 1 = P(W)(2S(1) 1 (2m 1)P(W). (m 1)(2S(1) 1) 2S (1)). Limiting Distributions: The limiting distribution of P(O n (W) = k) depends on the relation between n and m = w. CLT, LLC, LD np(w), w is given, P(O n (W) = k) = Po( λ) Geom(θ) np(w) λ 0, mp(w) 0, (Polýa-Aeppli) λ = np(w)/s(1), θ = (S(1) 1)/S(1) np(w) S 2 (1) ( ) k S(1) 1 S(1) np(w) 0, m = o(n), n(1 α)p(w) S 2 (1) ( ) k S(1) 1 S(1) m = αn. Sketch of Proofs 1. By Cauchy formula we have E[u On ] = [z n ]T(z,u) = 1 2πi R(z)U(z) (1 um(z)) dz z n. 2. For CLT and LD we apply residue theorem to find for large R 1 E[u On ] = C(u)ρ(u) n m+1 + O(R n ) where ρ(u) is the root of 1 um(z) = For Poisson law we assume np(w) λ and mp(w) 0 to find that ρ(u) 1. Hence, using we are led to E[u On ] = exp E[u On ] = [z n ](T(z,u) + T 0 (z)) np(w) S(1) = 1 + np(w) S(1) z 1 1 z S(1) 1 S(1) z 1 1 z S(1) 1 S(1) (1 + O(mP(w)), np(w) λ + o(np(w)), np(w) 0. Open Problem Open Problem 1: Extend String Pattern Matching to Graph Pattern Matching, that is, replace the text T by a graph G and find the number of occurrences of a given subgraph/pattern g in G for different models of graph G generation. Attention: Watch out for group automorphism of g! Open Problem Open Problem 1: Extend String Pattern Matching to Graph Pattern Matching, that is, replace the text T by a graph G and find the number of occurrences of a given subgraph/pattern g in G for different models of graph G generation. Attention: Watch out for group automorphism of g! Exact graph matching was analyzed in: Picard F, Daudin JJ, Schbath S, Robin S, Assessing the Exceptionality of Network Motifs. J. Comp. Bio., 2008. Open Problem Open Problem 1: Extend String Pattern Matching to Graph Pattern Matching, that is, replace the text T by a graph G and find the number of occurrences of a given subgraph/pattern g in G for different models of graph G generation. Attention: Watch out for group automorphism of g! Exact graph matching was analyzed in: Picard F, Daudin JJ, Schbath S, Robin S, Assessing the Exceptionality of Network Motifs. J. Comp. Bio., Open Problem 2: Extend String Pattern Matching to other Adanced Structural Pattern Matching, such as trees, sets, and multimodal data structures. Outline Update 1. Motivations 2. Pattern Matching Problems 3. Analysis & Applications Exact String Matching Generalized String Matching & Finding Biological Motifs Hidden Patterns & Ranking Google Pages Joint String Complexity & Classification of Twitter Biology: Weak Signals and Generalized Pattern Matching Denise and Regnier (2002) observed that in biological sequence whenever a word is overrepresented, then its subwords are also overrepresented. For example, if W 1 = AATAAA, then W 2 = ATAAAN is also overrepresented. Overrepresented subwords are called artifact, and it is important to disregard automatically noise created by artifacts. New Approach: Once a dominating signal has been detected, we look for a weaker signal by comparing the number of observed occurrences of patterns to the conditional expectations not the regular expectations. Biology: Weak Signals and Generalized Pattern Matching Denise and Regnier (2002) observed that in biological sequence whenever a word is overrepresented, then its subwords are also overrepresented. For example, if W 1 = AATAAA, then W 2 = ATAAAN is also overrepresented. Overrepresented subwords are called artifact, and it is important to disregard automatically noise created by artifacts. New Approach: Once a dominating signal has been detected, we look for a weaker signal by comparing the number of observed occurrences of patterns to the conditional expectations not the regular expectations. This harder question needs a new approach thru Generalized Pattern Matching to show that E[O n (W 2 ) O n (W 1 ) = k] αn. To evaluate this we need to study generalized pattern matching. Polyadenylation Signals in Human Genes Beaudoing et al. (2000) studied several variants of the well known AAUAAA polyadenylation signal in mrna of humans genes. Using our approach Denise and Regnier (2002) discovered/eliminated all artifacts and found new signals in a much simpler and reliable way. Hexamer Obs. Rk Exp. Z-sc. Rk Cd.Exp. Cd.Z-sc. Rk AAUAAA AAAUAA AUAAAA UUUUUU AUAAAU AAAAUA UAAAAU AUUAAA 1013 l AUAAAG UAAUAA UAAAAA UUAAAA CAAUAA AAAAAA UAAAUA Outline Update 1. Motivations 2. Pattern Matching Problems 3. Analysis & Applications Exact String Matching Generalized String Matching & Finding Biological Motifs Hidden Patterns & Ranking Google Pages Joint String Complexity & Classification of Twitter Hidden Patterns In the subsequence pattern or a hidden word W occurs as a subsequence: T i1 = w 1, T i2 = w 2,..., T i m = w m. where we put additional constraints that i j+1 i j d j. The I = (i 1,...,i m )-tuple is called a position and D = (d 1,...,d m constitutes the constraints. Let O n (W) be the number of W occurrences in T. Observe that O n (W) = I X I where and 0 otherwise. X I := 1 if W occurs at position I in T n Hidden Patterns In the subsequence pattern or a hidden word W occurs as a subsequence: T i1 = w 1, T i2 = w 2,..., T i m = w m. where we put additional constraints that i j+1 i j d j. The I = (i 1,...,i m )-tuple is called a position and D = (d 1,...,d m constitutes the constraints. Let O n (W) be the number of W occurrences in T. Observe that O n (W) = I X I where and 0 otherwise. X I := 1 if W occurs at position I in T n If all d j are finite, then we have the constrained problem, which is a generalized pattern matching. Below analysis is based on: P. Flajolet, W.S., and B. Vallee, ICALP 2001 & JACM (Extension to dynamaic sources by Bourdon and Vallee.) Constrained HPM and Generalized Pattern Matching 1. The (W, D) constrained subsequence problem is viewed as the generalized string matching. Example: If (W,D) = a# 2 b, then W = {ab,aab,abb}. Constrained HPM and Generalized Pattern Matching 1. The (W, D) constrained subsequence problem is viewed as the generalized string matching. Example: If (W,D) = a# 2 b, then W = {ab,aab,abb}. 2. de Bruijn Automaton. Let M = max{length(w)} 1. Define a de Bruijn automaton over B where B = A M. De Bruijn automaton is built over B. Constrained HPM and Generalized Pattern Matching 1. The (W, D) constrained subsequence problem is viewed as the generalized string matching. Example: If (W,D) = a# 2 b, then W = {ab,aab,abb}. 2. de Bruijn Automaton. Let M = max{length(w)} 1. Define a de Bruijn automaton over B where B = A M. De Bruijn automaton is built over B. 3. Let b B and a A. Then the transition from the state b upon scanning symbol a of the text is toˆb B such that For example ba ˆb = b 2 b 3 b M a. abb }{{} B a }{{} A }{{} bba. B 4. The Transition Matrix: T(u) is a complex-valued transition matrix defined as: [T(u)] b,ˆb := P(a)u O M+1 (ba) O M (b) [[ˆb = b 2 b 3 b M a]] where O M (b) is the number of pattern occurrences W in the text b. Example 5. Example. Let W = {ab,aab,aba}. Then M = 2, and the the de Bruijn graph and matrix T(u) are shown below T(u) = aa ab ba bb aa ab ba bb P(a) P(b)u P(a)u P(b) P(a) P(b) P(a) P(b). (b,u 2 ) ab (a,u) (a,u 0 ) (b,u 0 ) aa ba (a,u 0 ) (b,u 0 ) (a,u 0 ) bb (b,u 0 ) For a great analysis of generalized pattern matching by symbolic calculus see Clement, Bassimo, and Nicodeme, TALG, 2012. Generating Functions 6. Using properties of product of matrices we conclude that O n (u) = E[u O n(w) ] = b t (u)t n (u) 1 where b t (u) is an initial vector and 1 = (1,...,1). 7. Spectral Decomposition Let λ(u) be the largest eigenvalue of T(u). Then for some A 1: O n (u) = c(u)λ n (u)(1 + O(A n )) Theorem 1 (Flajolet, Vallee, WS). For fixed W (i.e., m = O(1)): Central Limit Theorem: { } On np(w) Pr σ(w) x n Large Deviations: If T(u) is primitive, then 1 2π x e t2 /2 Pr{O n (W) = ae[o n ]} 1 σ a 2πn e ni(a)+θ a where I(a) can be explicitly computed, and θ a is a known constant. Generating Functions 6. Using properties of product of matrices we conclude that O n (u) = E[u O n(w) ] = b t (u)t n (u) 1 where b t (u) is an initial vector and 1 = (1,...,1). 7. Spectral Decomposition Let λ(u) be the largest eigenvalue of T(u). Then for some A 1: O n (u) = c(u)λ n (u)(1 + O(A n )) Theorem 1 (Flajolet, Vallee, WS). For fixed W (i.e., m = O(1)): Central Limit Theorem: { } On np(w) Pr σ(w) x n Large Deviations: If T(u) is primitive, then 1 2π x e t2 /2 Pr{O n (W) = ae[o n ]} 1 σ a 2πn e ni(a)+θ a where I(a) can be explicitly computed, and θ a is a known constant. Question: Extension to large m? How? Deletion Channel Why HPM for Large m A deletion channel with parameter d: input: a binary sequence x := x n 1 = x 1 x n, channel: deletes each symbol independently with probability d, output: a subsequence Y = Y (x) = x i1...x im of x; M follows the binomial distribution Bi(n,(1 d)) and indices i 1,...,i M correspond to undeleted bits, I(X n 1,Y (Xn 1 )): mutual information betwen the ouput and the input, C = max P I(X;Y): channel capacity. Deletion Channel Why HPM for Large m A deletion channel with parameter d: input: a binary sequence x := x n 1 = x 1 x n, channel: deletes each symbol independently with probability d, output: a subsequence Y = Y (x) = x i1...x im of x; M follows the binomial distribution Bi(n,(1 d)) and indices i 1,...,i M correspond to undeleted bits, I(X n 1,Y (Xn 1 )): mutual information betwen the ouput and the input, C = max P I(X;Y): channel capacity. Theorem 2. For any random input X n 1, the mutual information satisfies I(X n 1 ;Y M 1 ))= w A d n w (1 d) w (E[O n (w)logo n (w)] E[O n (w)]loge[o n (w)]). where O n (w) is the number of hidden patterns w in X n 1. Open Problems Open Problem 3: Extend the analysis of HPM and GPM to largem; in particularm = Θ(n). Open Problem 4: Analyze HPM and GPM for large alphabet size A ; e.g., A = Θ(n). Open Problems Open Problem 3: Extend the analysis of HPM and GPM to largem; in particularm = Θ(n). Open Problem 4: Analyze HPM and GPM for large alphabet size A ; e.g., A = Θ(n). Open Problem 5: Find mutual information and capacity of the deletion channel. Open Problems Open Problem 3: Extend the analysis of HPM and GPM to largem; in particularm = Θ(n). Open Problem 4: Analyze HPM and GPM for large alphabet size A ; e.g., A = Θ(n). Open Problem 5: Find mutual information and capacity of the deletion channel. Some Observations. 1. Assume m = θn, θ = 1 d, and p = P(1): ( n E[O n (W)] = P(W) m) and for a typical W we have: E[O n (W)] 2 n(h(θ) θh(p)), where H(x) is the binary entropy. 2. Let now W = 0 m. Then we can prove ( ( n k ) ) P O n (0 m ) = = m ( n k) p k (1 p) n k and then Var[O n (0 m )] E[O 2 n (0m )] 2 nβ((1 d),p) β(θ, p) = (2(q+θp δ)h(θ/(q+θp δ))+h((1 θ)p+δ)+((1 θ)p+δ) log p+(q+θp δ) Outline Update 1. Motivations 2. Pattern Matching Problems 3. Analysis & Applications Exact String Matching Generalized String Matching & Finding Biological Motifs Hidden Patterns & Intrusion Detection Joint String Complexity & Classification of Twitter Some Definitions String Complexity of a single sequence is the number of distinct substrings. Throughout, we write X for the string and denote by I(X) the set of distinct substrings of X over alphabet A. Example. If X = aabaa, then I(X) = {ǫ,a,b,aa,ab,ba,aab,aba,baa,aaba,abaa,aabaa}, so I(X) = 12. But if X = aaaaa, then so I(X) = 6. I(X) = {ǫ,a,aa,aaa,aaaa,aaaaa}, Some Definitions String Complexity of a single sequence is the number of distinct substrings. Throughout, we write X for the string and denote by I(X) the set of distinct substrings of X over alphabet A. Example. If X = aabaa, then I(X) = {ǫ,a,b,aa,ab,ba,aab,aba,baa,aaba,abaa,aabaa}, so I(X) = 12. But if X = aaaaa, then so I(X) = 6. I(X) = {ǫ,a,aa,aaa,aaaa,aaaaa}, The string complexity is the cardinality of I(X) and we study here the average string complexity E[ I(X) ] = X A n P(X) I(X). where X is generated by a memoryless/markov source. Suffix Trees and String Complexity a b a a b a b a $ 1 $ 8 $ 6 b a $ 4 a b a b a $ 3 b a a b a b a $ 2 b a $ 5 $ 9 $ a b a a b a b a $ Non-compact suffix trie for X = abaababa and string complexity I(X) = 24. a b a ababa$ 1 a ba ababa$ 1 $ 8 $ 6 $ 8 $ 6 ba$ 4 ba$ 4 ababa$ 3 ababa$ 3 b a ababa$ 2 ba ababa$ 2 ba$ 5 ba$ 5 $ 7 $ 7 $ a b a a b a b a $ $ a b a a b a b a $ String Complexity = # internal nodes in a non-compact suffix tree. Some Simple Facts Let O(w) denote the number of times that the word w occurs in X. Then I(X) = w A min{1,o X (w)}. Since between any two positions in X there is one and only one substring: w A O X (w) = ( X + 1) X. 2 Hence Define: C n = = ( X + 1) X I(X) = max{0,o X (w) 1}. 2 w A C n := E[ I(X) X = n]. Then (n + 1)n (k 1)P(O n (w) = k) 2 w A k 2 (n + 1)n kp(o n (w) = k) + P(O n (w) 2). 2 w A w A k 2 We need to study probabilistically O n (w): that is: number of w occurrences in a text X generated a probabilistic source. Back to Suffix Trees Size S n and path length L n : E[S n ] = w A P(O n w +1 (w) 2) E[L n ] = w A k 2 kp(o n(w) = k) S 2 S 4 S 1 S 3 Suffix tree built from the first four suffixes of X = , i.e , , , Random suffix tree resembles random independent trie: (Jacquet & WS, 1994, Ward & Fayolle, 2005): since we can prove that E[S n ] E[S T n ] = O(n1 ε ) E[L n ] E[L T n ] = O(n1 ε ), P(O n (w) 2) P(Ω n (w) 2) = O ( ) n ε P 1 ε (w) whereω n (w) indicates how many ofnindepdendent strings share the same prefix w. Back to String Complexity Last expression allows us to write C n = (n + 1)n 2 + E[S n ] E[L n ] where E[S n ] E[S T n ] and E[L n] E[L T n ] are, respectively, the average size and path length in the associated indepdendent tries. Back to String Complexity Last expression allows us to write C n = (n + 1)n 2 + E[S n ] E[L n ] where E[S n ] E[S T n ] and E[L n] E[L T n ] are, respectively, the average size and path length in the associated indepdendent tries. Poissonization Mellin Inverse Mellin (SP) De-Poisso
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks