A Cluster Refinement Algorithm for MotifDiscoveryArif UllahReg.no: (EE-1738)Ghulam Ishaq Khan Institute of Engineering Sciencesand Technology, Topi, Swabi KPK PakistanSEMESTER PROJECT REPORTSubmitted toDr. Ahmar RashidJanuary 10, 2018Report: CS(506) Arif Ullah (EE-1738)AbstractTranscription Factor Binding Sites (TFBS) or motif discovery is one of the hot area of research incomputer science and bio informatics. The motif discovery has been proved to NP-hard problem. In thispaper the Cluster Refinement algorithm for Motif Discovery (CRMD) has been proposed , which makeuse of statistical motif model with different motif and motif instances. The proposed algorithm searchesfor the best initial candidate motif in the DNA sequences using entropy based clustering approach. Theoptimal motif is selected from the set of initial candidates after greedy refinement procedure is employed.The said refinement procedure uses adaptive threshold to change the number of motif instances. CRMDhas the ability to find multiple motif in the DNA sequences and assuming one occurrence per sequenceleads to further improvement of the algorithm performance. It is concluded from the experimental resultsthat the performance-complexity trade off of CRMD on the synthetis as well as real data set, is fargood from the four state of the art algorithms.1Report: CS(506) Arif Ullah (EE-1738)I. INTRODUCTIONTranscription factor binding sites (TFBSs) or motif discovery is well known and one of the keychallenge in contemporary biology and bio informatics. The motif discovery in DNA sequencesis important for understanding the regulation relationship and evolutionary mechanism of livingorganisms. Computational method are still unable to efficiently find the TFBS on sequence level.One of the most difficult task is to search for the optimum globally in a high dimensional space.Numerous algorithm has been proposed for finding the TFBS discovery, based on the searchingprocedure like onsensus-based search algorithms and statistical optimization methods. Consensusbased search has limitations like modif width and maximum error handling while initial point andgetting stuck in the local optima with non optimal solutions effect the performance of Statisticalmethods. Population-based evolutionary algorithms has weak performance complexity trade off.The proposed heuristic approach called Cluster Refinement algorithm for Motif Discovery canefficiently find the local optima and searches for the global optima with in finite set of localoptima. First, the initial candidate is decided on the basis of entropy based clustering algorithmand the refinement algorithm is applied to find the global optima. if the prior knowledge or OneOccurrence Per Sequence (OOPS) is given, CRMD can be used to find multiple distinct motifs.The experimental results show that CRMD performance is far better for most of the 800 syntheticdata sets.The CMRD is also tested on the eight real data set ABS database 1, SCPD database 2,Escherichia coli data sets 3, and Tompa’s data sets 4, MEME 5, Motif Sampler 6, GAME7 and GALF-P 8. The result is evident of the fact that CMRD has good performance and do notstuck in the local optima like MEME 5 and Motif Sampler 6 and the performance complexitytrade-off is better than Genetic Algorithm-based motif discovery strategies like GAME andGALF-P. The running time of CMRD is compared to most of the tested algorithm in this paper.The organization of the paper includes: section 2 is the background related to motif discovery,section 3 includes the mathematical formulation of the problem and objective function, section4 is all about CMRD and the section 4 experimental results and in section 6 conclusion ispresented.II. BACKGROUNDthe TFBS is defined as “small portion of nucleotide fragment is the cris-regulatory region ofgene in a DNA sequence ” and it interaction with the Transcription Factor(TF) influence thegene expression. Usually the width is 5-10 base pair (bp) but in real case its width is upto 20Report: CS(506) Arif Ullah (EE-1738)bp. Generally, the widths of TFBS can be restricted from 5 to 25 bp.TRBS discovery includes de novo motif discovery which is an attractive pre screening procedureas compared to costly and complex biological experiments like DNA foot printing 9 and ChIPchip 10. de novo motif discovery try to find motif with out having any prior knowledge aboutthe consensus.the motif discovery problem can be categorized into two categories i.e. enumerative (consensusbased) approaches and statistical (matrix-based) approach using either combinatorial method ofstring discovery or using Position Frequency Matrix (PFM), or Position Weight Matrix (PWM)for finding profile of the TFBSs. The string matching procedure fail in enumerative approachesand the brute force search is infeasible due to due to the NP-hardness 12 while in consensusbased approaches 13 assumes that the hamming distance between the consensus and motifinstance is known. These approach has limitation of handling short motif width (upto 14 bp)and small hamming distance in order to solve the problem in reasonable computational timewhich shows its limitation in real word problem where the width could be up to 22 bp alsothe hamming distance is not known beforehand. The hamming distance (d) should be optimallyselected because too small d may leads to miss the TFBS while large value of d increasecomputational complexity.PWM and PFM is good choice for using to represent the motif with continuous frequency. Somestatistical methods like Expectation Maximization 5, 18 and Motif Sampler 19, 20 havebeen proposed in literature. However, statistical method uses probabilistic approach but may takelarge convergence time. Another disadvantage is that they often stuck in the global optima.Evolutionary algorithm like Genetic Algorithms (GAs) 7, 8, 21, 22, 23 have beenproposed in literature for TFBS discovery. The advantages of GA is that it is likely to explorethe global optimum in huge dimensional search space but the computational complexity is toohigh and does not guarantee the consistency of the solution in different run.III. OBJECTIVEThe TFBS discovery is formulated as an optimization problem with certain objectives and weneed to maximize the objective function with certain constraint.Report: CS(506) Arif Ullah (EE-1738)A. Problem FormulationFor given DNA sequences, first we are required to find the binding sites corresponding tomotif instances.Data input: Given set of sequence S = fSiji = 1; 2; :::; Dg , the nucleotide alphabet definedon the alphabet B = fA; C; T; Gg . li is the length of sequence while w is width of motif. Theset of all the w long subsequences contained in S is fSijiji = 1; 2::; D; ji = 1; 2::; li – w + 1gwhere Sjii is the motif instance and ji is the binding sites.Position output: we are required to find PIM of motif, A = fAj i ji = 1; 2; ::; Dg where Ai =fAj i jj = 1; 2; :::; lig is the starting position or indicator row vector for sequence Si. if the jthposition in ith sequence Si is a binding site thenAj i = 1 otherwise Aj i = 0. if jAj is the setof motif instances then the ith motif instances is given by:S(A) = fS(A)1; S(A)2; :::; S(A)jAjwhereS(A)i = fS(A)1 i S(A)2 i :::; S(A)w i. on the jth position the list of nucleotide is given by:S(A)j = fS(A)j 1S(A)j 2:::; S(A)j jAj.Consensus output: the consensus should also be find out. in the absence of consensus we haveto find the Position Count Matrix (PCM) represented by N(A) for each and every nucleotide ofindividual motif instance of A. N(A) = (N(A)1; N(A)2; ::; N(A)w) position Frequency Matrix(PFM) represented by N^(A) can be obtained by normalizing N(A) i.e. N^(A) = NjA(Aj ). TheReport: CS(506) Arif Ullah (EE-1738)motif discovery problem can be illustrated in Fig.1. where M(C) represent different number ofnucleotides in the data set C and ?0 = f?0b = MP(bS2)BbMSbjb 2 BgB. Maximum a PosterioriWMotif discovery problem require the evaluation of PIM and PCM in term of certain optimization measure. the candidate motif instance can be evaluated in different way, in paper Bayesiananalysis is used to derive the posterior probability of the motif instances and searching for themotif with maximum probability. Assuming that the motif instances is generated independentlyand follow multinomial distribution Qw j=1 p(N(A))j where the probability of nucleotide at thejth position is given by p(N(A))j, similarly the joint probability of the nucleotide is given byp(N(A))j = Yb2B?N(A)j bjbwhere ?jb represent the latent probability of base b in position j.The PIM A can be viewed asthe missing label of the data S where ? is the latent parameters of the distribution model of A.The likelihood of S of the background sequences and the motif instances is given as follows:p(Sj?; ?0; A; p0) = pj 0Aj(1 – p0)L-jAj?0M(S(Ac))wY j=1?Nj (A)jIV. ALGORITHMSolving PIM directly from sequence S and motif width w is computationally intractable.The motif discovery problem is NP hard and become more complex if we assume OOPS. Theproposed algorithm for motif discovery consist of three main steps. The algorithm 1 is the mainalgorithm. First, all the possible subsequence from the set of all sequences. the starting positionof each sub sequence ranges from 1 to jSij-w+1. The sub sequences represented by sub(S) aregrouped into cluster using clustering procedure based on entropy based clustering. The PIM ofClusteri is represented by A(Clusteri). From Clusteri the set of artificial motifs instances A(i)are generated whose PFM is denoted by N^(A^(i)). The third step includes the refine procedureuses the PCM N^(A^(i)) of motif instance. The output is generated as the motifs with best p(A(i)).The result is further enhanced post adaptation is applied by assuming OOPS procedure, only ifthe result motif is consistent.The execution procedure of the main algorithm can illustrated by figure 2. In which the firstReport: CS(506) Arif Ullah (EE-1738)set represent the set of DNA sequences the the set of all possible sub sequences are generatedthen all the subsequences are grouped in to cluster followed by the PCM of each cluster. In thisexample D=10 and the number of clusters are 17. After applying the refinement technique thebest set of motif instant is returned.Report: CS(506) Arif Ullah (EE-1738)The next two steps includes the clustering technique called Enteropy Based Clustering andGreedy Refinement.A. Entropy Based ClusteringThe clustering technique uses initial PCM for the refinement. The selection of good initialPCM may leads to the increase the likelihood of local optima as global optima. Consideringinitial PCM has better results than the random selection of PCM, But using large number ofsubsequences lead to computational complexity.Th approach of clustering is based on grouping the subsequences with certain size. The advantages of clustering are:1) First, clustering all the subsequences guarantees that every subsequence has a large chanceto occur in a certain cluster, and thus, is likely to be considered in the subsequent process.2) Second, grouping the similar subsequences together exempts us from the costly computationof processing every subsequence later, and in the extreme case, the huge number (4w) of all thepotential consensus.3)Third, clustering similar subsequences into the same group has already accomplished part ofthe job of maximizing the posterior probability4)Fourth, our clustering has an explicit control of the number of the subsequences in a cluster,as it discards small insignificant clusters and partitions large clusters to remove the noise.the algorithm 2 is the pseudo-code for the clustering procedure in the algorithm 1.Report: CS(506) Arif Ullah (EE-1738)The pseudocode for theCluster of Algorithm 1 is presented in Algorithm 2. Cluster willdiscard the subsequence s if its size jsj is smaller than D. Between D4 and D, it is returned as acluster. For sizes greater than D, the Cluster partitions s. The step Pos(s) partitions two subsetsof sequences using pos as optimal position and base as optimal nucleotide. On position pos andnucleotide b = base, lies the subsequence of first set spos base. Similarly, the subsequence of the otherset i.e. sposb6=base lies at pos and has nucleotide other than base. Smaller clusters of the subsequences are obtained by applying Cluster recursively in Clustering(spos base) and Clustering pos b6=base.Since Clustering makes sure that the size of subsequences remain in between D4 and D, weare assured that all clusters consist of D4 subsequences which is limited by a maximal clusternumber 4LD , where L is the total number of subsequence. In practice, the number of clusters iseven smaller than the maximal number. This is because subsequences in a cluster are usuallylarger than D4 . For a typical dataset of thousands of subsequences, the number of cluster is onlyof the order of few hundreds.B. Greedy RefinementRefine procedure use algorithm 1 and 2 to find the local optimal set from initial cluster.the refine procedure use PCM N^(A(cluster)) of initial subsequences for further refinement.Refinement find the best subsequence on the basis N(A) such that N(A) is improve than theN(A^) (the conserved one).The advantage of refinement procedure are:1) it uses a fast greedy local search method to find the local optimal motif instances.2) The search is flexible as it allows a variable number of instances. At the same time, thenumber is changed by at most one instance each iteration, and thus, Refine still converges veryfast.Algorithm 3 shows the overall pseudocode of the Refinement procedure.Report: CS(506) Arif Ullah (EE-1738)it replace the old motif iteratively by the new motif candiadate such that Iteratively, it replacesthe old motif candidate instances p(A(i))., which is A initially, with the new candidate subsequences to maximize the posterior probability p(A(i)). Refine tries to remove the least likelycandidate motif instance s1 to increase p(A(i)). If it is removed successfully, NUM is decreased.Otherwise, Refine tries to add the next most likely subsequence s2 to increase p(A(i)). If it isadded successfully, NUM is increased. Refine stops iterating when A remains the same in twoconsecutive iterations, and finally, it returns the last Motif.Report: CS(506) Arif Ullah (EE-1738)V. EXPERIMENTAL RESULTSThe algorithm has been tested CRMD on both synthetic and real DNA data sets. A testingdata set consists of sequences with motif instances already tagged, and hence, it can be used forthe algorithm performance evaluation. On the nucleotide level, it is calculated that how manynucleotides that the predicted instances and the true instances overlap for. On the site level, apredicted instance is correct if it overlaps with the true instance for at least one nucleotide. Tocombine the performance indices on both levels, we propose that a motif instance Ai j is correctlyrecovered if either of its ends is within 3 bp away from the corresponding end of the true motifinstance 7, 8. More formally, we haveAij = 1 is8><>:correct if jj – jsj < 3 or jj + w - jejincorrect; otherwisewhere js and je are the indices of the starting and ending positions of the closest true motifinstance.To measure the performance of CRMD and other algorithms, we adopt the metrics of Precision,Recall, and F-score 7, 8 defined as below, where the operator j:j is the cardinality of the set:P recision = jcorrect motifjjmotif foundjF - score = jcorrect motifjjmotif foundjRecall = 2 × P recision × RecalljP recision + RecalljA. Synthetic Data SetsA total of 800 synthetic data sets with length 300 bp for each sequence were generated underthe following eight combinations of three perspectives:1) motif width: short (8 bp) or long (16 bp);2) number of sequences: small (20) or large (60);3) motif conservation: high or low.Table 1 shows the results CRMD has the highest average F -scores on six out of the eightscenarios. In the remaining two scenarios, CRMD has the second highest average F -scores.CRMD also has the highest average F -score over all the 800 problems of eight differentscenarios, which proves that CRMD is relatively robust in a variety of problems.Report: CS(506) Arif Ullah (EE-1738)B. Real Data SetsThe algorithm has been tested for eight real data sets, i.e., CREB, CRP, ERE, E2F, MEF2,MYOD, SRF, and TBP, and compared the performance with the other four algorithms. Theseeight data sets consist of the sequences from many different species. To investigate the performance of CRMD on real data sets and how it is compared to other algorithms, we have alsotested the algorithms on a wide range of real data sets. The following tables 2,3,4 describe thedetailed analysis of the result performance in term of precision, recall and F-score.Report: CS(506) Arif Ullah (EE-1738)The computational complexity or the running time of CRMD is compared with he four stateof the art algorithm in table 5.Report: CS(506) Arif Ullah (EE-1738)VI. CONCLUSION AND FUTURE WORKThe novel approach has been propose for motif finding called CMRD. The entropy basedclustering is used to find the initial first candidate, for finding the local optima geedy refinementprocedure is used. the search for the optimal motif is based on maximizing the a posteriori probability. the computational complexity is minimized significantly using entropy based clusteringbecause the number of cluster is too much less as compared to the number of subsequences.usinginitial cluster the refinement procedure find the local optima which efficient and faster way interm of computational complexity i.e. with out sampling the subsequences probabilistically.The performance result show that the performance of CRMD is good as compared to the fourstate of the art algorithm presented in the paper. Also the algorithm has been tested for syntheticas well as real data set which is show that the performance as well complexity of the proposedalgorithm is good in case of synthetic data and many cases of real data set.Future work:The complexity of the algorithm mainly depends upon generation of subsequence as well asthe clustering procedure. so if the subsequences are generated efficiently with improved timecomplexity it will further improve the over all complexity of CRMD. so there is a need ofefficient generation of sub sequence from the given DNA sequence, so using neighbor hoodbased search algorithm like likelihood ascent search (LAS) algorithm can be modified and usedfor the said purpose and check in future whether such algorithm is working in this senerio ornot.Also the hashing technique could be checked whether it is working or not for accessing clusterclusters.The genetic algorithm is not of the most efficient algorithm in term of performance if large set ofpopulation is selected but the main problem is that it will drastically increase the computationalcomplexity of searching procedure which is not desirable specially in case of motif discoveryproblem so the problem formulation and improvement in computational cost should be kept inReport: CS(506) Arif Ullah (EE-1738)consideration if these algorithm are used for motif discovery each and every step should bechecked and analyses the complexity.Note: The report has been generated in LATEX and the figures and plots are automatically set which may be placed at different positions as compared to the originalpaper article.Report: CS(506) Arif Ullah (EE-1738)REFERENCES1 E. Blanco, D. Farre´, M.M. Alba', X. Messeguer, and R. Guigo´, "ABS:ADatabase of Annotated Regulatory Binding Sites from Orthologous Promoters," Nucleic Acids Research, vol. 34, no. Database Issue, pp. 63-67, 20062 J. Zhu, "SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae," Bioinformatics, vol. 15, no. 7, pp. 607-611, 1999.3 J. Hu, B. Li, and D. Kihara, "Limitations and Potentials of Current MotifDiscovery Algorithms," Nucleic Acids Research, vol. 33, no. 15, pp. 4899-4913, 2005.4 M. Tompa, N. Li, and T. Bailey, "Assessing Computational Tools for theDiscovery of Transcription Factor Binding Sites," Nature Biotechnology, vol.23, pp. 137-144, 2005. vol. 34, no. 10, Oct. 2016, pp. 2650-2666.5 T.L. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers," Proc. Second Int'l Conf. IntelligentSystems for Molecular Biology, pp. 28-36, 1994.6 J.S. Liu, A.F. Neuwald, and C.E. Lawrence, "Bayesian Models for MultipleLocal Sequence Alignment and Gibbs Sampling Strategies," J. Am. StatisticalAssoc., vol. 90, no. 432, pp. 1156-1170, 19957 Z. Wei and S.T. Jensen, "GAME: Detecting Cis-Regulatory Elements Usinga Genetic Algorithm," Bioinformatics, vol. 22, no. 13, pp. 1577-1584, 2006.Report: CS(506) Arif Ullah (EE-1738)8 T.-M. Chan, K.-S. Leung, and K.-H. Lee, "Tfbs Identification Based on Genetic Algorithm with Combined Representations and Adaptive Post-Processing,"J. Bioinformatics, vol. 24, pp. 341-349, 20079 D.J. Galas and A. Schmitz, "DNAse Footprinting: A Simple Method for theDetection of Protein-DNA Binding Specificity," Nucleic Acids Research, vol.5, no. 9, pp. 3157-3170, Sept. 1987.10 C. Horak and M. Snyder, "ChIP-Chip: A Genomic Approach for IdentifyingTranscription Factor Binding Sites," Methods in Enzymology, vol. 350, pp.469-483, 2002.11 G. Sandve and F. Drablos, "A Survey of Motif Discovery Methods in anIntegrated Framework," Biology Direct, vol. 1, no. 1, 2006.12 M. Li, B. Ma, and L. Wang, "Finding Similar Regions in Many Sequences,"J. Computer and System Sciences, vol. 65, pp. 73-96, 2002.13 P. Pevzner and S. Sze, "Combinatorial Approaches to Finding Subtle Signalsin DNA Sequences," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 269-278, 2000.15 M.F. Sagot, "Spelling Approximate Repeated or Common Motifs Using aSuffix Tree," LATIN '98: Theoretical Informatics, pp. 111- 127, SpringerVerlag, 1998.Report: CS(506) Arif Ullah (EE-1738)16 P. Bieganski, J. Riedl, J.V. Carlis, and E. Retzel, "Generalized Suffix Treesfor Biological Sequence Data: Applications and Implementations," Proc. 27thHawaii Int'l Conf. Systems Sciences, pp. 35-44, 1994.17 J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," J.Computational Biology, vol. 9, no. 2, pp. 225-242, 2002.18 B. Raphael, L. Lung-Tien, and G. Varghese, "A Uniform Projection Methodfor Motif Discovery in DNA Sequences," IEEE/ACM Trans. ComputationalBiology and Bioinformatics, vol. 1, no. 2, pp. 91- 94, Apr.-June 2004.19 K. Blekas, D. Fotiadis, and A. Likas, "Greedy Mixture Learning for MultipleMotif Discovery in Biological Sequences," Bioinformatics, vol. 19, no. 5, pp.607-617, 2003.20 J.S. Liu, A.F. Neuwald, and C.E. Lawrence, "Bayesian Models for MultipleLocal Sequence Alignment and Gibbs Sampling Strategies," J. Am. StatisticalAssoc., vol. 90, no. 432, pp. 1156-1170, Nov. 1995.21 C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C.Wooton, "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy forMultiple Alignment," Science, vol. 262, no. 8, pp. 208-214, Oct. 1993.22 G.B. Fogel, D.G. Weekes, G. Varga, E.R. Dow, H.B. Harlow, J.E. Onyia,and C. Su, "Discovery of Sequence Motifs Related to Coexpression of GenesReport: CS(506) Arif Ullah (EE-1738)Using Evolutionary Computation," Nucleic Acids Research, vol. 32, no. 13,pp. 3826-3835, 2004.23 M. Lones and A. Tyrrell, "Regulatory Motif Discovery Using a PopulationClustering Evolutionary Algorithm," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 403-414, July-Sept. 2007.