![]() The presence or the absence of an attribute in a sequence is respectively denoted by 1 or 0. These motifs will be used as attributes/features to construct a binary table where each row corresponds to sequence. Preprocessing consists of extracting motifs from a set of sequences. In this case, the classification obeys the knowledge discovery in data (KDD) process and hence comprises three major steps: In, authors have shown that motif extraction methods can efficiently contribute to the use of machine learning algorithms for the classification of biological sequences. Motifs extraction methods are generally based on the assumption that the significant regions are better preserved during the evolution because of their importance in terms of structure and/or function of the molecule, and thus that they appear more frequently than it is expected. Meanwhile, different studies have been devoted to motif extraction in biological sequences. In fact, those classifiers rely on data described in a relational format. Since relevant information is represented by strings of characters, this technique generally doesn't enable the use of well-known classification techniques such as decision trees (DT), naïve bayes (NB), support vector machines (SVM) and nearest neighbour (NN) which have proved to be very efficient in real data mining tasks. Īlignment is the main technique used by biologists to look for homology among sequences, and hence to classify new sequences into already known families/classes. Biologists also seek, for instance, to identify active sites in proteins and enzymes, to classify parts of DNA sequences into coding or non-coding zones or to determine the function of the nucleic sequences such as the identification of the promoter sites and the junction sites. They are also involved in many important biological processes such as chromosome replication, signal transduction, folding pathway and metabolism. Indeed, they often intervene in terms of bio-macromolecules functional evolution, reparation of misfolds and defects. Furthermore, the study and the prediction of oligomeric proteins (quaternary structures) are very useful in biology and medicine for many reasons. This makes it possible to study the evolution of this protein and to discover its biological functions. In fact, biologists are often interested in identifying the family to which a lately sequenced protein belongs. Classification and prediction techniques are one way to deal with such task. The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.Īnalysis and interpretation of biological sequence data is a fundamental task in bioinformatics. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. However, the number of generated features varies from a substitution matrix to another. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We also compared the classifiers in term of accuracy. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. However, designing a suitable feature space, for a set of proteins, is not a trivial task. ![]() It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. Motif extraction is one way to address that task. ![]() This paper deals with the preprocessing of protein sequences for supervised classification. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |