TRAINER: A General-Purpose Trainable Short Biosequence Classifer

Kalkan, Alper T.
Umu, Sinan U.
Akkaya, Mahinur
Classifying sequences is one of the central problems in computational biosciences. Several tools have been released to map an unknown molecular entity to one of the known classes using solely its sequence data. However, all of the existing tools are problem-specific and restricted to an alphabet constrained by relevant biological structure. Here, we introduce TRAINER, a new online tool designed to serve as a generic sequence classification platform to enable users provide their own training data with any alphabet therein defined. TRAINER allows users to select among several feature representation schemes and supervised machine learning methods with relevant parameters. Trained models can be saved for future use without retraining by other users. Two case studies are reported for effective use of the system for DNA and protein sequences; candidate effector prediction and nucleolar localization signal prediction. Biological relevance of the results is discussed.


SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees
Ogul, Hasan; Mumcuoğlu, Ünal Erkan (2006-08-01)
A new method based on probabilistic suffix trees (PSTs) is defined for pairwise comparison of distantly related protein sequences. The new definition is adopted in a discriminative framework for protein classification using pairwise sequence similarity scores in feature encoding. The framework uses support vector machines (SVMs) to separate structurally similar and dissimilar examples. The new discriminative system, which we call as SVM-PST, has been tested for SCOP family classification task, and compared ...
Seven, Ahmet İrfan (2013-05-01)
Mutation of skew-symmetrizable matrices is a fundamental operation that first arose in Fomin-Zelevinsky's theory of cluster algebras; it also appears naturally in many different areas of mathematics. In this paper, we study mutation classes of skew-symmetrizable 3 x 3 matrices and associated graphs. We determine representatives for these classes using a natural minimality condition, generalizing and strengthening results of Beineke-BrustleHille and Felikson-Shapiro-Tumarkin. Furthermore, we obtain a new num...
Loop-based conic multivariate adaptive regression splines is a novel method for advanced construction of complex biological networks
Ayyıldız Demirci, Ezgi; Purutçuoğlu Gazi, Vilda; Weber, Gerhard Wilhelm (2018-11-01)
The Gaussian Graphical Model (GGM) and its Bayesian alternative, called, the Gaussian copula graphical model (GCGM) are two widely used approaches to construct the undirected networks of biological systems. They define the interactions between species by using the conditional dependencies of the multivariate normality assumption. However, when the system's dimension is high, the performance of the model becomes computationally demanding, and, particularly, the accuracy of GGM decreases when the observations...
Time series on riemannian manifolds
Ergezer, Hamza; Leblebicioğlu, Mehmet Kemal; Department of Electrical and Electronics Engineering (2017)
In this thesis, feature covariance matrices are utilized to solve several problems related to time series. In the first part of the thesis, a novel representation is proposed to represent the time series using feature covariance matrices. By this representation, time series are carried onto Riemannian manifold space. The proposed representation is firstly applied to trajectories which are essentially 2D time series. Anomaly detection and activity perception problems in crowded visual scenes are studied by usi...
Optimization approaches for classification and feature selection using overlapping hyperboxes
Akbulut, Derya; Özdemirel, Nur Evin; İyigün, Cem; Department of Industrial Engineering (2019)
In this thesis, an optimization approach is proposed for the binary classification problem. A mixed integer programming (MIP) model formulation is used to generate hyperboxes as classifiers. The hyperboxes are determined by lower and upper bounds on the feature values, and overlapping of hyperboxes is allowed to reach a balance between misclassification and overfitting. For the test phase, distance-based heuristic algorithms are also developed to classify the overlap and uncovered samples that are not class...
Citation Formats
H. OĞUL, A. T. Kalkan, S. U. Umu, and M. Akkaya, “TRAINER: A General-Purpose Trainable Short Biosequence Classifer,” PROTEIN AND PEPTIDE LETTERS, pp. 1108–1114, 2013, Accessed: 00, 2020. [Online]. Available: