Integrating machine learning techniques into robust data enrichment approach and its application to gene expression data

Erdogdu, Utku
Alhajj, Reda
Polat, Faruk
Rokne, Jon
Demetrick, Douglas
The availability of enough samples for effective analysis and knowledge discovery has been a challenge in the research community, especially in the area of gene expression data analysis. Thus, the approaches being developed for data analysis have mostly suffered from the lack of enough data to train and test the constructed models. We argue that the process of sample generation could be successfully automated by employing some sophisticated machine learning techniques. An automated sample generation framework could successfully complement the actual sample generation from real cases. This argument is validated in this paper by describing a framework that integrates multiple models (perspectives) for sample generation. We illustrate its applicability for producing new gene expression data samples, a highly demanding area that has not received attention. The three perspectives employed in the process are based on models that are not closely related. The independence eliminates the bias of having the produced approach covering only certain characteristics of the domain and leading to samples skewed towards one direction. The first model is based on the Probabilistic Boolean Network (PBN) representation of the gene regulatory network underlying the given gene expression data. The second model integrates Hierarchical Markov Model (HIMM) and the third model employs a genetic algorithm in the process. Each model learns as much as possible characteristics of the domain being analysed and tries to incorporate the learned characteristics in generating new samples. In other words, the models base their analysis on domain knowledge implicitly present in the data itself. The developed framework has been extensively tested by checking how the new samples complement the original samples. The produced results are very promising in showing the effectiveness, usefulness and applicability of the proposed multi-model framework.


Subtree selection in kernels for graph classification
TAN, MEHMET; Polat, Faruk; Alhajj, Reda (Inderscience Publishers, 2013-01-01)
Classification of structured data is essential for a wide range of problems in bioinformatics and cheminformatics. One such problem is in silico prediction of small molecule properties such as toxicity, mutagenicity and activity. In this paper, we propose a new feature selection method for graph kernels that uses the subtrees of graphs as their feature sets. A masking procedure which boils down to feature selection is proposed for this purpose. Experiments conducted on several data sets as well as a compari...
A quantitative investigation of students' attitudes towards electronic book technology
Bulur, Hatice Gonca (SAGE Publications, 2020-09-01)
The purpose of this study is to analyse the factors that have an impact on technology adoption for e-books utilizing the Analytic Hierarchy Process and Multiple Regression Analysis methods. Findings indicate that perceived usefulness and ease of use are the most significant determinants in using e-books. Of key significance is that Analytic Hierarchy Process results show that consumers make pairwise comparisons, adding environmental concerns to the selection process. Recognizing the importance of all these ...
Discovering functional interaction patterns in protein-protein interaction networks
Turanalp, Mehmet E.; Can, Tolga (Springer Science and Business Media LLC, 2008-06-11)
Background: In recent years, a considerable amount of research effort has been directed to the analysis of biological networks with the availability of genome-scale networks of genes and/or proteins of an increasing number of organisms. A protein-protein interaction (PPI) network is a particular biological network which represents physical interactions between pairs of proteins of an organism. Major research on PPI networks has focused on understanding the topological organization of PPI networks, evolution...
Effective feature reduction for link prediction in location-based social networks
Bayrak, Ahmet Engin; Polat, Faruk (SAGE Publications, 2019-10-01)
In this study, we investigated feature-based approaches for improving the link prediction performance for location-based social networks (LBSNs) and analysed their performances. We developed new features based on time, common friend detail and place category information of check-in data in order to make use of information in the data which cannot be utilised by the existing features from the literature. We proposed a feature selection method to determine a feature subset that enhances the prediction perform...
Explicit diversification of search results across multiple dimensions for educational search
Yigit-Sert, Sevgi; Altıngövde, İsmail Sengör; Macdonald, Craig; Ounis, Iadh; ULUSOY, ÖZGÜR (Wiley, 2020-09-01)
Making use of search systems to foster learning is an emerging research trend known assearch as learning. Earlier works identified result diversification as a useful technique to support learning-oriented search, since diversification ensures a comprehensive coverage of various aspects of the queried topic in the result list. Inspired by this finding, first we define a new research problem, multidimensional result diversification, in the context of educational search. We argue that in a search engine for th...
Citation Formats
U. Erdogdu, M. TAN, R. Alhajj, F. Polat, J. Rokne, and D. Demetrick, “Integrating machine learning techniques into robust data enrichment approach and its application to gene expression data,” INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, pp. 247–281, 2013, Accessed: 00, 2020. [Online]. Available: