Reproducible and scalable data analysis on high performance computing

2022-10
Akkoyun, Emrah
Kuru, Nurdan
Dereli, Onur
Bircan, Aylin
Taştan, Öznur
Adebali, Ogün
We recently presented the PHACT tool ( PHylogeny-Aware Computation of Tolerance) for assessing amino acid substitutions that achieved superior predictive performance compared to widely adapted tools [1]. PHACT scores alterations not only using the frequency of the alterations in the multiple sequence alignment (MSA) as most common tools do - but also uses the gene-based phylogenetic trees. PHACT’s inputs include the MSA of the protein, the phylogenetic tree estimated on that MSA and the probability distribution of amino acids at each ancestral node estimated from the tree. To assess the predictive performance of PHACT. We performed various experiments over a dataset that include 20,546 proteins and 61,662 variants. In theory, analyzing a protein takes a single CPU day using eight cores, thus, the amount of computation is 192 CPU hours. The overall computation time to finish the analyses for the whole dataset (20.546 proteins) is 3.94M CPU hours. Using a single and powerful computer with 64 cores takes around seven years, and 512GB of memory is not practical. We completed the analyses within four months by using a High-Performance Computing (HPC) cluster. Performing an extensive reproducible and scalable data analysis for multiple proteins with various parameters on an HPC is not straightforward [3]. For example, 50 proteins with ten parameters and ten consecutive tasks mean 5000 independent jobs must be executed successfully. Each job (task) has different characteristics; some are CPU, and others are memory-intensive jobs. Some jobs are completed within hours, while others take days or weeks. On the other hand, HPC is a complex environment; hundreds of servers running together and obtaining a failure is not an exception; thus, managing such a large number of jobs is not easy. A workflow tool, where an analysis definition is determined by a set of rules and a set of output files from a set of input files is obtained, is a must. In addition to being scalable, being reproducible is also a critical requirement that the same results can be obtained by other researchers anytime [4]. All the tools, software used during the analyses, input files, the computational facilities could be defined in a text file so all environments could be deployed without additional efforts. To satisfy all these requirements, we used a Snakemake workflow with a conda package manager due to its human-readable, Python-based language, portability, integration with a conda package manager, automatic deployment, and ability to specific software dependencies. PHACT framework specifies rules in Snakefile. Rules decompose the workflow into small steps such as finding homologs of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT), or generating a maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE). Each rule has its model parameters, which can be set via a single configuration file (config/config.yml). A dry-run parameter can be used to check if the workflow is adequately defined and to estimate the amount of calculation remaining. It summarizes the number of total jobs (rule) performed and sets of input and output files used and created, respectively. For 2 query files, as given in Fig.1, 29 jobs will be executed. In addition, to allow workload running on a local computer with a limited number of query IDs, the PHACT framework is designed to analyze a bulk of query IDs in parallel using HPC. Most HPC clusters have a scheduler that handles the workload on compute nodes. Users must prepare a bash script and submit it to the cluster to interact with a scheduler. Snakemake has the functionality to perform all these efforts automatically. Within this work, a valuable dataset that contains MSAs and phylogenic trees, which amounts to more than 1M files and a 1.6TByte size, was created and shared with other researchers. All details such as documentation, scripts, tools, environments, and input proteins can be found on our GitHub page [5] and all results are published on our FTP server [6].

Suggestions

Adaptive evolution strategies in structural optimization: Enhancing their computational performance with applications to large-scale structures
Hasançebi, Oğuzhan (2008-01-01)
In this study the computational performance of adaptive evolution strategies (ESs) in large-scale structural optimization is mainly investigated to achieve the following objectives: (i) to present an ESs based solution algorithm for efficient optimum design of large structural systems consisting of continuous, discrete and mixed design variables; (ii) to integrate new parameters and methodologies into adaptive ESs to improve the computational performance of the algorithm; and (iii) to assess successful self...
Parallel computing in linear mixed models
Gökalp Yavuz, Fulya (Springer Science and Business Media LLC, 2020-09-01)
In this study, we propose a parallel programming method for linear mixed models (LMM) generated from big data. A commonly used algorithm, expectation maximization (EM), is preferred for its use of maximum likelihood estimations, as the estimations are stable and simple. However, EM has a high computation cost. In our proposed method, we use a divide and recombine to split the data into smaller subsets, running the algorithm steps in parallel on multiple local cores and combining the results. The proposed me...
Interoperability by means of configurable connectors
Kaya, Muhammed Çağrı; Doğru, Ali H.; Department of Computer Engineering (2020)
A configurable connector-based software development methodology for componentbased approaches is presented. This method involves the incorporation of variability modeling capabilities into component modeling environments. The focus of this research is on supporting technologies for the combination of parts that are not directly compatible. In the scope of this research, firstly, proposals for the configurable connector paradigm are put forth, that are, achieving interoperability among system components by u...
Multi-objective combinatorial optimization using evolutionary algorithms
Özsayın, Burcu; Köksalan, Murat; Department of Industrial Engineering (2009)
Due to the complexity of multi-objective combinatorial optimization problems (MOCO), metaheuristics like multi-objective evolutionary algorithms (MOEA) are gaining importance to obtain a well-converged and well-dispersed Pareto-optimal frontier approximation. In this study, of the well-known MOCO problems, single-dimensional multi-objective knapsack problem and multi-objective assignment problem are taken into consideration. We develop a steady-state and elitist MOEA in order to approximate the Pareto-optim...
Consensus clustering of time series data
Yetere Kurşun, Ayça; Batmaz, İnci; İyigün, Cem; Department of Scientific Computing (2014)
In this study, we aim to develop a methodology that merges Dynamic Time Warping (DTW) and consensus clustering in a single algorithm. Mostly used time series distance measures require data to be of the same length and measure the distance between time series data mostly depends on the similarity of each coinciding data pair in time. DTW is a relatively new measure used to compare two time dependent sequences which may be out of phase or may not have the same lengths or frequencies. DTW aligns two time serie...
Citation Formats
E. Akkoyun, N. Kuru, O. Dereli, A. Bircan, Ö. Taştan, and O. Adebali, “Reproducible and scalable data analysis on high performance computing,” Erdemli, Mersin, TÜRKİYE, 2022, p. 2028, Accessed: 00, 2023. [Online]. Available: https://hibit2022.ims.metu.edu.tr/.