Criteria for the Evaluation of Workflow Management Systems for Scientific Data Analysis

2022-10
Dilan Kiran, Aleyna
Ay, Mehmet Can
Alllmer, Jens
Many scientific endeavors, such as molecular biology, have become dependent on largescale data and its analysis. For example, precision medicine depends on molecular measurements and data analysis on a per-patient basis. Data analysis, supporting medical decisions, has to be standardized and performed in a consistent manner across patients. While perhaps not life-threatening, data analyses in basic research have become increasingly complex. RNA-seq data, for example, entails a multi-step analysis ranging from quality assessment of the measurements to statistical analyses. Workflow management systems (WFMS) enable the development of data analysis workflows (WF), their reproduction, and their application to datasets of the same type. However, there are far more than a hundred WFMS available to choose from and no way to convert data analysis WFs among WFMS. Therefore, the initial choice of a WFMS is important as it entails a lock-in to the system. Perhaps the reach in the particular field (number of citations) can be used as a proxy for the selection of a WFMS, but of the about 25 WFMS we mention in this work, at least 5 have a large reach in scientific data analysis. Hence other criteria are needed to delineate among WFMS. By extracting such criteria from selected studies concerning WFMS and adding additional criteria, we arrived at five critical (reproducibility, reusability, FAIRness, versioning support, and security) and five important criteria (providing a graphical user interface, WF flexibility, WF scalability, WF shareability, and computational transparency) for the assessment of WFMS. We applied the criteria to the most cited WFMS in Pubmed and found that none of them support all criteria. We hope that suggesting these criteria will spark a discussion on what features are important for WFMS in scientific data analysis and perhaps will lead to the development of WFMS that fulfill such criteria.

Suggestions

Evaluating the effects of rescaling parameters in large-scale genomic simulations
Kıratlı, Ozan; Birand Özsoy, Ayşegül Ceren; Department of Biology (2016)
Computer simulations are widely used in many subdisciplines of biological sciences, which evolutionary biology. Large-scale genomic simulations, where several kb (kilo base) to several Mb (megabase) genomes are modeled, are being increasingly used. These simulations require high computing power. There are some methods proposed in the literature to decrease the time and memory demand of these simulations. This study is concentrated on one of those methods, where both the number of generation, and the number ...
Discovering functional interaction patterns in protein-protein interaction networks
Turanalp, Mehmet E.; Can, Tolga (Springer Science and Business Media LLC, 2008-06-11)
Background: In recent years, a considerable amount of research effort has been directed to the analysis of biological networks with the availability of genome-scale networks of genes and/or proteins of an increasing number of organisms. A protein-protein interaction (PPI) network is a particular biological network which represents physical interactions between pairs of proteins of an organism. Major research on PPI networks has focused on understanding the topological organization of PPI networks, evolution...
Automated biological data acquisition and integration using machine learning techniques
Çarkacıoğlu, Levent; Atalay, Mehmet Volkan; Department of Computer Engineering (2009)
Since the initial genome sequencing projects along with the recent advances on technology, molecular biology and large scale transcriptome analysis result in data accumulation at a large scale. These data have been provided in different platforms and come from different laboratories therefore, there is a need for compilation and comprehensive analysis. In this thesis, we addressed the automatization of biological data acquisition and integration from these non-uniform data using machine learning techniques....
An algorithm to analyze stability of gene-expression patterns
Gebert, J; Latsch, M; Pickl, SW; Weber, Gerhard Wilhelm; Wunschiers, R (Elsevier BV, 2006-05-01)
Many problems in the field of computational biology consist of the analysis of so-called gene-expression data. The successful application of approximation and optimization techniques, dynamical systems, algorithms and the utilization of the underlying combinatorial structures lead to a better understanding in that field. For the concrete example of gene-expression data we extend an algorithm, which exploits discrete information. This is lying in extremal points of polyhedra, which grow step by step, up to a...
Computation and analysis of spectra of large networks with directed graphs
Sarıaydın, Ayşe; Karasözen, Bülent; Jost, Jürgen; Department of Scientific Computing (2010)
Analysis of large networks in biology, science, technology and social systems have become very popular recently. These networks are mathematically represented as graphs. The task is then to extract relevant qualitative information about the empirical networks from the analysis of these graphs. It was found that a graph can be conveniently represented by the spectrum of a suitable difference operator, the normalized graph Laplacian, which underlies diffusions and random walks on graphs. When applied to large...
Citation Formats
A. Dilan Kiran, M. C. Ay, and J. Alllmer, “Criteria for the Evaluation of Workflow Management Systems for Scientific Data Analysis,” Erdemli, Mersin, TÜRKİYE, 2022, p. 3035, Accessed: 00, 2023. [Online]. Available: https://hibit2022.ims.metu.edu.tr/.