Parallelization of K-Means and DBSCAN clustering algorithms on a HPC cluster

Download
2013
Durrani, Hunain
The amount of information that must be processed daily by computer systems has reached huge quantities. It is impossible, or would be prohibitively expensive, to build such a powerful supercomputer that could process such large data in the required time limits. Cluster computing has emerged to address this problem by connecting hundreds of small computers using ultra-fast switches so that their combined computational power and parallel processing techniques make it possible to quickly solve many difficult problems. In fact, cloud computing has emerged to market the data processing power collected in cluster computing centers with hundreds of thousands of computers and allow the customers to purchase additional data processing power, storage, memory, and communication capacity when needed. Data mining has been one of the most favorite topics for all the researchers as it’s the technique that helps large scale business to extract useful data from the heap of irrelevant data. In this era of big data stores parallel implementation of data mining is the basic tool of all the large scale businesses. In this research, parallel versions of two popular clustering algorithms, K-Means and DBSCAN, are developed and it is experimentally shown that their performance continues to improve even as the input data size keeps increasing, making these parallel implementations ideally suited to parallel computing environments.

Suggestions

PARALLEL COMPUTING IN STATISTICAL METHODS
Oltulu, Orçun; Gökalp Yavuz, Fulya; Department of Statistics (2022-8-17)
Cost-efficient data collection and storage methods enable scientists, companies, and even regular computer users to reach high-dimensional data sets faster and cheaper. Even though personal computers are getting more powerful and efficient, some algorithms, tasks, and problems still require too much computational power and time to run on a personal computer. For a few decades, parallelization in statistical computing had an increasing trend, and researchers put significant effort into converting or adjustin...
Semantic data modeling of spatiotemporal database applications
Yazıcı, Adnan; Sun, N (Wiley, 2001-07-01)
Due to the ubiquity of space-related and time-related information, the ability of a database system to deal with both spatial and temporal phenomenon facts in a spatiotemporal applications is highly desired. However, uncertain and fuzzy information in these applications highly increases the complexity of database modeling. In this paper we introduce a semantic data modeling approach for spatiotemporal database applications. We specifically focus on various aspects of spatial and temporal database issues and...
Boostıng performance of hls optımızatıon for soc based hardware accelerators.
Kocaay, Aziz Berkin; Bazlamaçcı, Cüneyt F..; Department of Electrical and Electronics Engineering (2020)
Modern large-scale computing algorithms require huge amount of computational power. In adapting to increasing computation demands, FPGA-based SoC platforms provide an alternative to traditional CPU or GPU units, which suffer from thermal problems, power issues, etc. However, design flow for FPGA based development may be hard and time-consuming for an average software engineer who has limited knowledge about hardware design. A new approach in FPGA-based system development without the need for a hardware engi...
Guidelines for automating Phase I of control charts by considering effects on Phase-II performance of individuals control chart
Atalay, Murat; TESTİK, MURAT CANER; Duran, Serhan; Weiss, Christian H. (Informa UK Limited, 2019-09-11)
With the advances in measurement technologies, today products and processes may have hundreds of variables that can be monitored. As the number of variables to be monitored in a process increases, a cumbersome task is the design of control charts, especially when one needs to estimate unknown process parameters. In Phase-I control chart implementations, a set of samples that are ideally from an in-control process is formed by iteratively eliminating/retaining potentially out-of-control samples and this is t...
Scalable Software Model Checking Using Design for Verification
TEVFİK, Bultan; Betin Can, Aysu (2005-11-13)
There has been significant progress in automated verification techniques based on model checking. However, scalable software model checking remains a challenging problem. We believe that this problem can be addressed using a design for verification approach based on design patterns that facilitate scalable automated verification. We have been investigating a design for verification approach based on the following principles: 1) use of stateful, behavioral interfaces which isolate the behavior and enable mod...
Citation Formats
H. Durrani, “Parallelization of K-Means and DBSCAN clustering algorithms on a HPC cluster,” M.S. - Master of Science, Middle East Technical University, 2013.