Parallelization of K-Means and DBSCAN clustering algorithms on a HPC cluster

Download
2013
Durrani, Hunain
The amount of information that must be processed daily by computer systems has reached huge quantities. It is impossible, or would be prohibitively expensive, to build such a powerful supercomputer that could process such large data in the required time limits. Cluster computing has emerged to address this problem by connecting hundreds of small computers using ultra-fast switches so that their combined computational power and parallel processing techniques make it possible to quickly solve many difficult problems. In fact, cloud computing has emerged to market the data processing power collected in cluster computing centers with hundreds of thousands of computers and allow the customers to purchase additional data processing power, storage, memory, and communication capacity when needed. Data mining has been one of the most favorite topics for all the researchers as it’s the technique that helps large scale business to extract useful data from the heap of irrelevant data. In this era of big data stores parallel implementation of data mining is the basic tool of all the large scale businesses. In this research, parallel versions of two popular clustering algorithms, K-Means and DBSCAN, are developed and it is experimentally shown that their performance continues to improve even as the input data size keeps increasing, making these parallel implementations ideally suited to parallel computing environments.
Citation Formats
H. Durrani, “Parallelization of K-Means and DBSCAN clustering algorithms on a HPC cluster,” M.S. - Master of Science, Middle East Technical University, 2013.