Developing novel methods on data stream classification and clustering for accuracy improvement

2024-9
Maden, Engin
The streaming data from different sources as social media, telecommunication network or credit card processing are accumulated and growing enormously. Thus, it has become more important to produce valuable information from such big data environments. There are specific characteristics of data streams such as continuous flow, high volume, rapid arrival and change of distribution. Due to these characteristics, there are limitations for processing data streams such as limited resource and time and the data can be scanned only once. At this point data stream mining emerges with the streaming version of traditional data mining operations such as clustering and classification. In this study, data stream classification and short text stream clustering as a specific area of data stream clustering are worked on. Enhancements and novel methods are proposed and their performances are compared with the state of the art methods. For data stream classification, our proposed methods are named as m-kNN (Mean Extended kNN) and CSWB (Combined Sliding Window Based) classifier which is a combination of m-kNN and MC-NN (Micro Cluster Nearest Neighbour). Two new versions of CSWB are also presented, CSWB-e and CSWB-e2, such that our m-kNN classifier is combined with K* (K-Star) and C4.5, and with K* (K-Star) and Naive Bayes, respectively. For the short text stream clustering, a method named T-GSC (A Two Level Graph Based Short Text Stream Clusterer) is proposed. A survey is also prepared about the current methods in short text stream clustering and classified them with respect to their clustering approaches.
Citation Formats
E. Maden, “Developing novel methods on data stream classification and clustering for accuracy improvement,” Ph.D. - Doctoral Program, Middle East Technical University, 2024.