New Techniques in Profiling Big Datasets for Machine Learning with a Concise Review of Android Mobile Malware Datasets

2018-12-04
CANBEK, Gurol
SAĞIROĞLU, ŞEREF
Taşkaya Temizel, Tuğba
As the volume, variety, velocity aspects of big data are increasing, the other aspects such as veracity, value, variability, and venue could not be interpreted easily by data owners or researchers. The aspects are also unclear if the data is to be used in machine learning studies such as classification or clustering. This study proposes four techniques with fourteen criteria to systematically profile the datasets collected from different resources to distinguish from one another and see their strong and weak aspects. The proposed approach is demonstrated in five Android mobile malware datasets in the literature and in security industry namely Android Malware Genome Project, Drebin, Android Malware Dataset, Android Botnet, and Virus Total 2018. The results have shown that the proposed profiling methods reveal remarkable insight about the datasets comparatively and directs researchers to achieve big but more visible, qualitative, and internalized datasets.
International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)

Suggestions

BIG DATA FOR INDUSTRY 4.0: A CONCEPTUAL FRAMEWORK
Gökalp, Mert Onuralp; Kayabay, Kerem; Eren, Pekin Erhan; Koçyiğit, Altan (2016-12-17)
Exponential growth in data volume originating from Internet of Things sources and information services drives the industry to develop new models and distributed tools to handle big data. In order to achieve strategic advantages, effective use of these tools and integrating results to their business processes are critical for enterprises. While there is an abundance of tools available in the market, they are underutilized by organizations due to their complexities. Deployment and usage of big data analysis t...
A Study for Development of Propagation Model Based on Ray Tracing for Coverage Prediction in Terrestrial Broadcasting Systems
Tabakcioglu, Mehmet Baris; Ozmen, Ahmet; KARA, ALİ (2009-04-11)
In this work, improvements on propagation prediction models based on ray tracing in coverage estimation for digital broadcasting systems are presented. For this purpose, firstly propagation models based on Geometrical Theory of Diffraction (GTD) are discussed, and then an improved model is proposed for prediction of propagation path loss or electric field strength at the receiver. The proposed model incorporates first order expansion of classical GTD in field computation and convex hull for ray tracing. Sim...
Generative Data Augmentation for Vehicle Detection in Aerial Images
Kumdakçı, Hilmi; Öngün, Cihan; Temizel, Alptekin (null; 2021-01-15)
Scarcity of training data is one of the prominent problemsfor deep networks which require large amounts data. Data augmentationis a widely used method to increase the number of training samples andtheir variations. In this paper, we focus on improving vehicle detectionperformance in aerial images and propose a generative augmentationmethod which does not need any extra supervision than the boundingbox annotations of the vehicle objects in the training dataset. The pro-posed method increases the perf...
INTEGRATED INSTANCE-BASED AND KERNEL METHODS FOR POWER QUALITY KNOWLEDGE MODELING
Guder, Mennan; Salor, Ozgul; Cadirci, Isik (2010-10-28)
In this paper, an integrated knowledge discovery strategy for high dimensional spatial power quality event data is proposed. Real time, distributed measuring of the electricity transmission system parameters provides huge number of time series power quality events. The proposed method aims to construct characteristic event distribution and interaction models for individual power quality sensors and the whole electricity transmission system by considering feasibility, time and accuracy concerns. In order to ...
A statistical approach for performance evaluation of 386 and 486 microprocessors
Akman, I (1996-02-01)
This paper endeavors to show how multivariate statistical techniques may be used to assess the hardware performance of 386 and 486 based microcomputers. The benchmark test results collected from 34 microcomputers were analyzed in order to find the objective ''assessment criterion'' which could be used as part of the future certification process. The dominant hardware components for the price of these computers were also searched.
Citation Formats
G. CANBEK, Ş. SAĞIROĞLU, and T. Taşkaya Temizel, “New Techniques in Profiling Big Datasets for Machine Learning with a Concise Review of Android Mobile Malware Datasets,” presented at the International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT), Turkish IT Author, Ankara, TURKEY, 2018, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/55994.