Millions of new websites are created daily, making it challenging to determine which ones are safe. Cybersecurity involves protecting companies and users from cyberattacks. Cybercriminals exploit various methods, including phishing attacks, to trick users into revealing sensitive information. In Australia alone, there were over 74,000 reported phishing attacks in 2022, resulting in a financial loss of over $24 million. Artificial intelligence (AI) and machine learning are effective tools in various domains, such as cancer detection, financial fraud detection, and chatbot development. Machine learning models, such as Random Forest and Support Vector Machines, are commonly used for classification tasks. With the rise of cybercrime, it is crucial to use machine learning to identify both known and new malicious URLs. The purpose of the study is to compare different instance selection methods and machine learning models for classifying malicious URLs.
In this study, a dataset containing approximately 650,000 URLs from Kaggle was used. The dataset consisted of four categories: phishing, defacement, malware, and benign URLs. Three datasets, each consisting of around 170,000 URLs, were generated using instance selection methods (DRLSH, BPLSH, and random selection) implemented in MATLAB. Machine learning models, including SVM, DT, KNNs, and RF, were employed. The study applied these instance selection methods to a dataset of malicious URLs, trained the machine learning models on the resulting datasets, and evaluated their performance using 16 features and one output feature.
In the process of hyperparameter tuning, the training dataset was used to train four models with different hyperparameter settings. Bayesian optimization was employed to find the best hyperparameters for each model. The classification process was then conducted, and the results were compared. The study found that the random instance selection method outperformed the other two methods, BPLSH and DRLSH, in terms of both accuracy and elapsed time for data selection. The lower accuracies achieved by the DRLSH and BPLSH methods may be attributed to the imbalanced dataset, which led to poor sample selection.