A Hybrid Deep Learning Ensemble for Multi-Class Malicious URL Detection in Arabic and English
Malicious URLs serve as primary vectors for cyber-attacks, facilitating phishing, malware, and defacement. While conventional systems focus on binary classification, security operations require granular threat identification. Furthermore, literature exhibits a significant bias toward English content, leaving Arabic speaking populations disproportionately exposed to localized threats. This study proposes a hybrid stacked ensemble architecture integrating CNN-BiLSTM with an Attention mechanism, Random Forest, and XGBoost. The methodology incorporates 27 lexical features for English URLs and 23 specialized features tailored to Arabic linguistic structures and Punycode-encoded domains. The model was evaluated on 651,191 English and 20,329 Arabic URLs. The architecture achieved a peak accuracy of 99.43% on English data and 89.07% on the Arabic dataset, outper-forming baseline configurations. Feature correlation analysis demonstrated that class imbalance inflates feature importance, with average correlation coefficients decreasing by 0.214 post-balancing. Comparative experiments utilizing a unified cross-language model yielded inferior results (91.2% English, 82.5% Arabic), confirming that language-specific optimization is essential. This research establishes the first multi-class baseline for Arabic URL detection, providing a robust, scalable framework for regional threat intelligence.
المقاييس

هذا العمل مرخص بموجب Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.