Article

A Comparative Model to Analyze the Impact of Tax Dataset Augmentation on the Accuracy of Machine Learning Models

Many factors influence the efficiency of a classification model in Machine Learning, including the dataset’s size and the type of Machine Learning (ML) technique utilized in the classification process. Primarily, the accuracy varies among different machine learning methods. This paper develops a model that analyses and measures the impact of Tax Dataset Augmentation on the Accuracy of Machine Learning Models. The paper compares the performance of different models based on the most common machine learning techniques, namely: DT, RF, SVM, and ANN (MLP). Based on three datasets provided by Yemen’s Tax Authority. The first dataset contains 1083 records, while the second dataset is identical to the previous one, but with nearly five times the number of records, following data preprocessing, which resulted in approximately 5000 additional records. The last dataset is the same as the original dataset, but it has been duplicated nearly ten times, resulting in almost 10,000 records. The dataset partitioning technique utilized k-fold validation using the three datasets. Results show that the performance of ML classifiers such as ANN (MLP), DT, RF, and SVM is affected by dataset augmentation in terms of accuracy, recall, precision, and F-score. Results also show that the performance varies among the first three techniques; however, the SVM Classifier yields the lowest results. In general, despite some techniques leading to overfitting, it is found that most machine learning models utilizing tax datasets with five times the duplicates produced better outcomes than those using the original dataset. These findings provide practical guidance for tax authorities in selecting robust machine learning models under limited data availability and highlight risks associated with naïve dataset expansion.

...
Abeer Abdullah Shujaaaddeen
Sana'a University Department of Computer Science, Faculty of Computer and Information Technology, Sana’a University, Sana’a, Yemen
...
Ammar T.Zahary
Department of Computer Science, Faculty of Computer and Information Technology, Sana’a University, Sana’a, Yemen
...
Fadl Mutaher Ba-Alwi
Department of Computer Science, Faculty of Computer and Information Technology, Sana’a University, Sana’a, Yemen
10686
A. Shujaaddeen, F. M. Ba-Alwi, and G. Al-Gaphari, “A New Machine Learning Model for Detecting levels of Tax Evasion Based on Hybrid Neural Network”, Int J Intell Syst Appl Eng, vol. 12, no. 11s, pp. 450–468, Jan. 2024.
10687
K. Al Sadi, “Applied Sciences Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers,” 2023.
10688
M. T. Abraham, N. Satyam, P. Jain, B. Pradhan, and A. Alamri, “Effect of spatial resolution and data splitting on landslide susceptibility mapping using different machine learning algorithms,” Geomatics, Nat. Hazards Risk, vol. 12, no. 1, pp. 3381–3408, 2021, doi: 10.1080/19475705.2021.2011791.
10689
Use the "Insert Citation" button to add citations to this document.
10690
D. Rodr, “Tax Fraud Detection Through Neural Networks: An Application Using a Sample of Personal Income Taxpayers,” 2019, doi: 10.3390/fi11040086.
10691
“Tax-Related Burden on SMEs in the European Union: The Case of Slovenia Dejan Ravšelj Polonca Kovač Aleksander Aristovnik,” vol. 2117, pp. 69–79, 2019, doi: 10.2478/miss-2019-0024.
10692
I. S. Conference and E. Sarajevo, “Expert Systems as a Means in Detecting Tax Evasion,” no. September, pp. 18–20, 2020.
10693
A. Z. Adamov, “Machine Learning and Advanced Analytics in Tax Fraud Detection,” no. October 2019, 2020, doi: 10.1109/AICT47866.2019.8981758.
10694
C. Reviews, “An Income Tax Fraud Detection Using AI,” vol. 7, no. 16, pp. 119–124, 2020.
10695
M. Z. Abedin, H. Mohammad, D. Science, N. Science, and G. Bishwabidyalay, “Tax Default Prediction Using Feature Transformation-Based Machine Learning,” no. December, 2020, doi: 10.1109/ACCESS.2020.3048018.
10696
V. Jellis, M. David, P. Bruno, J. Vanhoeyveld, D. Martens, and B. Peeters, “This Item is the Archived Peer-reviewed Author-version of: Value-added Tax Fraud Detection with Scalable Anomaly Detection Techniques Reference:” vol. 86, 2020.
10697
O. F. Atayah, “Audit and Tax in The Context of Emerging Technologies: A Retrospective Analysis, Current Trends, and Future Opportunities,” vol. 21, no. November 2020, pp. 95–128, 2021, doi: 10.4192/1577-8517-v21.
10698
A. Ippolito and A. C. G. Lozano, “Tax Crime Prediction with Machine Learning: A Case Study in the Municipality of São Paulo,” ICEIS 2020 - Proc. 22nd Int. Conf. Enterp. Inf. Syst., vol. 1, no. Iceis, pp. 452–459, 2020, doi: 10.5220/0009564704520459.
10699
A. Rathi, S. Sharma, G. Lodha, and M. Srivastava, “A Study on Application of Artificial Intelligence and Machine Learning in Indian Taxation System,” no. February 2021, doi: 10.17762/pae.v58i2.2265.
10700
J. Perbendaharaan, K. Negara Dan Kebijakan Publik, R. David Febriminanto, and M. Wasesa, “Indonesian Treasury Review Machine Learning for Predicting Tax Revenue Potential,” Keuangan Negara dan Kebijakan Publik, 2022. [Online]. Available: www.pajak.com
10701
T. Ruzgas, L. Kižauskienė, M. Lukauskas, E. Sinkevičius, M. Frolovaitė, and J. Arnastauskaitė, “Tax Fraud Reduction Using Analytics in an East European Country,” Axioms, vol. 12, no. 3, p. 288, Mar. 2023, doi: 10.3390/axioms12030288.
10702
N. Alsadhan, “A Multi-Module Machine Learning Approach to Detect Tax Fraud,” Comput. Syst. Sci. Eng., vol. 46, no. 1, pp. 241–253, 2023, doi: 10.32604/csse.2023.033375.
10703
A. A. Shujaaddeen, F. Mutaher Ba -Alwi, A. T. Zahary and A. Sultan Alhegami, "A Model for Measuring the Effect of Splitting Data Method on the Efficiency of Machine Learning Models: A Comparative Study," 2024 4th International Conference on Emerging Smart Technologies and Applications (eSmarTA), Sana'a, Yemen, 2024, pp. 1-13, doi: 10.1109/eSmarTA62850.2024.10639022.
10704
A. A. S. Shujaaaddeen and F. M. M. Ba-Alwi, “A Comparative Study of the Performance of Machine Learning Models on a Tax Dataset of Yemen to Detect Levels of Tax Evasion,” Sana'a University Journal of Applied Sciences and Technology, vol. 1, no. 4, pp. 304–312, 2023, doi: 10.59628/just.v1i4.528.
10705
A. A. Shujaaddeen, F. M. Ba-Alwi, A. T. Zahary, G. Al-Gaphari, A. M. Al-Badani, and A. Alsabry, "Enhancing a Random Forest Model Based on Single Rule Reduction for Tax Evasion Depends on the Values of K in K-Fold Validation Technique," 2024 1st International Conference on Emerging Technologies for Dependable Internet of Things (ICETI), Sana'a, Yemen, 2024, pp. 1-9, doi: 10.1109/ICETI63946.2024.10777271.
10706
Brownlee, J. (2020, January 3) How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification. https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalance
10707
Use the "Insert Citation" button to add citations to this document.
10708
R. Analysis, H. Reliability, P. S. Assessment, S. V. Machine, and A. Kulkarni, “Confusion Matrix,” ScienceDirect, pp. 1–22, 2022.
Download data is not yet available.

Metrics

0
Views
0
Downloads
0
Citations

How to Cite

A Comparative Model to Analyze the Impact of Tax Dataset Augmentation on the Accuracy of Machine Learning Models. (2026). Sana’a University Journal of Applied Sciences and Technology, 4(4), 1914-1923. https://doi.org/10.59628/jast.v4i4.2121

Similar Articles

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)