Spam Detection and Classification Based on DistilBERT Deep Learning Algorithm

Authors

  • Tianrui Liu Electrical and Computer Engineering, University of California San Diego, La Jolla, United States
  • Shaojie Li Huacong Qingjiao Information Technology (Beijing) Co., Ltd., Beijing, China
  • Yushan Dong University of Maryland, MD, United States
  • Yuhong Mo Electrical and Computer Engineering, Carnegie Mellon University, United States
  • Shuyao He Northeastern University, Boston, United States

DOI:

https://doi.org/10.5281/zenodo.11180575

Keywords:

spam detection, distilbert, accuracy, classification

Abstract

This paper discusses the importance of spam classification in the field of information security. With the popularity of the Internet and email, spam has become one of the major issues affecting user experience and information security. The study begins with preprocessing text data in various ways, including converting to lowercase, removing irrelevant content, links, punctuation, etc., and filtering deactivated words and words of length 1. By applying the DistilBERT model to the text classification task, the results show that it achieves 93% accuracy in spam classification, effectively distinguishing between spam and non-spam emails. The confusion matrix showed that 18,500 emails were correctly classified and a small number of spam emails were misclassified as non-spam emails. Overall, the DistilBERT model showed high accuracy in spam classification, but more algorithms are still expected to emerge to improve the prediction accuracy. This study provides a useful reference for improving spam filtering systems in the future, which is expected to further enhance user experience and information security.

Downloads

Download data is not yet available.

References

Mu, Pengyu, Wenhao Zhang., & Yuhong Mo. (2021). Research on spatio-temporal patterns of traffic operation index hotspots based on big data mining technology. Basic & Clinical Pharmacology & Toxicology. 128(111), River ST, Hoboken 07030-5774, NJ USA: Wiley.

Mo, Y., Qin, H., Dong, Y., Zhu, Z., & Li, Z. (2024). Large language model (llm) ai text generation detection based on transformer deep learning algorithm. International Journal of Engineering and Management Research, 14(2), 154-159.

Sumathi, V. P., V. Vanitha., & R. Kalaiselvi. (2023). Performance comparison of machine learning algorithms in short message service spam classification. 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA). IEEE.

Ma, Danqing, Bo, Dang, Shaojie, Li, Hengyi, Zang., & Xinqi, Dong. (2023). Implementation of computer vision technology based on artificial intelligence for medical image analysis. International Journal of Computer Science and Information Technology, 1(1), 69–76.

Zhu, Armando, Jiefeng Li., & Cewu Lu. (2021). Pseudo view representation learning for monocular RGB-D human pose and shape estimation. IEEE Signal Processing Letters 29, 712-716.

Li, Yanjie, et al. (2021). Transfer-learning-based network traffic automatic generation framework. 6th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE.

Liu, Tianrui, et al. (2024). Rumor Detection with a novel graph neural network approach. arXiv preprint arXiv:2403.16206.

Mo, Y., Qin, H., Dong, Y., Zhu, Z., & Li, Z. (2024). Large language model (llm) ai text generation detection based on transformer deep learning algorithm. International Journal of Engineering and Management Research, 14(2), 154-159.

Zhang, Jingyu, et al. (2024). Research on detection of floating objects in river and lake based on AI intelligent image recognition. arXiv preprint arXiv:2404.0688.

Xiang, Ao, et al. (2024). Research on splicingimage detection algorithms based on natural image statistical characteristics. arXiv preprint arXiv:2404.16296.

Li, Zhenglin, et al. (2023). Stock market analysis and prediction using LSTM: A case study on technology stocks. Innovations in Applied Engineering and Technology, 1-6.

Li, Shaojie, Yuhong Mo., & Zhenglin Li. (2022). Automated pneumonia detection in chest x-ray images using deep learning model. Innovations in Applied Engineering and Technology, 1-6.

Dai, Shuying, et al. (2024). The cloud-based design of unmanned constant temperature food delivery trolley in the context of artificial intelligence. Journal of Computer Technology and Applied Mathematics, 11, 6-12.

Published

2024-05-11

How to Cite

Tianrui Liu, Shaojie Li, Yushan Dong, Yuhong Mo, & Shuyao He. (2024). Spam Detection and Classification Based on DistilBERT Deep Learning Algorithm. Applied Science and Engineering Journal for Advanced Research, 3(3), 6–10. https://doi.org/10.5281/zenodo.11180575

Issue

Section

Articles