Phishing Detection System through Hybrid Machine Learning Based on URL

Phishing Detection System through Hybrid Machine Learning Based on URL

Abstract:

The Internet's exponential growth has led to an increase in e-commerce usage, but it has also attracted hackers who seek to steal personal information online. One of the most prevalent methods used by cybercriminals is the phishing scheme, where they trick individuals into divulging sensitive information through fake URLs. Due to the semantics-based attack strategy used in phishing, it is challenging to differentiate between legitimate and phishing URLs, taking advantage of computer users' vulnerabilities. Software companies provide anti-phishing systems that utilize blacklists, heuristics, visuals, and machine learning, but they cannot prevent all phishing attempts. Therefore, this research proposes five classification methods that use hybrid features, including natural language processing (NLP) and principal component analysis (PCA), to address this issue. The study finds that the Random Forest algorithm utilizing NLP and word vector features surpasses its competitors, with a 99.75% accuracy rate in classifying phishing URLs.