Published on April 6, 2025

Phisherman's Friend - Phishing URL Detection

AI/ML

Security

Data Analysis

Python

Feature Engineering

Classification

URL Analysis

Cybersecurity

Machine Learning

Phisherman's Friend - Phishing URL Detection

Project grade received: "A+"

Overview

Phishing remains one of the most prevalent social-engineering attacks on the internet. Even savvy users can be fooled by URLs that mimic legitimate sites.

In this project, we explored phishing-URL detection through a relatable case study of links sent to impersonate government websites in Singapore. In this case, we anchored our solutions around this link:

Additionally, the contents of the website differ. This makes sense as the phishing link aims to siphon data from the users, while the real website provides value to the viewers.

On the left is the actual website, the right is the phishing link, same as in the above image. We can see that the Telegram spam message is using the image from the official MOF website.

An important observation is that the phishing link uses HTTPS (indicated by the padlock icon in browsers). While HTTPS encrypts the connection between user and website, it only verifies domain ownership—not the legitimacy of the organization. Attackers can obtain valid HTTPS certificates for deceptive domains, creating a false sense of security for users who trust the padlock icon as a sign of website authenticity.

Project Objective

We explored the various methods of phishing-URL detection, starting with basic methods like list-based detection, followed by traditional ML, and then deep learning methods, as well as a hybrid approach.

We also made our own ML powered phishing-URL detector, built end-to-end to showcase the detection of our chosen case study url as a phishing link, compared to the official webpage.

In this project overview, I will focus on the ML model built. The overall content on other detection methods and more can be found in the slides, which is linked above.

Our Demo Model

The notebook implements a comprehensive phishing URL detection system using deep learning. Here's a breakdown of what it does:

1. Data Collection & Preparation

Dataset Selection: We utilized the PhiUSIIL Phishing URL Dataset from the UCI Machine Learning Repository, which contains over 235,000 URLs (134,850 legitimate and 100,945 phishing)
Data Balancing: Created a balanced working set of 1,800 URLs to optimize training time while maintaining performance
Rich Feature Set: Leveraged features from both URL structure and webpage source code

2. Feature Engineering

We extracted three categories of features:

Feature Category	Examples	Why It Matters
Lexical	URL length, special character frequency	Phishing URLs often contain unusual character patterns
Host-based	Domain age, HTTPS usage, subdomain count	Phishing sites typically use newer domains with complex subdomains
Content-based	Frequency of suspicious tokens (e.g., "login," "verify")	Social engineering keywords are common in phishing

3. Character-Level URL Encoding

Converted each URL to a one-hot encoded matrix (100 characters max)
This encoding preserves character position information that traditional feature extraction might miss
Each URL becomes a 100 x 73 matrix (73 = lowercase letters, digits, and special characters)

4. CNN Model Architecture

Our model processes URLs as character sequences with:

Input Layer: One-hot encoded URL characters
Three Convolutional Blocks: With increasing filter sizes (64, 128, 256)
Batch Normalization: For training stability
Pooling Layers: To reduce dimensionality and extract key patterns
Dense Layers: For final classification

5. Model Training and Evaluation

Trained for 10 epochs with Adam optimizer
Achieved 94.8% accuracy on test data
Monitored losses to prevent over-fitting

6. Model Visualization

The most innovative aspect of our approach is the visualization of what the model "sees":

Phishing URL detection with attention heatmap showing suspicious characters

Legitimate URL detection with minimal suspicious patterns identified

URL Attention Maps: As shown above, our model clearly identifies suspicious elements in the phishing URL (top image: notice the strong attention on "cleim-your-bodget" and unusual domains) while showing minimal concern with the legitimate government domain (bottom image)
Character-Level Analysis: The model examines each character's contribution to the phishing prediction, with deeper red indicating higher suspicion
Interpretable Results: The threshold of 0.5 separates phishing from legitimate URLs, with confidence scores shown

7. Case Study Testing

When testing against our case study phishing URL that impersonated the Singapore government website, our model successfully identified key risk factors:

Feature-by-feature comparison between phishing and legitimate URLs

The comparative analysis above reveals critical differences between the URLs:

Suspicious Subdomain: The phishing URL uses a deceptive subdomain "cleim-your-bodget-6fm4.up" (medium risk)
Excessive Hyphens: The phishing URL contains 3 hyphens compared to 0 in the legitimate URL

This explainable approach not only flags malicious URLs but provides specific reasons why, making the system more transparent and trustworthy for cybersecurity applications.

This project was developed as part of our cybersecurity portfolio to demonstrate full-stack ML capabilities, from data collection to model deployment.

See all projects