Phisherman's Friend - Phishing URL Detection

Project grade received: "A+"
Overview
Phishing remains one of the most prevalent social-engineering attacks on the internet. Even savvy users can be fooled by URLs that mimic legitimate sites.
In this project, we explored phishing-URL detection through a relatable case study of links sent to impersonate government websites in Singapore. In this case, we anchored our solutions around this link:

Additionally, the contents of the website differ. This makes sense as the phishing link aims to siphon data from the users, while the real website provides value to the viewers.

On the left is the actual website, the right is the phishing link, same as in the above image. We can see that the Telegram spam message is using the image from the official MOF website.
An important observation is that the phishing link uses HTTPS (indicated by the padlock icon in browsers). While HTTPS encrypts the connection between user and website, it only verifies domain ownership—not the legitimacy of the organization. Attackers can obtain valid HTTPS certificates for deceptive domains, creating a false sense of security for users who trust the padlock icon as a sign of website authenticity.
Project Objective
We explored the various methods of phishing-URL detection, starting with basic methods like list-based detection, followed by traditional ML, and then deep learning methods, as well as a hybrid approach.
We also made our own ML powered phishing-URL detector, built end-to-end to showcase the detection of our chosen case study url as a phishing link, compared to the official webpage.
In this project overview, I will focus on the ML model built. The overall content on other detection methods and more can be found in the slides, which is linked above.
Our Demo Model
The notebook implements a comprehensive phishing URL detection system using deep learning. Here's a breakdown of what it does:
1. Data Collection & Preparation
- Dataset Selection: We utilized the PhiUSIIL Phishing URL Dataset from the UCI Machine Learning Repository, which contains over 235,000 URLs (134,850 legitimate and 100,945 phishing)
- Data Balancing: Created a balanced working set of 1,800 URLs to optimize training time while maintaining performance
- Rich Feature Set: Leveraged features from both URL structure and webpage source code
2. Feature Engineering
We extracted three categories of features:
Feature Category | Examples | Why It Matters |
---|---|---|
Lexical | URL length, special character frequency | Phishing URLs often contain unusual character patterns |
Host-based | Domain age, HTTPS usage, subdomain count | Phishing sites typically use newer domains with complex subdomains |
Content-based | Frequency of suspicious tokens (e.g., "login," "verify") | Social engineering keywords are common in phishing |
3. Character-Level URL Encoding
- Converted each URL to a one-hot encoded matrix (100 characters max)
- This encoding preserves character position information that traditional feature extraction might miss
- Each URL becomes a 100 x 73 matrix (73 = lowercase letters, digits, and special characters)
4. CNN Model Architecture
Our model processes URLs as character sequences with:
- Input Layer: One-hot encoded URL characters
- Three Convolutional Blocks: With increasing filter sizes (64, 128, 256)
- Batch Normalization: For training stability
- Pooling Layers: To reduce dimensionality and extract key patterns
- Dense Layers: For final classification
5. Model Training and Evaluation
- Trained for 10 epochs with Adam optimizer
- Achieved 94.8% accuracy on test data
- Monitored losses to prevent over-fitting
6. Model Visualization
The most innovative aspect of our approach is the visualization of what the model "sees":


- URL Attention Maps: As shown above, our model clearly identifies suspicious elements in the phishing URL (top image: notice the strong attention on "cleim-your-bodget" and unusual domains) while showing minimal concern with the legitimate government domain (bottom image)
- Character-Level Analysis: The model examines each character's contribution to the phishing prediction, with deeper red indicating higher suspicion
- Interpretable Results: The threshold of 0.5 separates phishing from legitimate URLs, with confidence scores shown
7. Case Study Testing
When testing against our case study phishing URL that impersonated the Singapore government website, our model successfully identified key risk factors:

The comparative analysis above reveals critical differences between the URLs:
- Suspicious Subdomain: The phishing URL uses a deceptive subdomain "cleim-your-bodget-6fm4.up" (medium risk)
- Excessive Hyphens: The phishing URL contains 3 hyphens compared to 0 in the legitimate URL
This explainable approach not only flags malicious URLs but provides specific reasons why, making the system more transparent and trustworthy for cybersecurity applications.
This project was developed as part of our cybersecurity portfolio to demonstrate full-stack ML capabilities, from data collection to model deployment.