JPMorgan: Parse HTML and Extract External Domain Links — Data Engineering Interview Q&A (2026)

Parse HTML and Extract External Domain Links

Beginner Mode

Start your terminal to use beginner mode.

Scenario

A web page contains numerous hyperlinks to both internal pages and external websites. You need to extract and catalog all external domains for analysis.

Task

Write a Python script at /home/interview/extract_domains.py that fetches the HTML page from http://content.local, extracts all hyperlinks, filters for external links (links to domains other than content.local), and saves the unique external domains to /home/interview/external_domains.txt (one domain per line, including protocol).

Note: BeautifulSoup is already installed for HTML parsing.

Example

Expected output format in /home/interview/external_domains.txt:

http://news.example.com
https://blog.sample.org
http://cdn.resources.net

Step 1: Fetch and examine the HTML page

curl http://content.local/

Review the HTML structure to identify link patterns.

Step 2: Create the Python script

nano /home/interview/extract_domains.py

Write a script using BeautifulSoup to parse HTML and extract external domains:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

# Fetch the HTML page
response = requests.get('http://content.local/')
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all href attributes
links = soup.find_all('a', href=True)

# Collect external domains
external_domains = set()

for link in links:
    href = link['href']
    
    # Skip anchor links, mailto, tel, etc. that don't have domain names
    if href.startswith('#'):
        continue
    
    # Handle protocol-relative URLs
    if href.startswith('//'):
        href = 'http:' + href
    
    # Parse URL - check if it has a scheme (protocol)
    if '://' in href or href.startswith('//'):
        parsed = urlparse(href)
        domain = parsed.netloc
        
        # Filter out internal links (content.local) and skip mailto/tel
        if domain and domain != 'content.local' and parsed.scheme not in ['mailto', 'tel']:
            # Store full domain with protocol
            full_domain = f"{parsed.scheme}://{domain}"
            external_domains.add(full_domain)

# Save unique domains to file
with open('/home/interview/external_domains.txt', 'w') as f:
    for domain in sorted(external_domains):
        f.write(domain + '\n')

print(f"Extracted {len(external_domains)} unique external domains")

Step 3: Run the script

python3 /home/interview/extract_domains.py

Step 4: Verify the output

cat /home/interview/external_domains.txt

Should show a list of unique external domains sorted alphabetically.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →