Scenario
A web page contains numerous hyperlinks to both internal pages and external websites. You need to extract and catalog all external domains for analysis.
Task
Write a Python script at /home/interview/extract_domains.py that fetches the HTML page from http://content.local, extracts all hyperlinks, filters for external links (links to domains other than content.local), and saves the unique external domains to /home/interview/external_domains.txt (one domain per line, including protocol).
Note: BeautifulSoup is already installed for HTML parsing.
Example
Expected output format in /home/interview/external_domains.txt:
http://news.example.com
https://blog.sample.org
http://cdn.resources.net
Step 1: Fetch and examine the HTML page
curl http://content.local/
Review the HTML structure to identify link patterns.
Step 2: Create the Python script
nano /home/interview/extract_domains.py
Write a script using BeautifulSoup to parse HTML and extract external domains:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
# Fetch the HTML page
response = requests.get('http://content.local/')
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all href attributes
links = soup.find_all('a', href=True)
# Collect external domains
external_domains = set()
for link in links:
href = link['href']
# Skip anchor links, mailto, tel, etc. that don't have domain names
if href.startswith('#'):
continue
# Handle protocol-relative URLs
if href.startswith('//'):
href = 'http:' + href
# Parse URL - check if it has a scheme (protocol)
if '://' in href or href.startswith('//'):
parsed = urlparse(href)
domain = parsed.netloc
# Filter out internal links (content.local) and skip mailto/tel
if domain and domain != 'content.local' and parsed.scheme not in ['mailto', 'tel']:
# Store full domain with protocol
full_domain = f"{parsed.scheme}://{domain}"
external_domains.add(full_domain)
# Save unique domains to file
with open('/home/interview/external_domains.txt', 'w') as f:
for domain in sorted(external_domains):
f.write(domain + '\n')
print(f"Extracted {len(external_domains)} unique external domains")
Step 3: Run the script
python3 /home/interview/extract_domains.py
Step 4: Verify the output
cat /home/interview/external_domains.txt
Should show a list of unique external domains sorted alphabetically.