Scraping websites with backend obfuscated content can be challenging because they add another layer to it. Websites use several techniques, including base64 and ROT13 encoding, which show up as gibberish in the source code. You need to decode them to make them meaningful.
This article discusses how to tackle five popular techniques that websites use to obfuscate their text content.
1. Base64 Encoding
The Base64 encoding scheme transforms binary data into a textual representation by mapping it to a set of 64 ASCII characters—uppercase letters (A-Z), lowercase letters (a-z), digits (0-9), and two additional symbols (+ and /).
Example in HTML
<div data-email="am9obkBleGFtcGxlLmNvbQ==">Contact Us</div>
<!-- Decodes to: john@example.com -->
In this example, the attribute data-email contains a Base64-encoded string representing an email address. When decoded, it reveals the actual email “john@example.com,” which is hidden from immediate view in the page source.
How to Handle It
Create a Base64 decoding function designed to detect and decode Base64-encoded strings.
import base64
import logging
logger = logging.getLogger(__name__)
def decode_base64(encoded_text: str) -> str:
try:
# Add padding if necessary
padding = 4 - (len(encoded_text) % 4)
if padding != 4:
encoded_text += '=' * padding
return base64.b64decode(encoded_text).decode('utf-8')
except Exception as e:
logger.error(f"Base64 decoding failed: {str(e)}")
return encoded_text
This function:
- Checks if the input string requires padding (Base64 strings must be a multiple of 4 characters)
- Adds the necessary padding if missing
- Attempts to decode the string into UTF-8 text
- Logs the error and returns the original string if the decoding fails
After defining the function, find a base64 encoded string using the RegEx pattern and call the defined function.
if re.match(r'^[A-Za-z0-9+/]*={0,2}$', text):
text = decode_base64(text)
2. ROT13 Encoding
ROT13—a simple substitution cipher—replaces each letter in the alphabet with the letter 13 positions ahead, wrapping around at the end of the alphabet. Applying ROT13 twice gives you the original string, making it a symmetric cipher.
Example
Original: Hello World
ROT13: Uryyb Jbeyq
This example shows how the phrase “Hello World” transforms with ROT13, making it unreadable without decoding.
How to Handle It
Create a ROT13 decoding function that translates each alphabetic character by shifting it 13 places.
def decode_rot13(text: str) -> str:
return text.translate(str.maketrans(
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm'
))
This function:
- Detects if the input text consists solely of alphabetic characters
- Applies the ROT13 translation
- Verifies that the decoded text differs from the original to avoid unnecessary decoding
Next, use a combination of RegEx and if statements to find ROT13 encoded strings.
if re.match(r'^[A-Za-z]+$', text):
rot13_text = decode_rot13(text)
if rot13_text.lower() != text.lower():
text = rot13_text
3. JavaScript Obfuscation
JavaScript obfuscation refers to techniques that transform readable JavaScript code into a form that is difficult for humans to understand while preserving its functionality. This is often done to protect intellectual property, hide sensitive logic such as API keys, or prevent automated scraping by making the code complex and unreadable.
Examples
<!-- Simple Function Obfuscation -->
<div data-js="(function(){return 'Hidden' + ' Text';})()">
<!-- String Concatenation -->
<span data-js="String.fromCharCode(72,101,108,108,111)">
<!-- Encoded Logic -->
<div data-js="eval(atob('YWxlcnQoJ0hlbGxvJyk='))">
These examples demonstrate different obfuscation techniques such as anonymous functions, character code concatenation, and encoded logic executed via eval and Base64 decoding.
How to Handle It
Create a function that uses js2py to run the JavaScript and capture its output. If execution fails, errors are logged.
import js2py
def execute_js(js_code: str) -> Any:
try:
result = js2py.eval_js(js_code)
return result
except Exception as e:
logger.error(f"JavaScript execution failed: {str(e)}")
return None
This function enables you to retrieve dynamically generated or obfuscated content effectively. You can use it inside an if block that searches for the attribute ‘data-js’.
# assuming 'element' is a BeautifulSoup element
if 'data-js' in element.attrs:
js_code = element['data-js']
js_result = execute_js(js_code)
if js_result:
text = str(js_result)
This method ensures that even complex JavaScript obfuscations can be decoded to reveal hidden content.
4. HTML Entity Encoding
HTML entities are textual representations of special characters that might otherwise be interpreted as HTML tags or cause parsing issues. They begin with an ampersand (&) and end with a semicolon (;), encoding reserved characters, non-ASCII symbols, or invisible characters in a way that browsers can correctly render.
Examples
& → &
< → < > → >
He → He
These examples illustrate how common characters and Unicode codes are represented as HTML entities.
How to Handle It
Create a function that utilizes the BeautifulSoup library to decode HTML entities automatically.
from bs4 import BeautifulSoup
def decode_html_entities(text: str) -> str:
return BeautifulSoup(text, 'html.parser').get_text()
This function:
- Parses the input text as HTML
- Extracts the plain text, converting all entities into their corresponding characters
This method ensures that the extracted text is clean, readable, and free from encoded entities.
Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?Go the hassle-free route with ScrapeHero
Combining Everything
You can use all the above functions while web scraping using requests-based methods.
First, define a function that combines all the other functions.
def deobfuscate_text(text: str, element=None) -> str:
"""Apply all deobfuscation techniques to text"""
original_text = text
try:
# 1. Check for JavaScript obfuscation in data-js attribute
if element and 'data-js' in element.attrs:
js_code = element['data-js']
js_result = execute_js(js_code)
if js_result:
text = str(js_result)
print("JavaScript deobfuscation applied")
# 2. Check for Base64 encoding
if re.match(r'^[A-Za-z0-9+/]*={0,2}$', text) and len(text) > 4:
decoded_base64 = decode_base64(text)
if decoded_base64 != text:
text = decoded_base64
print("Base64 decoding applied")
# 3. Check for ROT13 encoding
if re.match(r'^[A-Za-z]+$', text):
rot13_text = decode_rot13(text)
if rot13_text.lower() != text.lower():
text = rot13_text
print("ROT13 decoding applied")
# 4. Decode HTML entities
if '&' in text and ';' in text:
decoded_entities = decode_html_entities(text)
if decoded_entities != text:
text = decoded_entities
print("HTML entity decoding applied")
except Exception as e:
print(f"Deobfuscation failed: {str(e)}")
return original_text
return text
This function accepts a text and the element holding it—the parent node—and returns the deobfuscated text.
Next, fetch the obfuscated page. Use requests to do so.
url = "https://example.com/obfuscated-page"
response = requests.get(url)
Parse the response text using Beautifulsoup.
soup = BeautifulSoup(response.content, 'html.parser')
Finally, iterate through all the text nodes and call deobfuscate_text().
for text_node in soup.find_all(string=True):
if text_node.strip(): # Skip empty or whitespace-only strings
parent = text_node.parent
cleaned = deobfuscate_text(text_node, parent)
print(f"Original: {text_node}")
print(f"Deobfuscated: {cleaned}")
Scraping Websites with Backend Obfuscated Content: Additional Protection Measures
To enhance reliability while scraping websites with backend-obfuscated content, use these additional techniques:
1. Rotating User Agents
Randomly change the User-Agent header in HTTP requests to mimic different browsers and devices, reducing the chance of being blocked.
user_agent = UserAgent()
headers = {'User-Agent': user_agent.random}
2. Undetected Chrome Driver
Use the undetected Chrome driver with specific options to avoid detection by anti-bot measures implemented by websites.
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
3. Session Management
Maintain persistent sessions with cookies and headers to simulate continuous browsing and reduce repeated authentication or rate limiting.
session = requests.Session()
Scraping Websites with Backend Obfuscated Content: Best Practices
Follow these best practices while scraping backend-obfuscated text to ensure your scraper runs smoothly:
1. Error Handling
- Wrap all the obfuscation methods in try-catch blocks to ensure the scraper does not crash on unexpected input.
- If deobfuscation fails, return the original content to avoid data loss.
- Implement comprehensive logging for troubleshooting and improving decoding algorithms.
2. Performance
- Apply deobfuscation methods selectively based on pattern matching to minimize unnecessary processing.
- Perform resource cleanup after use to prevent memory leaks.
- Employ parallelization and caching strategies to improve speed.
Wrapping Up: Why You Need a Web Scraping Service
Web scraping backend obfuscation requires you to add an additional layer to your scraping script that decodes otherwise gibberish data. It’s possible to do it. However, you’ll need additional time and resources, which may not be practical in a large-scale project.
Instead, why not use a web scraping service?
A web scraping service like ScrapeHero can take care of all the scraping. You can then focus on what matters without worrying about the technicalities of data collection.
ScrapeHero is an enterprise-grade web scraping service capable of building high-quality scrapers and crawlers. Contact ScrapeHero to save time and focus on your core needs.