Dealing With the Gibberish: Scraping Websites with Backend Obfuscated Content

Share:

Scraping Websites with Backend-Obfuscated Content

Scraping websites with backend obfuscated content can be challenging because they add another layer to it. Websites use several techniques, including base64 and ROT13 encoding, which show up as gibberish in the source code. You need to decode them to make them meaningful.

This article discusses how to tackle five popular techniques that websites use to obfuscate their text content.

1. Base64 Encoding

The Base64 encoding scheme transforms binary data into a textual representation by mapping it to a set of 64 ASCII characters—uppercase letters (A-Z), lowercase letters (a-z), digits (0-9), and two additional symbols (+ and /).

Example in HTML


<div data-email="am9obkBleGFtcGxlLmNvbQ==">Contact Us</div>
<!-- Decodes to: john@example.com -->

In this example, the attribute data-email contains a Base64-encoded string representing an email address. When decoded, it reveals the actual email “john@example.com,” which is hidden from immediate view in the page source.

How to Handle It

Create a Base64 decoding function designed to detect and decode Base64-encoded strings.

import base64
import logging

logger = logging.getLogger(__name__)

def decode_base64(encoded_text: str) -&gt; str:
    try:
        # Add padding if necessary
        padding = 4 - (len(encoded_text) % 4)
        if padding != 4:
            encoded_text += '=' * padding
        return base64.b64decode(encoded_text).decode('utf-8')
    except Exception as e:
        logger.error(f"Base64 decoding failed: {str(e)}")
        return encoded_text

This function:

  • Checks if the input string requires padding (Base64 strings must be a multiple of 4 characters)
  • Adds the necessary padding if missing
  • Attempts to decode the string into UTF-8 text
  • Logs the error and returns the original string if the decoding fails

After defining the function, find a base64 encoded string using the RegEx pattern and call the defined function.

if re.match(r'^[A-Za-z0-9+/]*={0,2}$', text):
    text = decode_base64(text)

2. ROT13 Encoding

ROT13—a simple substitution cipher—replaces each letter in the alphabet with the letter 13 positions ahead, wrapping around at the end of the alphabet. Applying ROT13 twice gives you the original string, making it a symmetric cipher.

Example

Original: Hello World
ROT13:    Uryyb Jbeyq

This example shows how the phrase “Hello World” transforms with ROT13, making it unreadable without decoding.

How to Handle It

Create a ROT13 decoding function that translates each alphabetic character by shifting it 13 places.

def decode_rot13(text: str) -&gt; str:
    return text.translate(str.maketrans(
        'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
        'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm'
    ))

This function:

  1. Detects if the input text consists solely of alphabetic characters
  2. Applies the ROT13 translation
  3. Verifies that the decoded text differs from the original to avoid unnecessary decoding

Next, use a combination of RegEx and if statements to find ROT13 encoded strings.

if re.match(r'^[A-Za-z]+$', text):
    rot13_text = decode_rot13(text)
    if rot13_text.lower() != text.lower():
        text = rot13_text

 3. JavaScript Obfuscation

JavaScript obfuscation refers to techniques that transform readable JavaScript code into a form that is difficult for humans to understand while preserving its functionality. This is often done to protect intellectual property, hide sensitive logic such as API keys, or prevent automated scraping by making the code complex and unreadable.

Examples


<!-- Simple Function Obfuscation -->
<div data-js="(function(){return 'Hidden' + ' Text';})()">
<!-- String Concatenation -->
<span data-js="String.fromCharCode(72,101,108,108,111)">
<!-- Encoded Logic -->
<div data-js="eval(atob('YWxlcnQoJ0hlbGxvJyk='))">

These examples demonstrate different obfuscation techniques such as anonymous functions, character code concatenation, and encoded logic executed via eval and Base64 decoding.

How to Handle It

Create a function that uses js2py to run the JavaScript and capture its output. If execution fails, errors are logged.

import js2py

def execute_js(js_code: str) -&gt; Any:
    try:
        result = js2py.eval_js(js_code)
        return result
    except Exception as e:
        logger.error(f"JavaScript execution failed: {str(e)}")
        return None

This function enables you to retrieve dynamically generated or obfuscated content effectively. You can use it inside an if block that searches for the attribute ‘data-js’.

# assuming 'element' is a BeautifulSoup element
if 'data-js' in element.attrs:
    js_code = element['data-js']
    js_result = execute_js(js_code)
    if js_result:
        text = str(js_result)

This method ensures that even complex JavaScript obfuscations can be decoded to reveal hidden content.

4. HTML Entity Encoding

HTML entities are textual representations of special characters that might otherwise be interpreted as HTML tags or cause parsing issues. They begin with an ampersand (&) and end with a semicolon (;), encoding reserved characters, non-ASCII symbols, or invisible characters in a way that browsers can correctly render.

Examples

&amp;        → &amp;
&lt;         → &lt; &gt; → &gt;
He  → He

These examples illustrate how common characters and Unicode codes are represented as HTML entities.

How to Handle It

Create a function that utilizes the BeautifulSoup library to decode HTML entities automatically.

from bs4 import BeautifulSoup

def decode_html_entities(text: str) -&gt; str:
    return BeautifulSoup(text, 'html.parser').get_text()

This function:

  1. Parses the input text as HTML
  2. Extracts the plain text, converting all entities into their corresponding characters

This method ensures that the extracted text is clean, readable, and free from encoded entities.

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

Combining Everything

You can use all the above functions while web scraping using requests-based methods. 

First, define a function that combines all the other functions.

def deobfuscate_text(text: str, element=None) -&gt; str:
    """Apply all deobfuscation techniques to text"""
    original_text = text
    
    try:
        # 1. Check for JavaScript obfuscation in data-js attribute
        if element and 'data-js' in element.attrs:
            js_code = element['data-js']
            js_result = execute_js(js_code)
            if js_result:
                text = str(js_result)
                print("JavaScript deobfuscation applied")
        
        # 2. Check for Base64 encoding
        if re.match(r'^[A-Za-z0-9+/]*={0,2}$', text) and len(text) &gt; 4:
            decoded_base64 = decode_base64(text)
            if decoded_base64 != text:
                text = decoded_base64
                print("Base64 decoding applied")
        
        # 3. Check for ROT13 encoding
        if re.match(r'^[A-Za-z]+$', text):
            rot13_text = decode_rot13(text)
            if rot13_text.lower() != text.lower():
                text = rot13_text
                print("ROT13 decoding applied")
        
        # 4. Decode HTML entities
        if '&amp;' in text and ';' in text:
            decoded_entities = decode_html_entities(text)
            if decoded_entities != text:
                text = decoded_entities
                print("HTML entity decoding applied")
                
    except Exception as e:
        print(f"Deobfuscation failed: {str(e)}")
        return original_text
    
    return text

This function accepts a text and the element holding it—the parent node—and returns the deobfuscated text.

Next, fetch the obfuscated page. Use requests to do so.

url = "https://example.com/obfuscated-page" 
response = requests.get(url)

Parse the response text using Beautifulsoup.

soup = BeautifulSoup(response.content, 'html.parser')

Finally, iterate through all the text nodes and call deobfuscate_text().

for text_node in soup.find_all(string=True):
    if text_node.strip():  # Skip empty or whitespace-only strings
        parent = text_node.parent
        cleaned = deobfuscate_text(text_node, parent)
        print(f"Original: {text_node}")
        print(f"Deobfuscated: {cleaned}")

Want to learn more about  static web scraping? Read this article on web scraping using Python requests.

Scraping Websites with Backend Obfuscated Content: Additional Protection Measures

To enhance reliability while scraping websites with backend-obfuscated content, use these additional techniques:

1. Rotating User Agents

Randomly change the User-Agent header in HTTP requests to mimic different browsers and devices, reducing the chance of being blocked.

user_agent = UserAgent()
headers = {'User-Agent': user_agent.random}

2. Undetected Chrome Driver

Use the undetected Chrome driver with specific options to avoid detection by anti-bot measures implemented by websites.

options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')

3. Session Management

Maintain persistent sessions with cookies and headers to simulate continuous browsing and reduce repeated authentication or rate limiting.

session = requests.Session()

Scraping Websites with Backend Obfuscated Content: Best Practices

Follow these best practices while scraping backend-obfuscated text to ensure your scraper runs smoothly:

1. Error Handling

  • Wrap all the obfuscation methods in try-catch blocks to ensure the scraper does not crash on unexpected input.
  • If deobfuscation fails, return the original content to avoid data loss.
  • Implement comprehensive logging for troubleshooting and improving decoding algorithms.

2. Performance

  • Apply deobfuscation methods selectively based on pattern matching to minimize unnecessary processing.
  • Perform resource cleanup after use to prevent memory leaks.
  • Employ parallelization and caching strategies to improve speed.

Wrapping Up: Why You Need a Web Scraping Service

Web scraping backend obfuscation requires you to add an additional layer to your scraping script that decodes otherwise gibberish data. It’s possible to do it. However, you’ll need additional time and resources, which may not be practical in a large-scale project.

Instead, why not use a web scraping service?

A web scraping service like ScrapeHero can take care of all the scraping. You can then focus on what matters without worrying about the technicalities of data collection.

ScrapeHero is an enterprise-grade web scraping service capable of building high-quality scrapers and crawlers. Contact ScrapeHero to save time and focus on your core needs.

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Data Storytelling with Web Scraping

Messy Data? Data Storytelling with Web Scraping Makes It Clear

Explore how web scraping and data storytelling work together to drive more brilliant insights.
Scraping Real-Time Data from WebSockets

All About Scraping Real-Time Data from WebSockets

Learn how to scrape real-time data from WebSockets.
Scraping real time data streams

How to Scrape Real-Time Data Streams from Server-Sent Events

Learn how to scrape real-time data streams from server-sent events.
ScrapeHero Logo

Can we help you get some data?