Scrape Data from the Data Layer of Google Tag Manager

Share:

Google Tag Manager (GTM) has a data layer that stores information before passing it to the GTM container. You can scrape this GTM data. However, scraping data from Google Tag Manager requires you to execute JavaScript, which HTTP requests can’t do.

Therefore, you need browser automation libraries like Selenium Python. Selenium has methods to execute JavaScript and get the information.

Here is an image showing an example of a data layer variable.

Screenshot showing the empty data layer variable in the website’s static code

How the Data Layer in Google Tag Manager Works

The data layer in Google Tag Manager gets information when a script executes. For example, consider a data layer definition.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import pandas
import re
import json

The code scrapes the data layer of several eCommerce websites listed on Webinopoly.com. Therefore, the first step is to visit that website and get the links.

driver = webdriver.Chrome()
driver.get("https://webinopoly.com/blogs/news/top-100-most-successful-shopify-stores")
a = driver.find_elements(By.XPATH,"//tr/td/a")

The above code uses the find_elements method of Selenium and locates the link elements using XPaths. Finding XPaths can be challenging; you must analyze the HTML code and figure it out for each website.

The next step is to visit the sites listed on the website and extract the information from the data layer.

There might be errors while scraping from the data layer; therefore, you’ll use try-except blocks.

for url in a[:10]:
    try:
        newDriver = webdriver.Chrome()
        newDriver.get(url.get_attribute("href"))
        html = newDriver.find_element(By.TAG_NAME,"html")
        html.send_keys(Keys.END)
        data_layer = newDriver.execute_script("return window.dataLayer")
        print(url.get_attribute("href"))
        print(data_layer)
        fileNname = url.get_attribute("href")
        filename = re.search(r'://(.*?)\.com', fileNname)


        print(filename.group(1))
        with open(filename.group(1), 'w') as file:
            json.dump(data_layer, file, indent=2)
        newDriver.quit()

The code iterates through 10 link elements and tries to

  1. Extract URLs from the element. You can use the get_attribute() method on the element to extract the URLs.
  2. Execute a script and get the data layer. The execute_script() method allows you to execute JavaScript.
  3. Save as a JSON file. You can use json.dump() to save the extracted information to a JSON file.

NOTE: The function uses regular expressions to get the domain name without “https://” and use it as the name for each JSON file.

If there are any errors, the code will push them into a scraping_errors variable, creating a list.

except Exception as e:
        errors.append([{e},url])
        newDriver.quit()

You’ll then save this list as a CSV file. DataFrame() converts scraping_errors into a Pandas DataFrame, and to_csv writes the variable into a CSV file.

df = pandas.DataFrame(errors)
df.to_csv("errors.csv")
driver.quit()

Here is the full code for scraping the data layer of Google Tag Manager.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import pandas
import re
import json


driver = webdriver.Chrome()
driver.get("https://webinopoly.com/blogs/news/top-100-most-successful-shopify-stores")
a = driver.find_elements(By.XPATH,"//tr/td/a")
errors=[]
for url in a[:10]:
    try:
        newDriver = webdriver.Chrome()
        newDriver.get(url.get_attribute("href"))
        html = newDriver.find_element(By.TAG_NAME,"html")
        html.send_keys(Keys.END)
        data_layer = newDriver.execute_script("return window.dataLayer")
        print(url.get_attribute("href"))
        print(data_layer)
        fileNname = url.get_attribute("href")
        filename = re.search(r'://(.*?)\.com', fileNname)


        print(filename.group(1))
        with open(filename.group(1), 'w') as file:
            json.dump(data_layer, file, indent=2)
        newDriver.quit()
    except Exception as e:
        errors.append([{e},url])
        newDriver.quit()


df = pandas.DataFrame(errors)
df.to_csv("errors.csv")
driver.quit()

Here is the information scraped from the data layer of colourpop.com.

[
  [
    "js",
    {
      "getTimeAlias": {}
    }
  ],
  [
    "config",
    "G-WSYJCRME4P",
    {
      "send_page_view": false
    }
  ],
  [
    "event",
    "page_view",
    {
      "page_location": "https://colourpop.com/",
      "page_path": "/",
      "page_title": "ColourPop Cosmetics",
      "send_to": "G-WSYJCRME4P"
    }
  ],
  [
    "set",
    "developer_id.dMWZhNz",
    true
  ],
  {
    "event": "gtm.dom",
    "gtm.uniqueEventId": 9
  },
  {
    "event": "gtm.load",
    "gtm.uniqueEventId": 10
  }
]

Code Limitations

While the above code allows you to scrape the data layer using Python, it may have certain limitations.

This code assumes that the name of the data layer object is dataLayer. However, the web page may use a custom name, and the code won’t work in that case. In such situations, you must figure out the name of the data layer by analyzing the website.

The code is also vulnerable to measures restricting web scraping; it does not include strategies, like proxy rotation, that can help you scrape without getting blocked. Therefore, you can’t pull data from the data layer on a large scale using this code.

Wrapping Up

It’s possible to extract details from the data layer of GTM using Python. The information can provide you with insights into your competitor’s tracking strategies. However, you must use automated browsers, like Selenium or Playwright, instead of HTTP requests.

The lack of strategies to bypass anti-scraping measures makes the code unsuitable for large-scale web scraping. Moreover, the data layer variable’s name may differ. You must figure it out by studying the website’s structure, which can be tedious.

It’s better to use professional web scraping services if you don’t want to code yourself or build a more robust code appropriate for large-scale data extraction.

Try ScrapeHero. We’re a full-service web scraping service provider capable of building enterprise-grade web scrapers. Our services range from data extraction to custom robotic process automation.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

NoSQL vs. SQL databases

Stuck Choosing a Database? Explore NoSQL vs. SQL Databases in Detail

Find out which SQL and NoSQL databases are best suited to store your scraped data.
Scrape JavaScript-Rich Websites

Upgrade Your Web Scraping Skills: Scrape JavaScript-Rich Websites

Learn all about scraping JavaScript-rich websites.
Web scraping with mechanicalsoup

Ditch Multiple Libraries by Web Scraping with MechanicalSoup

Learn how you can replace Python requests and BeautifulSoup with MechanicalSoup.
ScrapeHero Logo

Can we help you get some data?