How to Scrape Real-Time Data Streams from Server-Sent Events

Share:

Scraping real time data streams

Server-Sent Events (SSE) allow servers to send real-time data to clients over an established HTTP connection. Therefore, you can use them for scraping real-time data streams. Not sure how to get started? Read on. 

This tutorial discusses how to scrape data from SSE in Python and JavaScript.

How to Scrape from SSEs

Logic of scraping data SSE endpoints

SSEs are unidirectional. Once the client establishes a connection to the SSE endpoint, the client can’t send data to the server. However, the server can send data using the same connection without the client specifically requesting it. 

So you need an endpoint for SSEs to work, and scraping live data streams from SSEs starts with locating this endpoint. 

After that, you can establish an HTTP connection to that endpoint using a GET request and extract data from the events sent by the server.

To find the SSE endpoint:

  1. Go to the target web page
  2. Open developer tools (right-click anywhere and select ‘Inspect’)
  3. Select ‘Network’ tab and filter by ‘Fetch/XHR’
  4. Look for and copy the URL of the GET request with the type ‘eventsource’

Here’s a demo HTML page that uses server-sent events.

HTML demo page showing a server-sent event

Besides using the endpoint directly, you can also extract data from the HTML page. However, this requires you to use a headless browser, which is more resource-intensive. This article on scraping Google’s dynamic search results discusses how to use headless browsers for real-time data scraping.

Scraping Real-Time Data Streams Using JavaScript

In JavaScript, you need to install a module named eventsource to parse data from SSE. You can install it using the node package manager (npm).

After installing the necessary packages, you can start writing code.

Start by importing the EventSource class and the fs module. While the EventSource class enables you to extract SSE data, the fs module allows you to save the extracted data to a file.

const EventSource = require('eventsource').EventSource;
const fs = require('fs')

The code shown in this tutorial uses two functions:

  • scrape_sse(): Extracts data from an SSE endpoint.
  • writeToFile(): Writes the extracted data to a JSON lines file.

scrape_sse()

This function accepts a URL and uses it to create an object of the EventSource class.

const source = new EventSource(url);

This object contains methods to grab the data from the SSE endpoint. But first define a variable to store the data.

let data;

Next, use the onmessage() method to extract and save the data to a file.

source.onmessage = (event) => {
    const time = new Date().toTimeString().split(' ')[0];
    data = {[JSON.parse(event.data).msg]: time};
    writeToFile(data)
  };

The above code:

  1. Builds a dict with the message as the key and the current time as the value
  2. Calls writeToFile() to save the extracted data to a JSON lines file.

writeToFile()

This function accepts the extracted data as a dict and saves it in a JSON lines file. The code uses fs.appendFileSync() to do so.

function writeToFile(data) {
  const logEntry = JSON.stringify(data) + '\n';
  fs.appendFileSync('output.jsonl', logEntry, 'utf8');
}

After defining the function, you can call scrape_sse() with a URL as the argument.

if(require.main === module) {
  scrapeSse('https://sse.dev/test?interval=1');
  process.on('SIGINT', () => {
    console.log('Streaming stopped');
    process.exit();
  });
}

Here’s the complete code:

const EventSource = require('eventsource').EventSource;
const fs = require('fs')


function writeToFile(data) {
  const logEntry = JSON.stringify(data) + '\n';
  fs.appendFileSync('output.jsonl', logEntry, 'utf8');
}


function scrapeSse(url) {
 
  const source = new EventSource(url);
  let data;


  source.onmessage = (event) => {
    const time = new Date().toTimeString().split(' ')[0];
    data = {[JSON.parse(event.data).msg]: time};
    writeToFile(data)
  };


  source.onerror = (error) => {
    console.error('SSE error:', error);
  };
}


if(require.main === module) {
  scrapeSse('https://sse.dev/test?interval=1');


  process.on('SIGINT', () => {
    console.log('Streaming stopped');
    process.exit();
  });
}

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

Scraping Real-Time Data Streams Using Python

In Python, you need two modules: sseclient and requests. You can install both of them using PIP.

The requests library connects with the data stream and the sseclient parses the data sent by the server.

Besides these, you’ll also need the json module to save the extracted data in a JSON file.

Start your code to scrape real-time data streams by importing the packages mentioned above.

import sseclient
import requests

from datetime import datetime
import json

Next, create two functions:

  • scrape_sse(): Extracts data from server-sent events
  • save_data(): Saves the extracted data

scrape_sse()

This function 

  1. Accepts a URL
  2. Extracts data from an SSE endpoint
  3. Saves the data

Begin the function by using Python requests to connect to the SSE endpoint.

headers = {"Accept": "text/event-stream"}
    response = requests.get(url, stream=True, headers=headers)

Then, pass the response to the class SSEClient, which creates an object.

client = sseclient.SSEClient(response)

The created object has a method events() that returns a generator. You can use this generator to process the events sent by the server and extract the required data.

for event in client.events():
        date_time = datetime.now()
        current_time = str(date_time).split()[1].split('.')[0]
        data = json.loads(event.data)
        print(data)
        string_data = json.dumps({current_time:data['msg']})
        save_data(string_data)

The above code:

  1. Gets the current time
  2. Extracts the data from the SSE
  3. Saves it in a JSON file by calling save_data

save_data()

This function accepts the extracted data and saves it in a JSON Lines file.

def save_data(data):
    with open("data.jsonl", "a", encoding="utf-8") as f:
        f.write(data + "\n")

You can now call scrape_sse() with a URL as the argument.

try:
    scrape_sse("https://sse.dev/test")
except KeyboardInterrupt:
    print("Streaming has been stopped")

Code Limitations

Although you can use the code shown in this tutorial to scrape data from a real website:

  • You need to find the correct SSE endpoint, which can be challenging.
  • The structure returned by the endpoint can change, breaking your code; you need to alter your code whenever that happens.

Want a no-hassle solution for gathering real-time data? Try ScrapeHero’s web scraping API.

Wrapping Up: Why Use a Web Scraping Service

You can scrape data from Server-Sent Events using packages like sseclient (Python) or eventsource (JavaScript), provided you have the SSE endpoint.

However, finding the SSE endpoint can be tedious. Moreover, you must monitor the changes to the endpoint yourself. If you want to avoid all this work, consider using a web scraping service.

A web scraping service like ScrapeHero can gather real-time data despite technical hurdles. Their experts can build enterprise-grade web scrapers capable of pulling high-quality data. You can then focus on analyzing the real-time data without worrying about extracting it.

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Data Storytelling with Web Scraping

Messy Data? Data Storytelling with Web Scraping Makes It Clear

Explore how web scraping and data storytelling work together to drive more brilliant insights.
Scraping Real-Time Data from WebSockets

All About Scraping Real-Time Data from WebSockets

Learn how to scrape real-time data from WebSockets.
Adaptive Web Scraping

Tired of Broken Scrapers? Discover Adaptive Web Scraping

Learn to build a resilient scraper using adaptive web scraping.
ScrapeHero Logo

Can we help you get some data?