Big Data, Web Scraping and Analysis

Big data refers to large amounts of rapidly generating, highly unstructured data that can be highly unreliable. Its collection and analysis, though challenging, are crucial for your business to remain competitive.

This article gives an overview of web scraping big data and the methods to analyze it.

Data Types, sources, and characteristics of big data

Sources of Big Data

You can perform big data extraction from various sources:

  • APIs: You can get data from the APIs offered by various websites or social media. These may provide additional data or data in a more structured format.
  • Sensors: IoT devices can collect data. These devices can store and transmit information regarding their usage.
  • Web Scraping: You can extract vast quantities of public data online. This tutorial discusses big data web scraping using Python.

Data Types Included While Web Scraping Big Data

Data types in big data can be

  • Structured
  • Semi-Structured
  • Unstructured

Structured data is what traditional data analysis uses. Two popular examples are

  • Tables: Data organized in rows and columns, where you need two parameters to refer to a data point in tables.
  • Key-value: Data organized as key-value pairs, where you can refer to a data point with one parameter, the key.

Semi-structured data has a structure that varies throughout the data set. For example

  • XML/HTML: Stores data in various tags, giving it a structure. However, the tags’ names and positions are inconsistent throughout the data set.
  • JSON: Stores data in key-value pairs, but the values could be arrays or another set of key-value pairs.

Unstructured data is everything else. These don’t have any structure or apparent patterns. It can include:

  • Video
  • Audio
  • Text

Web Scraping Big Data Using Python

Big data scraping is possible with Python; it has several libraries dedicated to web scraping; for example,

  • Python requests can get HTML data from websites via HTTP requests.
  • BeautifulSoup can leverage HTML or XML parsers and provide intuitive methods to extract data.
  • Selenium and Playwright can programmatically control a browser, making them suitable for scraping dynamic websites.

The above libraries can fetch structured, semi-structured, or unstructured data. Read on to get an overview of scraping each kind of data.

Tables

You can directly scrape tables using Pandas. The method read_html() of Pandas can extract data in tabular format from a URL; moreover, you can also use Pandas to clean the extracted data.

<pre><code>
import pandas as pd # Import the pandas library for data manipulation

# URL of the web page containing the table
url = "http://example.com/table-data"

# Use the read_html() method to scrape all tables from the URL
tables = pd.read_html(url)

# Assuming the table of interest is the first table (index 0)
df = tables[0]

# Print the first few rows of the dataframe
print(df.head())
</code><pre>

HTML/XML

You can extract XML/HTML data using Python requests. The get() method of Python requests can send HTTP requests to a URL, and the resulting response will contain the HTML/XML data.

import requests
response = requests.get("https://example.com")
print(response.text)

Python requests can get HTML/XML data from static websites. However, you require automated browsers like Playwright or Selenium to scrape dynamic websites.

You can visit the website using your preferred automated browser to extract the source code.

#import web driver from selenium
from selenium import webdriver
#launch the browser in a headless mode
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
browser = webdriver.Chrome(options=options)


# visit the website
browser.get("https://example.com")


# get the page source
print(browser.page_source)

The above code

  • Launches a Selenium browser
  • Visits a website
  • Gets the HTML source code

JSON

You can extract the JSON data from the XML/HTML. Usually, the script tag will have the JSON data. In that case, you can use Python requests to get the HTML data and BeautifulSoup to parse data and find the script tag.

soup = BeautifulSoup(response.text)
scripts = soup.find('script')

Videos/Images

To scrape videos, you need to find the URL. You can get the URLs from the HTML/XML data fetched with Python requests or Selenium.

soup = BeautifulSoup(response.text)
video= soup.find('video')
image= soup.find(‘img’)

Then, you use Python requests to fetch it; the process to get images is also similar.

videoContent= requests.get(video[‘src’])
imageContent= requests.get(image[‘src’])


with open("video","wb") as video:
    video.write(videoContent.content)


with open("image","wb") as image:
    image.write(imageContent.content)

Text

You can parse and extract the text once you fetch the HTML/XML data. You can use BeautifulSoup for that. Find an element using the find method, and use the text attribute to extract the text.

soup = BeautifulSoup(response.text)
soup.text

Analyzing Big Data Using Python

Basic Steps of Big Data Analytics

Big data analytics includes three steps:

  1. Collection
  2. Pre-Processing
  3. Analysis

Web scraping comes under the first step.

The second step involves preparing the data for analysis. This step is crucial, as poor pre-processing can lead to inaccurate results.

Pre-processing involves data cleaning and data transformation.

Data transformation makes the data more suitable for analysis. It involves

  • Making data discrete
  • Grouping data
  • Transforming data using log
  • Aggregating multiple data points into one
  • Reducing the number of variables in a data

Data cleaning makes the data more reliable. It involves

  • Adding missing data
  • Removing duplicate data
  • Removing data that significantly deviates from other values
  • Ensuring consistency across the data set

You can use Python for pre-processing. There are libraries dedicated to handling each type of data. Examples include

  • Pandas for structured data: it consists of several methods for cleaning data. For example, its dropna() method can remove rows with missing values.
    import pandas as pd  # Import the pandas library for data manipulation
    
    
    # Load the dataset into a dataframe from a CSV file
    df = pd.read_csv("data.csv")
    
    
    # Remove rows with missing values using the dropna() method
    df = df.dropna()
    
    
    # Print the first few rows of the cleaned dataframe
    print(df.head())
  • Scikit-image for images: it has functions for image manipulation, including filtering (applying filters) and segmentation (making sections of an image).
    from skimage import io, filters  # Import the skimage library for image processing
    
    
    # Read the image from a file
    image = io.imread("image.jpg")
    
    
    # Apply the Sobel filter to detect edges in the image
    edges = filters.sobel(image)
    
    
    # Display the edges using the io module
    io.imshow(edges)
    io.show()
  • OpenCV for tasks requiring computer vision: It can read videos and convert them into frames; you can then process each frame.
    import cv2  # Import the OpenCV library for computer vision tasks
    
    
    # Open the video file
    cap = cv2.VideoCapture("video.mp4")
    
    
    # Loop through the video frames
    while cap.isOpened():
        ret, frame = cap.read()  # Read a frame from the video
        if not ret:
            break  # Exit the loop if no more frames
    
    
        # Convert the frame to grayscale
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    
    
        # Display the grayscale frame
        cv2.imshow("frame", gray)
    
    
        # Break the loop if the "q" key is pressed
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
    
    
    # Release the video capture object and close all OpenCV windows
    cap.release()
    cv2.destroyAllWindows()
    
  • The os module for interacting with the operating system: It allows you to manage a directory, get environment variables, run programs, etc.
    import os  # Import the os module for interacting with the operating system
    
    
    # List files in a specified directory
    files = os.listdir("path_to_directory")
    
    
    # Print the list of files
    print(files)
    

After preprocessing comes the analysis, where you use various techniques to derive insights from data. A few of them are

  • Data Mining: Understanding the patterns in the data sets and deriving insights.
  • Predictive Analysis: Predicting future outcomes based on past data
  • Deep Learning: Using neural networks to solve a problem
  • Text Analysis: Deriving meaning out of large amounts of text using
  • Statistical Methods: Using statistical methods to analyze data

Data Mining

The process tries to find relationships, patterns, and trends in a data set to derive meaningful conclusions. A popular data mining method is classification, where you classify data into categories.

For example, you can classify a person as someone who makes a specific purchase or someone who doesn’t based on their salary and age.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Create a DataFrame
data_classification = {
    'age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61, 30, 48, 33, 40, 28, 22, 25, 35, 29, 40],
    'salary': [25000, 30000, 52000, 110000, 95000, 105000, 92000, 130000, 100000, 98000, 40000, 65000, 48000, 85000, 31000, 27000, 29000, 47000, 32000, 60000],
    'purchased': [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]  # 0 = No, 1 = Yes
}


df_classification = pd.DataFrame(data_classification)


# Prepare the features and target variable
X = df_classification[['age', 'salary']]
y = df_classification['purchased']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


# Create and train a decision tree classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)


# Make predictions on the test set
y_pred = clf.predict(X_test)


# Print the accuracy of the classifier
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Predictive Analysis

Predictive analysis involves understanding past data to predict future results. Linear regression is one of the methods you can use to predict future outcomes; it assumes a linear relationship between a target variable and one or more features.

For example, you can assume that the experiences and salaries have a linear relationship and then use linear regression to make predictions.

import pandas as pd
from sklearn.linear_model import LinearRegression


# Create a DataFrame
data_regression = {
    'years_experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'salary': [45000, 50000, 60000, 65000, 70000, 75000, 80000, 85000, 90000, 95000]
}


df_regression = pd.DataFrame(data_regression)




# Prepare the features and target variable
X = df_regression[['years_experience']]
y = df_regression['salary']


# Create and train a linear regression model
reg = LinearRegression().fit(X, y)


# Make predictions on the training set
predictions = reg.predict(X)


# Print the predictions
print(predictions)

Deep Learning

In deep learning, you use neural networks to find complex patterns from seemingly unstructured data. Neural networks are algorithms that operate similar to brain neurons.

Here is a code for training a model that predicts house prices from number rooms, square footage, and age of the houses.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd


# Create a DataFrame
data_deep_learning = {
    'num_rooms': [3, 4, 2, 3, 4, 3, 5, 4, 2, 3],
    'square_footage': [1500, 1800, 1200, 1600, 2000, 1500, 2500, 1800, 1100, 1700],
    'age_of_house': [10, 15, 5, 10, 20, 10, 30, 25, 5, 12],
    'price': [300000, 400000, 250000, 350000, 500000, 300000, 600000, 450000, 220000, 370000]
}


df_deep_learning = pd.DataFrame(data_deep_learning)




# Prepare the features and target variable
X = df_deep_learning[['num_rooms', 'square_footage', 'age_of_house']]
y = df_deep_learning['price']


# Define the input shape based on the features
input_shape = X.shape[1]


# Define a simple neural network model
model = Sequential([
    Dense(64, activation='relu', input_shape=(input_shape,)),  # Input layer
    Dense(64, activation='relu'),  # Hidden layer
    Dense(1)  # Output layer
])


# Compile the model with Adam optimizer and mean squared error loss function
model.compile(optimizer='adam', loss='mse')


# Train the model on the training data
model.fit(X, y, epochs=10, batch_size=32)

Text Analysis

Text analysis involves analyzing large amounts of text to derive meaning or information. For example,

  • Sentiment Analysis: Understands the tone behind the text
    from textblob import TextBlob  # Import TextBlob for text processing
    
    
    # Sample text for sentiment analysis
    text = "I love programming in Python!"
    
    
    # Create a TextBlob object
    blob = TextBlob(text)
    
    
    # Print the sentiment of the text (polarity and subjectivity)
    print(blob.sentiment)
  • Text Extraction: Gets a particular information from the text
    import re  # Import the re module for regular expressions
    
    
    # Sample text containing a number
    text = "Sample text with a number 12345"
    
    
    # Find all sequences of digits in the text
    numbers = re.findall(r'\d+', text)
    
    
    # Print the extracted numbers
    print(numbers)

Statistical Methods

It uses traditional statistics on big data. Although the methods have limitations, they might prove useful in early stages of big data analytics, where you have to understand the characteristics and structure of data.

For example, central tendencies, like mean, can describe the data set.

import numpy as np  # Import NumPy for numerical operations


# Sample data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]


# Calculate the mean of the data
np.mean(data)

Challenges in Big Data Analytics

Big data analytics is challenging because of its characteristics: velocity, volume, veracity, and variety.

  • Velocity: rate of generation
  • Volume: the size of the data set is
  • Veracity: reliability of the data set
  • Variety: data types in the data set

The magnitude of these is enormous.

Rapid data generation makes storing and processing more challenging. As this pace increases with the latest technology, it is essential to keep your infrastructure up-to-date.

The large volume adds further burden to your infrastructure. You must ensure adequate capacity to store big data as you need a considerable amount to derive meaningful insights.

Unreliability increases the work required for processing, making meticulous data cleaning crucial to ensure high-quality data.

Finally, the variety of data reduces the efficiency of web scraping big data as you can not have a single solution.

Wrapping Up

Big data is large, structured, or unstructured data that grows rapidly. Traditional data analysis techniques are inefficient for analyzing big data; its collection and storage are also challenging.

For both collection and analysis, you can use Python. For example, Python has libraries like Requests and BeautifulSoup for web scraping and Pandas and Scikit-learn for analysis.

However, it can be overwhelming to perform both web scraping and analysis. That is where ScrapeHero Services can help you.

We are a full-service web scraping service provider capable of building enterprise-grade scrapers and crawlers. You only have to provide the requirements, and we will deliver high-quality data that you can use to train your AI models.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in:   Featured, web scraping, Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?