How to scrape Yahoo Finance and extract stock market data using Python & LXML

Yahoo Finance is a good source for extracting financial data, be it – stock market data, trading prices or business-related news.

In this tutorial, we will extract the trading summary for a public company from Yahoo Finance ( like http://finance.yahoo.com/quote/AAPL?p=AAPL ). We’ll be extracting the following fields for this tutorial.

  1. Previous Close
  2. Open
  3. Bid
  4. Ask
  5. Day’s Range
  6. 52 Week Range
  7. Volume
  8. Average Volume
  9. Market Cap
  10. Beta
  11. PE Ratio
  12. EPS
  13. Earning’s Date
  14. Dividend & Yield
  15. Ex-Dividend Date
  16. 1yr Target EST

Below is a screenshot of what data we’ll be extracting from Yahoo Finance.

Scraping Logic

  1. Construct the URL of the search results page from Yahoo Finance. For example, here is the one for Apple-http://finance.yahoo.com/quote/AAPL?p=AAPL
  2. Download HTML of the search result page using Python Requests
  3. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  4. Save the data to a JSON file.

Requirements

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Packages

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements:

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds

Scrape this in cloud for free
Deploy to ScrapeHero Cloud

The Code

You can download the code from the link https://gist.github.com/scrapehero/516fc801a210433602fe9fd41a69b496 if the embed above does not work.

If you would like the code in Python 2 check out this link https://gist.github.com/scrapehero/b0c7426f85aeaba441d603bb81e1d0e2

Running the Scraper

Assume the script is named yahoo_finance.py If you type in the script name in command prompt or terminal with a  -h

python yahoo_finance.py -h

usage: yahoo_finance.py [-h] ticker

positional arguments:
  ticker

optional arguments:
  -h, --help  show this help message and exit

The ticker argument is the ticker symbol or stock symbol to identify a company.

To find the stock data for Apple Inc we would put the argument like this:

 python3 yahoo_finance.py aapl

This should create a JSON file called aapl-summary.json that will be in the same folder as the script.

The output file would look similar to this:

{
    "Previous Close": "139.52", 
    "Open": "138.92", 
    "Bid": "138.69 x 100", 
    "Ask": "139.01 x 4600", 
    "Day's Range": "138.82 - 139.80", 
    "52 Week Range": "89.47 - 140.28", 
    "Volume": "16,641,812", 
    "Avg. Volume": "28,451,631", 
    "Market Cap": "729.58B", 
    "Beta": "1.36", 
    "PE Ratio (TTM)": "16.69", 
    "EPS (TTM)": 8.33, 
    "Earnings Date": "2017-04-24 to 2017-04-28", 
    "Dividend & Yield": "2.28 (1.63%)", 
    "Ex-Dividend Date": "N/A", 
    "1y Target Est": 142.48, 
    "url": "http://finance.yahoo.com/quote/aapl?p=aapl", 
    "ticker": "aapl"
}

You can download the code at https://gist.github.com/scrapehero/516fc801a210433602fe9fd41a69b496

Let us know in the comments how this scraper worked for you.

Known Limitations

This code should work for grabbing stock market data of most companies. However, if you want to scrape for thousands of pages and do it frequently  (say, multiple times per hour) there are some important things you should be aware of, and you can read about them at How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need some professional help with scraping complex websites contact us by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data


Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   Financial Data Gathering Tutorials, Web Scraping Tutorials

Responses

soulsoldseparately July 16, 2018

Is the ‘print’ function the main difference between using Python 2.7 vs 3.6?

Reply

Jose Fernandes (@joseferpt) September 15, 2018

I would like to scrap the Statistics and Analysis pages can you please share the code or indicate the changes to add in the code for the summary shared above? Thanks in advance.

Reply

Jeff December 23, 2018

Thanks for the great work you do. I’ve been wanting to do something like this for quite some time and you provided me the right motivation. I hope you don’t mind, but I’ve modified your code a bit to add some flexibility. You use the actual webpage people get at Yahoo Finance just for a few pieces of data. For the rest you use an address that returns a nice JSON blob that you use to fill in the rest of the information. It works great but the same custom address doesn’t return much for mutual funds or ETFs. I was able to find a similar address that could be used for mutual funds and ETFs but think a better approach is to just use the publicly known webpage. I was able to manipulate that and produce summary information for stocks (same output as your scipt), mutual funds and ETFs.

The other advantage of doing it this way is that there’s a vast amount of other information available in the JSON blobs that I grab. To find out what is available I suggest using http://beautifytools.com/html-beautifier.php and loading the Yahoo Financial summary page url. Once you click on “Beautify html” you’re presented with a nice tree format of what’s in there. This view will also show where the paths came from for the data I do store.

Here’s the modified code:

from lxml import html
import requests
from time import sleep
import json
import argparse
from collections import OrderedDict
from time import sleep

def matching(string, begTok, endTok):
# Find location of the beginning token
start = string.find(begTok)
stack = []
# Append it to the stack
stack.append(start)
# Loop through rest of the string until we find the matching ending token
for i in range(start+1, len(string)):
if begTok in string[i]:
stack.append(i)
elif endTok in string[i]:
stack.remove(stack[len(stack)-1])
if len(stack) == 0:
# Removed the last begTok so we’re done
end = i+1
break
return end

def parse(ticker):
# Yahoo Finance summary for stock, mutual fund or ETF
url = “http://finance.yahoo.com/quote/%s?p=%s”%(ticker,ticker)
response = requests.get(url, verify=False)
print (“Parsing %s”%(url))
sleep(4)
summary_data = OrderedDict()

# Convert the _context html object into JSON blob to tell if this is an equity, a mutual fund or an ETF
contextStart = response.text.find('"_context"')
contextEnd = contextStart+matching(response.text[contextStart:len(response.text)], '{', '}')

# Convert the QuoteSummaryStore html object into JSON blob
summaryStart = response.text.find('"QuoteSummaryStore"')
summaryEnd = summaryStart+matching(response.text[summaryStart:len(response.text)], '{', '}')

# Convert the ticker quote html object into JSON blob
streamStart = response.text.find('"StreamDataStore"')
quoteStart = streamStart+response.text[streamStart:len(response.text)].find("%s"%ticker.upper())-1
quoteEnd = quoteStart+matching(response.text[quoteStart:len(response.text)], '{', '}')

try:
json_loaded_context = json.loads('{' + response.text[contextStart:contextEnd] + '}')
json_loaded_summary = json.loads('{' + response.text[summaryStart:summaryEnd] + '}')
# Didn't end up needing this for the summary details, but there's lots of good data there
json_loaded_quote = json.loads('{' + response.text[quoteStart:quoteEnd] + '}')
if "EQUITY" in json_loaded_context["_context"]["quoteType"]:
# Define all the data that appears on the Yahoo Financial summary page for a stock
# Use http://beautifytools.com/html-beautifier.php to understand where the path came from or to add any additional data
prev_close = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["previousClose"]['fmt']
mark_open = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["open"]['fmt']
bid = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["bid"]['fmt'] + " x "\
+ str(json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["bidSize"]['raw'])
ask = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["ask"]['fmt'] + " x "\
+ str(json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["askSize"]['raw'])
day_range = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["regularMarketDayLow"]['fmt']\
+ " - " + json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["regularMarketDayHigh"]['fmt']
year_range = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["fiftyTwoWeekLow"]['fmt'] + " - "\
+ json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["fiftyTwoWeekHigh"]['fmt']
volume = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["volume"]['longFmt']
avg_volume = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["averageVolume"]['longFmt']
market_cap = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["marketCap"]['fmt']
beta = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["beta"]['fmt']
PE = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["trailingPE"]['fmt']
eps = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["trailingEps"]['fmt']
earnings_list = json_loaded_summary["QuoteSummaryStore"]["calendarEvents"]['earnings']
datelist = []
for i in earnings_list['earningsDate']:
datelist.append(i['fmt'])
earnings_date = ' to '.join(datelist)
div = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["dividendRate"]['fmt'] + " ("\
+ json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["dividendYield"]['fmt'] + ")"
ex_div_date = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["exDividendDate"]['fmt']
y_Target_Est = json_loaded_summary["QuoteSummaryStore"]["financialData"]["targetMeanPrice"]['raw']

# Store ordered pairs to be written to a file
summary_data.update({'Previous Close':prev_close,'Open':mark_open,'Bid':bid,'Ask':ask,"Day's Range":day_range\
,'52 Week Range':year_range,'Volume':volume,'Avg. Volume':avg_volume,'Market Cap':market_cap,'Beta (3Y Monthly)':beta\
,'PE Ratio (TTM)':PE,'EPS (TTM)':eps,'Earnings Date':earnings_date,'Forward Dividend & Yield':div\
,'Ex-Dividend Date':ex_div_date,'1y Target Est':y_Target_Est,'ticker':ticker,'url':url})
return summary_data
elif "MUTUALFUND" in json_loaded_context["_context"]["quoteType"]:
# Define all the data that appears on the Yahoo Financial summary page for a mutual fund
prev_close = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["previousClose"]['fmt']
ytd_return = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["ytdReturn"]['fmt']
exp_rat = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["annualReportExpenseRatio"]['fmt']
category = json_loaded_summary["QuoteSummaryStore"]["fundProfile"]["categoryName"]
last_cap_gain = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["lastCapGain"]['fmt']
morningstar_rating = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["morningStarOverallRating"]['raw']
morningstar_risk_rating = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["morningStarRiskRating"]['raw']
sustainability_rating = json_loaded_summary["QuoteSummaryStore"]["esgScores"]["sustainScore"]['raw']
net_assets = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["totalAssets"]['fmt']
beta = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["beta3Year"]['fmt']
yld = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["yield"]['fmt']
five_year_avg_ret = json_loaded_summary["QuoteSummaryStore"]["fundPerformance"]["performanceOverview"]["fiveYrAvgReturnPct"]['fmt']
holdings_turnover = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["annualHoldingsTurnover"]['fmt']
div = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["lastDividendValue"]['fmt']
inception_date = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["fundInceptionDate"]['fmt']

# Store ordered pairs to be written to a file
summary_data.update({'Previous Close':prev_close,'YTD Return':ytd_return,'Expense Ratio (net)':exp_rat,'Category':category\
,'Last Cap Gain':last_cap_gain,'Morningstar Rating':morningstar_rating,'Morningstar Risk Rating':morningstar_risk_rating\
,'Sustainability Rating':sustainability_rating,'Net Assets':net_assets,'Beta (3Y Monthly)':beta,'Yield':yld\
,'5y Average Return':five_year_avg_ret,'Holdings Turnover':holdings_turnover,'Last Dividend':div,'Average for Category':'N/A'\
,'Inception Date':inception_date,'ticker':ticker,'url':url})
return summary_data
elif "ETF" in json_loaded_context["_context"]["quoteType"]:
# Define all the data that appears on the Yahoo Financial summary page for an ETF
prev_close = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["previousClose"]['fmt']
mark_open = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["open"]['fmt']
bid = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["bid"]['fmt'] + " x "\
+ str(json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["bidSize"]['raw'])
ask = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["ask"]['fmt'] + " x "\
+ str(json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["askSize"]['raw'])
day_range = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["regularMarketDayLow"]['fmt'] + " - "\
+ json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["regularMarketDayHigh"]['fmt']
year_range = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["fiftyTwoWeekLow"]['fmt'] + " - "\
+ json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["fiftyTwoWeekHigh"]['fmt']
volume = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["volume"]['longFmt']
avg_volume = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["averageVolume"]['longFmt']
net_assets = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["totalAssets"]['fmt']
nav = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["navPrice"]['fmt']
yld = json_loaded_summary["QuoteSummaryStore"]["summaryDetail"]["yield"]['fmt']
ytd_return = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["ytdReturn"]['fmt']
beta = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]['beta3Year']['fmt']
exp_rat = json_loaded_summary["QuoteSummaryStore"]["fundProfile"]["feesExpensesInvestment"]["annualReportExpenseRatio"]['fmt']
inception_date = json_loaded_summary["QuoteSummaryStore"]["defaultKeyStatistics"]["fundInceptionDate"]['fmt']

# Store ordered pairs to be written to a file
summary_data.update({'Previous Close':prev_close,'Open':mark_open,'Bid':bid,'Ask':ask,"Day's Range":day_range,'52 Week Range':year_range\
,'Volume':volume,'Avg. Volume':avg_volume,'Net Assets':net_assets,'NAV':nav,'PE Ratio (TTM)':'N/A','Yield':yld,'YTD Return':ytd_return\
,'Beta (3Y Monthly)':beta,'Expense Ratio (net)':exp_rat,'Inception Date':inception_date,'ticker':ticker,'url':url})
return summary_data
except:
print ("Failed to parse json response")
return {"error":"Failed to parse json response"}

if name==”main“:
argparser = argparse.ArgumentParser()
argparser.add_argument(‘ticker’,help = ”)
args = argparser.parse_args()
ticker = args.ticker
print (“Fetching data for %s”%(ticker))
scraped_data = parse(ticker)
print (“Writing data to output file”)
with open(‘%s-summary.json’%(ticker),’w’) as fp:
json.dump(scraped_data,fp,indent = 4)

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data