How to scrape Yahoo Finance and extract stock market data using Python & LXML

Yahoo Finance is a good source for extracting financial data, be it – stock market data, trading prices or business-related news.

In this tutorial, we will extract the trading summary for a public company from Yahoo Finance ( like http://finance.yahoo.com/quote/AAPL?p=AAPL ). We’ll be extracting the following fields for this tutorial.

  1. Previous Close
  2. Open
  3. Bid
  4. Ask
  5. Day’s Range
  6. 52 Week Range
  7. Volume
  8. Average Volume
  9. Market Cap
  10. Beta
  11. PE Ratio
  12. EPS
  13. Earning’s Date
  14. Dividend & Yield
  15. Ex-Dividend Date
  16. 1yr Target EST

Below is a screenshot of what data we’ll be extracting from Yahoo Finance.

Scraping Logic

  1. Construct the URL of the search results page from Yahoo Finance. For example, here is the one for Apple-http://finance.yahoo.com/quote/AAPL?p=AAPL
  2. Download HTML of the search result page using Python Requests
  3. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  4. Save the data to a JSON file.

Requirements

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Packages

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements:

The Code

You can download the code from the link https://gist.github.com/scrapehero/516fc801a210433602fe9fd41a69b496 if the embed above does not work.

If you would like the code in Python 2 check out this link https://gist.github.com/scrapehero/b0c7426f85aeaba441d603bb81e1d0e2

Running the Scraper

Assume the script is named yahoo_finance.py If you type in the script name in command prompt or terminal with a  -h

The ticker argument is the ticker symbol or stock symbol to identify a company.

To find the stock data for Apple Inc we would put the argument like this:

This should create a JSON file called aapl-summary.json that will be in the same folder as the script.

The output file would look similar to this:

You can download the code at https://gist.github.com/scrapehero/516fc801a210433602fe9fd41a69b496

Let us know in the comments how this scraper worked for you.

Known Limitations

This code should work for grabbing stock market data of most companies. However, if you want to scrape for thousands of pages and do it frequently  (say, multiple times per hour) there are some important things you should be aware of, and you can read about them at How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need some professional help with scraping complex websites contact us by filling up the form below.

Tell us about your complex web scraping projects

Turn websites into meaningful and structured data through our web data extraction service

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

23 comments on “How to scrape Yahoo Finance and extract stock market data using Python & LXML

Mike

I tried running your code, but I keep getting hassled by a

SyntaxError: ‘return’ outside function

I think I have your indentation right. Is there any way to post the code somewhere so I can get it straight from the horse’s mouth?

Hawkeye

If you don’t feel like writing a scraper I suggest looking at python’s yahoo-finance package. You can get all of the same data with a few lines of code.

delfinoharrison

Yahoo finance API is not available anymore. I have moved to MarketXLS after this change, much more reliable data.

    ScrapeHero

    This is the problem with APIs – Scraping is the only option to gather such data without using an API that can restrict or get discontinued at any time.

future learn2

This code (raw_table_key = table_data.xpath(‘.//td[@class=”C(black)”]//text()’) is not working in python 2.7. So I have changed with this(raw_table_key = table_data.xpath(‘.//td[contains(@class,”C(black)”)]//text()’)). Anyway very nice code and helpful to me. Thanks.

Ahmet

Tried your code and I get :

Fetching data for aapl
Parsing http://finance.yahoo.com/quote/aapl?p=aapl
Writing data to output file

when I go to the output file it is very limited:
{
“”: “172.29”,
“url”: “http://finance.yahoo.com/quote/aapl?p=aapl”,
“ticker”: “aapl”,
“1y Target Est”: 172.29,
“EPS (TTM)”: 8.808,
“Earnings Date”: “2017-10-23 to 2017-10-27”
}

Am I doing something wrong? I use Atom as my editor and running on a Macbook

Ira Fuchs

I am trying to use your code just to get a current price for securities. The slimmed down version looks like this:

from lxml import html
import requests
from exceptions import ValueError
from time import sleep
import json
import argparse
from collections import OrderedDict
from time import sleep

def parse(ticker):
url = “http://finance.yahoo.com/quote/%s?p=%s”%(ticker,ticker)
response = requests.get(url)
parser = html.fromstring(response.text)
summary_table = parser.xpath(‘//div[contains(@data-test,”summary-table”)]//tr’)
summary_data = OrderedDict()
other_details_json_link = “https://query2.finance.yahoo.com/v10/finance/quoteSummary/{0}?formatted=true&lang=en-US&region=US&modules=financialData”.format(ticker)
summary_json_response = requests.get(other_details_json_link)
try:
json_loaded_summary = json.loads(summary_json_response.text)
return json_loaded_summary[“quoteSummary”][“result”][0][“financialData”][“currentPrice”][‘raw’]
except ValueError:
print “Failed to parse json response”
return {“error”:”Failed to parse json response”}

if __name__==”__main__”:
argparser = argparse.ArgumentParser()
argparser.add_argument(‘ticker’,help = ”)
args = argparser.parse_args()
ticker = args.ticker
print parse(ticker)

This works fine except for getting prices of ETFs (e.g. EFA, VWO).

Yahoo provides these prices but the data is slightly different. My question is how to modify your code to work with ETFs. Also, in writing this script how do you view the XML that needs to be parsed since it is dynamically created?

The Market Prophet

I’m having some trouble getting this working, getting an error on line 12:

C:\Python36\Scripts>python yf.py aapl
File “yf.py”, line 12
print “Parsing %s”%(url)
^
SyntaxError: invalid syntax

Any Ideas to fix this?

    ScrapeHero

    The scraper is written in Python 2.7 and you are using 3.6. You can either install and run the script using python2.7.

    We’ll soon update this script to python 3

      The Market Prophet

      Wonderful, thank you! Got it going on 2.7. However, I’m having the same issue as Ahmet above, the output is limited to the following:
      {
      “”: “189.48”,
      “url”: “http://finance.yahoo.com/quote/aapl?p=aapl”,
      “ticker”: “aapl”,
      “1y Target Est”: 189.48,
      “EPS (TTM)”: 9.21,
      “Earnings Date”: “2018-02-01”
      }

      Additionally, I’d like to pull all data from the yahoo finance statistics page (https://ca.finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL). I tried simply swapping the URL but it didn’t do the trick so I suspect there’s more to it

      Any tips would be very appreciated!

        The Market Prophet

        Got it working with future learn2’s advice above and pulling all the data I wanted from the json link. Can’t seem to get Dividend data though – doesn’t seem to be in the json link. Any idea how to get dividend data?

          ScrapeHero

          Hi there,
          Please follow the same pattern to identify the dividend field and modify the scraper to grab that or other fields.

          ScrapeHero

          Actually the dividend is already extracted. In the above example
          “Dividend & Yield”: “2.28 (1.63%)“,

Matt

Is it possible to enter a list of tickers to have the program generate files for each ticker? Instead of doing it individually?

    ScrapeHero

    Hi Matt,
    Sure you can enter the ticker symbols in a text file and write a python program to read that file line by line and pass the ticker to this program.
    A quick Google search for “Python read text file as input to script” can provide a lot of scripts or snippets for you.

cayenne91

Hi there,

I tried to run this function parse(ticker) using ticker = ‘APPL’ but when loading for this line:

other_details_json_link = ‘https://query2.finance.yahoo.com/v10/finance/quoteSummary/{0}?formatted=true&lang=en-US&region=US&modules=financialData’.format(‘APPL’)

Also tried this url:

“https://query2.finance.yahoo.com/v10/finance/quoteSummary/{0}?formatted=true&lang=en-US&region=US&modules=summaryProfile%2CfinancialData%2CrecommendationTrend%2CupgradeDowngradeHistory%2Cearnings%2CdefaultKeyStatistics%2CcalendarEvents&corsDomain=finance.yahoo.com”.format(‘AAPL’)

I get: {‘quoteSummary’: {‘result’: None, ‘error’: {‘code’: ‘Not Found’, ‘description’: ‘Quote not found for ticker symbol: APPL’}}}

response is a 404 so I suspect this url doesn’t work anymore?

Cheers,

Ciaran

    ScrapeHero

    Are you trying to find AAPL instead of APPL for Apple?

      cayenne91

      Thank you very much ScrapeHero, this is perfect

      cayenne91

      I’m curious how do you know what the modules are called that can be put into your query. Reason I ask is that for smaller stocks (ticker=’BG.VI’) I get back the following:

      {“quoteSummary”:{“result”:null,”error”:{“code”:”Not Found”,”description”:”No fundamentals data found for any of the summaryTypes=financialData,defaultKeyStatistics,summaryProfile,earnings,calendarEvents,upgradeDowngradeHistory,recommendationTrend”}}}

      If I could understand the query parameters better I might be able to tweak your code to get around this for myself.

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service