Web Scraping of JavaScript website

Introduction

In this post, I’m using selenium to demonstrate how to web scrape a JavaScript enabled page.

Why not Beautiful Soup ?

If you had some experience of using python for web scraping, you probably already heard of beautifulsoup and urllib. By using the following code, we will be able to see the HTML and then use HTML tags to extract the desired elements. However, if the web page embedded with JavaScript, you will notice that some of the HTML elements can’t be seen from the beautiful soup, because they are rendered by the JavaScript. Instead you will only see the script tags, which indicate the place where the JavaScript codes are placed.


urlpage = 'https://apps.polkcountyiowa.gov/PolkCountyInmates/CurrentInmates/'
# download the web page
f = urllib.request.urlopen(urlpage)
# extract the html text
html = f.read()
# parse html using beautifulsoup
bs = BeautifulSoup(html)

the desired html elements are rendered by the script tags, so an alternative is needed to extract this page.

Procedures of Web-Scraping using Selenium

1. Prerequisite

download the chrome driver from https://chromedriver.chromium.org/downloads
current stable version is 76.0.3809.126
choose your Operating System (Mac/Windows/Linux)
extract the webdriver to CHOME_DRIVER (e.g. ./chromedriver)

2. Launch the Chrome Driver

use selenium to launch the a chrome browser, by calling webdriver.Chrome(). A blank chrome window should pop up.


from selenium import webdriver
CHROME_DRIVER = './chromedriver'
# run chrome webdriver from executable path of your choice
driver = webdriver.Chrome(executable_path = CHROME_DRIVER)

Now, let’s load the page we want to extract.


# load web page
urlpage = 'https://apps.polkcountyiowa.gov/PolkCountyInmates/CurrentInmates/'
driver.get(urlpage)
print('waiting 15s for page loading')
# wait for 15 seconds to allow the page completely loaded
time.sleep(15)

use driver.quit() to close the browser when you are done with testing.

3. Parse the Web Page

selenium provides multiple ways to locate the elements of the HTML. By using Chrome Developer Tools (Chrome > More tools > Developer tools), we can easily locate the HTML elements. For example, we’re going to extract the link of Details, so we point the HTML element and copy the xpath location.

In selenium, we can call find_elements_by_xpath to extract all elements that matching the xpath pattern.


xpath = '//*[@id="DataTables_Table_0"]/tbody/tr[1]/td[1]/a'
results = driver.find_elements_by_xpath(xpath)
# results is a list, because find_elements_by_xpath look for all items matching the xpath.
# if use find_element_by_xpath, it returns the first item matches the xpath.
len(results)
# 1
results[0].get_attribute('href')
# https://apps.polkcountyiowa.gov/PolkCountyInmates/CurrentInmates/Details?Book_ID=299591

It’s worth noticing that, the xpath pattern is too specific and only the first link is returned. Therefore we need to generalize the xpath pattern, to capture all the links.

Let’s trace back to the upper levels of the xpath. Instead of using tr[1] to extract the first row, we use *[contains(@role,'row')] to capture all the rows contains the class role='row'.

Then in each element, we use td/a xpath to locate the html a tags. Because the number of links is relative big, a tqdm progress bar is also added to show the progress of extraction.


xpath = "//*[@id='DataTables_Table_0']/tbody//*[contains(@role,'row')]"
results = driver.find_elements_by_xpath(xpath)
len(results)
# 938
results[0].find_element_by_xpath('td/a').get_attribute('href')
# https://apps.polkcountyiowa.gov/PolkCountyInmates/CurrentInmates/Details?Book_ID=299591

# add progress bar to extract all links
from tqdm import tqdm
links = []
for result in tqdm(results):
	links.append(result.find_element_by_xpath('td/a').get_attribute('href'))

4. save the data

Finally, we can save the links to a csv file for later usage.


import pandas as pd
df_links = pd.DataFrame({'links':links})
# save data to csv
df_links.to_csv('./links.csv', index = False)
# close the browser
driver.quit()

The complete code is stored at here

Web Scraping of JavaScript website

Introduction

Why not Beautiful Soup ?

Procedures of Web-Scraping using Selenium

1. Prerequisite

2. Launch the Chrome Driver

3. Parse the Web Page

4. save the data

One thought on “Web Scraping of JavaScript website”

Leave a comment Cancel reply

Menu

Web Scraping of JavaScript website

Introduction

Why not Beautiful Soup ?

Procedures of Web-Scraping using Selenium

1. Prerequisite

2. Launch the Chrome Driver

3. Parse the Web Page

4. save the data

Share this:

One thought on “Web Scraping of JavaScript website”

Leave a comment Cancel reply

Menu