Web Scraping to Automate Data Collection
All businesses record and maintain data, but it is not the only source for analytical projects. Most data is stored for operational purposes such as accounting or sales. While there is plenty of information in these sources, I implore you to also consider external data sources. In this discussion, we will explore an automated method of data collection through an external web site using simple web scraping techniques in Python.
Suppose we are interested in identifying zip codes that surround certain areas. For example, we might want to market to zip codes that are near our various store locations. In this scenario we could use a proprietary software, such as ArcGIS, to determine this information, but this is often expensive and unnecessary. Another option would be to identify an alternative data source, often through the internet, and develop a web scraping script to query and collect the information.
A quick search reveals many options for identifying zip codes. One of which is a Free Map Tools site, which is developed for users to measure, save and send maps to others. Their focus is on hobbyists such as runners, cyclers, or sailors; but we can utilize the information in their zip code radius lookup feature. We simply have to supply an origin zip code and a radius in miles, and they will return a list of zip codes that surround the origin. Feel free to try it out because it will help you understand what information we need to supply the site in order to answer the business question.
There are three important considerations when using any external data source:
- Are we collecting the data for a unique project, or will it be used to support continuing operations?
- How important is the data to the project?
- Does this violate any terms and conditions of the website?
Data used for continuing operations is less suitable for web scraping because you introduce risk by trusting the numbers from an external source. You can mitigate some of this risk by evaluating the source reliability. In this case, the About Us page reveals that Free Map Tools pulls from the Google API, which is trusted underlying source of information. Further, you can sometimes determine the last time the page was updated or modified by searching the HTML source code. A page updated regularly may be more reliable, but be wary of dynamic pages, which often list the most recent update as the last time a query or form was submitted.
Also, if the data that is being collected is absolutely essential to project success, you may be better off using a more traditional data collection method. All in all, you must understand the risks involved with web scraping, and compare them with the benefits. Lastly, do not scrape a website if it is against the policy of the website.
Let’s go ahead and dive into how web scraping will allow us to submit many zip codes to the web page and pull the zip code information of interest. The Python program can be found at our GitHub page.
Web Scraping Tutorial
First, we will import the Time, Pandas, and Selenium Python packages. Time will allow for delayed data submission so that we do not overload the servers, which can lead to IP bans. Pandas is a standard data manipulation and munging package. Selenium is a website navigation driver. We will have selenium plug in our various information to the web page forms and pull the data from the site.
import pandas as pd
from selenium import webdriver
It is a good practice to save the webpage as a variable. This way, if they make future updates to the URL we can simply change this variable, rather than finding all the unique occurrences throughout our script.
webpage = ‘https://www.freemaptools.com/find-zip-codes-inside-radius.htm’
Let’s read in a CSV file that contains all the origin zip codes in a single column.
zips = pd.read_csv(‘FILEPATH_to_csv’,sep=’,’)
We now establish a webdriver instance and specify whether to use Chrome, Firefox or Internet Explorer. We can also pass the webpage to the driver.
driver = webdriver.Chrome(‘FILEPATH to chromedriver.exe’)
We use the find_element_by_name method to locate the various elements of the webpage such as form fields and submit buttons. The trick to find these elements is to search through the source code to identify ID tags or names. The following picture demonstrates the discovery of the ID and name tags for the radius input box.
radius = driver.find_element_by_name(“tb_radius_miles”)
We can pass various functions to the element we defined. In the next two lines, we clear the previous selection and submit “MILES” instead of “KM”.
The other elements we will need to supply are the origin zip, submit button, and the box containing zip codes returned from the site.
origin_zip = driver.find_element_by_name(“goto”)
draw = driver.find_element_by_name(“Go”)
We will now write a function that we will apply to each row in the zips dataframe using Pandas Apply(). The details of the function are commented in the code.
origin_zip.clear() # clear the previous zip
new_zip = row[‘ZIP_FIELD_NAME’] # pull the new zip from the zips dataframe
if len(str(new_zip)) == 4: # data cleaning because 4 digit zip codes need to be formatted as 5 digits
new_zip = ‘0’ + str(new_zip)
origin_zip.send_keys(row[‘ZIP_FIELD_NAME’]) # send the zip code
draw.click() # submit the form to tell the website to draw the map and find the surrounding zip codes
time.sleep(1) # set a delay to process and not overload the website servers
zips = driver.find_element_by_id(“tb_output”) # resulting zips
newzips = zips.get_attribute(“value”) # read the value of the resulting zips
zips[‘NEW_ZIPS_FIELD_NAME’] = zips.apply(get_surrounding_zips, axis=1)
The Zips dataframe now contains all zips surrounding the uploaded origin zip codes. We can now use this information in further Python programming or we can export it.
Web scraping is a powerful method to gather external data. Always remember to check data reliability before a web scrape, and to consider other data collection approaches if the data is essential for project success or the project will be an ongoing task to continue company operations.
If you enjoyed this tutorial, be sure to subscribe to actionable-business-analytics!