Chapter 5 - Web scraping with Pandas

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'

Summary

By the end of this chapter, we’re going to have downloaded all of Canada’s weather data for 2012, and saved it to a CSV.

We’ll do this by downloading it one month at a time, and then combining all the months together.

Here’s the temperature every hour for 2012! Download weather_2012.csv and try yourself.

weather_2012_final = pd.read_csv('weather_2012.csv', index_col='Date/Time')
weather_2012_final['Temp (C)'].plot(figsize=(15, 6))

Output:

Plot temperature data for 2012

5.1 Downloading one month of weather data

When playing with the cycling data, I wanted temperature and precipitation data to find out if people like biking when it’s raining. So I went to the site for Canadian historical weather data, and figured out how to get it automatically.

Here we’re going to get the data for March 2012, and clean it up

Here’s an URL template you can use to get data in Montreal.

url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"

To get the data for March 2013, we need to format it with month=3, year=2012.

url = url_template.format(month=3, year=2012)
weather_mar2012 = pd.read_csv(url, index_col='Date/Time', parse_dates=True)

This is super great! We can just use the same read_csv function as before, and just give it a URL as a filename. Awesome.

There are 16 rows of metadata at the top of this CSV, but pandas knows CSVs are weird, so there’s a skiprows options. We parse the dates again, and set ‘Date/Time’ to be the index column. Here’s the resulting dataframe.

weather_mar2012

Output:

YearMonthDayTimeData QualityTemp (°C)Temp FlagDew Point Temp (°C)Dew Point Temp FlagRel Hum (%)...Wind Spd FlagVisibility (km)Visibility FlagStn Press (kPa)Stn Press FlagHmdxHmdx FlagWind ChillWind Chill FlagWeather
Date/Time
2012-03-01 00:00:0020123100:00-5.5NaN-9.7NaN72...NaN4.0NaN100.97NaNNaNNaN-13NaNSnow
2012-03-01 01:00:0020123101:00-5.7NaN-8.7NaN79...NaN2.4NaN100.87NaNNaNNaN-13NaNSnow
2012-03-01 02:00:0020123102:00-5.4NaN-8.3NaN80...NaN4.8NaN100.80NaNNaNNaN-13NaNSnow
2012-03-01 03:00:0020123103:00-4.7NaN-7.7NaN79...NaN4.0NaN100.69NaNNaNNaN-12NaNSnow
2012-03-01 04:00:0020123104:00-5.4NaN-7.8NaN83...NaN1.6NaN100.62NaNNaNNaN-14NaNSnow
2012-03-01 05:00:0020123105:00-5.3NaN-7.9NaN82...NaN2.4NaN100.58NaNNaNNaN-14NaNSnow
2012-03-01 06:00:0020123106:00-5.2NaN-7.8NaN82...NaN4.0NaN100.57NaNNaNNaN-14NaNSnow
2012-03-01 07:00:0020123107:00-4.9NaN-7.4NaN83...NaN1.6NaN100.59NaNNaNNaN-13NaNSnow
2012-03-01 08:00:0020123108:00-5.0NaN-7.5NaN83...NaN1.2NaN100.59NaNNaNNaN-13NaNSnow
2012-03-01 09:00:0020123109:00-4.9NaN-7.5NaN82...NaN1.6NaN100.60NaNNaNNaN-13NaNSnow
2012-03-01 10:00:0020123110:00-4.7NaN-7.3NaN82...NaN1.2NaN100.62NaNNaNNaN-13NaNSnow
2012-03-01 11:00:0020123111:00-4.4NaN-6.8NaN83...NaN1.0NaN100.66NaNNaNNaN-12NaNSnow
2012-03-01 12:00:0020123112:00-4.3NaN-6.8NaN83...NaN1.2NaN100.66NaNNaNNaN-12NaNSnow
2012-03-01 13:00:0020123113:00-4.3NaN-6.9NaN82...NaN1.2NaN100.65NaNNaNNaN-12NaNSnow
2012-03-01 14:00:0020123114:00-3.9NaN-6.6NaN81...NaN1.2NaN100.67NaNNaNNaN-11NaNSnow
2012-03-01 15:00:0020123115:00-3.3NaN-6.2NaN80...NaN1.6NaN100.71NaNNaNNaN-10NaNSnow
2012-03-01 16:00:0020123116:00-2.7NaN-5.7NaN80...NaN2.4NaN100.74NaNNaNNaN-8NaNSnow
2012-03-01 17:00:0020123117:00-2.9NaN-5.9NaN80...NaN4.0NaN100.80NaNNaNNaN-9NaNSnow
2012-03-01 18:00:0020123118:00-3.0NaN-6.0NaN80...NaN4.0NaN100.87NaNNaNNaN-9NaNSnow
2012-03-01 19:00:0020123119:00-3.6NaN-6.4NaN81...NaN3.2NaN100.93NaNNaNNaN-9NaNSnow
2012-03-01 20:00:0020123120:00-3.7NaN-6.4NaN81...NaN4.8NaN100.95NaNNaNNaN-10NaNSnow
2012-03-01 21:00:0020123121:00-3.9NaN-6.7NaN81...NaN6.4NaN100.98NaNNaNNaN-10NaNSnow
2012-03-01 22:00:0020123122:00-4.3NaN-6.9NaN82...NaN2.4NaN101.00NaNNaNNaN-11NaNSnow
2012-03-01 23:00:0020123123:00-4.3NaN-7.1NaN81...NaN4.8NaN101.04NaNNaNNaN-11NaNSnow
2012-03-02 00:00:0020123200:00-4.8NaN-7.3NaN83...NaN3.2NaN101.04NaNNaNNaN-12NaNSnow
2012-03-02 01:00:0020123201:00-5.3NaN-7.9NaN82...NaN4.8NaN101.09NaNNaNNaN-12NaNSnow
2012-03-02 02:00:0020123202:00-5.2NaN-7.8NaN82...NaN6.4NaN101.11NaNNaNNaN-12NaNSnow
2012-03-02 03:00:0020123203:00-5.5NaN-7.9NaN83...NaN4.8NaN101.15NaNNaNNaN-12NaNSnow
2012-03-02 04:00:0020123204:00-5.6NaN-8.2NaN82...NaN6.4NaN101.15NaNNaNNaN-13NaNSnow
2012-03-02 05:00:0020123205:00-5.5NaN-8.3NaN81...NaN12.9NaN101.15NaNNaNNaN-12NaNSnow
..................................................................
2012-03-30 18:00:00201233018:003.9NaN-7.9NaN42...NaN24.1NaN101.26NaNNaNNaNNaNNaNMostly Cloudy
2012-03-30 19:00:00201233019:003.1NaN-6.7NaN49...NaN25.0NaN101.29NaNNaNNaNNaNNaNMostly Cloudy
2012-03-30 20:00:00201233020:003.0NaN-8.4NaN43...NaN25.0NaN101.30NaNNaNNaNNaNNaNMostly Cloudy
2012-03-30 21:00:00201233021:001.7NaN-9.0NaN45...NaN25.0NaN101.32NaNNaNNaNNaNNaNCloudy
2012-03-30 22:00:00201233022:000.4NaN-8.1NaN53...NaN25.0NaN101.30NaNNaNNaNNaNNaNMostly Cloudy
2012-03-30 23:00:00201233023:001.4NaN-7.7NaN51...NaN25.0NaN101.34NaNNaNNaNNaNNaNMainly Clear
2012-03-31 00:00:00201233100:001.5NaN-8.6NaN47...NaN25.0NaN101.33NaNNaNNaNNaNNaNMostly Cloudy
2012-03-31 01:00:00201233101:001.3NaN-9.6NaN44...NaN25.0NaN101.31NaNNaNNaNNaNNaNMostly Cloudy
2012-03-31 02:00:00201233102:001.3NaN-9.7NaN44...NaN25.0NaN101.29NaNNaNNaNNaNNaNCloudy
2012-03-31 03:00:00201233103:000.7NaN-8.8NaN49...NaN25.0NaN101.30NaNNaNNaNNaNNaNCloudy
2012-03-31 04:00:00201233104:00-0.9NaN-8.5NaN56...NaN25.0NaN101.32NaNNaNNaN-5NaNCloudy
2012-03-31 05:00:00201233105:00-0.6NaN-9.2NaN52...NaN25.0NaN101.30NaNNaNNaN-5NaNCloudy
2012-03-31 06:00:00201233106:00-0.5NaN-9.2NaN52...NaN48.3NaN101.32NaNNaNNaN-5NaNCloudy
2012-03-31 07:00:00201233107:00-0.3NaN-9.2NaN51...NaN48.3NaN101.32NaNNaNNaN-5NaNCloudy
2012-03-31 08:00:00201233108:000.7NaN-8.5NaN50...NaN48.3NaN101.33NaNNaNNaNNaNNaNCloudy
2012-03-31 09:00:00201233109:001.5NaN-7.8NaN50...NaN48.3NaN101.34NaNNaNNaNNaNNaNMostly Cloudy
2012-03-31 10:00:00201233110:002.9NaN-8.1NaN44...NaN48.3NaN101.30NaNNaNNaNNaNNaNMainly Clear
2012-03-31 11:00:00201233111:004.6NaN-9.7NaN35...NaN48.3NaN101.24NaNNaNNaNNaNNaNClear
2012-03-31 12:00:00201233112:006.4NaN-7.1NaN37...NaN48.3NaN101.16NaNNaNNaNNaNNaNClear
2012-03-31 13:00:00201233113:006.5NaN-9.7NaN30...NaN48.3NaN101.08NaNNaNNaNNaNNaNClear
2012-03-31 14:00:00201233114:007.7NaN-8.5NaN31...NaN48.3NaN101.01NaNNaNNaNNaNNaNMainly Clear
2012-03-31 15:00:00201233115:007.7NaN-8.6NaN30...NaN48.3NaN100.94NaNNaNNaNNaNNaNMainly Clear
2012-03-31 16:00:00201233116:008.4NaN-7.7NaN31...NaN48.3NaN100.89NaNNaNNaNNaNNaNMainly Clear
2012-03-31 17:00:00201233117:007.9NaN-8.1NaN31...NaN48.3NaN100.88NaNNaNNaNNaNNaNMainly Clear
2012-03-31 18:00:00201233118:007.0NaN-8.2NaN33...NaN48.3NaN100.87NaNNaNNaNNaNNaNMainly Clear
2012-03-31 19:00:00201233119:005.9NaN-8.0NaN36...NaN25.0NaN100.88NaNNaNNaNNaNNaNClear
2012-03-31 20:00:00201233120:004.4NaN-7.2NaN43...NaN25.0NaN100.85NaNNaNNaNNaNNaNClear
2012-03-31 21:00:00201233121:002.6NaN-6.3NaN52...NaN25.0NaN100.86NaNNaNNaNNaNNaNClear
2012-03-31 22:00:00201233122:002.7NaN-6.7NaN50...NaN25.0NaN100.82NaNNaNNaNNaNNaNClear
2012-03-31 23:00:00201233123:001.5NaN-6.9NaN54...NaN25.0NaN100.79NaNNaNNaNNaNNaNClear

744 rows × 24 columns

Let’s plot it!

weather_mar2012[u"Temp (\xc2\xb0C)"].plot(figsize=(15, 5))

Output:

Plot which borough has the most noise complaints

Notice how it goes up to 25° C in the middle there? That was a big deal. It was March, and people were wearing shorts outside.

And I was out of town and I missed it. Still sad, humans.

I had to write ‘\xb0’ for that degree character °. Let’s fix up the columns. We’re going to just print them out, copy, and fix them up by hand.

weather_mar2012.columns = [
    u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)', 
    u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', 
    u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', 
    u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
    u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', 
    u'Wind Chill Flag', u'Weather']

You’ll notice in the summary above that there are a few columns which are are either entirely empty or only have a few values in them. Let’s get rid of all of those with dropna.

The argument axis=1 to dropna means “drop columns”, not rows”, and how='any' means “drop the column if any value is null”.

This is much better now – we only have columns with real data.

weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')
weather_mar2012[:5]

Output:

YearMonthDayTimeData QualityTemp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
Date/Time
2012-03-01 00:00:0020123100:00-5.5-9.772244.0100.97Snow
2012-03-01 01:00:0020123101:00-5.7-8.779262.4100.87Snow
2012-03-01 02:00:0020123102:00-5.4-8.380284.8100.80Snow
2012-03-01 03:00:0020123103:00-4.7-7.779284.0100.69Snow
2012-03-01 04:00:0020123104:00-5.4-7.883351.6100.62Snow

The Year/Month/Day/Time columns are redundant, though, and the Data Quality column doesn’t look too useful. Let’s get rid of those.

The axis=1 argument means “Drop columns”, like before. The default for operations like dropna and drop is always to operate on rows.

weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
weather_mar2012[:5]

Output:

Temp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
Date/Time
2012-03-01 00:00:00-5.5-9.772244.0100.97Snow
2012-03-01 01:00:00-5.7-8.779262.4100.87Snow
2012-03-01 02:00:00-5.4-8.380284.8100.80Snow
2012-03-01 03:00:00-4.7-7.779284.0100.69Snow
2012-03-01 04:00:00-5.4-7.883351.6100.62Snow

Awesome! We now only have the relevant columns, and it’s much more manageable.

5.2 Plotting the temperature by hour of day

This one’s just for fun – we’ve already done this before, using groupby and aggregate! We will learn whether or not it gets colder at night. Well, obviously. But let’s do it anyway.

temperatures = weather_mar2012[[u'Temp (C)']].copy()
print(temperatures.head)
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour
temperatures.groupby('Hour').aggregate(np.median).plot()

Output:

Date/Time                    
2012-03-01 00:00:00      -5.5
2012-03-01 01:00:00      -5.7
2012-03-01 02:00:00      -5.4
2012-03-01 03:00:00      -4.7
2012-03-01 04:00:00      -5.4
2012-03-01 05:00:00      -5.3
2012-03-01 06:00:00      -5.2
2012-03-01 07:00:00      -4.9
2012-03-01 08:00:00      -5.0
2012-03-01 09:00:00      -4.9
2012-03-01 10:00:00      -4.7
2012-03-01 11:00:00      -4.4
2012-03-01 12:00:00      -4.3
2012-03-01 13:00:00      -4.3
2012-03-01 14:00:00      -3.9
2012-03-01 15:00:00      -3.3
2012-03-01 16:00:00      -2.7
2012-03-01 17:00:00      -2.9
2012-03-01 18:00:00      -3.0
2012-03-01 19:00:00      -3.6
2012-03-01 20:00:00      -3.7
2012-03-01 21:00:00      -3.9
2012-03-01 22:00:00      -4.3
2012-03-01 23:00:00      -4.3
2012-03-02 00:00:00      -4.8
2012-03-02 01:00:00      -5.3
2012-03-02 02:00:00      -5.2
2012-03-02 03:00:00      -5.5
2012-03-02 04:00:00      -5.6
2012-03-02 05:00:00      -5.5
...                       ...
2012-03-30 18:00:00       3.9
2012-03-30 19:00:00       3.1
2012-03-30 20:00:00       3.0
2012-03-30 21:00:00       1.7
2012-03-30 22:00:00       0.4
2012-03-30 23:00:00       1.4
2012-03-31 00:00:00       1.5
2012-03-31 01:00:00       1.3
2012-03-31 02:00:00       1.3
2012-03-31 03:00:00       0.7
2012-03-31 04:00:00      -0.9
2012-03-31 05:00:00      -0.6
2012-03-31 06:00:00      -0.5
2012-03-31 07:00:00      -0.3
2012-03-31 08:00:00       0.7
2012-03-31 09:00:00       1.5
2012-03-31 10:00:00       2.9
2012-03-31 11:00:00       4.6
2012-03-31 12:00:00       6.4
2012-03-31 13:00:00       6.5
2012-03-31 14:00:00       7.7
2012-03-31 15:00:00       7.7
2012-03-31 16:00:00       8.4
2012-03-31 17:00:00       7.9
2012-03-31 18:00:00       7.0
2012-03-31 19:00:00       5.9
2012-03-31 20:00:00       4.4
2012-03-31 21:00:00       2.6
2012-03-31 22:00:00       2.7
2012-03-31 23:00:00       1.5

[744 rows x 1 columns]>
Plot dataframe as per weekdays

So it looks like the time with the highest median temperature is 2pm. Neat.

5.3 Getting the whole year of data

Okay, so what if we want the data for the whole year? Ideally the API would just let us download that, but I couldn’t figure out a way to do that.

First, let’s put our work from above into a function that gets the weather for a given month.

I noticed that there’s an irritating bug where when I ask for January, it gives me data for the previous year, so we’ll fix that too. [no, really. You can check =)]

def download_weather_month(year, month):
    if month == 1:
        year += 1
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=True)
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
    return weather_data

We can test that this function does the right thing:

download_weather_month(2012, 1)[:5]

Output:

Temp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
Date/Time
2013-01-01 00:00:00-1.0-1.795356.499.89Snow
2013-01-01 01:00:00-2.0-5.1793516.199.93Mainly Clear
2013-01-01 02:00:00-2.7-6.0782819.3100.08Snow
2013-01-01 03:00:00-5.6-11.7623025.0100.21Clear
2013-01-01 04:00:00-7.7-12.6683519.3100.32Mainly Clear

Now we can get all the months at once. This will take a little while to run.

data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]

Once we have this, it’s easy to concatenate all the dataframes together into one big dataframe using pd.concat. And now we have the whole year’s data!

weather_2012 = pd.concat(data_by_month)
weather_2012

Output:

Temp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
Date/Time
2013-01-01 00:00:00-1.0-1.795356.499.89Snow
2013-01-01 01:00:00-2.0-5.1793516.199.93Mainly Clear
2013-01-01 02:00:00-2.7-6.0782819.3100.08Snow
2013-01-01 03:00:00-5.6-11.7623025.0100.21Clear
2013-01-01 04:00:00-7.7-12.6683519.3100.32Mainly Clear
2013-01-01 05:00:00-9.7-14.8663325.0100.47Clear
2013-01-01 06:00:00-11.1-17.0623025.0100.65Clear
2013-01-01 07:00:00-12.2-17.2662025.0100.78Clear
2013-01-01 08:00:00-13.0-17.7681324.1100.87Clear
2013-01-01 09:00:00-13.0-17.3702024.1100.86Clear
2013-01-01 10:00:00-12.6-17.8651924.1100.90Clear
2013-01-01 11:00:00-12.2-17.6642224.1100.88Mainly Clear
2013-01-01 12:00:00-11.8-17.2642624.1100.87Mainly Clear
2013-01-01 13:00:00-11.3-17.4612624.1100.83Mainly Clear
2013-01-01 14:00:00-11.3-17.4612824.1100.82Mainly Clear
2013-01-01 15:00:00-11.4-17.6603024.1100.85Mainly Clear
2013-01-01 16:00:00-12.0-18.0612224.1100.81Mainly Clear
2013-01-01 17:00:00-13.0-18.4641925.0100.90Clear
2013-01-01 18:00:00-13.4-18.4662425.0100.96Clear
2013-01-01 19:00:00-14.1-18.7682025.0101.02Clear
2013-01-01 20:00:00-14.3-19.0671525.0101.04Clear
2013-01-01 21:00:00-14.8-19.5671525.0100.98Mainly Clear
2013-01-01 22:00:00-16.3-20.272725.0100.98Mostly Cloudy
2013-01-01 23:00:00-15.4-19.8691125.0100.99Cloudy
2013-01-02 00:00:00-14.0-18.4691119.3100.96Snow
2013-01-02 01:00:00-14.1-18.3701125.0100.91Mostly Cloudy
2013-01-02 02:00:00-14.3-18.3721325.0100.94Snow Showers
2013-01-02 03:00:00-14.7-18.076919.3100.91Snow
2013-01-02 04:00:00-14.2-17.17969.7100.83Snow
2013-01-02 05:00:00-14.3-17.08006.4100.81Snow
........................
2012-12-30 18:00:00-12.6-16.0762425.0101.36Mainly Clear
2012-12-30 19:00:00-13.4-16.5772625.0101.47Mainly Clear
2012-12-30 20:00:00-13.8-16.5802425.0101.52Clear
2012-12-30 21:00:00-13.8-16.5802025.0101.50Mainly Clear
2012-12-30 22:00:00-13.7-16.3811925.0101.54Mainly Clear
2012-12-30 23:00:00-12.1-15.1782825.0101.52Mostly Cloudy
2012-12-31 00:00:00-11.1-14.4772625.0101.51Cloudy
2012-12-31 01:00:00-10.7-14.0771525.0101.50Cloudy
2012-12-31 02:00:00-10.1-13.477925.0101.45Cloudy
2012-12-31 03:00:00-11.8-14.481625.0101.42Mostly Cloudy
2012-12-31 04:00:00-10.5-12.8831125.0101.34Cloudy
2012-12-31 05:00:00-10.2-12.484625.0101.28Cloudy
2012-12-31 06:00:00-9.7-11.785425.0101.23Cloudy
2012-12-31 07:00:00-9.3-11.385019.3101.19Snow Showers
2012-12-31 08:00:00-8.6-10.38743.2101.14Snow Showers
2012-12-31 09:00:00-8.1-9.68942.4101.09Snow
2012-12-31 10:00:00-7.4-8.98946.4101.05Snow,Fog
2012-12-31 11:00:00-6.7-7.99199.7100.93Snow
2012-12-31 12:00:00-5.8-7.588412.9100.78Snow
2012-12-31 13:00:00-4.6-6.686412.9100.63Snow
2012-12-31 14:00:00-3.4-5.784611.3100.57Snow
2012-12-31 15:00:00-2.3-4.68499.7100.47Snow
2012-12-31 16:00:00-1.4-4.0821312.9100.40Snow
2012-12-31 17:00:00-1.1-3.385199.7100.30Snow
2012-12-31 18:00:00-1.3-3.188179.7100.19Snow
2012-12-31 19:00:000.1-2.781309.7100.13Snow
2012-12-31 20:00:000.2-2.483249.7100.03Snow
2012-12-31 21:00:00-0.5-1.593284.899.95Snow
2012-12-31 22:00:00-0.2-1.889289.799.91Snow
2012-12-31 23:00:000.0-2.1863011.399.89Snow

8784 rows × 7 columns

5.4 Saving to a CSV

It’s slow and unnecessary to download the data every time, so let’s save our dataframe for later use!

weather_2012.to_csv('weather_2012.csv')

And we’re done!

Help us improve this content by editing this page on GitHub