data = read_weather("data/weather.csv")
File "plot_data-2.py", line 28, in read_weather
usecols=(0,1,2,3,21))
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: could not convert string to float: Rain
The final line tells us that `numpy` can't convert the string "Rain" into a floating point number. This is from the weather events column in our data, which contains text like "Rain" or "Snow-Fog". We could try and write a converter for this column too, but I've chosen to simply have `numpy` bring the column in as a string (which we'll manually parse later).
To do this, we pass in a special object for the `dtype` parameter of `loadtxt`. This object can be constructed by giving a dictionary to the numpy.dtype function. The code below provides names and data types for all of the columns we'll be using:
:::python
def read_weather(file_name):
dtypes = np.dtype({ 'names' : ('timestamp', 'max temp', 'mean temp', 'min temp', 'events'),
'formats' : [np.int, np.float, np.float, np.float, 'S100'] })
data = np.loadtxt(file_name, delimiter=',', skiprows=1,
converters = { 0 : date2int },
usecols=(0,1,2,3,21), dtype=dtypes)
return data
The last column format is given as "S100", which means "string up to 100 characters in length." numpy needs to know the maximum size of the column for efficiency, so I gave myself plenty of room with 100 characters. Running this new code produces the following output:
:::text
$ python plot_data.py
[(734503, 53.0, 43.0, 32.0, 'Rain') (734504, 32.0, 25.0, 18.0, 'Snow')
(734505, 27.0, 20.0, 12.0, '') (734506, 42.0, 34.0, 26.0, '')
(734509, 52.0, 40.0, 28.0, '') (734510, 47.0, 36.0, 24.0, '')
(734511, 51.0, 38.0, 24.0, '') (734512, 57.0, 43.0, 28.0, '')
(734513, 45.0, 43.0, 40.0, 'Rain') (734514, 43.0, 29.0, 15.0, 'Fog-Snow')
(734515, 19.0, 17.0, 15.0, 'Snow') (734516, 27.0, 18.0, 9.0, 'Snow')
...
We're finally in business. Each row in our data set consists of a timestamp (the date converted to an integer), the maximum, mean, and minimum temperature, and the weather events that occurred that day. We're going to start by plotting the mean temperature versus the day of the year. Since we've given names to each of our columns, we can pull them out easily:
:::python
def read_weather(file_name):
dtypes = np.dtype({ 'names' : ('timestamp', 'max temp', 'mean temp', 'min temp', 'events'),
'formats' : [np.int, np.float, np.float, np.float, 'S100'] })
data = np.loadtxt(file_name, delimiter=',', skiprows=1,
converters = { 0 : date2int },
usecols=(0,1,2,3,21), dtype=dtypes)
return data
#--------------------------------------------------
data = read_weather('data/weather.csv')
min_temps = data['min temp']
mean_temps = data['mean temp']
max_temps = data['max temp']
dates = [datetime.fromordinal(d) for d in data['timestamp']]
events = data['events']
for date, temp in zip(dates, mean_temps):
print '{0:%b %d}: {1}'.format(date, temp)
Each column can be extract individually from the `data` array by using `data['column name']`. I've used the datetime.fromordinal function on the `timestamp` column to convert the integers back into datetime objects.
Using the handy built-in zip function, I've printed out pairs of dates and mean temperatures. I use advanced string formatting to print the month, day, and temperature (see the datetime documentation for date formatting information). The program now gives the following output:
:::text
Jan 01: 43.0
Jan 02: 25.0
Jan 03: 20.0
Jan 04: 34.0
...
May 11: 59.0
May 12: 62.0
May 14: 69.0
Everything looks good, so let's get started plotting.
3. Temperature Plot
------------------------
We're going to start with a simple line plot that has the day of the year on the x-axis and the mean temperature for that day on the y-axis. Our plotting function, called `temp\_plot`, will take in dates and times, and give us back a `matplotlib` figure object. Here's the code:
:::python
def temp_plot(dates, mean_temps):
year_start = datetime(2012, 1, 1)
days = [(d - year_start).days + 1 for d in dates]
fig = pyplot.figure()
pyplot.title('Temperatures in Bloomington 2012')
pyplot.ylabel('Mean Temperature (F)')
pyplot.xlabel('Day of Year')
pyplot.plot(days, mean_temps, marker='o')
return fig
We start by computing the day of the year for each date. The `datetime` module lets us subtract dates from each other, producing a timedelta object. We subtract each date from January 1st of 2012, adding 1 so that our count will start from 1 instead of 0. The `days` field on a `timedelta` object gives the total number of days (in this case, from January 1st).
Next, we create a new `matplotlib` figure. In between calls to pyplot.figure, `matplotlib's` plotting functions will draw new plots on top of old ones. We'll use this fact to add a trend line to our plot shortly.
After adding a title and some axis labels to our figure, we call pyplot.plot with our `days` (x values) and `mean\_temps` arrays (y values). I've also passed in 'o' to the optional `marker` parameter so that small circles will be plotted for each data point.
In the main body of the program, we use the `os` module to create a "plots" directory (checking if it exists first). Next, we call our `temp\_plot` function and then use savefig to save the figure out to a png file:
:::python
data = read_weather('data/weather.csv')
min_temps = data['min temp']
mean_temps = data['mean temp']
max_temps = data['max temp']
dates = [datetime.fromordinal(d) for d in data['timestamp']]
events = data['events']
if not os.path.exists('plots'):
os.mkdir('plots')
fig = temp_plot(dates, mean_temps)
fig.savefig('plots/day_vs_temp.png')
Running $ python plot_data.py should create a "plots" folder and put a file inside called "day_vs_temp.png" that looks like this:
![Day vs. Temperature](../assets/img/day_vs_temp-5.png)
Not bad! Let's add a trend line to the plot based on a simple linear model of the data.
3.1 Adding a trend line
-----------------------
By using numpy's polyfit function, adding a trend line is a snap. This function takes our x and y values (`days` and `mean\_temps`), and gives us back a slope and intercept (the final parameter is the degree of the fitted polynomial -- we pass 1 for a linear fit).
slope, intercept = np.polyfit(days, mean\_temps, 1)

Using the slope and intercept, we can plot a trend line by computing "ideal" temperatures for each day according to the old `y = mx + b` formula. With our variables below, this will be `ideal\_temps = (slope * days) + intercept`. Note that I've changed the `days = ...` line to `days = np.array(...)` so that we can do mathematical operations directly on the array.
:::python
def temp_plot(dates, mean_temps):
year_start = datetime(2012, 1, 1)
days = np.array([(d - year_start).days + 1 for d in dates])
fig = pyplot.figure()
pyplot.title('Temperatures in Bloomington 2012')
pyplot.ylabel('Mean Temperature (F)')
pyplot.xlabel('Day of Year')
pyplot.plot(days, mean_temps, marker='o')
slope, intercept = np.polyfit(days, mean_temps, 1)
ideal_temps = intercept + (slope * days)
r_sq = r_squared(mean_temps, ideal_temps)
fit_label = 'Linear fit ({0:.2f})'.format(slope)
pyplot.plot(days, ideal_temps, color='red', linestyle='--', label=fit_label)
pyplot.annotate('r^2 = {0:.2f}'.format(r_sq), (0.05, 0.9), xycoords='axes fraction')
pyplot.legend(loc='lower right')
return fig
To make the plot a little more useful, I've annotated the plot with the R-squared value of the fit. pyplot.annotate lets you put text on the figure in a variety of ways. Here, I've set the `xycoords` parameter to "axes fraction" so that `annotate` interprets my coordinates (0.05, 0.9) as fractions between 0 and 1 relative to the figure axes. The (0.05, 0.9) means to place the text horizontally 5% from the y-axis (left) and 90% from the x-axis (bottom).
The final call to pyplot.legend places a legend on the figure. You *must* include a `label` parameter on at least one plot object for this to work (I've included it on the trend line `plot` call). By default, the legend will show up in the upper-right corner of the figure. This will get in the way on our current plot, so I moved the figure to the lower-right with the `loc` parameter.
With the changes above, here's the new plot:
![Day vs. Temperature](../assets/img/day_vs_temp-6.png)
Notice that the string formatting (`{0:.3f}`) has rounded the R-squared value and slope label for us to three decimal places.
3.2 Adding "error" bars
------------------------
Since we also have the min and max temperatures in our data, let's add "error" bars to our plot to show the temperature range on each day. We'll modify `temp\_plot` to take in two additional parameters (`min\_temps` and `max\_temps`), and plot the temperature range if they both have values (i.e., are not `None`).
Adding error bars requires us to use the pyplot.errorbar function instead of `pyplot.plot`. It takes additional parameters (`xerr` and `yerr`) for the x and y errors. We will use `yerr`, and pass in an array with two rows: one for error above each data point, and one for the error below. This array is easily computed by subtracting the max and min temperatures from the mean, and then stacking the two arrays together row-wise with numpy.vstack.
:::python
def temp_plot(dates, mean_temps, min_temps = None, max_temps = None):
year_start = datetime(2012, 1, 1)
days = np.array([(d - year_start).days + 1 for d in dates])
fig = pyplot.figure()
pyplot.title('Temperatures in Bloomington 2012')
pyplot.ylabel('Mean Temperature (F)')
pyplot.xlabel('Day of Year')
if (max_temps is None or min_temps is None):
# Normal plot without error bars
pyplot.plot(days, mean_temps, marker='o')
else:
# Compute min/max temperature difference from the mean
temp_err = np.row_stack((mean_temps - min_temps,
max_temps - mean_temps))
# Make line plot with error bars to show temperature range
pyplot.errorbar(days, mean_temps, marker='o', yerr=temp_err)
pyplot.title('Temperatures in Bloomington 2012 (max/min)')
slope, intercept = np.polyfit(days, mean_temps, 1)
ideal_temps = intercept + (slope * days)
r_sq = r_squared(mean_temps, ideal_temps)
fit_label = 'Linear fit ({0:.2f})'.format(slope)
pyplot.plot(days, ideal_temps, color='red', linestyle='--', label=fit_label)
pyplot.annotate('r^2 = {0:2f}'.format(r_sq), (0.05, 0.9), xycoords='axes fraction')
pyplot.legend(loc='lower right')
return fig
#--------------------------------------------------
data = read_weather('data/weather.csv')
min_temps = data['min temp']
mean_temps = data['mean temp']
max_temps = data['max temp']
dates = [datetime.fromordinal(d) for d in data['timestamp']]
events = data['events']
if not os.path.exists('plots'):
os.mkdir('plots')
# Plot without error bars
fig = temp_plot(dates, mean_temps)
fig.savefig('plots/day_vs_temp.png')
# Plot with error bars
fig = temp_plot(dates, mean_temps, min_temps, max_temps)
fig.savefig('plots/day_vs_temp-all.png')
The new plot is saved to a file named `day\_vs\_temp-all.png` and looks like this:
![Day vs. Temperature](../assets/img/day_vs_temp-all-7.png)
If you need to compute standard error for your `errorbar` plot, you can use scipy.stats.sem from the scipy module.
For our next plot, we'll do a multi-part histogram of the weather events for each month.
4. Event Histogram
------------------
Histograms in `matplotlib` are generated using the pyplot.hist function. This function takes an array of data, which can itself contain arrays (for a multi-part histogram). We want to count events per month, so we'll need to create an array for each type of event. Inside these arrays will be observations like `[1, 1, 2, 3, 3]` for "January", "January", "February", "March", "March". Here's a diagram to help out:
![Event Diagram](../assets/img/events.png)
When `pyplot.hist` receives our data, it will attempt to "bin" the month observations automatically. By default, it will break observations into 10 bins. We want a bin for each month instead, and we want the bins aligned properly to the month numbers (1 = January, 2 = February, etc.). The `bins` parameter to `pyplot.hist` takes either a number (representing the desired number of bins) or a sequence (representing the desired bin *edges*). In the code below, we pass `range(1, 5 + 2)` to ensure that our bins start at 1 (for January) and go *through* 5 (for May).
:::python
def hist_events(dates, events):
event_months = []
for i in range(num_events):
event_months.append([])
# Build up lists of months where events occurred
for date, event_str in zip(dates, events):
if len(event_str) == 0:
continue # Skip blank events
month = date.month
# Multiple events in a day are separated by '-'
for event in event_str.split('-'):
event_code = event2int(event)
event_months[event_code].append(month)
# Plot histogram
fig = pyplot.figure()
pyplot.title('Weather Events in Bloomington 2012')
pyplot.xlabel('Month')
pyplot.ylabel('Event Count')
bins = np.arange(1, 5 + 2)
pyplot.hist(event_months, bins=bins, label=event_types)
pyplot.legend()
return fig
The main body of the program is updated to call `hist\_events` and save the resulting figure to `plots/event\_histogram.png`.
data = read_weather('data/weather.csv')
min_temps = data['min temp']
mean_temps = data['mean temp']
max_temps = data['max temp']
dates = [datetime.fromordinal(d) for d in data['timestamp']]
events = data['events']
if not os.path.exists('plots'):
os.mkdir('plots')
fig = temp_plot(dates, mean_temps)
fig.savefig('plots/day_vs_temp.png')
fig = temp_plot(dates, mean_temps, min_temps, max_temps)
fig.savefig('plots/day_vs_temp-all.png')
fig = hist_events(dates, events)
fig.savefig(os.path.join('plots', 'event_histogram.png'))
When we run $ python plot_data.py, the new plot looks like this:
![Event Histogram](../assets/img/event_histogram-8.png)
Each collection of bars represents a month, and the individual bars represent the number of Rain, Thunderstorm, etc. events observed for that month. The figure's legend was populated by passing `event\_types` in for the `label` parameter of `pyplot.hist`.
The plot looks good, but it would be nice to properly label the months. We could do this manually with pyplot.xticks as follows:
:::python
pyplot.xticks( (1.5, 2.5, 3.5, 4.5, 5.5), ('January', 'February', 'March', 'April', 'May') )
This will label each bin in the center (hence the .5 added to each number) with the proper month name. If our data grows to include more months, however, we'll have to manually extend the number of bins and our labels. Let's change `hist\_events` to keep track of the range of months in the data. Additionally, we'll use Python's calendar module to automatically get the month names.
At the top of the program, we'll import the `calendar` module:
:::python
import calendar
and then redefine hist_events as follows:
def hist_events(dates, events):
event_months = []
for i in range(num_events):
event_months.append([])
# Build up lists of months where events occurred
min_month = 13
max_month = 0
for date, event_str in zip(dates, events):
if len(event_str) == 0:
continue # Skip blank events
month = date.month
min_month = min(month, min_month)
max_month = max(month, max_month)
# Multiple events in a day are separated by '-'
for event in event_str.split('-'):
event_code = event2int(event)
event_months[event_code].append(month)
# Plot histogram
fig = pyplot.figure()
pyplot.title('Weather Events in Bloomington 2012')
pyplot.xlabel('Month')
pyplot.ylabel('Event Count')
pyplot.axes().yaxis.grid()
num_months = max_month - min_month + 1;
bins = np.arange(1, num_months + 2) # Bin edges
pyplot.hist(event_months, bins=bins, label=event_types)
# Align month labels to bin centers
month_names = calendar.month_name[min_month:max_month+1]
pyplot.xticks(bins + 0.5, month_names)
pyplot.legend()
return fig
During the process of building our observation arrays, we now track the minimum and maximum months observed. This allows us to automatically create our bin edges, and let's us grab months names from the `calendar` module by indexing into the `calendar.month\_name` list.
Note that the `bins` variables was created using numpy.arange, which is a shortcut for `bins = numpy.array(range(1, num\_months + 2))`. Making `bins` a `numpy` array lets us call `pyplot.xticks` with `bins + 0.5`, centering `month\_names` on each bin.
As a bonus, I've also added a horizontal grid using the axis.grid function. You can add both a horizontal and vertical grid at the same time by calling pyplot.grid.
Here's the updated plot:
![Event Histogram](../assets/img/event_histogram-9.png)
Looks ready for publication!
[[Code and Data](../assets/weather.zip)]