Notes on Wes McKinney's Python for Data Analysis

Chapters:


1. Preliminaries


NumPy


  • A fast and efficient multidimensional array object ndarray
  • Functions for performing element-wise computations with arrays or mathematical operations between arrays
  • Tools for reading and writing array-based data sets to disk
  • Linear algebra, Fourier transform, and random number generation
  • Tools for integrating connecting C, C++, and Fortran code to Python

pandas


pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. It is, as you will see, one of the critical ingredients enabling Python to be a powerful and productive data analysis environment. The primary object in pandas that will be used in this book is the DataFrame, a two-dimensional tabular, column-oriented data structure with both row and column labels.

pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. pandas is the primary tool that we will use in this book.

The pandas name itself is derived from panel data, an econometrics term for multidimensional structured data sets, and Python data analysis itself.

matplotlib


matplotlib is the most popular Python library for producing plots and other 2D data visualizations. It was originally created by John D. Hunter (JDH) and is now maintained by a large team of developers. It is well-suited for creating plots suitable for publication. It integrates well with IPython (see below), thus providing a comfortable interactive environment for plotting and exploring data. The plots are also interactive; you can zoom in on section of the plot and pan around the plot using the toolbar in the plot window.

IPython


Aside from the standard terminal-based IPython shell, the project also provides
A Mathematica-like HTML notebook for connecting to IPython through a web browser. More on this later.
A rich Qt GUI framework-based console with inline plotting
An infrastructure for interactive parallel and distributed computing
I will devote a chapter to IPython and how to get the most out of its features. I strongly recommend using it while working through this book.

Chapter 2. Whetting your appetite


usa.gov data from bit.ly


In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil. As of this writing, in addition to providing a live feed, hourly snapshots are available as downloadable text files (see http://www.usa.gov/About/developer-resources/1usagov.shtml).

In the case of the hourly snapshots, each line in files contains a common form of web data known as JSON, which stands for JavaScript Serialized Object Notation.

Counting time zones in pure Python


from collections import defaultdict
 
def get_counts2(sequence):
    counts = defaultdict(int) # values will initialize to 0
    for x in sequence:
        counts[x] += 1
    return counts

def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

If you search the Python standard library, you may find the collections.Counter class, which makes this task a lot easier:

In [49]: from collections import Counter
 
In [50]: counts = Counter(time_zones)
 
In [51]: counts.most_common(10)

Counting time zones with pandas


The main pandas data structure is the DataFrame, which you can think of as representing a table or spreadsheet of data. Creating a DataFrame from the original set of records is simple:

In [162]: from pandas import DataFrame, Series
 
In [163]: import pandas as pd
 
In [164]: frame = DataFrame(records)
 
In [9]: frame['tz'][:10]
Out[9]: 
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz

The output shown for the frame is the summary view shown for large DataFrame objects. The Series object returned by frame['tz'] has a method value_counts that gives us what we're looking for:

In [167]: tz_counts = frame['tz'].value_counts()
 
In [168]: tz_counts[:10]
Out[168]: 
America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33

Then, we might want to make a plot of this data using plotting library matplotlib. You can do a bit of munging to fill in a substitute value for unknown and missing time zone data in the records. The fillna function can replace missing (NA) values and unknown (empty strings) values can be replaced by a comparison and boolean array indexing:

In [169]: clean_tz = frame['tz'].fillna('Missing')
 
In [170]: clean_tz[clean_tz == ''] = 'Unknown'
 
In [171]: tz_counts = clean_tz.value_counts()
 
In [172]: tz_counts[:10]
Out[172]: 
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35

Making a horizontal bar plot can be accomplished using the plot method on the counts objects:

In [174]: tz_counts[:10].plot(kind='barh', rot=0)


Parsing all of the interesting information in these "agent" strings may seem like a daunting task. Luckily, once you have mastered Python's built-in string functions and regular expression capabilities, it is really not so bad. For example, we could split off the first token in the string (corresponding roughly to the browser capability) and make another summary of the user behavior:

In [178]: results = Series([x.split()[0] for x in frame.a.dropna()])
 
In [179]: results[:5]
Out[179]: 
0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
 
In [180]: results.value_counts()[:8]
Out[180]: 
Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4

Now, suppose you wanted to decompose the top time zones into Windows and non-Windows users. As a simplification, let's say that a user is on Windows if the string 'Windows' is in the agent string. Since some of the agents are missing, I'll exclude these from the data:

In [181]: cframe = frame[frame.a.notnull()]

We want to then compute a value indicating each row is Windows or not:

In [182]: operating_system = ['Windows' if 'Windows' in agent else 'Not Windows'
   .....:                     for agent in cframe['a']]
 
In [183]: operating_system[:5]
Out[183]: ['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows']

Then, you can group the data by its time zone column and this new list of operating systems:

In [184]: by_tz_os = cframe.groupby(['tz', operating_system])


The group counts, analogous to the value_counts function above, can be computed using size. This result is then reshaped into a table with unstack:

In [185]: agg_counts = by_tz_os.size().unstack().fillna(0)


Finally, let's select the top overall time zones. To do so, I construct an indirect index array from the row counts in agg_counts:

# Use to sort in ascending order
In [187]: indexer = agg_counts.sum(1).argsort()

I then use take to select the rows in that order, then slice off the last 10 rows:

In [189]: count_subset = agg_counts.take(indexer)[-10:]

The plot doesn't make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again:

In [194]: normed_subset = count_subset.div(count_subset.sum(1), axis=0)
 
In [195]: normed_subset.plot(kind='barh', stacked=True)

MovieLens 1M data set


The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies. It's spread across 3 tables: ratings, user information, and movie information. After extracting the data from the zip file, each table can be loaded into a pandas DataFrame object using pandas.read_table:

import os
import pandas as pd
 
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
                      names=unames)
 
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,
                        names=rnames)
 
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,
                        names=mnames)

Using pandas's merge function, we first merge ratings with users then merging that result with the movies data. pandas infers which columns to use as the merge (or join) keys based on overlapping names:

In [211]: data = pd.merge(pd.merge(ratings, users), movies)

In this form, aggregating the ratings grouped by one or more user or movie attributes is straightforward once you build some familiarity with pandas. To get mean movie ratings for each film grouped by gender, we can use the pivot_table method:

In [214]: mean_ratings = data.pivot_table('rating', rows='title',
   .....:                                 cols='gender', aggfunc='mean')
 
In [215]: mean_ratings[:5]
Out[215]: 
gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

This produced another DataFrame containing mean ratings with movie totals as row labels and gender as column labels. There are a number of question you might want to answer now. First, I'm going to filter down to movies that received at least 250 ratings (a completely arbitrary number); to do this, I group the data by title and use size() to get a Series of group sizes for each title:

ratings_by_title = data.groupby('title').size()
active_titles = ratings_by_title.index[ratings_by_title >= 250]

The index of titles receiving at least 250 ratings can then be used to select rows from mean_ratings above:

In [220]: mean_ratings = mean_ratings.ix[active_titles]

To see the top films among female viewers, we can sort by the F column in descending order.

In [223]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

Measuring rating disagreement


 mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

Sorting by 'diff' gives us the movies with the greatest rating difference and which were preferred by women:

In [226]: sorted_by_diff = mean_ratings.sort_index(by='diff')

Reversing the order of the rows and again slicing off the top 15 rows, we get the movies preferred by men that women didn't rate as highly:

# Reverse order of rows, take first 15 rows
In [228]: sorted_by_diff[::-1][:15]

Suppose instead you wanted the movies that elicited the most disagreement among viewers, independent of gender. Disagreement can be measured by the variance or standard deviation of the ratings:

# Standard deviation of rating grouped by title
In [229]: rating_std_by_title = data.groupby('title')['rating'].std()
 
# Filter down to active_titles
In [230]: rating_std_by_title = rating_std_by_title.ix[active_titles]
 
# Order Series by value in descending order
In [231]: rating_std_by_title.order(ascending=False)[:10]
Out[231]: 
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating

You may have noticed that movie genres are given as a pipe-separated (|) string. If you wanted to do some analysis by genre, more work would be required to transform the genre information into a more usable form. I will revisit this data later in the book to illustrate such a transformation.

US Baby Names 1880-2010


http://www.ssa.gov/oact/babynames/limits.html

As this is in nicely comma-separated form, it can be loaded into a DataFrame with pandas.read_csv:

In [241]: import pandas as pd
 
In [242]: names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'])

These files only contain names with at least 5 occurrences in each year, so for simplicity's sake we can use the sum of the births column by sex as the total number of births in that year:

In [244]: names1880.groupby('sex').births.sum()
Out[244]: 
sex
F       90993
M      110493
Name: births

Since the data set is split into files by year, one of the first things to do is to assemble all of the data into a single DataFrame and further to add a year field. This is easy to do using pandas.concat:

# 2010 is the last available year right now
years = range(1880, 2011)
 
pieces = []
columns = ['name', 'sex', 'births']
 
for year in years:
    path = 'names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year
    pieces.append(frame)
 
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

There are a couple things to note here. First, remember that concat glues the DataFrame objects together row-wise by default. Secondly, you have to pass ignore_index=True because we're not interested in preserving the original row numbers returned from read_csv. So we have now a very large DataFrame containing all of the names data.

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.tail()
sex         F        M
year                  
2007  1917881  2070445
2008  1885178  2034014
2009  1830134  1976208
2010  1768463  1909167
2011  1742410  1880633

Next, let's insert a column prop with the fraction of babies given each name relative to the total number of births. A prop value of 0.02 would indicate that 2 out of every 100 babies was given a particular name. Thus, we group the data by year and sex, then add the new column to each group:

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)
    group['prop'] = births / births.sum()
    return group
names = names.groupby(['year', 'sex']).apply(add_prop)

Whenever doing a group operation like this, it's sometimes valuable to do a sanity check, like verifying that the prop column sums to 1 within all the groups. Since this is floating point data, use np.allclose to check that the group sums are sufficiently close to (but perhaps not exactly equal to) 1:

import numpy as np
In [252]: np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
Out[252]: True

Now that this is done, I'm going to extract a subset of the data to facilitate further analysis: the top 1000 names for each sex/year combination. This is yet another group operation:

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

Now that this is done, I'm going to extract a subset of the data to facilitate further analysis: the top 1000 names for each sex/year combination. This is yet another group operation:

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)


Analyzing naming trends


With the full data set and Top 1000 data set in hand, we can start analyzing various naming trends of interest. Splitting the Top 1000 names into the boy and girl portions is easy to do first:

In [256]: boys = top1000[top1000.sex == 'M']
 
In [257]: girls = top1000[top1000.sex == 'F']

Simple time series, like the number of Johns or Marys for each year can be plotted but require a bit of munging to be a bit more useful. Let's form a pivot table of the total number of births by year and name:

In [258]: total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum)

Now, this can be plotted for a handful of names using DataFrame's plot method:

In [259]: total_births
Out[259]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)
 
In [260]: subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
 
In [261]: subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")

Measuring the increase in naming diversity


One explanation for the decrease in plots above is that fewer parents are choosing common names for their children. This hypothesis can be explored and confirmed in the data. One measure is the proportion of births represented by the top 1000 most popular names, which I aggregate and plot by year and sex:

In [263]: table = top1000.pivot_table('prop', rows='year', cols='sex', aggfunc=sum)
In [264]: table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

So you can see that, indeed, there appears to be increasing name diversity (decreasing total proportion in the top 1000). Another interesting metric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births. This number is a bit more tricky to compute. Let's consider just the boy names from 2010:

In [265]: df = boys[boys.year == 2010]
 
In [266]: df
Out[266]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 260877 to 261876
Data columns:
name      1000  non-null values
sex       1000  non-null values
births    1000  non-null values
year      1000  non-null values
prop      1000  non-null values
dtypes: float64(1), int64(2), object(2)

After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%. You could write a for loop to do this, but a vectorized NumPy way is a bit more clever. Taking the cumulative sum, cumsum, of prop then calling the method searchsorted returns the position in the cumulative sum at which 0.5 would need to be inserted to keep it in sorted order:

In [267]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
 
In [268]: prop_cumsum[:10]
Out[268]: 
260877    0.011523
260878    0.020934
260879    0.029959
260880    0.038930
260881    0.047817
260882    0.056579
260883    0.065155
260884    0.073414
260885    0.081528
260886    0.089621
 
In [269]: prop_cumsum.searchsorted(0.5)
Out[269]: 116


Since arrays are zero-indexed, adding 1 to this result gives you a result of 117. By contrast, in 1900 this number was much smaller:

In [270]: df = boys[boys.year == 1900]
 
In [271]: in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
 
In [272]: in1900.searchsorted(0.5) + 1
Out[272]: 25

It should now be fairly straightforward to apply this operation to each year/sex combination; groupby those fields and apply a function returning the count for each group:

def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False)
    return group.prop.cumsum().searchsorted(q) + 1
 
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')

This resulting DataFrame diversity now has two time series, one for each sex, indexed by year. This can be inspected in IPython and plotted as before:

In [274]: diversity.head()
Out[274]: 
sex    F   M
year        
1880  38  14
1881  38  14
1882  38  15
1883  39  15
1884  39  16

The "Last letter" Revolution


In 2007, a baby name researcher Laura Wattenberg pointed out on her website (http://www.babynamewizard.com) that the distribution of boy names by final letter has changed significantly over the last 100 years. To see this, I first aggregate all of the births in the full data set by year, sex, and final letter:

# extract last letter from name column
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', rows=last_letters, cols=['sex', 'year'], aggfunc=sum)

Then, I select out 3 representative years spanning the history and print the first few rows:

In [277]: subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
 
In [278]: subtable.head()
Out[278]: 
sex               F                      M                
year           1910    1960    2010   1910    1960    2010
last_letter                                               
a            108376  691247  670605    977    5204   28438
b               NaN     694     450    411    3912   38859
c                 5      49     946    482   15476   23125
d              6750    3729    2607  22111  262112   44398
e            133569  435013  313833  28655  178823  129012
 
In [280]: letter_prop = subtable / subtable.sum().astype(float)

Going back to the full table created above, I again normalize by year and sex and select a subset of letters for the boy names, finally transposing to make each column a time series:

In [283]: letter_prop = table / table.sum().astype(float)
 
In [284]: dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
 
In [285]: dny_ts.head()
Out[285]: 
             d         n         y
year                              
1880  0.083055  0.153213  0.075760
1881  0.083247  0.153214  0.077451
1882  0.085340  0.149560  0.077537
1883  0.084066  0.151646  0.079144
1884  0.086120  0.149915  0.080405

Boy names that became girl names (and vice versa)

Another fun trend is looking at boy names that were more popular with one sex earlier in the sample but have "changed sexes" in the present. One example is the name Lesley or Leslie. Going back to the top1000 dataset, I compute a list of names occurring in the dataset starting with 'lesl':

In [288]: all_names = top1000.name.unique()
 
In [289]: mask = np.array(['lesl' in x.lower() for x in all_names])
 
In [290]: lesley_like = all_names[mask]
 
In [291]: lesley_like
Out[291]: array([Leslie, Lesley, Leslee, Lesli, Lesly], dtype=object)

From there, we can filter down to just those names and sum births grouped by name to see the relative frequencies:

In [292]: filtered = top1000[top1000.name.isin(lesley_like)]
 
In [293]: filtered.groupby('name').births.sum()
Out[293]: 
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births

Next, let's aggregate by sex and year and normalize within year:

In [294]: table = filtered.pivot_table('births', rows='year', cols='sex', aggfunc='sum')
 
In [295]: table = table.div(table.sum(1), axis=0)
 
In [296]: table.tail()
Out[296]: 
sex   F   M
year       
2006  1 NaN
2007  1 NaN
2008  1 NaN
2009  1 NaN
2010  1 NaN