In [1]:
#all of our import statements
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Table Creation

You can initialize a table with just putting a list of column names into the parenthesis that follow the call. Typically you do it this way when you're going to be adding rows to a table from scratch, such as if you were iterating through some other data and wanted to build a table.

In [2]:
Table(['Name', 'Type'])
Out[2]:
Name Type
In [3]:
Table(['Name', 'Type']).with_rows([
        ['Fido', 'Retriever'],
        ['Clifford', 'Big Red'],
        ['Spot', 'Malamute']
    ])
Out[3]:
Name Type
Fido Retriever
Clifford Big Red
Spot Malamute

The more traditional way to create a table is done by the .with_rows( ) method.

In [4]:
dog = Table().with_columns([
        'Name', ['Fido', 'Clifford', 'Spot'],
        'Type', ['Retriever', 'Big Red', 'Malamute']
    ])
dog
Out[4]:
Name Type
Fido Retriever
Clifford Big Red
Spot Malamute

Pulling out Column

There are multiple ways that you can pull a column out of a table, but some are better than others for different situations.

The safest way to pull out a column is with the .column( ) method. It explicitly lets the reader know that you are pulling out a column, and does not have any side-effects.

In [5]:
dog.column('Type')
Out[5]:
array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

Another way is to index by the column name, that is, we put the column name in brackets, and call it on the table. It is less verbose, but people understand that you're pulling a column of that name out of a table.

In [6]:
dog['Type']
Out[6]:
array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

Indexing by column number is probably the worst way to do it. When reading the code, it is unclear what the column is, . As a rule of thumb, if you're going to do it this way, you should index by the column name instead.

In [7]:
dog[1]
Out[7]:
array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

There are special things that you can do when you use brackets to index a column. When you index with a column name, you can overwrite the current column by assigning it to a new value.

Suppose we wanted to change the Clifford type to "Big" instead of "Big Red" because we wanted to add a column "Color" and didn't want to be redundant.

In [8]:
print('This is the original Table')
dog.show()

# this changes the current values of "Type" to a new array
dog['Type'] = make_array('Retriever', 'Big', 'Malamute')

# this creates a new column "Color"
dog['Color'] = make_array('Golden', 'Red', 'Gray')

print('Now we have changed the "Type" column, and added a new column')
dog.show()
This is the original Table
Name Type
Fido Retriever
Clifford Big Red
Spot Malamute
Now we have changed the "Type" column, and added a new column
Name Type Color
Fido Retriever Golden
Clifford Big Red
Spot Malamute Gray

Now our dog table is forever changed. We are unable to get the old values that we used to have in 'Type' (unless we run all of the cells at the type when we first created the table). So do this with caution, it can be dangerous. Sometimes it is better to create a new version of the table with that column changed, that way, if we realize that our new table is inherently flawed or we want to access our old table, we can always revert back to it.

In [9]:
dog_2 = dog.with_columns([
        'Type', make_array('happy', 'sad', 'creepy'),
        'Color', make_array('green', 'blue', 'purple')
    ])
print('dog_2 (our new version)')
dog_2.show()
print('dog (still unchanged)')
dog.show()
dog_2 (our new version)
Name Type Color
Fido happy green
Clifford sad blue
Spot creepy purple
dog (still unchanged)
Name Type Color
Fido Retriever Golden
Clifford Big Red
Spot Malamute Gray

Table Manipulation

Moving into some real data, below are a selected list of grocery and liquor stores in San Francisco was originally made from a combination of SF Open Data, Google Places, Yelp, and the Census.

In [10]:
businesses = Table().read_table('SelectedBusinesses.csv')
businesses.show(3)
Location Id Business Account Number Ownership Name DBA Name price level Street Address City State Source Zipcode Business Start Date Business End Date Location Start Date Location End Date Mail Address Mail City Mail Zipcode Mail State NAICS Code NAICS Code Description Parking Tax Transient Occupancy Tax LIC Code LIC Code Description Supervisor District Neighborhoods - Analysis Boundaries Business Corridor Business Location Coordinates Lats Lons Original Business Object Type Grocery Store Liquor Store
0317683-02-001 317683 Nam Hai Corporation Nam Hai Corporation 1 919 Grant Ave San Francisco CA 94108 6/24/1998 nan 1/9/1999 nan 919 Grant Ave San Francisco 94108 CA 4400-4599 Retail Trade False False H03 Retail Mkts W/o Prep (Under 5,001) 3 Chinatown nan 919 GRANT AVE San Francisco, CA 94108 (37.795406, -122.4 ... (37.795406, -122.406531) 37.7954 -122.407 Nam Hai Corporation Nan Hai Corporation {'name': 'Nan Hai Corporation', 'scope': 'GOOGLE', 'id': ... ['grocery_or_supermarket', 'food', 'store', 'point_of_in ... True False
0321623-01-001 321623 Kontrafouris Athanasios Union Street Produce 1 801 Union St San Francisco CA 94133 4/30/1998 nan 4/30/1998 nan 801 Union Street San Francisco 94133 CA 7220-7229 Food Services False False H07 Retail Food Market W/food Prep. Lt 5001 Sqft 3 Russian Hill nan 801 UNION ST San Francisco, CA 94133 (37.80005, -122.412481) (37.80005, -122.412481) 37.8 -122.412 Union Street Produce Union Street Produce Co {'name': 'Union Street Produce Co', 'vicinity': '801 Uni ... ['grocery_or_supermarket', 'food', 'store', 'point_of_in ... True False
0326568-03-001 326568 Afewerk Senait S Nicks Foods 2 1659 Market St San Francisco CA 94103 3/1/2009 nan 11/15/2009 6/24/2013 1659 Market Street San Francisco 94103 CA 4400-4599 Retail Trade False False nan nan nan nan nan 1659 MARKET ST SAN FRANCISCO, CA 94103 (37.773284, -122. ... (37.773284, -122.42147) 37.7733 -122.421 Nicks Foods Nick's Super Market {'name': "Nick's Super Market", 'vicinity': '144 Page St ... ['grocery_or_supermarket', 'food', 'store', 'point_of_in ... True False

... (477 rows omitted)

Use .num_columns to get the number of columns and use .num_rows to get the number of rows

In [11]:
number_of_columns = businesses.num_columns
number_of_business = businesses.num_rows
In [12]:
print('There are ' + str(number_of_business) + " businesses")
print('And there are ' + str(number_of_columns) + " columns")
There are 480 businesses
And there are 36 columns

There are way too many columns that we currently don't care about. Let's cut down on them by selecting out which ones we want.

In [13]:
businesses.labels
Out[13]:
('Location Id',
 'Business Account Number',
 'Ownership Name',
 'DBA Name',
 'price level',
 'Street Address',
 'City',
 'State',
 'Source Zipcode',
 'Business Start Date',
 'Business End Date',
 'Location Start Date',
 'Location End Date',
 'Mail Address',
 'Mail City',
 'Mail Zipcode',
 'Mail State',
 'NAICS Code',
 'NAICS Code Description',
 'Parking Tax',
 'Transient Occupancy Tax',
 'LIC Code',
 'LIC Code Description',
 'Supervisor District',
 'Neighborhoods - Analysis Boundaries',
 'Business Corridor',
 'Business Location',
 'Coordinates',
 'Lats',
 'Lons',
 'Original',
 'Business',
 'Object',
 'Type',
 'Grocery Store',
 'Liquor Store')
In [14]:
select_businesses = businesses.select('Ownership Name', 'DBA Name','Source Zipcode','Transient Occupancy Tax','Parking Tax','price level', 'Grocery Store', 'Liquor Store')
select_businesses
Out[14]:
Ownership Name DBA Name Source Zipcode Transient Occupancy Tax Parking Tax price level Grocery Store Liquor Store
Nam Hai Corporation Nam Hai Corporation 94108 False False 1 True False
Kontrafouris Athanasios Union Street Produce 94133 False False 1 True False
Afewerk Senait S Nicks Foods 94103 False False 2 True False
Afewerk Senait S Nick's Foods 94103 False False 2 True False
Chen Yu Ying Quarts' N Pints 94122 False False 1 False True
Ping Lau Kwan Ho Kee Market 94133 False False 4 True False
Habash Husam G London Market & Deli 94115 False False 3 True False
Jo Pos Inc D & M Liquors 94115 False False 4 False True
Bsisso Ketam A Noe Hill Market 94114 False False 2 True False
Hom Song Chin/leung Homs Grocery 94110 False False 1 True False

... (470 rows omitted)

Now that we have a more managable dataset, we can try to find some things out about it. Start by using .stats() to get a list of summary statistics about the table.

In [15]:
select_businesses.stats()
Out[15]:
statistic Ownership Name DBA Name Source Zipcode Transient Occupancy Tax Parking Tax price level Grocery Store Liquor Store
min 2e Shqair Inc 1001 Castro Street 94102 0 0 1 0 0
max Zuha & Rana Corporation Zains Fine Wine & Spirits 95901 0 0 4 1 1
median 94112 0 0 2 1 0
sum 4.51781e+07 0 0 793 358 177

It is weird that both of the taxes have max's and mins of 0. Further investigation . . .

In [16]:
select_businesses.column('Transient Occupancy Tax')[:10]
Out[16]:
array([False, False, False, False, False, False, False, False, False, False], dtype=bool)
In [17]:
select_businesses.column('Parking Tax')[:10]
Out[17]:
array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

A sum of zero for each of the taxes means that neither has any 'True' values. In Python, True evaluates to one, and false evaluates to zero. Because all of the businesses have the same value for those columns, they aren't really adding any value to our table. So let's drop them. We're pretty confident that we won't be needing them in the future, so we can reassign select_businesses to our new table.

In [18]:
select_businesses = select_businesses.drop('Parking Tax', 'Transient Occupancy Tax')
select_businesses.show(3)
Ownership Name DBA Name Source Zipcode price level Grocery Store Liquor Store
Nam Hai Corporation Nam Hai Corporation 94108 1 True False
Kontrafouris Athanasios Union Street Produce 94133 1 True False
Afewerk Senait S Nicks Foods 94103 2 True False

... (477 rows omitted)

We have the median price level, but let's get the mean one. Then let's look at the distributions of those numbers.

In [19]:
price_levels = select_businesses.column('price level')
np.mean(price_levels)
Out[19]:
1.6520833333333333
In [20]:
select_businesses.group('price level').barh(column_for_categories='price level')

Next, lets see if number of businesses in an area affects any of the other columns. We can use the .group( ) method to accomplish this.

In [21]:
select_businesses.group('Source Zipcode')
Out[21]:
Source Zipcode count
94102 35
94103 25
94104 2
94105 9
94107 17
94108 14
94109 53
94110 59
94111 2
94112 27

... (17 rows omitted)

Now sort them by which zipcodes have the most businesses in their area.

In [22]:
grouped_zipcodes = select_businesses.group('Source Zipcode').sort('count', descending=True)
grouped_zipcodes
Out[22]:
Source Zipcode count
94110 59
94109 53
94133 39
94102 35
94117 32
94112 27
94103 25
94118 22
94122 20
94115 20

... (17 rows omitted)

What is the zipcode with the most businesses?

In [23]:
grouped_zipcodes.take(0)
Out[23]:
Source Zipcode count
94110 59

Next, we will want to compare the differences between the mean price levels of businesses in 94110 those in the other ones.

In [24]:
# selecting out biggest zipcodes
zipcode = grouped_zipcodes.take(0).column('Source Zipcode').item(0)

# These are lists of booleans, which you can input into the where function to select rows
in_94110 = select_businesses['Source Zipcode'] == zipcode
not_in_94110 = select_businesses['Source Zipcode'] != zipcode
In [25]:
print('94110')
select_businesses.where(in_94110).show(3)
print('The rest of the zipcodes')
select_businesses.where(not_in_94110).show(3)
94110
Ownership Name DBA Name Source Zipcode price level Grocery Store Liquor Store
Hom Song Chin/leung Homs Grocery 94110 1 True False
Totah B/totah M/ Totah N Norms Market 94110 2 False True
Jada Rimon & Sana R Image Market 94110 1 True False

... (56 rows omitted)

The rest of the zipcodes
Ownership Name DBA Name Source Zipcode price level Grocery Store Liquor Store
Nam Hai Corporation Nam Hai Corporation 94108 1 True False
Kontrafouris Athanasios Union Street Produce 94133 1 True False
Afewerk Senait S Nicks Foods 94103 2 True False

... (418 rows omitted)

In [26]:
np.mean(select_businesses.where(in_94110).column('price level'))
Out[26]:
1.4915254237288136
In [27]:
np.mean(select_businesses.where(not_in_94110).column('price level'))
Out[27]:
1.6745843230403801

The price level is on average lower for the businesses that are within the zipcode 94110.

Group can take a second argument, a function, that takes all of the values for that column of the group, and applies it. Typically you use np.mean or max, but there are many that you could use.

In [28]:
averaged_businesses = select_businesses.group('Source Zipcode', np.mean)
averaged_businesses
Out[28]:
Source Zipcode Ownership Name mean DBA Name mean price level mean Grocery Store mean Liquor Store mean
94102 1.8 0.6 0.485714
94103 1.56 0.8 0.36
94104 1 0.5 0.5
94105 2.11111 0.444444 0.666667
94107 1.82353 0.588235 0.470588
94108 1.92857 0.714286 0.428571
94109 1.64151 0.830189 0.320755
94110 1.49153 0.864407 0.271186
94111 3 0.5 0.5
94112 1.51852 0.62963 0.407407

... (17 rows omitted)

In [29]:
# using group, we can get the business counts for each zipcode
business_counts = select_businesses.group('Source Zipcode').sort('Source Zipcode').column('count')

# then we can make a new column with those same values on the above table
averaged_businesses = averaged_businesses.sort('Source Zipcode').with_columns([
        'Business Count', business_counts])
averaged_businesses
Out[29]:
Source Zipcode Ownership Name mean DBA Name mean price level mean Grocery Store mean Liquor Store mean Business Count
94102 1.8 0.6 0.485714 35
94103 1.56 0.8 0.36 25
94104 1 0.5 0.5 2
94105 2.11111 0.444444 0.666667 9
94107 1.82353 0.588235 0.470588 17
94108 1.92857 0.714286 0.428571 14
94109 1.64151 0.830189 0.320755 53
94110 1.49153 0.864407 0.271186 59
94111 3 0.5 0.5 2
94112 1.51852 0.62963 0.407407 27

... (17 rows omitted)

We can display the distribution of average price levels with a histogram.

STOP!

Making sure that you understand histograms is key, if you aren't completely sure what one is, read this section of the textbook https://www.inferentialthinking.com/chapters/06/2/visualizing-numerical-distributions.html

In [30]:
averaged_businesses.hist('price level mean', bins=np.arange(1,3.25,.25))

Let's estimate how many business are in the bin between 1.5 and 1.75

In [31]:
# percent per unit * unit = percent
160 * .25
Out[31]:
40.0

So forty percent of our data lies within that bin

In [32]:
.40 * 27
Out[32]:
10.8

Which equates to about 11 zipcodes

It looks kind of weird that there is a bin way that is so far out to the right, take a look at what rows that might include.

In [33]:
# Where using table predicates
averaged_businesses.where('price level mean', are.above(2.5))
Out[33]:
Source Zipcode Ownership Name mean DBA Name mean price level mean Grocery Store mean Liquor Store mean Business Count
94111 3 0.5 0.5 2

Use a scatter plot to see the relationship between price level and the number of businesses in a zipcode.

In [34]:
averaged_businesses.scatter('price level mean', 'Business Count')