#all of our import statements
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Table Creation¶

You can initialize a table with just putting a list of column names into the parenthesis that follow the call. Typically you do it this way when you're going to be adding rows to a table from scratch, such as if you were iterating through some other data and wanted to build a table.

Table(['Name', 'Type'])

Table(['Name', 'Type']).with_rows([
        ['Fido', 'Retriever'],
        ['Clifford', 'Big Red'],
        ['Spot', 'Malamute']
    ])

The more traditional way to create a table is done by the .with_rows( ) method.

dog = Table().with_columns([
        'Name', ['Fido', 'Clifford', 'Spot'],
        'Type', ['Retriever', 'Big Red', 'Malamute']
    ])
dog

Pulling out Column¶

There are multiple ways that you can pull a column out of a table, but some are better than others for different situations.

The safest way to pull out a column is with the .column( ) method. It explicitly lets the reader know that you are pulling out a column, and does not have any side-effects.

dog.column('Type')

array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

Another way is to index by the column name, that is, we put the column name in brackets, and call it on the table. It is less verbose, but people understand that you're pulling a column of that name out of a table.

dog['Type']

array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

Indexing by column number is probably the worst way to do it. When reading the code, it is unclear what the column is, . As a rule of thumb, if you're going to do it this way, you should index by the column name instead.

dog[1]

array(['Retriever', 'Big Red', 'Malamute'], 
      dtype='<U9')

There are special things that you can do when you use brackets to index a column. When you index with a column name, you can overwrite the current column by assigning it to a new value.

Suppose we wanted to change the Clifford type to "Big" instead of "Big Red" because we wanted to add a column "Color" and didn't want to be redundant.

print('This is the original Table')
dog.show()

# this changes the current values of "Type" to a new array
dog['Type'] = make_array('Retriever', 'Big', 'Malamute')

# this creates a new column "Color"
dog['Color'] = make_array('Golden', 'Red', 'Gray')

print('Now we have changed the "Type" column, and added a new column')
dog.show()

This is the original Table

Now we have changed the "Type" column, and added a new column

Now our dog table is forever changed. We are unable to get the old values that we used to have in 'Type' (unless we run all of the cells at the type when we first created the table). So do this with caution, it can be dangerous. Sometimes it is better to create a new version of the table with that column changed, that way, if we realize that our new table is inherently flawed or we want to access our old table, we can always revert back to it.

dog_2 = dog.with_columns([
        'Type', make_array('happy', 'sad', 'creepy'),
        'Color', make_array('green', 'blue', 'purple')
    ])
print('dog_2 (our new version)')
dog_2.show()
print('dog (still unchanged)')
dog.show()

dog_2 (our new version)

dog (still unchanged)

Table Manipulation¶

Moving into some real data, below are a selected list of grocery and liquor stores in San Francisco was originally made from a combination of SF Open Data, Google Places, Yelp, and the Census.

businesses = Table().read_table('SelectedBusinesses.csv')
businesses.show(3)

There are 480 businesses
And there are 36 columns

('Location Id',
 'Business Account Number',
 'Ownership Name',
 'DBA Name',
 'price level',
 'Street Address',
 'City',
 'State',
 'Source Zipcode',
 'Business Start Date',
 'Business End Date',
 'Location Start Date',
 'Location End Date',
 'Mail Address',
 'Mail City',
 'Mail Zipcode',
 'Mail State',
 'NAICS Code',
 'NAICS Code Description',
 'Parking Tax',
 'Transient Occupancy Tax',
 'LIC Code',
 'LIC Code Description',
 'Supervisor District',
 'Neighborhoods - Analysis Boundaries',
 'Business Corridor',
 'Business Location',
 'Coordinates',
 'Lats',
 'Lons',
 'Original',
 'Business',
 'Object',
 'Type',
 'Grocery Store',
 'Liquor Store')

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

1.6520833333333333

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

Use .num_columns to get the number of columns and use .num_rows to get the number of rows

number_of_columns = businesses.num_columns
number_of_business = businesses.num_rows

print('There are ' + str(number_of_business) + " businesses")
print('And there are ' + str(number_of_columns) + " columns")

There are 480 businesses
And there are 36 columns

There are way too many columns that we currently don't care about. Let's cut down on them by selecting out which ones we want.

businesses.labels

('Location Id',
 'Business Account Number',
 'Ownership Name',
 'DBA Name',
 'price level',
 'Street Address',
 'City',
 'State',
 'Source Zipcode',
 'Business Start Date',
 'Business End Date',
 'Location Start Date',
 'Location End Date',
 'Mail Address',
 'Mail City',
 'Mail Zipcode',
 'Mail State',
 'NAICS Code',
 'NAICS Code Description',
 'Parking Tax',
 'Transient Occupancy Tax',
 'LIC Code',
 'LIC Code Description',
 'Supervisor District',
 'Neighborhoods - Analysis Boundaries',
 'Business Corridor',
 'Business Location',
 'Coordinates',
 'Lats',
 'Lons',
 'Original',
 'Business',
 'Object',
 'Type',
 'Grocery Store',
 'Liquor Store')

select_businesses = businesses.select('Ownership Name', 'DBA Name','Source Zipcode','Transient Occupancy Tax','Parking Tax','price level', 'Grocery Store', 'Liquor Store')
select_businesses

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

1.6520833333333333

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

10.8

Now that we have a more managable dataset, we can try to find some things out about it. Start by using .stats() to get a list of summary statistics about the table.

select_businesses.stats()

It is weird that both of the taxes have max's and mins of 0. Further investigation . . .

select_businesses.column('Transient Occupancy Tax')[:10]

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

select_businesses.column('Parking Tax')[:10]

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

A sum of zero for each of the taxes means that neither has any 'True' values. In Python, True evaluates to one, and false evaluates to zero. Because all of the businesses have the same value for those columns, they aren't really adding any value to our table. So let's drop them. We're pretty confident that we won't be needing them in the future, so we can reassign select_businesses to our new table.

select_businesses = select_businesses.drop('Parking Tax', 'Transient Occupancy Tax')
select_businesses.show(3)

1.6520833333333333

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

10.8

We have the median price level, but let's get the mean one. Then let's look at the distributions of those numbers.

price_levels = select_businesses.column('price level')
np.mean(price_levels)

1.6520833333333333

select_businesses.group('price level').barh(column_for_categories='price level')

Next, lets see if number of businesses in an area affects any of the other columns. We can use the .group( ) method to accomplish this.

select_businesses.group('Source Zipcode')

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

10.8

Now sort them by which zipcodes have the most businesses in their area.

grouped_zipcodes = select_businesses.group('Source Zipcode').sort('count', descending=True)
grouped_zipcodes

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

10.8

What is the zipcode with the most businesses?

grouped_zipcodes.take(0)

Next, we will want to compare the differences between the mean price levels of businesses in 94110 those in the other ones.

# selecting out biggest zipcodes
zipcode = grouped_zipcodes.take(0).column('Source Zipcode').item(0)

# These are lists of booleans, which you can input into the where function to select rows
in_94110 = select_businesses['Source Zipcode'] == zipcode
not_in_94110 = select_businesses['Source Zipcode'] != zipcode

print('94110')
select_businesses.where(in_94110).show(3)
print('The rest of the zipcodes')
select_businesses.where(not_in_94110).show(3)

94110

The rest of the zipcodes

1.4915254237288136

1.6745843230403801

40.0

10.8

np.mean(select_businesses.where(in_94110).column('price level'))

1.4915254237288136

np.mean(select_businesses.where(not_in_94110).column('price level'))

1.6745843230403801

The price level is on average lower for the businesses that are within the zipcode 94110.

Group can take a second argument, a function, that takes all of the values for that column of the group, and applies it. Typically you use np.mean or max, but there are many that you could use.

averaged_businesses = select_businesses.group('Source Zipcode', np.mean)
averaged_businesses

40.0

10.8

# using group, we can get the business counts for each zipcode
business_counts = select_businesses.group('Source Zipcode').sort('Source Zipcode').column('count')

# then we can make a new column with those same values on the above table
averaged_businesses = averaged_businesses.sort('Source Zipcode').with_columns([
        'Business Count', business_counts])
averaged_businesses

40.0

10.8

We can display the distribution of average price levels with a histogram.

STOP!¶

Making sure that you understand histograms is key, if you aren't completely sure what one is, read this section of the textbook https://www.inferentialthinking.com/chapters/06/2/visualizing-numerical-distributions.html

averaged_businesses.hist('price level mean', bins=np.arange(1,3.25,.25))

Let's estimate how many business are in the bin between 1.5 and 1.75

# percent per unit * unit = percent
160 * .25

40.0

So forty percent of our data lies within that bin

.40 * 27

10.8

Which equates to about 11 zipcodes

It looks kind of weird that there is a bin way that is so far out to the right, take a look at what rows that might include.

# Where using table predicates
averaged_businesses.where('price level mean', are.above(2.5))

Use a scatter plot to see the relationship between price level and the number of businesses in a zipcode.

averaged_businesses.scatter('price level mean', 'Business Count')

Location Id	Business Account Number	Ownership Name	DBA Name	price level	Street Address	City	State	Source Zipcode	Business Start Date	Business End Date	Location Start Date	Location End Date	Mail Address	Mail City	Mail Zipcode	Mail State	NAICS Code	NAICS Code Description	Parking Tax	Transient Occupancy Tax	LIC Code	LIC Code Description	Supervisor District	Neighborhoods - Analysis Boundaries	Business Corridor	Business Location	Coordinates	Lats	Lons	Original	Business	Object	Type	Grocery Store	Liquor Store
0317683-02-001	317683	Nam Hai Corporation	Nam Hai Corporation	1	919 Grant Ave	San Francisco	CA	94108	6/24/1998	nan	1/9/1999	nan	919 Grant Ave	San Francisco	94108	CA	4400-4599	Retail Trade	False	False	H03	Retail Mkts W/o Prep (Under 5,001)	3	Chinatown	nan	919 GRANT AVE San Francisco, CA 94108 (37.795406, -122.4 ...	(37.795406, -122.406531)	37.7954	-122.407	Nam Hai Corporation	Nan Hai Corporation	{'name': 'Nan Hai Corporation', 'scope': 'GOOGLE', 'id': ...	['grocery_or_supermarket', 'food', 'store', 'point_of_in ...	True	False
0321623-01-001	321623	Kontrafouris Athanasios	Union Street Produce	1	801 Union St	San Francisco	CA	94133	4/30/1998	nan	4/30/1998	nan	801 Union Street	San Francisco	94133	CA	7220-7229	Food Services	False	False	H07	Retail Food Market W/food Prep. Lt 5001 Sqft	3	Russian Hill	nan	801 UNION ST San Francisco, CA 94133 (37.80005, -122.412481)	(37.80005, -122.412481)	37.8	-122.412	Union Street Produce	Union Street Produce Co	{'name': 'Union Street Produce Co', 'vicinity': '801 Uni ...	['grocery_or_supermarket', 'food', 'store', 'point_of_in ...	True	False
0326568-03-001	326568	Afewerk Senait S	Nicks Foods	2	1659 Market St	San Francisco	CA	94103	3/1/2009	nan	11/15/2009	6/24/2013	1659 Market Street	San Francisco	94103	CA	4400-4599	Retail Trade	False	False	nan	nan	nan	nan	nan	1659 MARKET ST SAN FRANCISCO, CA 94103 (37.773284, -122. ...	(37.773284, -122.42147)	37.7733	-122.421	Nicks Foods	Nick's Super Market	{'name': "Nick's Super Market", 'vicinity': '144 Page St ...	['grocery_or_supermarket', 'food', 'store', 'point_of_in ...	True	False

Source Zipcode	count
94102	35
94103	25
94104	2
94105	9
94107	17
94108	14
94109	53
94110	59
94111	2
94112	27

Source Zipcode	count
94110	59
94109	53
94133	39
94102	35
94117	32
94112	27
94103	25
94118	22
94122	20
94115	20

Source Zipcode	price level mean	Grocery Store mean	Liquor Store mean
94102	1.8	0.6	0.485714
94103	1.56	0.8	0.36
94104	1	0.5	0.5
94105	2.11111	0.444444	0.666667
94107	1.82353	0.588235	0.470588
94108	1.92857	0.714286	0.428571
94109	1.64151	0.830189	0.320755
94110	1.49153	0.864407	0.271186
94111	3	0.5	0.5
94112	1.51852	0.62963	0.407407

Source Zipcode	price level mean	Grocery Store mean	Liquor Store mean	Business Count
94102	1.8	0.6	0.485714	35
94103	1.56	0.8	0.36	25
94104	1	0.5	0.5	2
94105	2.11111	0.444444	0.666667	9
94107	1.82353	0.588235	0.470588	17
94108	1.92857	0.714286	0.428571	14
94109	1.64151	0.830189	0.320755	53
94110	1.49153	0.864407	0.271186	59
94111	3	0.5	0.5	2
94112	1.51852	0.62963	0.407407	27

statistic	Ownership Name	DBA Name	Source Zipcode	price level	Grocery Store	Liquor Store
min	2e Shqair Inc	1001 Castro Street	94102	1	0	0
max	Zuha & Rana Corporation	Zains Fine Wine & Spirits	95901	4	1	1
median			94112	2	1	0
sum			4.51781e+07	793	358	177

Ownership Name	DBA Name	Source Zipcode	price level	Grocery Store	Liquor Store
Hom Song Chin/leung	Homs Grocery	94110	1	True	False
Totah B/totah M/ Totah N	Norms Market	94110	2	False	True
Jada Rimon & Sana	R Image Market	94110	1	True	False

Name	Type
Fido	Retriever
Clifford	Big Red
Spot	Malamute

Name	Type
Fido	Retriever
Clifford	Big Red
Spot	Malamute

Name	Type
Fido	Retriever
Clifford	Big Red
Spot	Malamute

Name	Type	Color
Fido	Retriever	Golden
Clifford	Big	Red
Spot	Malamute	Gray