#all of our import statements
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
You can initialize a table with just putting a list of column names into the parenthesis that follow the call. Typically you do it this way when you're going to be adding rows to a table from scratch, such as if you were iterating through some other data and wanted to build a table.
Table(['Name', 'Type'])
Table(['Name', 'Type']).with_rows([
['Fido', 'Retriever'],
['Clifford', 'Big Red'],
['Spot', 'Malamute']
])
The more traditional way to create a table is done by the .with_rows( ) method.
dog = Table().with_columns([
'Name', ['Fido', 'Clifford', 'Spot'],
'Type', ['Retriever', 'Big Red', 'Malamute']
])
dog
There are multiple ways that you can pull a column out of a table, but some are better than others for different situations.
The safest way to pull out a column is with the .column( ) method. It explicitly lets the reader know that you are pulling out a column, and does not have any side-effects.
dog.column('Type')
Another way is to index by the column name, that is, we put the column name in brackets, and call it on the table. It is less verbose, but people understand that you're pulling a column of that name out of a table.
dog['Type']
Indexing by column number is probably the worst way to do it. When reading the code, it is unclear what the column is, . As a rule of thumb, if you're going to do it this way, you should index by the column name instead.
dog[1]
There are special things that you can do when you use brackets to index a column. When you index with a column name, you can overwrite the current column by assigning it to a new value.
Suppose we wanted to change the Clifford type to "Big" instead of "Big Red" because we wanted to add a column "Color" and didn't want to be redundant.
print('This is the original Table')
dog.show()
# this changes the current values of "Type" to a new array
dog['Type'] = make_array('Retriever', 'Big', 'Malamute')
# this creates a new column "Color"
dog['Color'] = make_array('Golden', 'Red', 'Gray')
print('Now we have changed the "Type" column, and added a new column')
dog.show()
Now our dog table is forever changed. We are unable to get the old values that we used to have in 'Type' (unless we run all of the cells at the type when we first created the table). So do this with caution, it can be dangerous. Sometimes it is better to create a new version of the table with that column changed, that way, if we realize that our new table is inherently flawed or we want to access our old table, we can always revert back to it.
dog_2 = dog.with_columns([
'Type', make_array('happy', 'sad', 'creepy'),
'Color', make_array('green', 'blue', 'purple')
])
print('dog_2 (our new version)')
dog_2.show()
print('dog (still unchanged)')
dog.show()
Moving into some real data, below are a selected list of grocery and liquor stores in San Francisco was originally made from a combination of SF Open Data, Google Places, Yelp, and the Census.
businesses = Table().read_table('SelectedBusinesses.csv')
businesses.show(3)
Use .num_columns to get the number of columns and use .num_rows to get the number of rows
number_of_columns = businesses.num_columns
number_of_business = businesses.num_rows
print('There are ' + str(number_of_business) + " businesses")
print('And there are ' + str(number_of_columns) + " columns")
There are way too many columns that we currently don't care about. Let's cut down on them by selecting out which ones we want.
businesses.labels
select_businesses = businesses.select('Ownership Name', 'DBA Name','Source Zipcode','Transient Occupancy Tax','Parking Tax','price level', 'Grocery Store', 'Liquor Store')
select_businesses
Now that we have a more managable dataset, we can try to find some things out about it. Start by using .stats() to get a list of summary statistics about the table.
select_businesses.stats()
It is weird that both of the taxes have max's and mins of 0. Further investigation . . .
select_businesses.column('Transient Occupancy Tax')[:10]
select_businesses.column('Parking Tax')[:10]
A sum of zero for each of the taxes means that neither has any 'True' values. In Python, True evaluates to one, and false evaluates to zero. Because all of the businesses have the same value for those columns, they aren't really adding any value to our table. So let's drop them. We're pretty confident that we won't be needing them in the future, so we can reassign select_businesses to our new table.
select_businesses = select_businesses.drop('Parking Tax', 'Transient Occupancy Tax')
select_businesses.show(3)
We have the median price level, but let's get the mean one. Then let's look at the distributions of those numbers.
price_levels = select_businesses.column('price level')
np.mean(price_levels)
select_businesses.group('price level').barh(column_for_categories='price level')
Next, lets see if number of businesses in an area affects any of the other columns. We can use the .group( ) method to accomplish this.
select_businesses.group('Source Zipcode')
Now sort them by which zipcodes have the most businesses in their area.
grouped_zipcodes = select_businesses.group('Source Zipcode').sort('count', descending=True)
grouped_zipcodes
What is the zipcode with the most businesses?
grouped_zipcodes.take(0)
Next, we will want to compare the differences between the mean price levels of businesses in 94110 those in the other ones.
# selecting out biggest zipcodes
zipcode = grouped_zipcodes.take(0).column('Source Zipcode').item(0)
# These are lists of booleans, which you can input into the where function to select rows
in_94110 = select_businesses['Source Zipcode'] == zipcode
not_in_94110 = select_businesses['Source Zipcode'] != zipcode
print('94110')
select_businesses.where(in_94110).show(3)
print('The rest of the zipcodes')
select_businesses.where(not_in_94110).show(3)
np.mean(select_businesses.where(in_94110).column('price level'))
np.mean(select_businesses.where(not_in_94110).column('price level'))
The price level is on average lower for the businesses that are within the zipcode 94110.
Group can take a second argument, a function, that takes all of the values for that column of the group, and applies it. Typically you use np.mean or max, but there are many that you could use.
averaged_businesses = select_businesses.group('Source Zipcode', np.mean)
averaged_businesses
# using group, we can get the business counts for each zipcode
business_counts = select_businesses.group('Source Zipcode').sort('Source Zipcode').column('count')
# then we can make a new column with those same values on the above table
averaged_businesses = averaged_businesses.sort('Source Zipcode').with_columns([
'Business Count', business_counts])
averaged_businesses
We can display the distribution of average price levels with a histogram.
Making sure that you understand histograms is key, if you aren't completely sure what one is, read this section of the textbook https://www.inferentialthinking.com/chapters/06/2/visualizing-numerical-distributions.html
averaged_businesses.hist('price level mean', bins=np.arange(1,3.25,.25))
Let's estimate how many business are in the bin between 1.5 and 1.75
# percent per unit * unit = percent
160 * .25
So forty percent of our data lies within that bin
.40 * 27
Which equates to about 11 zipcodes
It looks kind of weird that there is a bin way that is so far out to the right, take a look at what rows that might include.
# Where using table predicates
averaged_businesses.where('price level mean', are.above(2.5))
Use a scatter plot to see the relationship between price level and the number of businesses in a zipcode.
averaged_businesses.scatter('price level mean', 'Business Count')