Practical Machine Learning - Day 1

VMware Bangalore
June 18-20, 2018

Amit kapoorAnand ChitipothuBargava Subramanian

Notes of this workshop are available online at:
https://bit.ly/vmware-ml

Home | Day 1 | Day 2 | Day 2 - Housing | Day 3

In [1]:
1 + 2 
Out[1]:
3

Write any python code and press Shift + Enter to execute it.

There are code and markdown cells. The markdown cell is used to write text like this.

You can change the cell type from the cell menu on the top or using ESC+m and ESC+y to switch between markdown and code cell types.

In [2]:
1 + 2
Out[2]:
3

Simple formatting tips:

# Heading 1 --

Heading 1

## Heading 2 --

Heading 2

**bold text** -- bold text

*italic text* -- italic text

Use 4 spaces in the beginining to write something verbatim:

This is
  verbatim text

Data Science Libraries

We are going to use the following data science libraries:

  • numpy - numerical computation
  • pandas - spreadsheet for hackers
  • matplotlib - plotting graphs
  • scikit-learn (sklearn) - machine learning
In [3]:
import numpy as np
In [4]:
x = np.array([1, 2, 3])
In [5]:
x
Out[5]:
array([1, 2, 3])
In [6]:
x.dtype
Out[6]:
dtype('int64')
In [9]:
type(x)
Out[9]:
numpy.ndarray
In [10]:
x.shape
Out[10]:
(3,)
In [11]:
d = np.array([[0.1, 0.2, 0.3],
              [1.1, 1.2, 1.3]])
In [12]:
d.dtype
Out[12]:
dtype('float64')
In [13]:
d.shape
Out[13]:
(2, 3)
In [16]:
d3 = np.array([d, d, d, d])
In [18]:
d3
Out[18]:
array([[[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]]])
In [17]:
d3.shape
Out[17]:
(4, 2, 3)

One of the best things about numpy is that it allows us to do vector operations.

In [19]:
d
Out[19]:
array([[0.1, 0.2, 0.3],
       [1.1, 1.2, 1.3]])
In [20]:
d + 1
Out[20]:
array([[1.1, 1.2, 1.3],
       [2.1, 2.2, 2.3]])
In [21]:
1 / d
Out[21]:
array([[10.        ,  5.        ,  3.33333333],
       [ 0.90909091,  0.83333333,  0.76923077]])

The numpy operations are very efficient.

In [27]:
x1 = np.array(range(10000000))
x2 = list(range(10000000))
In [29]:
%%time
y1 = 10 * x1
CPU times: user 38 ms, sys: 10.7 ms, total: 48.7 ms
Wall time: 46.4 ms
In [30]:
%%time
y2 = [10*a for a in x2]
CPU times: user 1.05 s, sys: 504 ms, total: 1.55 s
Wall time: 1.74 s

Q: Can we use * operator on regular lists?

Yes, but that is a completely different operation.

In [32]:
5 * [1, 2, 3]
Out[32]:
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
In [33]:
"hello" * 3
Out[33]:
'hellohellohello'

Boolean Indexing

In [34]:
d
Out[34]:
array([[0.1, 0.2, 0.3],
       [1.1, 1.2, 1.3]])
In [35]:
d > 1
Out[35]:
array([[False, False, False],
       [ True,  True,  True]])
In [39]:
# count number of elements greater than 1
np.sum(d > 1)
Out[39]:
3
In [38]:
# all elements greater than 1
d[d > 1]
Out[38]:
array([1.1, 1.2, 1.3])
In [40]:
# set all the elements less than 1 to 0
d[d < 1] = 0
In [41]:
d
Out[41]:
array([[0. , 0. , 0. ],
       [1.1, 1.2, 1.3]])
In [42]:
from scipy import misc
In [43]:
face = misc.face(gray=True)
In [44]:
face
Out[44]:
array([[114, 130, 145, ..., 119, 129, 137],
       [ 83, 104, 123, ..., 118, 134, 146],
       [ 68,  88, 109, ..., 119, 134, 145],
       ...,
       [ 98, 103, 116, ..., 144, 143, 143],
       [ 94, 104, 120, ..., 143, 142, 142],
       [ 94, 106, 119, ..., 142, 141, 140]], dtype=uint8)
In [45]:
face.shape
Out[45]:
(768, 1024)
In [46]:
face.dtype
Out[46]:
dtype('uint8')
In [48]:
import matplotlib.pyplot as plt
%matplotlib inline
In [50]:
plt.imshow(face, cmap='gray')
Out[50]:
<matplotlib.image.AxesImage at 0x157187710>
In [51]:
plt.imshow(255-face, cmap='gray')
Out[51]:
<matplotlib.image.AxesImage at 0x11182a2e8>
In [54]:
face0 = face.copy() # keep a copy of the original
In [57]:
x = np.arange(256).reshape(16, 16)
In [58]:
x
Out[58]:
array([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15],
       [ 16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,
         29,  30,  31],
       [ 32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
         45,  46,  47],
       [ 48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,
         61,  62,  63],
       [ 64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
         77,  78,  79],
       [ 80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
         93,  94,  95],
       [ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
        109, 110, 111],
       [112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
        125, 126, 127],
       [128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
        141, 142, 143],
       [144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
        157, 158, 159],
       [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
        173, 174, 175],
       [176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188,
        189, 190, 191],
       [192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204,
        205, 206, 207],
       [208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
        221, 222, 223],
       [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236,
        237, 238, 239],
       [240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
        253, 254, 255]])
In [59]:
plt.imshow(x, cmap='gray')
Out[59]:
<matplotlib.image.AxesImage at 0x156f8f7f0>

Problem: Convert all pixels darker than value 200 to black in the face image and plot it.

In [60]:
face[face<200] = 0
In [61]:
plt.imshow(face, cmap='gray')
Out[61]:
<matplotlib.image.AxesImage at 0x1570c0e10>
In [62]:
# get the original
face = face0.copy()

plt.imshow(face.transpose(), cmap='gray')
Out[62]:
<matplotlib.image.AxesImage at 0x1570af198>
In [63]:
face.shape
Out[63]:
(768, 1024)
In [68]:
plt.imshow(face[:400, 600:1024], cmap='gray')
Out[68]:
<matplotlib.image.AxesImage at 0x1571eaef0>
In [71]:
h, w = face.shape
In [72]:
h
Out[72]:
768
In [73]:
w
Out[73]:
1024
In [74]:
plt.imshow(face, cmap='gray')
Out[74]:
<matplotlib.image.AxesImage at 0x15706e668>
In [81]:
face[:h//2, :w//2] = 0
plt.imshow(face, cmap='gray')
Out[81]:
<matplotlib.image.AxesImage at 0x155f09da0>
In [79]:
4 / 2
Out[79]:
2.0
In [80]:
4 // 2
Out[80]:
2

Problem: Swap the top-right quarter with the bottom-left quarter of the image.

In [89]:
face = face0.copy()
TR = face[:h//2, w//2:]
BL = face[h//2:, :w//2]
plt.imshow(TR, cmap="gray")
plt.show()
plt.imshow(BL, cmap="gray")
Out[89]:
<matplotlib.image.AxesImage at 0x1584dc978>
In [90]:
TR = face[:h//2, w//2:].copy()
BL = face[h//2:, :w//2].copy()
face[:h//2, w//2:] = BL
face[h//2:, :w//2] = TR
plt.imshow(face, cmap="gray")
Out[90]:
<matplotlib.image.AxesImage at 0x1582631d0>
In [93]:
face = face0.copy()
face[:h//2, w//2:], face[h//2:, :w//2] = (
    face[h//2:, :w//2].copy(), 
    face[:h//2, w//2:].copy())
plt.imshow(face, cmap="gray")
Out[93]:
<matplotlib.image.AxesImage at 0x1580045f8>

Pandas

In [98]:
import pandas as pd

Pandas has two important data structures. The Series and the DataFrame.

DataFrame is a spreadsheet and Series is a column.

In [99]:
x = pd.Series([1.1, 2.2, 3.3, 4.4])
In [100]:
x
Out[100]:
0    1.1
1    2.2
2    3.3
3    4.4
dtype: float64
In [101]:
x[0]
Out[101]:
1.1
In [102]:
x.shape
Out[102]:
(4,)
In [103]:
x.dtype
Out[103]:
dtype('float64')
In [104]:
x + 10
Out[104]:
0    11.1
1    12.2
2    13.3
3    14.4
dtype: float64

Every Series can have an index.

In [106]:
x = pd.Series([1.1, 2.2, 3.3, 4.4], index=["a", "b", "c", "d"])
In [107]:
x
Out[107]:
a    1.1
b    2.2
c    3.3
d    4.4
dtype: float64
In [108]:
x['a']
Out[108]:
1.1

The DataFrame

In [109]:
data = [[1, 1],
        [2, 4],
        [3, 9],
        [4, 16]]
df = pd.DataFrame(data)
In [110]:
df
Out[110]:
0 1
0 1 1
1 2 4
2 3 9
3 4 16
In [111]:
df = pd.DataFrame(data, columns=["x", "y"])
In [112]:
df
Out[112]:
x y
0 1 1
1 2 4
2 3 9
3 4 16
In [113]:
df.x
Out[113]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64
In [115]:
df['x']
Out[115]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

Problem: Add a new column z to the dataframe df with value y - x in each row.

In [116]:
df.y - df.x
Out[116]:
0     0
1     2
2     6
3    12
dtype: int64
In [117]:
df['z'] = df.y - df.x
# df['z'] = df['y'] - df['x']
In [118]:
df
Out[118]:
x y z
0 1 1 0
1 2 4 2
2 3 9 6
3 4 16 12

Example: Iris Dataset

In [2]:
url = "https://notes.pipal.in/2018/vmware-ml/iris.csv"
In [3]:
df = pd.read_csv(url)
In [4]:
df.head()
Out[4]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [126]:
df.Name.unique()
Out[126]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
In [128]:
df.Name.nunique()
Out[128]:
3
In [129]:
df.Name.value_counts()
Out[129]:
Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Name, dtype: int64

Q: What happens if we write df.Name.unique without the parenthesis?

In [130]:
x = "hello"
In [131]:
x.upper()
Out[131]:
'HELLO'
In [132]:
x.upper
Out[132]:
<function str.upper>
In [136]:
class Foo:
    def foo(self): pass
    def __repr__(self): return "<The Foo object>"
In [137]:
f = Foo()
In [138]:
f.foo
Out[138]:
<bound method Foo.foo of <The Foo object>>

So df.Name.unique prints that it bound method of a data frame with the data of the dataframe also printed.

In [139]:
df.describe()
Out[139]:
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [140]:
df.boxplot()
Out[140]:
<matplotlib.axes._subplots.AxesSubplot at 0x15c3050f0>
In [141]:
df.SepalLength.hist()
Out[141]:
<matplotlib.axes._subplots.AxesSubplot at 0x15c304710>
In [148]:
df.hist(figsize=(12, 8), sharex=True, sharey=True);
In [154]:
df.hist(column='PetalLength', by='Name', sharex=True, sharey=True);
In [157]:
df.boxplot(column='PetalLength', by='Name');

Problem: Find the ratio of PetalLength to PetalWidth and plot a histogram grouped by Name.

In [158]:
df['Ratio'] = df.PetalLength / df.PetalWidth
In [160]:
df.hist(column='Ratio', by='Name');
In [8]:
df.plot(kind='scatter', x='PetalLength', y='PetalWidth');
In [12]:
df.Name.unique()
Out[12]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
In [13]:
names = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
In [14]:
df['iname'] = df.Name.map(names.get)
In [15]:
df.head()
Out[15]:
SepalLength SepalWidth PetalLength PetalWidth Name iname
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 0
2 4.7 3.2 1.3 0.2 Iris-setosa 0
3 4.6 3.1 1.5 0.2 Iris-setosa 0
4 5.0 3.6 1.4 0.2 Iris-setosa 0
In [16]:
names.get('Iris-setosa')
Out[16]:
0
In [22]:
plt.rcParams['figure.subplot.bottom'] = 0.15
#plt.rcParams
In [23]:
df.plot(kind='scatter', x='PetalLength', y='PetalWidth', c='iname', 
        cmap='viridis');

House Rent Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

FRAME

In [ ]:
Given the a set of features about my requirements for a house, 
how much rent will I have to pay? 

ACQUIRE

In [4]:
url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json.zip"
In [5]:
df = pd.read_json(url)
In [6]:
df.columns
Out[6]:
Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')
In [7]:
df.head()
Out[7]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc [https://photos.renthop.com/2/7211212_1ed4542e... 3000 792 Metropolitan Avenue
10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 Columbus Avenue [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d [https://photos.renthop.com/2/7150865_be3306c5... 5465 808 Columbus Avenue
100004 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa 2016-04-17 03:26:41 Top Top West Village location, beautiful Pre-w... W 13 Street [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 6887163 -74.0018 d9039c43983f6e564b1482b273bd7b01 [https://photos.renthop.com/2/6887163_de85c427... 2850 241 W 13 Street
100007 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 2016-04-18 02:22:02 Building Amenities - Garage - Garden - fitness... East 49th Street [Hardwood Floors, No Fee] low 40.7539 6888711 -73.9677 1067e078446a7897d2da493d2f741316 [https://photos.renthop.com/2/6888711_6e660cee... 3275 333 East 49th Street
100013 1.0 4 0 2016-04-28 01:32:41 Beautifully renovated 3 bedroom flex 4 bedroom... West 143rd Street [Pre-War] low 40.8241 6934781 -73.9493 98e13ad4b495b9613cef886d79a6291f [https://photos.renthop.com/2/6934781_1fa4b41a... 3350 500 West 143rd Street

REFINE

Find missing values.

In [8]:
df.isnull().sum()
Out[8]:
bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64
In [9]:
(df == '').sum()
Out[9]:
bathrooms             0
bedrooms              0
building_id           0
created               0
description        1446
display_address     135
features              0
interest_level        0
latitude              0
listing_id            0
longitude             0
manager_id            0
photos                0
price                 0
street_address       10
dtype: int64
In [10]:
# Count the number of values in the features column where the value is empty list
df.features.map(lambda x: x==[]).sum()
Out[10]:
3218
In [11]:
df.applymap(lambda x: x == '' or x==[]).sum()
Out[11]:
bathrooms             0
bedrooms              0
building_id           0
created               0
description        1446
display_address     135
features           3218
interest_level        0
latitude              0
listing_id            0
longitude             0
manager_id            0
photos             3615
price                 0
street_address       10
dtype: int64

EXPLORE

In [12]:
df.columns
Out[12]:
Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')
In [13]:
df.bedrooms.hist()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x133bbee10>
In [17]:
df.bedrooms.plot(kind="box")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ddcfe10>
In [18]:
df.bedrooms.value_counts()
Out[18]:
1    15752
2    14623
0     9475
3     7276
4     1929
5      247
6       46
8        2
7        2
Name: bedrooms, dtype: int64
In [20]:
df.bedrooms.value_counts().plot(kind="bar")
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11daefc88>
In [21]:
df.shape
Out[21]:
(49352, 15)

Problem: Plot price vs. bedrooms.

Problem: Plot price vs. bedrooms and color it by interest_level.

In [23]:
df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x11dac0780>
In [27]:
# remove the points where price is too high
df[df.price < 100000].plot(kind="scatter", x='bedrooms', y='price', alpha=0.5);
In [28]:
# use log scale for y
df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5, logy=True);
In [31]:
df[df.price < 100000].boxplot(column='price', by='bedrooms');
In [ ]: