Practical Machine Learning - Day 1¶

VMware Bangalore
June 18-20, 2018

Amit kapoor • Anand Chitipothu • Bargava Subramanian

Notes of this workshop are available online at:
https://bit.ly/vmware-ml

Home | Day 1 | Day 2 | Day 2 - Housing | Day 3

1 + 2

3

Write any python code and press Shift + Enter to execute it.

There are code and markdown cells. The markdown cell is used to write text like this.

You can change the cell type from the cell menu on the top or using ESC+m and ESC+y to switch between markdown and code cell types.

1 + 2

3

Simple formatting tips:

# Heading 1 --

Heading 1¶

## Heading 2 --

Heading 2¶

**bold text** -- bold text

*italic text* -- italic text

Use 4 spaces in the beginining to write something verbatim:

This is
  verbatim text

Data Science Libraries¶

We are going to use the following data science libraries:

numpy - numerical computation
pandas - spreadsheet for hackers
matplotlib - plotting graphs
scikit-learn (sklearn) - machine learning

import numpy as np

x = np.array([1, 2, 3])

x

array([1, 2, 3])

x.dtype

dtype('int64')

type(x)

numpy.ndarray

x.shape

(3,)

d = np.array([[0.1, 0.2, 0.3],
              [1.1, 1.2, 1.3]])

d.dtype

dtype('float64')

d.shape

(2, 3)

d3 = np.array([d, d, d, d])

d3

array([[[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]],

       [[0.1, 0.2, 0.3],
        [1.1, 1.2, 1.3]]])

d3.shape

(4, 2, 3)

One of the best things about numpy is that it allows us to do vector operations.

d

array([[0.1, 0.2, 0.3],
       [1.1, 1.2, 1.3]])

d + 1

array([[1.1, 1.2, 1.3],
       [2.1, 2.2, 2.3]])

1 / d

array([[10.        ,  5.        ,  3.33333333],
       [ 0.90909091,  0.83333333,  0.76923077]])

The numpy operations are very efficient.

x1 = np.array(range(10000000))
x2 = list(range(10000000))

%%time
y1 = 10 * x1

CPU times: user 38 ms, sys: 10.7 ms, total: 48.7 ms
Wall time: 46.4 ms

%%time
y2 = [10*a for a in x2]

CPU times: user 1.05 s, sys: 504 ms, total: 1.55 s
Wall time: 1.74 s

Q: Can we use * operator on regular lists?

Yes, but that is a completely different operation.

5 * [1, 2, 3]

[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]

"hello" * 3

'hellohellohello'

Boolean Indexing¶

d

array([[0.1, 0.2, 0.3],
       [1.1, 1.2, 1.3]])

d > 1

array([[False, False, False],
       [ True,  True,  True]])

# count number of elements greater than 1
np.sum(d > 1)

3

# all elements greater than 1
d[d > 1]

array([1.1, 1.2, 1.3])

# set all the elements less than 1 to 0
d[d < 1] = 0

d

array([[0. , 0. , 0. ],
       [1.1, 1.2, 1.3]])

from scipy import misc

face = misc.face(gray=True)

face

array([[114, 130, 145, ..., 119, 129, 137],
       [ 83, 104, 123, ..., 118, 134, 146],
       [ 68,  88, 109, ..., 119, 134, 145],
       ...,
       [ 98, 103, 116, ..., 144, 143, 143],
       [ 94, 104, 120, ..., 143, 142, 142],
       [ 94, 106, 119, ..., 142, 141, 140]], dtype=uint8)

face.shape

(768, 1024)

face.dtype

dtype('uint8')

import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(face, cmap='gray')

<matplotlib.image.AxesImage at 0x157187710>

plt.imshow(255-face, cmap='gray')

<matplotlib.image.AxesImage at 0x11182a2e8>

face0 = face.copy() # keep a copy of the original

x = np.arange(256).reshape(16, 16)

x

array([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15],
       [ 16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,
         29,  30,  31],
       [ 32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
         45,  46,  47],
       [ 48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,
         61,  62,  63],
       [ 64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
         77,  78,  79],
       [ 80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
         93,  94,  95],
       [ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
        109, 110, 111],
       [112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
        125, 126, 127],
       [128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
        141, 142, 143],
       [144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
        157, 158, 159],
       [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
        173, 174, 175],
       [176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188,
        189, 190, 191],
       [192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204,
        205, 206, 207],
       [208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
        221, 222, 223],
       [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236,
        237, 238, 239],
       [240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
        253, 254, 255]])

plt.imshow(x, cmap='gray')

<matplotlib.image.AxesImage at 0x156f8f7f0>

Problem: Convert all pixels darker than value 200 to black in the face image and plot it.

face[face<200] = 0

plt.imshow(face, cmap='gray')

<matplotlib.image.AxesImage at 0x1570c0e10>

# get the original
face = face0.copy()

plt.imshow(face.transpose(), cmap='gray')

<matplotlib.image.AxesImage at 0x1570af198>

face.shape

(768, 1024)

plt.imshow(face[:400, 600:1024], cmap='gray')

<matplotlib.image.AxesImage at 0x1571eaef0>

h, w = face.shape

h

768

w

1024

plt.imshow(face, cmap='gray')

<matplotlib.image.AxesImage at 0x15706e668>

face[:h//2, :w//2] = 0
plt.imshow(face, cmap='gray')

<matplotlib.image.AxesImage at 0x155f09da0>

4 / 2

2.0

4 // 2

2

Problem: Swap the top-right quarter with the bottom-left quarter of the image.

face = face0.copy()
TR = face[:h//2, w//2:]
BL = face[h//2:, :w//2]
plt.imshow(TR, cmap="gray")
plt.show()
plt.imshow(BL, cmap="gray")

<matplotlib.image.AxesImage at 0x1584dc978>

TR = face[:h//2, w//2:].copy()
BL = face[h//2:, :w//2].copy()
face[:h//2, w//2:] = BL
face[h//2:, :w//2] = TR
plt.imshow(face, cmap="gray")

<matplotlib.image.AxesImage at 0x1582631d0>

face = face0.copy()
face[:h//2, w//2:], face[h//2:, :w//2] = (
    face[h//2:, :w//2].copy(), 
    face[:h//2, w//2:].copy())
plt.imshow(face, cmap="gray")

<matplotlib.image.AxesImage at 0x1580045f8>

Pandas¶

import pandas as pd

Pandas has two important data structures. The Series and the DataFrame.

DataFrame is a spreadsheet and Series is a column.

x = pd.Series([1.1, 2.2, 3.3, 4.4])

x

0    1.1
1    2.2
2    3.3
3    4.4
dtype: float64

x[0]

1.1

x.shape

(4,)

x.dtype

dtype('float64')

x + 10

0    11.1
1    12.2
2    13.3
3    14.4
dtype: float64

Every Series can have an index.

x = pd.Series([1.1, 2.2, 3.3, 4.4], index=["a", "b", "c", "d"])

x

a    1.1
b    2.2
c    3.3
d    4.4
dtype: float64

x['a']

1.1

The DataFrame¶

data = [[1, 1],
        [2, 4],
        [3, 9],
        [4, 16]]
df = pd.DataFrame(data)

df

df = pd.DataFrame(data, columns=["x", "y"])

df

df.x

0    1
1    2
2    3
3    4
Name: x, dtype: int64

df['x']

0    1
1    2
2    3
3    4
Name: x, dtype: int64

Problem: Add a new column z to the dataframe df with value y - x in each row.

df.y - df.x

0     0
1     2
2     6
3    12
dtype: int64

df['z'] = df.y - df.x
# df['z'] = df['y'] - df['x']

df

Example: Iris Dataset¶

url = "https://notes.pipal.in/2018/vmware-ml/iris.csv"

df = pd.read_csv(url)

df.head()

df.Name.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

df.Name.nunique()

3

df.Name.value_counts()

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Name, dtype: int64

Q: What happens if we write df.Name.unique without the parenthesis?

x = "hello"

x.upper()

'HELLO'

x.upper

<function str.upper>

class Foo:
    def foo(self): pass
    def __repr__(self): return "<The Foo object>"

f = Foo()

f.foo

<bound method Foo.foo of <The Foo object>>

So df.Name.unique prints that it bound method of a data frame with the data of the dataframe also printed.

df.describe()

df.boxplot()

<matplotlib.axes._subplots.AxesSubplot at 0x15c3050f0>

df.SepalLength.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x15c304710>

df.hist(figsize=(12, 8), sharex=True, sharey=True);

df.hist(column='PetalLength', by='Name', sharex=True, sharey=True);

df.boxplot(column='PetalLength', by='Name');

Problem: Find the ratio of PetalLength to PetalWidth and plot a histogram grouped by Name.

df['Ratio'] = df.PetalLength / df.PetalWidth

df.hist(column='Ratio', by='Name');

df.plot(kind='scatter', x='PetalLength', y='PetalWidth');

df.Name.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

names = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}

df['iname'] = df.Name.map(names.get)

df.head()

names.get('Iris-setosa')

0

plt.rcParams['figure.subplot.bottom'] = 0.15
#plt.rcParams

df.plot(kind='scatter', x='PetalLength', y='PetalWidth', c='iname', 
        cmap='viridis');

House Rent Dataset¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

FRAME¶

Given the a set of features about my requirements for a house, 
how much rent will I have to pay?

ACQUIRE¶

url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json.zip"

df = pd.read_json(url)

df.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')

df.head()

REFINE¶

Find missing values.

df.isnull().sum()

bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64

(df == '').sum()

bathrooms             0
bedrooms              0
building_id           0
created               0
description        1446
display_address     135
features              0
interest_level        0
latitude              0
listing_id            0
longitude             0
manager_id            0
photos                0
price                 0
street_address       10
dtype: int64

# Count the number of values in the features column where the value is empty list
df.features.map(lambda x: x==[]).sum()

3218

df.applymap(lambda x: x == '' or x==[]).sum()

bathrooms             0
bedrooms              0
building_id           0
created               0
description        1446
display_address     135
features           3218
interest_level        0
latitude              0
listing_id            0
longitude             0
manager_id            0
photos             3615
price                 0
street_address       10
dtype: int64

EXPLORE¶

df.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')

df.bedrooms.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x133bbee10>

df.bedrooms.plot(kind="box")

<matplotlib.axes._subplots.AxesSubplot at 0x11ddcfe10>

df.bedrooms.value_counts()

1    15752
2    14623
0     9475
3     7276
4     1929
5      247
6       46
8        2
7        2
Name: bedrooms, dtype: int64

df.bedrooms.value_counts().plot(kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x11daefc88>

df.shape

(49352, 15)

Problem: Plot price vs. bedrooms.

Problem: Plot price vs. bedrooms and color it by interest_level.

df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x11dac0780>

# remove the points where price is too high
df[df.price < 100000].plot(kind="scatter", x='bedrooms', y='price', alpha=0.5);

# use log scale for y
df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5, logy=True);

df[df.price < 100000].boxplot(column='price', by='bedrooms');

	SepalLength	SepalWidth	PetalLength	PetalWidth
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
10	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	-73.9425	5ba989232d0489da1b5f2c45f6688adc	[https://photos.renthop.com/2/7211212_1ed4542e...	3000	792 Metropolitan Avenue
10000	1.0	2	c5c8a357cba207596b04d1afd1e4f130	2016-06-12 12:19:27		Columbus Avenue	[Doorman, Elevator, Fitness Center, Cats Allow...	low	40.7947	7150865	-73.9667	7533621a882f71e25173b27e3139d83d	[https://photos.renthop.com/2/7150865_be3306c5...	5465	808 Columbus Avenue
100004	1.0	1	c3ba40552e2120b0acfc3cb5730bb2aa	2016-04-17 03:26:41	Top Top West Village location, beautiful Pre-w...	W 13 Street	[Laundry In Building, Dishwasher, Hardwood Flo...	high	40.7388	6887163	-74.0018	d9039c43983f6e564b1482b273bd7b01	[https://photos.renthop.com/2/6887163_de85c427...	2850	241 W 13 Street
100007	1.0	1	28d9ad350afeaab8027513a3e52ac8d5	2016-04-18 02:22:02	Building Amenities - Garage - Garden - fitness...	East 49th Street	[Hardwood Floors, No Fee]	low	40.7539	6888711	-73.9677	1067e078446a7897d2da493d2f741316	[https://photos.renthop.com/2/6888711_6e660cee...	3275	333 East 49th Street
100013	1.0	4	0	2016-04-28 01:32:41	Beautifully renovated 3 bedroom flex 4 bedroom...	West 143rd Street	[Pre-War]	low	40.8241	6934781	-73.9493	98e13ad4b495b9613cef886d79a6291f	[https://photos.renthop.com/2/6934781_1fa4b41a...	3350	500 West 143rd Street

	SepalLength	SepalWidth	PetalLength	PetalWidth	Name
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa