VMware Bangalore
June 18-20, 2018
Amit kapoor • Anand Chitipothu • Bargava Subramanian
Notes of this workshop are available online at:
https://bit.ly/vmware-ml
Home | Day 1 | Day 2 | Day 2 - Housing | Day 3
1 + 2
Write any python code and press Shift + Enter to execute it.
There are code and markdown cells. The markdown cell is used to write text like this.
You can change the cell type from the cell menu on the top or using ESC+m and ESC+y to switch between markdown and code cell types.
1 + 2
We are going to use the following data science libraries:
import numpy as np
x = np.array([1, 2, 3])
x
x.dtype
type(x)
x.shape
d = np.array([[0.1, 0.2, 0.3],
[1.1, 1.2, 1.3]])
d.dtype
d.shape
d3 = np.array([d, d, d, d])
d3
d3.shape
One of the best things about numpy is that it allows us to do vector operations.
d
d + 1
1 / d
The numpy operations are very efficient.
x1 = np.array(range(10000000))
x2 = list(range(10000000))
%%time
y1 = 10 * x1
%%time
y2 = [10*a for a in x2]
Q: Can we use * operator on regular lists?
Yes, but that is a completely different operation.
5 * [1, 2, 3]
"hello" * 3
d
d > 1
# count number of elements greater than 1
np.sum(d > 1)
# all elements greater than 1
d[d > 1]
# set all the elements less than 1 to 0
d[d < 1] = 0
d
from scipy import misc
face = misc.face(gray=True)
face
face.shape
face.dtype
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(face, cmap='gray')
plt.imshow(255-face, cmap='gray')
face0 = face.copy() # keep a copy of the original
x = np.arange(256).reshape(16, 16)
x
plt.imshow(x, cmap='gray')
Problem: Convert all pixels darker than value 200 to black in the face image and plot it.
face[face<200] = 0
plt.imshow(face, cmap='gray')
# get the original
face = face0.copy()
plt.imshow(face.transpose(), cmap='gray')
face.shape
plt.imshow(face[:400, 600:1024], cmap='gray')
h, w = face.shape
h
w
plt.imshow(face, cmap='gray')
face[:h//2, :w//2] = 0
plt.imshow(face, cmap='gray')
4 / 2
4 // 2
Problem: Swap the top-right quarter with the bottom-left quarter of the image.
face = face0.copy()
TR = face[:h//2, w//2:]
BL = face[h//2:, :w//2]
plt.imshow(TR, cmap="gray")
plt.show()
plt.imshow(BL, cmap="gray")
TR = face[:h//2, w//2:].copy()
BL = face[h//2:, :w//2].copy()
face[:h//2, w//2:] = BL
face[h//2:, :w//2] = TR
plt.imshow(face, cmap="gray")
face = face0.copy()
face[:h//2, w//2:], face[h//2:, :w//2] = (
face[h//2:, :w//2].copy(),
face[:h//2, w//2:].copy())
plt.imshow(face, cmap="gray")
import pandas as pd
Pandas has two important data structures. The Series and the DataFrame.
DataFrame is a spreadsheet and Series is a column.
x = pd.Series([1.1, 2.2, 3.3, 4.4])
x
x[0]
x.shape
x.dtype
x + 10
Every Series can have an index.
x = pd.Series([1.1, 2.2, 3.3, 4.4], index=["a", "b", "c", "d"])
x
x['a']
data = [[1, 1],
[2, 4],
[3, 9],
[4, 16]]
df = pd.DataFrame(data)
df
df = pd.DataFrame(data, columns=["x", "y"])
df
df.x
df['x']
Problem: Add a new column z to the dataframe df with value y - x in each row.
df.y - df.x
df['z'] = df.y - df.x
# df['z'] = df['y'] - df['x']
df
url = "https://notes.pipal.in/2018/vmware-ml/iris.csv"
df = pd.read_csv(url)
df.head()
df.Name.unique()
df.Name.nunique()
df.Name.value_counts()
Q: What happens if we write df.Name.unique without the parenthesis?
x = "hello"
x.upper()
x.upper
class Foo:
def foo(self): pass
def __repr__(self): return "<The Foo object>"
f = Foo()
f.foo
So df.Name.unique prints that it bound method of a data frame with the data of the dataframe also printed.
df.describe()
df.boxplot()
df.SepalLength.hist()
df.hist(figsize=(12, 8), sharex=True, sharey=True);
df.hist(column='PetalLength', by='Name', sharex=True, sharey=True);
df.boxplot(column='PetalLength', by='Name');
Problem: Find the ratio of PetalLength to PetalWidth and plot a histogram grouped by Name.
df['Ratio'] = df.PetalLength / df.PetalWidth
df.hist(column='Ratio', by='Name');
df.plot(kind='scatter', x='PetalLength', y='PetalWidth');
df.Name.unique()
names = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
df['iname'] = df.Name.map(names.get)
df.head()
names.get('Iris-setosa')
plt.rcParams['figure.subplot.bottom'] = 0.15
#plt.rcParams
df.plot(kind='scatter', x='PetalLength', y='PetalWidth', c='iname',
cmap='viridis');
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Given the a set of features about my requirements for a house,
how much rent will I have to pay?
url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json.zip"
df = pd.read_json(url)
df.columns
df.head()
Find missing values.
df.isnull().sum()
(df == '').sum()
# Count the number of values in the features column where the value is empty list
df.features.map(lambda x: x==[]).sum()
df.applymap(lambda x: x == '' or x==[]).sum()
df.columns
df.bedrooms.hist()
df.bedrooms.plot(kind="box")
df.bedrooms.value_counts()
df.bedrooms.value_counts().plot(kind="bar")
df.shape
Problem: Plot price vs. bedrooms.
Problem: Plot price vs. bedrooms and color it by interest_level.
df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5)
# remove the points where price is too high
df[df.price < 100000].plot(kind="scatter", x='bedrooms', y='price', alpha=0.5);
# use log scale for y
df.plot(kind="scatter", x='bedrooms', y='price', alpha=0.5, logy=True);
df[df.price < 100000].boxplot(column='price', by='bedrooms');