Airwatch Bangalore
May 7-9, 2018
Notes of this workshop are available online at:
https://bit.ly/airwatch-ml
Home | Day 1 | Day 2 - iris| Day 2 - Boston Housing | Day 3 - Movies
import pandas as pd
import numpy as np
%matplotlib inline
There are three different species of the iris flower. Each specieis looks slightly differently. Can we use machine learning to predict the species of the flower by looking at it?
Pictures are from the wikipedia page, CC BY-SA.
The iris prediction problem is the hello world of machine learning.
The scikit-learn dataset comes with this dataset.
from sklearn.datasets import load_iris
iris = load_iris()
iris.keys()
print(iris['DESCR'])
The iris data from scikit-learn is available as numpy arrays. To make our job easier, I've exported them as csv so that we can load them as pandas dataframe. I've also divided the dataset into two parts. One to train our model and another to see how well our model is preforming.
url_iris_train = "https://notes.pipal.in/2018/airwatch-ml/iris-train.csv"
url_iris_test = "https://notes.pipal.in/2018/airwatch-ml/iris-test.csv"
df_train = pd.read_csv(url_iris_train, index_col=0)
df_test = pd.read_csv(url_iris_test, index_col=0)
df_train.head()
df_train.describe()
def predict(petal_length, petal_width):
# Improve this function
return "setosa"
def test(dataset):
predicted = np.array([predict(x1, x2) for x1, x2 in
zip(dataset.petal_length, dataset.petal_width)])
actual = dataset.species
matched = sum(predicted==actual)
return matched / len(dataset)
test(df_train)
df_train.boxplot()
df_train.boxplot("petal_length", by="species")
df_train.plot(kind="scatter", x="petal_length", y="petal_width")
df_train.head()
df_train.species.unique()
species = {"setosa": 0, "versicolor": 1, "virginica": 2}
species.get("setosa")
df_train['ispecies'] = df_train.species.map(species.get)
df_train.head()
df_train.plot(kind="scatter",
x="petal_length", y="petal_width",
c="ispecies", cmap="viridis")
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)
model.fit(df_train[["petal_length", "petal_width"]], df_train.ispecies)
df_train.head()
model.predict([[1.4, 0.2]])[0]
def predict(petal_length, petal_width):
data = [[petal_length, petal_width]]
y = model.predict(data)[0]
return ["setosa", "versicolor", "virginica"][y]
predict(1.4, 0.2)
test(df_train)
test(df_test)
Install the modelvis library.
!pip install -q -U modelvis
import modelvis
modelvis.__version__
modelvis.render_tree(model,
feature_names=["petal_length", "petal_width"],
class_names=["setosa", "versicolor", "virginica"])
modelvis.print_tree_as_code(model)
modelvis.plot_decision_boundaries(
model,
X=df_train[["petal_length", "petal_width"]],
y=df_train.ispecies,
class_names=["setosa", "versicolor", "virginica"],
show_input=True
)
DecisionTreeClassifier??
model = DecisionTreeClassifier(max_depth=20, min_samples_split=2, min_impurity_split=1e-25)
model.fit(df_train[["petal_length", "petal_width"]], df_train.ispecies)
modelvis.plot_decision_boundaries(
model,
X=df_train[["petal_length", "petal_width"]],
y=df_train.ispecies,
class_names=["setosa", "versicolor", "virginica"],
show_input=True
)
modelvis.plot_decision_boundaries(
model,
X=df_train[["petal_length", "petal_width"]],
y=df_train.ispecies,
class_names=["setosa", "versicolor", "virginica"],
show_input=True, probability=True
)
modelvis.render_tree(model,
feature_names=["petal_length", "petal_width"],
class_names=["setosa", "versicolor", "virginica"])
df_train[(df_train.petal_width > 1.75) & (df_train.petal_length <= 4.85)]
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
X = df_train[["petal_length", "petal_width"]]
y = df_train.ispecies
model2.fit(X, y)
model2.coef_
model2.intercept_
modelvis.plot_decision_boundaries(
model2,
X=X,
y=y,
class_names=["setosa", "versicolor", "virginica"],
show_input=True
)
modelvis.plot_decision_boundaries(
model2,
X=X,
y=y,
class_names=["setosa", "versicolor", "virginica"],
probability=True,
show_input=True
)
def predict(petal_length, petal_width):
data = [[petal_length, petal_width]]
y = model.predict(data)[0]
return ["setosa", "versicolor", "virginica"][y]
test(df_train)
test(df_test)
Exercise: Build a decision-tree model and a linear model using all the 4 features of iris and find the accuracy on the test dataset.
df_test['ispecies']=df_test.species.map(species.get)
X = df_train.iloc[:,0:4]
y = df_train.ispecies
X_test = df_test.iloc[:,0:4]
y_test = df_test.ispecies
model = DecisionTreeClassifier()
model.fit(X,y);
scoring = []
for i in range(1, 10):
model.set_params(max_depth = i)
model.fit(X,y)
train_accuracy = model.score(X,y)
test_accuracy = model.score(X_test, y_test)
scoring.append([i, train_accuracy, test_accuracy])
scoring