Practical Machine Learning - Day 2 - Housing¶

VMware Bangalore
June 18-20, 2018

Amit kapoor • Anand Chitipothu • Bargava Subramanian

Notes of this workshop are available online at: https://bit.ly/vmware-ml

Home | Day 1 | Day 2 - Iris | Day 2 - Housing | Day 3

Housing¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.subplot.bottom'] = 0.15

%matplotlib inline

FRAME¶

Find if a retail listing is going generate interest from the protential customers?

ACQUIRE¶

url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json.zip"
# url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json"

df = pd.read_json(url)

df.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')

df.shape

(49352, 15)

df.dtypes

bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object

df.head()

REFINE¶

Prediction Variable: interest_level

df.interest_level.value_counts()

low       34284
medium    11229
high       3839
Name: interest_level, dtype: int64

What are my Y feature set

df.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')

Domain Check: Is the type of data match what it should be?
Quality of the Data Is there any missing values in the column?
Does the distribution fit the expectation about the data: Are their any outliers in that column

# What are the unique values in the column
df.bathrooms.unique()

array([ 1.5,  1. ,  2. ,  3.5,  3. ,  2.5,  0. ,  4. ,  4.5, 10. ,  5. ,
        6. ,  6.5,  5.5,  7. ])

# Is the type of bathrooms correct?
df.bathrooms.dtypes

dtype('float64')

# Are there any null values
df.bathrooms.isnull().sum()

0

# Are there any outliers in this data
df.bathrooms.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x135975a58>

df.bathrooms.value_counts(sort=True)

1.0     39422
2.0      7660
3.0       745
1.5       645
0.0       313
2.5       277
4.0       159
3.5        70
4.5        29
5.0        20
5.5         5
6.0         4
6.5         1
10.0        1
7.0         1
Name: bathrooms, dtype: int64

Bedrooms¶

# Is there null values
df.bedrooms.isnull().sum()

0

df.bedrooms.value_counts(sort=True)

1    15752
2    14623
0     9475
3     7276
4     1929
5      247
6       46
8        2
7        2
Name: bedrooms, dtype: int64

# Cross Tabulation
pd.crosstab(df.bedrooms, df.bathrooms)

Outlier / Null Identification & Redressal System¶

Identification

Domain Knowledge: Is it even possible
Correlate with other feature: Bathrooms vs Bedrooms, Check with the size of the datasets
Visual inspection: Literally send a person, Look at photos associated to find it
Statistial Measure: Box-Plot, ...

Redressal system for Outliers & Null Values

REMOVE: Drop that row or colum from the dataset
IMPUTATION:
- Domain Knowledge: Bathroom = Bedrooms - 0.5
- Calculate a figure: mean, median, mode
- Build a cluster model to impute value
Conversion: Convert from Continuous to Categorical

df[df.bathrooms == 10]

## Check for dtypes for all
df.dtypes

bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object

bathrooms ratio
bedrooms ratio
building_id interval
created datatime
description text
display_address text
features list of text
interest_level ordinal
latitude interval
listing_id interval
longitude interval
manager_id nominal
photos list but -> photos*
price ratio
street_address text

# Fix the dtypes
df.created = pd.to_datetime(df.created)

Null values together¶

If you want to visalise missing nos. - then you can also use missingno library

df.isnull().sum()

bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64

## Outlier for Price
df.price.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x134f8cc18>

df[df.price > 200000]

Outlier treatment¶

Drop Price > 200000
Drop Bathrooms > 7

df.drop?

# Need to run this ! 
df.drop(df[df.price > 200000].index, inplace=True)

df.drop(df[df.bathrooms > 7].index, inplace=True)

df.shape

(49347, 15)

Explore the data¶

Lets looks at the possible input features

Bathrooms
Bedrooms
Price
Latitude
Longitude
Features --> # len of features as a variable
Created

# Price
df.price.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x134d7e7f0>

df.price.plot(kind="hist", logx=True, bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x12f5f80f0>

plt.scatter(df.longitude, df.latitude, alpha=0.1)
plt.xlim(-70,-74)
plt.ylim(38,45)

(38, 45)

df.head()

from plotnine import *

(ggplot(df) + aes('longitude', 'latitude', color='interest_level') 
 + geom_point(alpha=0.01) + ylim(40.6,40.8)  +xlim(-73.5,-74.5) +
 facet_wrap("interest_level")
)

/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/layer.py:450: UserWarning: geom_point : Removed 4008 rows containing missing values.
  self.data = self.geom.handle_na(self.data)

<ggplot: (-9223372036525907740)>

Transfom¶

Shape Transform: Change price to log numbers
Encoding: Convert *categorical to numbers

df["priceLog"] = np.log(df.price)

df.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address', 'priceLog'],
      dtype='object')

X = df[["bathrooms", "bedrooms", "latitude", "longitude", "priceLog"]]
X.head()

y = df['interest_level']
y.head()

10        medium
10000        low
100004      high
100007       low
100013       low
Name: interest_level, dtype: object

type(y)

pandas.core.series.Series

from sklearn.preprocessing import LabelEncoder

le_y = LabelEncoder()

le_y.fit(y)

LabelEncoder()

le_y.classes_

array(['high', 'low', 'medium'], dtype=object)

y_encoded = le_y.transform(y)

y_encoded

array([2, 1, 0, ..., 1, 1, 1])

print("Original Values: \n" ,  y.iloc[:5], "\n\n Encoded values: \n", y_encoded[:5])

Original Values: 
 10        medium
10000        low
100004      high
100007       low
100013       low
Name: interest_level, dtype: object 

 Encoded values: 
 [2 1 0 1 1]

First model - Decision Tree¶

from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier(max_depth=2)

model_dt.fit(X, y_encoded)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Cross-validation¶

#from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import cross_val_score

score = cross_val_score(model_dt, X, y_encoded, scoring="accuracy", cv=5, n_jobs=-1)

np.mean(score)

0.6968203489387579

Problem Try a different error metric (precision, recall)

Problem Change model depth and see how the model performance changes

def depthDT(depth):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X, y_encoded)
    score = cross_val_score(model, X, y_encoded, scoring="accuracy", cv=5, n_jobs=-1)
    return np.mean(score)

scores = []
for i in range(1,20,1):
    score = depthDT(i)
    scores.append(score)

plt.plot(scores)

[<matplotlib.lines.Line2D at 0x12fb12b00>]

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
10	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	-73.9425	5ba989232d0489da1b5f2c45f6688adc	[https://photos.renthop.com/2/7211212_1ed4542e...	3000	792 Metropolitan Avenue
10000	1.0	2	c5c8a357cba207596b04d1afd1e4f130	2016-06-12 12:19:27		Columbus Avenue	[Doorman, Elevator, Fitness Center, Cats Allow...	low	40.7947	7150865	-73.9667	7533621a882f71e25173b27e3139d83d	[https://photos.renthop.com/2/7150865_be3306c5...	5465	808 Columbus Avenue
100004	1.0	1	c3ba40552e2120b0acfc3cb5730bb2aa	2016-04-17 03:26:41	Top Top West Village location, beautiful Pre-w...	W 13 Street	[Laundry In Building, Dishwasher, Hardwood Flo...	high	40.7388	6887163	-74.0018	d9039c43983f6e564b1482b273bd7b01	[https://photos.renthop.com/2/6887163_de85c427...	2850	241 W 13 Street
100007	1.0	1	28d9ad350afeaab8027513a3e52ac8d5	2016-04-18 02:22:02	Building Amenities - Garage - Garden - fitness...	East 49th Street	[Hardwood Floors, No Fee]	low	40.7539	6888711	-73.9677	1067e078446a7897d2da493d2f741316	[https://photos.renthop.com/2/6888711_6e660cee...	3275	333 East 49th Street
100013	1.0	4	0	2016-04-28 01:32:41	Beautifully renovated 3 bedroom flex 4 bedroom...	West 143rd Street	[Pre-War]	low	40.8241	6934781	-73.9493	98e13ad4b495b9613cef886d79a6291f	[https://photos.renthop.com/2/6934781_1fa4b41a...	3350	500 West 143rd Street

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
12168	1.0	2	5d3525a5085445e7fcd64a53aac3cb0a	2016-06-24 05:02:58		West 116th Street	[Doorman, Elevator, Cats Allowed, Dogs Allowed...	low	40.8011	7208794	-73.9480	d1737922fe92ccb0dc37ba85589e6415	[]	1150000	40 West 116th Street
32611	1.0	2	cd25bbea2af848ebe9821da820b725da	2016-06-24 05:02:11		Hudson Street	[Doorman, Elevator, Cats Allowed, Dogs Allowed...	low	40.7299	7208764	-74.0071	d1737922fe92ccb0dc37ba85589e6415	[]	4490000	421 Hudson Street
55437	1.0	1	37385c8a58176b529964083315c28e32	2016-05-14 05:21:28		West 57th Street	[Doorman, Cats Allowed, Dogs Allowed]	low	40.7676	7013217	-73.9844	8f5a9c893f6d602f4953fcc0b8e6e9b4	[]	1070000	333 West 57th Street
57803	1.0	1	37385c8a58176b529964083315c28e32	2016-05-19 02:37:06	This 1 Bedroom apartment is located on a prime...	West 57th Street	[Doorman, Elevator, Pre-War, Dogs Allowed, Cat...	low	40.7676	7036279	-73.9844	18133bc914e6faf6f8cc1bf29d66fc0d	[https://photos.renthop.com/2/7036279_924b52f0...	1070000	333 West 57th Street

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address	priceLog
10	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	-73.9425	5ba989232d0489da1b5f2c45f6688adc	[https://photos.renthop.com/2/7211212_1ed4542e...	3000	792 Metropolitan Avenue	8.006368
10000	1.0	2	c5c8a357cba207596b04d1afd1e4f130	2016-06-12 12:19:27		Columbus Avenue	[Doorman, Elevator, Fitness Center, Cats Allow...	low	40.7947	7150865	-73.9667	7533621a882f71e25173b27e3139d83d	[https://photos.renthop.com/2/7150865_be3306c5...	5465	808 Columbus Avenue	8.606119
100004	1.0	1	c3ba40552e2120b0acfc3cb5730bb2aa	2016-04-17 03:26:41	Top Top West Village location, beautiful Pre-w...	W 13 Street	[Laundry In Building, Dishwasher, Hardwood Flo...	high	40.7388	6887163	-74.0018	d9039c43983f6e564b1482b273bd7b01	[https://photos.renthop.com/2/6887163_de85c427...	2850	241 W 13 Street	7.955074
100007	1.0	1	28d9ad350afeaab8027513a3e52ac8d5	2016-04-18 02:22:02	Building Amenities - Garage - Garden - fitness...	East 49th Street	[Hardwood Floors, No Fee]	low	40.7539	6888711	-73.9677	1067e078446a7897d2da493d2f741316	[https://photos.renthop.com/2/6888711_6e660cee...	3275	333 East 49th Street	8.094073
100013	1.0	4	0	2016-04-28 01:32:41	Beautifully renovated 3 bedroom flex 4 bedroom...	West 143rd Street	[Pre-War]	low	40.8241	6934781	-73.9493	98e13ad4b495b9613cef886d79a6291f	[https://photos.renthop.com/2/6934781_1fa4b41a...	3350	500 West 143rd Street	8.116716

bathrooms	0.0	1.0	1.5	2.0	2.5	3.0	3.5	4.0	4.5	5.0	5.5	6.0	6.5	7.0	10.0
bedrooms
0	157	9279	9	29	0	0	0	1	0	0	0	0	0	0	0
1	73	15301	154	207	2	14	0	0	1	0	0	0	0	0	0
2	59	10872	208	3359	87	36	1	0	0	0	0	0	0	0	1
3	21	3594	208	2769	144	491	33	16	0	0	0	0	0	0	0
4	3	365	60	1155	39	148	33	89	21	13	3	0	0	0	0
5	0	10	5	130	4	36	3	40	7	6	2	3	0	1	0
6	0	1	1	11	1	18	0	12	0	1	0	1	0	0	0
7	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0
8	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0