Practical Machine Learning - Day 2 - Housing

VMware Bangalore
June 18-20, 2018

Amit kapoor • Anand Chitipothu • Bargava Subramanian

Notes of this workshop are available online at: https://bit.ly/vmware-ml

Home | Day 1 | Day 2 - Iris | Day 2 - Housing | Day 3

Housing

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.subplot.bottom'] = 0.15

%matplotlib inline

FRAME

Find if a retail listing is going generate interest from the protential customers?

ACQUIRE

In [3]:
url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json.zip"
# url = "https://notes.pipal.in/2018/vmware-ml/rent-data.json"
In [4]:
df = pd.read_json(url)
In [5]:
df.columns
Out[5]:
Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')
In [8]:
df.shape
Out[8]:
(49352, 15)
In [9]:
df.dtypes
Out[9]:
bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object
In [6]:
df.head()
Out[6]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc [https://photos.renthop.com/2/7211212_1ed4542e... 3000 792 Metropolitan Avenue
10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 Columbus Avenue [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d [https://photos.renthop.com/2/7150865_be3306c5... 5465 808 Columbus Avenue
100004 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa 2016-04-17 03:26:41 Top Top West Village location, beautiful Pre-w... W 13 Street [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 6887163 -74.0018 d9039c43983f6e564b1482b273bd7b01 [https://photos.renthop.com/2/6887163_de85c427... 2850 241 W 13 Street
100007 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 2016-04-18 02:22:02 Building Amenities - Garage - Garden - fitness... East 49th Street [Hardwood Floors, No Fee] low 40.7539 6888711 -73.9677 1067e078446a7897d2da493d2f741316 [https://photos.renthop.com/2/6888711_6e660cee... 3275 333 East 49th Street
100013 1.0 4 0 2016-04-28 01:32:41 Beautifully renovated 3 bedroom flex 4 bedroom... West 143rd Street [Pre-War] low 40.8241 6934781 -73.9493 98e13ad4b495b9613cef886d79a6291f [https://photos.renthop.com/2/6934781_1fa4b41a... 3350 500 West 143rd Street

REFINE

Prediction Variable: interest_level

In [11]:
df.interest_level.value_counts()
Out[11]:
low       34284
medium    11229
high       3839
Name: interest_level, dtype: int64

What are my Y feature set

In [13]:
df.columns
Out[13]:
Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address'],
      dtype='object')

Refinement

  • Domain Check: Is the type of data match what it should be?
  • Quality of the Data Is there any missing values in the column?
  • Does the distribution fit the expectation about the data: Are their any outliers in that column
In [19]:
# What are the unique values in the column
df.bathrooms.unique()
Out[19]:
array([ 1.5,  1. ,  2. ,  3.5,  3. ,  2.5,  0. ,  4. ,  4.5, 10. ,  5. ,
        6. ,  6.5,  5.5,  7. ])
In [20]:
# Is the type of bathrooms correct?
df.bathrooms.dtypes
Out[20]:
dtype('float64')
In [21]:
# Are there any null values
df.bathrooms.isnull().sum()
Out[21]:
0
In [24]:
# Are there any outliers in this data
df.bathrooms.hist()
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x135975a58>
In [29]:
df.bathrooms.value_counts(sort=True)
Out[29]:
1.0     39422
2.0      7660
3.0       745
1.5       645
0.0       313
2.5       277
4.0       159
3.5        70
4.5        29
5.0        20
5.5         5
6.0         4
6.5         1
10.0        1
7.0         1
Name: bathrooms, dtype: int64

Bedrooms

In [32]:
# Is there null values
df.bedrooms.isnull().sum()
Out[32]:
0
In [33]:
df.bedrooms.value_counts(sort=True)
Out[33]:
1    15752
2    14623
0     9475
3     7276
4     1929
5      247
6       46
8        2
7        2
Name: bedrooms, dtype: int64
In [35]:
# Cross Tabulation
pd.crosstab(df.bedrooms, df.bathrooms)
Out[35]:
bathrooms 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 10.0
bedrooms
0 157 9279 9 29 0 0 0 1 0 0 0 0 0 0 0
1 73 15301 154 207 2 14 0 0 1 0 0 0 0 0 0
2 59 10872 208 3359 87 36 1 0 0 0 0 0 0 0 1
3 21 3594 208 2769 144 491 33 16 0 0 0 0 0 0 0
4 3 365 60 1155 39 148 33 89 21 13 3 0 0 0 0
5 0 10 5 130 4 36 3 40 7 6 2 3 0 1 0
6 0 1 1 11 1 18 0 12 0 1 0 1 0 0 0
7 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
8 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0

Outlier / Null Identification & Redressal System

Identification

  • Domain Knowledge: Is it even possible
  • Correlate with other feature: Bathrooms vs Bedrooms, Check with the size of the datasets
  • Visual inspection: Literally send a person, Look at photos associated to find it
  • Statistial Measure: Box-Plot, ...

Redressal system for Outliers & Null Values

  • REMOVE: Drop that row or colum from the dataset
  • IMPUTATION:
    • Domain Knowledge: Bathroom = Bedrooms - 0.5
    • Calculate a figure: mean, median, mode
    • Build a cluster model to impute value
  • Conversion: Convert from Continuous to Categorical
In [37]:
df[df.bathrooms == 10]
Out[37]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
104459 10.0 2 424f8014bddc288d26da5fe81d0bea02 2016-04-09 04:34:31 ***The building?s well-attended lobby welcomes... W 52 St. [Doorman, Elevator, Fitness Center, Laundry in... low 40.7633 6849204 -73.9849 0c71a59cb70215fbf49c9dd93efaa172 [https://photos.renthop.com/2/6849204_1f92b58a... 3600 260 W 52 St.
In [38]:
## Check for dtypes for all
df.dtypes
Out[38]:
bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object
  • bathrooms ratio
  • bedrooms ratio
  • building_id interval
  • created datatime
  • description text
  • display_address text
  • features list of text
  • interest_level ordinal
  • latitude interval
  • listing_id interval
  • longitude interval
  • manager_id nominal
  • photos list but -> photos*
  • price ratio
  • street_address text
In [41]:
# Fix the dtypes
df.created = pd.to_datetime(df.created)

Null values together

If you want to visalise missing nos. - then you can also use missingno library

In [45]:
df.isnull().sum()
Out[45]:
bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64
In [54]:
## Outlier for Price
df.price.hist()
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x134f8cc18>
In [59]:
df[df.price > 200000]
Out[59]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
12168 1.0 2 5d3525a5085445e7fcd64a53aac3cb0a 2016-06-24 05:02:58 West 116th Street [Doorman, Elevator, Cats Allowed, Dogs Allowed... low 40.8011 7208794 -73.9480 d1737922fe92ccb0dc37ba85589e6415 [] 1150000 40 West 116th Street
32611 1.0 2 cd25bbea2af848ebe9821da820b725da 2016-06-24 05:02:11 Hudson Street [Doorman, Elevator, Cats Allowed, Dogs Allowed... low 40.7299 7208764 -74.0071 d1737922fe92ccb0dc37ba85589e6415 [] 4490000 421 Hudson Street
55437 1.0 1 37385c8a58176b529964083315c28e32 2016-05-14 05:21:28 West 57th Street [Doorman, Cats Allowed, Dogs Allowed] low 40.7676 7013217 -73.9844 8f5a9c893f6d602f4953fcc0b8e6e9b4 [] 1070000 333 West 57th Street
57803 1.0 1 37385c8a58176b529964083315c28e32 2016-05-19 02:37:06 This 1 Bedroom apartment is located on a prime... West 57th Street [Doorman, Elevator, Pre-War, Dogs Allowed, Cat... low 40.7676 7036279 -73.9844 18133bc914e6faf6f8cc1bf29d66fc0d [https://photos.renthop.com/2/7036279_924b52f0... 1070000 333 West 57th Street

Outlier treatment

  • Drop Price > 200000
  • Drop Bathrooms > 7
In [71]:
df.drop?
In [75]:
# Need to run this ! 
df.drop(df[df.price > 200000].index, inplace=True)
In [76]:
df.drop(df[df.bathrooms > 7].index, inplace=True)
In [77]:
df.shape
Out[77]:
(49347, 15)

Explore the data

Lets looks at the possible input features

  • Bathrooms
  • Bedrooms
  • Price
  • Latitude
  • Longitude
  • Features --> # len of features as a variable
  • Created
In [78]:
# Price
df.price.hist()
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x134d7e7f0>
In [83]:
df.price.plot(kind="hist", logx=True, bins=100)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f5f80f0>
In [143]:
plt.scatter(df.longitude, df.latitude, alpha=0.1)
plt.xlim(-70,-74)
plt.ylim(38,45)
Out[143]:
(38, 45)
In [155]:
df.head()
Out[155]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address priceLog
10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc [https://photos.renthop.com/2/7211212_1ed4542e... 3000 792 Metropolitan Avenue 8.006368
10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 Columbus Avenue [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d [https://photos.renthop.com/2/7150865_be3306c5... 5465 808 Columbus Avenue 8.606119
100004 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa 2016-04-17 03:26:41 Top Top West Village location, beautiful Pre-w... W 13 Street [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 6887163 -74.0018 d9039c43983f6e564b1482b273bd7b01 [https://photos.renthop.com/2/6887163_de85c427... 2850 241 W 13 Street 7.955074
100007 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 2016-04-18 02:22:02 Building Amenities - Garage - Garden - fitness... East 49th Street [Hardwood Floors, No Fee] low 40.7539 6888711 -73.9677 1067e078446a7897d2da493d2f741316 [https://photos.renthop.com/2/6888711_6e660cee... 3275 333 East 49th Street 8.094073
100013 1.0 4 0 2016-04-28 01:32:41 Beautifully renovated 3 bedroom flex 4 bedroom... West 143rd Street [Pre-War] low 40.8241 6934781 -73.9493 98e13ad4b495b9613cef886d79a6291f [https://photos.renthop.com/2/6934781_1fa4b41a... 3350 500 West 143rd Street 8.116716
In [146]:
from plotnine import *
In [163]:
(ggplot(df) + aes('longitude', 'latitude', color='interest_level') 
 + geom_point(alpha=0.01) + ylim(40.6,40.8)  +xlim(-73.5,-74.5) +
 facet_wrap("interest_level")
)
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/layer.py:450: UserWarning: geom_point : Removed 4008 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
Out[163]:
<ggplot: (-9223372036525907740)>

Transfom

  • Shape Transform: Change price to log numbers
  • Encoding: Convert *categorical to numbers
In [84]:
df["priceLog"] = np.log(df.price)
In [89]:
df.columns
Out[89]:
Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'interest_level', 'latitude',
       'listing_id', 'longitude', 'manager_id', 'photos', 'price',
       'street_address', 'priceLog'],
      dtype='object')
In [93]:
X = df[["bathrooms", "bedrooms", "latitude", "longitude", "priceLog"]]
X.head()
Out[93]:
bathrooms bedrooms latitude longitude priceLog
10 1.5 3 40.7145 -73.9425 8.006368
10000 1.0 2 40.7947 -73.9667 8.606119
100004 1.0 1 40.7388 -74.0018 7.955074
100007 1.0 1 40.7539 -73.9677 8.094073
100013 1.0 4 40.8241 -73.9493 8.116716
In [95]:
y = df['interest_level']
y.head()
Out[95]:
10        medium
10000        low
100004      high
100007       low
100013       low
Name: interest_level, dtype: object
In [96]:
type(y)
Out[96]:
pandas.core.series.Series
In [97]:
from sklearn.preprocessing import LabelEncoder
In [98]:
le_y = LabelEncoder()
In [99]:
le_y.fit(y)
Out[99]:
LabelEncoder()
In [109]:
le_y.classes_
Out[109]:
array(['high', 'low', 'medium'], dtype=object)
In [100]:
y_encoded = le_y.transform(y)
In [101]:
y_encoded
Out[101]:
array([2, 1, 0, ..., 1, 1, 1])
In [108]:
print("Original Values: \n" ,  y.iloc[:5], "\n\n Encoded values: \n", y_encoded[:5])
Original Values: 
 10        medium
10000        low
100004      high
100007       low
100013       low
Name: interest_level, dtype: object 

 Encoded values: 
 [2 1 0 1 1]

First model - Decision Tree

In [113]:
from sklearn.tree import DecisionTreeClassifier
In [114]:
model_dt = DecisionTreeClassifier(max_depth=2)
In [115]:
model_dt.fit(X, y_encoded)
Out[115]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Cross-validation

In [116]:
#from sklearn.model_selection import StratifiedKFold
In [119]:
from sklearn.model_selection import cross_val_score
In [120]:
score = cross_val_score(model_dt, X, y_encoded, scoring="accuracy", cv=5, n_jobs=-1)
In [121]:
np.mean(score)
Out[121]:
0.6968203489387579

Problem Try a different error metric (precision, recall)

Problem Change model depth and see how the model performance changes

In [135]:
def depthDT(depth):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X, y_encoded)
    score = cross_val_score(model, X, y_encoded, scoring="accuracy", cv=5, n_jobs=-1)
    return np.mean(score)
In [139]:
scores = []
for i in range(1,20,1):
    score = depthDT(i)
    scores.append(score)
In [140]:
plt.plot(scores)
Out[140]:
[<matplotlib.lines.Line2D at 0x12fb12b00>]