Practical Machine Learning - Day 1

Airwatch Bangalore
May 7-9, 2018

Home | Day 2 - Boston Housing | Day 3 - Movies

In [1]:
1 + 2
Out[1]:
3

This is a simple text.

You can execute any cell by pressing Shift+Enter.

You can convert a cell to markdown cell by pressing Esc + M. You may also be able to do that using the dropdown on the top.

To convert a markdown cell to code cell press Esc + Y.

Data Science Libaries

For Data Science we are going to use the following libraries:

  • numpy - numerical computation
  • pandas - spreadhseet for hackers
  • matplotlib - ploting
  • scikit-learn (sklearn) - machine learning library

numpy

In [2]:
import numpy as np
In [3]:
x = np.array([1, 2, 3])
In [4]:
x.shape
Out[4]:
(3,)
In [5]:
x.dtype
Out[5]:
dtype('int64')
In [6]:
x.sum()
Out[6]:
6
In [7]:
x + 1
Out[7]:
array([2, 3, 4])
In [8]:
d = np.array([[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]])
In [9]:
d
Out[9]:
array([[ 0.1,  0.2,  0.3],
       [ 1.1,  1.2,  1.3]])
In [10]:
d.shape
Out[10]:
(2, 3)
In [11]:
d * 10
Out[11]:
array([[  1.,   2.,   3.],
       [ 11.,  12.,  13.]])
In [12]:
d
Out[12]:
array([[ 0.1,  0.2,  0.3],
       [ 1.1,  1.2,  1.3]])
In [13]:
d > 1
Out[13]:
array([[False, False, False],
       [ True,  True,  True]], dtype=bool)
In [14]:
np.sum(d > 1)
Out[14]:
3

Set all elements less than 1 to 0.

In [15]:
d[d < 1] = 0
In [16]:
d
Out[16]:
array([[ 0. ,  0. ,  0. ],
       [ 1.1,  1.2,  1.3]])

pandas

In [1]:
import pandas as pd

Pandas has two important data structures. Series and DataFrame.

Series is a column and DataFrame is a spreadsheet.

In [18]:
x = pd.Series([1.1, 2.2, 3.3, 4.4])
In [19]:
x
Out[19]:
0    1.1
1    2.2
2    3.3
3    4.4
dtype: float64
In [20]:
x = pd.Series([1.1, 2.2, 3.3, 4.4], ["a", "b", "c", "d"])
In [21]:
x
Out[21]:
a    1.1
b    2.2
c    3.3
d    4.4
dtype: float64

DataFrame

In [22]:
data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data)
In [23]:
d
Out[23]:
0 1
0 1 1
1 2 4
2 3 9
3 4 16
In [24]:
data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data, columns=["x", "y"])
In [25]:
d
Out[25]:
x y
0 1 1
1 2 4
2 3 9
3 4 16
In [26]:
d.x
Out[26]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64
In [27]:
d[d.x > 2]
Out[27]:
x y
2 3 9
3 4 16

Weather Data

In [ ]:
url = "https://archive.org/download/www.imdaws.com-2012/daily.zip/daily%2FARG-2012-07-01.csv"
In [36]:
!wget -nv -O ARG-2012-07-01.csv $url 
2018-05-07 15:47:29 URL:https://notes.pipal.in/2018/airwatch-ml/ARG-2012-07-01.csv [0/0] -> "ARG-2012-07-01.csv" [1]
In [3]:
url = "https://notes.pipal.in/2018/airwatch-ml/ARG-2012-07-01.csv"
url = "ARG-2012-07-01.csv"
In [59]:
df = pd.read_csv(url)
In [60]:
df.head()
Out[60]:
SR.NO. STATION ID DATE TIME [UTC] LATITUDE [N] LONGITUDE [E] RAINFALL [mm] TEMPERATURE [Deg C] TMAX [Deg C] TMIN [Deg C]
0 1 ANDER 1-Jul-2012 23:00:00 26.1 84.3 0
1 2 AGAR 1-Jul-2012 23:00:00 23.7 76.0 1
2 3 AJAYGARH 1-Jul-2012 23:00:00 24.9 80.2 0
3 4 AKHUPADA 1-Jul-2012 23:00:00 20.9 86.3 0
4 5 AMLOH 1-Jul-2012 23:00:00 30.6 76.2 0
In [61]:
df.shape
Out[61]:
(10859, 10)
In [62]:
df.dtypes
Out[62]:
SR.NO.                   int64
STATION ID              object
DATE                    object
TIME [UTC]              object
LATITUDE [N]           float64
LONGITUDE [E]          float64
RAINFALL [mm]           object
TEMPERATURE [Deg C]     object
TMAX [Deg C]            object
TMIN [Deg C]            object
dtype: object
In [63]:
df["STATION ID"].nunique()
Out[63]:
498
In [64]:
df.columns
Out[64]:
Index(['SR.NO.', 'STATION ID', 'DATE', 'TIME [UTC]', 'LATITUDE [N]',
       'LONGITUDE [E]', 'RAINFALL [mm]', 'TEMPERATURE [Deg C]', 'TMAX [Deg C]',
       'TMIN [Deg C]'],
      dtype='object')

Let us simplify the column names.

In [65]:
df.columns = ["srno", "station", "date", "time", 
              "lat", "lon", "rainfall", 
              "temp", "tmax", "tmin"]
In [66]:
df.head()
Out[66]:
srno station date time lat lon rainfall temp tmax tmin
0 1 ANDER 1-Jul-2012 23:00:00 26.1 84.3 0
1 2 AGAR 1-Jul-2012 23:00:00 23.7 76.0 1
2 3 AJAYGARH 1-Jul-2012 23:00:00 24.9 80.2 0
3 4 AKHUPADA 1-Jul-2012 23:00:00 20.9 86.3 0
4 5 AMLOH 1-Jul-2012 23:00:00 30.6 76.2 0

If you want to save it to the disk, then use:

In [12]:
df.to_csv("rainfall.csv")
In [13]:
df.date.unique()
Out[13]:
array(['1-Jul-2012'], dtype=object)
In [14]:
df.date.nunique()
Out[14]:
1
In [15]:
df.time.nunique()
Out[15]:
24
In [16]:
df.time.unique()
Out[16]:
array(['23:00:00', '22:00:00', '21:00:00', '20:00:00', '19:00:00',
       '18:00:00', '17:00:00', '16:00:00', '15:00:00', '14:00:00',
       '13:00:00', '12:00:00', '11:00:00', '10:00:00', '09:00:00',
       '08:00:00', '07:00:00', '06:00:00', '05:00:00', '04:00:00',
       '03:00:00', '02:00:00', '01:00:00', '00:00:00'], dtype=object)
In [17]:
df.dtypes
Out[17]:
srno          int64
station      object
date         object
time         object
lat         float64
lon         float64
rainfall     object
temp         object
tmax         object
tmin         object
dtype: object

How to get all entries for station "AGAR"?

In [18]:
df[df.station == "AGAR"]
Out[18]:
srno station date time lat lon rainfall temp tmax tmin
1 2 AGAR 1-Jul-2012 23:00:00 23.7 76.0 1
470 471 AGAR 1-Jul-2012 22:00:00 23.7 76.0 1
1010 1011 AGAR 1-Jul-2012 21:00:00 23.7 76.0 1
1429 1430 AGAR 1-Jul-2012 20:00:00 23.7 76.0 1
1858 1859 AGAR 1-Jul-2012 19:00:00 23.7 76.0 1
2299 2300 AGAR 1-Jul-2012 18:00:00 23.7 76.0 1
2734 2735 AGAR 1-Jul-2012 17:00:00 23.7 76.0 1
3639 3640 AGAR 1-Jul-2012 15:00:00 23.7 76.0 0
4984 4985 AGAR 1-Jul-2012 12:00:00 23.7 76.0 0
5434 5435 AGAR 1-Jul-2012 11:00:00 23.7 76.0 0
5888 5889 AGAR 1-Jul-2012 10:00:00 23.7 76.0 0
6336 6337 AGAR 1-Jul-2012 09:00:00 23.7 76.0 0
6788 6789 AGAR 1-Jul-2012 08:00:00 23.7 76.0 0
7250 7251 AGAR 1-Jul-2012 07:00:00 23.7 76.0 0
7712 7713 AGAR 1-Jul-2012 06:00:00 23.7 76.0 0
8165 8166 AGAR 1-Jul-2012 05:00:00 23.7 76.0 0
8600 8601 AGAR 1-Jul-2012 04:00:00 23.7 76.0 0
9079 9080 AGAR 1-Jul-2012 03:00:00 23.7 76.0 3
9520 9521 AGAR 1-Jul-2012 02:00:00 23.7 76.0 3
9963 9964 AGAR 1-Jul-2012 01:00:00 23.7 76.0 3
10408 10409 AGAR 1-Jul-2012 00:00:00 23.7 76.0 3
In [19]:
c = df[df.station == "AGAR"].rainfall
In [20]:
c2 = c.astype(np.float)
In [21]:
c2
Out[21]:
1        1.0
470      1.0
1010     1.0
1429     1.0
1858     1.0
2299     1.0
2734     1.0
3639     0.0
4984     0.0
5434     0.0
5888     0.0
6336     0.0
6788     0.0
7250     0.0
7712     0.0
8165     0.0
8600     0.0
9079     3.0
9520     3.0
9963     3.0
10408    3.0
Name: rainfall, dtype: float64

The rainfall column is not numeric. There may be an invalid value. Let us find that out.

In [22]:
df.rainfall.unique()
Out[22]:
array(['0', '1', '11', '23', '6', '2', '37', ' ', '3', '35', '4', '55',
       '13', '5', '38', '53', '16', '62', '32', '12', '26', '43', '19',
       '20', '8', '24', '9', '25', '30', '21', '41', '42', '47', '7', '14',
       '28', '31', '40', '59', '44', '10', '80', '27', '34', '52', '18',
       '33', '48', '45', '39', '76', '17', '22', '15', '75', '46', '72',
       '36', '66', '60', '51', '29', '50', '395', '286'], dtype=object)
In [71]:
df.rainfall = df.rainfall.replace(" ", "0").astype(np.float)
In [24]:
df.head()
Out[24]:
srno station date time lat lon rainfall temp tmax tmin
0 1 ANDER 1-Jul-2012 23:00:00 26.1 84.3 0.0
1 2 AGAR 1-Jul-2012 23:00:00 23.7 76.0 1.0
2 3 AJAYGARH 1-Jul-2012 23:00:00 24.9 80.2 0.0
3 4 AKHUPADA 1-Jul-2012 23:00:00 20.9 86.3 0.0
4 5 AMLOH 1-Jul-2012 23:00:00 30.6 76.2 0.0

How many stations had atleast some rainfall?

Which station got the most rainfall during that day?

In [25]:
df[df.station == 'SATHYABAMA UNIVERSITY'] 
Out[25]:
srno station date time lat lon rainfall temp tmax tmin
415 416 SATHYABAMA UNIVERSITY 1-Jul-2012 23:00:00 12.9 80.2 0.0 29.3
958 959 SATHYABAMA UNIVERSITY 1-Jul-2012 22:00:00 12.9 80.2 0.0 28.8
959 960 SATHYABAMA UNIVERSITY 1-Jul-2012 22:00:00 12.9 80.2 0.0 28.8
1380 1381 SATHYABAMA UNIVERSITY 1-Jul-2012 21:00:00 12.9 80.2 0.0 28.2
1810 1811 SATHYABAMA UNIVERSITY 1-Jul-2012 20:00:00 12.9 80.2 0.0 28.1
2244 2245 SATHYABAMA UNIVERSITY 1-Jul-2012 19:00:00 12.9 80.2 0.0 28
2679 2680 SATHYABAMA UNIVERSITY 1-Jul-2012 18:00:00 12.9 80.2 0.0 28.5
3137 3138 SATHYABAMA UNIVERSITY 1-Jul-2012 17:00:00 12.9 80.2 0.0 28.5
3586 3587 SATHYABAMA UNIVERSITY 1-Jul-2012 16:00:00 12.9 80.2 0.0 29
4045 4046 SATHYABAMA UNIVERSITY 1-Jul-2012 15:00:00 12.9 80.2 0.0 28.8
4502 4503 SATHYABAMA UNIVERSITY 1-Jul-2012 14:00:00 12.9 80.2 0.0 29.1
4935 4936 SATHYABAMA UNIVERSITY 1-Jul-2012 13:00:00 12.9 80.2 0.0 29.1
5837 5838 SATHYABAMA UNIVERSITY 1-Jul-2012 11:00:00 12.9 80.2 0.0 30.5
6281 6282 SATHYABAMA UNIVERSITY 1-Jul-2012 10:00:00 12.9 80.2 0.0 33.8
6738 6739 SATHYABAMA UNIVERSITY 1-Jul-2012 09:00:00 12.9 80.2 0.0 34.1
7197 7198 SATHYABAMA UNIVERSITY 1-Jul-2012 08:00:00 12.9 80.2 0.0 33.2
7662 7663 SATHYABAMA UNIVERSITY 1-Jul-2012 07:00:00 12.9 80.2 0.0 32
8117 8118 SATHYABAMA UNIVERSITY 1-Jul-2012 06:00:00 12.9 80.2 0.0 31.2
8549 8550 SATHYABAMA UNIVERSITY 1-Jul-2012 05:00:00 12.9 80.2 0.0 30.3
9024 9025 SATHYABAMA UNIVERSITY 1-Jul-2012 04:00:00 12.9 80.2 0.0 29.7
9466 9467 SATHYABAMA UNIVERSITY 1-Jul-2012 03:00:00 12.9 80.2 9.0 29 24.2
9910 9911 SATHYABAMA UNIVERSITY 1-Jul-2012 02:00:00 12.9 80.2 9.0 27.7
10357 10358 SATHYABAMA UNIVERSITY 1-Jul-2012 01:00:00 12.9 80.2 9.0 27.1
10805 10806 SATHYABAMA UNIVERSITY 1-Jul-2012 00:00:00 12.9 80.2 9.0 26.2
In [26]:
df.rainfall.sum()
Out[26]:
19279.0
In [27]:
df_station = df.groupby(["station", "lat", "lon"]).sum()
In [28]:
# delete the srno column
del df_station['srno']
In [29]:
df_station.head()
Out[29]:
rainfall
station lat lon
ADAMPUR 31.4 75.7 0.0
AGAR 23.7 76.0 19.0
AJAYGARH 24.9 80.2 0.0
AKHUPADA 20.9 86.3 0.0
ALOT 23.8 75.5 8.0
In [30]:
df_station.reset_index(inplace=True)
In [31]:
df_station.head()
Out[31]:
station lat lon rainfall
0 ADAMPUR 31.4 75.7 0.0
1 AGAR 23.7 76.0 19.0
2 AJAYGARH 24.9 80.2 0.0
3 AKHUPADA 20.9 86.3 0.0
4 ALOT 23.8 75.5 8.0
In [32]:
np.sum(df_station.rainfall > 0)
Out[32]:
171
In [33]:
df_station.sort_values('rainfall', ascending=False).head()
Out[33]:
station lat lon rainfall
94 CHAKUR 18.3 76.5 711.0
347 PAUNI 20.8 79.6 632.0
458 THAMARASSERY 11.4 75.9 618.0
304 MURTIZAPUR 20.7 77.4 608.0
492 VILANGANKUNNU 10.6 76.2 603.0
In [34]:
x = df.groupby('station').count()['rainfall'].value_counts()
In [35]:
x
Out[35]:
23    136
24    109
22    109
21     49
20     25
25     22
19     11
26      5
3       4
12      4
16      4
1       3
11      2
10      2
6       2
5       2
18      2
2       2
14      1
13      1
17      1
9       1
15      1
Name: rainfall, dtype: int64
In [36]:
%matplotlib inline
In [37]:
(df.groupby('station')
    .count()
    .rainfall
    .value_counts()
    .sort_index()
    .plot(kind="bar"))
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1150da668>

Visualization

In [38]:
data = [
    ["N", 25, 8],
    ["E", 5, 2],
    ["W", 10, 3],
    ["S", 15, 5],
    ["C", 20, 6]]
columns = ["Area", "Sales", "Profit"]
df = pd.DataFrame(data, columns=columns) 
In [39]:
df
Out[39]:
Area Sales Profit
0 N 25 8
1 E 5 2
2 W 10 3
3 S 15 5
4 C 20 6

Lets get back to the rainfall data.

Where is more rain fall happening in the country? Can you show it visually?

In [41]:
df_station.head()
Out[41]:
station lat lon rainfall
0 ADAMPUR 31.4 75.7 0.0
1 AGAR 23.7 76.0 19.0
2 AJAYGARH 24.9 80.2 0.0
3 AKHUPADA 20.9 86.3 0.0
4 ALOT 23.8 75.5 8.0
In [43]:
# Add this line to the beginning of your notebook
%matplotlib inline
In [67]:
df.columns
Out[67]:
Index(['srno', 'station', 'date', 'time', 'lat', 'lon', 'rainfall', 'temp',
       'tmax', 'tmin'],
      dtype='object')
In [74]:
df_station = df.groupby(["station", "lat", "lon"]).sum().reset_index()
In [75]:
df_station.head(50).plot(kind="bar", x="station", y="rainfall")
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a1295f8>
In [76]:
(df_station
 .sort_values("rainfall")
 .tail(50)
 .plot(kind="barh", x="station", y="rainfall", figsize=(5, 20)))
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a212748>
In [54]:
df_station.rainfall.hist(bins=50)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x1189586d8>
In [84]:
df_station.plot(kind="scatter", 
                x="lon", y="lat", 
                s=df_station.rainfall,
                figsize=(4, 4), 
                alpha=0.3)
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab850b8>
In [99]:
import matplotlib.pyplot as plt
plt.style.use("default")
In [100]:
df_station.plot(kind="scatter", 
                x="lon", y="lat", 
                c="rainfall",
                figsize=(4, 4), 
                alpha=0.3,
                cmap="Reds")
Out[100]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b91fbe0>
In [ ]: