Practical Machine Learning - Day 1¶

Airwatch Bangalore
May 7-9, 2018

Home | Day 2 - Boston Housing | Day 3 - Movies

1 + 2

3

This is a simple text.

You can execute any cell by pressing Shift+Enter.

You can convert a cell to markdown cell by pressing Esc + M. You may also be able to do that using the dropdown on the top.

To convert a markdown cell to code cell press Esc + Y.

Data Science Libaries¶

For Data Science we are going to use the following libraries:

numpy - numerical computation
pandas - spreadhseet for hackers
matplotlib - ploting
scikit-learn (sklearn) - machine learning library

numpy¶

import numpy as np

x = np.array([1, 2, 3])

x.shape

(3,)

x.dtype

dtype('int64')

x.sum()

6

x + 1

array([2, 3, 4])

d = np.array([[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]])

d

array([[ 0.1,  0.2,  0.3],
       [ 1.1,  1.2,  1.3]])

d.shape

(2, 3)

d * 10

array([[  1.,   2.,   3.],
       [ 11.,  12.,  13.]])

d

array([[ 0.1,  0.2,  0.3],
       [ 1.1,  1.2,  1.3]])

d > 1

array([[False, False, False],
       [ True,  True,  True]], dtype=bool)

np.sum(d > 1)

3

Set all elements less than 1 to 0.

d[d < 1] = 0

d

array([[ 0. ,  0. ,  0. ],
       [ 1.1,  1.2,  1.3]])

pandas¶

import pandas as pd

Pandas has two important data structures. Series and DataFrame.

Series is a column and DataFrame is a spreadsheet.

x = pd.Series([1.1, 2.2, 3.3, 4.4])

x

0    1.1
1    2.2
2    3.3
3    4.4
dtype: float64

x = pd.Series([1.1, 2.2, 3.3, 4.4], ["a", "b", "c", "d"])

x

a    1.1
b    2.2
c    3.3
d    4.4
dtype: float64

DataFrame¶

data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data)

d

data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data, columns=["x", "y"])

d

d.x

0    1
1    2
2    3
3    4
Name: x, dtype: int64

d[d.x > 2]

Weather Data¶

url = "https://archive.org/download/www.imdaws.com-2012/daily.zip/daily%2FARG-2012-07-01.csv"

!wget -nv -O ARG-2012-07-01.csv $url

2018-05-07 15:47:29 URL:https://notes.pipal.in/2018/airwatch-ml/ARG-2012-07-01.csv [0/0] -> "ARG-2012-07-01.csv" [1]

url = "https://notes.pipal.in/2018/airwatch-ml/ARG-2012-07-01.csv"
url = "ARG-2012-07-01.csv"

df = pd.read_csv(url)

df.head()

df.shape

(10859, 10)

df.dtypes

SR.NO.                   int64
STATION ID              object
DATE                    object
TIME [UTC]              object
LATITUDE [N]           float64
LONGITUDE [E]          float64
RAINFALL [mm]           object
TEMPERATURE [Deg C]     object
TMAX [Deg C]            object
TMIN [Deg C]            object
dtype: object

df["STATION ID"].nunique()

498

df.columns

Index(['SR.NO.', 'STATION ID', 'DATE', 'TIME [UTC]', 'LATITUDE [N]',
       'LONGITUDE [E]', 'RAINFALL [mm]', 'TEMPERATURE [Deg C]', 'TMAX [Deg C]',
       'TMIN [Deg C]'],
      dtype='object')

Let us simplify the column names.

df.columns = ["srno", "station", "date", "time", 
              "lat", "lon", "rainfall", 
              "temp", "tmax", "tmin"]

df.head()

If you want to save it to the disk, then use:

df.to_csv("rainfall.csv")

df.date.unique()

array(['1-Jul-2012'], dtype=object)

df.date.nunique()

1

df.time.nunique()

24

df.time.unique()

array(['23:00:00', '22:00:00', '21:00:00', '20:00:00', '19:00:00',
       '18:00:00', '17:00:00', '16:00:00', '15:00:00', '14:00:00',
       '13:00:00', '12:00:00', '11:00:00', '10:00:00', '09:00:00',
       '08:00:00', '07:00:00', '06:00:00', '05:00:00', '04:00:00',
       '03:00:00', '02:00:00', '01:00:00', '00:00:00'], dtype=object)

df.dtypes

srno          int64
station      object
date         object
time         object
lat         float64
lon         float64
rainfall     object
temp         object
tmax         object
tmin         object
dtype: object

How to get all entries for station "AGAR"?

df[df.station == "AGAR"]

c = df[df.station == "AGAR"].rainfall

c2 = c.astype(np.float)

c2

1        1.0
470      1.0
1010     1.0
1429     1.0
1858     1.0
2299     1.0
2734     1.0
3639     0.0
4984     0.0
5434     0.0
5888     0.0
6336     0.0
6788     0.0
7250     0.0
7712     0.0
8165     0.0
8600     0.0
9079     3.0
9520     3.0
9963     3.0
10408    3.0
Name: rainfall, dtype: float64

The rainfall column is not numeric. There may be an invalid value. Let us find that out.

df.rainfall.unique()

array(['0', '1', '11', '23', '6', '2', '37', ' ', '3', '35', '4', '55',
       '13', '5', '38', '53', '16', '62', '32', '12', '26', '43', '19',
       '20', '8', '24', '9', '25', '30', '21', '41', '42', '47', '7', '14',
       '28', '31', '40', '59', '44', '10', '80', '27', '34', '52', '18',
       '33', '48', '45', '39', '76', '17', '22', '15', '75', '46', '72',
       '36', '66', '60', '51', '29', '50', '395', '286'], dtype=object)

df.rainfall = df.rainfall.replace(" ", "0").astype(np.float)

df.head()

How many stations had atleast some rainfall?

Which station got the most rainfall during that day?

df[df.station == 'SATHYABAMA UNIVERSITY']

df.rainfall.sum()

19279.0

df_station = df.groupby(["station", "lat", "lon"]).sum()

# delete the srno column
del df_station['srno']

df_station.head()

df_station.reset_index(inplace=True)

df_station.head()

np.sum(df_station.rainfall > 0)

171

df_station.sort_values('rainfall', ascending=False).head()

x = df.groupby('station').count()['rainfall'].value_counts()

x

23    136
24    109
22    109
21     49
20     25
25     22
19     11
26      5
3       4
12      4
16      4
1       3
11      2
10      2
6       2
5       2
18      2
2       2
14      1
13      1
17      1
9       1
15      1
Name: rainfall, dtype: int64

%matplotlib inline

(df.groupby('station')
    .count()
    .rainfall
    .value_counts()
    .sort_index()
    .plot(kind="bar"))

<matplotlib.axes._subplots.AxesSubplot at 0x1150da668>

Visualization¶

data = [
    ["N", 25, 8],
    ["E", 5, 2],
    ["W", 10, 3],
    ["S", 15, 5],
    ["C", 20, 6]]
columns = ["Area", "Sales", "Profit"]
df = pd.DataFrame(data, columns=columns)

df

Lets get back to the rainfall data.

Where is more rain fall happening in the country? Can you show it visually?

df_station.head()

# Add this line to the beginning of your notebook
%matplotlib inline

df.columns

Index(['srno', 'station', 'date', 'time', 'lat', 'lon', 'rainfall', 'temp',
       'tmax', 'tmin'],
      dtype='object')

df_station = df.groupby(["station", "lat", "lon"]).sum().reset_index()

df_station.head(50).plot(kind="bar", x="station", y="rainfall")

<matplotlib.axes._subplots.AxesSubplot at 0x11a1295f8>

(df_station
 .sort_values("rainfall")
 .tail(50)
 .plot(kind="barh", x="station", y="rainfall", figsize=(5, 20)))

<matplotlib.axes._subplots.AxesSubplot at 0x11a212748>

df_station.rainfall.hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x1189586d8>

df_station.plot(kind="scatter", 
                x="lon", y="lat", 
                s=df_station.rainfall,
                figsize=(4, 4), 
                alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x11ab850b8>

import matplotlib.pyplot as plt
plt.style.use("default")

df_station.plot(kind="scatter", 
                x="lon", y="lat", 
                c="rainfall",
                figsize=(4, 4), 
                alpha=0.3,
                cmap="Reds")

<matplotlib.axes._subplots.AxesSubplot at 0x11b91fbe0>

	srno	station	date	time	lat	lon	rainfall	temp	tmin
415	416	SATHYABAMA UNIVERSITY	1-Jul-2012	23:00:00	12.9	80.2	0.0	29.3
958	959	SATHYABAMA UNIVERSITY	1-Jul-2012	22:00:00	12.9	80.2	0.0	28.8
959	960	SATHYABAMA UNIVERSITY	1-Jul-2012	22:00:00	12.9	80.2	0.0	28.8
1380	1381	SATHYABAMA UNIVERSITY	1-Jul-2012	21:00:00	12.9	80.2	0.0	28.2
1810	1811	SATHYABAMA UNIVERSITY	1-Jul-2012	20:00:00	12.9	80.2	0.0	28.1
2244	2245	SATHYABAMA UNIVERSITY	1-Jul-2012	19:00:00	12.9	80.2	0.0	28
2679	2680	SATHYABAMA UNIVERSITY	1-Jul-2012	18:00:00	12.9	80.2	0.0	28.5
3137	3138	SATHYABAMA UNIVERSITY	1-Jul-2012	17:00:00	12.9	80.2	0.0	28.5
3586	3587	SATHYABAMA UNIVERSITY	1-Jul-2012	16:00:00	12.9	80.2	0.0	29
4045	4046	SATHYABAMA UNIVERSITY	1-Jul-2012	15:00:00	12.9	80.2	0.0	28.8
4502	4503	SATHYABAMA UNIVERSITY	1-Jul-2012	14:00:00	12.9	80.2	0.0	29.1
4935	4936	SATHYABAMA UNIVERSITY	1-Jul-2012	13:00:00	12.9	80.2	0.0	29.1
5837	5838	SATHYABAMA UNIVERSITY	1-Jul-2012	11:00:00	12.9	80.2	0.0	30.5
6281	6282	SATHYABAMA UNIVERSITY	1-Jul-2012	10:00:00	12.9	80.2	0.0	33.8
6738	6739	SATHYABAMA UNIVERSITY	1-Jul-2012	09:00:00	12.9	80.2	0.0	34.1
7197	7198	SATHYABAMA UNIVERSITY	1-Jul-2012	08:00:00	12.9	80.2	0.0	33.2
7662	7663	SATHYABAMA UNIVERSITY	1-Jul-2012	07:00:00	12.9	80.2	0.0	32
8117	8118	SATHYABAMA UNIVERSITY	1-Jul-2012	06:00:00	12.9	80.2	0.0	31.2
8549	8550	SATHYABAMA UNIVERSITY	1-Jul-2012	05:00:00	12.9	80.2	0.0	30.3
9024	9025	SATHYABAMA UNIVERSITY	1-Jul-2012	04:00:00	12.9	80.2	0.0	29.7
9466	9467	SATHYABAMA UNIVERSITY	1-Jul-2012	03:00:00	12.9	80.2	9.0	29	24.2
9910	9911	SATHYABAMA UNIVERSITY	1-Jul-2012	02:00:00	12.9	80.2	9.0	27.7
10357	10358	SATHYABAMA UNIVERSITY	1-Jul-2012	01:00:00	12.9	80.2	9.0	27.1
10805	10806	SATHYABAMA UNIVERSITY	1-Jul-2012	00:00:00	12.9	80.2	9.0	26.2

			rainfall
station	lat	lon
ADAMPUR	31.4	75.7	0.0
AGAR	23.7	76.0	19.0
AJAYGARH	24.9	80.2	0.0
AKHUPADA	20.9	86.3	0.0
ALOT	23.8	75.5	8.0

	station	lat	lon	rainfall
0	ADAMPUR	31.4	75.7	0.0
1	AGAR	23.7	76.0	19.0
2	AJAYGARH	24.9	80.2	0.0
3	AKHUPADA	20.9	86.3	0.0
4	ALOT	23.8	75.5	8.0

	station	lat	lon	rainfall
94	CHAKUR	18.3	76.5	711.0
347	PAUNI	20.8	79.6	632.0
458	THAMARASSERY	11.4	75.9	618.0
304	MURTIZAPUR	20.7	77.4	608.0
492	VILANGANKUNNU	10.6	76.2	603.0

	station	lat	lon	rainfall
0	ADAMPUR	31.4	75.7	0.0
1	AGAR	23.7	76.0	19.0
2	AJAYGARH	24.9	80.2	0.0
3	AKHUPADA	20.9	86.3	0.0
4	ALOT	23.8	75.5	8.0

	SR.NO.	STATION ID	DATE	TIME [UTC]	LATITUDE [N]	LONGITUDE [E]	RAINFALL [mm]
0	1	ANDER	1-Jul-2012	23:00:00	26.1	84.3	0
1	2	AGAR	1-Jul-2012	23:00:00	23.7	76.0	1
2	3	AJAYGARH	1-Jul-2012	23:00:00	24.9	80.2	0
3	4	AKHUPADA	1-Jul-2012	23:00:00	20.9	86.3	0
4	5	AMLOH	1-Jul-2012	23:00:00	30.6	76.2	0

	Area	Sales	Profit
0	N	25	8
1	E	5	2
2	W	10	3
3	S	15	5
4	C	20	6