Airwatch Bangalore
May 7-9, 2018
1 + 2
This is a simple text.
You can execute any cell by pressing Shift+Enter.
You can convert a cell to markdown cell by pressing Esc + M. You may also be able to do that using the dropdown on the top.
To convert a markdown cell to code cell press Esc + Y.
For Data Science we are going to use the following libraries:
import numpy as np
x = np.array([1, 2, 3])
x.shape
x.dtype
x.sum()
x + 1
d = np.array([[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]])
d
d.shape
d * 10
d
d > 1
np.sum(d > 1)
Set all elements less than 1 to 0.
d[d < 1] = 0
d
import pandas as pd
Pandas has two important data structures. Series and DataFrame.
Series is a column and DataFrame is a spreadsheet.
x = pd.Series([1.1, 2.2, 3.3, 4.4])
x
x = pd.Series([1.1, 2.2, 3.3, 4.4], ["a", "b", "c", "d"])
x
data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data)
d
data = [[1, 1], [2, 4], [3, 9], [4, 16]]
d = pd.DataFrame(data, columns=["x", "y"])
d
d.x
d[d.x > 2]
url = "https://archive.org/download/www.imdaws.com-2012/daily.zip/daily%2FARG-2012-07-01.csv"
!wget -nv -O ARG-2012-07-01.csv $url
url = "https://notes.pipal.in/2018/airwatch-ml/ARG-2012-07-01.csv"
url = "ARG-2012-07-01.csv"
df = pd.read_csv(url)
df.head()
df.shape
df.dtypes
df["STATION ID"].nunique()
df.columns
Let us simplify the column names.
df.columns = ["srno", "station", "date", "time",
"lat", "lon", "rainfall",
"temp", "tmax", "tmin"]
df.head()
If you want to save it to the disk, then use:
df.to_csv("rainfall.csv")
df.date.unique()
df.date.nunique()
df.time.nunique()
df.time.unique()
df.dtypes
How to get all entries for station "AGAR"?
df[df.station == "AGAR"]
c = df[df.station == "AGAR"].rainfall
c2 = c.astype(np.float)
c2
The rainfall column is not numeric. There may be an invalid value. Let us find that out.
df.rainfall.unique()
df.rainfall = df.rainfall.replace(" ", "0").astype(np.float)
df.head()
How many stations had atleast some rainfall?
Which station got the most rainfall during that day?
df[df.station == 'SATHYABAMA UNIVERSITY']
df.rainfall.sum()
df_station = df.groupby(["station", "lat", "lon"]).sum()
# delete the srno column
del df_station['srno']
df_station.head()
df_station.reset_index(inplace=True)
df_station.head()
np.sum(df_station.rainfall > 0)
df_station.sort_values('rainfall', ascending=False).head()
x = df.groupby('station').count()['rainfall'].value_counts()
x
%matplotlib inline
(df.groupby('station')
.count()
.rainfall
.value_counts()
.sort_index()
.plot(kind="bar"))
data = [
["N", 25, 8],
["E", 5, 2],
["W", 10, 3],
["S", 15, 5],
["C", 20, 6]]
columns = ["Area", "Sales", "Profit"]
df = pd.DataFrame(data, columns=columns)
df
Lets get back to the rainfall data.
Where is more rain fall happening in the country? Can you show it visually?
df_station.head()
# Add this line to the beginning of your notebook
%matplotlib inline
df.columns
df_station = df.groupby(["station", "lat", "lon"]).sum().reset_index()
df_station.head(50).plot(kind="bar", x="station", y="rainfall")
(df_station
.sort_values("rainfall")
.tail(50)
.plot(kind="barh", x="station", y="rainfall", figsize=(5, 20)))
df_station.rainfall.hist(bins=50)
df_station.plot(kind="scatter",
x="lon", y="lat",
s=df_station.rainfall,
figsize=(4, 4),
alpha=0.3)
import matplotlib.pyplot as plt
plt.style.use("default")
df_station.plot(kind="scatter",
x="lon", y="lat",
c="rainfall",
figsize=(4, 4),
alpha=0.3,
cmap="Reds")