Freight Optimization

Notes: https://notes.pipal.in/2018/vmware-ml2/

1 - Introduction | Freight Optimization

In [2]:
import numpy as np
import pandas as pd

Datasets

In [3]:
url = "https://notes.pipal.in/2018/vmware-ml2/10-Secondary-Freight-Data.csv"
matrix_url = "https://notes.pipal.in/2018/vmware-ml2/13-location-distance-matrix.csv"
In [4]:
freight = pd.read_csv(url)
In [5]:
freight.head()
Out[5]:
Sales Order No. New Date Supplying DC Location Customer Code New Customer Town Qty (in cases) Freight/Cartage Remarks Amount
0 2004912014 1-Jul-08 Pune 190886 PUNE 15.0 5 Charges - Per case 75
1 2004912846 1-Jul-08 Ahemadabad 190406 JAMNAGAR 30.0 18 Charges - Per case 540
2 2004913418 1-Jul-08 Jaipur 188582 BAGRU 6.0 20 Charges - Per case 120
3 2004916450 2-Jul-08 Raipur 191024 RAIPUR 23.0 5 Charges - Per case 104
4 2004916806 2-Jul-08 Zirakhpur 207786 Banur 1.0 1,000 Charges - Per Consignment 1,000
In [6]:
freight.dtypes
Out[6]:
Sales Order No. New        int64
Date                      object
Supplying DC Location     object
Customer Code New          int64
Customer Town             object
Qty (in cases)           float64
Freight/Cartage           object
Remarks                   object
Amount                    object
dtype: object
In [8]:
freight.columns
Out[8]:
Index(['Sales Order No. New', 'Date', 'Supplying DC Location',
       'Customer Code New', 'Customer Town', 'Qty (in cases)',
       'Freight/Cartage', 'Remarks', 'Amount'],
      dtype='object')
In [90]:
matrix = pd.read_csv(matrix_url)
In [91]:
matrix.head()
Out[91]:
Source S. Longitude S. Latitude Destination D. Longitude D. Latitude Lane Distance
0 BANGALORE 77.59 12.98 BANGALORE 77.59 12.98 BANGALORE to BANGALORE 0.00
1 BANGALORE 77.59 12.98 CHENNAI 80.24 13.07 BANGALORE to CHENNAI 294.47
2 BANGALORE 77.59 12.98 MUMBAI 72.84 18.98 BANGALORE to MUMBAI 554.79
3 BANGALORE 77.59 12.98 HYDERABAD 78.49 17.39 BANGALORE to HYDERABAD 142.09
4 BANGALORE 77.59 12.98 NEW DELHI 77.17 28.62 BANGALORE to NEW DELHI 381.46

Refine the Data

Basic Cleaning:

  • fix column names
  • drop unused/redundant columns

Basic Cleaning

In [9]:
columns = {
    'Sales Order No. New': 'orderno',
    'Date': 'date', 
    'Supplying DC Location': 'source',
    'Customer Code New': 'custcode',
    'Customer Town': 'dest', 
    'Qty (in cases)': 'qty',
    'Freight/Cartage': 'cartage', 
    'Remarks': 'remarks', 
    'Amount': 'amount'
}
In [12]:
freight.rename(columns=columns, inplace=True)
In [13]:
freight.head()
Out[13]:
orderno date source custcode dest qty cartage remarks amount
0 2004912014 1-Jul-08 Pune 190886 PUNE 15.0 5 Charges - Per case 75
1 2004912846 1-Jul-08 Ahemadabad 190406 JAMNAGAR 30.0 18 Charges - Per case 540
2 2004913418 1-Jul-08 Jaipur 188582 BAGRU 6.0 20 Charges - Per case 120
3 2004916450 2-Jul-08 Raipur 191024 RAIPUR 23.0 5 Charges - Per case 104
4 2004916806 2-Jul-08 Zirakhpur 207786 Banur 1.0 1,000 Charges - Per Consignment 1,000
In [15]:
freight.drop(["orderno", "custcode"], axis=1, inplace=True)
In [16]:
freight.head()
Out[16]:
date source dest qty cartage remarks amount
0 1-Jul-08 Pune PUNE 15.0 5 Charges - Per case 75
1 1-Jul-08 Ahemadabad JAMNAGAR 30.0 18 Charges - Per case 540
2 1-Jul-08 Jaipur BAGRU 6.0 20 Charges - Per case 120
3 2-Jul-08 Raipur RAIPUR 23.0 5 Charges - Per case 104
4 2-Jul-08 Zirakhpur Banur 1.0 1,000 Charges - Per Consignment 1,000
In [18]:
freight.amount.dtype
Out[18]:
dtype('O')
In [23]:
freight.amount.unique()[:20]
Out[23]:
array(['75', '540', '120', '104', '1,000', '84', '255', '138', '48', '39',
       '380', '54', '42', '65', '57', '38', '113', '125', '119', '26'], dtype=object)
In [25]:
freight.dtypes
Out[25]:
date        object
source      object
dest        object
qty        float64
cartage     object
remarks     object
amount      object
dtype: object
In [26]:
# remove commas in the numbers
freight.amount.head()
Out[26]:
0       75
1      540
2      120
3      104
4    1,000
Name: amount, dtype: object
In [33]:
freight.amount.head().str.replace(",", "")
Out[33]:
0      75
1     540
2     120
3     104
4    1000
Name: amount, dtype: object
In [35]:
freight.amount = freight.amount.str.replace(",", "")
In [36]:
freight.amount.head()
Out[36]:
0      75
1     540
2     120
3     104
4    1000
Name: amount, dtype: object

Problem: Replace commas from cartage column.

In [38]:
freight.cartage = freight.cartage.str.replace(",", "")
In [42]:
# fix column types
freight.amount = pd.to_numeric(freight.amount)
freight.cartage = pd.to_numeric(freight.cartage)
freight.date = pd.to_datetime(freight.date)
In [43]:
freight.head()
Out[43]:
date source dest qty cartage remarks amount
0 2008-07-01 Pune PUNE 15.0 5 Charges - Per case 75
1 2008-07-01 Ahemadabad JAMNAGAR 30.0 18 Charges - Per case 540
2 2008-07-01 Jaipur BAGRU 6.0 20 Charges - Per case 120
3 2008-07-02 Raipur RAIPUR 23.0 5 Charges - Per case 104
4 2008-07-02 Zirakhpur Banur 1.0 1000 Charges - Per Consignment 1000
In [44]:
# dtypes
freight.dtypes
Out[44]:
date       datetime64[ns]
source             object
dest               object
qty               float64
cartage             int64
remarks            object
amount              int64
dtype: object

Q: Is the remarks column required?

Lets see.

In [45]:
freight.remarks.unique()
Out[45]:
array(['Charges - Per case', 'Charges - Per Consignment'], dtype=object)
In [46]:
freight[freight.remarks=='Charges - Per case'].head()
Out[46]:
date source dest qty cartage remarks amount
0 2008-07-01 Pune PUNE 15.0 5 Charges - Per case 75
1 2008-07-01 Ahemadabad JAMNAGAR 30.0 18 Charges - Per case 540
2 2008-07-01 Jaipur BAGRU 6.0 20 Charges - Per case 120
3 2008-07-02 Raipur RAIPUR 23.0 5 Charges - Per case 104
5 2008-07-02 Raipur SARIPALI (C.G.) 4.0 21 Charges - Per case 84
In [47]:
freight[freight.remarks=='Charges - Per Consignment'].head()
Out[47]:
date source dest qty cartage remarks amount
4 2008-07-02 Zirakhpur Banur 1.0 1000 Charges - Per Consignment 1000
25 2008-07-03 Delhi NEW DELHI 1.0 1000 Charges - Per Consignment 1000
31 2008-07-03 Ghaziabad NOIDA 17.0 250 Charges - Per Consignment 250
32 2008-07-03 Chennai CHENNAI 4.0 550 Charges - Per Consignment 550
38 2008-07-04 Kolkata KOLKATA 11.0 750 Charges - Per Consignment 750

Fix Missing Values

In [49]:
len(freight)
Out[49]:
60976
In [50]:
freight.isnull().head()
Out[50]:
date source dest qty cartage remarks amount
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
In [51]:
freight.isnull().sum()
Out[51]:
date       0
source     6
dest       0
qty        0
cartage    0
remarks    0
amount     0
dtype: int64
In [52]:
freight[freight.source.isnull()]
Out[52]:
date source dest qty cartage remarks amount
28572 2008-12-29 NaN LEH 120.0 0 Charges - Per case 0
34377 2009-01-31 NaN LEH 109.0 0 Charges - Per case 0
45204 2009-03-31 NaN LEH 15.0 0 Charges - Per case 0
52852 2009-05-18 NaN LEH 135.0 0 Charges - Per case 0
53438 2009-05-22 NaN LEH 160.0 0 Charges - Per case 0
60903 2009-06-30 NaN LEH 154.0 0 Charges - Per case 0
In [53]:
# remove NaN values
freight.dropna(inplace=True)

Standadize text fields

In [58]:
shortunits = {
    "Charges - Per case": "case",
    "Charges - Per Consignment": "consignment"
}
freight.remarks = freight.remarks.replace(shortunits)
In [59]:
freight.remarks.value_counts()
Out[59]:
case           57776
consignment     3194
Name: remarks, dtype: int64
In [61]:
freight.dest.value_counts()
Out[61]:
Chennai             1138
BANGALORE           1134
MUMBAI              1014
AHMEDABAD            799
CHENNAI              779
KOLKATA              748
NEW DELHI            721
HYDERABAD            620
PUNE                 601
Hyderabad            563
THANE                506
New Delhi            469
RANCHI               462
Mumbai               424
LUCKNOW              414
SALEM                382
MADURAI              345
PATNA                342
SURAT                292
JAIPUR               264
RAJKOT               252
Bangalore            245
VARANASI             241
TIRUNELVELI          236
JAMSHEDPUR           226
KANPUR               211
TRIVANDRUM           209
JABALPUR             178
THRISSUR             174
AGRA                 168
                    ... 
BHUPALPALLY            1
Bachepalli             1
Dalsingsarai           1
Bellampalle            1
Thuvarankurichy        1
Distt:Ahmedabad        1
Ghumarwin              1
Puttur                 1
Banga                  1
LALSOT                 1
BANTHARA BAZAAR,       1
KARANPRAYAG            1
DAMANDIU               1
DEBAI                  1
TAMKUHI ROAD           1
TIRODA                 1
Ramanagaram            1
DUNGANJ                1
BARSHI                 1
ANPARA BAZAR           1
Namakkal               1
SONARPUR               1
JAMUI                  1
Bahadurgarh            1
ANDOLE                 1
SHAJAPUR               1
 RAMGANJ MANDI         1
NAWABGANJ              1
Lanka                  1
CHINTPURNI             1
Name: dest, Length: 1923, dtype: int64
In [72]:
def fix_names(c):
    c = c.str.strip()
    c = c.str.title()
    return c
In [73]:
freight.source = fix_names(freight.source)
freight.dest = fix_names(freight.dest)
In [74]:
freight.head()
Out[74]:
date source dest qty cartage remarks amount
0 2008-07-01 Pune Pune 15.0 5 case 75
1 2008-07-01 Ahemadabad Jamnagar 30.0 18 case 540
2 2008-07-01 Jaipur Bagru 6.0 20 case 120
3 2008-07-02 Raipur Raipur 23.0 5 case 104
4 2008-07-02 Zirakhpur Banur 1.0 1000 consignment 1000

Q: Which source has the maximum quantity of transfer?

In [77]:
freight.groupby('source').sum().head()
Out[77]:
qty cartage amount
source
Ahemadabad 67593.0 285560 1692459
Bangalore 221221.0 294249 3254982
Bbsr 44318.0 54520 1099986
Bhiwandi 103646.0 252468 1670543
Chandigarh 4247.0 545 21235
In [81]:
freight.groupby('source').sum().qty.sort_values(ascending=False).head()
Out[81]:
source
Bangalore    221221.0
Chennai      140902.0
Madurai      133278.0
Hyderabad    122391.0
Cochin       103790.0
Name: qty, dtype: float64
In [82]:
(freight.groupby('source')
        .sum()
        .qty
        .sort_values(ascending=False)
        .head()
)
Out[82]:
source
Bangalore    221221.0
Chennai      140902.0
Madurai      133278.0
Hyderabad    122391.0
Cochin       103790.0
Name: qty, dtype: float64

Q: What is the source/destination pair that has the maximum quantity of transfer?

In [84]:
(freight.groupby(['source', 'dest'])
        .sum()
        .qty
        .sort_values(ascending=False)
        .head()
)
Out[84]:
source     dest     
Bangalore  Bangalore    111120.0
Chennai    Chennai       80251.0
Bhiwandi   Mumbai        58666.0
Hyderabad  Hyderabad     54372.0
Delhi      New Delhi     51669.0
Name: qty, dtype: float64

Q: Show data for Bangalore -> Bangalore.

In [86]:
freight[(freight.source == 'Bangalore') & (freight.dest == 'Bangalore')].head()
Out[86]:
date source dest qty cartage remarks amount
36 2008-07-04 Bangalore Bangalore 345.0 7 case 2415
51 2008-07-04 Bangalore Bangalore 54.0 7 case 378
121 2008-07-04 Bangalore Bangalore 126.0 7 case 882
125 2008-07-04 Bangalore Bangalore 40.0 7 case 280
150 2008-07-05 Bangalore Bangalore 76.0 7 case 532
In [88]:
freight['computed_amount'] = freight.qty * freight.cartage
In [89]:
freight.head()
Out[89]:
date source dest qty cartage remarks amount computed_amount
0 2008-07-01 Pune Pune 15.0 5 case 75 75.0
1 2008-07-01 Ahemadabad Jamnagar 30.0 18 case 540 540.0
2 2008-07-01 Jaipur Bagru 6.0 20 case 120 120.0
3 2008-07-02 Raipur Raipur 23.0 5 case 104 115.0
4 2008-07-02 Zirakhpur Banur 1.0 1000 consignment 1000 1000.0

Problem: Refine the location-distance-matrix data and merge it with the freight dataframe.

Hint: pd.merge?

In [94]:
matrix = pd.read_csv(matrix_url)
In [95]:
matrix.head()
Out[95]:
Source S. Longitude S. Latitude Destination D. Longitude D. Latitude Lane Distance
0 BANGALORE 77.59 12.98 BANGALORE 77.59 12.98 BANGALORE to BANGALORE 0.00
1 BANGALORE 77.59 12.98 CHENNAI 80.24 13.07 BANGALORE to CHENNAI 294.47
2 BANGALORE 77.59 12.98 MUMBAI 72.84 18.98 BANGALORE to MUMBAI 554.79
3 BANGALORE 77.59 12.98 HYDERABAD 78.49 17.39 BANGALORE to HYDERABAD 142.09
4 BANGALORE 77.59 12.98 NEW DELHI 77.17 28.62 BANGALORE to NEW DELHI 381.46
In [97]:
# fix column names
columns = {
    "Source": "source",
    "S. Longitude": "slon",
    "S. Latitude": "slat",
    "Destination": "dest",
    "D. Longitude": "dlon",
    "D. Latitude": "dlat",
    "Lane": "lane",
    "Distance": "distance"
}    
matrix.rename(columns=columns, inplace=True)
In [98]:
matrix.head()
Out[98]:
source slon slat dest dlon dlat lane distance
0 BANGALORE 77.59 12.98 BANGALORE 77.59 12.98 BANGALORE to BANGALORE 0.00
1 BANGALORE 77.59 12.98 CHENNAI 80.24 13.07 BANGALORE to CHENNAI 294.47
2 BANGALORE 77.59 12.98 MUMBAI 72.84 18.98 BANGALORE to MUMBAI 554.79
3 BANGALORE 77.59 12.98 HYDERABAD 78.49 17.39 BANGALORE to HYDERABAD 142.09
4 BANGALORE 77.59 12.98 NEW DELHI 77.17 28.62 BANGALORE to NEW DELHI 381.46
In [99]:
# missing values
matrix.isnull().sum()
Out[99]:
source      0
slon        0
slat        0
dest        0
dlon        0
dlat        0
lane        0
distance    0
dtype: int64
In [101]:
# standadize names
matrix.source = fix_names(matrix.source)
matrix.dest = fix_names(matrix.dest)
matrix.head()
Out[101]:
source slon slat dest dlon dlat lane distance
0 Bangalore 77.59 12.98 Bangalore 77.59 12.98 BANGALORE to BANGALORE 0.00
1 Bangalore 77.59 12.98 Chennai 80.24 13.07 BANGALORE to CHENNAI 294.47
2 Bangalore 77.59 12.98 Mumbai 72.84 18.98 BANGALORE to MUMBAI 554.79
3 Bangalore 77.59 12.98 Hyderabad 78.49 17.39 BANGALORE to HYDERABAD 142.09
4 Bangalore 77.59 12.98 New Delhi 77.17 28.62 BANGALORE to NEW DELHI 381.46
In [111]:
(matrix.source + "--" + matrix.dest).nunique(), len(matrix)
Out[111]:
(65890, 65896)
In [113]:
# drop the duplicates
matrix.drop_duplicates(['source', 'dest'], inplace=True)
In [116]:
len(matrix)
Out[116]:
65890
In [114]:
df = pd.merge(freight, matrix, how='left', on=['source', 'dest'])
In [115]:
len(freight), len(matrix), len(df)
Out[115]:
(60970, 65890, 60970)
In [117]:
df.head()
Out[117]:
date source dest qty cartage remarks amount computed_amount slon slat dlon dlat lane distance
0 2008-07-01 Pune Pune 15.0 5 case 75 75.0 73.85 18.53 73.85 18.53 PUNE to PUNE 0.00
1 2008-07-01 Ahemadabad Jamnagar 30.0 18 case 540 540.0 NaN NaN NaN NaN NaN NaN
2 2008-07-01 Jaipur Bagru 6.0 20 case 120 120.0 75.81 26.92 75.53 26.82 JAIPUR to BAGRU 30.66
3 2008-07-02 Raipur Raipur 23.0 5 case 104 115.0 81.65 21.23 81.65 21.23 RAIPUR to RAIPUR 0.00
4 2008-07-02 Zirakhpur Banur 1.0 1000 consignment 1000 1000.0 NaN NaN NaN NaN NaN NaN
In [121]:
df.isnull().sum()
Out[121]:
date                   0
source                 0
dest                   0
qty                    0
cartage                0
remarks                0
amount                 0
computed_amount        0
slon               15782
slat               15782
dlon               15782
dlat               15782
lane               15782
distance           15782
dtype: int64
In [122]:
len(df)
Out[122]:
60970
In [133]:
df[df.isnull().sum(axis=1)>0].head()
Out[133]:
date source dest qty cartage remarks amount computed_amount slon slat dlon dlat lane distance
1 2008-07-01 Ahemadabad Jamnagar 30.0 18 case 540 540.0 NaN NaN NaN NaN NaN NaN
4 2008-07-02 Zirakhpur Banur 1.0 1000 consignment 1000 1000.0 NaN NaN NaN NaN NaN NaN
11 2008-07-02 Cochin Mattanchery 38.0 10 case 380 380.0 NaN NaN NaN NaN NaN NaN
31 2008-07-03 Ghaziabad Noida 17.0 250 consignment 250 4250.0 NaN NaN NaN NaN NaN NaN
37 2008-07-04 Ahemadabad Derol 7.0 24 case 168 168.0 NaN NaN NaN NaN NaN NaN

Q: What is the total amount where distance is missing?

In [134]:
df[df.isnull().sum(axis=1)>0].amount.sum()
Out[134]:
11070970
In [137]:
df[df.isnull().sum(axis=1)>0].amount.sum() / df.amount.sum()
Out[137]:
0.271028672517011
In [138]:
freight.source.nunique(), matrix.source.nunique(), df.source.nunique()
Out[138]:
(31, 54, 31)
In [139]:
freight.source.unique()
Out[139]:
array(['Pune', 'Ahemadabad', 'Jaipur', 'Raipur', 'Zirakhpur', 'Cochin',
       'Rohtak', 'Chennai', 'Delhi', 'Ghaziabad', 'Kolkata', 'Bangalore',
       'Madurai', 'Vijaywada', 'Nagpur', 'Lucknow', 'Coimbatore', 'Goa',
       'Hyderabad', 'Ranchi', 'Bhiwandi', 'Gauwhati', 'Bbsr', 'Varanasi',
       'Patna', 'Indore', 'Rishikesh', 'Jammu', 'Parwanoo', 'Chandigarh',
       'Haldwani'], dtype=object)
In [140]:
matrix.source.unique()
Out[140]:
array(['Bangalore', 'Chennai', 'Mumbai', 'Hyderabad', 'New Delhi', 'Salem',
       'Coimbatore', 'Pune', 'Srinagar', 'Madurai', 'Kolkata', 'Raipur',
       'Lucknow', 'Imphal', 'Tirunelveli', 'Secunderabad', 'Ahmedabad',
       'Jaipur', 'Mangalore', 'Dimapur', 'Thane', 'Ranchi', 'Kanpur',
       'Varanasi', 'Kollam', 'Vishakapatnam', 'Mysore', 'Trichy', 'Hubli',
       'Madanapalle', 'Bareilly', 'Pudukottai', 'Barhampur', 'Tirupur',
       'Ludhiana', 'Jamshedpur', 'Bhubaneshwar', 'Bhiwandi', 'Chandigarh',
       'Cochin', 'Delhi', 'Guwahati', 'Ghaziabad', 'Goa', 'Haldwani',
       'Indore', 'Jammu', 'Nagpur', 'Parwanoo', 'Patna', 'Rishikesh',
       'Rohtak', 'Vijaywada', 'Zirakpur'], dtype=object)
In [143]:
set(freight.source.unique()) - set(matrix.source.unique())
Out[143]:
{'Ahemadabad', 'Bbsr', 'Gauwhati', 'Zirakhpur'}
In [144]:
set(freight.dest.unique()) - set(matrix.dest.unique())
Out[144]:
set()
In [149]:
cities = {
    'Ahemadabad': 'Ahmedabad',
    'Bbsr': 'Bhubaneshwar',
    'Gauwhati': 'Guwahati',
    'Zirakhpur': 'Zirakpur'
}
# freight.replace({"source": cities}).head()
freight.replace({"source": cities}, inplace=True)
In [150]:
freight.head()
Out[150]:
date source dest qty cartage remarks amount computed_amount
0 2008-07-01 Pune Pune 15.0 5 case 75 75.0
1 2008-07-01 Ahmedabad Jamnagar 30.0 18 case 540 540.0
2 2008-07-01 Jaipur Bagru 6.0 20 case 120 120.0
3 2008-07-02 Raipur Raipur 23.0 5 case 104 115.0
4 2008-07-02 Zirakpur Banur 1.0 1000 consignment 1000 1000.0
In [151]:
df = pd.merge(freight, matrix, how='left', on=['source', 'dest'])
In [152]:
df.isnull().sum()
Out[152]:
date                  0
source                0
dest                  0
qty                   0
cartage               0
remarks               0
amount                0
computed_amount       0
slon               7136
slat               7136
dlon               7136
dlat               7136
lane               7136
distance           7136
dtype: int64
In [154]:
df[df.isnull().sum(axis=1) > 0].head()
Out[154]:
date source dest qty cartage remarks amount computed_amount slon slat dlon dlat lane distance
4 2008-07-02 Zirakpur Banur 1.0 1000 consignment 1000 1000.0 NaN NaN NaN NaN NaN NaN
11 2008-07-02 Cochin Mattanchery 38.0 10 case 380 380.0 NaN NaN NaN NaN NaN NaN
31 2008-07-03 Ghaziabad Noida 17.0 250 consignment 250 4250.0 NaN NaN NaN NaN NaN NaN
73 2008-07-04 Cochin Tripunithura 6.0 10 case 60 60.0 NaN NaN NaN NaN NaN NaN
90 2008-07-04 Goa Talegao 11.0 21 case 231 231.0 NaN NaN NaN NaN NaN NaN
In [155]:
df[df.isnull().sum(axis=1) > 0].amount.sum() / df.amount.sum()
Out[155]:
0.12835823942861685
In [166]:
data = df.dropna()
In [207]:
data = data[data.remarks == "case"].copy()

Explore

In [180]:
#!pip install altair
In [161]:
import altair as alt
import matplotlib.pyplot as plt
%matplotlib inline
In [184]:
plt.style.use("ggplot")
In [185]:
data.distance.hist(bins=40)
Out[185]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd5f7f0>
In [186]:
data.cartage.hist(bins=40);
In [183]:
data[data.cartage > 100]
Out[183]:
date source dest qty cartage remarks amount computed_amount slon slat dlon dlat lane distance
2270 2008-07-21 Kolkata Andaman 63.0 204 case 12821 12852.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
2861 2008-07-25 Kolkata Andaman 6.0 204 case 1221 1224.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
6336 2008-08-18 Kolkata Andaman 35.0 204 case 7123 7140.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
6339 2008-08-18 Kolkata Andaman 3.0 204 case 611 612.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
11737 2008-09-20 Kolkata Andaman 60.0 204 case 12210 12240.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
17669 2008-10-30 Kolkata Andaman 43.0 204 case 8751 8772.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
24520 2008-12-08 Kolkata Andaman 56.0 204 case 11396 11424.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
30650 2009-01-19 Kolkata Andaman 59.0 204 case 12007 12036.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
35159 2009-02-09 Kolkata Andaman 51.0 204 case 10379 10404.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
42529 2009-03-23 Kolkata Andaman 59.0 204 case 12007 12036.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
46240 2009-04-14 Kolkata Andaman 0.0 204 case 0 0.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
46448 2009-04-14 Kolkata Andaman 32.0 204 case 6512 6528.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
50928 2009-05-07 Kolkata Andaman 48.0 204 case 9768 9792.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
58038 2009-06-18 Kolkata Andaman 28.0 204 case 5698 5712.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
58993 2009-06-24 Kolkata Andaman 28.0 204 case 5698 5712.0 88.33 22.63 92.69 11.68 KOLKATA to ANDAMAN 483.33
In [199]:
data.plot(kind="scatter", x="distance", y="cartage", alpha=0.05);
In [200]:
data.plot(kind="scatter", x="distance", y="cartage", alpha=0.05, logy=True);
In [213]:
data.loc[data.distance==0, "distance"] = 5
In [214]:
data.plot(kind="scatter", x="distance", y="cartage", alpha=0.05, logy=True, logx=True);
In [216]:
data.plot(kind="scatter", 
          x="distance", 
          y="cartage", 
          c="amount",
          alpha=0.05, 
          logy=True, 
          logx=True,
          cmap='viridis');
In [219]:
data.to_csv("data.csv")

Model

In [220]:
data.head()
Out[220]:
date source dest qty cartage remarks amount computed_amount slon slat dlon dlat lane distance
0 2008-07-01 Pune Pune 15.0 5 case 75 75.0 73.85 18.53 73.85 18.53 PUNE to PUNE 5.00
1 2008-07-01 Ahmedabad Jamnagar 30.0 18 case 540 540.0 72.60 23.03 70.06 22.46 AHMEDABAD to JAMNAGAR 282.60
2 2008-07-01 Jaipur Bagru 6.0 20 case 120 120.0 75.81 26.92 75.53 26.82 JAIPUR to BAGRU 30.66
3 2008-07-02 Raipur Raipur 23.0 5 case 104 115.0 81.65 21.23 81.65 21.23 RAIPUR to RAIPUR 5.00
5 2008-07-02 Raipur Saripali (C.G.) 4.0 21 case 84 84.0 81.65 21.23 82.53 21.18 RAIPUR to SARIPALI (C.G.) 98.64

Do we need to take date column in modeling?

Let us see if date has any effect on the cartage.

In [226]:
data[(data.source=='Bangalore') & (data.dest == 'Hubli')].describe()
Out[226]:
qty cartage amount computed_amount slon slat dlon dlat distance
count 166.000000 166.0 166.000000 166.000000 1.660000e+02 1.660000e+02 1.660000e+02 1.660000e+02 1.660000e+02
mean 49.608434 26.0 1314.024096 1289.819277 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
std 35.485170 0.0 939.946884 922.614410 1.140308e-13 1.425385e-14 9.977697e-14 4.632502e-14 1.140308e-13
min 1.000000 26.0 26.000000 26.000000 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
25% 23.250000 26.0 615.750000 604.500000 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
50% 42.000000 26.0 1112.500000 1092.000000 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
75% 71.750000 26.0 1900.500000 1865.500000 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
max 179.000000 26.0 4741.000000 4654.000000 7.759000e+01 1.298000e+01 7.514000e+01 1.535000e+01 2.800100e+02
In [231]:
data.groupby(['source', 'dest']).std().cartage.value_counts()
Out[231]:
0.000000     1410
2.462961        1
0.773565        1
2.386378        1
3.829708        1
4.712121        1
3.601470        1
1.054093        1
0.450225        1
18.961218       1
2.905092        1
4.284857        1
1.388730        1
7.807881        1
6.966903        1
4.569226        1
2.745873        1
0.472742        1
1.190891        1
0.876162        1
1.426785        1
0.551364        1
Name: cartage, dtype: int64
In [233]:
data.groupby(['source', 'dest']).std().cartage.hist();
In [247]:
# roll up

agg_func = {
    "cartage": ["mean"], 
    "distance": ["mean"], 
    "qty": ["sum"], 
    "amount": ["sum", "count"]
}

newdata = (data
    .groupby(["source", "dest", "slat", "slon", "dlat", "dlon"])
    .agg(agg_func)
    .reset_index())

newdata.columns = newdata.columns.droplevel(1)
newdata.columns = list(newdata.columns)[:-1] + ['count']
newdata.head()
Out[247]:
source dest slat slon dlat dlon cartage distance qty amount count
0 Ahmedabad Ahmadabad 23.03 72.6 23.03 72.60 18.0 5.00 26.0 468 3
1 Ahmedabad Ahmedabad 23.03 72.6 23.03 72.60 18.0 5.00 12236.0 220248 797
2 Ahmedabad Ahmedabd 23.03 72.6 23.03 72.60 18.0 5.00 33.0 594 4
3 Ahmedabad Amreli 23.03 72.6 21.60 71.22 25.0 161.25 111.0 2775 15
4 Ahmedabad Anand 23.03 72.6 22.55 72.95 22.0 42.40 155.0 3410 21
In [252]:
newdata.plot(kind='scatter', x='distance', y='cartage', alpha=0.3);
In [254]:
newdata.plot(kind='scatter', x='distance', y='cartage', alpha=0.3, logy=True, logx=True);
In [257]:
newdata['distance_log'] = np.log10(newdata.distance)
newdata['cartage_log'] = np.log10(newdata.cartage)
X = newdata[['distance_log']]
y = newdata.cartage_log
In [258]:
X.head()
Out[258]:
distance_log
0 0.698970
1 0.698970
2 0.698970
3 2.207500
4 1.627366
In [259]:
y.head()
Out[259]:
0    1.255273
1    1.255273
2    1.255273
3    1.397940
4    1.342423
Name: cartage_log, dtype: float64
In [260]:
from sklearn.linear_model import LinearRegression
In [262]:
model = LinearRegression()
model.fit(X, y)
Out[262]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [263]:
model.intercept_, model.coef_
Out[263]:
(1.1983538178610935, array([ 0.06244738]))
In [295]:
def show_predicitons(X, y, y_pred):
    plt.scatter(X, y)
    plt.plot(X, y_pred, color='b')
In [297]:
y_pred = model.predict(X)
show_predicitons(X, y, y_pred)
In [300]:
from sklearn.model_selection import cross_val_score
In [322]:
model = LinearRegression()
score = cross_val_score(model, 
                        X, y, 
                        scoring='neg_mean_squared_error', 
                        cv=5, 
                        n_jobs=-1,)
np.mean(score)
Out[322]:
-0.083758488565365427
In [311]:
score
Out[311]:
array([-0.06201186, -0.04173291, -0.3876596 , -0.02725678, -0.08762837])

Lets build another model with 2 features.

In [318]:
X2 = newdata[['distance_log', 'count']]
In [321]:
model = LinearRegression()
score = cross_val_score(model, 
                        X2, y, 
                        scoring='neg_mean_squared_error', 
                        cv=5, 
                        n_jobs=-1,)
np.mean(score)
Out[321]:
-0.084442034144584455
In [325]:
newdata['cartage_per_distance'] = newdata['cartage'] / newdata['distance']
newdata['cartage_per_distance_log'] = np.log10(newdata['cartage_per_distance'])
In [324]:
newdata.plot(kind='scatter', 
             x='distance',
             y='cartage_per_distance',
             logx=True,
             logy=True);
In [329]:
model = LinearRegression()
In [330]:
X = newdata[['distance_log']]
y = newdata['cartage_per_distance_log']
In [331]:
score = cross_val_score(model, 
                        X, y, 
                        scoring='neg_mean_squared_error', 
                        cv=5, 
                        n_jobs=-1,)
np.mean(score)
Out[331]:
-0.083758488565365469
In [332]:
model.fit(X, y)
y_pred = model.predict(X)
show_predicitons(X, y, y_pred)

Tree-based models

In [333]:
from sklearn.tree import DecisionTreeRegressor
In [368]:
tree = DecisionTreeRegressor(max_depth=2)
In [369]:
X = newdata[['distance_log']]
y = newdata['cartage_per_distance_log']
In [370]:
tree.fit(X, y)
Out[370]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
In [371]:
def show_predicitons(X, y, y_pred):
    plt.scatter(X, y, color='r')
    plt.scatter(X, y_pred, color='b')
In [372]:
y_pred = tree.predict(X)
show_predicitons(X, y, y_pred)
In [373]:
#!pip install modelvis
In [374]:
import modelvis
In [375]:
modelvis.render_tree(tree, feature_names=['distance_log'])
Out[375]:
Tree 0 distance_log ≤ 1.9442 mse = 0.3069 samples = 1464 value = -0.692 1 distance_log ≤ 1.3756 mse = 0.2053 samples = 582 value = -0.2194 0->1 True 4 distance_log ≤ 2.2855 mse = 0.1293 samples = 882 value = -1.0038 0->4 False 2 mse = 0.1657 samples = 148 value = 0.2985 1->2 3 mse = 0.0961 samples = 434 value = -0.396 1->3 5 mse = 0.0882 samples = 405 value = -0.7747 4->5 6 mse = 0.0818 samples = 477 value = -1.1983 4->6
In [377]:
model.predict([[1.4]])
Out[377]:
array([-0.11421984])
In [378]:
from sklearn.model_selection import GridSearchCV
In [380]:
tree = DecisionTreeRegressor()
param_grid = {
    "max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]
}
grid = GridSearchCV(tree, 
                    param_grid=param_grid, 
                    cv=10, 
                    scoring="neg_mean_squared_error")
grid.fit(X, y)
grid.best_estimator_
Out[380]:
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
In [396]:
tree = grid.best_estimator_

df2 = newdata.sort_values(by='distance_log')
X = df2[['distance_log']]
y = df2.cartage_per_distance_log

def show_predicitons2(model, X, y):
    X = X.as_matrix()
    X1 = np.arange(X[0].min(), X[0].max(), 1000)
    y1 = model.predict(X1)
    
def show_predicitons(X, y, y_pred):    
    plt.scatter(X, y, color='r')
    plt.plot(X, y_pred, color='b')

y_pred = tree.predict(X)
show_predicitons(X, y, y_pred)

Cars dataset

Download the cars dataset and build a model to predict the price of a car.

features: brand, kmpl, bhp, type
target: price