Python Virtual Training For Arcesium - Module III - Day 5¶

Mar 13-17, 2023 Vikrant Patil

All notes are available online at https://notes.pipal.in/2023/arcesium_finop_jan/

Please login to https://engage.pipal.in/ and launch jupyter lab

For today create a notebook with name module3-day5

notebook names are case sensitive. Make sure you give correct name

© Pipal Academy LLP

Reading pdf files¶

we will make use of third party library pypdf2

In [1]:
import PyPDF2
In [2]:
!pip install pypdf2
Requirement already satisfied: pypdf2 in /home/vikrant/usr/local/default/lib/python3.10/site-packages (3.0.1)
In [3]:
pdffileurl = "https://posoco.in/download/16-07-20_nldc_psp/?wpdmdl=30215"
In [4]:
import requests

def download(url, filename):
    resp = requests.get(url)
    with open(filename, "wb") as f:
        f.write(resp.content)
        

download(pdffileurl, "data.pdf")
In [5]:
with open("data.pdf","rb")  as f:
    pdfreader = PyPDF2.PdfReader(f)
    for page in pdfreader.pages:
        print(page.extract_text()[:100])
        print("="*20)
 
National Load Despatch Centre  
राष्ट्रीय भार प्रेषण केंद्र 
POWER SYSTEM OPERATION CORPORATION LI
====================
NR WR SR ER NER TOTAL
59882 41115 34238 21526 2730 159491
1114 0 0 0 6 1120
1398 998 807 447 48 3698
====================
16-Jul-2020
Sl 
NoVoltage Level Line Details Circuit Max Import (MW) Max Export (MW) Import (MU) Exp
====================

We are interested in second page

In [6]:
with open("data.pdf","rb")  as f: # open the file in read,binary mode
    pdfreader = PyPDF2.PdfReader(f)
    page = pdfreader.pages[1]
    print(page.extract_text(page.extract_text()))
NR WR SR ER NER TOTAL
59882 41115 34238 21526 2730 159491
1114 0 0 0 6 1120
1398 998 807 447 48 3698
355 33 77 149 29 643
11 49 128 - - 187
39.60 16.60 41.59 4.60 0.03 102
12.6 0.0 0.0 0.0 0.0 12.6
65470 43593 38117 21535 2827 160654
22:20 10:29 10:00 21:20 19:41 21:26
Region FVI < 49.7 49.7 - 49.8 49.8 - 49.9 < 49.9 49.9 - 50.05 > 50.05
All India 0.057 0.16 1.81 13.19 15.16 76.52 8.32
Max.Demand Shortage during Energy Met Drawal OD(+)/UD(-) Max OD Energy
Region States Met during the 
day(MW)maximum 
Demand(MW)(MU)Schedule
(MU)(MU) (MW)Shortage 
(MU)
Punjab 11090 0 237.9 146.8 -1.8 49 0.0
Haryana 9388 0 209.4 152.8 0.7 325 1.9
Rajasthan 12087 0 262.4 119.7 5.4 809 0.0
Delhi 5726 0 118.6 102.8 -1.4 228 0.0
NR UP 22873 0 448.9 208.5 2.0 546 0.4
Uttarakhand 1899 0 42.8 20.7 0.8 111 0.0
HP 1366 0 28.6 -2.6 -0.2 91 0.0
J&K(UT) & Ladakh(UT) 2177 544 43.1 20.3 0.4 502 10.3
Chandigarh 295 0 6.0 5.9 0.2 61 0.0
Chhattisgarh 3685 0 86.9 36.8 0.8 468 0.0
Gujarat 13478 0 286.2 87.6 4.0 527 0.0
MP 9547 0 214.7 113.8 -3.8 198 0.0
WR Maharashtra 16964 0 365.1 138.1 -1.9 457 0.0
Goa 405 0 8.5 8.2 -0.2 33 0.0
DD 246 0 5.3 5.3 0.0 19 0.0
DNH 614 0 14.0 13.8 0.2 44 0.0
AMNSIL 777 0 17.1 4.2 0.7 272 0.0
Andhra Pradesh 6439 0 141.0 45.6 -1.3 607 0.0
Telangana 8614 0 167.3 81.6 -2.5 385 0.0
SR Karnataka 8486 0 155.1 51.1 -3.4 650 0.0
Kerala 3077 0 65.2 46.1 0.5 179 0.0
Tamil Nadu 12371 0 271.3 125.9 -3.7 573 0.0
Puducherry 349 0 7.5 7.5 -0.1 35 0.0
Bihar 5740 0 111.5 106.0 -0.3 386 0.0
DVC 2989 0 62.7 -42.6 -0.7 206 0.0
Jharkhand 1438 0 26.3 18.5 -1.0 124 0.0
ER Odisha 3983 0 82.2 -0.2 -0.2 325 0.0
West Bengal 7917 0 162.6 47.2 -0.8 303 0.0
Sikkim 100 0 1.4 1.5 -0.1 17 0.0
Arunachal Pradesh 120 3 2.0 1.8 0.2 40 0.0
Assam 1759 23 30.0 27.1 -0.1 135 0.0
Manipur 183 1 2.6 2.3 0.3 37 0.0
NER Meghalaya 307 2 5.3 -1.3 0.3 52 0.0
Mizoram 89 1 1.5 1.2 0.0 13 0.0
Nagaland 140 2 2.2 2.3 -0.2 23 0.0
Tripura 298 7 4.9 5.9 0.7 66 0.0
Bhutan Nepal Bangladesh
53.3 -1.5 -19.1
2337.0 -271.3 -1110.0
NR WR SR ER NER TOTAL
352.1 -295.4 95.0 -145.8 -6.0 0.0
359.2 -293.7 84.6 -152.6 -3.4 -6.0
7.1 1.6 -10.5 -6.9 2.6 -6.0
NR WR SR ER NER TOTAL
3838 14847 11792 3445 677 34598
9289 23225 14423 4892 47 51876
13127 38072 26215 8337 723 86473
NR WR SR ER NER All India
546 1080 370 482 7 2486
25 13 14 0 0 52
355 33 77 149 29 643
26 33 47 0 0 106
40 82 19 0 22 163
71 73 210 5 0 359
1063 1314 737 636 58 3809
6.71 5.54 28.51 0.73 0.05 9.43
42.55 10.54 45.35 24.19 49.63 29.09
1.068
1.102 Based on State Max Demands
Diversity factor = Sum of regional or state maximum demands / All India maximum demand
*Source: RLDCs for solar connected to ISTS; SLDCs for embedded solar. Limited visibility of embedded solar data.
Executive Director-NLDCShare of RES in total generation (%)
Share of Non-fossil fuel (Hydro,Nuclear and RES) in total generation(%)
H. All India Demand Diversity Factor
Based on Regional Max DemandsLignite
Hydro
Nuclear
Gas, Naptha & Diesel
RES (Wind, Solar, Biomass & Others)
TotalState Sector
Total
G. Sourcewise generation (MU)
CoalActual(MU)
O/D/U/D(MU)
F. Generation Outage(MW)
Central SectorDay Peak (MW)
E. Import/Export by Regions (in MU) - Import(+ve)/Export(-ve); OD(+)/UD(-)
Schedule(MU)D. Transnational Exchanges (MU) - Import(+ve)/Export(-ve)   
Actual (MU)Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)
B. Frequency Profile (%)
C. Power Supply Position in StatesDemand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*Report for previous day Date of Reporting: 16-Jul-2020
A. Power Supply Position at All India and Regional level

In [7]:
def extract_data(text):
    lines = text.split("\n")
    data_text = lines[:10]
    
    headers = data_text[0].split()
    data = [line.strip().split() for line in data_text[1:]]
    return headers, data
    

def extract_data_from_pdf(pdffile):     
    with open("data.pdf","rb")  as f:
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[1]
        text = page.extract_text()
        return extract_data(text)
In [8]:
h, d = extract_data_from_pdf("data.pdf")
In [9]:
h
Out[9]:
['NR', 'WR', 'SR', 'ER', 'NER', 'TOTAL']
In [10]:
d
Out[10]:
[['59882', '41115', '34238', '21526', '2730', '159491'],
 ['1114', '0', '0', '0', '6', '1120'],
 ['1398', '998', '807', '447', '48', '3698'],
 ['355', '33', '77', '149', '29', '643'],
 ['11', '49', '128', '-', '-', '187'],
 ['39.60', '16.60', '41.59', '4.60', '0.03', '102'],
 ['12.6', '0.0', '0.0', '0.0', '0.0', '12.6'],
 ['65470', '43593', '38117', '21535', '2827', '160654'],
 ['22:20', '10:29', '10:00', '21:20', '19:41', '21:26']]
In [11]:
import pandas as pd
In [12]:
pd.DataFrame([{"a":1,"b":2}, {"a":2, "b":3}])
Out[12]:
a b
0 1 2
1 2 3
In [13]:
dirlist = []
for row in d:
    dirlist.append(dict(zip(h, row)))
In [14]:
dirlist
Out[14]:
[{'NR': '59882',
  'WR': '41115',
  'SR': '34238',
  'ER': '21526',
  'NER': '2730',
  'TOTAL': '159491'},
 {'NR': '1114', 'WR': '0', 'SR': '0', 'ER': '0', 'NER': '6', 'TOTAL': '1120'},
 {'NR': '1398',
  'WR': '998',
  'SR': '807',
  'ER': '447',
  'NER': '48',
  'TOTAL': '3698'},
 {'NR': '355',
  'WR': '33',
  'SR': '77',
  'ER': '149',
  'NER': '29',
  'TOTAL': '643'},
 {'NR': '11', 'WR': '49', 'SR': '128', 'ER': '-', 'NER': '-', 'TOTAL': '187'},
 {'NR': '39.60',
  'WR': '16.60',
  'SR': '41.59',
  'ER': '4.60',
  'NER': '0.03',
  'TOTAL': '102'},
 {'NR': '12.6',
  'WR': '0.0',
  'SR': '0.0',
  'ER': '0.0',
  'NER': '0.0',
  'TOTAL': '12.6'},
 {'NR': '65470',
  'WR': '43593',
  'SR': '38117',
  'ER': '21535',
  'NER': '2827',
  'TOTAL': '160654'},
 {'NR': '22:20',
  'WR': '10:29',
  'SR': '10:00',
  'ER': '21:20',
  'NER': '19:41',
  'TOTAL': '21:26'}]
In [15]:
pd.DataFrame(dirlist)
Out[15]:
NR WR SR ER NER TOTAL
0 59882 41115 34238 21526 2730 159491
1 1114 0 0 0 6 1120
2 1398 998 807 447 48 3698
3 355 33 77 149 29 643
4 11 49 128 - - 187
5 39.60 16.60 41.59 4.60 0.03 102
6 12.6 0.0 0.0 0.0 0.0 12.6
7 65470 43593 38117 21535 2827 160654
8 22:20 10:29 10:00 21:20 19:41 21:26
In [16]:
rowlabels = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
In [17]:
pd.DataFrame([dict(zip(h, row)) for row in d], index=rowlabels)
Out[17]:
NR WR SR ER NER TOTAL
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) 59882 41115 34238 21526 2730 159491
Peak Shortage (MW) 1114 0 0 0 6 1120
Energy Met (MU) 1398 998 807 447 48 3698
Hydro Gen (MU) 355 33 77 149 29 643
Wind Gen (MU) 11 49 128 - - 187
Solar Gen (MU)* 39.60 16.60 41.59 4.60 0.03 102
Energy Shortage (MU) 12.6 0.0 0.0 0.0 0.0 12.6
Maximum Demand Met During the Day (MW) (From NLDC SCADA) 65470 43593 38117 21535 2827 160654
Time Of Maximum Demand Met (From NLDC SCADA) 22:20 10:29 10:00 21:20 19:41 21:26
In [18]:
def extract_data(text):
    lines = text.split("\n")
    data_text = lines[:10]
    
    headers = data_text[0].split()
    data = [line.strip().split() for line in data_text[1:]]
    return headers, data
    

def extract_data_from_pdf(pdffile):     
    with open("data.pdf","rb")  as f:
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[1]
        text = page.extract_text()
        h, d = extract_data(text)
        rowlabels = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
        return pd.DataFrame([dict(zip(h, row)) for row in d], index=rowlabels)
In [19]:
extract_data_from_pdf("data.pdf")
Out[19]:
NR WR SR ER NER TOTAL
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) 59882 41115 34238 21526 2730 159491
Peak Shortage (MW) 1114 0 0 0 6 1120
Energy Met (MU) 1398 998 807 447 48 3698
Hydro Gen (MU) 355 33 77 149 29 643
Wind Gen (MU) 11 49 128 - - 187
Solar Gen (MU)* 39.60 16.60 41.59 4.60 0.03 102
Energy Shortage (MU) 12.6 0.0 0.0 0.0 0.0 12.6
Maximum Demand Met During the Day (MW) (From NLDC SCADA) 65470 43593 38117 21535 2827 160654
Time Of Maximum Demand Met (From NLDC SCADA) 22:20 10:29 10:00 21:20 19:41 21:26
In [20]:
df = extract_data_from_pdf("data.pdf")
In [21]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) to Time Of Maximum Demand Met (From NLDC SCADA)
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   NR      9 non-null      object
 1   WR      9 non-null      object
 2   SR      9 non-null      object
 3   ER      9 non-null      object
 4   NER     9 non-null      object
 5   TOTAL   9 non-null      object
dtypes: object(6)
memory usage: 504.0+ bytes

The dataframe is of not use because all numeric data is converted into text here!

In [27]:
def extract_data(text):
    lines = text.split("\n")
    data_text = lines[:10]
    
    headers = data_text[0].split()
    data = [line.strip().split() for line in data_text[1:]]
    return headers, data
    

def extract_data_from_pdf(pdffile):     
    with open("data.pdf","rb")  as f:
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[1]
        text = page.extract_text()
        h, d = extract_data(text)
        colnames = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
        
        return pd.DataFrame(dict(zip(colnames, d)), index=h)
In [28]:
extract_data_from_pdf("data.pdf")
Out[28]:
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) Peak Shortage (MW) Energy Met (MU) Hydro Gen (MU) Wind Gen (MU) Solar Gen (MU)* Energy Shortage (MU) Maximum Demand Met During the Day (MW) (From NLDC SCADA) Time Of Maximum Demand Met (From NLDC SCADA)
NR 59882 1114 1398 355 11 39.60 12.6 65470 22:20
WR 41115 0 998 33 49 16.60 0.0 43593 10:29
SR 34238 0 807 77 128 41.59 0.0 38117 10:00
ER 21526 0 447 149 - 4.60 0.0 21535 21:20
NER 2730 6 48 29 - 0.03 0.0 2827 19:41
TOTAL 159491 1120 3698 643 187 102 12.6 160654 21:26
In [29]:
df = extract_data_from_pdf("data.pdf")
In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, NR to TOTAL
Data columns (total 9 columns):
 #   Column                                                            Non-Null Count  Dtype 
---  ------                                                            --------------  ----- 
 0   Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)  6 non-null      object
 1   Peak Shortage (MW)                                                6 non-null      object
 2   Energy Met (MU)                                                   6 non-null      object
 3   Hydro Gen (MU)                                                    6 non-null      object
 4   Wind Gen (MU)                                                     6 non-null      object
 5   Solar Gen (MU)*                                                   6 non-null      object
 6   Energy Shortage (MU)                                              6 non-null      object
 7   Maximum Demand Met During the Day (MW) (From NLDC SCADA)          6 non-null      object
 8   Time Of Maximum Demand Met (From NLDC SCADA)                      6 non-null      object
dtypes: object(9)
memory usage: 480.0+ bytes
In [49]:
def removedash(row):
    return [item.replace("-", "") for item in row]
    
def extract_data(text):
    lines = text.split("\n")
    data_text = lines[:10]
    
    headers = data_text[0].split()
    data = [removedash(line.split()) for line in data_text[1:]]
    return headers, data
    

def extract_data_from_pdf(pdffile):     
    with open("data.pdf","rb")  as f:
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[1]
        text = page.extract_text()
        h, d = extract_data(text)
        print(h, d)
        colnames = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
        
        data = pd.DataFrame(dict(zip(colnames, d)), index=h)
        
        for c in colnames[:-1]:
            data[c] = pd.to_numeric(data[c])
        return data
In [50]:
df = extract_data_from_pdf("data.pdf")
df
['NR', 'WR', 'SR', 'ER', 'NER', 'TOTAL'] [['59882', '41115', '34238', '21526', '2730', '159491'], ['1114', '0', '0', '0', '6', '1120'], ['1398', '998', '807', '447', '48', '3698'], ['355', '33', '77', '149', '29', '643'], ['11', '49', '128', '', '', '187'], ['39.60', '16.60', '41.59', '4.60', '0.03', '102'], ['12.6', '0.0', '0.0', '0.0', '0.0', '12.6'], ['65470', '43593', '38117', '21535', '2827', '160654'], ['22:20', '10:29', '10:00', '21:20', '19:41', '21:26']]
Out[50]:
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) Peak Shortage (MW) Energy Met (MU) Hydro Gen (MU) Wind Gen (MU) Solar Gen (MU)* Energy Shortage (MU) Maximum Demand Met During the Day (MW) (From NLDC SCADA) Time Of Maximum Demand Met (From NLDC SCADA)
NR 59882 1114 1398 355 11.0 39.60 12.6 65470 22:20
WR 41115 0 998 33 49.0 16.60 0.0 43593 10:29
SR 34238 0 807 77 128.0 41.59 0.0 38117 10:00
ER 21526 0 447 149 NaN 4.60 0.0 21535 21:20
NER 2730 6 48 29 NaN 0.03 0.0 2827 19:41
TOTAL 159491 1120 3698 643 187.0 102.00 12.6 160654 21:26
In [48]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, NR to TOTAL
Data columns (total 9 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)  6 non-null      int64  
 1   Peak Shortage (MW)                                                6 non-null      int64  
 2   Energy Met (MU)                                                   6 non-null      int64  
 3   Hydro Gen (MU)                                                    6 non-null      int64  
 4   Wind Gen (MU)                                                     4 non-null      float64
 5   Solar Gen (MU)*                                                   6 non-null      float64
 6   Energy Shortage (MU)                                              6 non-null      float64
 7   Maximum Demand Met During the Day (MW) (From NLDC SCADA)          6 non-null      int64  
 8   Time Of Maximum Demand Met (From NLDC SCADA)                      6 non-null      object 
dtypes: float64(3), int64(5), object(1)
memory usage: 480.0+ bytes

when to use text mode / binary mode¶

  • to read text based files like csv, txt , html, -> "r","w" this means read or write in text mode
  • to read/write file which is binary - xlsx, pdf, jpeg ypu should specify binary mode. "rb", "wb" -> read in binary , write in binary

debugging your python programs using pdb¶

In [54]:
%%file wc.py
import sys

def linecount(filename):
    with open(filename) as f:
        return len(f.readlines())
    
def wordcount(filename):
    with open(filename) as f:
        return len(f.read().split())
    
def charcount(filename):
    with open(filename) as f:
        return len(f.read())


if __name__ == "__main__":
    filename = sys.argv[1]
    print(linecount(filename), wordcount(filename), charcount(filename), filename)
    
Overwriting wc.py
In [56]:
!python wc.py poem.txt
5415.79s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
21 144 857 poem.txt
In [2]:
import PyPDF2
import pandas as pd

def removedash(row):
    return [item.replace("-", "") for item in row]
    
def extract_data(text):
    lines = text.split("\n")
    data_text = lines[:10]
    
    headers = data_text[0].split()
    data = [removedash(line.split()) for line in data_text[1:]]
    return headers, data
    

def extract_data_from_pdf(pdffile):     
    with open("data.pdf","rb")  as f:
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[1]
        text = page.extract_text()
        h, d = extract_data(text)
        print(h, d)
        colnames = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
        
        data = pd.DataFrame(dict(zip(colnames, d)), index=h)
        
        for c in colnames[:-1]:
            data[c] = pd.to_numeric(data[c])
        return data
    
extract_data_from_pdf("data.pdf")
['NR', 'WR', 'SR', 'ER', 'NER', 'TOTAL'] [['59882', '41115', '34238', '21526', '2730', '159491'], ['1114', '0', '0', '0', '6', '1120'], ['1398', '998', '807', '447', '48', '3698'], ['355', '33', '77', '149', '29', '643'], ['11', '49', '128', '', '', '187'], ['39.60', '16.60', '41.59', '4.60', '0.03', '102'], ['12.6', '0.0', '0.0', '0.0', '0.0', '12.6'], ['65470', '43593', '38117', '21535', '2827', '160654'], ['22:20', '10:29', '10:00', '21:20', '19:41', '21:26']]
Out[2]:
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) Peak Shortage (MW) Energy Met (MU) Hydro Gen (MU) Wind Gen (MU) Solar Gen (MU)* Energy Shortage (MU) Maximum Demand Met During the Day (MW) (From NLDC SCADA) Time Of Maximum Demand Met (From NLDC SCADA)
NR 59882 1114 1398 355 11.0 39.60 12.6 65470 22:20
WR 41115 0 998 33 49.0 16.60 0.0 43593 10:29
SR 34238 0 807 77 128.0 41.59 0.0 38117 10:00
ER 21526 0 447 149 NaN 4.60 0.0 21535 21:20
NER 2730 6 48 29 NaN 0.03 0.0 2827 19:41
TOTAL 159491 1120 3698 643 187.0 102.00 12.6 160654 21:26

Debuggin with pdb¶

python -m pdb yourscript.py args1 args2

(pdb) h # will show help .. what commands are available
(pdb) l # show lines from file with line no.
(pdb) b # set a breakpoint b 12 ..set breakpoint at line 12. b filename:line
(pdb) c # will start execution and stop at first break point
(pdb) n # execute this line and go to next line and stop
(pdb) s # if there is function call on current line then go inside that function and stop
(pdb) p variable # will print current value of variable

A quick revision of basics¶

In [3]:
names = ["aakash", "priya", "Aman"] 
In [4]:
for i in range(3):
    values[i] = i*i*2 # you can modify list without defining it!
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 2
      1 for i in range(3):
----> 2     values[i] = i*i*2

NameError: name 'values' is not defined
In [9]:
values = [] # or define a empty list
for i in range(3):
    values.append( i*i*2) # and keep appending
In [10]:
values = [0, 0, 0] # either initialize the list of required size
for i in range(3):
    values[i] = i*i*2
In [12]:
def func():
    x = 10 # this is local variable inside the function
    
x = 20

func() 

print(x)
    
20
In [13]:
def appendzero(x):
    x.append(0) # this will modify actual list
    
    
y = [1, 1, 1, 1]
appendzero(y) 
print(y)
[1, 1, 1, 1, 0]
In [14]:
def appendzero_and_return(x):
    x = x + [0] # local variable... will not affect original variable
    return x
    
    
y = [1, 1, 1, 1]
appendzero_and_return(y)
Out[14]:
[1, 1, 1, 1, 0]
In [15]:
print(y)
[1, 1, 1, 1]
In [16]:
def square(x):
    print(x*x) # you will get None as return
In [17]:
x2 = square(5)
25
In [18]:
print(x2)
None
In [20]:
def square(x):
    return x*x
In [21]:
x2 = square(5)
In [22]:
x2
Out[22]:
25
In [23]:
for i in range(3):
    print(i, end=",")
0,1,2,
In [24]:
for c in "text data":
    print(c, end=",")
t,e,x,t, ,d,a,t,a,
In [26]:
for key in {"z":2, "x":4}:
    print(key)
z
x
In [27]:
d = {"z":2, "x":4}
In [28]:
for key, value in d.items():
    print(key, value)
z 2
x 4
In [29]:
values = [] 
for i in range(3):
    values.append( i*i*2) 
In [30]:
[i*i*2 for i in range(3)]
Out[30]:
[0, 2, 8]
In [31]:
[i*i*2 for i in range(10) if i%2==0]
Out[31]:
[0, 8, 32, 72, 128]
In [33]:
{k:str(k) for k in range(5)}
Out[33]:
{0: '0', 1: '1', 2: '2', 3: '3', 4: '4'}
In [34]:
pd.DataFrame({"col1":[1, 2, 3, 4],
             "col2":[2, 2, 2, 2]})
Out[34]:
col1 col2
0 1 2
1 2 2
2 3 2
3 4 2
In [36]:
pd.DataFrame([
    {"col1":1, "col2":2},
    {"col1":21, "col2":22},
    {"col1":31, "col2":32}])
Out[36]:
col1 col2
0 1 2
1 21 22
2 31 32
In [37]:
wallet = pd.read_csv("wallet.csv", index_col=0)
In [38]:
wallet
Out[38]:
date category description debit
0 2021-03-07 14:53:28.377359 Music Amazon 421.207327
1 2020-10-08 09:53:28.377359 Food Swiggy 328.440080
2 2021-02-23 09:53:28.377359 Books Amazon 244.679437
3 2020-11-01 14:53:28.377359 Utility Phone 222.756318
4 2021-06-05 13:53:28.377359 Books Flipcart 494.128492
... ... ... ... ...
95 2021-07-19 13:53:28.377359 Utility Phone 388.671213
96 2021-01-12 19:53:28.377359 Books Flipcart 467.554562
97 2021-03-25 11:53:28.377359 Utility Phone 320.789434
98 2021-05-13 15:53:28.377359 Travel Taxi 442.096469
99 2020-10-11 16:53:28.377359 Food Hotel 100.455501

100 rows × 4 columns

find min transaction in utility category¶

In [40]:
wallet.groupby("category").min()['debit'].loc['Utility']
Out[40]:
103.68079074846584

Find total expenses on books

In [43]:
wallet.groupby("category").sum(numeric_only=True)['debit'].loc['Books']
Out[43]:
4929.750393283798
In [44]:
wallet.groupby(['category', 'description']).sum(numeric_only=True)
Out[44]:
debit
category description
Books Amazon 1037.442802
Amazon Kindle 1389.052376
Flipcart 2503.255216
Food Hotel 2752.174732
Swiggy 1936.495366
Zomato 3592.519075
Music Amazon 1467.247766
Netflix 1546.567562
spotify 1219.636541
Travel Auto 2210.428935
Metro 1216.463665
Taxi 2626.039276
Utility Electricity 2885.064355
Phone 4677.202878
In [45]:
df = wallet.groupby(['category', 'description']).sum(numeric_only=True)
In [46]:
df
Out[46]:
debit
category description
Books Amazon 1037.442802
Amazon Kindle 1389.052376
Flipcart 2503.255216
Food Hotel 2752.174732
Swiggy 1936.495366
Zomato 3592.519075
Music Amazon 1467.247766
Netflix 1546.567562
spotify 1219.636541
Travel Auto 2210.428935
Metro 1216.463665
Taxi 2626.039276
Utility Electricity 2885.064355
Phone 4677.202878
In [47]:
df.loc[('Books','Amazon')]
Out[47]:
debit    1037.442802
Name: (Books, Amazon), dtype: float64
In [48]:
df.reset_index()
Out[48]:
category description debit
0 Books Amazon 1037.442802
1 Books Amazon Kindle 1389.052376
2 Books Flipcart 2503.255216
3 Food Hotel 2752.174732
4 Food Swiggy 1936.495366
5 Food Zomato 3592.519075
6 Music Amazon 1467.247766
7 Music Netflix 1546.567562
8 Music spotify 1219.636541
9 Travel Auto 2210.428935
10 Travel Metro 1216.463665
11 Travel Taxi 2626.039276
12 Utility Electricity 2885.064355
13 Utility Phone 4677.202878

combining data¶

  • pd.concat([]) .. this will put data one below each other
  • df1.join(df2) ... will join based on index
  • pd.merge(df1, df2, on=common_col) this will merge by given column name
In [ ]:
pd.read_html(url) # load tables on this page and return a list of tables

downloading from internet¶

  1. requests and API , json (most authentic way to download data legaly)
  2. selenium to download files if download is available only from browser

xlsxwriter¶

  • create a workbook of name of the file
  • add worksheet into workbook
  • write into worksheet
  • create format in workbook and make use of it while writing in worksheet
  • close workbook
In [ ]: