Python Virtual Training For Arcesium - Module III - Day 4¶

Dec 17-23, 2020 Vikrant Patil

These notes are available online at http://notes.pipal.in/2020/arcesium_finop_batch3/module3-day4.html

We will be using jupyter hub from http://lab.pipal.in for this training. Create a notebook with name module3-day4.ipynb for today's session. Before you start shutdown all kernels except today's notebook.

Using selenium to download data from internet¶

docs for python-selenium

https://selenium-python.readthedocs.io/installation.html#drivers

download geckodriver for the browser you need to launch from python

for firefox

https://github.com/mozilla/geckodriver/releases

create virtual environment for selenium

python -m venv firefox_selenium

activate it using (windows)

firefox_selenium\Scripts\activate.bat

for linux/max

source firefox_selenium/bin/activate

geckodriver for windows will have zip file, which has geckodriver.exe in it. unzip it and copy it in firefox_selenium\Scripts\

for linux/mac users copy the unzipped executable in firefox_selenium/bin/

pip install selenium

%%file search_arcesium.py

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.arcesium.com/")

careers = driver.find_element_by_class_name("careers-anchor")
careers.click()
driver.close()

Overwriting search_arcesium.py

documentation for selenium using python

https://selenium-python.readthedocs.io/

Reading pdf files¶

to read pdf files we will need a package called PyPDF2

from jupyter

!pip install PyPDF2

from cmd

pip install PyPDF2

!pip install PyPDF2

Collecting PyPDF2
  Using cached PyPDF2-1.26.0.tar.gz (77 kB)
Building wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... done
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61084 sha256=84670daae0093b25e472b99f599e73fda3e64185542411e7a44b3141db7ab31e
  Stored in directory: /home/vikrant/.cache/pip/wheels/b1/1a/8f/a4c34be976825a2f7948d0fa40907598d69834f8ab5889de11
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0
WARNING: You are using pip version 20.2.3; however, version 20.3.3 is available.
You should consider upgrading via the '/home/vikrant/anaconda3/bin/python -m pip install --upgrade pip' command.

!cat download.py

import sys
import requests

def download(url, filename):
    resp = requests.get(url)
    with open(filename, "wb") as f:
        f.write(resp.content)
        
if __name__ == "__main__":
    url = sys.argv[1]
    filename = sys.argv[2]
    download(url, filename)

!python download.py https://posoco.in/download/16-07-20_nldc_psp/?wpdmdl=30215 demanddata.pdf

We will try to read this pdf file https://posoco.in/download/16-07-20_nldc_psp/?wpdmdl=30215 and try to exctract table A from page 2

import PyPDF2

with open("demanddata.pdf", "rb") as f:
    pdfreader = PyPDF2.PdfFileReader(f)
    n = pdfreader.getNumPages()
    page = pdfreader.getPage(1)
    print(page.extractText()[:100])

NR
WR
SR
ER
NER
TOTAL
59882
41115
34238
21526
2730
159491
1114
0
0
0
6
1120
1398
998
807
447
48
3698

def print_pdf_text(filename):
    with open("demanddata.pdf", "rb") as f:
        pdfreader = PyPDF2.PdfFileReader(f)
        n = pdfreader.getNumPages()
        for p in range(n):
            page = pdfreader.getPage(p)
            print(page.extractText()[:500])
            print("="*10)

print_pdf_text("demanddata.pdf")

 
National Load Despatch Centre
 
 
POWER SYSTEM OPERATION CORPORATION LIMITED
 

(
Government of India Enterprise
/


)
 
B
-
9, QUTU
B INSTITUTIONAL AREA, KATWARIA SARAI,
 
NEW DELHI 
-
110016
 

,

,


,


_____________________________________________________________________________________________________________________________
__________
 
Ref:
 
POSOCO/NLDC/SO/Daily PSP
 
Report
 
 
:
 
16
th
 
Jul
 
20
20
 
 
To,
 
 
1.
 

,


,


,


==========
NR
WR
SR
ER
NER
TOTAL
59882
41115
34238
21526
2730
159491
1114
0
0
0
6
1120
1398
998
807
447
48
3698
355
33
77
149
29
643
11
49
128
-
-
187
39.60
16.60
41.59
4.60
0.03
102
12.6
0.0
0.0
0.0
0.0
12.6
65470
43593
38117
21535
2827
160654
22:20
10:29
10:00
21:20
19:41
21:26
Region
FVI
< 49.7
49.7 - 49.8
49.8 - 49.9
< 49.9
49.9 - 50.05
> 50.05
All India
0.057
0.16
1.81
13.19
15.16
76.52
8.32
Max.Demand
Shortage during
Energy Met
Drawal
OD(+)/UD(-)
Max OD
Energy
Region
States
Met during the 
day(MW)
ma
==========
16-Jul-2020
Sl 
No
Voltage Level
Line Details
Circuit
Max Import (MW)
Max Export (MW)
Import (MU)
Export (MU)
NET (MU)
1
HVDC
ALIPURDUAR-AGRA
D/C
0
1001
0.0
24.4
-24.4
2
HVDC
PUSAULI  B/B
-
0
399
0.0
9.6
-9.6
3
765 kV
GAYA-VARANASI
D/C
0
655
0.0
12.9
-12.9
4
765 kV
SASARAM-FATEHPUR
S/C
108
119
0.0
0.9
-0.9
5
765 kV
GAYA-BALIA
S/C
0
478
0.0
4.7
-4.7
6
400 kV
PUSAULI-VARANASI
S/C
0
283
0.0
5.9
-5.9
7
400 kV
PUSAULI -ALLAHABAD
S/C
0
180
0.0
3.5
-3.5
8
400 kV
MUZAFFARPUR-GORAKHPUR
D/C
0
834
0.0
15.5
==========

def get_page(pdffile, pageno):
    with open("demanddata.pdf", "rb") as f:
        pdfreader = PyPDF2.PdfFileReader(f)
        page = pdfreader.getPage(pageno)
        return page.extractText()

print(get_page("demanddata.pdf", 1))

NR
WR
SR
ER
NER
TOTAL
59882
41115
34238
21526
2730
159491
1114
0
0
0
6
1120
1398
998
807
447
48
3698
355
33
77
149
29
643
11
49
128
-
-
187
39.60
16.60
41.59
4.60
0.03
102
12.6
0.0
0.0
0.0
0.0
12.6
65470
43593
38117
21535
2827
160654
22:20
10:29
10:00
21:20
19:41
21:26
Region
FVI
< 49.7
49.7 - 49.8
49.8 - 49.9
< 49.9
49.9 - 50.05
> 50.05
All India
0.057
0.16
1.81
13.19
15.16
76.52
8.32
Max.Demand
Shortage during
Energy Met
Drawal
OD(+)/UD(-)
Max OD
Energy
Region
States
Met during the 
day(MW)
maximum 
Demand(MW)
(MU)
Schedule
(MU)
(MU)
(MW)
Shortage 
(MU)
Punjab
11090
0
237.9
146.8
-1.8
49
0.0
Haryana
9388
0
209.4
152.8
0.7
325
1.9
Rajasthan
12087
0
262.4
119.7
5.4
809
0.0
Delhi
5726
0
118.6
102.8
-1.4
228
0.0
NR
UP
22873
0
448.9
208.5
2.0
546
0.4
Uttarakhand
1899
0
42.8
20.7
0.8
111
0.0
HP
1366
0
28.6
-2.6
-0.2
91
0.0
J&K(UT) & Ladakh(UT)
2177
544
43.1
20.3
0.4
502
10.3
Chandigarh
295
0
6.0
5.9
0.2
61
0.0
Chhattisgarh
3685
0
86.9
36.8
0.8
468
0.0
Gujarat
13478
0
286.2
87.6
4.0
527
0.0
MP
9547
0
214.7
113.8
-3.8
198
0.0
WR
Maharashtra
16964
0
365.1
138.1
-1.9
457
0.0
Goa
405
0
8.5
8.2
-0.2
33
0.0
DD
246
0
5.3
5.3
0.0
19
0.0
DNH
614
0
14.0
13.8
0.2
44
0.0
AMNSIL
777
0
17.1
4.2
0.7
272
0.0
Andhra Pradesh
6439
0
141.0
45.6
-1.3
607
0.0
Telangana
8614
0
167.3
81.6
-2.5
385
0.0
SR
Karnataka
8486
0
155.1
51.1
-3.4
650
0.0
Kerala
3077
0
65.2
46.1
0.5
179
0.0
Tamil Nadu
12371
0
271.3
125.9
-3.7
573
0.0
Puducherry
349
0
7.5
7.5
-0.1
35
0.0
Bihar
5740
0
111.5
106.0
-0.3
386
0.0
DVC
2989
0
62.7
-42.6
-0.7
206
0.0
Jharkhand
1438
0
26.3
18.5
-1.0
124
0.0
ER
Odisha
3983
0
82.2
-0.2
-0.2
325
0.0
West Bengal
7917
0
162.6
47.2
-0.8
303
0.0
Sikkim
100
0
1.4
1.5
-0.1
17
0.0
Arunachal Pradesh
120
3
2.0
1.8
0.2
40
0.0
Assam
1759
23
30.0
27.1
-0.1
135
0.0
Manipur
183
1
2.6
2.3
0.3
37
0.0
NER
Meghalaya
307
2
5.3
-1.3
0.3
52
0.0
Mizoram
89
1
1.5
1.2
0.0
13
0.0
Nagaland
140
2
2.2
2.3
-0.2
23
0.0
Tripura
298
7
4.9
5.9
0.7
66
0.0
Bhutan
Nepal
Bangladesh
53.3
-1.5
-19.1
2337.0
-271.3
-1110.0
NR
WR
SR
ER
NER
TOTAL
352.1
-295.4
95.0
-145.8
-6.0
0.0
359.2
-293.7
84.6
-152.6
-3.4
-6.0
7.1
1.6
-10.5
-6.9
2.6
-6.0
NR
WR
SR
ER
NER
TOTAL
3838
14847
11792
3445
677
34598
9289
23225
14423
4892
47
51876
13127
38072
26215
8337
723
86473
NR
WR
SR
ER
NER
All India
546
1080
370
482
7
2486
25
13
14
0
0
52
355
33
77
149
29
643
26
33
47
0
0
106
40
82
19
0
22
163
71
73
210
5
0
359
1063
1314
737
636
58
3809
6.71
5.54
28.51
0.73
0.05
9.43
42.55
10.54
45.35
24.19
49.63
29.09
1.068
1.102
Based on State Max Demands
Diversity factor = Sum of regional or state maximum demands / All India maximum demand
*Source: RLDCs for solar connected to ISTS; SLDCs for embedded solar. Limited visibility of embedded solar data.
Executive Director-NLDC
Share of RES in total generation (%)
Share of Non-fossil fuel (Hydro,Nuclear and RES) in total generation(%)
H. All India Demand Diversity Factor
Based on Regional Max Demands
Lignite
Hydro
Nuclear
Gas, Naptha & Diesel
RES (Wind, Solar, Biomass & Others)
Total
State Sector
Total
G. Sourcewise generation (MU)
Coal
Actual(MU)
O/D/U/D(MU)
F. Generation Outage(MW)
Central Sector
Day Peak (MW)
E. Import/Export by Regions (in MU) - Import(+ve)/Export(-ve); OD(+)/UD(-)
Schedule(MU)
D. Transnational Exchanges (MU) - Import(+ve)/Export(-ve)€€€
Actual (MU)
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)
B. Frequency Profile (%)
C. Power Supply Position in States
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*
Report for previous day
Date of Reporting:
16-Jul-2020
A. Power Supply Position at All India and Regional level

def chunk(items, n, count):
    """
    [i1, i2, i3, i4, i5, i6, i7 ....i100]
    will break it into peices of size n
    """
    s = 0 
    for i in range(count):
        yield items[s:s+n] # n items from items, starting s position
        s = (i+1)*n
        

def extract_table_A(pagetext):
    lines = pagetext("\n")
    header = "NR WR SR ER NER TOTAL"
    headers = header.strip().split()
    data = {}
    steps = chunk(lines, len(headers))

for row in chunk(list(range(20)), 3, 4):
    print(row)

[0, 1, 2]
[3, 4, 5]
[6, 7, 8]
[9, 10, 11]

import random

def randomnums(n):
    for i in range(n):
        yield random.random()

ran = randomnums(5)

ran

<generator object randomnums at 0x7f45eda8a270>

r = reversed([1, 2, 3, 4, 5])

r

<list_reverseiterator at 0x7f45edf5f550>

next(r)

5

next(r)

4

next(r)

3

next(r)

2

next(r)

1

next(r)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-41-8ebe59a56b1d> in <module>
----> 1 next(r)

StopIteration:

ran

<generator object randomnums at 0x7f45eda8a270>

next(ran)

0.7359782505244229

next(ran)

0.34407781572007

next(ran)

0.07464743675189789

ran

<generator object randomnums at 0x7f45eda8a270>

def randomnums(n):
    print("Start generator")
    for i in range(n):
        print("yielding ...", i)
        yield random.random() 
        print("Back to loop")
        
    print("End of generator")

ran = randomnums(3)

next(ran)

Start generator
yielding ... 0

0.800996994201215

next(ran)

Back to loop
yielding ... 1

0.9485583938907229

random.random()

0.700528870612659

def nhellos(n):
    print("Start generator")
    for i in range(n):
        print("yielding ...", i)
        yield "hello"
        print("Back to loop")  
    print("End of generator")

h = nhellos(2)

next(h)

Start generator
yielding ... 0

'hello'

s = next(h)

Back to loop
yielding ... 1

s

'hello'

next(h)

Back to loop
End of generator

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-58-31146b9ab14d> in <module>
----> 1 next(h)

StopIteration:

for r in randomnums(4):
    print(r)

Start generator
yielding ... 0
0.12238524801463746
Back to loop
yielding ... 1
0.15188283688094728
Back to loop
yielding ... 2
0.4091636583979146
Back to loop
yielding ... 3
0.5480814846207586
Back to loop
End of generator

def chunk(items, n, count):
    """
    [i1, i2, i3, i4, i5, i6, i7 ....i100]
    will break it into peices of size n
    """
    s = 0 
    for i in range(count):
        yield items[s:s+n] # n items from items, starting s position
        s = (i+1)*n
        

def extract_table_A(pagetext):
    lines = pagetext.split("\n")
    header = "NR WR SR ER NER TOTAL"
    headers = header.strip().split()
    data = {}
    steps = chunk(lines, len(headers), 9)
    next(steps)
    for row in steps:
        print(row)

extract_table_A(get_page("demanddata.pdf", 1))

['59882', '41115', '34238', '21526', '2730', '159491']
['1114', '0', '0', '0', '6', '1120']
['1398', '998', '807', '447', '48', '3698']
['355', '33', '77', '149', '29', '643']
['11', '49', '128', '-', '-', '187']
['39.60', '16.60', '41.59', '4.60', '0.03', '102']
['12.6', '0.0', '0.0', '0.0', '0.0', '12.6']
['65470', '43593', '38117', '21535', '2827', '160654']

import pandas as pd
def chunk(items, n, count):
    """
    [i1, i2, i3, i4, i5, i6, i7 ....i100]
    will break it into peices of size n
    """
    s = 0 
    for i in range(count):
        yield items[s:s+n] # n items from items, starting s position
        s = (i+1)*n
        

def extract_table_A(pagetext):
    lines = pagetext.split("\n")
    header = "NR WR SR ER NER TOTAL"
    headers = header.strip().split()
    data = {}
    steps = chunk(lines, len(headers), 9)
    next(steps)
    for row in steps:
        for h, d in zip(headers, row):
            data.setdefault(h , []).append(d)
            
    pd.DataFrame(data)

d = {}

d['x']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-67-7657742692bd> in <module>
----> 1 d['x']

KeyError: 'x'

d.get('x', 0)

0

d

{}

d.setdefault("x", 0)

0

d

{'x': 0}

import pandas as pd
def chunk(items, n, count):
    """
    [i1, i2, i3, i4, i5, i6, i7 ....i100]
    will break it into peices of size n
    """
    s = 0 
    for i in range(count):
        yield items[s:s+n] # n items from items, starting s position
        s = (i+1)*n
        

def get_page(pdffile, pageno):
    with open("demanddata.pdf", "rb") as f:
        pdfreader = PyPDF2.PdfFileReader(f)
        page = pdfreader.getPage(pageno)
        return page.extractText()
        
        
def extract_table_A(pagetext):
    lines = pagetext.split("\n")
    header = "NR WR SR ER NER TOTAL"
    headers = header.strip().split()
    data = {}
    steps = chunk(lines, len(headers), 9)
    next(steps)
    for row in steps:
        for h, d in zip(headers, row):
            data.setdefault(h , []).append(d)
            
    return pd.DataFrame(data)

extract_table_A(get_page("demanddata.pdf", 1))

%%file extract_tableA.py
"""this script allows extracting table from a pdf file. it assumes 
certain format. tested with file https://posoco.in/download/16-07-20_nldc_psp/?wpdmdl=30215
"""
import pandas as pd
import PyPDF2
import typer

app = typer.Typer()


def chunk(items, n, count):
    """
    [i1, i2, i3, i4, i5, i6, i7 ....i100]
    will break it into peices of size n
    """
    s = 0 
    for i in range(count):
        yield items[s:s+n] # n items from items, starting s position
        s = (i+1)*n
        

def get_page(pdffile, pageno):
    with open("demanddata.pdf", "rb") as f:
        pdfreader = PyPDF2.PdfFileReader(f)
        page = pdfreader.getPage(pageno)
        return page.extractText()
        
        
def extract_table_A(pagetext):
    lines = pagetext.split("\n")
    header = "NR WR SR ER NER TOTAL"
    headers = header.strip().split()
    data = {}
    steps = chunk(lines, len(headers), 9)
    next(steps)
    for row in steps:
        for h, d in zip(headers, row):
            data.setdefault(h , []).append(d)
            
    return pd.DataFrame(data)


@app.command()
def extract_tableA(pdffile, csvfile):
    """
    exctracts table A from pdffile and saves it in csvfile
    """
    page = get_page(pdffile, 1)
    df = extract_table_A(page)
    df.to_csv(csvfile)
    
    
if __name__ == "__main__":
    app()

Overwriting extract_tableA.py

ANy command line tool has elaborate command line options

!python extract_tableA.py --help

Usage: extract_tableA.py [OPTIONS] PDFFILE CSVFILE

  exctracts table A from pdffile and saves it in csvfile

Arguments:
  PDFFILE  [required]
  CSVFILE  [required]

Options:
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

!python extract_tableA.py demanddata.pdf demanddata.csv

!pip install typer

Requirement already satisfied: typer in /home/vikrant/anaconda3/lib/python3.8/site-packages (0.3.1)
Requirement already satisfied: click<7.2.0,>=7.1.1 in /home/vikrant/anaconda3/lib/python3.8/site-packages (from typer) (7.1.2)
WARNING: You are using pip version 20.2.3; however, version 20.3.3 is available.
You should consider upgrading via the '/home/vikrant/anaconda3/bin/python -m pip install --upgrade pip' command.

!cat demanddata.csv

,NR,WR,SR,ER,NER,TOTAL
0,59882,41115,34238,21526,2730,159491
1,1114,0,0,0,6,1120
2,1398,998,807,447,48,3698
3,355,33,77,149,29,643
4,11,49,128,-,-,187
5,39.60,16.60,41.59,4.60,0.03,102
6,12.6,0.0,0.0,0.0,0.0,12.6
7,65470,43593,38117,21535,2827,160654

%%file head.py
import typer

app = typer.Typer()

@app.command()
def head(filename:str, n:int=5):
    with open(filename) as f:
        for i in range(n):
            print(f.readline(), end="")
            
if __name__ == "__main__":
    app()

Overwriting head.py

!python head.py --help

Usage: head.py [OPTIONS] FILENAME

Arguments:
  FILENAME  [required]

Options:
  --n INTEGER                     [default: 5]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

!python head.py demanddata.csv

,NR,WR,SR,ER,NER,TOTAL
0,59882,41115,34238,21526,2730,159491
1,1114,0,0,0,6,1120
2,1398,998,807,447,48,3698
3,355,33,77,149,29,643

!python head.py --n 3 demanddata.csv

,NR,WR,SR,ER,NER,TOTAL
0,59882,41115,34238,21526,2730,159491
1,1114,0,0,0,6,1120

date patterns¶

import datetime

datetime.datetime.today().strftime("%Y")

'2021'

datetime.datetime.today()

datetime.datetime(2021, 1, 15, 13, 6, 57, 298854)

datetime.datetime.today().strftime("%Y-%m-%d %H:%M:%S")

'2021-01-15 13:07:35'

time = '2021-01-15 13:07:35'

datetime.datetime.strptime(time, "%Y-%m-%d %H:%M:%S")

datetime.datetime(2021, 1, 15, 13, 7, 35)

time1 = "2030/01/15"

datetime.datetime.strptime(time1, "%Y/%m/%d")

datetime.datetime(2030, 1, 15, 0, 0)

import re #regular expression module

multiplestring = """
fjdsaf hdsg kjfhdsf kjhfds ds
dhf kdsjh

def hello():
    print("hello")

sadsad kjshdf jsfkjdhfs kjdshafkjhdsa f
kjhfds kd
kjhdsfkj 
kjhd f
kkdjkfj
"""

empty = re.compile("^$") # empty line
ninechars = re.compile("^.........$")
p1 = re.compile("\d+.+") # one digits and many chARS
P2 = re.compile("^\d+$") # only digits one or more

p1.match("hello") # if there no match it will return None

p1.match("2kjfdkjf")

<re.Match object; span=(0, 8), match='2kjfdkjf'>

P2.match("fdfd")

P2.match("5")

<re.Match object; span=(0, 1), match='5'>

P2.match("5556575")

<re.Match object; span=(0, 7), match='5556575'>

s = "<c>text</c>"

	NR	WR	SR	ER	NER	TOTAL
0	59882	41115	34238	21526	2730	159491
1	1114	0	0	0	6	1120
2	1398	998	807	447	48	3698
3	355	33	77	149	29	643
4	11	49	128	-	-	187
5	39.60	16.60	41.59	4.60	0.03	102
6	12.6	0.0	0.0	0.0	0.0	12.6
7	65470	43593	38117	21535	2827	160654