Module 3 - Day 3

Quick recap

HTTP protocol

  • Server - where the HTTP server run and serves the clients by responding either in HTML,XML,files or ..streaming in these forms
  • client - firefox, chrome, python or mobile app ..these are examples of clients whic make use http protocol and try to fetch data from server
  • get - all information is in url… all parameters required are provided in url
  • post - some paramters can be provided in url but there some hidden parameters (forms)
  • put - a mechanism to put a file/data directly server
  • delete - a mechanism to delete some file/data from server
import requests
url = "https://www.python.org/"
response = requests.get(url)
response
<Response [200]>
print(response.text[:600]) # html is nothing but text
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <!-- Google tag (gtag.js) -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js'

alphavantage example

url = "https://www.alphavantage.co/query"
API_KEY = "UKVFE0JLE0TBPDEF"
p = {"function":"TIME_SERIES_INTRADAY",
          "symbol":"IBM",
          "interval": "5min",
          "apikey": API_KEY}
r = requests.get(url, params=p)
r
<Response [200]>
r.status_code
200
 data = r.json() # will convert returned json data into python lists/dictionaries
len(data)
2
type(data)
dict
data.keys()
dict_keys(['Meta Data', 'Time Series (5min)'])
data['Meta Data']
{'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
 '2. Symbol': 'IBM',
 '3. Last Refreshed': '2024-03-12 19:55:00',
 '4. Interval': '5min',
 '5. Output Size': 'Compact',
 '6. Time Zone': 'US/Eastern'}

problem

Write a function dwonload_notes to scrape live notes from https://notes.arcesium-lab.pipal.in. The pattern for url is this

notes for module 1 day 1 - > https://notes.arcesium-lab.pipal.in/1-1.html
notes for module 3 day 2 -> https://notes.arcesium-lab.pipal.in/3-2.html

There are three modules and 5 days in each module.

def download_training_notes(module, day):
    filename = f"{module}-{day}.html"
    url = f"https://notes.arcesium-lab.pipal.in/{filename}"
    r = requests.get(url)
    if r.status_code == 200:
        with open(filename, "w") as f:
            f.write(r.text)
download_training_notes(1,1)
import os
import requests

def download_training_notes(module, day):
    filename = f"{module}-{day}.html"
    url = f"https://notes.arcesium-lab.pipal.in/{filename}"
    r = requests.get(url)
    if r.status_code == 200:
        with open(filename, "w") as f:
            f.write(r.text)

def download_all_notes():
    foldername = "all_training_notes"
    os.mkdir(foldername)
    os.chdir(foldername)
    for module in range(1, 3):
        for day in range(1, 6):
            print(f"Downloading module {module} - day {day}") 
            download_training_notes(module, day)
    os.chdir("..")
            
download_all_notes()
FileNotFoundError: [Errno 2] No such file or directory: 'all_training_notes'
os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory
os.chdir("/home/jupyter-vikrant/arcesium-python-2024")
download_all_notes()
Downloading module 1 - day 1
Downloading module 1 - day 2
Downloading module 1 - day 3
Downloading module 1 - day 4
Downloading module 1 - day 5
Downloading module 2 - day 1
Downloading module 2 - day 2
Downloading module 2 - day 3
Downloading module 2 - day 4
Downloading module 2 - day 5

Exception Handling


def download_all_notes():
    foldername = "all_training_notes"
    os.mkdir(foldername)
    os.chdir(foldername)
    try:
        for module in range(1, 3):
            for day in range(1, 6):
                print(f"Downloading module {module} - day {day}") 
                download_training_notes(module, day)
    except Exception as e:
        pass
    finally:
        os.chdir("..")
x + 2
NameError: name 'x' is not defined

def func():
    foldername = "all_training_notes"
    os.mkdir(foldername)
    os.chdir(foldername)
    try:
        for module in range(1, 3):
            for day in range(1, 6):
                doom
                print(f"Downloading module {module} - day {day}") 
                download_training_notes(module, day)
    except Exception as e:
        print(e)
    finally:
        print("I was in finally block")
        os.chdir("..")
func()
FileExistsError: [Errno 17] File exists: 'all_training_notes'

def func():
    try:
        for module in range(1, 3):
            for day in range(1, 6):
                doom
                print("Hello World!") 
    except Exception as e:
        print(e)
    finally:
        print("I was in finally block")
func()
name 'doom' is not defined
I was in finally block
%load_problem fibs-api
Problem: Fibs API

Write a function fibs to generate fibonacci numbers using fibs API at https://numbers.apps.pipal.in/.

The API takes 3 query parameters a, b and n and generates n fibonacci numbers starting with a and b.

$ curl 'https://numbers.apps.pipal.in/fibs?a=3&b=4&n=10'
3
4
7
11
18
29
47
76
123
199

Write a function fibs that takes three numbers a, b and n as arguments and returns a list of fibonacci numbers given by the API.

>>> fibs(3, 4, 10)
[3, 4, 7, 11, 18, 29, 47, 76, 123, 199]

>>> sum(fibs(3, 4, 10))
517

You can verify your solution using:

%verify_problem fibs-api

# your code here

def fibs(first, second, n):
    apiparams = {"a": first,
                 "b": second,
                 "n": n}
    url =  "https://numbers.apps.pipal.in/fibs"
    r = requests.get(url, params=apiparams)
    if r.status_code == 200:
        return r.text   
fibs(1, 1, 30)
'1\n1\n2\n3\n5\n8\n13\n21\n34\n55\n89\n144\n233\n377\n610\n987\n1597\n2584\n4181\n6765\n10946\n17711\n28657\n46368\n75025\n121393\n196418\n317811\n514229\n832040\n'
r.text
# your code here

def fibs(first, second, n):
    apiparams = {"a": first,
                 "b": second,
                 "n": n}
    url =  "https://numbers.apps.pipal.in/fibs"
    r = requests.get(url, params=apiparams)
    if r.status_code == 200:
        return r.text.split()  
fibs(1, 1, 10)
['1', '1', '2', '3', '5', '8', '13', '21', '34', '55']
# your code here

def fibs(first, second, n):
    apiparams = {"a": first,
                 "b": second,
                 "n": n}
    url =  "https://numbers.apps.pipal.in/fibs"
    r = requests.get(url, params=apiparams)
    if r.status_code == 200:
        return [int(w) for w in r.text.split()]  
fibs(1, 1, 20)
[1,
 1,
 2,
 3,
 5,
 8,
 13,
 21,
 34,
 55,
 89,
 144,
 233,
 377,
 610,
 987,
 1597,
 2584,
 4181,
 6765]
sum(fibs(1, 1, 100))
927372692193078999175
def write_fibs(first, second, n, filename):
    apiparams = {"a": first,
                 "b": second,
                 "n": n}
    url =  "https://numbers.apps.pipal.in/fibs"
    r = requests.get(url, params=apiparams)
    if r.status_code == 200:
        with open(filename, "w") as f:
            f.write(r.text)
write_fibs(1, 1, 50, "fibs50.txt")
!cat fibs50.txt
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
10946
17711
28657
46368
75025
121393
196418
317811
514229
832040
1346269
2178309
3524578
5702887
9227465
14930352
24157817
39088169
63245986
102334155
165580141
267914296
433494437
701408733
1134903170
1836311903
2971215073
4807526976
7778742049
12586269025
filename = zen.txt
NameError: name 'zen' is not defined
filename  = "zen.txt"
doom # if your text does not start with ' or " then it means it is a variable
NameError: name 'doom' is not defined
"doom" # literal text
'doom'

Authentication

SOme website apis are protected with username and password

%%file /tmp/pass.txt
UJKJDS^&KSG
Writing /tmp/pass.txt
user = "someusername"
with open("/tmp/pass.txt") as f:
    pass_ = f.read().strip()

r = requests.get("http://api.github.com/user", auth=(user, pass_))
r
<Response [401]>

Different kinds of authentications

  • kerberos auth
pip install requests requests-kerberos
from request_kerberos import HTTPKerberosAuth, OPTIONAL
kerberos_auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
r = requests.get(url, auth= kerberos_auth)

Accesing emails from outlook

all about outlook and python

create virtual environment with name pywinservices

python -m venv pywinservices

activate it using

.bat

after activation install following packages

pip install pypiwin32

following code makes use of MAPI api provided by microsoft for more advanced cases refer

https://docs.microsoft.com/en-us/office/client-developer/outlook/outlook-home

import datetime
import os
import win32com.client


def get_outlook():
    return win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")


def get_inbox(email=None):
    outlook = get_outlook()
    if email == None:
        inbox = outlook.GetDefaultFolder(6)  # this works
    else:
        inbox = outlook.Folders.Items(email).Folders.Items('Inbox')
        # inbox1 = outlook.Folders.Items('123@abc.com').Folders.Items('Inbox') # To access 123@abc.com Inbox
        # inbox2 = outlook.Folders.Items('456@def.com').Folders.Itmes('Inbox') # To access 456@def.com Inbox
    return inbox


def get_default_inbox_messages():
    inbox = get_inbox()
    messages = inbox.Items
    return messages


def saveattachemnts_from_email(subject):
    """saves attachment from email that matches with subject
    """
    messages = get_default_inbox_messages()
    path = os.path.expanduser("~/Desktop/Attachments")
    today = datetime.date.today()

    for message in messages:
        if message.Subject == subject and message.Unread or message.Senton.date() == today:
            # body_content = message.body
            attachments = message.Attachments
            attachment = attachments.Item(1)
            for attachment in message.Attachments:
                attachment.SaveAsFile(os.path.join(path, str(attachment)))
                if message.Subject == subject and message.Unread:
                    message.Unread = False
                break


# print first 3 emails from ...
def print_emails(folderindex: int = 6, emailcount: int = 3):
    """
    folderindex 3, 4, 5, 6 ..Trash, Outbox, Sent, Inbox 
    """                                                    # folders,and emailcount,
    # try different numbers
    messages = get_default_inbox_messages()
    for i in range(emailcount):
        message = messages[i]
        print(message.Subject)
        # this body of message can be parsed to extract table
        print(message.Body)
        print("="*30)


def send_email_with_attachment():
    import win32com.client as win32
    outlook = win32.Dispatch('outlook.application')
    mail = outlook.CreateItem(0)
    mail.To = 'To address'
    mail.Subject = 'Message subject'
    mail.Body = 'Message body'
    mail.HTMLBody = '<h2>HTML Message body</h2>'  # this field is optional

    # To attach a file to the email (optional):
    attachment = "Path to the attachment"
    mail.Attachments.Add(attachment)

    mail.Send()
ModuleNotFoundError: No module named 'win32com'

Reading pdf files

pdffileurl = "https://posoco.in/download/16-07-20_nldc_psp/?wpdmdl=30215"

import requests

def download(url, filename):
    r = requests.get(url)
    with open(filename, "wb") as f: # should be opened in binaray,write mode
        f.write(r.content)

download(pdffileurl, "data.pdf")
import PyPDF2
with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
    pdfreader = PyPDF2.PdfReader(f)
    for page in pdfreader.pages:
        print(page.extract_text()[:100])
        print("="*20)
 
National Load Despatch Centre  
राष्ट्रीय भार प्रेषण केंद्र 
POWER SYSTEM OPERATION CORPORATION LI
====================
NR WR SR ER NER TOTAL
59882 41115 34238 21526 2730 159491
1114 0 0 0 6 1120
1398 998 807 447 48 3698
====================
16-Jul-2020
Sl 
NoVoltage Level Line Details Circuit Max Import (MW) Max Export (MW) Import (MU) Exp
====================
with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
    pdfreader = PyPDF2.PdfReader(f)
    page = pdfreader.pages[1]
    print(page.extract_text())
NR WR SR ER NER TOTAL
59882 41115 34238 21526 2730 159491
1114 0 0 0 6 1120
1398 998 807 447 48 3698
355 33 77 149 29 643
11 49 128 - - 187
39.60 16.60 41.59 4.60 0.03 102
12.6 0.0 0.0 0.0 0.0 12.6
65470 43593 38117 21535 2827 160654
22:20 10:29 10:00 21:20 19:41 21:26
Region FVI < 49.7 49.7 - 49.8 49.8 - 49.9 < 49.9 49.9 - 50.05 > 50.05
All India 0.057 0.16 1.81 13.19 15.16 76.52 8.32
Max.Demand Shortage during Energy Met Drawal OD(+)/UD(-) Max OD Energy
Region States Met during the 
day(MW)maximum 
Demand(MW)(MU)Schedule
(MU)(MU) (MW)Shortage 
(MU)
Punjab 11090 0 237.9 146.8 -1.8 49 0.0
Haryana 9388 0 209.4 152.8 0.7 325 1.9
Rajasthan 12087 0 262.4 119.7 5.4 809 0.0
Delhi 5726 0 118.6 102.8 -1.4 228 0.0
NR UP 22873 0 448.9 208.5 2.0 546 0.4
Uttarakhand 1899 0 42.8 20.7 0.8 111 0.0
HP 1366 0 28.6 -2.6 -0.2 91 0.0
J&K(UT) & Ladakh(UT) 2177 544 43.1 20.3 0.4 502 10.3
Chandigarh 295 0 6.0 5.9 0.2 61 0.0
Chhattisgarh 3685 0 86.9 36.8 0.8 468 0.0
Gujarat 13478 0 286.2 87.6 4.0 527 0.0
MP 9547 0 214.7 113.8 -3.8 198 0.0
WR Maharashtra 16964 0 365.1 138.1 -1.9 457 0.0
Goa 405 0 8.5 8.2 -0.2 33 0.0
DD 246 0 5.3 5.3 0.0 19 0.0
DNH 614 0 14.0 13.8 0.2 44 0.0
AMNSIL 777 0 17.1 4.2 0.7 272 0.0
Andhra Pradesh 6439 0 141.0 45.6 -1.3 607 0.0
Telangana 8614 0 167.3 81.6 -2.5 385 0.0
SR Karnataka 8486 0 155.1 51.1 -3.4 650 0.0
Kerala 3077 0 65.2 46.1 0.5 179 0.0
Tamil Nadu 12371 0 271.3 125.9 -3.7 573 0.0
Puducherry 349 0 7.5 7.5 -0.1 35 0.0
Bihar 5740 0 111.5 106.0 -0.3 386 0.0
DVC 2989 0 62.7 -42.6 -0.7 206 0.0
Jharkhand 1438 0 26.3 18.5 -1.0 124 0.0
ER Odisha 3983 0 82.2 -0.2 -0.2 325 0.0
West Bengal 7917 0 162.6 47.2 -0.8 303 0.0
Sikkim 100 0 1.4 1.5 -0.1 17 0.0
Arunachal Pradesh 120 3 2.0 1.8 0.2 40 0.0
Assam 1759 23 30.0 27.1 -0.1 135 0.0
Manipur 183 1 2.6 2.3 0.3 37 0.0
NER Meghalaya 307 2 5.3 -1.3 0.3 52 0.0
Mizoram 89 1 1.5 1.2 0.0 13 0.0
Nagaland 140 2 2.2 2.3 -0.2 23 0.0
Tripura 298 7 4.9 5.9 0.7 66 0.0
Bhutan Nepal Bangladesh
53.3 -1.5 -19.1
2337.0 -271.3 -1110.0
NR WR SR ER NER TOTAL
352.1 -295.4 95.0 -145.8 -6.0 0.0
359.2 -293.7 84.6 -152.6 -3.4 -6.0
7.1 1.6 -10.5 -6.9 2.6 -6.0
NR WR SR ER NER TOTAL
3838 14847 11792 3445 677 34598
9289 23225 14423 4892 47 51876
13127 38072 26215 8337 723 86473
NR WR SR ER NER All India
546 1080 370 482 7 2486
25 13 14 0 0 52
355 33 77 149 29 643
26 33 47 0 0 106
40 82 19 0 22 163
71 73 210 5 0 359
1063 1314 737 636 58 3809
6.71 5.54 28.51 0.73 0.05 9.43
42.55 10.54 45.35 24.19 49.63 29.09
1.068
1.102 Based on State Max Demands
Diversity factor = Sum of regional or state maximum demands / All India maximum demand
*Source: RLDCs for solar connected to ISTS; SLDCs for embedded solar. Limited visibility of embedded solar data.
Executive Director-NLDCShare of RES in total generation (%)
Share of Non-fossil fuel (Hydro,Nuclear and RES) in total generation(%)
H. All India Demand Diversity Factor
Based on Regional Max DemandsLignite
Hydro
Nuclear
Gas, Naptha & Diesel
RES (Wind, Solar, Biomass & Others)
TotalState Sector
Total
G. Sourcewise generation (MU)
CoalActual(MU)
O/D/U/D(MU)
F. Generation Outage(MW)
Central SectorDay Peak (MW)
E. Import/Export by Regions (in MU) - Import(+ve)/Export(-ve); OD(+)/UD(-)
Schedule(MU)D. Transnational Exchanges (MU) - Import(+ve)/Export(-ve)   
Actual (MU)Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)
B. Frequency Profile (%)
C. Power Supply Position in StatesDemand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)*Report for previous day Date of Reporting: 16-Jul-2020
A. Power Supply Position at All India and Regional level
def extract_page(i):
    with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[i]
        return page.extract_text()
extract_page(1)
'NR WR SR ER NER TOTAL\n59882 41115 34238 21526 2730 159491\n1114 0 0 0 6 1120\n1398 998 807 447 48 3698\n355 33 77 149 29 643\n11 49 128 - - 187\n39.60 16.60 41.59 4.60 0.03 102\n12.6 0.0 0.0 0.0 0.0 12.6\n65470 43593 38117 21535 2827 160654\n22:20 10:29 10:00 21:20 19:41 21:26\nRegion FVI < 49.7 49.7 - 49.8 49.8 - 49.9 < 49.9 49.9 - 50.05 > 50.05\nAll India 0.057 0.16 1.81 13.19 15.16 76.52 8.32\nMax.Demand Shortage during Energy Met Drawal OD(+)/UD(-) Max OD Energy\nRegion States Met during the \nday(MW)maximum \nDemand(MW)(MU)Schedule\n(MU)(MU) (MW)Shortage \n(MU)\nPunjab 11090 0 237.9 146.8 -1.8 49 0.0\nHaryana 9388 0 209.4 152.8 0.7 325 1.9\nRajasthan 12087 0 262.4 119.7 5.4 809 0.0\nDelhi 5726 0 118.6 102.8 -1.4 228 0.0\nNR UP 22873 0 448.9 208.5 2.0 546 0.4\nUttarakhand 1899 0 42.8 20.7 0.8 111 0.0\nHP 1366 0 28.6 -2.6 -0.2 91 0.0\nJ&K(UT) & Ladakh(UT) 2177 544 43.1 20.3 0.4 502 10.3\nChandigarh 295 0 6.0 5.9 0.2 61 0.0\nChhattisgarh 3685 0 86.9 36.8 0.8 468 0.0\nGujarat 13478 0 286.2 87.6 4.0 527 0.0\nMP 9547 0 214.7 113.8 -3.8 198 0.0\nWR Maharashtra 16964 0 365.1 138.1 -1.9 457 0.0\nGoa 405 0 8.5 8.2 -0.2 33 0.0\nDD 246 0 5.3 5.3 0.0 19 0.0\nDNH 614 0 14.0 13.8 0.2 44 0.0\nAMNSIL 777 0 17.1 4.2 0.7 272 0.0\nAndhra Pradesh 6439 0 141.0 45.6 -1.3 607 0.0\nTelangana 8614 0 167.3 81.6 -2.5 385 0.0\nSR Karnataka 8486 0 155.1 51.1 -3.4 650 0.0\nKerala 3077 0 65.2 46.1 0.5 179 0.0\nTamil Nadu 12371 0 271.3 125.9 -3.7 573 0.0\nPuducherry 349 0 7.5 7.5 -0.1 35 0.0\nBihar 5740 0 111.5 106.0 -0.3 386 0.0\nDVC 2989 0 62.7 -42.6 -0.7 206 0.0\nJharkhand 1438 0 26.3 18.5 -1.0 124 0.0\nER Odisha 3983 0 82.2 -0.2 -0.2 325 0.0\nWest Bengal 7917 0 162.6 47.2 -0.8 303 0.0\nSikkim 100 0 1.4 1.5 -0.1 17 0.0\nArunachal Pradesh 120 3 2.0 1.8 0.2 40 0.0\nAssam 1759 23 30.0 27.1 -0.1 135 0.0\nManipur 183 1 2.6 2.3 0.3 37 0.0\nNER Meghalaya 307 2 5.3 -1.3 0.3 52 0.0\nMizoram 89 1 1.5 1.2 0.0 13 0.0\nNagaland 140 2 2.2 2.3 -0.2 23 0.0\nTripura 298 7 4.9 5.9 0.7 66 0.0\nBhutan Nepal Bangladesh\n53.3 -1.5 -19.1\n2337.0 -271.3 -1110.0\nNR WR SR ER NER TOTAL\n352.1 -295.4 95.0 -145.8 -6.0 0.0\n359.2 -293.7 84.6 -152.6 -3.4 -6.0\n7.1 1.6 -10.5 -6.9 2.6 -6.0\nNR WR SR ER NER TOTAL\n3838 14847 11792 3445 677 34598\n9289 23225 14423 4892 47 51876\n13127 38072 26215 8337 723 86473\nNR WR SR ER NER All India\n546 1080 370 482 7 2486\n25 13 14 0 0 52\n355 33 77 149 29 643\n26 33 47 0 0 106\n40 82 19 0 22 163\n71 73 210 5 0 359\n1063 1314 737 636 58 3809\n6.71 5.54 28.51 0.73 0.05 9.43\n42.55 10.54 45.35 24.19 49.63 29.09\n1.068\n1.102 Based on State Max Demands\nDiversity factor = Sum of regional or state maximum demands / All India maximum demand\n*Source: RLDCs for solar connected to ISTS; SLDCs for embedded solar. Limited visibility of embedded solar data.\nExecutive Director-NLDCShare of RES in total generation (%)\nShare of Non-fossil fuel (Hydro,Nuclear and RES) in total generation(%)\nH. All India Demand Diversity Factor\nBased on Regional Max DemandsLignite\nHydro\nNuclear\nGas, Naptha & Diesel\nRES (Wind, Solar, Biomass & Others)\nTotalState Sector\nTotal\nG. Sourcewise generation (MU)\nCoalActual(MU)\nO/D/U/D(MU)\nF. Generation Outage(MW)\nCentral SectorDay Peak (MW)\nE. Import/Export by Regions (in MU) - Import(+ve)/Export(-ve); OD(+)/UD(-)\nSchedule(MU)D. Transnational Exchanges (MU) - Import(+ve)/Export(-ve)\xa0\xa0\xa0\nActual (MU)Energy Shortage (MU)\nMaximum Demand Met During the Day (MW) (From NLDC SCADA)\nTime Of Maximum Demand Met (From NLDC SCADA)\nB. Frequency Profile (%)\nC. Power Supply Position in StatesDemand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)\nPeak Shortage (MW)\nEnergy Met (MU)\nHydro Gen (MU)\nWind Gen (MU)\nSolar Gen (MU)*Report for previous day Date of Reporting: 16-Jul-2020\nA. Power Supply Position at All India and Regional level\n'
def extract_page(i):
    with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[i]
        return page.extract_text()


def extract_tableA_data(pagetext):
    lines = pagetext.split("\n")
    data_of_interest = lines[:10]

    headers = data_of_interest[0].split()
    data = [line.split() for line in data_of_interest[1:]]
    return headers, data


def extract_table_A(filename):
    page = extract_page(1)
    return extract_tableA_data(page)
extract_table_A("data.pdf")
(['NR', 'WR', 'SR', 'ER', 'NER', 'TOTAL'],
 [['59882', '41115', '34238', '21526', '2730', '159491'],
  ['1114', '0', '0', '0', '6', '1120'],
  ['1398', '998', '807', '447', '48', '3698'],
  ['355', '33', '77', '149', '29', '643'],
  ['11', '49', '128', '-', '-', '187'],
  ['39.60', '16.60', '41.59', '4.60', '0.03', '102'],
  ['12.6', '0.0', '0.0', '0.0', '0.0', '12.6'],
  ['65470', '43593', '38117', '21535', '2827', '160654'],
  ['22:20', '10:29', '10:00', '21:20', '19:41', '21:26']])
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4],
              "B":["a","b","c","d"]})
A B
0 1 a
1 2 b
2 3 c
3 4 d
pd.DataFrame([{"A":1, "B":"a"},
             {"A":2, "B":"b"},
             {"A":3, "B": "c"},
             {"A": 4, "B": "d"}])
A B
0 1 a
1 2 b
2 3 c
3 4 d
columnames, rows = extract_table_A("data.pdf")
dictrows = []
for row in rows:
    dictrows.append(dict(zip(columnames, row)))
dictrows
[{'NR': '59882',
  'WR': '41115',
  'SR': '34238',
  'ER': '21526',
  'NER': '2730',
  'TOTAL': '159491'},
 {'NR': '1114', 'WR': '0', 'SR': '0', 'ER': '0', 'NER': '6', 'TOTAL': '1120'},
 {'NR': '1398',
  'WR': '998',
  'SR': '807',
  'ER': '447',
  'NER': '48',
  'TOTAL': '3698'},
 {'NR': '355',
  'WR': '33',
  'SR': '77',
  'ER': '149',
  'NER': '29',
  'TOTAL': '643'},
 {'NR': '11', 'WR': '49', 'SR': '128', 'ER': '-', 'NER': '-', 'TOTAL': '187'},
 {'NR': '39.60',
  'WR': '16.60',
  'SR': '41.59',
  'ER': '4.60',
  'NER': '0.03',
  'TOTAL': '102'},
 {'NR': '12.6',
  'WR': '0.0',
  'SR': '0.0',
  'ER': '0.0',
  'NER': '0.0',
  'TOTAL': '12.6'},
 {'NR': '65470',
  'WR': '43593',
  'SR': '38117',
  'ER': '21535',
  'NER': '2827',
  'TOTAL': '160654'},
 {'NR': '22:20',
  'WR': '10:29',
  'SR': '10:00',
  'ER': '21:20',
  'NER': '19:41',
  'TOTAL': '21:26'}]
pd.DataFrame(dictrows)
NR WR SR ER NER TOTAL
0 59882 41115 34238 21526 2730 159491
1 1114 0 0 0 6 1120
2 1398 998 807 447 48 3698
3 355 33 77 149 29 643
4 11 49 128 - - 187
5 39.60 16.60 41.59 4.60 0.03 102
6 12.6 0.0 0.0 0.0 0.0 12.6
7 65470 43593 38117 21535 2827 160654
8 22:20 10:29 10:00 21:20 19:41 21:26
def extract_page(i):
    with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[i]
        return page.extract_text()


def extract_tableA_data(pagetext):
    lines = pagetext.split("\n")
    data_of_interest = lines[:10]

    headers = data_of_interest[0].split()
    data = [line.split() for line in data_of_interest[1:]]
    return headers, data


def extract_table_A(filename):
    page = extract_page(1)
    colnames, rows = extract_tableA_data(page)
    dictrows  = [dict(zip(colnames, row)) for row in rows]
    return pd.DataFrame(dictrows)
extract_table_A("data.pdf")
NR WR SR ER NER TOTAL
0 59882 41115 34238 21526 2730 159491
1 1114 0 0 0 6 1120
2 1398 998 807 447 48 3698
3 355 33 77 149 29 643
4 11 49 128 - - 187
5 39.60 16.60 41.59 4.60 0.03 102
6 12.6 0.0 0.0 0.0 0.0 12.6
7 65470 43593 38117 21535 2827 160654
8 22:20 10:29 10:00 21:20 19:41 21:26
def get_row_lables():
    rowlabels = """Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs)
Peak Shortage (MW)
Energy Met (MU)
Hydro Gen (MU)
Wind Gen (MU)
Solar Gen (MU)
Energy Shortage (MU)
Maximum Demand Met During the Day (MW) (From NLDC SCADA)
Time Of Maximum Demand Met (From NLDC SCADA)""".split("\n")
    return rowlabels


def extract_page(i):
    with open("data.pdf", "rb") as f: # you will have to open in read-binary mode
        pdfreader = PyPDF2.PdfReader(f)
        page = pdfreader.pages[i]
        return page.extract_text()


def extract_tableA_data(pagetext):
    lines = pagetext.split("\n")
    data_of_interest = lines[:10]

    headers = data_of_interest[0].split()
    data = [line.split() for line in data_of_interest[1:]]
    return headers, data


def extract_table_A(filename):
    page = extract_page(1)
    colnames, rows = extract_tableA_data(page)
    dictrows  = [dict(zip(colnames, row)) for row in rows]
    rowlabels = get_row_lables()
    return pd.DataFrame(dictrows, index=rowlabels)
extract_table_A("data.pdf")
NR WR SR ER NER TOTAL
Demand Met during Evening Peak hrs(MW) (at 2000 hrs; from RLDCs) 59882 41115 34238 21526 2730 159491
Peak Shortage (MW) 1114 0 0 0 6 1120
Energy Met (MU) 1398 998 807 447 48 3698
Hydro Gen (MU) 355 33 77 149 29 643
Wind Gen (MU) 11 49 128 - - 187
Solar Gen (MU) 39.60 16.60 41.59 4.60 0.03 102
Energy Shortage (MU) 12.6 0.0 0.0 0.0 0.0 12.6
Maximum Demand Met During the Day (MW) (From NLDC SCADA) 65470 43593 38117 21535 2827 160654
Time Of Maximum Demand Met (From NLDC SCADA) 22:20 10:29 10:00 21:20 19:41 21:26

When should you write a program to automate?

  • When the problem is reccuring
  • When the data you are going to process follows some definite logic
  • Do not write python automation for problems which do not follow logic of how data is stored!

Downloading data from internet

  • file download!
  • Try it with pd.read_html
  • check if the site provides get/post API, make use of requests to get data
  • general scraping – downloading html using requests and then using BeautifulSoup extract what you inrerested in!
  • last option is use selenium