Python Virtual Training For Arcesium - Module III - Day 2¶

Mar 13-17, 2023 Vikrant Patil

All notes are available online at https://notes.pipal.in/2023/arcesium_finop_jan/

Please login to https://engage.pipal.in/ and launch jupyter lab

For today create a notebook with name module3-day2

notebook names are case sensitive. Make sure you give correct name

© Pipal Academy LLP

Downloading Data From Internet¶

HTTP protocol

  • get - all information is in the url
  • post - some information is hidden (forms)
  • put
  • delete
In [1]:
import requests
In [2]:
!pip install requests
Requirement already satisfied: requests in /home/vikrant/usr/local/default/lib/python3.10/site-packages (2.28.1)
Requirement already satisfied: idna<4,>=2.5 in /home/vikrant/usr/local/default/lib/python3.10/site-packages (from requests) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/vikrant/usr/local/default/lib/python3.10/site-packages (from requests) (2022.9.24)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/vikrant/usr/local/default/lib/python3.10/site-packages (from requests) (1.26.13)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/vikrant/usr/local/default/lib/python3.10/site-packages (from requests) (2.1.1)

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
In [3]:
url = "https://www.python.org/"
response = requests.get(url)
In [5]:
response # the number shown here is reponse code
Out[5]:
<Response [200]>
In [7]:
print(response.text[:600]) # We access this if data is text/html
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <!-- Google tag (gtag.js) -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js'
In [8]:
with open("python.org.html", "w") as f:
    f.write(response.text)
In [9]:
url = "https://www.python.org/search/"
params = {"q": "pandas"}
r = requests.get(url, params=params)
In [10]:
r
Out[10]:
<Response [200]>
In [12]:
with open("python.org-search.html", "w") as f:
    f.write(r.text)

Download Stocks data¶

We will download stocks data from alphavantage.co

In [20]:
url = 'https://www.alphavantage.co/query'
API_KEY  = "UKVFE0JLE0TBPDEF"
params = {"function":"TIME_SERIES_INTRADAY",
          "symbol":"IBM",
          "interval":"5min",
          "apikey":API_KEY}

r = requests.get(url, params=params)
data = r.json() # this will return data in python dictionaries/lists/basic datatypes
In [21]:
r
Out[21]:
<Response [200]>
In [24]:
type(data)
Out[24]:
dict
In [25]:
data.keys()
Out[25]:
dict_keys(['Meta Data', 'Time Series (5min)'])
In [26]:
data['Meta Data']
Out[26]:
{'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
 '2. Symbol': 'IBM',
 '3. Last Refreshed': '2023-03-13 20:00:00',
 '4. Interval': '5min',
 '5. Output Size': 'Compact',
 '6. Time Zone': 'US/Eastern'}
In [28]:
len(data['Time Series (5min)'])
Out[28]:
100
In [31]:
import pandas as pd
import json
In [38]:
pd.read_json(json.dumps(data['Time Series (5min)'])) # this expects json string
Out[38]:
2023-03-13 20:00:00 2023-03-13 19:30:00 2023-03-13 18:55:00 2023-03-13 18:50:00 2023-03-13 18:10:00 2023-03-13 17:30:00 2023-03-13 17:00:00 2023-03-13 16:25:00 2023-03-13 16:15:00 2023-03-13 16:05:00 ... 2023-03-13 09:10:00 2023-03-13 09:05:00 2023-03-13 09:00:00 2023-03-13 08:55:00 2023-03-13 08:50:00 2023-03-13 08:20:00 2023-03-13 08:15:00 2023-03-13 08:10:00 2023-03-13 08:05:00 2023-03-13 07:15:00
1. open 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.4 125.71 125.71
2. high 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.01 125.4 125.71 125.71
3. low 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.2 125.50 125.71
4. close 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.2 125.50 125.71
5. volume 385.0000 100.00 101.00 352.00 1444.00 365.0000 496.00 2653.00 210.00 111201.00 ... 201 160.0 120.0 100.0000 1693 501 731.00 755.0 865.00 335.00

5 rows × 100 columns

What is json¶

Json plain text formatted data with some basic data types in it

  • ints, floats, str - 1 2 3 3.4 1.2
  • lists - '[1, 2, 3, 4, 5]'
  • dictitonary - '{"a": 4, "b": 5"}'
In [33]:
json.dumps([1, 2, 3, 4]) # dump python data as json string
Out[33]:
'[1, 2, 3, 4]'
In [34]:
json.dumps(params)
Out[34]:
'{"function": "TIME_SERIES_INTRADAY", "symbol": "IBM", "interval": "5min", "apikey": "UKVFE0JLE0TBPDEF"}'
In [35]:
xjson = input("Please input list of integer")
In [36]:
xjson
Out[36]:
'[1, 2, 3, 4, 5]'
In [37]:
json.loads(xjson)
Out[37]:
[1, 2, 3, 4, 5]

back to aplhavantage¶

In [39]:
pd.read_json(json.dumps(data['Time Series (5min)']))
Out[39]:
2023-03-13 20:00:00 2023-03-13 19:30:00 2023-03-13 18:55:00 2023-03-13 18:50:00 2023-03-13 18:10:00 2023-03-13 17:30:00 2023-03-13 17:00:00 2023-03-13 16:25:00 2023-03-13 16:15:00 2023-03-13 16:05:00 ... 2023-03-13 09:10:00 2023-03-13 09:05:00 2023-03-13 09:00:00 2023-03-13 08:55:00 2023-03-13 08:50:00 2023-03-13 08:20:00 2023-03-13 08:15:00 2023-03-13 08:10:00 2023-03-13 08:05:00 2023-03-13 07:15:00
1. open 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.4 125.71 125.71
2. high 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.01 125.4 125.71 125.71
3. low 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.2 125.50 125.71
4. close 125.7101 125.71 125.68 125.62 125.58 126.1694 125.61 125.58 125.23 125.58 ... 125 125.1 125.2 124.7971 125 125 125.00 125.2 125.50 125.71
5. volume 385.0000 100.00 101.00 352.00 1444.00 365.0000 496.00 2653.00 210.00 111201.00 ... 201 160.0 120.0 100.0000 1693 501 731.00 755.0 865.00 335.00

5 rows × 100 columns

In [40]:
df = pd.read_json(json.dumps(data['Time Series (5min)']))
In [43]:
daily_IBM = df.transpose()
In [44]:
daily_IBM
Out[44]:
1. open 2. high 3. low 4. close 5. volume
2023-03-13 20:00:00 125.7101 125.7101 125.7101 125.7101 385.0
2023-03-13 19:30:00 125.7100 125.7100 125.7100 125.7100 100.0
2023-03-13 18:55:00 125.6800 125.6800 125.6800 125.6800 101.0
2023-03-13 18:50:00 125.6200 125.6200 125.6200 125.6200 352.0
2023-03-13 18:10:00 125.5800 125.5800 125.5800 125.5800 1444.0
... ... ... ... ... ...
2023-03-13 08:20:00 125.0000 125.0000 125.0000 125.0000 501.0
2023-03-13 08:15:00 125.0000 125.0100 125.0000 125.0000 731.0
2023-03-13 08:10:00 125.4000 125.4000 125.2000 125.2000 755.0
2023-03-13 08:05:00 125.7100 125.7100 125.5000 125.5000 865.0
2023-03-13 07:15:00 125.7100 125.7100 125.7100 125.7100 335.0

100 rows × 5 columns

In [45]:
r = requests.get("https://api.github.com/events")
In [46]:
r
Out[46]:
<Response [200]>
In [49]:
githubdata = r.json() # although the method name is json, it actually returns python data
                      # originally site reponded with json data, but requests will convert it
                      # into python data
In [50]:
type(githubdata)
Out[50]:
list
In [51]:
len(githubdata)
Out[51]:
30
In [52]:
githubdata[0]
Out[52]:
{'id': '27698359707',
 'type': 'PushEvent',
 'actor': {'id': 76208813,
  'login': 'NaveendraKumar',
  'display_login': 'NaveendraKumar',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/NaveendraKumar',
  'avatar_url': 'https://avatars.githubusercontent.com/u/76208813?'},
 'repo': {'id': 613300826,
  'name': 'NaveendraKumar/Qt_Git_Jenkins_Config',
  'url': 'https://api.github.com/repos/NaveendraKumar/Qt_Git_Jenkins_Config'},
 'payload': {'repository_id': 613300826,
  'push_id': 12931866936,
  'size': 1,
  'distinct_size': 1,
  'ref': 'refs/heads/master',
  'head': '6b07253a4b03413275ffffb83de6148fb77679fa',
  'before': '72e40bf859576e3279be47b0293b28c75d1eae01',
  'commits': [{'sha': '6b07253a4b03413275ffffb83de6148fb77679fa',
    'author': {'email': 'naveendrakumar37@gmail.com',
     'name': 'NaveendraKumar'},
    'message': 'New Rectangle added in project',
    'distinct': True,
    'url': 'https://api.github.com/repos/NaveendraKumar/Qt_Git_Jenkins_Config/commits/6b07253a4b03413275ffffb83de6148fb77679fa'}]},
 'public': True,
 'created_at': '2023-03-14T05:23:06Z'}

Authentication¶

some websites neeed username and password to access data

In [53]:
%%file /tmp/pass.txt
GhjgGf&(3jd
Writing /tmp/pass.txt
In [54]:
user = "someusername"
pass_ = open("/tmp/pass.txt").read().strip()
resp = requests.get("http://api.github.com/user", auth=(user, pass_))
In [55]:
resp # error !
Out[55]:
<Response [401]>

Different kinds of authentications¶

  • Kerberos

pip install requests requests-kerberos

In [ ]:
from requests_kerberos import HTTPKerberosAuth, OPTIONAL
kerberos_auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
r = requests.get(request_url, auth=kerberos_auth)

some more on requests.get¶

In [56]:
url = 'https://www.alphavantage.co/query'
API_KEY  = "UKVFE0JLE0TBPDEF"
params = {"function":"TIME_SERIES_INTRADAY",
          "symbol":"IBM",
          "interval":"5min",
          "apikey":API_KEY}

r = requests.get(url, params=params)
data = r.json() 
In [57]:
url = "https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=UKVFE0JLE0TBPDEF"
r = requests.get(url)
data = r.json()
In [62]:
def get_api_key():
    return "UKVFE0JLE0TBPDEF"

def get_daily_time_series(ticker, interval="5min"):
    url = 'https://www.alphavantage.co/query'
    API_KEY  = get_api_key()
    params = {"function":"TIME_SERIES_INTRADAY",
              "symbol":ticker,
              "interval":interval,
              "apikey":API_KEY}

    r = requests.get(url, params=params)
    data = r.json()
    return pd.DataFrame(data[f'Time Series ({interval})']).transpose()
In [66]:
get_daily_time_series("APLE", "30min")
Out[66]:
1. open 2. high 3. low 4. close 5. volume
2023-03-13 16:30:00 15.3300 15.3300 15.3300 15.3300 22111
2023-03-13 16:00:00 15.3400 15.3550 15.3025 15.3200 482534
2023-03-13 15:30:00 15.4200 15.4450 15.2800 15.3350 421490
2023-03-13 15:00:00 15.4500 15.5000 15.4000 15.4300 84357
2023-03-13 14:30:00 15.4200 15.4800 15.3600 15.4500 201028
... ... ... ... ... ...
2023-03-03 15:00:00 17.0800 17.1000 17.0150 17.0200 82605
2023-03-03 14:30:00 17.0600 17.0800 17.0400 17.0750 58178
2023-03-03 14:00:00 17.0400 17.0700 17.0350 17.0600 35509
2023-03-03 13:30:00 17.0150 17.0600 16.9800 17.0350 59608
2023-03-03 13:00:00 16.9850 17.0300 16.9800 17.0150 35553

100 rows × 5 columns

Post¶

In [67]:
r = requests.post("http://httpbin.org/post", data={"value1":"A", "value2":"B"})
r.json()
Out[67]:
{'args': {},
 'data': '',
 'files': {},
 'form': {'value1': 'A', 'value2': 'B'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '17',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.28.1',
  'X-Amzn-Trace-Id': 'Root=1-64100b61-1c10c28f6694520a5f66b50d'},
 'json': None,
 'origin': '157.33.224.254',
 'url': 'http://httpbin.org/post'}

Downloading a file¶

In [68]:
def download(url, filename):
    resp = requests.get(url)
    with open(filename, "wb") as f:
        f.write(resp.content)
In [69]:
download("https://www.python.org/ftp/python/3.10.10/Python-3.10.10.tgz", "Python-source.tgz")

problem

  • Write a function download_notes which will download your training notes. it takes module name and day as parameters
>>> download_notes("module1", "day1")
dowloaded .. module1-day1.html
In [71]:
def download_all_notes():
    for m in range(1, 4):
        for d in range(1, 6):
            download_notes(f"module{m}",f"day{d}")

def download_notes(module, day):
    filename = f"{module}-{day}.html"
    url =  "https://notes.pipal.in/2023/arcesium_finop_jan/{filename}"
    download(url, filename)
    
In [72]:
download_notes("module1", "day5")
In [73]:
!ls module1-day5.html
module1-day5.html

Download using Selenium¶

  • create a virtualenv
python3 -m venv selenium
  • download geckodriver and extract in selenium/bin/ folder for linux/mac in selenium\Scripts .. for windows

  • activate the virtual env

source seleniun/bin/activate

for windows

selenium\Scripts\activate.bat
  • install selenium
pip install selenium
In [76]:
%%file download_with_selenium.py
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Firefox() # launch a browser

driver.get("http://www.python.org") # got to this url

assert "Python" in driver.title # make sure the site is loaded
time.sleep(5)

elem = driver.find_element(By.NAME, "q")

elem.clear()
elem.send_keys("pandas")
elem.send_keys(Keys.RETURN)
time.sleep(10)
assert "No results found." not in driver.page_source
driver.close()
Overwriting download_with_selenium.py

Things to remember¶

  • If API documentation is give then , use requests to download data
  • Try directly with pandas, pd.read_html
  • try with requests and extracting from html
  • last option is to download using selenium (when sites are dynamics)
In [ ]: