Python Virtual Training For Arcesium - Module III - Day 3¶

Aug 19-25, 2022 Vikrant Patil

All notes are available online at https://notes.pipal.in/2022/arcesium_finop_batch1/

Please accept the invitation that you have received in your email and login to

https://engage.pipal.in/

login to lab and create today's notebook module3-day3

© Pipal Academy LLP

Downloading data from internet¶

HTTP protocol has 4 methods

  • get - this url has all the data in url
  • post - some data is in url and some data is hidden
  • put -
  • delete

if I try to search for some text in a search engine , this is how the url looks like!

https://duckduckgo.com/?t=ffab&q=lets+search+for+something

Example alphavantage¶

In [2]:
import requests
alphavantageurl = "https://www.alphavantage.co/query"
API_KEY = "UKVFE0JLE0TBPDEF"

params = {
    "function":"TIME_SERIES_INTRADAY",
    "symbol":"AAPL",
    "interval":"15min",
    "apikey":API_KEY
}

resp = requests.get(alphavantageurl, params=params)
In [3]:
resp.status_code
Out[3]:
200
In [4]:
data = resp.json()
In [5]:
type(data)
Out[5]:
dict
In [6]:
data.keys()
Out[6]:
dict_keys(['Meta Data', 'Time Series (15min)'])
In [7]:
data['Meta Data']
Out[7]:
{'1. Information': 'Intraday (15min) open, high, low, close prices and volume',
 '2. Symbol': 'AAPL',
 '3. Last Refreshed': '2022-08-22 20:00:00',
 '4. Interval': '15min',
 '5. Output Size': 'Compact',
 '6. Time Zone': 'US/Eastern'}
In [8]:
type(data['Time Series (15min)'])
Out[8]:
dict
In [9]:
len(data['Time Series (15min)'])
Out[9]:
100
In [10]:
data['Time Series (15min)'].keys()
Out[10]:
dict_keys(['2022-08-22 20:00:00', '2022-08-22 19:45:00', '2022-08-22 19:30:00', '2022-08-22 19:15:00', '2022-08-22 19:00:00', '2022-08-22 18:45:00', '2022-08-22 18:30:00', '2022-08-22 18:15:00', '2022-08-22 18:00:00', '2022-08-22 17:45:00', '2022-08-22 17:30:00', '2022-08-22 17:15:00', '2022-08-22 17:00:00', '2022-08-22 16:45:00', '2022-08-22 16:30:00', '2022-08-22 16:15:00', '2022-08-22 16:00:00', '2022-08-22 15:45:00', '2022-08-22 15:30:00', '2022-08-22 15:15:00', '2022-08-22 15:00:00', '2022-08-22 14:45:00', '2022-08-22 14:30:00', '2022-08-22 14:15:00', '2022-08-22 14:00:00', '2022-08-22 13:45:00', '2022-08-22 13:30:00', '2022-08-22 13:15:00', '2022-08-22 13:00:00', '2022-08-22 12:45:00', '2022-08-22 12:30:00', '2022-08-22 12:15:00', '2022-08-22 12:00:00', '2022-08-22 11:45:00', '2022-08-22 11:30:00', '2022-08-22 11:15:00', '2022-08-22 11:00:00', '2022-08-22 10:45:00', '2022-08-22 10:30:00', '2022-08-22 10:15:00', '2022-08-22 10:00:00', '2022-08-22 09:45:00', '2022-08-22 09:30:00', '2022-08-22 09:15:00', '2022-08-22 09:00:00', '2022-08-22 08:45:00', '2022-08-22 08:30:00', '2022-08-22 08:15:00', '2022-08-22 08:00:00', '2022-08-22 07:45:00', '2022-08-22 07:30:00', '2022-08-22 07:15:00', '2022-08-22 07:00:00', '2022-08-22 06:45:00', '2022-08-22 06:30:00', '2022-08-22 06:15:00', '2022-08-22 06:00:00', '2022-08-22 05:45:00', '2022-08-22 05:30:00', '2022-08-22 05:15:00', '2022-08-22 05:00:00', '2022-08-22 04:45:00', '2022-08-22 04:30:00', '2022-08-22 04:15:00', '2022-08-19 20:00:00', '2022-08-19 19:45:00', '2022-08-19 19:30:00', '2022-08-19 19:15:00', '2022-08-19 19:00:00', '2022-08-19 18:45:00', '2022-08-19 18:30:00', '2022-08-19 18:15:00', '2022-08-19 18:00:00', '2022-08-19 17:45:00', '2022-08-19 17:30:00', '2022-08-19 17:15:00', '2022-08-19 17:00:00', '2022-08-19 16:45:00', '2022-08-19 16:30:00', '2022-08-19 16:15:00', '2022-08-19 16:00:00', '2022-08-19 15:45:00', '2022-08-19 15:30:00', '2022-08-19 15:15:00', '2022-08-19 15:00:00', '2022-08-19 14:45:00', '2022-08-19 14:30:00', '2022-08-19 14:15:00', '2022-08-19 14:00:00', '2022-08-19 13:45:00', '2022-08-19 13:30:00', '2022-08-19 13:15:00', '2022-08-19 13:00:00', '2022-08-19 12:45:00', '2022-08-19 12:30:00', '2022-08-19 12:15:00', '2022-08-19 12:00:00', '2022-08-19 11:45:00', '2022-08-19 11:30:00', '2022-08-19 11:15:00'])
In [11]:
data['Time Series (15min)']['2022-08-22 20:00:00']
Out[11]:
{'1. open': '167.8800',
 '2. high': '167.9900',
 '3. low': '167.8600',
 '4. close': '167.9800',
 '5. volume': '20165'}
In [12]:
import pandas as pd
In [13]:
pd.DataFrame(data['Time Series (15min)'])
Out[13]:
2022-08-22 20:00:00 2022-08-22 19:45:00 2022-08-22 19:30:00 2022-08-22 19:15:00 2022-08-22 19:00:00 2022-08-22 18:45:00 2022-08-22 18:30:00 2022-08-22 18:15:00 2022-08-22 18:00:00 2022-08-22 17:45:00 ... 2022-08-19 13:30:00 2022-08-19 13:15:00 2022-08-19 13:00:00 2022-08-19 12:45:00 2022-08-19 12:30:00 2022-08-19 12:15:00 2022-08-19 12:00:00 2022-08-19 11:45:00 2022-08-19 11:30:00 2022-08-19 11:15:00
1. open 167.8800 167.8900 167.8500 167.7500 167.7600 167.6700 167.7000 167.7700 167.7300 167.7500 ... 172.4200 172.7400 172.6500 172.1600 172.1374 172.1600 172.1100 172.0100 171.8719 171.8350
2. high 167.9900 167.9100 167.8800 167.8500 167.8100 167.7700 167.7000 167.7800 167.7692 167.7500 ... 172.4800 173.0300 172.9400 172.6700 172.3500 172.1800 172.4480 172.2500 172.1600 172.1600
3. low 167.8600 167.8500 167.8200 167.7500 167.7500 167.6500 167.6000 167.6800 167.7300 167.5700 ... 172.0150 172.3000 172.5000 172.0600 172.0700 171.8600 172.0800 171.8600 171.7850 171.6400
4. close 167.9800 167.8800 167.8600 167.8100 167.8100 167.7500 167.6700 167.6800 167.7692 167.7300 ... 172.0200 172.4250 172.7400 172.6600 172.1600 172.1312 172.1700 172.1100 172.0050 171.8750
5. volume 20165 8805 7143 4649 6759 13170 5764 14911 7538 5932 ... 1242545 1804845 1834057 1452497 1160098 1378402 1701756 1714674 1805881 2337331

5 rows × 100 columns

In [14]:
time_series_data = pd.DataFrame(data['Time Series (15min)']).transpose()
In [15]:
time_series_data
Out[15]:
1. open 2. high 3. low 4. close 5. volume
2022-08-22 20:00:00 167.8800 167.9900 167.8600 167.9800 20165
2022-08-22 19:45:00 167.8900 167.9100 167.8500 167.8800 8805
2022-08-22 19:30:00 167.8500 167.8800 167.8200 167.8600 7143
2022-08-22 19:15:00 167.7500 167.8500 167.7500 167.8100 4649
2022-08-22 19:00:00 167.7600 167.8100 167.7500 167.8100 6759
... ... ... ... ... ...
2022-08-19 12:15:00 172.1600 172.1800 171.8600 172.1312 1378402
2022-08-19 12:00:00 172.1100 172.4480 172.0800 172.1700 1701756
2022-08-19 11:45:00 172.0100 172.2500 171.8600 172.1100 1714674
2022-08-19 11:30:00 171.8719 172.1600 171.7850 172.0050 1805881
2022-08-19 11:15:00 171.8350 172.1600 171.6400 171.8750 2337331

100 rows × 5 columns

problem

  • Download csv file for above data . In general write a function to download csv data for INTRA_DAY_TIME_SERIES fro any ticker
def download_alpha(ticker, interval, filenname):
    pass

download_alpha("AAPL", "15min", "aaple_15min.csv")
In [18]:
def download_alpha(ticker, interval, filename):
    alphavantageurl = "https://www.alphavantage.co/query"
    API_KEY = "UKVFE0JLE0TBPDEF"
    
    params = {"function":"TIME_SERIES_INTRADAY",
             "symbol":ticker,
             "apikey":API_KEY,
             "datatype": "csv",
             "interval":interval}
    
    resp = requests.get(alphavantageurl, params=params)
    if resp.status_code == 200:
        with open(filename, "w") as f:
            f.write(resp.text)
    else:
        raise Exception("Data download failed") # instaed of printing , raise an exception
In [17]:
45 + "r"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 45 + "r"

TypeError: unsupported operand type(s) for +: 'int' and 'str'
In [19]:
download_alpha("AAPL", "15min","apple_15min.csv")
In [20]:
!head apple_15min.csv
timestamp,open,high,low,close,volume
2022-08-22 20:00:00,167.8800,167.9900,167.8600,167.9800,20165
2022-08-22 19:45:00,167.8900,167.9100,167.8500,167.8800,8805
2022-08-22 19:30:00,167.8500,167.8800,167.8200,167.8600,7143
2022-08-22 19:15:00,167.7500,167.8500,167.7500,167.8100,4649
2022-08-22 19:00:00,167.7600,167.8100,167.7500,167.8100,6759
2022-08-22 18:45:00,167.6700,167.7700,167.6500,167.7500,13170
2022-08-22 18:30:00,167.7000,167.7000,167.6000,167.6700,5764
2022-08-22 18:15:00,167.7700,167.7800,167.6800,167.6800,14911
2022-08-22 18:00:00,167.7300,167.7692,167.7300,167.7692,7538
In [21]:
%%file download_data.py
import requests
import typer

def download_alpha(ticker:str, interval:str, filename:str):
    alphavantageurl = "https://www.alphavantage.co/query"
    API_KEY = "UKVFE0JLE0TBPDEF"
    
    params = {"function":"TIME_SERIES_INTRADAY",
             "symbol":ticker,
             "apikey":API_KEY,
             "datatype": "csv",
             "interval":interval}
    
    resp = requests.get(alphavantageurl, params=params)
    if resp.status_code == 200:
        with open(filename, "w") as f:
            f.write(resp.text)
    else:
        raise Exception("Data download failed") # instaed of printing , raise an exception
        
if __name__ == "__main__":
    typer.run(download_alpha)
Writing download_data.py
In [22]:
!python download_data.py --help
/home/vikrant/usr/local/jupyter-py3.10/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
Usage: download_data.py [OPTIONS] TICKER INTERVAL FILENAME

Arguments:
  TICKER    [required]
  INTERVAL  [required]
  FILENAME  [required]

Options:
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.
In [23]:
!python download_data.py "IBM" "5min" "ibm_5min.csv"
/home/vikrant/usr/local/jupyter-py3.10/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
In [24]:
!head ibm_5min.csv
timestamp,open,high,low,close,volume
2022-08-22 20:00:00,135.7200,135.7200,135.7200,135.7200,159
2022-08-22 19:55:00,135.7300,135.7300,135.7300,135.7300,100
2022-08-22 18:15:00,135.9500,136.0500,135.9400,136.0500,925
2022-08-22 17:55:00,135.8600,135.9000,135.8600,135.9000,1101
2022-08-22 17:35:00,135.8000,135.8000,135.8000,135.8000,500
2022-08-22 16:40:00,135.5570,135.5570,135.5500,135.5500,662
2022-08-22 16:30:00,135.5500,135.5500,135.5500,135.5500,245
2022-08-22 16:25:00,135.7500,135.7500,135.7500,135.7500,101
2022-08-22 16:20:00,135.5500,135.5500,135.5500,135.5500,11368
In [26]:
def download_big(url, filename, chunksize=1024):
    resp = requests.get(url)
    with open(filename, "wb") as f:
        for chunk in resp.iter_content(chunk_size=chunksize):
            f.write(chunk)
            print(".", end="")
In [27]:
excelurl = "https://raw.githubusercontent.com/vikipedia/python-trainings/master/online_course/source/module2/wallet.xlsx"

download_big(excelurl, "excel_data.xlsx")
...........

Authentication¶

In [28]:
user = "vikipedia"
pass_ = open("/tmp/pass.txt").read().strip()
resp = requests.get("http://api.github.com/user", auth=(user, pass_)) # simple user/password authentication
In [29]:
resp.status_code
Out[29]:
401

For your case

pip install requests requests-kerberos

kerberos_auth = HTTPKerberosAuth(mutual_authetication="OPTIONAL")
response = requests.get(request_url, auth=kerberos_auth, params=params)
response.json()

General rules for downloading¶

  • API - most authentic way of downloading data
  • pd.read_csv
  • pd.read_excel
  • first try with pd.read_html()
  • request.get

General scraping¶

In [33]:
url = "http://www.thehindu.com"
resp = requests.get(url, params={"service":"rss"})
In [34]:
resp.status_code
Out[34]:
200
In [35]:
xmltext = resp.text
In [36]:
print(xmltext[:1000])
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title> The Hindu - Home </title>
    <link> https://www.thehindu.com/ </link>
    <description> RSS Feed </description>
    <language>en-us</language>
    <copyright>Copyright 2022 The Hindu</copyright>    
<item>
    <title>
        <![CDATA[Rupee falls 4 paise to 79.88 against U.S. dollar in early trade ]]>
    </title>
    <author>
        <![CDATA[PTI  ]]>
    </author>
    <category>
        <![CDATA[Markets]]>
    </category>
    <link>https://www.thehindu.com/business/markets/rupee-falls-4-paise-to-7988-against-us-dollar-in-early-trade/article65800436.ece
    </link>
    <description>
        <![CDATA[The rupee opened at 79.85 against the dollar, then fell to 79.88, registering a decline of 4 paise over the last close]]>
    </description>
    <pubDate>
        <![CDATA[Tue, 23 Aug 2022 11:25:15 +0530]]>
    </pubDate>
</item>
    
<item>
    <title>
        <![CDATA[BJP MLA T. Raja Singh arrested by Hyd
In [37]:
from xml.etree import ElementTree as et

root = et.fromstring(xmltext)
items = root.findall(".//item")
In [38]:
print(et.tostring(items[0]).decode())
<item>
    <title>
        Rupee falls 4 paise to 79.88 against U.S. dollar in early trade 
    </title>
    <author>
        PTI  
    </author>
    <category>
        Markets
    </category>
    <link>https://www.thehindu.com/business/markets/rupee-falls-4-paise-to-7988-against-us-dollar-in-early-trade/article65800436.ece
    </link>
    <description>
        The rupee opened at 79.85 against the dollar, then fell to 79.88, registering a decline of 4 paise over the last close
    </description>
    <pubDate>
        Tue, 23 Aug 2022 11:25:15 +0530
    </pubDate>
</item>
    

In [41]:
for item in items[:5]:
    print(item.findtext("title").strip())
    print(item.findtext("link").strip())
    print(item.findtext("author").strip())
    print("="*25)
Rupee falls 4 paise to 79.88 against U.S. dollar in early trade
https://www.thehindu.com/business/markets/rupee-falls-4-paise-to-7988-against-us-dollar-in-early-trade/article65800436.ece
PTI
=========================
BJP MLA T. Raja Singh arrested by Hyderabad police
https://www.thehindu.com/news/national/telangana/bjp-mla-t-raja-singh-arrested-by-hyderabad-police/article65800422.ece
B.Pradeep
=========================
BSF recovers cache of arms near Indo-Pak border in Punjab
https://www.thehindu.com/news/national/bsf-recovers-cache-of-arms-near-indo-pak-border-in-punjab/article65800420.ece
PTI
=========================
Top news developments in Karnataka on August 23, 2022
https://www.thehindu.com/news/national/karnataka/top-news-developments-in-karnataka-on-august-23-2022/article65800284.ece
Karnataka Bureau
=========================
Google Doodle pays tribute to Indian physicist and meteorologist Anna Mani
https://www.thehindu.com/sci-tech/science/google-doodle-pays-tribute-to-indian-physicist-and-meteorologist-anna-mani/article65800385.ece
The Hindu Bureau
=========================

Post url¶

In [42]:
resp = requests.post("https://httpbin.org/post", data={"input1":"x","input2":"y"})
In [43]:
resp.json()
Out[43]:
{'args': {},
 'data': '',
 'files': {},
 'form': {'input1': 'x', 'input2': 'y'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '17',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.27.1',
  'X-Amzn-Trace-Id': 'Root=1-630472f3-07d1a0306fd233143683ee3e'},
 'json': None,
 'origin': '152.57.196.198',
 'url': 'https://httpbin.org/post'}
In [45]:
resp = requests.get("https://httpbin.org/get", params={"param1":"x","param2":"y"})
In [46]:
resp.json()
Out[46]:
{'args': {'param1': 'x', 'param2': 'y'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.27.1',
  'X-Amzn-Trace-Id': 'Root=1-63047348-688d409a2f7d0c9a4afedc60'},
 'origin': '152.57.196.198',
 'url': 'https://httpbin.org/get?param1=x&param2=y'}

Selenium¶

selenium allows you to launch browser from python program and actually click on website

  1. Install selenium package in your virtual environment
   pip install selenium
  1. Install selenium driver (i.e. place the geckofriver executable in ENVFOLDER/Scripts/ for windows, ENVFOLDER/bin/ for linux)

more detailed documentation of selenium

In [47]:
%%file search_python_docs.py
from selenium import webdriver
from selenium.webdriver.common.key import Keys
from selenium.webdriver.common.by import By

driver = webdriver.FireFox()
driver.get("http://www.python.org")
elem = driver.find_element(By.Name, "q")
elem.clear()
elem.send_keys("python docs")
elem.send_keys(Keys.RETURN)
driver.close()
Writing search_python_docs.py

regular expression¶

  • str methods can be used to search data in text
In [48]:
import re 
In [55]:
pattern = re.compile("^$") # ^ means start of line/text, $-> end of line/text
linewith_single_char = re.compile("^.$") # . meand any char!
datelikepattern = re.compile("^\d{4,4}-\d{2,2}-\d{2,2}") # \d means digit, {m,n}
In [118]:
lines = """line1
1
2
some dsjhdkjs kdjhfkds

2021-10-25
sadk ksjdksa

kjs"""
In [53]:
for line in lines.split("\n"):
    if pattern.match(line):
        print("found empty line")
found empty line
found empty line
In [54]:
for line in lines.split("\n"):
    if datelikepattern.match(line):
        print(line)
2021-10-25
In [56]:
for line in lines.split("\n"):
    if linewith_single_char.match(line):
        print(line)
1
2
In [57]:
lines = """line1
1
2
some dsjhdkjs kdjhfkds

2021-10-25
sadk ksjdksa

kjs,


Total = (2323.45)
"""
In [63]:
search_total = re.compile("Total += *\(\d+\\.?\d*\)")
In [67]:
searchitem = None
for line in lines.split("\n"):
    if search_total.match(line):
        searchitem = line
In [71]:
searchitem.split("=")[1].strip().replace("(","").replace(")","")
Out[71]:
'2323.45'

problem

  • From a csv file "wallet.csv" find out rows which have timestamp between 14:00 to 15:00

csvurl = "https://raw.githubusercontent.com/vikipedia/python-trainings/master/online_course/source/module2/wallet.csv"

In [101]:
!cat wallet.csv
,date,category,description,debit
0,2021-03-07 14:53:28.377359,Music,Amazon,421.2073272347991
1,2020-10-08 09:53:28.377359,Food,Swiggy,328.4400802428426
2,2021-02-23 09:53:28.377359,Books,Amazon,244.67943701511354
3,2020-11-01 14:53:28.377359,Utility,Phone,222.7563175805277
4,2021-06-05 13:53:28.377359,Books,Flipcart,494.1284923793595
5,2021-07-28 19:53:28.377359,Utility,Electricity,219.94171130968408
6,2021-04-16 11:53:28.377359,Books,Amazon Kindle,270.32259514795845
7,2021-02-15 10:53:28.377359,Food,Zomato,457.1831036346536
8,2021-08-10 19:53:28.377359,Utility,Phone,151.49637259947792
9,2020-11-29 14:53:28.377359,Travel,Auto,443.61888423247854
10,2021-06-15 13:53:28.377359,Travel,Metro,328.1754210974373
11,2021-07-24 13:53:28.377359,Food,Zomato,434.4954675355444
12,2021-07-24 14:53:28.377359,Music,Amazon,329.5360031897569
13,2021-06-06 10:53:28.377359,Utility,Phone,154.0449491816659
14,2021-06-09 13:53:28.377359,Travel,Taxi,485.2977429821982
15,2021-08-24 17:53:28.377359,Food,Zomato,262.9439932340398
16,2021-03-05 19:53:28.377359,Utility,Phone,390.31687619327926
17,2021-04-17 18:53:28.377359,Utility,Electricity,316.8786754246636
18,2021-05-08 15:53:28.377359,Travel,Auto,433.82240427779357
19,2021-05-16 10:53:28.377359,Books,Flipcart,109.32590886550067
20,2020-10-12 18:53:28.377359,Travel,Auto,365.92180825376613
21,2021-01-04 19:53:28.377359,Travel,Metro,329.09737150258513
22,2021-06-24 15:53:28.377359,Food,Zomato,489.1434830522253
23,2020-12-11 10:53:28.377359,Music,Netflix,354.94024099198157
24,2021-05-31 11:53:28.377359,Books,Amazon,498.10049550461065
25,2021-05-21 14:53:28.377359,Food,Hotel,483.315863517772
26,2020-08-26 15:53:28.377359,Books,Amazon Kindle,138.806577801854
27,2021-05-01 15:53:28.377359,Utility,Electricity,103.68079074846585
28,2020-12-14 15:53:28.377359,Utility,Phone,358.4599327957656
29,2021-06-20 10:53:28.377359,Utility,Electricity,184.5577284049955
30,2020-09-15 18:53:28.377359,Food,Swiggy,203.5292397894327
31,2020-09-25 11:53:28.377359,Books,Flipcart,246.50352738452796
32,2021-06-23 11:53:28.377359,Food,Zomato,345.03043608141513
33,2021-05-14 18:53:28.377359,Food,Hotel,449.24802955761743
34,2021-05-14 10:53:28.377359,Utility,Phone,499.8581815222449
35,2021-02-18 18:53:28.377359,Travel,Metro,441.6021430011205
36,2020-12-10 10:53:28.377359,Travel,Auto,472.94143917262176
37,2021-04-18 16:53:28.377359,Music,Amazon,266.0690783774673
38,2021-08-15 10:53:28.377359,Travel,Auto,494.1243994056571
39,2021-05-17 17:53:28.377359,Food,Swiggy,112.33316019807455
40,2021-07-19 12:53:28.377359,Food,Swiggy,291.54598801930536
41,2021-02-20 19:53:28.377359,Utility,Phone,425.18719068071806
42,2021-08-22 17:53:28.377359,Food,Hotel,210.25626950078572
43,2020-09-21 12:53:28.377359,Utility,Phone,486.03393276160733
44,2020-12-26 19:53:28.377359,Utility,Electricity,257.92759337085425
45,2021-05-27 16:53:28.377359,Utility,Electricity,154.74287259516655
46,2021-05-15 15:53:28.377359,Utility,Electricity,359.3249716537848
47,2020-10-28 10:53:28.377359,Books,Flipcart,310.408610004679
48,2021-08-23 17:53:28.377359,Utility,Electricity,310.05840961423314
49,2021-03-16 09:53:28.377359,Music,spotify,232.30340219121138
50,2020-12-24 11:53:28.377359,Food,Zomato,463.00187492635547
51,2020-12-22 17:53:28.377359,Food,Zomato,331.22702332837093
52,2021-03-26 09:53:28.377359,Travel,Taxi,403.6100701341934
53,2021-01-27 09:53:28.377359,Utility,Electricity,183.1866624101276
54,2020-11-16 10:53:28.377359,Music,spotify,160.81754340768396
55,2021-01-21 19:53:28.377359,Books,Flipcart,423.74970808720553
56,2021-05-19 18:53:28.377359,Utility,Phone,319.3428762684619
57,2021-07-15 15:53:28.377359,Utility,Phone,279.6090437716363
58,2021-05-20 10:53:28.377359,Food,Hotel,255.8710346734312
59,2020-08-28 11:53:28.377359,Food,Swiggy,208.2329120852039
60,2021-01-17 11:53:28.377359,Utility,Electricity,382.5195101154448
61,2021-02-25 13:53:28.377359,Food,Hotel,124.65827844174062
62,2021-01-27 19:53:28.377359,Books,Amazon Kindle,497.7708601564023
63,2021-05-10 11:53:28.377359,Travel,Taxi,355.9890502253258
64,2021-01-31 14:53:28.377359,Food,Zomato,232.2223798622789
65,2020-10-23 18:53:28.377359,Music,Netflix,188.7487426895118
66,2020-10-09 16:53:28.377359,Food,Swiggy,263.9577700340145
67,2021-07-31 14:53:28.377359,Music,Netflix,324.786916846731
68,2020-08-26 09:53:28.377359,Travel,Taxi,279.1478844739421
69,2020-10-10 15:53:28.377359,Utility,Electricity,300.52462041935115
70,2021-08-17 13:53:28.377359,Utility,Phone,125.22977317126336
71,2021-03-30 12:53:28.377359,Food,Swiggy,245.36050838040904
72,2021-06-30 18:53:28.377359,Books,Amazon,294.66286899004876
73,2021-08-15 17:53:28.377359,Travel,Metro,117.58872931045573
74,2021-03-20 11:53:28.377359,Travel,Taxi,303.05542098520453
75,2021-03-03 12:53:28.377359,Food,Hotel,425.6252909948148
76,2020-11-17 09:53:28.377359,Music,Netflix,197.5346000167895
77,2021-01-18 14:53:28.377359,Books,Amazon Kindle,482.1523430204321
78,2020-09-09 16:53:28.377359,Music,spotify,415.3728938035302
79,2021-08-17 09:53:28.377359,Music,Netflix,321.7634156544651
80,2021-02-17 09:53:28.377359,Food,Swiggy,283.09570727160764
81,2020-10-29 16:53:28.377359,Food,Hotel,470.08099539923614
82,2020-09-22 09:53:28.377359,Music,spotify,411.14270120842224
83,2021-03-18 09:53:28.377359,Books,Flipcart,451.5844070294999
84,2020-09-21 10:53:28.377359,Music,Netflix,158.7936457269333
85,2021-01-12 09:53:28.377359,Music,Amazon,130.37490757527
86,2021-05-07 16:53:28.377359,Food,Zomato,198.450671792638
87,2021-05-19 15:53:28.377359,Food,Zomato,378.82064134052473
88,2021-04-18 09:53:28.377359,Utility,Phone,124.2212478444578
89,2021-04-12 14:53:28.377359,Music,Amazon,218.487173429263
90,2020-12-01 14:53:28.377359,Music,Amazon,101.57327588889417
91,2021-01-22 17:53:28.377359,Food,Hotel,232.66346838787223
92,2021-01-12 19:53:28.377359,Travel,Taxi,356.8426379886326
93,2021-01-11 09:53:28.377359,Utility,Electricity,111.72080867898062
94,2021-01-04 13:53:28.377359,Utility,Phone,431.1855366816298
95,2021-07-19 13:53:28.377359,Utility,Phone,388.6712132388421
96,2021-01-12 19:53:28.377359,Books,Flipcart,467.5545618966052
97,2021-03-25 11:53:28.377359,Utility,Phone,320.78943360123816
98,2021-05-13 15:53:28.377359,Travel,Taxi,442.0964693975505
99,2020-10-11 16:53:28.377359,Food,Hotel,100.45550129902665
In [73]:
csvurl = "https://raw.githubusercontent.com/vikipedia/python-trainings/master/online_course/source/module2/wallet.csv"
download_big(csvurl, "wallet.csv")
.......
In [113]:
def lines_between2_3_pm(filename):
    datepattern = re.compile("\d{1,2},\d{4,4}-\d{2,2}-\d{2,2} 14:.+")
    with open(filename) as f:
        for line in f:
            if datepattern.match(line.strip()):
                print(line, end="")
In [114]:
lines_between2_3_pm("wallet.csv")
0,2021-03-07 14:53:28.377359,Music,Amazon,421.2073272347991
3,2020-11-01 14:53:28.377359,Utility,Phone,222.7563175805277
9,2020-11-29 14:53:28.377359,Travel,Auto,443.61888423247854
12,2021-07-24 14:53:28.377359,Music,Amazon,329.5360031897569
25,2021-05-21 14:53:28.377359,Food,Hotel,483.315863517772
64,2021-01-31 14:53:28.377359,Food,Zomato,232.2223798622789
67,2021-07-31 14:53:28.377359,Music,Netflix,324.786916846731
77,2021-01-18 14:53:28.377359,Books,Amazon Kindle,482.1523430204321
89,2021-04-12 14:53:28.377359,Music,Amazon,218.487173429263
90,2020-12-01 14:53:28.377359,Music,Amazon,101.57327588889417
In [107]:
l = "0,2021-03-07 14:53:28.377359,Music,Amazon,421.2073272347991"
datepattern = re.compile("\d{1,2},\d{4,4}-\d{2,2}-\d{2,2} 14:.+")
In [108]:
datepattern.match(l)
Out[108]:
<re.Match object; span=(0, 59), match='0,2021-03-07 14:53:28.377359,Music,Amazon,421.207>

More help on regular expression can be found here

In [115]:
"\d{4,5}" # an integer with 4 or 5 digits
Out[115]:
'\\d{4,5}'
In [116]:
"3\d{3,3}" # any four digit number that start with 3
Out[116]:
'3\\d{3,3}'
In [117]:
"\d\d\d\d" # 4 digits!
Out[117]:
'\\d\\d\\d\\d'
^ - start of line
$ - end of line
\d - digits
\s - white space
. - any char
+ - 1 or more of whatever is previous to this
? - 0 or 1 of whatever is previous to this
* -  0 or many occurences of whatever is previous to this
{m,n} - min m time and max n times of whatever is previous to this
In [ ]: