Session 8

Published

October 23, 2023

Topics Covered
  • Invoking External Applications
  • Text Processing & Regular Expressions

Invoking External Applications

A lot of times, we may want to:

  • invoke an external command
  • invoke an external command and read the output

There two modules in Python standard library that allows us to do this - os and subprocess.

The os module is very primitive and not that safe. The subprocess module is safe and very nice!

!pwd
/home/jupyter-anand/book
!echo hello
hello
import os
os.system("echo hello")
hello
0
status = os.system("echo hello")
hello
status
0

0 indicates successful termination of the process. Any non-zero status indicates error.

status = os.system("cat no-file")
cat: no-file: No such file or directory
status
256

The previous command terminated with non-zero exit status, indicating that it was not successful.

The subprocess module

import subprocess
subprocess.call(["echo", "hello"])
hello
0
status = subprocess.call(["echo", "hello"])
hello
status
0
subprocess.call(["echo", "foo > bar"])
foo > bar
0

Let’s see what happens if we use the os.system.

os.system("echo foo > bar")
0
!cat bar
foo

The os.system passes the command to shell and shell executes it. When we say echo foo > bar, shell interprets that as output redirection to file bar.

!seq 10 > 10.txt

When we call subprocess.call, we specify each argument explicitly instead leaving that for the shell to parse.

subprocess.call(["echo", "1 & 2 & 3 ; 4"])
0
1 & 2 & 3 ; 4
os.system("echo 1 & 2 & 3 ; 4")
1
sh: 1: sh: 1: 3: not found
sh: 1: 4: not found
2: not found
32512

You can even make subprocess unsafe, by forcing it to use shell to run the command. But that is not the default behavior. With subprocess, we are safe by default.

subprocess.call("echo 1 & 2 & 3 ; 4", shell=True)
1
/bin/sh: 1: /bin/sh: 1: 2: not found
3: not found
/bin/sh: 1: 4: not found
127
p = subprocess.Popen(["echo", "hello"])
hello
p
<Popen: returncode: None args: ['echo', 'hello']>
p.pid
1542209
p.wait()
0

We can read the output by specifying stdout as a pipe.

p = subprocess.Popen(["echo", "hello"], stdout=subprocess.PIPE)
p.stdout.read()
b'hello\n'
p.wait() 
0

By default, the output is bytes. We can force it to be text, by passing text=True.

p = subprocess.Popen(["echo", "hello"], stdout=subprocess.PIPE, text=True)
p.stdout.read()
'hello\n'
p = subprocess.Popen(["cat", "bad-file"], stdout=subprocess.PIPE, text=True)
cat: bad-file: No such file or directory
p.stdout.read()
''
p = subprocess.Popen(["cat", "bad-file"], 
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE,
                     text=True)
p.stdout.read()
''
p.stderr.read()
'cat: bad-file: No such file or directory\n'

We can also tell subprocess to send stderr to stdout.

p = subprocess.Popen(["cat", "bad-file"], 
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.STDOUT,
                     text=True)
p.stdout.read()
'cat: bad-file: No such file or directory\n'

Example: figlet

!figlet hello
 _          _ _       
| |__   ___| | | ___  
| '_ \ / _ \ | |/ _ \ 
| | | |  __/ | | (_) |
|_| |_|\___|_|_|\___/ 
                      

Let’s implement a function figlet, that calls the figlet command and returns the output of the command.

def figlet(text):
    cmd = ["figlet", text]
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, text=True)
    output = p.stdout.read()
    p.wait()
    return output
figlet("hello")
" _          _ _       \n| |__   ___| | | ___  \n| '_ \\ / _ \\ | |/ _ \\ \n| | | |  __/ | | (_) |\n|_| |_|\\___|_|_|\\___/ \n                      \n"
print(figlet("hello"))
 _          _ _       
| |__   ___| | | ___  
| '_ \ / _ \ | |/ _ \ 
| | | |  __/ | | (_) |
|_| |_|\___|_|_|\___/ 
                      
for i in range(5):
    print(figlet(str(i)))
  ___  
 / _ \ 
| | | |
| |_| |
 \___/ 
       

 _ 
/ |
| |
| |
|_|
   

 ____  
|___ \ 
  __) |
 / __/ 
|_____|
       

 _____ 
|___ / 
  |_ \ 
 ___) |
|____/ 
       

 _  _   
| || |  
| || |_ 
|__   _|
   |_|  
        

Problem: cowsay

There is command cowsay to print a text as if a cow is saying it.

!/usr/games/cowsay "python is awesome!"
 ____________________
< python is awesome! >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Problem: Write a function cowsay that takes a text as argument, calls the command cowsay with that text and returns the output of that command.

>>> x = cowsay("python is awesome")
>> print(x)
 ___________________
< python is awesome >
 -------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Regular Expressions

Regular Expression is a mini-language for pattern matching.

sentence = "10 apples and 20 mangoes"

How many fruits are there in that sentence?

sentence.split()
['10', 'apples', 'and', '20', 'mangoes']
[int(w) for w in sentence.split() if str.isnumeric(w)]
[10, 20]
sum([int(w) for w in sentence.split() if str.isnumeric(w)])
30

Let’s see how to do that with regular expressions.

import re
re.findall("\d+", sentence)
['10', '20']

Suppose we want to mask the numbers.

re.sub("\d", "X", sentence)
'XX apples and XX mangoes'

Syntax

A regular expression can have an ordinary or special characters.

Rule 1: Oridinary characters match themselves

re.match("one", "one throusand phones")
<re.Match object; span=(0, 3), match='one'>
re.findall("one", "one throusand phones")
['one', 'one']

Rule 2: Special character . matches any character

re.match("a.c", "abc")
<re.Match object; span=(0, 3), match='abc'>

If we want to match a literal ., we need escape it.

re.match("a\.b", "axb") # no match
re.match("a\.b", "a.b") # match
<re.Match object; span=(0, 3), match='a.b'>

Rule 3: Special character | is used to match either of the two patterns

re.match("a|b", "a")
<re.Match object; span=(0, 1), match='a'>
re.match("a|b", "b")
<re.Match object; span=(0, 1), match='b'>
re.match("a|b", "c")
re.findall("apple|mango|banana", "10 apples and 20 mangoes")
['apple', 'mango']

Rule 4: A character group

A character group matches any of the characters in thr group.

re.match("a[bc]", "ab")
<re.Match object; span=(0, 2), match='ab'>
re.match("a[bc]", "ac")
<re.Match object; span=(0, 2), match='ac'>
re.match("a[bc]", "ad")
re.findall("a[bc]", "ab ac ad ae")
['ab', 'ac']

We can also specify a range.

re.match("a[0-9]b", "a5b")
<re.Match object; span=(0, 3), match='a5b'>
re.match("[a-z][0-9][a-z]", "a5b")
<re.Match object; span=(0, 3), match='a5b'>
re.match("[a-z][0-9][a-z]", "k8s")
<re.Match object; span=(0, 3), match='k8s'>

We can also negate a character group using ^.

re.findall("[^a-z0-9]", "10 apples and 20 mangoes!! <>")
[' ', ' ', ' ', ' ', '!', '!', ' ', '<', '>']

Rule 5: Modifiers

  • ? - matches 0 or 1 occurances
  • * - matches 0 or more occurances
  • + - matches 1 or more occurances

Lets say we want to match hexadecimal numbers like 0x12fe.

re.search("0x[0-9a-f]+", "the number 0x12fw is a hexadecimal number")
<re.Match object; span=(11, 16), match='0x12f'>
re.match("ab?", "a")
<re.Match object; span=(0, 1), match='a'>

Rule 6: Predefined escape codes

  • \d - same as [0-9]
  • \s - any white space
  • \w - any identifier (something like [a-zA-Z0-9_]+)
re.findall("\d+", "10 apples and 30 mangoes")
['10', '30']

Rule 7: Grouping

re.findall("(\d+) (apples|mangoes)", "10 apples and 20 mangoes")
[('10', 'apples'), ('20', 'mangoes')]
re.findall("(\d+) ([a-z]+)", "10 apples, 20 mangoes and 30 bananas are in the bag.")
[('10', 'apples'), ('20', 'mangoes'), ('30', 'bananas')]
matches = re.findall("(\d+) ([a-z]+)", "10 apples, 20 mangoes and 30 bananas are in the bag.")
from tabulate import tabulate
print(tabulate(matches))
--  -------
10  apples
20  mangoes
30  bananas
--  -------

Rule 8: Match begin and end of string

The special characters ^ and $ match the begin and the end of the string respectively.

Remove trailing space.

re.sub("\s+$", "", "   hello  world   ")
'   hello  world'

Python API of Regular Expressions

findall

finds all occurances.

re.findall("\d+", "10 apples and 20 mangoes")
['10', '20']

match

matches a regular expression at the beginning of a string.

re.match("\d+", "10 apples and 20 mangoes")
<re.Match object; span=(0, 2), match='10'>
m = re.match("\d+", "10 apples and 20 mangoes")
m.group()
'10'

search

similar to match, but find anywhere in the string.

re.search("\d+", "10 apples and 20 mangoes")
<re.Match object; span=(0, 2), match='10'>
re.search("\d+", "There are 10 apples and 20 mangoes")
<re.Match object; span=(10, 12), match='10'>

sub

re.sub("\d+", "?", "10 apples and 20 mangoes")
'? apples and ? mangoes'

split

re.split("<[a-z/]+>", "<b>Hello</b><i>World</i>")
['', 'Hello', '', 'World', '']

If we put the pattern in a group, that is also included in the result.

re.split("(<[a-z/]+>)", "<b>Hello</b><i>World</i>")
['', '<b>', 'Hello', '</b>', '', '<i>', 'World', '</i>', '']

Problems

%load_problem squeeze
Problem: Squeeze

Write a function squeeze to replace multiple continuous space characters with a single space.

>>> squeeze("a   b   c d")
'a b c d'

You can verify your solution using:

%verify_problem squeeze

# your code here


Example: Parse git log

%%file git-log.txt
commit b27f92644e44657533671155ce92f597ffdc2b03
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Thu Jul 7 08:22:38 2022 +0530

    Added gitignore

commit 7ed3348503ccd2f234d62f848428646469625ee5
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Thu Jul 7 08:21:50 2022 +0530

    Added version checks to verify commit script

commit cb86214516da82c619417e91df485ae1c739e645
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Wed Jul 6 23:16:54 2022 +0530

    Verify your python workspace 
Writing git-log.txt

Can we convert this into oneline log of git?

b27f926 Added gitignore
7ed3348 Added version checks to verify commit script
cb86214 Verify your python workspace
gitlog = open("git-log.txt").read()
print(gitlog)
commit b27f92644e44657533671155ce92f597ffdc2b03
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Thu Jul 7 08:22:38 2022 +0530

    Added gitignore

commit 7ed3348503ccd2f234d62f848428646469625ee5
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Thu Jul 7 08:21:50 2022 +0530

    Added version checks to verify commit script

commit cb86214516da82c619417e91df485ae1c739e645
Author: Anand Chitipothu <anandology@gmail.com>
Date:   Wed Jul 6 23:16:54 2022 +0530

    Verify your python workspace 
regex = re.compile("^commit ([0-9a-f]+)", re.M)
regex.split(gitlog)
['',
 'b27f92644e44657533671155ce92f597ffdc2b03',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Thu Jul 7 08:22:38 2022 +0530\n\n    Added gitignore\n\n',
 '7ed3348503ccd2f234d62f848428646469625ee5',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Thu Jul 7 08:21:50 2022 +0530\n\n    Added version checks to verify commit script\n\n',
 'cb86214516da82c619417e91df485ae1c739e645',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Wed Jul 6 23:16:54 2022 +0530\n\n    Verify your python workspace \n']
def pairs(values):
    return zip(values[::2], values[1::2])
for a, b in pairs(range(10)):
    print(a, b)
0 1
2 3
4 5
6 7
8 9
regex = re.compile("^commit ([0-9a-f]+)", re.M)
tokens = regex.split(gitlog)
tokens
['',
 'b27f92644e44657533671155ce92f597ffdc2b03',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Thu Jul 7 08:22:38 2022 +0530\n\n    Added gitignore\n\n',
 '7ed3348503ccd2f234d62f848428646469625ee5',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Thu Jul 7 08:21:50 2022 +0530\n\n    Added version checks to verify commit script\n\n',
 'cb86214516da82c619417e91df485ae1c739e645',
 '\nAuthor: Anand Chitipothu <anandology@gmail.com>\nDate:   Wed Jul 6 23:16:54 2022 +0530\n\n    Verify your python workspace \n']
for hash, rest in pairs(tokens[1:]):
    comment = rest.split("\n\n")[1].strip()
    print(hash[:7], comment)
b27f926 Added gitignore
7ed3348 Added version checks to verify commit script
cb86214 Verify your python workspace