Session 4

Published

October 9, 2023

Topics Covered

Working with Files

Working with Files

%%file three.txt
One
Two
Three

Overwriting three.txt

open("three.txt").read()

'One\nTwo\nThree\n'

open("three.txt").readlines()

['One\n', 'Two\n', 'Three\n']

contents = open("three.txt").read()

print(contents)

One
Two
Three

print(contents, end="")

One
Two
Three

for line in open("three.txt").readlines():
    print(line)

One

Two

Three

for line in open("three.txt").readlines():
    print(line, end="")

One
Two
Three

%load_problem cat

Problem: Cat

Write a program cat.py that takes one or more filenames are command-line arguments and prints the contents of them.

$ python cat.py files/five.txt
one
two
three
four
five

$ python cat.py files/abcd.txt files/1234.txt
A
B
C
D
1
2
3
4

You can verify your solution using:

%verify_problem cat

%%file cat.py
# your code here

!cat files/five.txt

one
two
three
four
five

!cat files/abcd.txt files/1234.txt

A
B
C
D
1
2
3
4

Binary Data and Binary Files

Just like strings, python has a bytes type.

a = "hello"

type(a)

str

binary_data = b"hello"

type(binary_data)

bytes

data = b"hello\x01\x02\x03"

data[-1]

data[0]

print(data)

b'hello\x01\x02\x03'

text = "अआइई"

len(text)

%%file hindi.txt
अआइई

Writing hindi.txt

!ls -l hindi.txt

-rw-r--r-- 1 jupyter-anand jupyter-anand 13 Oct  9 04:26 hindi.txt

data = text.encode("utf-8")

data

b'\xe0\xa4\x85\xe0\xa4\x86\xe0\xa4\x87\xe0\xa4\x88'

len(data)

To read a binary file, we need to open it in rb (read-binary) mode.

open("hindi.txt").read()

'अआइई\n'

open("hindi.txt", "rb").read()

b'\xe0\xa4\x85\xe0\xa4\x86\xe0\xa4\x87\xe0\xa4\x88\n'

Writing to Files

To write a file, we need to open the file in write mode.

f = open("a.txt", "w")
f.write("one\n")
f.write("two\n")
f.close()

open("a.txt").read()

'one\ntwo\n'

Whenever, we open a file in write mode, all the the contents of that file will be wiped out.

Also, it is import to close a file after writing. Only after a file closed, the contents will be synced to the disk.

f = open("b.txt", "w")
f.write("one\n")
f.write("two\n")
# not closed it

open("b.txt").read()

''

f.close() # now the contents will be written to disk

open("b.txt").read()

'one\ntwo\n'

f = open("c.txt", "w")
f.write("one")
f.write("two")
f.close()

open("c.txt").read()

'onetwo'

f = open("c.txt", "w")

f.write("one")

f.write("helloworld")

Q: How to find doucumentation?

f.write<shift+tab> -- shows help in a popup
f.write? -- adds help to the notebook

Try help on any other function.

len<shift+tab>
len?

f.write

f.write?

Signature: f.write(text, /)
Docstring:
Write string to stream.
Returns the number of characters written (which is always equal to
the length of the string).
Type:      builtin_function_or_method

The `with` statement

The with statement is handy to close a file automatically at the end of the code block.

with open("a.txt", "w") as f:
    f.write("one\n")
    f.write("two\n")
    # f gets closed here automatically

open("a.txt").read()

'one\ntwo\n'

Appending to an existing file

open("a.txt").read()

'one\ntwo\n'

with open("a.txt", "a") as f:
    f.write("three\n")

open("a.txt").read()

'one\ntwo\nthree\n'

Reading and Writing Binary Files

!ls -l files/python.png

-rw-r--r-- 1 jupyter-anand jupyter-anand 11155 Dec  7  2021 files/python.png

open("files/python.png").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Whenever we open a text file, we open that with an encoding.

When we open a file, python read the bytes from the file, decodes them using an encoding and gives back the text.

The default encoding is usually utf-8.

How to read binary files then?

data = open("files/python.png", "rb").read()

data[:20]

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02Y'

Similary when we write, we use mode wb.

with open("bytes.bin", "wb") as f:
    f.write(b"\x89\x01\x02\x03\x04")

!ls -l bytes.bin

-rw-r--r-- 1 jupyter-anand jupyter-anand 4 Oct  9 04:54 bytes.bin

open("bytes.bin", "rb").read()

b'\x01\x02\x03\x04'

open("bytes.bin").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Problem: Copy File

%load_problem copy-file

Problem: Copy File

Write a program copy_file.py to copy contents of one file to another.

The program should accept a source file and a destination file as arguments and copy the source to the destination.

$ python copyfile.py files/five.txt 5.txt

Note: Don't call this file copy.py as that interfere with a standard library module with the same name.

You can verify your solution using:

%verify_problem copy-file

%%file copy_file.py
# your code here

Writing copy_file.py

!python copy_file.py files/python.png python.png

!ls -l files/python.png

-rw-r--r-- 1 jupyter-anand jupyter-anand 11155 Dec  7  2021 files/python.png

Writing Command-line applications

!wc files/words.txt

 10  26 154 files/words.txt

!wc -l files/words.txt

10 files/words.txt

!wc --help

Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
printable characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help        display this help and exit
      --version     output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

!grep three files/words.txt

one two three
one two three four
one two three four five
two three four five
three four five
one-two-three-four-five-six-seven

!grep -c three files/words.txt

!python square.py 5

!python square.py --help

Traceback (most recent call last):
  File "/home/jupyter-anand/book/square.py", line 4, in <module>
    n = int(sys.argv[1])
ValueError: invalid literal for int() with base 10: '--help'

Example: hello.py

We’ll build a simple command-line app that just says hello to a name.

%%file hello.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")

args = p.parse_args()
print(args)
print("Hello", args.name)

Overwriting hello.py

!python hello.py Python

Namespace(name='Python')
Hello Python

!python hello.py

usage: hello.py [-h] name
hello.py: error: the following arguments are required: name

!python hello.py --help

usage: hello.py [-h] name

positional arguments:
  name        name to say hello

options:
  -h, --help  show this help message and exit

Let’s improve that by adding flags.

%%file hello2.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")
p.add_argument("-r", "--repeats", type=int, default=1, help="number of times to repeat the message")

args = p.parse_args()
print(args)
for i in range(args.repeats):
    print("Hello", args.name)

Writing hello2.py

!python hello2.py Python

Namespace(name='Python', repeats=1)
Hello Python

!python hello2.py Python -r 4

Namespace(name='Python', repeats=4)
Hello Python
Hello Python
Hello Python
Hello Python

!python hello2.py Python --repeats 4

Namespace(name='Python', repeats=4)
Hello Python
Hello Python
Hello Python
Hello Python

!python hello2.py --help

usage: hello2.py [-h] [-r REPEATS] name

positional arguments:
  name                  name to say hello

options:
  -h, --help            show this help message and exit
  -r REPEATS, --repeats REPEATS
                        number of times to repeat the message

Boolean Flags

%%file hello3.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")
p.add_argument("-r", "--repeats", type=int, default=1, help="number of times to repeat the message")
p.add_argument("-u", "--uppercase", 
               action="store_true",
               default=False,
               help="convert the message into uppercase")


args = p.parse_args()
# print(args)

msg = f"Hello {args.name}"
if args.uppercase:
    msg = msg.upper()
    
for i in range(args.repeats):
    print(msg)

Overwriting hello3.py

!python hello3.py --help

usage: hello3.py [-h] [-r REPEATS] [-u] name

positional arguments:
  name                  name to say hello

options:
  -h, --help            show this help message and exit
  -r REPEATS, --repeats REPEATS
                        number of times to repeat the message
  -u, --uppercase       convert the message into uppercase

!python hello3.py Python -u

HELLO PYTHON

!python hello3.py Python -u -r 3

HELLO PYTHON
HELLO PYTHON
HELLO PYTHON

!python hello3.py -u Python -r 3

HELLO PYTHON
HELLO PYTHON
HELLO PYTHON

!python hello3.py -u  -r 3 Python

HELLO PYTHON
HELLO PYTHON
HELLO PYTHON

Problem: Rewrite square.py that we wrote earlier by using argparse.

%%file square.py
import argparse

Overwriting square.py

!python square.py --help

!python square.py 5

!grep --help

Usage: grep [OPTION]... PATTERNS [FILE]...
Search for PATTERNS in each FILE.
Example: grep -i 'hello world' menu.h main.c
PATTERNS can contain multiple patterns separated by newlines.

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERNS are extended regular expressions
  -F, --fixed-strings       PATTERNS are strings
  -G, --basic-regexp        PATTERNS are basic regular expressions
  -P, --perl-regexp         PATTERNS are Perl regular expressions
  -e, --regexp=PATTERNS     use PATTERNS for matching
  -f, --file=FILE           take PATTERNS from FILE
  -i, --ignore-case         ignore case distinctions in patterns and data
      --no-ignore-case      do not ignore case distinctions (default)
  -w, --word-regexp         match only whole words
  -x, --line-regexp         match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      --help                display this help text and exit

Output control:
  -m, --max-count=NUM       stop after NUM selected lines
  -b, --byte-offset         print the byte offset with output lines
  -n, --line-number         print line number with output lines
      --line-buffered       flush output on every line
  -H, --with-filename       print file name with output lines
  -h, --no-filename         suppress the file name prefix on output
      --label=LABEL         use LABEL as the standard input file name prefix
  -o, --only-matching       show only nonempty parts of lines that match
  -q, --quiet, --silent     suppress all normal output
      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text
  -I                        equivalent to --binary-files=without-match
  -d, --directories=ACTION  how to handle directories;
                            ACTION is 'read', 'recurse', or 'skip'
  -D, --devices=ACTION      how to handle devices, FIFOs and sockets;
                            ACTION is 'read' or 'skip'
  -r, --recursive           like --directories=recurse
  -R, --dereference-recursive  likewise, but follow all symlinks
      --include=GLOB        search only files that match GLOB (a file pattern)
      --exclude=GLOB        skip files that match GLOB
      --exclude-from=FILE   skip files that match any file pattern from FILE
      --exclude-dir=GLOB    skip directories that match GLOB
  -L, --files-without-match  print only names of FILEs with no selected lines
  -l, --files-with-matches  print only names of FILEs with selected lines
  -c, --count               print only a count of selected lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --group-separator=SEP  print SEP on line between matches with context
      --no-group-separator  do not print separator for matches with context
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is 'always', 'never', or 'auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS/Windows)

When FILE is '-', read standard input.  With no FILE, read '.' if
recursive, '-' otherwise.  With fewer than two FILEs, assume -h.
Exit status is 0 if any line is selected, 1 otherwise;
if any error occurs and -q is not given, the exit status is 2.

Report bugs to: bug-grep@gnu.org
GNU grep home page: <https://www.gnu.org/software/grep/>
General help using GNU software: <https://www.gnu.org/gethelp/>

!grep  "two three" files/words.txt

one two three
one two three four
one two three four five
two three four five

!grep  -c "two three" files/words.txt

!grep  --count "two three" files/words.txt