Session 4

Published

October 9, 2023

Topics Covered
  • Working with Files

Working with Files

%%file three.txt
One
Two
Three
Overwriting three.txt
open("three.txt").read()
'One\nTwo\nThree\n'
open("three.txt").readlines()
['One\n', 'Two\n', 'Three\n']
contents = open("three.txt").read()
print(contents)
One
Two
Three
print(contents, end="")
One
Two
Three
for line in open("three.txt").readlines():
    print(line)
One

Two

Three
for line in open("three.txt").readlines():
    print(line, end="")
One
Two
Three
%load_problem cat
Problem: Cat

Write a program cat.py that takes one or more filenames are command-line arguments and prints the contents of them.

$ python cat.py files/five.txt
one
two
three
four
five

$ python cat.py files/abcd.txt files/1234.txt
A
B
C
D
1
2
3
4

You can verify your solution using:

%verify_problem cat

%%file cat.py
# your code here


!cat files/five.txt
one
two
three
four
five
!cat files/abcd.txt files/1234.txt
A
B
C
D
1
2
3
4

Binary Data and Binary Files

Just like strings, python has a bytes type.

a = "hello"
type(a)
str
binary_data = b"hello"
type(binary_data)
bytes
data = b"hello\x01\x02\x03"
data[-1]
3
data[0]
104
print(data)
b'hello\x01\x02\x03'
text = "अआइई"
len(text)
4
%%file hindi.txt
अआइई
Writing hindi.txt
!ls -l hindi.txt
-rw-r--r-- 1 jupyter-anand jupyter-anand 13 Oct  9 04:26 hindi.txt
data = text.encode("utf-8")
data
b'\xe0\xa4\x85\xe0\xa4\x86\xe0\xa4\x87\xe0\xa4\x88'
len(data)
12

To read a binary file, we need to open it in rb (read-binary) mode.

open("hindi.txt").read()
'अआइई\n'
open("hindi.txt", "rb").read()
b'\xe0\xa4\x85\xe0\xa4\x86\xe0\xa4\x87\xe0\xa4\x88\n'

Writing to Files

To write a file, we need to open the file in write mode.

f = open("a.txt", "w")
f.write("one\n")
f.write("two\n")
f.close()
open("a.txt").read()
'one\ntwo\n'

Whenever, we open a file in write mode, all the the contents of that file will be wiped out.

Also, it is import to close a file after writing. Only after a file closed, the contents will be synced to the disk.

f = open("b.txt", "w")
f.write("one\n")
f.write("two\n")
# not closed it
4
open("b.txt").read()
''
f.close() # now the contents will be written to disk
open("b.txt").read()
'one\ntwo\n'
f = open("c.txt", "w")
f.write("one")
f.write("two")
f.close()
open("c.txt").read()
'onetwo'
f = open("c.txt", "w")
f.write("one")
3
f.write("helloworld")
10

Q: How to find doucumentation?

f.write<shift+tab> -- shows help in a popup
f.write? -- adds help to the notebook

Try help on any other function.

len<shift+tab>
len?
f.write
f.write?
Signature: f.write(text, /)
Docstring:
Write string to stream.
Returns the number of characters written (which is always equal to
the length of the string).
Type:      builtin_function_or_method

The with statement

The with statement is handy to close a file automatically at the end of the code block.

with open("a.txt", "w") as f:
    f.write("one\n")
    f.write("two\n")
    # f gets closed here automatically
open("a.txt").read()
'one\ntwo\n'

Appending to an existing file

open("a.txt").read()
'one\ntwo\n'
with open("a.txt", "a") as f:
    f.write("three\n")
open("a.txt").read()
'one\ntwo\nthree\n'

Reading and Writing Binary Files

!ls -l files/python.png
-rw-r--r-- 1 jupyter-anand jupyter-anand 11155 Dec  7  2021 files/python.png
open("files/python.png").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Whenever we open a text file, we open that with an encoding.

When we open a file, python read the bytes from the file, decodes them using an encoding and gives back the text.

The default encoding is usually utf-8.

How to read binary files then?

data = open("files/python.png", "rb").read()
data[:20]
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02Y'

Similary when we write, we use mode wb.

with open("bytes.bin", "wb") as f:
    f.write(b"\x89\x01\x02\x03\x04")
!ls -l bytes.bin
-rw-r--r-- 1 jupyter-anand jupyter-anand 4 Oct  9 04:54 bytes.bin
open("bytes.bin", "rb").read()
b'\x01\x02\x03\x04'
open("bytes.bin").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Problem: Copy File

%load_problem copy-file
Problem: Copy File

Write a program copy_file.py to copy contents of one file to another.

The program should accept a source file and a destination file as arguments and copy the source to the destination.

$ python copyfile.py files/five.txt 5.txt

Note: Don't call this file copy.py as that interfere with a standard library module with the same name.

You can verify your solution using:

%verify_problem copy-file

%%file copy_file.py
# your code here


Writing copy_file.py
!python copy_file.py files/python.png python.png
!ls -l files/python.png
-rw-r--r-- 1 jupyter-anand jupyter-anand 11155 Dec  7  2021 files/python.png

Writing Command-line applications

!wc files/words.txt
 10  26 154 files/words.txt
!wc -l files/words.txt
10 files/words.txt
!wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
printable characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help        display this help and exit
      --version     output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
!grep three files/words.txt
one two three
one two three four
one two three four five
two three four five
three four five
one-two-three-four-five-six-seven
!grep -c three files/words.txt
6
!python square.py 5
25
!python square.py --help
Traceback (most recent call last):
  File "/home/jupyter-anand/book/square.py", line 4, in <module>
    n = int(sys.argv[1])
ValueError: invalid literal for int() with base 10: '--help'

Example: hello.py

We’ll build a simple command-line app that just says hello to a name.

%%file hello.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")

args = p.parse_args()
print(args)
print("Hello", args.name)
Overwriting hello.py
!python hello.py Python
Namespace(name='Python')
Hello Python
!python hello.py 
usage: hello.py [-h] name
hello.py: error: the following arguments are required: name
!python hello.py --help
usage: hello.py [-h] name

positional arguments:
  name        name to say hello

options:
  -h, --help  show this help message and exit

Let’s improve that by adding flags.

%%file hello2.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")
p.add_argument("-r", "--repeats", type=int, default=1, help="number of times to repeat the message")

args = p.parse_args()
print(args)
for i in range(args.repeats):
    print("Hello", args.name)
Writing hello2.py
!python hello2.py Python
Namespace(name='Python', repeats=1)
Hello Python
!python hello2.py Python -r 4
Namespace(name='Python', repeats=4)
Hello Python
Hello Python
Hello Python
Hello Python
!python hello2.py Python --repeats 4
Namespace(name='Python', repeats=4)
Hello Python
Hello Python
Hello Python
Hello Python
!python hello2.py --help
usage: hello2.py [-h] [-r REPEATS] name

positional arguments:
  name                  name to say hello

options:
  -h, --help            show this help message and exit
  -r REPEATS, --repeats REPEATS
                        number of times to repeat the message

Boolean Flags

%%file hello3.py
import argparse

p = argparse.ArgumentParser()
p.add_argument("name", help="name to say hello")
p.add_argument("-r", "--repeats", type=int, default=1, help="number of times to repeat the message")
p.add_argument("-u", "--uppercase", 
               action="store_true",
               default=False,
               help="convert the message into uppercase")


args = p.parse_args()
# print(args)

msg = f"Hello {args.name}"
if args.uppercase:
    msg = msg.upper()
    
for i in range(args.repeats):
    print(msg)
Overwriting hello3.py
!python hello3.py --help
usage: hello3.py [-h] [-r REPEATS] [-u] name

positional arguments:
  name                  name to say hello

options:
  -h, --help            show this help message and exit
  -r REPEATS, --repeats REPEATS
                        number of times to repeat the message
  -u, --uppercase       convert the message into uppercase
!python hello3.py Python -u
HELLO PYTHON
!python hello3.py Python -u -r 3
HELLO PYTHON
HELLO PYTHON
HELLO PYTHON
!python hello3.py -u Python -r 3
HELLO PYTHON
HELLO PYTHON
HELLO PYTHON
!python hello3.py -u  -r 3 Python
HELLO PYTHON
HELLO PYTHON
HELLO PYTHON

Problem: Rewrite square.py that we wrote earlier by using argparse.

%%file square.py
import argparse
Overwriting square.py
!python square.py --help
!python square.py 5
!grep --help
Usage: grep [OPTION]... PATTERNS [FILE]...
Search for PATTERNS in each FILE.
Example: grep -i 'hello world' menu.h main.c
PATTERNS can contain multiple patterns separated by newlines.

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERNS are extended regular expressions
  -F, --fixed-strings       PATTERNS are strings
  -G, --basic-regexp        PATTERNS are basic regular expressions
  -P, --perl-regexp         PATTERNS are Perl regular expressions
  -e, --regexp=PATTERNS     use PATTERNS for matching
  -f, --file=FILE           take PATTERNS from FILE
  -i, --ignore-case         ignore case distinctions in patterns and data
      --no-ignore-case      do not ignore case distinctions (default)
  -w, --word-regexp         match only whole words
  -x, --line-regexp         match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      --help                display this help text and exit

Output control:
  -m, --max-count=NUM       stop after NUM selected lines
  -b, --byte-offset         print the byte offset with output lines
  -n, --line-number         print line number with output lines
      --line-buffered       flush output on every line
  -H, --with-filename       print file name with output lines
  -h, --no-filename         suppress the file name prefix on output
      --label=LABEL         use LABEL as the standard input file name prefix
  -o, --only-matching       show only nonempty parts of lines that match
  -q, --quiet, --silent     suppress all normal output
      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text
  -I                        equivalent to --binary-files=without-match
  -d, --directories=ACTION  how to handle directories;
                            ACTION is 'read', 'recurse', or 'skip'
  -D, --devices=ACTION      how to handle devices, FIFOs and sockets;
                            ACTION is 'read' or 'skip'
  -r, --recursive           like --directories=recurse
  -R, --dereference-recursive  likewise, but follow all symlinks
      --include=GLOB        search only files that match GLOB (a file pattern)
      --exclude=GLOB        skip files that match GLOB
      --exclude-from=FILE   skip files that match any file pattern from FILE
      --exclude-dir=GLOB    skip directories that match GLOB
  -L, --files-without-match  print only names of FILEs with no selected lines
  -l, --files-with-matches  print only names of FILEs with selected lines
  -c, --count               print only a count of selected lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --group-separator=SEP  print SEP on line between matches with context
      --no-group-separator  do not print separator for matches with context
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is 'always', 'never', or 'auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS/Windows)

When FILE is '-', read standard input.  With no FILE, read '.' if
recursive, '-' otherwise.  With fewer than two FILEs, assume -h.
Exit status is 0 if any line is selected, 1 otherwise;
if any error occurs and -q is not given, the exit status is 2.

Report bugs to: bug-grep@gnu.org
GNU grep home page: <https://www.gnu.org/software/grep/>
General help using GNU software: <https://www.gnu.org/gethelp/>
!grep  "two three" files/words.txt
one two three
one two three four
one two three four five
two three four five
!grep  -c "two three" files/words.txt
4
!grep  --count "two three" files/words.txt
4