File Opening and Reading

Opening a File

We quickly introduce opening, reading and closing a file. We'll follow-through with better examples in the Functions lecture next week.

First, above all, before reading you must "open" a file, using the open() function. It minimally requires a filename (or, a pathname to a filename if you are familiar with pathnames under Mac/Linux/Windows). A second parameter is a small string of letters that specify in which mode you will be opening the file. In this class we will primarily open the file for read-only access. We do this by providing the literal string "r" as the second parameter.

If you forget this second parameter, for safety, it defaults to read-only access to that filename.

If successful, you get returned a file object, a data type that refers to the file you just opened. You use this object to read, rewind, write (if opened for writing) and other methods that make sense for a file object data type.

I have opened the file which is a census of female first names from 1990. You can download it here (notice I renamed it after download to add .txt at the end of the filename): https://www2.census.gov/topics/genealogy/1990surnames/dist.female.first

Place this file in the same folder/directory that contains your Python script or Jupyter Notebook.

In [27]:
infile = open("census-dist-female-first.txt", "r")

Reading a file using .read()

The simplest way to read a file is using the .read() method. When called without any parameters, it reads the entire file and returns it as a single string. Please be aware of this behavior. If you open and read a very large file using .read(), you will get a very large and unweildly string in return.

You can provide an optional parameter to specify how many characters (bytes, usually) you want to read, e.g. data = infile.read(10) will read 10 characters from a file.

In this form, we read the entire census text file. We ask len() to give us a length of this returned string (it's very large). We slice that string, called data to show us the first 100 characters.

In [28]:
data = infile.read(100)
In [29]:
len(data)
Out[29]:
100
In [30]:
data[0:100]
Out[30]:
'MARY           2.629  2.629      1\nPATRICIA       1.073  3.702      2\nLINDA          1.035  4.736   '
In [32]:
print(data[0:100])
MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736   

Seeking (a.k.a. "be kind, rewind")

Files are traditionally viewed as sequential tapes (think cassette or VHS tapes). When you open a file, you begin at the start of the file (position 0). Any access you perform that is peacemeal, such as reading $n$ characters of a file or reading the first $m$ lines of a file, will move your file position by a certain amount within the file. Any subsquent reading will start from that current position.

I said that .read(), when called with no parameter value reads the entire file.

In [33]:
infile.tell()
Out[33]:
102
In [6]:
infile.seek(0)
Out[6]:
0

Reading Rainbow

There is a spectrum of methods for reading files (particularly, text files whose file data is binary data that falls specifically within the values of an ASCII table):

  • .read(): Reads a file from the current position to the end (usually, from the start to end).
  • .readline(): Reads a file one, line, at, a time. Usefu if you have to step through the lines of a file very deliberately.
  • .readlines(): Reads the file from the current position to the end (usually, from the start to end for most use cases), and returns a list of lines from that file.

What is a 'line'?

You know it visually as a broken line of text that begins at the left of the screen moving again to the right. It's natural and common sense to us what a line is. To a computer, it needs a way to mark the end of a line, because to it all data is just a continuous stream of binary values.

You may see "\n" in some text you read in from a file. This is called the "new line", or "line feed" character - usually "newline". This is a special literal notation fot strings that says "this line of text ends here, begin a new one". It's the character made when you press enter/return.

A literal string can have a newline embedded in it:

In [34]:
multi_line = "This is a\ntwo-line string\n"
print(multi_line)
This is a
two-line string

In [7]:
infile.readline()
Out[7]:
'MARY           2.629  2.629      1\n'
In [8]:
infile.readline()
Out[8]:
'PATRICIA       1.073  3.702      2\n'
In [9]:
infile.readline()
Out[9]:
'LINDA          1.035  4.736      3\n'
In [10]:
infile.seek(0)
lines = infile.readlines()
In [11]:
type(lines)
Out[11]:
list
In [12]:
len(lines)
Out[12]:
4275
In [13]:
print(lines[0:100])
['MARY           2.629  2.629      1\n', 'PATRICIA       1.073  3.702      2\n', 'LINDA          1.035  4.736      3\n', 'BARBARA        0.980  5.716      4\n', 'ELIZABETH      0.937  6.653      5\n', 'JENNIFER       0.932  7.586      6\n', 'MARIA          0.828  8.414      7\n', 'SUSAN          0.794  9.209      8\n', 'MARGARET       0.768  9.976      9\n', 'DOROTHY        0.727 10.703     10\n', 'LISA           0.704 11.407     11\n', 'NANCY          0.669 12.075     12\n', 'KAREN          0.667 12.742     13\n', 'BETTY          0.666 13.408     14\n', 'HELEN          0.663 14.071     15\n', 'SANDRA         0.629 14.700     16\n', 'DONNA          0.583 15.282     17\n', 'CAROL          0.565 15.848     18\n', 'RUTH           0.562 16.410     19\n', 'SHARON         0.522 16.932     20\n', 'MICHELLE       0.519 17.451     21\n', 'LAURA          0.510 17.961     22\n', 'SARAH          0.508 18.469     23\n', 'KIMBERLY       0.504 18.973     24\n', 'DEBORAH        0.494 19.467     25\n', 'JESSICA        0.490 19.958     26\n', 'SHIRLEY        0.482 20.439     27\n', 'CYNTHIA        0.469 20.908     28\n', 'ANGELA         0.468 21.376     29\n', 'MELISSA        0.462 21.839     30\n', 'BRENDA         0.455 22.293     31\n', 'AMY            0.451 22.745     32\n', 'ANNA           0.440 23.185     33\n', 'REBECCA        0.430 23.615     34\n', 'VIRGINIA       0.430 24.044     35\n', 'KATHLEEN       0.424 24.468     36\n', 'PAMELA         0.416 24.884     37\n', 'MARTHA         0.412 25.297     38\n', 'DEBRA          0.408 25.704     39\n', 'AMANDA         0.404 26.108     40\n', 'STEPHANIE      0.400 26.508     41\n', 'CAROLYN        0.385 26.893     42\n', 'CHRISTINE      0.382 27.275     43\n', 'MARIE          0.379 27.655     44\n', 'JANET          0.379 28.034     45\n', 'CATHERINE      0.373 28.408     46\n', 'FRANCES        0.370 28.777     47\n', 'ANN            0.364 29.141     48\n', 'JOYCE          0.364 29.505     49\n', 'DIANE          0.359 29.864     50\n', 'ALICE          0.357 30.221     51\n', 'JULIE          0.348 30.568     52\n', 'HEATHER        0.337 30.905     53\n', 'TERESA         0.336 31.241     54\n', 'DORIS          0.335 31.577     55\n', 'GLORIA         0.335 31.912     56\n', 'EVELYN         0.322 32.233     57\n', 'JEAN           0.315 32.548     58\n', 'CHERYL         0.315 32.863     59\n', 'MILDRED        0.313 33.176     60\n', 'KATHERINE      0.313 33.489     61\n', 'JOAN           0.306 33.795     62\n', 'ASHLEY         0.303 34.098     63\n', 'JUDITH         0.297 34.395     64\n', 'ROSE           0.296 34.691     65\n', 'JANICE         0.285 34.975     66\n', 'KELLY          0.283 35.258     67\n', 'NICOLE         0.281 35.539     68\n', 'JUDY           0.276 35.815     69\n', 'CHRISTINA      0.275 36.090     70\n', 'KATHY          0.272 36.362     71\n', 'THERESA        0.271 36.633     72\n', 'BEVERLY        0.267 36.900     73\n', 'DENISE         0.264 37.164     74\n', 'TAMMY          0.259 37.423     75\n', 'IRENE          0.252 37.675     76\n', 'JANE           0.250 37.925     77\n', 'LORI           0.248 38.173     78\n', 'RACHEL         0.242 38.415     79\n', 'MARILYN        0.241 38.657     80\n', 'ANDREA         0.236 38.893     81\n', 'KATHRYN        0.234 39.127     82\n', 'LOUISE         0.229 39.356     83\n', 'SARA           0.229 39.584     84\n', 'ANNE           0.228 39.812     85\n', 'JACQUELINE     0.228 40.040     86\n', 'WANDA          0.226 40.266     87\n', 'BONNIE         0.223 40.489     88\n', 'JULIA          0.223 40.711     89\n', 'RUBY           0.221 40.932     90\n', 'LOIS           0.220 41.153     91\n', 'TINA           0.220 41.372     92\n', 'PHYLLIS        0.219 41.591     93\n', 'NORMA          0.218 41.809     94\n', 'PAULA          0.217 42.026     95\n', 'DIANA          0.216 42.242     96\n', 'ANNIE          0.216 42.458     97\n', 'LILLIAN        0.211 42.669     98\n', 'EMILY          0.208 42.877     99\n', 'ROBIN          0.208 43.085    100\n']