Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Lecture 13

File Input

23 Brumaire, Year CCXXXI

Song of the day: (Gimme Some of That) Ol' Atonal Music by Merle Hazard feat. Alison Brown (2019).

Sections

  1. Functions Review
  2. Reading Data From Files
    1. Opening The File
    2. Reading From The File
    3. Iterating Through The File
    4. Closing The File

Part 1: Functions Review

Write a function that will accept two strings (i.e. it has two parameters, and they are expected to be strings) and will return the Hamming Distance between both strings. The Hamming Distance is defined by the amount of characters from one string are different from another string. So, cat and cats have a Hamming Distance of 1, and bat and car have a distance of 2. As you can see, you cannot assume that both strings are the same length.

Solution

Part 2: Reading Data From Files

Thus far, we've been getting our user input from the console directly into our programs. This may or may not be obvious to you, but the applications that we want to eventually write don't interact with the console at all. In fact, ask yourser: have you, as a user, ever had to use the console to use an Apple or Google application? Of course not. The console is not for regular users, but for programmers.

So, if our programs are not getting their data from the console, where is it coming from? The easy answer to this is "the internet," but it's a little more complicated than that. Data is often sent, used and received in batches, and those batches are often files that contain the data necessary for the application to work properly.

For example, let's say that we were making a weather application. For this application, we would need the temperature reading from every city around the world. This data would, of course, not come from the console, but from the internet. Say that the readings that we received from the weather stations came back in a text file such as this one:

City,Temperature
San Francisco,47
Manhattan,57
Boston,51
Houston,68
St. James,50

This type of formatting is often called comma-separated values (CSV), as each element is, indeed, separated by a comma. The way that you are supposed to think of these files is sort of like an Excel spreadsheet:

City Temperature
San Francisco 47
Manhattan 57
Boston 51
Houston 68
St. James 50

Figure 1: Tabular version of the CSV data shown above.

Let's set the following goal for ourselves. Let's write a function that:

  1. Accepts the name of this data file as an argument.
  2. Opens this data file.
  3. Reads each line from the file.
  4. Displays the Farhenheit weather data in Celsius for each city.

So, something that would work like this:

def main():
    display_temperatures("data.csv")


main()

Output:

The temperature in San Francisco is 8.33°C.
The temperature in Manhattan is 13.89°C.
The temperature in Boston is 10.56°C.
The temperature in Houston is 20.0°C.
The temperature in St. James is 10.0°C.

We can use the following function to help us make a quick job out of temperature conversions:

def get_celsius(fahrenheit_temp):
    celsius = (fahrenheit_temp - 32.0) * (5.0 / 9.0)

    return celsius

This is a perfect way of introducing files, so let's get to it.

Step 1: Opening the file

Our function "skeleton" might look like this:

def display_temperatures(filepath):
    pass

The first step is, of course, to open our file up:

file_obj = open(filepath, 'r')

So, what's going on here? The open() function (as far as this course is concerned) accepts two arguments:

  1. The path of the file you want to open, in str form. Since data.csv exists in the same directory (folder) as our current file, we don't have to give the full filepath, but I certainly could do so (/Users/sebastianromerocruz/Documents/NYU_Adjunct/CS1114-material/lectures/file_input/data.csv, in my case) and it would work just the same.
  2. The mode that you want to open the file with. In order to read the contents of a file, we use 'r', which stands for "read mode".

Notice that it's not enough to just make a call to open()—we also have to save its returned value into a variable (in this case, the variable file_obj).

def display_temperatures(filepath):
    # STEP 1: Open the file in read mode
    file_obj = open(filepath, 'r')

Step 2: Reading from the file

Once we have this file open, we can operate on it in a number of ways. Here are some of the file methods that we can, and will use, in this class:

Method Example Description
read() file_obj.read() Reads the entire file as a single string.
readline() file_obj.readline() Returns the next line of the file with all text up to and including the newline character.
readlines() file_obj.readlines() Returns a list of strings, each representing a single line of the file. Note that we haven't introduced at lists yet, so I would really stay away from this one frot the time being.

Figure 1: File object methods. See full documentation here.

So how does Python know when a new line starts? For us, it is easy to tell by visually recognising that a line exists below another, but programming languages need specific instructions as to how to recognize these things. It turns out that if you extract a line from data.txt and print it, it looks like this:

file_path = "data.csv"
file_obj = open(file_path, 'r')

first_line = file_obj.readline()
print(first_line)

Output:

'City,Temperature\n'

What do we see at the end of the line? That's right: a \n (newline) character! The same way we print a new line using, say, the print() function, Python reads this "hidden" character and recognizes that this represents a break in the line.

It is almost always the case that you will have trailing (or preceding) whitespace characters when reading lines from a file (e.g. ' ', '\n', '\t'), so it'd be great to have a quick way to get rid of it, as it can cause errors when casting from a string into other types (for example, try casting "4.5\n" into a float). The string methodstrip() does just the job:

string_with_whitespace = "\t    Hello, World!\n\n   \n"
string_without_whitespace = string_with_whitespace.strip()

print(string_without_whitespace)

Output:

Hello, World!

As you can see, strip() removes both whitespace the precedes the first non-whitespace character, and whitespace that trails after the last non-whitespace character.

Step 3: Iterating through the file

Something about file objects in Python that is extremely convenient for us is that they are also considered to be sequences. Other sequences that we have seen in the past are strings and range objects, so what does that mean for us?

Well, if our file has 1,000,000 files, we would have to call readline() 1,000,000 times. Of course, this is possible with a for-loop, but there is a much neater way of doing this:

for line in file_obj:
    print(line)

Output:

City,Temperature

San Francisco,47

Manhattan,57

Boston,51

Houston,68

St. James,50

Isn't that nice? We don't even need to know how many lines a file has. Python will safely iterate through it the same way it iterates through any other sequence, and we can consider each of its lines individually.

So, let's throw this into our function. I'm going to use the find() string method along with string slicing to isolate both the city name and its corresponding temperature from each line:

def print_temperatures(filepath):
    # STEP 1: Open the file
    temp_file = open(filepath, 'r')  # opening filepath in read mode ('r')

    # STEP 2: Extract data from file
    for line in temp_file:
        # Convert to celsius
        line = line.strip()                      # get rid of potential new lines
        comma_idx = line.find(',')               # find location of the separating comma
        city_name = line[:comma_idx]             # isolate the city name
        temperature = line[comma_idx + 1:]       # isolate the temperature
        temperature = float(temperature)         # cast temperature str into float
        celsius_temp = get_celsius(temperature)  # get celsius equivalent

        # Print result
        print(f"The temperature in {city_name} is {round(celsius_temp, 2)}°C.")

Running this, we get the following:

Traceback (most recent call last):
  File "/Users/sebastianromerocruz/Desktop/lecture/lecture.py", line 43, in <module>
    main()
  File "/Users/sebastianromerocruz/Desktop/lecture/lecture.py", line 40, in main
    print_temperatures("data.csv")
  File "/Users/sebastianromerocruz/Desktop/lecture/lecture.py", line 30, in print_temperatures
    temperature = float(temperature)         # cast temperature str into float
ValueError: could not convert string to float: 'Temperature'

Uh oh. What happened? Following the error trace, it looks like our temperature = float(temperature) line failed because we passed in the string "Temperature" into it. This string comes from the first line of the file, often called the header of a file. This line is not actually part of the data, but rather tells us what each column in the "comma-grid" is supposed to be populated with. In other to fix this problem, we have to tell Python to first read the first line before iterating through the rest of the data using our for-loop.

Using readline(), this is not a problem:

def print_temperatures(filepath):
    # STEP 1: Open the file
    temp_file = open(filepath, 'r')  # opening filepath in read mode ('r')

    # STEP 1.5: Skip the header
    temp_file.readline()  # can skip saving to variable if you don't actually need the header

    # STEP 2: Extract data from file
    for line in temp_file:
        # Convert to celsius
        line = line.strip()                      # get rid of potential new lines
        comma_idx = line.find(',')               # find location of the separating comma
        city_name = line[:comma_idx]             # isolate the city name
        temperature = line[comma_idx + 1:]       # isolate the temperature
        temperature = float(temperature)         # cast temperature str into float
        celsius_temp = get_celsius(temperature)  # get celsius equivalent

        # Print result
        print(f"The temperature in {city_name} is {round(celsius_temp, 2)}°C.")

The reason why this works is because, every time Python reads a line (either via readline() or via a for-loop), it instantly moves to the following line in the file, and cannot go back. Therefore, if we call readline() once before the for-loop, the first line that the loop will see is:

San Francisco,47

instead of the header.

If you run your function now, it looks like everything is working according to plan:

The temperature in San Francisco is 8.33°C.
The temperature in Manhattan is 13.89°C.
The temperature in Boston is 10.56°C.
The temperature in Houston is 20.0°C.
The temperature in St. James is 10.0°C.

So we're done here. Or are we?

Step 4: Closing your file

We have to do one last step, and that is to close our file:

def print_temperatures(filepath):
    temp_file = open(filepath, 'r') 

    temp_file.readline() 

    for line in temp_file:
        line = line.strip()                      
        comma_idx = line.find(',')              
        city_name = line[:comma_idx]           
        temperature = line[comma_idx + 1:]   
        temperature = float(temperature)      
        celsius_temp = get_celsius(temperature) 

        print(f"The temperature in {city_name} is {round(celsius_temp, 2)}°C.")
    
    temp_file.close()  # right here!

Closing your file, while not required in this class, is a very important step later down the line in your computer science career. If you don't the file that you are working on, any other function, process, and/or app that wants to make use of this app will not be able to, causing the whole operation to fail.

Previous: Functions: return | Next: File Output and Exceptions