Introduction to File Handling in Python

1. Introduction to File Handling in Python

Overview of File Handling and Its Importance in Data Science

File handling is a fundamental aspect of any programming language, allowing you to store, retrieve, and manipulate data stored in files. In data science, file handling is particularly important as it enables data scientists to access and work with data stored in various formats, such as CSV, JSON, Excel, and many others. Efficient file handling is crucial for data preprocessing, analysis, and even in the stages of model training and evaluation, where datasets need to be read and results written to files.

Data in the real world is messy and often scattered across different file formats. Knowing how to handle these files properly allows you to automate the process of data cleaning, transforming raw data into a format that can be analyzed, and eventually extracting valuable insights from it.

Python's Built-in File Handling Methods

Python provides a set of built-in functions and methods for file operations, making it very efficient and straightforward to work with files. The basic operations include opening a file, reading from it, writing to it, and closing it when you're done. Python handles files in various modes, such as read (r), write (w), append (a), and others, to support different operations. Here's a brief introduction to these fundamental methods:

  • Open: Before you can read from or write to a file, you need to open it using the open() function. This function requires the path to the file and optionally the mode in which to open the file. If the mode is not specified, it defaults to read mode.
file = open('example.txt', 'r')


  • Read: Once a file is opened in read mode, you can use the read(), readline(), or readlines() methods to read its contents.
    • read() reads the entire file.
    • readline() reads the next line.
    • readlines() reads all the lines and returns a list of lines
    content = file.read()
  • Write: To write to a file, you must open it in write (w) or append (a) mode. Then you can use the write() or writelines() method to write text to the file
file = open('example.txt', 'w')
file.write('Hello, world!')


  • Close: It is crucial to close a file when you're done with it to free up system resources. Use the close() method for this purpose.
file.close()


  • With Statement: Python also supports the with statement, making it easier to handle files by automatically taking care of opening and closing the file, even if an error occurs during file operations. This is the recommended way to work with files.
with open('example.txt', 'r') as file:
    content = file.read()


Understanding these basic operations is the first step in mastering file handling in Python, setting a solid foundation for dealing with more complex data formats and operations in data science projects.

2. Working with Plain Text Files

Plain text files, often having a .txt extension, are one of the simplest and most common file types you'll encounter. They store data in a readable format without any formatting or binary data. In this section, we'll explore how to perform basic file operations like reading from and writing to plain text files using Python. Understanding these operations is crucial for preprocessing and analyzing data in data science projects.

Reading from Plain Text Files

Python offers several methods to read from a file:

  • read(): Reads the entire content of the file into a single string.
  • readline(): Reads the next line from the file, including the newline character.
  • readlines(): Reads all the lines in a file and returns them as a list of strings.

Example of reading an entire file:

with open('example.txt', 'r') as file:
    content = file.read()
    print(content)

Example of reading a file line by line:

with open('example.txt', 'r') as file:
    for line in file:
        print(line, end='')

Writing to Plain Text Files

To write to a file, you can use the following methods:

  • write(): Writes a string to the file.
  • writelines(): Writes a list of strings to the file.

Example of writing to a file:

with open('output.txt', 'w') as file:
    file.write('Hello, Python!\n')

Example of writing multiple lines to a file:

lines = ['First line\n', 'Second line\n']
with open('output.txt', 'w') as file:
    file.writelines(lines)


Using Context Managers for File Operations

The with statement in Python is used as a context manager that automatically handles opening and closing files, which is not only concise but also provides a clean way to ensure that resources are properly managed. When the block inside the with statement is exited, it automatically calls the close() method on the file object, even if exceptions occur within the block.

Exercise: Read a Text File and Count the Number of Words

Let's put what you've learned into practice. The following exercise will help you apply the basics of file handling by counting the number of words in a text file.

def count_words_in_file(filename):
    with open(filename, 'r') as file:
        content = file.read()
        words = content.split()
        return len(words)

filename = 'example.txt'
word_count = count_words_in_file(filename)
print(f'The file {filename} contains {word_count} words.')

This exercise demonstrates how to read a file, process its content, and perform a simple analysis. Exercises like this are foundational in data science, where reading and processing text data is a common task.

3. Handling CSV Files

CSV (Comma-Separated Values) files are a common data format used in data science for storing tabular data. Python provides a built-in csv module to easily deal with CSV files for both reading and writing operations.

Introduction to CSV Files and the csv Module

CSV files are plain text files where each line contains a list of values separated by commas. These files are simple and widely supported by data analysis tools and libraries, making them a popular choice for data exchange and storage in data science projects.

The Python csv module offers a way to easily read from and write to CSV files. It supports various customization options to handle different CSV file formats and complexities.

Reading CSV Files Using csv.reader

To read a CSV file, you can use the csv.reader object. This object takes a file object and returns a reader object that iterates over lines in the given CSV file.

import csv

with open('students.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

Handling header rows: If your CSV file includes a header row (first row as column names), you can use next(reader) to skip the header row before looping over the remaining rows.

with open('students.csv', 'r') as file:
    reader = csv.reader(file)
    header = next(reader)  # Skip the header
    for row in reader:
        print(row)

Writing Data to CSV Files Using csv.writer

You can write data to a CSV file using the csv.writer object. It provides methods like writerow for writing a single row and writerows for writing multiple rows at once.

with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Grade'])
    writer.writerow(['John Doe', 'A'])

Using DictReader and DictWriter for Working with Dictionaries

DictReader and DictWriter are classes in the csv module that allow you to read from and write to CSV files using dictionaries. This can be more convenient when dealing with CSV files that have header rows.

# Reading using DictReader
with open('students.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row)  # Each row is a dictionary

# Writing using DictWriter
with open('output.csv', 'w', newline='') as file:
    fieldnames = ['Name', 'Grade']
    writer = csv.DictWriter(file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'Name': 'Jane Doe', 'Grade': 'B+'})

Exercise: Calculate the Average Grade for Each Student

For this exercise, let's assume you have a CSV file named grades.csv with two columns: Name and Grade, where Grade is a numeric value.

def calculate_average_grade(csv_file):
    grades = {}
    with open(csv_file, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            name = row['Name']
            grade = float(row['Grade'])
            if name in grades:
                grades[name].append(grade)
            else:
                grades[name] = [grade]
    
    for name, grades_list in grades.items():
        average_grade = sum(grades_list) / len(grades_list)
        print(f"{name}: Average Grade = {average_grade:.2f}")

# Assuming 'grades.csv' is your file
calculate_average_grade('grades.csv')

This exercise demonstrates how to use DictReader to process CSV data into a more manageable format (dictionaries), perform calculations, and then output the result. Handling CSV files in this way is a common task in data science projects, especially during the data cleaning and preprocessing stages.

Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.