Introduction to Pandas

Introduction to Pandas

What is Pandas and Why Use It?

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data. Pandas is widely used in data science, machine learning, and other fields where data processing is required.

Overview of its Importance in Data Analysis and Manipulation Pandas simplifies many common data manipulation tasks such as reading and writing data, handling missing values, filtering, grouping, and more. Its primary data structures, Series and DataFrame, allow for easy manipulation and analysis of tabular data.

Installation and Setup

Installing Pandas via pip or conda

You can install Pandas using pip or conda. If you're using Anaconda, you can install it via conda, otherwise, you can use pip.

#pip install pandas
#conda install pandas
# importing pandas
import pandas as pd

Series:

A Series is a one-dimensional array-like object capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It consists of a sequence of values along with associated array labels or index. You can think of a Series as a single column of data.

Characteristics of Series:

  1. Homogeneity: All elements in a Series are of the same data type.
  2. Size Immutable: The size of a Series cannot be changed once it is created.
  3. Indexing: Each element in a Series is associated with an index label, which can be used to access individual elements.
  4. Similar to NumPy Arrays: Series is built on top of NumPy arrays, so many NumPy functions can be applied directly to Series.

Creating a Series:

You can create a Series using pd.Series() method by passing a list, array, or dictionary.

import pandas as pd

# Creating a Series from a list
s1 = pd.Series([1, 2, 3, 4, 5])
print(s1)

# Creating a Series from a dictionary
data = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
s2 = pd.Series(data)
print(s2)

OUTPUT:

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    0
b    1
c    2
d    3
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

OUTPUT:

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

DataFrame:

A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is analogous to a spreadsheet or SQL table, or a dictionary of Series objects. In essence, a DataFrame is a container for Series objects, where each Series represents a column.

Characteristics of DataFrame:

  1. Tabular Structure: Data is organized in rows and columns, just like a table.
  2. Column Indexing: Columns can be accessed by their names.
  3. Heterogeneous Data: Each column in a DataFrame can have a different data type.
  4. Size Mutable: Unlike Series, you can modify the size (add or delete rows/columns) of a DataFrame.

Creating a DataFrame:

You can create a DataFrame using pd.DataFrame() method by passing a dictionary, list of dictionaries, or from a NumPy array.

#import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 35, 29, 42]}
df1 = pd.DataFrame(data)
print(df1)
# Creating a DataFrame from a list of dictionaries
data = [{'Name': 'John', 'Age': 28}, {'Name': 'Anna', 'Age': 35}]
df2= pd.DataFrame(data)
print(df2)
# Creating a DataFrame from a NumPy array
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df3= pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df3)

OUTPUT:

   Name  Age
0  John  28
1  Anna  35
2  Peter 29
3  Linda 42

   Name  Age
0  John  28
1  Anna  35

    A  B   C
0   1  2   3
1   4  5   6

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

OUTPUT:

  Bob
Sue
0
I  liked it.
Pretty  good.
1
It  was awful.
Bland.

We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

DataFrame Attributes

DataFrame attributes are properties that provide information about the DataFrame.

.shape: Returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).

.dtypes: Returns the data types of each column in the DataFrame.

.index: Returns the index (row labels) of the DataFrame.

.columns: Returns the column labels of the DataFrame. .head(): shows first few rows .tail(): shows last few rows In Pandas, there are several functions that provide summary statistics and information about DataFrames. Some commonly used functions include info(), describe(), head(), tail(), shape, dtypes, nunique(), and value_counts().

1. info()

The info() function provides a concise summary of the DataFrame, including the number of non-null values in each column, data types, and memory usage.

# Display summary information about the DataFrame
df.info()

2. describe()

The describe() function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution.

# Generate descriptive statistics
df.describe()

3. head()

The head() function returns the first n rows of the DataFrame. By default, it returns the first five rows.

# Display the first few rows of the DataFrame
df.head()

4. tail()

The tail() function returns the last n rows of the DataFrame. By default, it returns the last five rows.

# Display the last few rows of the DataFrame
df.tail()

5. shape

The shape attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).

# Get the dimensions of the DataFrame
df.shape

6. dtypes

The dtypes attribute returns the data types of each column in the DataFrame.

# Get the data types of columns
df.dtypes

7. nunique()

The nunique() function returns the number of unique values in each column of the DataFrame.

# Get the number of unique values in each column
df.nunique()

8. value_counts()

The value_counts() function returns a Series containing counts of unique values in a column, sorted in descending order.

# Get the frequency of unique values in a column
df['column_name'].value_counts()

These summary functions are invaluable for quickly understanding the structure, content, and distribution of data within a DataFrame, allowing for effective exploratory data analysis.

df1.head()

OUTPUT:

df1.columns

OUTPUT:

Index(['Name', 'Age'], dtype='object')

# Shape of DataFrame
print(df1.shape)

# Data types of columns
print(df1.dtypes)

# Index of DataFrame
print(df1.index)

# Column labels of DataFrame
print(df1.columns)

OUTPUT: 

(4, 2)
Name    object
Age      int64
dtype: object
RangeIndex(start=0, stop=4,step=1)
Index(['Name','Age'], dtype='object')

df1.info()

OUTPUT:

<class'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2columns):
#  Column  Non-Null Count  Dtype 
---  ------ --------------  ----- 
0   Name    4 non-null      object
1  Age     4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes

df1.describe()

OUTPUT:

Indeed, data can be stored in various forms and formats, such as CSV files, Excel spreadsheets, SQL databases, JSON files, and more. Pandas provides convenient functions to read data from these sources and create DataFrames.

Reading from CSV Files

CSV (Comma Separated Values) files are one of the most common ways of storing data in tabular form. Pandas provides the pd.read_csv() function to read data from CSV files.

import pandas as pd

# Reading data from a CSV file
df = pd.read_csv('file.csv')

Reading from Excel Files

Excel files are widely used for storing tabular data. Pandas provides the pd.read_excel() function to read data from Excel files.

import pandas as pd

# Reading data from an Excel file
df = pd.read_excel('file.xlsx')

Reading from SQL Databases

Pandas can also read data directly from SQL databases using the pd.read_sql() function

import pandas as pd
import sqlite3

# Connecting to the database
conn = sqlite3.connect('database.db')

# Reading data from a SQL query
query = "SELECT * FROM table"
df = pd.read_sql(query, conn)

Other Formats

Pandas supports reading from various other formats such as JSON, HTML, HDF5, and more. Each format has its corresponding function (e.g., pd.read_json(), pd.read_html(), pd.read_hdf()).

import pandas as pd

# Reading data from a JSON file
df = pd.read_json('file.json')

# Reading data from an HTML file
df = pd.read_html('file.html')[0]  # [0] is used to select the first table from HTML

# Reading data from an HDF5 file
df = pd.read_hdf('file.h5', 'key')

Additional Parameters

These functions accept various additional parameters to customize the reading process, such as specifying delimiters, header rows, index columns, and more. Refer to the Pandas documentation for detailed information on these parameters.

# Example with additional parameters
df = pd.read_csv('file.csv', delimiter=';', header=0, index_col='id')


By leveraging these functions, you can easily read data from different file formats and databases into Pandas DataFrames, facilitating data analysis and manipulation tasks.

df=pd.read_csv('/content/drive/MyDrive/ML Course by ME/Python/Pandas/datsets_pandas/Housing.csv')
df.head()
df.tail()
df.shape

OUTPUT:

(545, 13)
df.columns

OUTPUT:

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',  'guestroom', 'basement', 'hotwaterheating', 'airconditioning',       'parking', 'prefarea', 'furnishingstatus'], dtype='object')

df.dtypes

OUTPUT:

df.info()

OUTPUT:

df.describe()

OUTPUT:

df.describe(include='all')

In Pandas, you can select specific columns or rows from a DataFrame using various methods. Here are some common approaches:

Selecting Specific Columns

  1. Using Bracket Notation: You can use square brackets [] with the column names to select specific columns. Pass a single column name or a list of column names.
 # Selecting a single column
 df['column_name']

 # Selecting multiple columns
 df[['column_name1', 'column_name2']]

Using Dot Notation: If column names are valid Python variable names and don't conflict with DataFrame methods or attributes, you can use dot notation.

 # Selecting a single column
 df.column_name

 # Note: Dot notation won't work if the column name contains spaces or special characters.

df['price'].head()

OUTPUT:

0    13300000
1    12250000
2    12250000
3    12215000
4    11410000
Name: price, dtype: int64

df[['price','area']].head()

OUTPUT:

             price     area
0            13300000           7420
1            12250000           8960
2            12250000           9960
3            12215000           7500
4            11410000           7420

df.price

OUTPUT:

0      13300000
1      12250000
2      12250000
3      12215000
4      11410000
        ...   
540     1820000
541     1767150
542     1750000
543     1750000
544     1750000
Name: price, Length: 545, dtype: int64

Selecting Specific Rows

  1. Using Indexing: You can use .loc[] or .iloc[] to select specific rows based on index labels or positions, respectively.
 # Selecting a single row by index label
 df.loc['index_label']

 # Selecting multiple rows by index labels
 df.loc[['index_label1', 'index_label2']]

 # Selecting a single row by position
 df.iloc[row_position]

 # Selecting multiple rows by positions
 df.iloc[start_index:end_index]

loc and iloc are both widely used methods in Pandas for selecting specific rows and columns from a DataFrame. They provide powerful indexing capabilities and are essential for data manipulation and analysis. Here's an overview of each method:

keyboard_arrow_down

1. loc:

loc is primarily label-based indexing, which means that you can use it to access rows and columns by their labels (index and column names). It allows you to select data based on the row and column labels.

Syntax:

df.loc[row_label, column_label]

Examples:

# Selecting a single row by index label
df.loc['index_label']

# Selecting a single element by row and column labels
df.loc['index_label', 'column_label']

# Selecting multiple rows by index labels
df.loc[['index_label1', 'index_label2']]

# Selecting rows and columns simultaneously
df.loc['index_label', ['column_label1', 'column_label2']]

2. iloc:

iloc is integer-based indexing, which means that you can use it to access rows and columns by their integer positions. It allows you to select data based on the row and column positions (indices).

Syntax:

df.iloc[row_index, column_index]

Examples:

# Selecting a single row by position
df.iloc[row_position]

# Selecting a single element by row and column positions
df.iloc[row_position, column_position]

# Selecting multiple rows by positions
df.iloc[start_index:end_index]

# Selecting rows and columns simultaneously by positions
df.iloc[start_index:end_index, start_column:end_column]

Key Differences:

  • loc is inclusive of both the start and stop labels, whereas iloc is exclusive of the stop index (similar to Python slicing).
  • With loc, you specify row and column labels, while with iloc, you specify row and column positions.
  • loc can accept boolean arrays or callable functions for selection, while iloc only accepts integers.
  • loc is slower than iloc because it involves label lookups.

Both loc and iloc are versatile and powerful tools for accessing data in Pandas DataFrames. The choice between them depends on whether you want to select data based on labels or positions.

# Selecting a single row by index label
df.loc[100]

OUTPUT:

price                  6230000
area                      6600
bedrooms                     3
bathrooms                    2
stories                      1
mainroad                   yes
guestroom                   no
basement                    yes
hotwaterheating             no
airconditioning             yes
parking                      0
prefarea                    yes
furnishingstatus     unfurnished
Name: 100, dtype: object

# Selecting a single element by row and column labels
#df.loc['index_label', 'column_label']
df.loc[100,'price']
# Selecting multiple rows by index labels
df.loc[[100,101,102]]

# Selecting rows and columns simultaneously
#df.loc['index_label', ['column_label1', 'column_label2']]
df.loc[100,['price','area','bathrooms']]

OUTPUT:

price                6230000
area                 6600
bathrooms      2
Name: 100, dtype: object

# Selecting a single row by position
df.iloc[100]

 # Selecting multiple rows by positions
 #df.iloc[start_index:end_index]

OUTPUT:

price                              6230000
area                                  6600
bedrooms                                 3  
bathrooms                                2
stories                                  1
mainroad                                yes
guestroom                                no
basement                                 yes
hotwaterheating                           no
airconditioning                          yes
parking                                    0
prefarea                                 yes
furnishingstatus   unfurnished
Name: 100, dtype: object
# Selecting multiple rows by positions
 #df.iloc[start_index:end_index]
df.iloc[0:5]
# Selecting a single element by row and column positions
#df.iloc[row_position, column_position]
df.iloc[0,2]

OUTPUT:

4

# Selecting rows and columns simultaneously by positions
#df.iloc[start_index:end_index, start_column:end_column]
df.iloc[0:5,0:3]

OUTPUT:

             price     area      bedrooms
0            13300000           7420      4
1            12250000           8960      4
2            12250000           9960      3
3            12215000           7500      4
4            11410000           7420      4

Using Conditional Selection: You can use boolean indexing to select rows based on conditions.

 # Selecting rows based on a condition
 df[df['column_name'] > value]

 # Selecting rows based on multiple conditions
 df[(df['column1'] > value1) & (df['column2'] < value2)]
df[df['price']>10000000]
df[(df['price']>8000000) & (df['area']<10000)]

Using Query: Pandas provides a query() method to select rows based on a query expression.

 # Selecting rows based on a query expression
 df.query('column_name > value')

These methods allow you to easily select specific columns or rows from a DataFrame based on your analysis requirements. Choose the method that best fits your use case and preference.

df.query('price>10000000 & area<15000')
df['airconditioning']=='yes'

OUTPUT:

0       True
1       True
2      False
3       True
4       True
      ...  
540    False
541    False
542    False
543    False
544    False
Name: airconditioning, Length: 545, dtype: bool

df.query("airconditioning == 'yes'")

Deleting Rows:

To delete rows using drop(), you need to specify the index labels or positions of the rows you want to remove. You can use the axis parameter to indicate that you're dropping rows (axis=0).

Syntax:

df.drop(labels, axis=0, inplace=False)

  • labels: The index labels or positions of the rows to be dropped.
  • axis: Specifies whether to drop rows (axis=0) or columns (axis=1). Default is 0 (rows).
  • inplace: If True, the operation is done in place and returns None. If False, it returns a new DataFrame with the rows removed. Default is False.

Example:

# Drop rows with index labels '2' and '4'
df.drop([2, 4], axis=0, inplace=True)

Deleting Columns:

To delete columns using drop(), you need to specify the column names or positions of the columns you want to remove. You can use the axis parameter to indicate that you're dropping columns (axis=1).

df.drop(labels, axis=1, inplace=False)

  • labels: The column names or positions of the columns to be dropped.
  • axis: Specifies whether to drop rows (axis=0) or columns (axis=1). Default is 0 (rows).
  • inplace: If True, the operation is done in place and returns None. If False, it returns a new DataFrame with the columns removed. Default is False.
# Drop columns 'column_name1' and 'column_name2'
df.drop(['column_name1', 'column_name2'], axis=1, inplace=True)

  • By default, drop() returns a new DataFrame with the specified rows or columns removed. If you want to modify the original DataFrame in place, you can set the inplace parameter to True.
  • If you want to drop rows or columns based on conditions, you can use boolean indexing or query() method to select the rows or columns to be removed, then use drop() to delete them.
df.drop([3,4], axis=0, inplace=False)
df.head()

df.drop([3,4], axis=0, inplace=True)
df.head()

df.drop('parking',axis=1)

df.drop(['parking','airconditioning'],axis=1)

Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.