Introduction to Pandas
Introduction to Pandas
What is Pandas and Why Use It?
Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data. Pandas is widely used in data science, machine learning, and other fields where data processing is required.
Overview of its Importance in Data Analysis and Manipulation Pandas simplifies many common data manipulation tasks such as reading and writing data, handling missing values, filtering, grouping, and more. Its primary data structures, Series and DataFrame, allow for easy manipulation and analysis of tabular data.
Installation and Setup
Installing Pandas via pip or conda
You can install Pandas using pip or conda. If you're using Anaconda, you can install it via conda, otherwise, you can use pip.
#pip install pandas
#conda install pandas
# importing pandas
import pandas as pd
Series:
A Series is a one-dimensional array-like object capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It consists of a sequence of values along with associated array labels or index. You can think of a Series as a single column of data.
Characteristics of Series:
- Homogeneity: All elements in a Series are of the same data type.
- Size Immutable: The size of a Series cannot be changed once it is created.
- Indexing: Each element in a Series is associated with an index label, which can be used to access individual elements.
- Similar to NumPy Arrays: Series is built on top of NumPy arrays, so many NumPy functions can be applied directly to Series.
Creating a Series:
You can create a Series using pd.Series()
method by passing a list, array, or dictionary.
import pandas as pd
# Creating a Series from a list
s1 = pd.Series([1, 2, 3, 4, 5])
print(s1)
# Creating a Series from a dictionary
data = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
s2 = pd.Series(data)
print(s2)
OUTPUT:
0 1
1 2
2 3
3 4
4 5
dtype: int64
a 0
b 1
c 2
d 3
dtype: int64
A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
OUTPUT:
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
DataFrame:
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is analogous to a spreadsheet or SQL table, or a dictionary of Series objects. In essence, a DataFrame is a container for Series objects, where each Series represents a column.
Characteristics of DataFrame:
- Tabular Structure: Data is organized in rows and columns, just like a table.
- Column Indexing: Columns can be accessed by their names.
- Heterogeneous Data: Each column in a DataFrame can have a different data type.
- Size Mutable: Unlike Series, you can modify the size (add or delete rows/columns) of a DataFrame.
Creating a DataFrame:
You can create a DataFrame using pd.DataFrame()
method by passing a dictionary, list of dictionaries, or from a NumPy array.
#import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 35, 29, 42]}
df1 = pd.DataFrame(data)
print(df1)
# Creating a DataFrame from a list of dictionaries
data = [{'Name': 'John', 'Age': 28}, {'Name': 'Anna', 'Age': 35}]
df2= pd.DataFrame(data)
print(df2)
# Creating a DataFrame from a NumPy array
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df3= pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df3)
OUTPUT:
Name Age
0 John 28
1 Anna 35
2 Peter 29
3 Linda 42
Name Age
0 John 28
1 Anna 35
A B C
0 1 2 3
1 4 5 6
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
OUTPUT:
Bob
Sue
0
I liked it.
Pretty good.
1
It was awful.
Bland.
We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:
DataFrame Attributes
DataFrame attributes are properties that provide information about the DataFrame.
.shape: Returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
.dtypes: Returns the data types of each column in the DataFrame.
.index: Returns the index (row labels) of the DataFrame.
.columns: Returns the column labels of the DataFrame. .head(): shows first few rows .tail(): shows last few rows In Pandas, there are several functions that provide summary statistics and information about DataFrames. Some commonly used functions include info()
, describe()
, head()
, tail()
, shape
, dtypes
, nunique()
, and value_counts()
.
1. info()
The info()
function provides a concise summary of the DataFrame, including the number of non-null values in each column, data types, and memory usage.
# Display summary information about the DataFrame
df.info()
2. describe()
The describe()
function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution.
# Generate descriptive statistics
df.describe()
3. head()
The head()
function returns the first n
rows of the DataFrame. By default, it returns the first five rows.
# Display the first few rows of the DataFrame
df.head()
4. tail()
The tail()
function returns the last n
rows of the DataFrame. By default, it returns the last five rows.
# Display the last few rows of the DataFrame
df.tail()
5. shape
The shape
attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
# Get the dimensions of the DataFrame
df.shape
6. dtypes
The dtypes
attribute returns the data types of each column in the DataFrame.
# Get the data types of columns
df.dtypes
7. nunique()
The nunique()
function returns the number of unique values in each column of the DataFrame.
# Get the number of unique values in each column
df.nunique()
8. value_counts()
The value_counts()
function returns a Series containing counts of unique values in a column, sorted in descending order.
# Get the frequency of unique values in a column
df['column_name'].value_counts()
These summary functions are invaluable for quickly understanding the structure, content, and distribution of data within a DataFrame, allowing for effective exploratory data analysis.
df1.head()
OUTPUT:
df1.columns
OUTPUT:
Index(['Name', 'Age'], dtype='object')
# Shape of DataFrame
print(df1.shape)
# Data types of columns
print(df1.dtypes)
# Index of DataFrame
print(df1.index)
# Column labels of DataFrame
print(df1.columns)
OUTPUT:
(4, 2)
Name object
Age int64
dtype: object
RangeIndex(start=0, stop=4,step=1)
Index(['Name','Age'], dtype='object')
df1.info()
OUTPUT:
<class'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes
df1.describe()
OUTPUT:
Indeed, data can be stored in various forms and formats, such as CSV files, Excel spreadsheets, SQL databases, JSON files, and more. Pandas provides convenient functions to read data from these sources and create DataFrames.
Reading from CSV Files
CSV (Comma Separated Values) files are one of the most common ways of storing data in tabular form. Pandas provides the pd.read_csv()
function to read data from CSV files.
import pandas as pd
# Reading data from a CSV file
df = pd.read_csv('file.csv')
Reading from Excel Files
Excel files are widely used for storing tabular data. Pandas provides the pd.read_excel()
function to read data from Excel files.
import pandas as pd
# Reading data from an Excel file
df = pd.read_excel('file.xlsx')
Reading from SQL Databases
Pandas can also read data directly from SQL databases using the pd.read_sql()
function
import pandas as pd
import sqlite3
# Connecting to the database
conn = sqlite3.connect('database.db')
# Reading data from a SQL query
query = "SELECT * FROM table"
df = pd.read_sql(query, conn)
Other Formats
Pandas supports reading from various other formats such as JSON, HTML, HDF5, and more. Each format has its corresponding function (e.g., pd.read_json()
, pd.read_html()
, pd.read_hdf()
).
import pandas as pd
# Reading data from a JSON file
df = pd.read_json('file.json')
# Reading data from an HTML file
df = pd.read_html('file.html')[0] # [0] is used to select the first table from HTML
# Reading data from an HDF5 file
df = pd.read_hdf('file.h5', 'key')
Additional Parameters
These functions accept various additional parameters to customize the reading process, such as specifying delimiters, header rows, index columns, and more. Refer to the Pandas documentation for detailed information on these parameters.
# Example with additional parameters
df = pd.read_csv('file.csv', delimiter=';', header=0, index_col='id')
By leveraging these functions, you can easily read data from different file formats and databases into Pandas DataFrames, facilitating data analysis and manipulation tasks.
df=pd.read_csv('/content/drive/MyDrive/ML Course by ME/Python/Pandas/datsets_pandas/Housing.csv')
df.head()
df.tail()
df.shape
OUTPUT:
(545, 13)
df.columns
OUTPUT:
Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea', 'furnishingstatus'], dtype='object')
df.dtypes
OUTPUT:
df.info()
OUTPUT:
df.describe()
OUTPUT:
df.describe(include='all')
In Pandas, you can select specific columns or rows from a DataFrame using various methods. Here are some common approaches:
Selecting Specific Columns
- Using Bracket Notation: You can use square brackets
[]
with the column names to select specific columns. Pass a single column name or a list of column names.
# Selecting a single column
df['column_name']
# Selecting multiple columns
df[['column_name1', 'column_name2']]
Using Dot Notation: If column names are valid Python variable names and don't conflict with DataFrame methods or attributes, you can use dot notation.
# Selecting a single column
df.column_name
# Note: Dot notation won't work if the column name contains spaces or special characters.
df['price'].head()
OUTPUT:
0 13300000
1 12250000
2 12250000
3 12215000
4 11410000
Name: price, dtype: int64
df[['price','area']].head()
OUTPUT:
price area
0 13300000 7420
1 12250000 8960
2 12250000 9960
3 12215000 7500
4 11410000 7420
df.price
OUTPUT:
0 13300000
1 12250000
2 12250000
3 12215000
4 11410000
...
540 1820000
541 1767150
542 1750000
543 1750000
544 1750000
Name: price, Length: 545, dtype: int64
Selecting Specific Rows
- Using Indexing: You can use
.loc[]
or.iloc[]
to select specific rows based on index labels or positions, respectively.
# Selecting a single row by index label
df.loc['index_label']
# Selecting multiple rows by index labels
df.loc[['index_label1', 'index_label2']]
# Selecting a single row by position
df.iloc[row_position]
# Selecting multiple rows by positions
df.iloc[start_index:end_index]
loc
and iloc
are both widely used methods in Pandas for selecting specific rows and columns from a DataFrame. They provide powerful indexing capabilities and are essential for data manipulation and analysis. Here's an overview of each method:
keyboard_arrow_down
1. loc
:
loc
is primarily label-based indexing, which means that you can use it to access rows and columns by their labels (index and column names). It allows you to select data based on the row and column labels.
Syntax:
df.loc[row_label, column_label]
Examples:
# Selecting a single row by index label
df.loc['index_label']
# Selecting a single element by row and column labels
df.loc['index_label', 'column_label']
# Selecting multiple rows by index labels
df.loc[['index_label1', 'index_label2']]
# Selecting rows and columns simultaneously
df.loc['index_label', ['column_label1', 'column_label2']]
2. iloc
:
iloc
is integer-based indexing, which means that you can use it to access rows and columns by their integer positions. It allows you to select data based on the row and column positions (indices).
Syntax:
df.iloc[row_index, column_index]
Examples:
# Selecting a single row by position
df.iloc[row_position]
# Selecting a single element by row and column positions
df.iloc[row_position, column_position]
# Selecting multiple rows by positions
df.iloc[start_index:end_index]
# Selecting rows and columns simultaneously by positions
df.iloc[start_index:end_index, start_column:end_column]
Key Differences:
loc
is inclusive of both the start and stop labels, whereasiloc
is exclusive of the stop index (similar to Python slicing).- With
loc
, you specify row and column labels, while withiloc
, you specify row and column positions. loc
can accept boolean arrays or callable functions for selection, whileiloc
only accepts integers.loc
is slower thaniloc
because it involves label lookups.
Both loc
and iloc
are versatile and powerful tools for accessing data in Pandas DataFrames. The choice between them depends on whether you want to select data based on labels or positions.
# Selecting a single row by index label
df.loc[100]
OUTPUT:
price 6230000
area 6600
bedrooms 3
bathrooms 2
stories 1
mainroad yes
guestroom no
basement yes
hotwaterheating no
airconditioning yes
parking 0
prefarea yes
furnishingstatus unfurnished
Name: 100, dtype: object
# Selecting a single element by row and column labels
#df.loc['index_label', 'column_label']
df.loc[100,'price']
# Selecting multiple rows by index labels
df.loc[[100,101,102]]
# Selecting rows and columns simultaneously
#df.loc['index_label', ['column_label1', 'column_label2']]
df.loc[100,['price','area','bathrooms']]
OUTPUT:
price 6230000
area 6600
bathrooms 2
Name: 100, dtype: object
# Selecting a single row by position
df.iloc[100]
# Selecting multiple rows by positions
#df.iloc[start_index:end_index]
OUTPUT:
price 6230000
area 6600
bedrooms 3
bathrooms 2
stories 1
mainroad yes
guestroom no
basement yes
hotwaterheating no
airconditioning yes
parking 0
prefarea yes
furnishingstatus unfurnished
Name: 100, dtype: object
# Selecting multiple rows by positions
#df.iloc[start_index:end_index]
df.iloc[0:5]
# Selecting a single element by row and column positions
#df.iloc[row_position, column_position]
df.iloc[0,2]
OUTPUT:
4
# Selecting rows and columns simultaneously by positions
#df.iloc[start_index:end_index, start_column:end_column]
df.iloc[0:5,0:3]
OUTPUT:
price area bedrooms
0 13300000 7420 4
1 12250000 8960 4
2 12250000 9960 3
3 12215000 7500 4
4 11410000 7420 4
Using Conditional Selection: You can use boolean indexing to select rows based on conditions.
# Selecting rows based on a condition
df[df['column_name'] > value]
# Selecting rows based on multiple conditions
df[(df['column1'] > value1) & (df['column2'] < value2)]
df[df['price']>10000000]
df[(df['price']>8000000) & (df['area']<10000)]
Using Query: Pandas provides a query()
method to select rows based on a query expression.
# Selecting rows based on a query expression
df.query('column_name > value')
These methods allow you to easily select specific columns or rows from a DataFrame based on your analysis requirements. Choose the method that best fits your use case and preference.
df.query('price>10000000 & area<15000')
df['airconditioning']=='yes'
OUTPUT:
0 True
1 True
2 False
3 True
4 True
...
540 False
541 False
542 False
543 False
544 False
Name: airconditioning, Length: 545, dtype: bool
df.query("airconditioning == 'yes'")
Deleting Rows:
To delete rows using drop()
, you need to specify the index labels or positions of the rows you want to remove. You can use the axis
parameter to indicate that you're dropping rows (axis=0).
Syntax:
df.drop(labels, axis=0, inplace=False)
labels
: The index labels or positions of the rows to be dropped.axis
: Specifies whether to drop rows (axis=0) or columns (axis=1). Default is 0 (rows).inplace
: If True, the operation is done in place and returns None. If False, it returns a new DataFrame with the rows removed. Default is False.
Example:
# Drop rows with index labels '2' and '4'
df.drop([2, 4], axis=0, inplace=True)
Deleting Columns:
To delete columns using drop()
, you need to specify the column names or positions of the columns you want to remove. You can use the axis
parameter to indicate that you're dropping columns (axis=1).
df.drop(labels, axis=1, inplace=False)
labels
: The column names or positions of the columns to be dropped.axis
: Specifies whether to drop rows (axis=0) or columns (axis=1). Default is 0 (rows).inplace
: If True, the operation is done in place and returns None. If False, it returns a new DataFrame with the columns removed. Default is False.
# Drop columns 'column_name1' and 'column_name2'
df.drop(['column_name1', 'column_name2'], axis=1, inplace=True)
- By default,
drop()
returns a new DataFrame with the specified rows or columns removed. If you want to modify the original DataFrame in place, you can set theinplace
parameter to True. - If you want to drop rows or columns based on conditions, you can use boolean indexing or
query()
method to select the rows or columns to be removed, then usedrop()
to delete them.
df.drop([3,4], axis=0, inplace=False)
df.head()
df.drop([3,4], axis=0, inplace=True)
df.head()
df.drop('parking',axis=1)
df.drop(['parking','airconditioning'],axis=1)
# Python Program to find the area of triangle
a = 5
b = 6
c = 7
# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))
# calculate the semi-perimeter
s = (a + b + c) / 2
# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)