Pandas Tutorial

Welcome to the Pandas tutorial! In this guide, we'll explore key features of the Pandas library in Python, which is widely used for data manipulation and analysis. We'll cover Pandas Series and DataFrame, and learn important functions like Series.map(), DataFrame.apply(), and more.

Pandas Series

Series

A Pandas Series is a one-dimensional labeled array capable of holding any data type. It can be thought of as an enhanced version of a list or an array in Python.

import pandas as pd
series = pd.Series([1, 2, 3, 4, 5])
print(series)

Series.map()

The map() function allows you to apply a function or mapping to each element of the Series.

series = pd.Series([1, 2, 3, 4, 5])
result = series.map(lambda x: x ** 2)
print(result)

Series.std()

The std() function calculates the standard deviation of the Series. It measures the spread of the data.

series = pd.Series([1, 2, 3, 4, 5])
std_dev = series.std()
print(std_dev)

Series.to_frame()

The to_frame() method converts a Series into a DataFrame. This is useful when you want to work with the Series as a DataFrame.

series = pd.Series([1, 2, 3, 4, 5])
df = series.to_frame(name='Numbers')
print(df)

Series.unique()

The unique() function returns an array of unique elements in the Series, removing duplicates.

series = pd.Series([1, 2, 2, 3, 4])
unique_values = series.unique()
print(unique_values)

Series.value_counts()

The value_counts() function counts the occurrences of each unique value in the Series.

series = pd.Series([1, 2, 2, 3, 3, 3])
value_counts = series.value_counts()
print(value_counts)

Pandas DataFrame

DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can think of it as a table or a spreadsheet.

import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)

DataFrame.append()

The append() method adds rows to a DataFrame. It is commonly used to combine data from different DataFrames.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
combined_df = df1.append(df2, ignore_index=True)
print(combined_df)

DataFrame.apply()

The apply() function is used to apply a function along the axis (rows or columns) of a DataFrame.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = df.apply(lambda x: x.max())
print(result)

DataFrame.aggregate()

The aggregate() method applies one or more functions to the DataFrame. This is useful for summarizing data.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = df.aggregate(['sum', 'mean'])
print(result)

DataFrame.assign()

The assign() method adds new columns to the DataFrame or modifies existing ones.

df = pd.DataFrame({'A': [1, 2, 3]})
df = df.assign(B = [4, 5, 6])
print(df)

DataFrame.astype()

The astype() function allows you to cast a DataFrame column to a different data type.

df = pd.DataFrame({'A': ['1', '2', '3']})
df['A'] = df['A'].astype(int)
print(df)

DataFrame.count()

The count() function returns the number of non-null entries for each column or row in the DataFrame.

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})
count_result = df.count()
print(count_result)

DataFrame.cut()

The cut() function segments and sorts data values into discrete bins or intervals.

data = pd.Series([1, 7, 5, 3, 9, 4])
labels = ['Low', 'Medium', 'High']
binned_data = pd.cut(data, bins=3, labels=labels)
print(binned_data)

DataFrame.describe()

The describe() method provides summary statistics of the DataFrame, including count, mean, std, min, and more.

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
summary = df.describe()
print(summary)

DataFrame.drop_duplicates()

The drop_duplicates() function removes duplicate rows from the DataFrame.

df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 6]})
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

DataFrame.groupby()

The groupby() method splits the data into groups based on some criteria, allowing you to perform operations on each group.

df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A'], 'Value': [10, 20, 30, 40, 50]})
grouped = df.groupby('Category').sum()
print(grouped)

DataFrame.head()

The head() function returns the first n rows of the DataFrame (default is 5).

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': [7, 8, 9, 10, 11, 12]})
head_result = df.head(3)
print(head_result)

DataFrame.hist()

The hist() function generates histograms for numeric columns of the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df.hist()
plt.show()

DataFrame.iterrows()

The iterrows() function iterates over DataFrame rows as (index, Series) pairs.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
for index, row in df.iterrows():
  print(index, row['A'], row['B'])

DataFrame.join()

The join() function joins columns of another DataFrame or Series into the current DataFrame.

df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [4, 5, 6]})
joined_df = df1.join(df2)
print(joined_df)

DataFrame.mean()

The mean() function computes the mean (average) of each numeric column in the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
mean_values = df.mean()
print(mean_values)

DataFrame.melt()

The melt() function unpivots a DataFrame from wide format to long format, making it suitable for certain types of analysis.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
melted_df = pd.melt(df, id_vars=['A'], value_vars=['B'])
print(melted_df)

DataFrame.merge()

The merge() function combines two DataFrames based on a common column or index, similar to SQL JOIN operations.

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 35]})
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

DataFrame.pivot_table()

The pivot_table() function creates a pivot table from the DataFrame, summarizing data in a matrix format.

df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A'], 'Value': [10, 20, 30, 40, 50]})
pivot = df.pivot_table(values='Value', index='Category', aggfunc='sum')
print(pivot)

DataFrame.query()

The query() function allows you to query the DataFrame using a string expression. This is similar to SQL WHERE clauses.

df = pd.DataFrame({'Age': [22, 25, 30, 35], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
result = df.query('Age > 25')
print(result)

DataFrame.rename()

The rename() function allows you to change column or index names in the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_renamed = df.rename(columns={'A': 'X', 'B': 'Y'})
print(df_renamed)

DataFrame.sample()

The sample() function returns a random sample of rows from the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
sampled_df = df.sample(n=2)
print(sampled_df)

DataFrame.shift()

The shift() function shifts the values of a DataFrame (or Series) by a specified number of periods.

df = pd.DataFrame({'A': [1, 2, 3, 4]})
shifted_df = df.shift(1)
print(shifted_df)

DataFrame.sort()

The sort() function sorts the DataFrame by the values of one or more columns.

df = pd.DataFrame({'A': [1, 3, 2, 4]})
sorted_df = df.sort_values(by='A')
print(sorted_df)

DataFrame.sum()

The sum() function computes the sum of the values in each column or row.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
sum_result = df.sum()
print(sum_result)

DataFrame.to_excel()

The to_excel() function exports the DataFrame to an Excel file.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_excel('output.xlsx', index=False)

DataFrame.transform()

The transform() function applies a function to each column (or row) and returns a DataFrame with the same shape.

df = pd.DataFrame({'A': [1, 2, 3]})
transformed_df = df.transform(lambda x: x + 1)
print(transformed_df)

DataFrame.transpose()

The transpose() function transposes the rows and columns of the DataFrame.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
transposed_df = df.transpose()
print(transposed_df)

DataFrame.where()

The where() function replaces values where the condition is False, allowing for conditional replacement.

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
result = df.where(df > 20, other=0)
print(result)

Pandas Operations

Operations

Pandas supports a wide variety of operations like addition, subtraction, multiplication, and division on DataFrame columns. Operations are performed element-wise across the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = df['A'] + df['B']
print(result)

Add Column to DataFrame

You can add new columns to a DataFrame by assigning values to a new column name.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'] + df['B']
print(df)

DataFrame to NumPy Array

You can convert a DataFrame to a NumPy array using the to_numpy() method. This is useful for applying numerical operations that are not available directly in Pandas.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
numpy_array = df.to_numpy()
print(numpy_array)

DataFrame to CSV

You can write a DataFrame to a CSV file using the to_csv() function.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.to_csv('output.csv', index=False)

Reading Files

Pandas can read data from various file formats, including CSV, Excel, and JSON, using functions like read_csv(), read_excel(), and read_json().

df = pd.read_csv('input.csv')
print(df.head())

Concatenation

The concat() function combines DataFrames either vertically (rows) or horizontally (columns).

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
df_concat = pd.concat([df1, df2], axis=0)
print(df_concat)

Data Operations

Operations

Pandas offers powerful functions for data operations such as grouping, aggregating, filtering, and reshaping data.

df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B'], 'Value': [10, 20, 30, 40]})
grouped_df = df.groupby('Category').sum()
print(grouped_df)

Data Processing

Data processing in Pandas often involves cleaning, transforming, or filtering the data to make it ready for analysis.

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
processed_df = df[df['Age'] > 25]
print(processed_df)

DataFrame.corr()

The corr() method computes the correlation of numeric columns in a DataFrame. It measures the relationship between columns.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
correlation = df.corr()
print(correlation)

DataFrame.dropna()

The dropna() method is used to remove rows or columns that contain missing values (NaNs).

df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
clean_df = df.dropna()
print(clean_df)

DataFrame.fillna()

The fillna() function is used to fill missing values (NaN) with a specified value.

df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
filled_df = df.fillna(0)
print(filled_df)

DataFrame.replace()

Use replace() to replace values in the DataFrame with another value.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_replaced = df.replace(1, 100)
print(df_replaced)

DataFrame.iloc[]

The iloc[] method is used for integer-location-based indexing. It selects rows and columns by their index positions.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
row_1 = df.iloc[1]  # Selects the second row
print(row_1)

DataFrame.isin()

The isin() method checks if each element in the DataFrame is contained in a list or array.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = df.isin([1, 5])
print(result)

DataFrame.loc[]

loc[] allows you to select rows and columns based on labels (index names), not integer positions.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
row_y = df.loc['y']  # Selects the row with index 'y'
print(row_y)

loc vs iloc

- loc[] is label-based, meaning you provide the index labels for rows and columns.
- iloc[] is position-based, meaning you provide the integer index positions.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
row_x_loc = df.loc['x']  # label-based selection
row_0_iloc = df.iloc[0]  # position-based selection
print(row_x_loc)
print(row_0_iloc)

Pandas Cheat Sheet

Cheat Sheet

A cheat sheet is a quick reference guide that lists common Pandas methods and syntax.

You can access various Pandas cheat sheets online to streamline your coding process and enhance productivity.

Pandas Index

Index

An Index object in Pandas is used to label and organize rows or columns. It can be customized and used to optimize data selection.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
index = df.index  # Get the index of the DataFrame
print(index)

Multiple Index

Multiple index allows you to have hierarchical indexing in rows and columns, making it easier to handle complex data.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.set_index(['A', 'B'], inplace=True)
print(df)

Reindex

The reindex() method allows you to change the index of a DataFrame or Series, making it easier to reorder data or add/remove labels.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
reindexed_df = df.reindex([2, 0, 1])
print(reindexed_df)

Reset Index

The reset_index() method resets the index of a DataFrame, adding the current index as a column.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
reset_df = df.reset_index()
print(reset_df)

Set Index

The set_index() method sets a specified column as the index for the DataFrame.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_with_index = df.set_index('A')
print(df_with_index)

Pandas and NumPy

Pandas and NumPy work well together, and many Pandas functions rely on NumPy operations. You can convert DataFrames or Series to NumPy arrays for high-performance numerical computations.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
numpy_array = df.to_numpy()  # Convert DataFrame to NumPy array
print(numpy_array)

Boolean Indexing

Boolean indexing allows you to filter data based on conditions, returning rows that match the condition.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
filtered_df = df[df['A'] > 1]  # Select rows where 'A' is greater than 1
print(filtered_df)

Concatenating Data

The concat() function is used to concatenate two or more DataFrames along rows or columns.

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
concatenated_df = pd.concat([df1, df2], ignore_index=True)
print(concatenated_df)

Pandas vs NumPy

- NumPy is a powerful library for numerical computations with high-performance array operations.
- Pandas is built on top of NumPy and provides powerful tools for data manipulation, analysis, and handling structured data like time series, missing values, and more.

Pandas Time Series

Time Series

Pandas has strong support for time series data, including easy conversion to datetime objects and resampling capabilities.

date_range = pd.date_range('2020-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
print(df)

Datetime

The to_datetime() function in Pandas is used to convert string representations of dates into Pandas datetime objects.

df = pd.DataFrame({'Date': ['2020-01-01', '2020-02-01']})
df['Date'] = pd.to_datetime(df['Date'])
print(df)

Time Offset

Pandas supports time offsets, which can be used to shift dates by a certain amount. The Timedelta object is useful for this.

date = pd.to_datetime('2024-01-01')
offset = pd.Timedelta(days=5)
new_date = date + offset  # Adds 5 days
print(new_date)

Time Periods

Time periods represent spans of time (such as a week or month). The Period object can be used to work with periods.

period = pd.Period('2024-01', freq='M')  # Monthly period
next_period = period + 1  # Next period (February 2024)
print(next_period)

Convert String to Date

Use pd.to_datetime() to convert string representations of dates into Pandas datetime objects.

date_str = '2024-01-01'
date = pd.to_datetime(date_str)
print(date)

Pandas Plot

Plot

Pandas has built-in plotting functionality. You can use the plot() method to create various types of plots, such as line plots, bar plots, and histograms, using DataFrame or Series data.

df = pd.DataFrame({'Date': pd.date_range('2024-01-01', periods=5, freq='D'), 'Value': [10, 20, 15, 30, 25]})
df.set_index('Date').plot(title='Sample Time Series')

The plot() method will automatically use the Date column as the x-axis when it is set as the index.

Multiple Plot

You can also plot multiple charts in one figure by using the subplots argument.

df = pd.DataFrame({'Date': pd.date_range('2024-01-01', periods=5, freq='D'), 'Value1': [10, 20, 30, 40, 50], 'Value2': [5, 10, 15, 20, 25]})
df.set_index('Date').plot(subplots=True, layout=(2, 1), figsize=(10, 6))