Initial Dataset Exploration and Analysis
The following steps will give you a comprehensive understanding of the dataset’s structure, content, and potential issues befor diving into deeper analysis or model-building.
Load the Data Efficiently
Load only a portion of the dataset first, especially if it’s large, to avoid overwhelming memory:
import pandas as pd
# Load a sample of 100 rows
df = pd.read_csv('your_file.csv', nrows=100)
You can also load in chunks if the file is very large:
chunk_size = 10000
for chunk in pd.read_csv('your_file.csv', chunksize=chunk_size):
# Process each chunk here
pass
Get an Overview of the Data
Use methods to quickly assess the structure:
First few rows:
df.head() # View first 5 rows
df.tail() # View last 5 rows
Summary of data types and non-null counts:
df.info()
Column names:
df.columns
Number of rows and columns:
df.shape
Check for Missing Data
Missing data can affect analysis and model performance:
# Count missing values per column
df.isnull().sum()
# Percentage of missing values per column
df.isnull().mean()
Remove rows with null values:
# Remove all rows where any columns contain null values
df = df.dropna()
# Drop rows with null values in a specific column
df = df.dropna(subset=['column'])
Look at Descriptive Statistics
Get an overall sense of the data’s distribution:
# Summary statistics for numeric columns
df.describe()
# Summary for all columns, including categorical
df.describe(include='all')
Examine Data Types
Ensure the data types (e.g., numeric, categorical) are correct:
df.dtypes
If you notice inconsistencies (e.g., a column with numbers stored as strings), you might want to convert them:
# Convert to numeric, with errors set to NaN
df['column'] = pd.to_numeric(df['column'], errors='coerce')
Convert the data type in one column to another type (a 64-bit integer in this case):
df['column'] = df['column'].astype(int)
Handle Duplicates
Check for any duplicate rows that may need to be removed:
# Number of duplicate rows
df.duplicated().sum()
# Remove duplicates
df.drop_duplicates(inplace=True)
Examine Unique Values in Categorical Columns
For categorical data, check the number of unique values and their distribution:
# Distribution of values
df['column'].value_counts()
# Number of unique values
df['column'].nunique()
Count the number of occurrences of a certain value in a column:
count = (df['column'] == 'value').sum()
Check for Outliers
Use basic visualizations to check for outliers in numeric columns:
import matplotlib.pyplot as plt
df.boxplot(column='numeric_column')
plt.show()
Inspect Data Relationships
Check correlations between numerical columns to identify potential relationships:
df.corr()
Visualize the Data
Simple plots can give you insights into patterns:
Histograms for distributions:
df['numeric_column'].hist()
plt.show()
Bar charts for categorical data:
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()
Memory Usage
For large datasets, monitor memory usage to avoid crashes:
# Total memory usage in bytes
df.memory_usage(deep=True).sum()
Consider optimizing memory by converting data types (e.g., converting large integer columns to smaller types):
# Convert to a smaller int type
df['column'] = df['column'].astype('int32')