Initial Dataset Exploration and Analysis

The following steps will give you a comprehensive understanding of the dataset’s structure, content, and potential issues befor diving into deeper analysis or model-building.

Load the Data Efficiently

Load only a portion of the dataset first, especially if it’s large, to avoid overwhelming memory:

   import pandas as pd

   # Load a sample of 100 rows
   df = pd.read_csv('your_file.csv', nrows=100)

You can also load in chunks if the file is very large:

   chunk_size = 10000
   for chunk in pd.read_csv('your_file.csv', chunksize=chunk_size):
       # Process each chunk here
       pass

Get an Overview of the Data

Use methods to quickly assess the structure:

First few rows:

   df.head()  # View first 5 rows
   df.tail()  # View last 5 rows

Summary of data types and non-null counts:

   df.info()

Column names:

   df.columns

Number of rows and columns:

   df.shape

Check for Missing Data

Missing data can affect analysis and model performance:

   # Count missing values per column
   df.isnull().sum()
   
   # Percentage of missing values per column
   df.isnull().mean()

Remove rows with null values:

   # Remove all rows where any columns contain null values
   df = df.dropna()

   # Drop rows with null values in a specific column
   df = df.dropna(subset=['column'])

Look at Descriptive Statistics

Get an overall sense of the data’s distribution:

   # Summary statistics for numeric columns
   df.describe()
   
   # Summary for all columns, including categorical
   df.describe(include='all')

Examine Data Types

Ensure the data types (e.g., numeric, categorical) are correct:

   df.dtypes

If you notice inconsistencies (e.g., a column with numbers stored as strings), you might want to convert them:

   # Convert to numeric, with errors set to NaN
   df['column'] = pd.to_numeric(df['column'], errors='coerce')

Convert the data type in one column to another type (a 64-bit integer in this case):

   df['column'] = df['column'].astype(int)

Handle Duplicates

Check for any duplicate rows that may need to be removed:

   # Number of duplicate rows
   df.duplicated().sum()
   
   # Remove duplicates
   df.drop_duplicates(inplace=True)

Examine Unique Values in Categorical Columns

For categorical data, check the number of unique values and their distribution:

   # Distribution of values
   df['column'].value_counts()
   
   # Number of unique values
   df['column'].nunique()

Count the number of occurrences of a certain value in a column:

   count = (df['column'] == 'value').sum()

Check for Outliers

Use basic visualizations to check for outliers in numeric columns:

   import matplotlib.pyplot as plt
   df.boxplot(column='numeric_column')
   plt.show()

Inspect Data Relationships

Check correlations between numerical columns to identify potential relationships:

   df.corr()

Visualize the Data

Simple plots can give you insights into patterns:

Histograms for distributions:

   df['numeric_column'].hist()
   plt.show()

Bar charts for categorical data:

   df['categorical_column'].value_counts().plot(kind='bar')
   plt.show()

Memory Usage

For large datasets, monitor memory usage to avoid crashes:

   # Total memory usage in bytes
   df.memory_usage(deep=True).sum()

Consider optimizing memory by converting data types (e.g., converting large integer columns to smaller types):

   # Convert to a smaller int type
   df['column'] = df['column'].astype('int32')