3. Exploratory Data Analysis

Python
Machine Learning
Data Visualization
Published

September 21, 2025

Loading the Data

Now that we have our processed dataset from part 2, we can start exploring the data to understand the relationships between features and identify any patterns that might help us build a better predictive model. First, lets load the processed dataset and define which features are numeric since we’ll be focusing on them for most of our analysis.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

data = pd.read_csv('data/processed_pokemon_cards.csv')

numeric_features = [
    'hp', 'level', 'convertedRetreatCost', 'number', 'primary_pokedex_number',
    'pokemon_count', 'total_weakness_multiplier', 'total_weakness_modifier',
    'total_resistance_multiplier', 'total_resistance_modifier',
    'pokedex_frequency', 'artist_frequency', 'ability_count', 'attack_count',
    'max_damage', 'attack_cost'
]

Overview of the Dataset

Before we dive into visualizations, lets get a high-level overview of our dataset structure and see what we’re working with.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4470 entries, 0 to 4469
Columns: 111 entries, level to has_ancient_trait
dtypes: float64(34), int64(75), object(2)
memory usage: 3.8+ MB

This gives us information about the data types and any missing values. Now lets look at the summary statistics for our numeric features to understand the distribution and range of values.

data[numeric_features].describe()
hp level convertedRetreatCost number primary_pokedex_number pokemon_count total_weakness_multiplier total_weakness_modifier total_resistance_multiplier total_resistance_modifier pokedex_frequency artist_frequency ability_count attack_count max_damage attack_cost
count 4470.000000 1147.000000 4470.000000 4470.000000 4470.000000 4470.000000 4470.000000 4470.000000 4470.0 4470.000000 4470.000000 4289.000000 4470.000000 4470.000000 4470.000000 4470.000000
mean 98.158837 27.697472 1.537360 64.167785 79.379418 1.014765 1.826846 1.120805 0.0 -6.241611 39.108277 130.888552 0.207606 1.654362 53.957494 3.356823
std 61.409467 15.783715 0.927594 54.347535 53.447625 0.127830 0.570393 4.660906 0.0 11.439853 30.512610 148.742955 0.406740 0.533293 54.109789 1.726154
min 30.000000 5.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.0 -60.000000 4.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 60.000000 14.000000 1.000000 24.000000 37.000000 1.000000 2.000000 0.000000 0.0 0.000000 24.000000 21.000000 0.000000 1.000000 20.000000 2.000000
50% 80.000000 26.000000 1.000000 51.000000 79.000000 1.000000 2.000000 0.000000 0.0 0.000000 30.000000 57.000000 0.000000 2.000000 30.000000 3.000000
75% 110.000000 36.000000 2.000000 88.000000 121.000000 1.000000 2.000000 0.000000 0.0 0.000000 45.000000 215.000000 0.000000 2.000000 70.000000 4.000000
max 380.000000 100.000000 5.000000 304.000000 474.000000 3.000000 4.000000 40.000000 0.0 0.000000 168.000000 471.000000 2.000000 4.000000 330.000000 13.000000

Visualizing Distributions

To better understand how our features are distributed, we can create box plots and histograms. These visualizations will help us identify outliers, skewness, and the overall shape of the data. Lets focus on some key features that we think will be important for predicting hit points.

focus_columns = [
  'level', 'max_damage', 'convertedRetreatCost', 'attack_count', 'pokemon_count', 'number'
]

fig, axes = plt.subplots(4, 4, figsize=(16, 12))
axes = axes.flatten()

for i, feature in enumerate(focus_columns):
    axes[i].boxplot(data[feature].dropna())
    axes[i].set_title(feature)
    axes[i].set_ylabel('Value')

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

These box plots help us identify outliers and understand the spread of values for each feature. We can see that some features have a few extreme values that might need special attention during modeling. Now lets look at the same features using histograms to see the actual distribution of values.

fig, axes = plt.subplots(4, 4, figsize=(16, 12))
axes = axes.flatten()

for i, feature in enumerate(focus_columns):
    axes[i].hist(data[feature].dropna(), bins=30, edgecolor='black')
    axes[i].set_title(feature)
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

The histograms show us how the values are distributed. We can see that some features like hp have a more discrete distribution with specific common values, while others like level have a more continuous distribution.

Since hp is our target variable, lets take a closer look at its distribution with a side-by-side comparison.

feature = 'hp'

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].boxplot(data[feature].dropna())
axes[0].set_title(f'Box Plot of {feature}')
axes[0].set_ylabel('Value')

axes[1].hist(data[feature].dropna(), bins=30, edgecolor='black')
axes[1].set_title(f'Histogram of {feature}')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

From these plots, we can see that HP values tend to cluster around certain values, which makes sense since Pokemon cards typically have HP values in multiples of 10. This might be something to consider when building our model.

Feature Correlations

Understanding how features relate to each other and to our target variable is crucial. Lets create a correlation heatmap to visualize these relationships.

corr_df = data[numeric_features]

corr_matrix = corr_df.corr()

plt.figure(figsize=(14, 10))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    linewidths=0.5
)

plt.title('Correlation Heatmap of Features')
plt.show()

This correlation heatmap shows us several interesting relationships:

  • Features that are strongly correlated with hp (our target) will be important for our model
  • Features that are highly correlated with each other might cause multicollinearity issues
  • We can identify which features might be redundant or provide little unique information

We should pay special attention to features that have strong positive or negative correlations with hp as these will likely be the most predictive.

Categorical Feature Distributions

Now lets look at the distribution of categorical features like types and subtypes. These were one-hot encoded in part 2, so we can count how many cards fall into each category.

type_columns = data.filter(regex='^type_')
type_counts = type_columns.sum()
type_counts = type_counts.sort_values(ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(range(len(type_counts)), type_counts.values)
plt.xticks(range(len(type_counts)), type_counts.index, rotation=45, ha='right')
plt.xlabel('Type')
plt.ylabel('Count')
plt.title('Distribution of Pokemon Card Types')
plt.tight_layout()
plt.show()

This gives us a sense of how balanced our dataset is across different Pokemon types. If certain types are underrepresented, our model might not perform as well for those types. Lets also look at the distribution of subtypes.

subtype_columns = data.filter(regex='^subtype_')
subtype_counts = subtype_columns.sum()
subtype_counts = subtype_counts.sort_values(ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(range(len(subtype_counts)), subtype_counts.values)
plt.xticks(range(len(subtype_counts)), subtype_counts.index, rotation=45, ha='right')
plt.xlabel('Type')
plt.ylabel('Count')
plt.title('Distribution of Pokemon Card Types')
plt.tight_layout()
plt.show()

Key Takeaways

From this exploratory data analysis, we’ve learned several important things about our dataset:

  1. Feature Distributions: Most features have reasonable distributions, though some have outliers we might need to handle
  2. Target Variable: HP values are discrete and tend to fall into specific ranges
  3. Correlations: We’ve identified which features are most strongly correlated with HP
  4. Class Balance: We can see how our data is distributed across different types and subtypes

These insights will help guide our feature selection and modeling decisions in the next part of this project. We now have a solid understanding of what we’re working with and can make informed decisions about how to build our predictive model.