π― Why Visualize Data?
Data visualization transforms abstract numbers into visual stories. The human brain processes images 60,000Γ faster than text. Visualization helps us explore, analyze, and communicate data effectively.
Three Purposes of Visualization
2. Explanatory: Communicate findings to stakeholders clearly
3. Confirmatory: Verify hypotheses and validate models
ποΈ Visual Perception & Pre-attentive Attributes
The human visual system can detect certain visual attributes almost instantly (< 250ms) without conscious effort. These are called pre-attentive attributes.
- Position: Most accurate for quantitative data (use X/Y axes)
- Length: Bar charts leverage this effectively
- Color Hue: Best for categorical distinctions
- Color Intensity: Good for gradients/magnitude
- Size: Bubble charts, but humans underestimate area
- Shape: Useful for categories, but limit to 5-7 shapes
- Orientation: Lines, angles
Cleveland & McGill's Accuracy Ranking
1. Position on common scale (bar chart)
2. Position on non-aligned scale (multiple axes)
3. Length (bar)
4. Angle, Slope
5. Area
6. Volume, Curvature
7. Color saturation, Color hue
π§ Under the Hood: The Weber-Fechner Law
Why are humans bad at comparing bubble sizes (area) but great at comparing bar chart heights (length/position)? Human perception of physical magnitudes follows a logarithmic scale, not a linear one.
- $I$: Initial stimulus intensity (e.g., initial bubble area)
- $\Delta I$: Just Noticeable Difference (JND) required to perceive a change
- $k$: Weber's constant. For length/position $k \approx 0.03$ (very sensitive), but for area $k \approx 0.10$ to $0.20$ (very insensitive).
π The Grammar of Graphics
The Grammar of Graphics (Wilkinson, 1999) is a framework for describing statistical graphics. It's the foundation of ggplot2 (R) and influences Seaborn, Altair, and Plotly.
- Data: The dataset being visualized
- Aesthetics (aes): Mapping data to visual properties (x, y, color, size)
- Geometries (geom): Visual elements (points, lines, bars, areas)
- Facets: Subplots by categorical variable
- Statistics: Transformations (binning, smoothing, aggregation)
- Coordinates: Cartesian, polar, map projections
- Themes: Non-data visual elements (fonts, backgrounds)
π§ Under the Hood: Coordinate Transformations
When mapping data to visuals, the coordinate system applies a mathematical transformation matrix. For example, converting standard Cartesian coordinates $(x, y)$ to Polar coordinates $(r, \theta)$ to render a pie chart or Coxcomb plot:
This is why pie charts are computationally and perceptually different from bar chartsβthey apply a non-linear polar transformation to the linear data dimensions.
import pandas as pd
from plotnine import *
# Following Grammar of Graphics exactly:
# Data (mpg) -> Aesthetics (x,y,color) -> Geometries (point, smooth)
plot = (
ggplot(mpg, aes(x='displ', y='hwy', color='class'))
+ geom_point(size=3, alpha=0.7)
+ geom_smooth(method='lm', se=False) # Add regression line
+ theme_minimal() # Add theme
+ labs(title='Engine Displacement vs Highway MPG')
)
print(plot)
π¨ Choosing the Right Chart
The best visualization depends on your data type and question. Here's a decision guide:
β’ Continuous: Histogram, KDE, Box plot, Violin plot
β’ Categorical: Bar chart, Count plot
Two Variables (Bivariate):
β’ Both Continuous: Scatter plot, Line chart, Hexbin, 2D histogram
β’ Continuous + Categorical: Box plot, Violin, Strip, Swarm
β’ Both Categorical: Heatmap, Grouped bar chart
Multiple Variables (Multivariate):
β’ Pair plot (scatterplot matrix)
β’ Parallel coordinates
β’ Heatmap correlation matrix
β’ Faceted plots (small multiples)
Common Chart Mistakes
π§ Under the Hood: Information Entropy in Visuals
How much data can a chart "handle" before it becomes cluttered? We can use Shannon Entropy ($H$) to quantify the visual information density. If a chart has $n$ visual marks (dots, lines) with probabilities $p_i$ of drawing attention:
Takeaway: If you add too many dimensions (color, size, shape simultaneously) on a single plot, the entropy $H$ exceeds human working memory limits ($\approx 2.5$ bits), leading to chart fatigue. This is mathematically why "less is more" in dashboard design.
π¬ Matplotlib Figure Anatomy
Understanding Matplotlib's object hierarchy is key to creating professional visualizations.
Figure β Axes β Axis β Tick β Labelβ’ Figure: The overall window/canvas
β’ Axes: The actual plot area (NOT the X/Y axis!)
β’ Axis: The X or Y axis with ticks and labels
β’ Artist: Everything visible (lines, text, patches)
Two Interfaces
plt.plot(x, y)plt.xlabel('Time')plt.show()2. Object-Oriented (OO): Explicit, recommended for complex plots
fig, ax = plt.subplots()ax.plot(x, y)ax.set_xlabel('Time')
π§ Under the Hood: Affine Transformations
How does Matplotlib convert your data coordinates (e.g., $x \in [0, 1000]$) into physical pixels on your screen? It uses a continuous pipeline of Affine Transformation Matrices:
This matrix $T$ scales ($s_x, s_y$) and translates ($t_x, t_y$) data points. The transformation pipeline is: Data $\rightarrow$ Axes (relative 0-1) $\rightarrow$ Figure (inches) $\rightarrow$ Display (pixels based on DPI).
import matplotlib.pyplot as plt
# 1. Create the Figure (The Canvas) and Axes (The Artist)
fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
# 2. Draw on the Axes
ax.plot([1, 2, 3], [4, 5, 2], marker='o', label='Data A')
# 3. Configure the Axes (Anatomy elements)
ax.set_title("My First OOP Plot", fontsize=16, fontweight='bold')
ax.set_xlabel("X-Axis (Units)", fontsize=12)
ax.set_ylabel("Y-Axis (Units)", fontsize=12)
# Set limits and ticks
ax.set_xlim(0, 4)
ax.set_ylim(0, 6)
ax.grid(True, linestyle='--', alpha=0.7)
# 4. Add accessories
ax.legend(loc='upper right')
# 5. Render or Save
plt.tight_layout() # Prevent clipping
plt.show()
# fig.savefig('my_plot.png', dpi=300)
π Basic Matplotlib Plots
Master the fundamental plot types that form the foundation of data visualization.
Code Examples
ax.plot(x, y, color='blue', linestyle='--', marker='o', label='Series A')Scatter Plot:
ax.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')Bar Chart:
ax.bar(categories, values, color='steelblue', edgecolor='black')Histogram:
ax.hist(data, bins=30, edgecolor='white', density=True)
π§ Under the Hood: The Freedman-Diaconis Rule
When you call a histogram without specifying bins, how does the library choose the optimal bin width? Advanced statistical libraries use the Freedman-Diaconis rule, which minimizes the integral of the squared difference between the histogram and the true underlying probability density:
Where $\text{IQR}$ is the Interquartile Range and $n$ is the number of observations. Unlike simpler rules (e.g., Sturges' rule), this mathematical method is extremely robust to heavy-tailed distributions and outliers.
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
# 1. Scatter Plot (Color & Size mapping)
x = np.random.randn(100)
y = x + np.random.randn(100)*0.5
sizes = np.random.uniform(10, 200, 100)
colors = x
sc = axs[0].scatter(x, y, s=sizes, c=colors, cmap='viridis', alpha=0.7)
axs[0].set_title('Scatter Plot')
fig.colorbar(sc, ax=axs[0], label='Color Value')
# 2. Bar Chart (with Error Bars)
categories = ['Group A', 'Group B', 'Group C']
values = [10, 22, 15]
errors = [1.5, 3.0, 2.0]
axs[1].bar(categories, values, yerr=errors, capsize=5, color='coral', alpha=0.8)
axs[1].set_title('Bar Chart with Error Bars')
for i, v in enumerate(values):
axs[1].text(i, v + 0.5, str(v), ha='center')
plt.tight_layout()
plt.show()
π² Subplots & Multi-panel Layouts
Combine multiple visualizations into a single figure for comprehensive analysis.
fig, axes = plt.subplots(2, 2, figsize=(12, 10))fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)gs = fig.add_gridspec(3, 3); ax = fig.add_subplot(gs[0, :])
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(3, 3, figure=fig)
# 1. Main large plot (spans 2x2 grid)
ax_main = fig.add_subplot(gs[0:2, 0:2])
ax_main.set_title('Main View')
# 2. Side plots (Top right, Bottom right)
ax_side1 = fig.add_subplot(gs[0, 2])
ax_side2 = fig.add_subplot(gs[1, 2])
# 3. Bottom wide plot (spans 1x3 grid)
ax_bottom = fig.add_subplot(gs[2, :])
ax_bottom.set_title('Timeline View')
plt.tight_layout()
plt.show()
π¨ Styling & Professional Themes
Transform basic plots into publication-quality visualizations.
plt.style.available β Lists all built-in stylesplt.style.use('seaborn-v0_8-whitegrid')with plt.style.context('dark_background'):
Color Palettes
Sequential: Blues, Greens, Oranges (for magnitude)
Diverging: coolwarm, RdBu (for +/- deviations)
Categorical: tab10, Set2, Paired (discrete groups)
π§ Under the Hood: Perceptually Uniform Colors (CIELAB)
Why do we use "viridis" instead of "rainbow" colormaps? A color map is a mathematical function mapping data $f(x) \rightarrow (R, G, B)$. However, standard RGB math doesn't match human perception (Euclidean distance in RGB $\neq$ perceived color distance).
Advanced colormaps like viridis are calculated in the CIELAB ($L^*a^*b^*$) color space. In this space, the mathematical distance formula $\Delta E^*$ perfectly matches how the retina and brain perceive brightness and hue differences, ensuring data is never visually distorted.
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Apply a global Seaborn theme
sns.set_theme(style="whitegrid", palette="muted")
# 2. Customize fonts globally
plt.rcParams.update({
'font.family': 'sans-serif',
'font.sans-serif': ['Helvetica', 'Arial'],
'axes.titleweight': 'bold',
'axes.titlesize': 16,
'axes.labelsize': 12,
'lines.linewidth': 2
})
# 3. Plotting with the new theme
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot([1, 2, 3], [4, 5, 2], label='Data')
ax.legend()
# 4. Remove top and right spines (cleaner look)
sns.despine(ax=ax)
plt.show()
π Seaborn: Statistical Visualization
Seaborn is a high-level library built on Matplotlib that makes statistical graphics beautiful and easy.
- Beautiful default styles and color palettes
- Works seamlessly with Pandas DataFrames
- Statistical estimation built-in (confidence intervals, regression)
- Faceting for multi-panel figures
- Functions organized by plot purpose
Seaborn Function Categories
Axes-level: Draw on specific axes (histplot, scatterplot, boxplot)
By Purpose:
β’ Distribution: histplot, kdeplot, ecdfplot, rugplot
β’ Relationship: scatterplot, lineplot, regplot
β’ Categorical: stripplot, swarmplot, boxplot, violinplot, barplot
β’ Matrix: heatmap, clustermap
π Distribution Plots
Visualize the distribution of a single variable or compare distributions across groups.
β’ Histogram: Discrete bins, shows raw counts
β’ KDE: Smooth curve, estimates probability density
β’ Use both together:
sns.histplot(data, kde=True)
π§ Under the Hood: Kernel Density Estimation (KDE)
A KDE plot is not just a smoothed line; it's a mathematical sum of continuous probability distributions (kernels) placed at every single data point $x_i$:
Here, $K$ is typically the Standard Normal Gaussian density function, and $h$ is the bandwidth parameter. If $h$ is too small, the curve is jagged (overfit); if $h$ is too large, it hides important statistical features (underfit).
import seaborn as sns
import matplotlib.pyplot as plt
penguins = sns.load_dataset("penguins")
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# 1. Histogram + KDE overlay
sns.histplot(
data=penguins, x="flipper_length_mm", hue="species",
element="step", stat="density", common_norm=False,
ax=axes[0]
)
axes[0].set_title("Histogram with Step Fill")
# 2. KDE Plot with Rug Plot
sns.kdeplot(
data=penguins, x="body_mass_g", hue="species",
fill=True, common_norm=False, palette="crest",
alpha=0.5, linewidth=1.5, ax=axes[1]
)
sns.rugplot(
data=penguins, x="body_mass_g", hue="species",
height=0.05, ax=axes[1]
)
axes[1].set_title("KDE Density + Rug Plot")
sns.despine()
plt.show()
π Relationship Plots
Explore relationships between two or more continuous variables.
sns.scatterplot(data=df, x='x', y='y', hue='category', size='magnitude')sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})sns.pairplot(df, hue='species', diag_kind='kde')
π§ Under the Hood: Ordinary Least Squares (OLS)
When you use sns.regplot, Seaborn calculates the line of best fit by minimizing the sum of the
squared residuals ($e_i^2$). The exact matrix algebra closed-form solution for the coefficients
$\hat{\boldsymbol{\beta}}$ is:
The shaded region around the line represents the 95% confidence interval, meaning if we resampled the data 100 times, the true regression line would fall inside this shaded band 95 times (usually computed via bootstrapping).
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
# 1. Advanced Scatter (4 dimensions: x, y, color, size)
plt.figure(figsize=(8, 6))
sns.scatterplot(
data=tips, x="total_bill", y="tip",
hue="time", size="size", sizes=(20, 200),
palette="deep", alpha=0.8
)
plt.title("4D Scatter Plot (Total Bill vs Tip)")
plt.show()
# 2. Regression Plot with Subplots (Using lmplot)
# lmplot is a figure-level function that creates multiple subplots automatically
sns.lmplot(
data=tips, x="total_bill", y="tip", col="time", hue="smoker",
height=5, aspect=1.2, scatter_kws={'alpha':0.5}
)
plt.show()
# 3. Pairplot (Explore all pairwise relationships)
sns.pairplot(
data=tips, hue="smoker",
diag_kind="kde", markers=["o", "s"]
)
plt.show()
π¦ Categorical Plots
Visualize distributions and comparisons across categorical groups.
β’ Strip/Swarm: Show all data points (small datasets)
β’ Box: Summary statistics (median, quartiles, outliers)
β’ Violin: Full distribution shape + summary
β’ Bar: Mean/count with error bars
π§ Under the Hood: The IQR Outlier Rule
Box plots identify "outliers" (the individual dots beyond the whiskers) purely mathematically, not visually. They use John Tukey's Interquartile Range (IQR) method:
Any point strictly outside $[Lower, Upper]$ is plotted as an outlier. Fun Fact: In a perfectly normal Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$, exactly 0.70% of the data will be incorrectly flagged as outliers by this static math rule!
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 1. Violin Plot (Distribution density across categories)
sns.violinplot(
data=tips, x="day", y="total_bill", hue="sex",
split=True, inner="quart", palette="muted",
ax=axes[0]
)
axes[0].set_title("Violin Plot (Split by Sex)")
# 2. Boxplot + Swarmplot Overlay
# Good for showing summary stats PLUS underlying data points
sns.boxplot(
data=tips, x="day", y="total_bill", color="white",
width=.5, showfliers=False, ax=axes[1] # hide boxplot outliers to avoid overlap
)
sns.swarmplot(
data=tips, x="day", y="total_bill", hue="time",
size=6, alpha=0.7, ax=axes[1]
)
axes[1].set_title("Boxplot + Swarmplot Overlay")
plt.tight_layout()
plt.show()
π₯ Heatmaps & Correlation Matrices
Visualize matrices of values using color intensity. Essential for EDA correlation analysis.
β’ Always annotate with values:
annot=Trueβ’ Use diverging colormap for correlation:
cmap='coolwarm', center=0β’ Mask upper/lower triangle:
mask=np.triu(np.ones_like(corr))β’ Square cells:
square=True
π§ Under the Hood: Correlation Coefficients
Correlation heatmaps display the strength of linear relationships between variables, typically mapping the Pearson Correlation Coefficient ($r$) to a discrete color gradient hexbin:
For non-linear but monotonic relationships, you should switch pandas to use Spearman's Rank Correlation ($\rho$), which mathematically converts raw values to ranks $R(x_i)$ before applying the same formula. Both map perfectly bounds of $[-1, 1]$.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Load data and calculate correlation matrix
penguins = sns.load_dataset("penguins")
# Select only numerical columns for correlation
numerical_df = penguins.select_dtypes(include=[np.number])
corr = numerical_df.corr()
# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(8, 6))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
corr,
mask=mask,
cmap='coolwarm',
vmax=1, vmin=-1,
center=0,
square=True,
linewidths=.5,
annot=True,
fmt=".2f",
cbar_kws={"shrink": .8}
)
plt.title("Penguin Feature Correlation")
plt.tight_layout()
plt.show()
π Plotly Express: Interactive Visualization
Plotly creates interactive, web-based visualizations with zoom, pan, hover tooltips, and more.
- Interactive out of the box (zoom, pan, select)
- Hover tooltips with data details
- Export as HTML, PNG, or embed in dashboards
- Works in Jupyter, Streamlit, Dash
- plotly.express is the high-level API (like Seaborn for Matplotlib)
px.scatter(df, x='x', y='y', color='category', size='value', hover_data=['name'])px.line(df, x='date', y='price', color='stock')px.bar(df, x='category', y='count', color='group', barmode='group')px.histogram(df, x='value', nbins=50, marginal='box')
π¬ Animated Visualizations
Add time dimension to your visualizations with animations.
px.scatter(df, x='gdp', y='life_exp', animation_frame='year', animation_group='country', size='pop', color='continent')Matplotlib Animation:
from matplotlib.animation import FuncAnimationani = FuncAnimation(fig, update_func, frames=100, interval=50)
import plotly.express as px
df = px.data.gapminder()
# Plotly makes animations incredibly easy with two arguments:
# 'animation_frame' (the time dimension) and 'animation_group' (the entity)
fig = px.scatter(
df,
x="gdpPercap", y="lifeExp",
animation_frame="year", animation_group="country",
size="pop", color="continent",
hover_name="country",
log_x=True, size_max=55,
range_x=[100,100000], range_y=[25,90],
title="Global Development 1952 - 2007"
)
fig.show()
π± Interactive Dashboards with Streamlit
Build interactive web apps for data exploration without web development experience.
streamlit run app.pyimport streamlit as stst.title("My Dashboard")st.slider("Select value", 0, 100, 50)st.selectbox("Choose", ["A", "B", "C"])st.plotly_chart(fig)
import streamlit as st
import pandas as pd
import plotly.express as px
# 1. Page Configuration
st.set_page_config(page_title="Sales Dashboard", layout="wide")
st.title("Interactive Sales Dashboard π")
# 2. Sidebar Filters
st.sidebar.header("Filters")
category = st.sidebar.selectbox("Select Category", ["Electronics", "Clothing", "Home"])
min_sales = st.sidebar.slider("Minimum Sales", 0, 1000, 200)
# Mock Data Generation
df = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=30),
'Sales': [x * 10 for x in range(30)],
'Category': [category] * 30
})
filtered_df = df[df['Sales'] >= min_sales]
# 3. Layout with Columns
col1, col2 = st.columns(2)
# KPI Metric
col1.metric("Total Filtered Sales", f"${filtered_df['Sales'].sum()}")
# 4. Insert Plotly Chart
fig = px.line(filtered_df, x='Date', y='Sales', title=f"{category} Sales Trend")
col2.plotly_chart(fig, use_container_width=True)
πΊοΈ Geospatial Visualization
Visualize geographic data with maps, choropleth, and point plots.
β’ Plotly:
px.choropleth(df, locations='country', color='value')β’ Folium: Interactive Leaflet maps
β’ Geopandas + Matplotlib: Static maps with shapefiles
β’ Kepler.gl: Large-scale geospatial visualization
π§ Under the Hood: Geospatial Math
Visualizing data on a map requires mathematically converting a 3D spherical Earth into 2D screen pixels. The Web Mercator Projection (used by Google Maps and Plotly) achieves this by preserving angles (conformal) but heavily distorting sizes near the poles:
Furthermore, when calculating distances between two GPS coordinates (e.g., to color a density heatmap), you cannot use straight Euclidean distance $d = \sqrt{x^2+y^2}$. Advanced libraries compute the Haversine formula to find the true great-circle distance over the sphere.
import plotly.express as px
# Plotly includes built-in geospatial data
df = px.data.gapminder().query("year==2007")
# Create a choropleth map
# 'locations' takes ISO-3 country codes by default
fig = px.choropleth(
df,
locations="iso_alpha", # Geopolitical boundaries
color="lifeExp", # Data to map to color
hover_name="country", # Tooltip label
color_continuous_scale=px.colors.sequential.Plasma,
title="Global Life Expectancy (2007)"
)
# Customize the map projection type
fig.update_geos(
projection_type="orthographic", # "natural earth", "mercator", etc.
showcoastlines=True,
coastlinecolor="DarkBlue"
)
fig.show()
π² 3D Visualization
Visualize three-dimensional relationships with surface plots, scatter plots, and more.
π§ Under the Hood: 3D Perspective Projection Matrix
To render 3D data $(x, y, z)$ on a 2D screen browser, libraries like Plotly.js apply a Perspective Projection Matrix. This creates the optical illusion of depth by scaling $x$ and $y$ inversely with distance $z$:
Once multiplied out, the final screen coordinates are $(x'/w, y'/w)$. When you rapidly drag to rotate a 3D Plotly graph, your browser's WebGL engine is recalculating this exact matrix millions of times per second to update the viewpoint mapping in real-time!
import plotly.express as px
# Load Iris dataset
df = px.data.iris()
# Create interactive 3D scatter plot
fig = px.scatter_3d(
df,
x='sepal_length',
y='sepal_width',
z='petal_width',
color='species',
size='petal_length',
size_max=18,
symbol='species',
opacity=0.7,
title="Iris 3D Feature Space"
)
# Tight layout for 3D plot
fig.update_layout(margin=dict(l=0, r=0, b=0, t=40))
fig.show()
π Data Storytelling
Transform visualizations into compelling narratives that drive action.
- Context: Why does this matter? Who is the audience?
- Data: What insights did you discover?
- Narrative: What's the storyline (beginning, middle, end)?
- Visual: Which chart best supports the story?
- Call to Action: What should the audience do?
Design Principles
Focus Attention: Use color strategically (grey + accent)
Think Like a Designer: Alignment, white space, hierarchy
Tell a Story: Title = conclusion, not description
Bad: "Sales by Region"
Good: "West Region Sales Dropped 23% in Q4"