Data Visualization Masterclass

🎯 Why Visualize Data?

Data visualization transforms abstract numbers into visual stories. The human brain processes images 60,000× faster than text. Visualization helps us explore, analyze, and communicate data effectively.

Anscombe's Quartet: Four datasets with nearly identical statistical properties (mean, variance, correlation) that look completely different when plotted. This demonstrates why visualization is essential - statistics alone can be misleading!

Three Purposes of Visualization

1. Exploratory: Discover patterns, anomalies, and insights in your data
2. Explanatory: Communicate findings to stakeholders clearly
3. Confirmatory: Verify hypotheses and validate models

💡 "The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

✅ Always start with visualization before building ML models.

👁️ Visual Perception & Pre-attentive Attributes

The human visual system can detect certain visual attributes almost instantly (< 250ms) without conscious effort. These are called pre-attentive attributes.

Pre-attentive Attributes:

Position: Most accurate for quantitative data (use X/Y axes)
Length: Bar charts leverage this effectively
Color Hue: Best for categorical distinctions
Color Intensity: Good for gradients/magnitude
Size: Bubble charts, but humans underestimate area
Shape: Useful for categories, but limit to 5-7 shapes
Orientation: Lines, angles

Cleveland & McGill's Accuracy Ranking

Most Accurate → Least Accurate:
1. Position on common scale (bar chart)
2. Position on non-aligned scale (multiple axes)
3. Length (bar)
4. Angle, Slope
5. Area
6. Volume, Curvature
7. Color saturation, Color hue

⚠️ Pie charts use angle (low accuracy). Bar charts are almost always better!

✅ Use position for most important data, color for categories.

🧠 Under the Hood: The Weber-Fechner Law

Why are humans bad at comparing bubble sizes (area) but great at comparing bar chart heights (length/position)? Human perception of physical magnitudes follows a logarithmic scale, not a linear one.

$$ \frac{\Delta I}{I} = k \quad \Rightarrow \quad S = c \ln\left(\frac{I}{I_0}\right) $$

$I$: Initial stimulus intensity (e.g., initial bubble area)
$\Delta I$: Just Noticeable Difference (JND) required to perceive a change
$k$: Weber's constant. For length/position $k \approx 0.03$ (very sensitive), but for area $k \approx 0.10$ to $0.20$ (very insensitive).

📐 The Grammar of Graphics

The Grammar of Graphics (Wilkinson, 1999) is a framework for describing statistical graphics. It's the foundation of ggplot2 (R) and influences Seaborn, Altair, and Plotly.

Components of a Graphic:

Data: The dataset being visualized
Aesthetics (aes): Mapping data to visual properties (x, y, color, size)
Geometries (geom): Visual elements (points, lines, bars, areas)
Facets: Subplots by categorical variable
Statistics: Transformations (binning, smoothing, aggregation)
Coordinates: Cartesian, polar, map projections
Themes: Non-data visual elements (fonts, backgrounds)

💡 Understanding Grammar of Graphics makes you a better visualizer in ANY library.

🧠 Under the Hood: Coordinate Transformations

When mapping data to visuals, the coordinate system applies a mathematical transformation matrix. For example, converting standard Cartesian coordinates $(x, y)$ to Polar coordinates $(r, \theta)$ to render a pie chart or Coxcomb plot:

$$ r = \sqrt{x^2 + y^2} $$ $$ \theta = \text{atan2}(y, x) $$ $$ \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} r \cos(\theta) \\ r \sin(\theta) \end{bmatrix} $$

This is why pie charts are computationally and perceptually different from bar charts—they apply a non-linear polar transformation to the linear data dimensions.

app.py - Grammar of Graphics with Plotnine (Python)

import pandas as pd
from plotnine import *

# Following Grammar of Graphics exactly:
# Data (mpg) -> Aesthetics (x,y,color) -> Geometries (point, smooth)
plot = (
    ggplot(mpg, aes(x='displ', y='hwy', color='class'))
    + geom_point(size=3, alpha=0.7)
    + geom_smooth(method='lm', se=False) # Add regression line
    + theme_minimal()                    # Add theme
    + labs(title='Engine Displacement vs Highway MPG')
)
print(plot)

🎨 Choosing the Right Chart

The best visualization depends on your data type and question. Here's a decision guide:

Single Variable (Univariate):
• Continuous: Histogram, KDE, Box plot, Violin plot
• Categorical: Bar chart, Count plot

Two Variables (Bivariate):
• Both Continuous: Scatter plot, Line chart, Hexbin, 2D histogram
• Continuous + Categorical: Box plot, Violin, Strip, Swarm
• Both Categorical: Heatmap, Grouped bar chart

Multiple Variables (Multivariate):
• Pair plot (scatterplot matrix)
• Parallel coordinates
• Heatmap correlation matrix
• Faceted plots (small multiples)

Common Chart Mistakes

⚠️ Pie charts for many categories - Use bar chart instead

⚠️ 3D effects on 2D data - Distorts perception

⚠️ Truncated Y-axis - Exaggerates differences

⚠️ Rainbow color scales - Not perceptually uniform

🧠 Under the Hood: Information Entropy in Visuals

How much data can a chart "handle" before it becomes cluttered? We can use Shannon Entropy ($H$) to quantify the visual information density. If a chart has $n$ visual marks (dots, lines) with probabilities $p_i$ of drawing attention:

$$ H(X) = - \sum_{i=1}^{n} p_i \log_2(p_i) $$

Takeaway: If you add too many dimensions (color, size, shape simultaneously) on a single plot, the entropy $H$ exceeds human working memory limits ($\approx 2.5$ bits), leading to chart fatigue. This is mathematically why "less is more" in dashboard design.

🔬 Matplotlib Figure Anatomy

Understanding Matplotlib's object hierarchy is key to creating professional visualizations.

Hierarchical Structure:
Figure → Axes → Axis → Tick → Label

• Figure: The overall window/canvas
• Axes: The actual plot area (NOT the X/Y axis!)
• Axis: The X or Y axis with ticks and labels
• Artist: Everything visible (lines, text, patches)

Two Interfaces

1. pyplot (MATLAB-style): Quick, implicit state
plt.plot(x, y)
plt.xlabel('Time')
plt.show()

2. Object-Oriented (OO): Explicit, recommended for complex plots
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('Time')

✅ Always use OO interface for publication-quality plots.

🧠 Under the Hood: Affine Transformations

How does Matplotlib convert your data coordinates (e.g., $x \in [0, 1000]$) into physical pixels on your screen? It uses a continuous pipeline of Affine Transformation Matrices:

$$ \begin{bmatrix} x_{\text{display}} \\ y_{\text{display}} \\ 1 \end{bmatrix} = \begin{bmatrix} s_x & 0 & t_x \\ 0 & s_y & t_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_{\text{data}} \\ y_{\text{data}} \\ 1 \end{bmatrix} $$

This matrix $T$ scales ($s_x, s_y$) and translates ($t_x, t_y$) data points. The transformation pipeline is: Data $\rightarrow$ Axes (relative 0-1) $\rightarrow$ Figure (inches) $\rightarrow$ Display (pixels based on DPI).

plot.py - Matplotlib Object-Oriented Setup

import matplotlib.pyplot as plt

# 1. Create the Figure (The Canvas) and Axes (The Artist)
fig, ax = plt.subplots(figsize=(10, 6), dpi=100)

# 2. Draw on the Axes
ax.plot([1, 2, 3], [4, 5, 2], marker='o', label='Data A')

# 3. Configure the Axes (Anatomy elements)
ax.set_title("My First OOP Plot", fontsize=16, fontweight='bold')
ax.set_xlabel("X-Axis (Units)", fontsize=12)
ax.set_ylabel("Y-Axis (Units)", fontsize=12)

# Set limits and ticks
ax.set_xlim(0, 4)
ax.set_ylim(0, 6)
ax.grid(True, linestyle='--', alpha=0.7)

# 4. Add accessories
ax.legend(loc='upper right')

# 5. Render or Save
plt.tight_layout() # Prevent clipping
plt.show()
# fig.savefig('my_plot.png', dpi=300)

📈 Basic Matplotlib Plots

Master the fundamental plot types that form the foundation of data visualization.

Code Examples

Line Plot:
ax.plot(x, y, color='blue', linestyle='--', marker='o', label='Series A')

Scatter Plot:
ax.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')

Bar Chart:
ax.bar(categories, values, color='steelblue', edgecolor='black')

Histogram:
ax.hist(data, bins=30, edgecolor='white', density=True)

🧠 Under the Hood: The Freedman-Diaconis Rule

When you call a histogram without specifying bins, how does the library choose the optimal bin width? Advanced statistical libraries use the Freedman-Diaconis rule, which minimizes the integral of the squared difference between the histogram and the true underlying probability density:

$$ \text{Bin Width } (h) = 2 \frac{\text{IQR}(x)}{\sqrt[3]{n}} $$

Where $\text{IQR}$ is the Interquartile Range and $n$ is the number of observations. Unlike simpler rules (e.g., Sturges' rule), this mathematical method is extremely robust to heavy-tailed distributions and outliers.

basic_plots.py - Common Matplotlib Patterns

import matplotlib.pyplot as plt
import numpy as np

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

# 1. Scatter Plot (Color & Size mapping)
x = np.random.randn(100)
y = x + np.random.randn(100)*0.5
sizes = np.random.uniform(10, 200, 100)
colors = x

sc = axs[0].scatter(x, y, s=sizes, c=colors, cmap='viridis', alpha=0.7)
axs[0].set_title('Scatter Plot')
fig.colorbar(sc, ax=axs[0], label='Color Value')

# 2. Bar Chart (with Error Bars)
categories = ['Group A', 'Group B', 'Group C']
values = [10, 22, 15]
errors = [1.5, 3.0, 2.0]

axs[1].bar(categories, values, yerr=errors, capsize=5, color='coral', alpha=0.8)
axs[1].set_title('Bar Chart with Error Bars')
for i, v in enumerate(values):
    axs[1].text(i, v + 0.5, str(v), ha='center')

plt.tight_layout()
plt.show()

🔲 Subplots & Multi-panel Layouts

Combine multiple visualizations into a single figure for comprehensive analysis.

Methods:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
gs = fig.add_gridspec(3, 3); ax = fig.add_subplot(gs[0, :])

✅ Use plt.tight_layout() or fig.set_constrained_layout(True) to prevent overlaps.

subplots.py - Complex Layouts with GridSpec

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(3, 3, figure=fig)

# 1. Main large plot (spans 2x2 grid)
ax_main = fig.add_subplot(gs[0:2, 0:2])
ax_main.set_title('Main View')

# 2. Side plots (Top right, Bottom right)
ax_side1 = fig.add_subplot(gs[0, 2])
ax_side2 = fig.add_subplot(gs[1, 2])

# 3. Bottom wide plot (spans 1x3 grid)
ax_bottom = fig.add_subplot(gs[2, :])
ax_bottom.set_title('Timeline View')

plt.tight_layout()
plt.show()

🎨 Styling & Professional Themes

Transform basic plots into publication-quality visualizations.

Available Styles:
plt.style.available → Lists all built-in styles
plt.style.use('seaborn-v0_8-whitegrid')
with plt.style.context('dark_background'):

Color Palettes

Perceptually Uniform: viridis, plasma, inferno, magma, cividis
Sequential: Blues, Greens, Oranges (for magnitude)
Diverging: coolwarm, RdBu (for +/- deviations)
Categorical: tab10, Set2, Paired (discrete groups)

🧠 Under the Hood: Perceptually Uniform Colors (CIELAB)

Why do we use "viridis" instead of "rainbow" colormaps? A color map is a mathematical function mapping data $f(x) \rightarrow (R, G, B)$. However, standard RGB math doesn't match human perception (Euclidean distance in RGB $\neq$ perceived color distance).

$$ \Delta E^* = \sqrt{(\Delta L^*)^2 + (\Delta a^*)^2 + (\Delta b^*)^2} $$

Advanced colormaps like viridis are calculated in the CIELAB ($L^*a^*b^*$) color space. In this space, the mathematical distance formula $\Delta E^*$ perfectly matches how the retina and brain perceive brightness and hue differences, ensuring data is never visually distorted.

styling.py - Applying Professional Aesthetics

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Apply a global Seaborn theme
sns.set_theme(style="whitegrid", palette="muted")

# 2. Customize fonts globally
plt.rcParams.update({
    'font.family': 'sans-serif',
    'font.sans-serif': ['Helvetica', 'Arial'],
    'axes.titleweight': 'bold',
    'axes.titlesize': 16,
    'axes.labelsize': 12,
    'lines.linewidth': 2
})

# 3. Plotting with the new theme
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot([1, 2, 3], [4, 5, 2], label='Data')
ax.legend()

# 4. Remove top and right spines (cleaner look)
sns.despine(ax=ax)

plt.show()

🌊 Seaborn: Statistical Visualization

Seaborn is a high-level library built on Matplotlib that makes statistical graphics beautiful and easy.

Why Seaborn?

Beautiful default styles and color palettes
Works seamlessly with Pandas DataFrames
Statistical estimation built-in (confidence intervals, regression)
Faceting for multi-panel figures
Functions organized by plot purpose

Seaborn Function Categories

Figure-level: Create entire figures (displot, relplot, catplot)
Axes-level: Draw on specific axes (histplot, scatterplot, boxplot)

By Purpose:
• Distribution: histplot, kdeplot, ecdfplot, rugplot
• Relationship: scatterplot, lineplot, regplot
• Categorical: stripplot, swarmplot, boxplot, violinplot, barplot
• Matrix: heatmap, clustermap

📊 Distribution Plots

Visualize the distribution of a single variable or compare distributions across groups.

Histogram vs KDE:
• Histogram: Discrete bins, shows raw counts
• KDE: Smooth curve, estimates probability density
• Use both together: sns.histplot(data, kde=True)

💡 ECDF (Empirical Cumulative Distribution Function) avoids binning issues entirely.

🧠 Under the Hood: Kernel Density Estimation (KDE)

A KDE plot is not just a smoothed line; it's a mathematical sum of continuous probability distributions (kernels) placed at every single data point $x_i$:

$$ \hat{f}_h(x) = \frac{1}{n h} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right) $$

Here, $K$ is typically the Standard Normal Gaussian density function, and $h$ is the bandwidth parameter. If $h$ is too small, the curve is jagged (overfit); if $h$ is too large, it hides important statistical features (underfit).

distributions.py - Visualizing Distributions

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset("penguins")

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 1. Histogram + KDE overlay
sns.histplot(
    data=penguins, x="flipper_length_mm", hue="species",
    element="step", stat="density", common_norm=False,
    ax=axes[0]
)
axes[0].set_title("Histogram with Step Fill")

# 2. KDE Plot with Rug Plot
sns.kdeplot(
    data=penguins, x="body_mass_g", hue="species",
    fill=True, common_norm=False, palette="crest",
    alpha=0.5, linewidth=1.5, ax=axes[1]
)
sns.rugplot(
    data=penguins, x="body_mass_g", hue="species",
    height=0.05, ax=axes[1]
)
axes[1].set_title("KDE Density + Rug Plot")

sns.despine()
plt.show()

🔗 Relationship Plots

Explore relationships between two or more continuous variables.

Key Functions:
sns.scatterplot(data=df, x='x', y='y', hue='category', size='magnitude')
sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})
sns.pairplot(df, hue='species', diag_kind='kde')

🧠 Under the Hood: Ordinary Least Squares (OLS)

When you use sns.regplot, Seaborn calculates the line of best fit by minimizing the sum of the squared residuals ($e_i^2$). The exact matrix algebra closed-form solution for the coefficients $\hat{\boldsymbol{\beta}}$ is:

$$ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$

The shaded region around the line represents the 95% confidence interval, meaning if we resampled the data 100 times, the true regression line would fall inside this shaded band 95 times (usually computed via bootstrapping).

relationships.py - Scatter and Regression

import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")

# 1. Advanced Scatter (4 dimensions: x, y, color, size)
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=tips, x="total_bill", y="tip",
    hue="time", size="size", sizes=(20, 200),
    palette="deep", alpha=0.8
)
plt.title("4D Scatter Plot (Total Bill vs Tip)")
plt.show()

# 2. Regression Plot with Subplots (Using lmplot)
# lmplot is a figure-level function that creates multiple subplots automatically
sns.lmplot(
    data=tips, x="total_bill", y="tip", col="time", hue="smoker",
    height=5, aspect=1.2, scatter_kws={'alpha':0.5}
)
plt.show()

# 3. Pairplot (Explore all pairwise relationships)
sns.pairplot(
    data=tips, hue="smoker",
    diag_kind="kde", markers=["o", "s"]
)
plt.show()

📦 Categorical Plots

Visualize distributions and comparisons across categorical groups.

When to Use:
• Strip/Swarm: Show all data points (small datasets)
• Box: Summary statistics (median, quartiles, outliers)
• Violin: Full distribution shape + summary
• Bar: Mean/count with error bars

🧠 Under the Hood: The IQR Outlier Rule

Box plots identify "outliers" (the individual dots beyond the whiskers) purely mathematically, not visually. They use John Tukey's Interquartile Range (IQR) method:

$$ \text{IQR} = Q_3 - Q_1 $$ $$ \text{Lower Fence} = Q_1 - 1.5 \times \text{IQR} $$ $$ \text{Upper Fence} = Q_3 + 1.5 \times \text{IQR} $$

Any point strictly outside $[Lower, Upper]$ is plotted as an outlier. Fun Fact: In a perfectly normal Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$, exactly 0.70% of the data will be incorrectly flagged as outliers by this static math rule!

categorical.py - Categories and Factor Variables

import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 1. Violin Plot (Distribution density across categories)
sns.violinplot(
    data=tips, x="day", y="total_bill", hue="sex",
    split=True, inner="quart", palette="muted",
    ax=axes[0]
)
axes[0].set_title("Violin Plot (Split by Sex)")

# 2. Boxplot + Swarmplot Overlay
# Good for showing summary stats PLUS underlying data points
sns.boxplot(
    data=tips, x="day", y="total_bill", color="white",
    width=.5, showfliers=False, ax=axes[1] # hide boxplot outliers to avoid overlap
)
sns.swarmplot(
    data=tips, x="day", y="total_bill", hue="time",
    size=6, alpha=0.7, ax=axes[1]
)
axes[1].set_title("Boxplot + Swarmplot Overlay")

plt.tight_layout()
plt.show()

🔥 Heatmaps & Correlation Matrices

Visualize matrices of values using color intensity. Essential for EDA correlation analysis.

Best Practices:
• Always annotate with values: annot=True
• Use diverging colormap for correlation: cmap='coolwarm', center=0
• Mask upper/lower triangle: mask=np.triu(np.ones_like(corr))
• Square cells: square=True

💡 Clustermap automatically clusters similar rows/columns together.

🧠 Under the Hood: Correlation Coefficients

Correlation heatmaps display the strength of linear relationships between variables, typically mapping the Pearson Correlation Coefficient ($r$) to a discrete color gradient hexbin:

$$ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$

For non-linear but monotonic relationships, you should switch pandas to use Spearman's Rank Correlation ($\rho$), which mathematically converts raw values to ranks $R(x_i)$ before applying the same formula. Both map perfectly bounds of $[-1, 1]$.

heatmaps.py - Correlation Matrix

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load data and calculate correlation matrix
penguins = sns.load_dataset("penguins")
# Select only numerical columns for correlation
numerical_df = penguins.select_dtypes(include=[np.number]) 
corr = numerical_df.corr()

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(8, 6))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr, 
    mask=mask, 
    cmap='coolwarm', 
    vmax=1, vmin=-1, 
    center=0,
    square=True, 
    linewidths=.5, 
    annot=True, 
    fmt=".2f",
    cbar_kws={"shrink": .8}
)

plt.title("Penguin Feature Correlation")
plt.tight_layout()
plt.show()

🚀 Plotly Express: Interactive Visualization

Plotly creates interactive, web-based visualizations with zoom, pan, hover tooltips, and more.

Why Plotly?

Interactive out of the box (zoom, pan, select)
Hover tooltips with data details
Export as HTML, PNG, or embed in dashboards
Works in Jupyter, Streamlit, Dash
plotly.express is the high-level API (like Seaborn for Matplotlib)

Common Functions:
px.scatter(df, x='x', y='y', color='category', size='value', hover_data=['name'])
px.line(df, x='date', y='price', color='stock')
px.bar(df, x='category', y='count', color='group', barmode='group')
px.histogram(df, x='value', nbins=50, marginal='box')

🎬 Animated Visualizations

Add time dimension to your visualizations with animations.

Plotly Animation:
px.scatter(df, x='gdp', y='life_exp', animation_frame='year', animation_group='country', size='pop', color='continent')

Matplotlib Animation:
from matplotlib.animation import FuncAnimation
ani = FuncAnimation(fig, update_func, frames=100, interval=50)

✅ Hans Rosling's Gapminder is the classic example of animated scatter plots!

animation_example.py - Gapminder Scatter

import plotly.express as px

df = px.data.gapminder()

# Plotly makes animations incredibly easy with two arguments:
# 'animation_frame' (the time dimension) and 'animation_group' (the entity)
fig = px.scatter(
    df, 
    x="gdpPercap", y="lifeExp", 
    animation_frame="year", animation_group="country",
    size="pop", color="continent", 
    hover_name="country",
    log_x=True, size_max=55, 
    range_x=[100,100000], range_y=[25,90],
    title="Global Development 1952 - 2007"
)

fig.show()

📱 Interactive Dashboards with Streamlit

Build interactive web apps for data exploration without web development experience.

Streamlit Basics:
streamlit run app.py

import streamlit as st
st.title("My Dashboard")
st.slider("Select value", 0, 100, 50)
st.selectbox("Choose", ["A", "B", "C"])
st.plotly_chart(fig)

💡 Streamlit auto-reruns when input changes - no callbacks needed!

app.py - Minimal Streamlit Dashboard

import streamlit as st
import pandas as pd
import plotly.express as px

# 1. Page Configuration
st.set_page_config(page_title="Sales Dashboard", layout="wide")
st.title("Interactive Sales Dashboard 📊")

# 2. Sidebar Filters
st.sidebar.header("Filters")
category = st.sidebar.selectbox("Select Category", ["Electronics", "Clothing", "Home"])
min_sales = st.sidebar.slider("Minimum Sales", 0, 1000, 200)

# Mock Data Generation
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=30),
    'Sales': [x * 10 for x in range(30)],
    'Category': [category] * 30
})
filtered_df = df[df['Sales'] >= min_sales]

# 3. Layout with Columns
col1, col2 = st.columns(2)

# KPI Metric
col1.metric("Total Filtered Sales", f"${filtered_df['Sales'].sum()}")

# 4. Insert Plotly Chart
fig = px.line(filtered_df, x='Date', y='Sales', title=f"{category} Sales Trend")
col2.plotly_chart(fig, use_container_width=True)

🗺️ Geospatial Visualization

Visualize geographic data with maps, choropleth, and point plots.

Libraries:
• Plotly: px.choropleth(df, locations='country', color='value')
• Folium: Interactive Leaflet maps
• Geopandas + Matplotlib: Static maps with shapefiles
• Kepler.gl: Large-scale geospatial visualization

🧠 Under the Hood: Geospatial Math

Visualizing data on a map requires mathematically converting a 3D spherical Earth into 2D screen pixels. The Web Mercator Projection (used by Google Maps and Plotly) achieves this by preserving angles (conformal) but heavily distorting sizes near the poles:

$$ x = R \cdot \lambda \qquad y = R \ln\left[\tan\left(\frac{\pi}{4} + \frac{\varphi}{2}\right)\right] $$

Furthermore, when calculating distances between two GPS coordinates (e.g., to color a density heatmap), you cannot use straight Euclidean distance $d = \sqrt{x^2+y^2}$. Advanced libraries compute the Haversine formula to find the true great-circle distance over the sphere.

geospatial.py - Plotly Choropleth

import plotly.express as px

# Plotly includes built-in geospatial data
df = px.data.gapminder().query("year==2007")

# Create a choropleth map
# 'locations' takes ISO-3 country codes by default 
fig = px.choropleth(
    df, 
    locations="iso_alpha",   # Geopolitical boundaries
    color="lifeExp",         # Data to map to color
    hover_name="country",    # Tooltip label
    color_continuous_scale=px.colors.sequential.Plasma,
    title="Global Life Expectancy (2007)"
)

# Customize the map projection type
fig.update_geos(
    projection_type="orthographic", # "natural earth", "mercator", etc.
    showcoastlines=True, 
    coastlinecolor="DarkBlue"
)

fig.show()

🎲 3D Visualization

Visualize three-dimensional relationships with surface plots, scatter plots, and more.

⚠️ 3D plots can obscure data. Often, multiple 2D views are more effective.

✅ Use Plotly for interactive 3D (rotate, zoom) instead of static Matplotlib 3D.

🧠 Under the Hood: 3D Perspective Projection Matrix

To render 3D data $(x, y, z)$ on a 2D screen browser, libraries like Plotly.js apply a Perspective Projection Matrix. This creates the optical illusion of depth by scaling $x$ and $y$ inversely with distance $z$:

$$ \begin{bmatrix} x' \\ y' \\ z' \\ w \end{bmatrix} = \begin{bmatrix} \frac{1}{\text{aspect} \cdot \tan(\frac{fov}{2})} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan(\frac{fov}{2})} & 0 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} $$

Once multiplied out, the final screen coordinates are $(x'/w, y'/w)$. When you rapidly drag to rotate a 3D Plotly graph, your browser's WebGL engine is recalculating this exact matrix millions of times per second to update the viewpoint mapping in real-time!

3d_plots.py - Interactive 3D Scatter

import plotly.express as px

# Load Iris dataset
df = px.data.iris()

# Create interactive 3D scatter plot
fig = px.scatter_3d(
    df, 
    x='sepal_length', 
    y='sepal_width', 
    z='petal_width',
    color='species',
    size='petal_length', 
    size_max=18,
    symbol='species',
    opacity=0.7,
    title="Iris 3D Feature Space"
)

# Tight layout for 3D plot
fig.update_layout(margin=dict(l=0, r=0, b=0, t=40))
fig.show()

📖 Data Storytelling

Transform visualizations into compelling narratives that drive action.

The Data Storytelling Framework:

Context: Why does this matter? Who is the audience?
Data: What insights did you discover?
Narrative: What's the storyline (beginning, middle, end)?
Visual: Which chart best supports the story?
Call to Action: What should the audience do?

Design Principles

Remove Clutter: Eliminate chartjunk, gridlines, borders
Focus Attention: Use color strategically (grey + accent)
Think Like a Designer: Alignment, white space, hierarchy
Tell a Story: Title = conclusion, not description
Bad: "Sales by Region"
Good: "West Region Sales Dropped 23% in Q4"

💡 "If you can't explain it simply, you don't understand it well enough." — Einstein

✅ Read "Storytelling with Data" by Cole Nussbaumer Knaflic