🐍 Python Fundamentals — Complete Deep Dive

⚡ Python Is Not What You Think

Python is a dynamically-typed, garbage-collected, interpreted language with a C-based runtime (CPython). Everything is an object — integers, functions, even classes. Understanding this object model is what separates beginners from professionals.

1. Data Structures — Complete Reference

Type	Mutable	Ordered	Hashable	Use Case
list	✓	✓	✗	Sequential data, time series, feature lists
tuple	✗	✓	✓	Fixed records, dict keys, DataFrame rows
dict	✓	✓ (3.7+)	✗	Lookup tables, JSON, config, caches
set	✓	✗	✗	Unique values, membership testing O(1)
frozenset	✗	✗	✓	Immutable set, usable as dict keys
deque	✓	✓	✗	O(1) append/pop both ends, sliding windows
bytes	✗	✓	✓	Binary data, serialization, network I/O
bytearray	✓	✓	✗	Mutable binary buffers

2. Time Complexity — What Every Dev Must Know

Operation	list	dict	set
Lookup by index/key	O(1)	O(1)	—
Search (x in ...)	O(n)	O(1)	O(1)
Insert/Append	O(1) end, O(n) middle	O(1)	O(1)
Delete	O(n)	O(1)	O(1)
Sort	O(n log n)	—	—
Iteration	O(n)	O(n)	O(n)

Real-world impact: Checking if an item exists in a list of 1M elements = ~50ms. In a set = ~0.00005ms. That's 1,000,000x faster. Always use sets/dicts for membership testing.

3. Python Memory Model

⚡ Everything Is An Object on the Heap

Variables are references (pointers), not boxes. a = [1,2,3] creates a list on the heap; a points to it. b = a makes both point to the same list. This is aliasing — the #1 source of bugs in beginner Python code.

Reference Counting: Each object tracks how many names reference it. When count = 0, freed immediately. del decrements the count, doesn't necessarily free memory.

Integer Interning: Python caches integers -5 to 256. So a = 100; b = 100; a is b → True. But a = 1000; b = 1000; a is b → may be False. Never use is for value comparison.

Garbage Collection: 3 generations (gen0, gen1, gen2). New objects in gen0. Survivors promoted. Use gc.collect() after deleting large ML models.

4. Generators & Iterators — The Heart of Python

🔄 Lazy Evaluation

yield suspends state, return terminates. A list of 1B items = ~8GB. A generator = ~100 bytes. The Iterator Protocol: any object with __iter__ + __next__. Generator expressions: (x**2 for x in range(10**9)) — O(1) memory.

yield from: Delegates to sub-generator. Forwards send() and throw(). Essential for building composable data pipelines.

send(): Two-way communication with generators (coroutines). value = yield result — both receives and produces values.

5. Closures & First-Class Functions

Functions are first-class objects — passed as args, returned, assigned. A closure captures variables from enclosing scope. Foundation of decorators, callbacks, and functional programming.

6. Critical Python Gotchas for Projects

⚠️ The 5 Deadliest Python Traps

1. Mutable Default Args: def f(x, lst=[]): — list shared across ALL calls. Fix: lst=None.
2. Late Binding Closures: [lambda: i for i in range(5)] — all return 4! Fix: lambda i=i: i.
3. Shallow Copy: list(a) copies outer list but shares inner objects.
4. String Concatenation: s += "text" in a loop creates new string every time — O(n²). Use ''.join(parts).
5. Circular Imports: Module A imports B, B imports A → ImportError. Fix: restructure or lazy import.

7. Error Handling for Production Projects

🛡️ Exception Hierarchy You Must Know

BaseException → Exception (catch this) → ValueError, TypeError, KeyError, FileNotFoundError, ConnectionError...
Rules: (1) Never catch bare except:. (2) Catch specific exceptions. (3) Use else for success path. (4) finally always runs. (5) Create custom exceptions for your project.

8. collections Module — Power Tools

Class	Purpose	Project Use Case
defaultdict	Dict with default factory	Group data: `defaultdict(list)`
Counter	Count hashable objects	Label distribution, word frequency
namedtuple	Lightweight immutable class	Return multiple named values
deque	Double-ended queue	Sliding window, BFS, ring buffer
ChainMap	Stack multiple dicts	Config layers: defaults → env → CLI
OrderedDict	Ordered dict (legacy)	`move_to_end()` for LRU cache

9. itertools — Memory-Efficient Pipelines

Function	What It Does	Project Use
`chain()`	Concatenate iterables lazily	Merge data files
`islice()`	Slice any iterator	Take first N from generator
`groupby()`	Group consecutive elements	Process sorted logs by date
`product()`	Cartesian product	Hyperparameter grid
`combinations()`	All r-length combos	Feature interaction pairs
`starmap()`	map() with unpacked args	Apply function to paired data
`accumulate()`	Running accumulator	Cumulative sums, running max
`tee()`	Clone iterator N times	Multiple passes over stream

10. File I/O for Real Projects

Format	Read	Write	Best For
JSON	`json.load(f)`	`json.dump(obj, f)`	Configs, API responses
CSV	`csv.DictReader(f)`	`csv.DictWriter(f)`	Tabular data (small)
YAML	`yaml.safe_load(f)`	`yaml.dump(obj, f)`	Config files
Pickle	`pickle.load(f)`	`pickle.dump(obj, f)`	Python objects, models
Parquet	`pd.read_parquet()`	`df.to_parquet()`	Large DataFrames (fast)
SQLite	`sqlite3.connect()`	SQL queries	Local database

11. pathlib — Modern File Handling

Stop using os.path.join(). Use pathlib.Path: Path('data') / 'train' / 'images'. Methods: .glob(), .read_text(), .mkdir(parents=True), .exists(), .suffix, .stem. Cross-platform, readable, powerful.

12. Virtual Environments & Dependency Management

Tool	Best For	Key Feature
venv	Simple projects	Built-in, lightweight
conda	DS/ML (C deps)	Handles CUDA, MKL, OpenCV
poetry	Modern packaging	Lock files, deterministic builds
uv	Speed	10-100x faster pip (Rust-based)
pip-tools	Requirements pinning	`pip-compile` for lock files

13. Project Structure Template

my_project/
├── src/
│   └── my_package/
│       ├── __init__.py
│       ├── data/           # Data loading & processing
│       ├── models/         # Model definitions
│       ├── training/       # Training loops
│       ├── evaluation/     # Metrics & evaluation
│       ├── serving/        # API endpoints
│       └── utils/          # Shared utilities
├── tests/
│   ├── conftest.py        # Shared fixtures
│   ├── test_data.py
│   └── test_models.py
├── configs/               # YAML/JSON configs
├── notebooks/             # EDA notebooks
├── scripts/               # CLI scripts
├── pyproject.toml         # Modern Python packaging
├── Dockerfile
├── Makefile               # Common commands
└── README.md

14. String Operations for Data Cleaning

f-strings (3.6+): f"{accuracy:.2%}" → "95.23%". f"{x=}" (3.8+) → "x=42" for debugging. f"{name!r}" → shows repr. regex: re.compile(pattern) for repeated use. re.sub() for cleaning. re.findall() for extraction. Always compile patterns used in loops.

15. Command-Line Interface (CLI) Tools

argparse: Built-in CLI parsing. click: Decorator-based, more Pythonic. typer: Modern, uses type hints. Every production project needs a CLI for: training, evaluation, data processing, deployment scripts.

💻 Python Fundamentals — Project Code

1. Generator Pipeline — Process Any Size Data

import json
from pathlib import Path

def read_jsonl(filepath):
    """Read JSON Lines file lazily — handles any size."""
    with open(filepath) as f:
        for line in f:
            yield json.loads(line.strip())

def filter_records(records, min_score=0.5):
    for rec in records:
        if rec.get('score', 0) >= min_score:
            yield rec

def batch(iterable, size=64):
    """Batch any iterable into fixed-size chunks."""
    from itertools import islice
    it = iter(iterable)
    while chunk := list(islice(it, size)):
        yield chunk

# Compose into pipeline — still O(1) memory!
pipeline = batch(filter_records(read_jsonl("data.jsonl")), size=32)
for chunk in pipeline:
    process(chunk)  # Only 32 records in memory at a time

2. Coroutine Pattern — Running Statistics

def running_stats():
    """Coroutine that computes running mean & variance."""
    n = 0
    mean = 0.0
    M2 = 0.0
    while True:
        x = yield {'mean': mean, 'var': M2/n if n > 0 else 0, 'n': n}
        n += 1
        delta = x - mean
        mean += delta / n
        M2 += delta * (x - mean)  # Welford's algorithm — numerically stable

stats = running_stats()
next(stats)  # Prime
stats.send(10)   # {'mean': 10.0, 'var': 0, 'n': 1}
stats.send(20)   # {'mean': 15.0, 'var': 25.0, 'n': 2}

3. Custom Exception Hierarchy for Projects

# Define project-specific exceptions
class ProjectError(Exception):
    """Base exception for the project."""

class DataValidationError(ProjectError):
    def __init__(self, column, expected, actual):
        self.column = column
        super().__init__(
            f"Column '{column}': expected {expected}, got {actual}"
        )

class ModelNotTrainedError(ProjectError):
    pass

# Usage with proper error handling
def load_and_validate(path):
    try:
        df = pd.read_csv(path)
    except FileNotFoundError:
        raise DataValidationError("file", "exists", "missing")
    except pd.errors.EmptyDataError:
        raise DataValidationError("data", "non-empty", "empty file")
    else:
        print(f"Loaded {len(df)} rows")
        return df
    finally:
        print("Load attempt complete")

4. Closures & Mutable Default Trap

# ⚠️ THE #1 PYTHON BUG — Mutable default argument
def bad_append(item, lst=[]):  # List shared across ALL calls!
    lst.append(item)
    return lst
bad_append(1)  # [1]
bad_append(2)  # [1, 2] ← SURPRISE!

# ✅ CORRECT — use None sentinel
def good_append(item, lst=None):
    if lst is None:
        lst = []
    lst.append(item)
    return lst

5. collections in Action

from collections import defaultdict, Counter, deque

# defaultdict — group data without KeyError
samples_by_label = defaultdict(list)
for feat, label in zip(features, labels):
    samples_by_label[label].append(feat)

# Counter — class distribution + top-N
dist = Counter(y_train)
print(dist.most_common(3))
imbalance_ratio = dist.most_common()[0][1] / dist.most_common()[-1][1]

# deque — sliding window for streaming
window = deque(maxlen=5)
for val in data_stream:
    window.append(val)
    moving_avg = sum(window) / len(window)

6. CLI Tool with argparse

import argparse

def main():
    parser = argparse.ArgumentParser(description="Train ML model")
    parser.add_argument("--data", required=True, help="Path to data")
    parser.add_argument("--model", choices=["rf", "xgb", "lgbm"], default="rf")
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--lr", type=float, default=0.001)
    parser.add_argument("--dry-run", action="store_true")
    args = parser.parse_args()
    
    print(f"Training {args.model} on {args.data}")
    # python train.py --data data.csv --model xgb --epochs 50

if __name__ == "__main__":
    main()

7. Advanced Comprehensions & Modern Python

# Walrus operator (:=) — assign + use (3.8+)
if (n := len(data)) > 1000:
    print(f"Large dataset: {n} samples")

# Dict merge (3.9+)
config = defaults | overrides

# match-case — Structural Pattern Matching (3.10+)
match command:
    case {"action": "train", "model": model_name}:
        train(model_name)
    case {"action": "predict", "data": path}:
        predict(path)
    case _:
        print("Unknown command")

# Extended unpacking
first, *middle, last = sorted(scores)

# Nested dict comprehension
metrics = {
    model: {metric: score for metric, score in results.items()}
    for model, results in all_results.items()
}

8. Regex for Data Cleaning

import re

# Compile patterns used repeatedly (10x faster)
EMAIL = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
PHONE = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')

# Extract all emails from text
emails = EMAIL.findall(text)

# Clean text for NLP
def clean_text(text):
    text = re.sub(r'http\S+', '', text)        # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)    # Keep only letters
    text = re.sub(r'\s+', ' ', text).strip()   # Normalize whitespace
    return text.lower()

9. Configuration Management

import json, yaml
from pathlib import Path
from dataclasses import dataclass, asdict

@dataclass
class Config:
    model_name: str = "random_forest"
    learning_rate: float = 0.001
    batch_size: int = 32
    epochs: int = 100
    data_path: str = "data/train.csv"
    
    @classmethod
    def from_yaml(cls, path):
        with open(path) as f:
            return cls(**yaml.safe_load(f))
    
    def save(self, path):
        Path(path).write_text(json.dumps(asdict(self), indent=2))

config = Config.from_yaml("configs/experiment.yaml")

🎯 Python Fundamentals — Interview Questions

Q1: List vs tuple — when to use which?

Answer: Tuples: immutable, hashable (dict keys), less memory. Lists: mutable, growable. Use tuples for fixed data (coordinates, config). Use lists for collections that change. Tuples signal "this shouldn't be modified."

Q2: How does Python's GIL affect DS?

Answer: GIL prevents multi-threading for CPU-bound Python. But NumPy/Pandas release the GIL during C operations. For pure Python CPU work → multiprocessing. For I/O → threading works. For data science, the GIL rarely matters.

Q3: Shallow vs deep copy?

Answer: copy.copy(): outer container copied, inner objects shared. copy.deepcopy(): everything copied recursively. Real trap: df2 = df is NOT a copy — it's aliasing. Use df.copy().

Q4: What is the mutable default argument trap?

Answer: def f(x, lst=[]): — default list created ONCE and shared. Fix: lst=None; if lst is None: lst = []. #1 Python interview gotcha.

Q5: Why are generators critical for large data?

Answer: O(1) memory. 1B items as list = 8GB. As generator = 100 bytes. Use for: file processing, streaming, batch training. yield from for composition.

Q6: Explain LEGB scope rule.

Answer: Name lookup order: Local → Enclosing → Global → Built-in. nonlocal for enclosing scope, global for module. list = [1] shadows built-in list().

Q7: How to handle a 10GB CSV?

Answer: (1) pd.read_csv(chunksize=N), (2) usecols=['needed'], (3) dtype={'col':'int32'}, (4) Dask, (5) DuckDB for SQL on CSV, (6) Polars for Rust-speed.

Q8: Dict lookup O(1) vs list search O(n)?

Answer: Dicts use hash tables. Key → hash → slot index. O(1) average. Lists scan linearly. x in set is O(1) but x in list is O(n). For 1M items: microseconds vs milliseconds.

Q9: Explain Python's garbage collection.

Answer: (1) Reference counting — freed at count=0. (2) Cyclic GC — detects A→B→A cycles. 3 generations. gc.collect() after deleting large models.

Q10: What is __slots__?

Answer: Replaces per-instance __dict__ with fixed array. ~40% memory savings. Use for millions of small objects. Trade-off: no dynamic attributes.

Q11: How do you structure a Python project?

Answer: src/package/ layout. pyproject.toml for config. tests/ with pytest. configs/ for YAML. Makefile for common commands. Separate data, models, training, serving.

Q12: What's the difference between is and ==?

Answer: == checks value equality. is checks identity (same memory). Use is only for singletons: x is None, x is True. Integer interning makes 256 is 256 True but 1000 is 1000 may be False.

🔢 NumPy — Complete Deep Dive

⚡ Why NumPy Is 50-100x Faster

(1) Contiguous memory — CPU cache-friendly. (2) Compiled C loops. (3) SIMD instructions — 4-8 floats simultaneously. Python list: array of pointers to objects. NumPy: raw typed data in a block.

1. ndarray Internals

Feature	Python List	NumPy ndarray
Storage	Pointers to objects	Contiguous typed data
Memory per int	~28 bytes + pointer	8 bytes (int64)
Operations	Python loop	Compiled C/Fortran
SIMD	Impossible	CPU vector instructions

2. Memory Layout & Strides

🧠 Strides = The Secret Behind Views

Every ndarray has strides — bytes to jump in each dimension. For (3,4) float64: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles row stride. C-order (row-major): rows contiguous. Fortran-order: columns contiguous. Iterate along last axis for best performance.

3. Broadcasting Rules

🎯 Rules (Right to Left)

Two arrays compatible when, for each trailing dim: dims are equal OR one is 1. (5,3,1) + (1,4) → (5,3,4). The "1" dims stretch virtually — no memory copied. Common: X - X.mean(axis=0) → (1000,5) - (5,) works!

4. Universal Functions (ufuncs)

Vectorized element-wise functions. Advanced methods: .reduce() (fold), .accumulate() (running total), .outer() (outer product), .at() (unbuffered in-place). Create custom with np.frompyfunc().

5. dtype Selection for Projects

dtype	Bytes	When to Use
float32	4	Deep learning, GPU (50% less memory)
float64	8	Default. Statistics, scientific computing
float16	2	Mixed-precision inference
int32	4	Indices, counts
int8	1	Quantized models
bool	1	Masks for filtering

6. np.einsum — One Function for All Tensor Ops

Einstein summation: express ANY tensor operation. Matrix multiply: 'ik,kj->ij'. Batch matmul: 'bij,bjk->bik'. Trace: 'ii->'. Often faster than chaining NumPy calls — avoids intermediate arrays.

7. Linear Algebra for ML Projects

X.T @ X → Gram matrix (basis of linear regression)
np.linalg.svd(X) → PCA, dimensionality reduction
np.linalg.eigh(cov) → Covariance eigenvectors
np.linalg.norm(X, axis=1) → L2 norms for distances
np.linalg.lstsq(X, y) → Stable linear regression
np.linalg.inv() → AVOID! Use solve() instead (numerically stable)

8. Random Number Generation

Modern: rng = np.random.default_rng(42) (NumPy 1.17+). PCG64 algorithm, thread-safe. Old np.random.seed(42) is global, not thread-safe. Always use default_rng() in projects.

9. Image Processing with NumPy

Images are just 3D arrays: (height, width, channels). Crop: img[100:200, 50:150]. Resize: scipy. Normalize: img / 255.0. Augment: flip img[:, ::-1], rotate with scipy.ndimage. Foundation of all computer vision.

💻 NumPy Project Code

1. Feature Engineering with Broadcasting

import numpy as np

# Z-score normalization
X = np.random.randn(1000, 5)
X_norm = (X - X.mean(axis=0)) / X.std(axis=0)  # (1000,5) - (5,)

# Min-Max scaling
X_scaled = (X - X.min(0)) / (X.max(0) - X.min(0) + 1e-8)

# Pairwise Euclidean distance matrix
diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]  # (N,1,D)-(1,N,D)
dist_matrix = np.sqrt((diff ** 2).sum(axis=-1))  # (N,N)

2. Boolean Masking & Advanced Indexing

# Remove outliers (3-sigma rule)
data = np.random.randn(10000)
clean = data[np.abs(data - data.mean()) < 3 * data.std()]

# np.where — conditional replacement
preds = np.array([0.3, 0.7, 0.1, 0.9])
labels = np.where(preds > 0.5, 1, 0)

# np.select — multiple conditions
conditions = [data < -1, data > 1]
choices = ['low', 'high']
category = np.select(conditions, choices, default='mid')

# Fancy indexing — sample without replacement
rng = np.random.default_rng(42)
idx = rng.choice(len(X), size=500, replace=False)
X_sample = X[idx]

3. einsum for Complex Operations

# Matrix multiply
C = np.einsum('ik,kj->ij', A, B)

# Batch matrix multiply (deep learning)
batch_result = np.einsum('bij,bjk->bik', batch_A, batch_B)

# Cosine similarity matrix
norms = np.linalg.norm(X, axis=1, keepdims=True)
X_normed = X / norms
sim = np.einsum('ij,kj->ik', X_normed, X_normed)

4. Implement Linear Regression from Scratch

# Normal equation: w = (X^T X)^(-1) X^T y
# Better: use lstsq for numerical stability
X_b = np.c_[np.ones((len(X), 1)), X]  # Add bias column
w, residuals, rank, sv = np.linalg.lstsq(X_b, y, rcond=None)
y_pred = X_b @ w
mse = ((y - y_pred) ** 2).mean()
r2 = 1 - ((y - y_pred)**2).sum() / ((y - y.mean())**2).sum()

5. Memory-Mapped Files for Huge Data

# Process arrays larger than RAM
big = np.memmap('huge.npy', dtype=np.float32,
    mode='w+', shape=(1000000, 100))
subset = big[5000:6000]  # Only reads 1000 rows from disk

# Structured arrays — mixed types without Pandas
dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')])
data = np.array([('Alice', 30, 95.5)], dtype=dt)

6. Implement PCA from Scratch

def pca(X, n_components):
    # Center the data
    X_centered = X - X.mean(axis=0)
    # Covariance matrix
    cov = X_centered.T @ X_centered / (len(X) - 1)
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    # Sort by largest eigenvalue
    idx = eigenvalues.argsort()[::-1][:n_components]
    components = eigenvectors[:, idx]
    # Project data
    X_pca = X_centered @ components
    explained_var = eigenvalues[idx] / eigenvalues.sum()
    return X_pca, explained_var, components

🎯 NumPy Interview Questions

Q1: Why is NumPy faster than Python lists?

Answer: (1) Contiguous memory (cache-friendly). (2) Compiled C loops. (3) SIMD instructions. Together: 50-100x speedup.

Q2: View vs copy?

Answer: Slicing = view (shares data). Fancy indexing = copy. Check: np.shares_memory(a, b). Views are dangerous: modifying view modifies original.

Q3: Broadcasting rules?

Answer: Right-to-left: dims must equal or one is 1. (3,1) + (1,4) → (3,4). No memory copied. Gotcha: (3,) + (3,4) fails — reshape to (3,1).

Q4: axis=0 vs axis=1?

Answer: axis=0: operate down rows (collapse rows). axis=1: across columns (collapse columns). (100,5): mean(axis=0)→(5,). mean(axis=1)→(100,).

Q5: Implement PCA with NumPy?

Answer: Center, compute covariance, eigendecompose (eigh), sort by eigenvalue, project onto top-k eigenvectors. Or SVD directly.

Q6: np.dot vs @ vs einsum?

Answer: @: clean, broadcasts. np.dot: confusing for 3D+. einsum: most flexible, any tensor op. Use @ for readability.

Q7: How to handle NaN?

Answer: np.isnan() detects. np.nanmean() ignores NaN. Gotcha: NaN == NaN is False (IEEE 754).

Q8: C-order vs Fortran-order?

Answer: C: rows contiguous (default). Fortran: columns contiguous (LAPACK/BLAS). Iterate last axis for speed. Convert: np.asfortranarray().

🐼 Pandas — Complete Deep Dive

⚡ DataFrame Internals — BlockManager

A DataFrame is NOT a 2D array. Uses BlockManager — same-dtype columns stored in contiguous blocks. Column operations: fast (same block). Row iteration: slow (crosses blocks). This is why df.iterrows() is 100x slower than vectorized ops.

1. The Golden Rules

⚠️ 5 Rules That Prevent 90% of Pandas Bugs

1. Use .loc (label) and .iloc (position) — never chain indexing.
2. df.loc[0:5] includes 5. df.iloc[0:5] excludes 5.
3. df[mask]['col'] = x creates copy. Use df.loc[mask, 'col'] = x.
4. df2 = df is NOT a copy. Use df2 = df.copy().
5. Always check df.dtypes and df.isna().sum() first.

2. GroupBy — Split-Apply-Combine

Most powerful Pandas operation. (1) Split → (2) Apply function → (3) Combine results. GroupBy is lazy — no computation until aggregation. Key methods:

Method	Output Shape	Use Case
`agg()`	Reduced (one row/group)	Sum, mean, count per group
`transform()`	Same as input	Fill with group mean, normalize within group
`filter()`	Subset of groups	Keep groups with N > 100
`apply()`	Flexible	Custom function per group

3. Pandas 2.0 — Major Changes

Feature	Before (1.x)	After (2.0+)
Backend	NumPy only	Apache Arrow option
Copy semantics	Confusing	Copy-on-Write
String dtype	`object`	`string[pyarrow]` (faster)
Nullable types	NaN for everything	pd.NA (proper null)

4. Polars vs Pandas

Feature	Pandas	Polars
Speed	1x	5-50x (Rust)
Parallelism	Single-threaded	Multi-threaded auto
API	Eager	Lazy + Eager
Ecosystem	Massive	Growing fast
Use when	EDA, small-med data, legacy	Large data, production

5. Merge/Join Patterns

Method	How	When
`merge()`	SQL-style joins on columns	Combine tables on shared keys
`join()`	Joins on index	Index-based combining
`concat()`	Stack along axis	Append rows/columns

Common pitfall: Merge produces more rows than expected = many-to-many join. Always check: len(merged) vs len(left).

6. Memory Optimization Strategies

Strategy	Savings	When
Category dtype	90%+	Few unique strings
Downcast numerics	50-75%	int64 → int32/int16
Sparse arrays	80%+	Mostly zeros/NaN
PyArrow backend	30-50%	String-heavy data
Read only needed columns	Variable	`usecols=['a','b']`

7. Window Functions for Time Series

.rolling(N): fixed sliding window. .expanding(): cumulative. .ewm(span=N): exponentially weighted. All support .mean(), .std(), .apply(). Essential for: lag features, moving averages, volatility, Bollinger bands.

8. Pivot Tables & Crosstab

df.pivot_table(values, index, columns, aggfunc) — summarize data by two categorical dimensions. pd.crosstab() — frequency table of two categorical columns. Essential for EDA and business reporting.

9. Method Chaining Pattern

Fluent API: .assign() instead of df['col']=. .pipe(func) for custom. .query('col > 5') for readable filters. No intermediate variables = cleaner, reproducible pipelines.

💻 Pandas Project Code

1. Complete Data Loading & Cleaning Pipeline

import pandas as pd
import numpy as np

def load_and_clean(path, config):
    """Production data loading pipeline."""
    df = (
        pd.read_csv(path, usecols=config['columns'],
            dtype=config.get('dtypes', None),
            parse_dates=config.get('date_cols', []))
        .rename(columns=str.lower)
        .drop_duplicates()
        .assign(
            date=lambda df: pd.to_datetime(df['date']),
            revenue=lambda df: df['price'] * df['qty']
        )
        .query('revenue > 0')
        .pipe(optimize_dtypes)
    )
    return df

2. GroupBy — Beyond Basics

# Named aggregation
summary = df.groupby('category').agg(
    total=('revenue', 'sum'),
    avg_price=('price', 'mean'),
    n_orders=('order_id', 'nunique'),
    top_product=('product', lambda x: x.value_counts().index[0])
)

# Transform — normalize within groups
df['pct_of_group'] = df.groupby('cat')['rev'].transform(
    lambda x: x / x.sum() * 100
)

# Filter — keep only groups with enough data
df_filtered = df.groupby('user').filter(lambda x: len(x) >= 5)

3. Time Series Feature Engineering

def create_time_features(df, date_col, target_col):
    """Generate time series features for ML."""
    df = df.sort_values(date_col).copy()
    
    # Lag features
    for lag in [1, 3, 7, 14, 30]:
        df[f'lag_{lag}'] = df[target_col].shift(lag)
    
    # Rolling statistics
    for window in [7, 14, 30]:
        df[f'rolling_mean_{window}'] = df[target_col].rolling(window).mean()
        df[f'rolling_std_{window}'] = df[target_col].rolling(window).std()
    
    # Date features
    df['dayofweek'] = df[date_col].dt.dayofweek
    df['month'] = df[date_col].dt.month
    df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
    
    # Percentage change
    df['pct_change'] = df[target_col].pct_change()
    
    return df

4. Memory Optimization

def optimize_dtypes(df):
    """Reduce DataFrame memory by 60-80%."""
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    
    for col in df.select_dtypes(['int']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    for col in df.select_dtypes(['float']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    for col in df.select_dtypes(['object']).columns:
        if df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')
    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {start_mem:.1f}MB → {end_mem:.1f}MB ({100*(1-end_mem/start_mem):.0f}% reduction)")
    return df

5. Merge with Validation

# LEFT JOIN with indicator for debugging
merged = pd.merge(orders, customers, on='customer_id',
    how='left', indicator=True, validate='many_to_one')

# Check for orphan records
orphans = merged[merged['_merge'] == 'left_only']
print(f"Orphan orders: {len(orphans)}")

# Multi-key merge
result = pd.merge(df1, df2, on=['date', 'product_id'],
    how='inner', suffixes=('_actual', '_predicted'))

6. Pivot Table for Business Reporting

# Revenue by month and category
pivot = df.pivot_table(
    values='revenue',
    index=df['date'].dt.to_period('M'),
    columns='category',
    aggfunc=['sum', 'count'],
    margins=True  # Add totals row/column
)

# Crosstab — frequency of two categorical columns
ct = pd.crosstab(df['region'], df['product'], normalize='index')

🎯 Pandas Interview Questions

Q1: SettingWithCopyWarning?

Answer: Chained indexing modifies copy. Fix: df.loc[mask, 'col'] = val. Pandas 2.0+ Copy-on-Write eliminates this.

Q2: merge vs join vs concat?

Answer: merge: SQL joins on columns. join: on index. concat: stack along axis. Use merge for column joins, concat for appending.

Q3: apply vs map vs transform?

Answer: map: Series element-wise. apply: rows/columns. transform: same-shape output. All slow — prefer vectorized when possible.

Q4: GroupBy transform vs agg?

Answer: agg reduces. transform broadcasts back. Use transform for "fill with group mean" or "normalize within group" patterns.

Q5: How to handle missing data?

Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.

Q6: Pandas vs Polars?

Answer: Polars: 5-50x faster (Rust), multi-threaded, lazy eval. Pandas: mature ecosystem, wide compatibility. New projects with big data → Polars.

Q7: What is MultiIndex?

Answer: Hierarchical indexing. Use for pivot tables, panel data. Access with .xs() or tuple. Reset with .reset_index().

Q8: How to optimize a 5GB DataFrame?

Answer: (1) Read only needed columns. (2) Downcast dtypes. (3) Category for strings. (4) Sparse for zeros. (5) PyArrow backend. (6) Process in chunks. Can reduce 5GB to 1GB.

📊 Data Visualization — Complete Guide

⚡ The Grammar of Graphics

Data + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart = this framework.

1. Choosing the Right Chart

Question	Chart Type	Library
Distribution?	Histogram, KDE, Box, Violin	Seaborn
Relationship?	Scatter, Hexbin, Regression	Seaborn/Plotly
Comparison?	Bar, Grouped bar, Violin	Seaborn
Trend over time?	Line, Area chart	Plotly/Matplotlib
Correlation?	Heatmap	Seaborn
Part of whole?	Pie, Treemap, Sunburst	Plotly
Geographic?	Choropleth, Mapbox	Plotly/Folium
High-dimensional?	Parallel coords, UMAP	Plotly
ML results?	Confusion matrix, ROC, SHAP	Seaborn/SHAP

2. Matplotlib Architecture

Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure → Axes (subplots) → Axis objects. Always use OO API: fig, ax = plt.subplots().

rcParams: Global defaults. plt.rcParams['font.size'] = 14. Create style files for project consistency. plt.style.use('seaborn-v0_8-whitegrid').

3. Color Theory for Data

💡 Color Guide

Sequential: viridis, plasma (low→high).
Diverging: RdBu, coolwarm (center matters).
Categorical: Set2, tab10 (distinct groups).
Never use rainbow/jet — bad for colorblind, perceptually non-uniform.

4. Seaborn — Statistical Visualization

Three API levels: Figure-level (relplot, catplot, displot), Axes-level (scatterplot, boxplot), Objects API (0.12+). Auto-computes regression lines, confidence intervals, density estimates.

5. Plotly — Interactive Dashboards

JavaScript-powered: hover, zoom, selection. plotly.express for quick plots. plotly.graph_objects for control. Integrates with Dash for production dashboards. Supports 3D, maps, animations. Export to HTML.

6. Visualization for ML Projects

What to Visualize	Chart	Why
Class distribution	Bar chart	Detect imbalance
Feature distributions	Histogram/KDE grid	Find skew, outliers
Feature correlations	Heatmap (triangular)	Multicollinearity
Training curves	Line plot (loss/acc vs epoch)	Detect overfit/underfit
Model comparison	Box plot of CV scores	Compare variance
Confusion matrix	Annotated heatmap	Error analysis
ROC curve	Line plot + AUC	Threshold selection
Feature importance	Horizontal bar	Model interpretation
SHAP values	Beeswarm/waterfall	Individual predictions

7. Common Mistakes

Truncated y-axis exaggerating differences
Pie charts for >5 categories — use bar instead
Rainbow/jet colormap — use viridis/cividis
Overplotting — use alpha, hexbin, KDE, or datashader
Missing labels, titles, units
3D charts without interaction — often misleading
Not saving high-DPI figures — use dpi=300

💻 Visualization Project Code

1. Publication-Quality Multi-Subplot Figure

import matplotlib.pyplot as plt
import numpy as np

# Professional style setup
plt.rcParams.update({
    'font.size': 12, 'axes.titlesize': 14,
    'figure.facecolor': 'white',
    'axes.spines.top': False, 'axes.spines.right': False
})

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Distribution
axes[0,0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white')
axes[0,0].axvline(data.mean(), color='red', linestyle='--', label='Mean')
axes[0,0].set_title('Distribution')

# Scatter with colormap
sc = axes[0,1].scatter(x, y, c=z, cmap='viridis', alpha=0.7)
plt.colorbar(sc, ax=axes[0,1])

# Line with confidence interval
axes[1,0].plot(x, y_mean, 'b-', linewidth=2)
axes[1,0].fill_between(x, y_mean-y_std, y_mean+y_std, alpha=0.2)

# Bar with error bars
axes[1,1].bar(categories, values, yerr=errors, capsize=5, color='coral')

plt.tight_layout()
plt.savefig('figure.png', dpi=300, bbox_inches='tight')

2. ML Evaluation Dashboard

import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc

def plot_model_evaluation(y_true, y_pred, y_proba):
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
    axes[0].set_title('Confusion Matrix')
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    axes[1].plot(fpr, tpr, label=f'AUC={auc(fpr,tpr):.3f}')
    axes[1].plot([0,1], [0,1], 'k--')
    axes[1].set_title('ROC Curve')
    axes[1].legend()
    
    # Feature Importance
    importance = model.feature_importances_
    idx = importance.argsort()
    axes[2].barh(feature_names[idx], importance[idx])
    axes[2].set_title('Feature Importance')
    
    plt.tight_layout()

3. Seaborn — EDA in One Call

# Pair plot — all relationships at once
sns.pairplot(df, hue='target', diag_kind='kde',
    plot_kws={'alpha': 0.6})

# Correlation heatmap (upper triangle)
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask, annot=True,
    fmt='.2f', cmap='RdBu_r', center=0)

4. Plotly — Interactive Dashboard

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Animated scatter (Gapminder style)
fig = px.scatter(df, x='gdp', y='life_exp',
    animation_frame='year', size='pop',
    color='continent', hover_name='country')

# Training curves dashboard
fig = make_subplots(rows=1, cols=2,
    subplot_titles=['Loss', 'Accuracy'])
fig.add_trace(go.Scatter(y=train_loss, name='Train Loss'), row=1, col=1)
fig.add_trace(go.Scatter(y=val_loss, name='Val Loss'), row=1, col=1)
fig.add_trace(go.Scatter(y=train_acc, name='Train Acc'), row=1, col=2)
fig.add_trace(go.Scatter(y=val_acc, name='Val Acc'), row=1, col=2)
fig.write_html('training_dashboard.html')

🎯 Visualization Interview Questions

Q1: Matplotlib vs Seaborn vs Plotly?

Answer: Matplotlib: full control, papers. Seaborn: statistical EDA, beautiful. Plotly: interactive, stakeholders. Rule: Seaborn→EDA, Matplotlib→papers, Plotly→stakeholders.

Q2: How to visualize high-dimensional data?

Answer: (1) PCA/t-SNE/UMAP to 2D, (2) Pair plots, (3) Parallel coordinates, (4) Correlation heatmap, (5) SHAP plots.

Q3: Handle overplotting?

Answer: alpha, hexbin, 2D KDE, random sampling, Datashader for millions of points.

Q4: Good viz for non-technical audience?

Answer: Title states conclusion. One insight per chart. Annotate key points. Consistent color. Minimal chart junk. Tell a story.

Q5: Figure vs Axes?

Answer: Figure = canvas. Axes = plot area. fig, axes = plt.subplots(2,2). Use OO API: ax.plot() not plt.plot().

Q6: Accessible visualizations?

Answer: Colorblind palettes (viridis), shapes not just color, sufficient contrast, alt text, 12pt+ fonts.

Q7: How to visualize model performance?

Answer: Training curves (loss/acc vs epoch), confusion matrix (heatmap), ROC/AUC, feature importance (horizontal bars), SHAP for interpretability.

🎯 Advanced Python — Complete Engineering Guide

1. Decorators — Complete Patterns

⚡ Three Levels of Decorators

Level 1: Simple wrapper (timing, logging). Level 2: With arguments (factory). Level 3: Class-based with state. Always use functools.wraps.

Common patterns: Retry with exponential backoff, caching, rate limiting, authentication, input validation, deprecation warnings.

2. Context Managers

Guarantee resource cleanup. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: files, DB connections, GPU locks, temporary settings, timers.

3. Dataclasses vs namedtuple vs Pydantic vs attrs

Feature	namedtuple	dataclass	Pydantic	attrs
Mutable	✗	✓	✓ (v2)	✓
Validation	✗	✗	✓ (auto)	✓ (validators)
JSON	✗	✗	✓ (built-in)	via cattrs
Performance	Fastest	Fast	Medium	Fast
Use for	Records	Data containers	API models	Complex classes

4. Type Hints — Complete Guide

🎯 Why Type Hints Matter for Projects

Enable: IDE autocompletion, mypy static analysis, self-documenting code, Pydantic validation. Python doesn't enforce at runtime — they're for tools and humans.

Hint	Meaning	Example
`list[int]`	List of ints (3.9+)	`scores: list[int] = []`
`dict[str, Any]`	Dict str keys	`config: dict[str, Any]`
`int \| None`	Optional (3.10+)	`x: int \| None = None`
`Callable[[int], str]`	Function type	Callbacks
`TypeVar`	Generic	Generic containers
`Literal`	Exact values	`Literal['train','test']`
`TypedDict`	Dict with typed keys	JSON schemas

5. async/await — Concurrent I/O

For I/O-bound tasks: API calls, DB queries, file reads. NOT for CPU (use multiprocessing). Event loop manages coroutines cooperatively. asyncio.gather() runs concurrently. Game changer: 100 API calls in ~1s vs 100s sequentially.

6. Design Patterns for ML Projects

Pattern	Use Case	Python Implementation
Strategy	Swap algorithms	Pass function/class as argument
Factory	Create objects by name	Registry dict: `models['rf']`
Observer	Training callbacks	Event system with hooks
Pipeline	Data transformations	Chain of `fit→transform`
Singleton	Model cache, DB pool	Module-level or metaclass
Template	Training loop	ABC with abstract methods
Registry	Auto-register models	Class decorator + dict

7. Descriptors — How @property Works

Any object implementing __get__/__set__/__delete__. @property is a descriptor. Control attribute access at class level. Used in Django ORM, SQLAlchemy, dataclass fields.

8. Metaclasses — Advanced

Classes are objects. Metaclasses define how classes behave. type is the default. Use for: auto-registration, interface enforcement, singleton. Most should use class decorators instead.

9. slots for Memory Efficiency

Replaces __dict__ with fixed array. ~40% memory savings per instance. Use for millions of small objects. Trade-off: no dynamic attributes.

10. Multiprocessing for CPU-Bound Work

multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. Each process has its own GIL. Share data via: multiprocessing.Queue, shared_memory, or serialize (pickle). Overhead: process creation ~100ms. Only use for expensive computations.

💻 Advanced Python Project Code

1. Production Decorator — Retry with Backoff

from functools import wraps
import time, logging

def retry(max_attempts=3, delay=1.0, exceptions=(Exception,)):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait = delay * (2 ** attempt)
                    logging.warning(f"Retry {attempt+1}/{max_attempts}: {e}, waiting {wait}s")
                    time.sleep(wait)
        return wrapper
    return decorator

@retry(max_attempts=3, delay=0.5)
def fetch_data(url):
    return requests.get(url, timeout=10).json()

2. Dataclass for ML Experiments

from dataclasses import dataclass, field, asdict
import json
from datetime import datetime

@dataclass
class Experiment:
    name: str
    model: str
    lr: float = 0.001
    epochs: int = 100
    batch_size: int = 32
    tags: list[str] = field(default_factory=list)
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
    metrics: dict = field(default_factory=dict)
    
    def __post_init__(self):
        if self.lr <= 0: raise ValueError("lr must be positive")
    
    def save(self, path):
        with open(path, 'w') as f:
            json.dump(asdict(self), f, indent=2)
    
    @classmethod
    def load(cls, path):
        with open(path) as f:
            return cls(**json.load(f))

3. Model Registry Pattern

MODEL_REGISTRY = {}

def register_model(name):
    def decorator(cls):
        MODEL_REGISTRY[name] = cls
        return cls
    return decorator

@register_model("random_forest")
class RandomForestModel:
    def train(self, X, y): ...

@register_model("xgboost")
class XGBoostModel:
    def train(self, X, y): ...

# Create model by name from config
model = MODEL_REGISTRY[config["model_name"]]()

4. async — Parallel API Calls

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.json()

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

# 100 API calls in ~1 second vs 100 seconds
results = asyncio.run(fetch_all(urls))

5. Pydantic for API Data Validation

from pydantic import BaseModel, Field, field_validator

class PredictionRequest(BaseModel):
    features: list[float] = Field(..., min_length=1)
    model_name: str = "default"
    threshold: float = Field(0.5, ge=0, le=1)
    
    @field_validator('features')
    @classmethod
    def check_features(cls, v):
        if any(np.isnan(x) for x in v):
            raise ValueError("NaN not allowed")
        return v

# Auto-validates on creation
req = PredictionRequest(features=[1.0, 2.0, 3.0])

6. Context Manager — Timer & GPU Lock

from contextlib import contextmanager
import time

@contextmanager
def timer(name="Block"):
    start = time.perf_counter()
    try:
        yield
    finally:
        elapsed = time.perf_counter() - start
        print(f"{name}: {elapsed:.4f}s")

with timer("Training"):
    model.fit(X_train, y_train)

🎯 Advanced Python Interview Questions

Q1: Explain MRO.

Answer: C3 Linearization for multiple inheritance. ClassName.mro() shows order. Subclasses before bases, left-to-right.

Q2: dataclass vs Pydantic?

Answer: dataclass: no validation, fast, standard library. Pydantic: auto-validation, JSON serialization, API models. Use Pydantic for external data, dataclass for internal.

Q3: When async vs threading vs multiprocessing?

Answer: async: I/O-bound, 1000s connections. threading: I/O, simpler. multiprocessing: CPU-bound (bypasses GIL). NumPy releases GIL internally.

Q4: How does @property work?

Answer: It's a descriptor with __get__/__set__. Attribute access triggers descriptor protocol. Used for computed attributes and validation.

Q5: Decorator with parameters?

Answer: Three nested functions: factory(params) → decorator(func) → wrapper(*args). Use @wraps(func) always.

Q6: What is __slots__?

Answer: Fixed array instead of __dict__. ~40% less memory. No dynamic attributes. Use for millions of objects.

Q7: Explain closures with use case.

Answer: Function capturing enclosing scope variables. Use: factory functions, decorators, callbacks. make_multiplier(3) returns function multiplying by 3.

Q8: Design patterns in Python vs Java?

Answer: Python makes many patterns trivial: Strategy = pass a function. Singleton = module variable. Factory = dict of classes. Observer = list of callables. Python prefers simplicity.

🤖 Scikit-learn — Complete ML Engineering

⚡ The Estimator API

Estimators: fit(X, y). Transformers: transform(X). Predictors: predict(X). Consistency allows seamless swapping and composition via Pipelines.

1. Pipelines — The Foundation of Production ML

⚠️ Data Leakage — The #1 ML Mistake

Fitting scaler on ENTIRE dataset before split = test set info leaks into training. Fix: put ALL preprocessing inside Pipeline. Pipeline ensures fit only on training folds during CV.

2. ColumnTransformer — Real-World Data

Real data has mixed types. ColumnTransformer applies different transformations per column set: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.

3. Custom Transformers

Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives fit_transform() free. Use check_is_fitted() for safety.

4. Cross-Validation Strategies

Strategy	When	Key Point
KFold	General	Doesn't preserve class ratios
StratifiedKFold	Imbalanced classification	Preserves class distribution
TimeSeriesSplit	Time-ordered data	Train always before test
GroupKFold	Grouped data (patients)	Same group never in train+test
RepeatedStratifiedKFold	Robust estimation	Multiple random splits

5. Hyperparameter Tuning

Method	Pros	Cons
GridSearchCV	Exhaustive	Exponential with params
RandomizedSearchCV	Faster, continuous dists	May miss optimal
Optuna	Smart search, pruning	Extra dependency
HalvingSearchCV	Successive halving	Newer, less docs

6. Complete ML Workflow

🎯 The Steps

1. EDA → 2. Train/Val/Test split → 3. Build Pipeline (preprocess + model) → 4. Cross-validate multiple models → 5. Select best → 6. Tune hyperparameters → 7. Final evaluation on test set → 8. Save model → 9. Deploy

7. Feature Engineering

Transformer	Purpose
PolynomialFeatures	Interaction & polynomial terms
FunctionTransformer	Apply any function (log, sqrt)
SplineTransformer	Non-linear feature basis
KBinsDiscretizer	Bin continuous into categories
TargetEncoder	Encode categoricals by target mean

8. Model Selection Guide

Data Size	Model	Why
<1K rows	Logistic/SVM/KNN	Simple, less overfitting
1K-100K	Random Forest, XGBoost	Best accuracy/speed tradeoff
100K+	XGBoost, LightGBM	Handles large data efficiently
Very large	SGDClassifier/online	Incremental learning
Tabular	Gradient Boosting	Almost always best for tabular

9. Handling Imbalanced Data

Strategy	How
class_weight='balanced'	Built-in for most models
SMOTE	Synthetic oversampling (imblearn)
Threshold tuning	Adjust decision threshold from 0.5
Metrics	Use F1, Precision-Recall AUC (not accuracy)
Ensemble	BalancedRandomForest

10. Model Persistence

joblib.dump(model, 'model.pkl') — faster than pickle for NumPy arrays. model = joblib.load('model.pkl'). Always save the entire pipeline (not just model) to include preprocessing. Version your models with timestamps.

💻 Scikit-learn Project Code

1. Production Pipeline — Complete Template

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), make_column_selector(dtype_include='number')),
    
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]), make_column_selector(dtype_include='object'))
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=-1))
])

# No data leakage!
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

2. Custom Transformer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class OutlierClipper(BaseEstimator, TransformerMixin):
    def __init__(self, factor=1.5):
        self.factor = factor
    
    def fit(self, X, y=None):
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        IQR = Q3 - Q1
        self.lower_ = Q1 - self.factor * IQR
        self.upper_ = Q3 + self.factor * IQR
        return self
    
    def transform(self, X):
        check_is_fitted(self)
        return np.clip(X, self.lower_, self.upper_)

3. Model Comparison Framework

from sklearn.model_selection import cross_validate

models = {
    'Logistic': LogisticRegression(),
    'RF': RandomForestClassifier(n_estimators=100),
    'XGBoost': XGBClassifier(n_estimators=100),
    'LightGBM': LGBMClassifier(n_estimators=100)
}

results = {}
for name, model in models.items():
    pipe = Pipeline([('prep', preprocessor), ('model', model)])
    cv = cross_validate(pipe, X, y, cv=5,
        scoring=['accuracy', 'f1', 'roc_auc'], n_jobs=-1)
    results[name] = {k: v.mean() for k, v in cv.items()}
    print(f"{name}: F1={cv['test_f1'].mean():.3f}")

pd.DataFrame(results).T.sort_values('test_f1', ascending=False)

4. Hyperparameter Tuning with Optuna

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0)
    }
    model = XGBClassifier(**params)
    score = cross_val_score(model, X, y, cv=5, scoring='f1').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best F1: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")

5. Save & Load Pipeline

import joblib
from datetime import datetime

# Save entire pipeline (includes preprocessing!)
version = datetime.now().strftime('%Y%m%d_%H%M')
joblib.dump(pipe, f'models/pipeline_{version}.pkl')

# Load and predict
pipe = joblib.load('models/pipeline_20240315_1430.pkl')
predictions = pipe.predict(new_data)  # Preprocessing included!

🎯 Scikit-learn Interview Questions

Q1: What is data leakage?

Answer: Test set info influencing training. Common: fitting scaler before split. Fix: Pipeline ensures fit only on train folds.

Q2: Pipeline vs ColumnTransformer?

Answer: Pipeline: sequential (A→B→C). ColumnTransformer: parallel branches (different processing per column type). Usually CT inside Pipeline.

Q3: Which cross-validation when?

Answer: KFold: general. Stratified: imbalanced. TimeSeriesSplit: temporal. GroupKFold: grouped data.

Q4: Grid vs Random vs Bayesian?

Answer: Grid: exhaustive, exponential. Random: better for many params. Bayesian (Optuna): learns, most efficient for expensive models.

Q5: Custom transformer?

Answer: BaseEstimator + TransformerMixin. Implement fit(X,y) and transform(X). TransformerMixin gives fit_transform free.

Q6: How to handle imbalanced data?

Answer: (1) class_weight='balanced'. (2) SMOTE oversampling. (3) Adjust threshold. (4) Use F1/AUC not accuracy. (5) BalancedRandomForest.

Q7: When to use which model?

Answer: Tabular: gradient boosting (XGBoost/LightGBM). Small data: Logistic/SVM. Interpretability: Logistic/trees. Speed: LightGBM. Baseline: Random Forest.

Q8: fit() vs transform() vs predict()?

Answer: fit: learn params from data. transform: apply params. predict: generate predictions. fit on train only, transform/predict on both.

🔥 Deep Learning with PyTorch — Complete Guide

⚡ PyTorch Philosophy: Define-by-Run

PyTorch builds the computational graph dynamically as operations execute (eager mode). Debug with print(), breakpoints, standard Python control flow.

1. Tensors — The Foundation

Concept	What	Key Point
Tensor	N-dimensional array	Like NumPy but GPU-capable
requires_grad	Track for autograd	Only for learnable params
device	CPU or CUDA	`.to('cuda')` moves to GPU
.detach()	Stop gradient tracking	Use for inference/metrics
.item()	Extract scalar	Use for logging loss
.contiguous()	Ensure contiguous memory	Required after transpose/permute

2. Autograd — How Backpropagation Works

🧠 Computational Graph (DAG)

When requires_grad=True, every operation is recorded. Each tensor stores grad_fn. .backward() traverses graph in reverse (chain rule). Graph destroyed after backward() unless retain_graph=True. Gradients ACCUMULATE — must optimizer.zero_grad() before each backward.

3. nn.Module — Building Blocks

Every model inherits nn.Module. Layers in __init__, computation in forward(). model.train()/model.eval() toggle BatchNorm/Dropout. model.parameters() for optimizer. model.state_dict() for save/load. Use nn.Sequential for simple stacks, nn.ModuleList/nn.ModuleDict for dynamic architectures.

4. Training Loop — The Standard Pattern

(1) Forward pass → (2) Compute loss → (3) optimizer.zero_grad() → (4) loss.backward() → (5) optimizer.step(). Add: gradient clipping, LR scheduling, mixed precision, logging, checkpointing.

5. Custom Datasets & DataLoaders

Dataset: override __len__ and __getitem__. DataLoader: batching, shuffling, multi-worker. num_workers>0 for parallel loading. pin_memory=True for faster GPU transfer. Use collate_fn for variable-length sequences.

6. Learning Rate Scheduling

Scheduler	Strategy	When
StepLR	Decay every N epochs	Simple baseline
CosineAnnealingLR	Cosine decay	Standard for vision
OneCycleLR	Warmup + decay	Best for fast training
ReduceLROnPlateau	Decay on stall	When loss plateaus
LinearLR	Linear warmup	Transformer models

7. Mixed Precision Training (AMP)

torch.cuda.amp: forward in float16 (2x faster), gradients in float32. GradScaler prevents underflow. 2-3x speedup. Standard practice for any GPU training.

8. Transfer Learning Patterns

Load pretrained → Freeze base → Replace head → Fine-tune with smaller LR. Discriminative LR: lower LR for earlier layers. Progressive unfreezing: unfreeze layers one at a time. Both work better than fine-tuning everything at once.

9. Distributed Training (DDP)

DistributedDataParallel: each GPU runs model copy, gradients averaged via all-reduce. Near-linear scaling. Use torchrun to launch. DistributedSampler for data splitting.

10. Debugging & Profiling

Tool	Purpose
register_forward_hook	View intermediate activations
register_backward_hook	Monitor gradient magnitudes
torch.profiler	GPU/CPU profiling
torch.cuda.memory_summary()	GPU memory debugging
detect_anomaly()	Find NaN/Inf sources

11. torch.compile (2.x)

JIT compiles model for 30-60% speedup. model = torch.compile(model). Uses TorchDynamo + Triton. Works on existing code. The future of PyTorch performance.

💻 PyTorch Project Code

1. Complete Training Framework

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

class Trainer:
    def __init__(self, model, optimizer, criterion, device='cuda'):
        self.model = model.to(device)
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
        self.history = {'train_loss': [], 'val_loss': []}
    
    def train_epoch(self, loader):
        self.model.train()
        total_loss = 0
        for X, y in loader:
            X, y = X.to(self.device), y.to(self.device)
            self.optimizer.zero_grad()
            loss = self.criterion(self.model(X), y)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()
            total_loss += loss.item() * len(X)
        return total_loss / len(loader.dataset)
    
    @torch.no_grad()
    def evaluate(self, loader):
        self.model.eval()
        total_loss = 0
        for X, y in loader:
            X, y = X.to(self.device), y.to(self.device)
            total_loss += self.criterion(self.model(X), y).item() * len(X)
        return total_loss / len(loader.dataset)
    
    def fit(self, train_loader, val_loader, epochs, patience=5):
        best_loss = float('inf')
        wait = 0
        for epoch in range(epochs):
            train_loss = self.train_epoch(train_loader)
            val_loss = self.evaluate(val_loader)
            self.history['train_loss'].append(train_loss)
            self.history['val_loss'].append(val_loss)
            print(f"Epoch {epoch+1}: train={train_loss:.4f} val={val_loss:.4f}")
            if val_loss < best_loss:
                best_loss = val_loss
                torch.save(self.model.state_dict(), 'best_model.pt')
                wait = 0
            else:
                wait += 1
                if wait >= patience:
                    print("Early stopping!")
                    break

2. Custom Dataset for Any Tabular Data

class TabularDataset(torch.utils.data.Dataset):
    def __init__(self, df, target, cat_cols=None, num_cols=None):
        self.target = torch.FloatTensor(df[target].values)
        self.num = torch.FloatTensor(df[num_cols].values) if num_cols else None
        self.cat = torch.LongTensor(df[cat_cols].values) if cat_cols else None
    
    def __len__(self):
        return len(self.target)
    
    def __getitem__(self, idx):
        x = {}
        if self.num is not None: x['num'] = self.num[idx]
        if self.cat is not None: x['cat'] = self.cat[idx]
        return x, self.target[idx]

3. Mixed Precision + Gradient Accumulation

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
accum_steps = 4  # Effective batch = batch_size × 4

for i, (X, y) in enumerate(loader):
    with autocast():
        loss = model(X.cuda(), y.cuda()) / accum_steps
    scaler.scale(loss).backward()
    
    if (i + 1) % accum_steps == 0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

4. Transfer Learning

import torchvision.models as models

model = models.resnet50(weights='IMAGENET1K_V2')
model.requires_grad_(False)  # Freeze all
model.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(2048, 512),
    nn.ReLU(),
    nn.Linear(512, num_classes)
)

# Discriminative LR: lower for pretrained, higher for new head
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': model.fc.parameters(), 'lr': 1e-3}
])

5. Model Save/Load Best Practices

# Save everything for resuming training
checkpoint = {
    'epoch': epoch,
    'model_state': model.state_dict(),
    'optimizer_state': optimizer.state_dict(),
    'scheduler_state': scheduler.state_dict(),
    'best_loss': best_loss,
    'config': config
}
torch.save(checkpoint, 'checkpoint.pt')

# Resume training
ckpt = torch.load('checkpoint.pt', map_location=device)
model.load_state_dict(ckpt['model_state'])
optimizer.load_state_dict(ckpt['optimizer_state'])

🎯 PyTorch Interview Questions

Q1: How does autograd work?

Answer: Records ops in DAG. .backward() traverses reverse, chain rule. Graph destroyed after backward. Dynamic = rebuilt each forward.

Q2: Why zero_grad()?

Answer: Gradients accumulate. Without zeroing, previous batch adds to current. Intentional: enables gradient accumulation for larger effective batch.

Q3: .detach() vs torch.no_grad()?

Answer: detach(): single tensor, shares data. no_grad(): context manager for all ops inside, saves memory. Use no_grad() for inference.

Q4: How to debug vanishing gradients?

Answer: (1) Backward hooks for gradient magnitudes. (2) clip_grad_norm_. (3) TensorBoard histograms. (4) BatchNorm/LayerNorm. (5) Skip connections.

Q5: DataLoader num_workers?

Answer: Rule: 4 × num_gpus. Too many = CPU overhead. pin_memory=True for faster transfers. Profile to find sweet spot.

Q6: torch.compile vs eager?

Answer: compile JITs model via TorchDynamo+Triton. 30-60% faster. One line change. The future of PyTorch performance.

Q7: How to save/load models?

Answer: state_dict (weights only) vs full checkpoint (weights + optimizer + epoch). Use state_dict for inference, checkpoint for resuming.

Q8: Mixed precision — how and why?

Answer: autocast(fp16 forward) + GradScaler(fp32 grads). 2-3x speedup. Minimal accuracy loss. Standard for GPU training.

🧠 TensorFlow & Keras — Complete Guide

⚡ TF2 = Eager by Default + @tf.function for Speed

TF2 defaults to eager mode (like PyTorch). @tf.function compiles to graph for production. Keras is the official API. TF handles full lifecycle: train → save → serve → monitor.

1. Three Model APIs

API	Use Case	Flexibility
Sequential	Linear stack	Low
Functional	Multi-input/output, branching	Medium (recommended)
Subclassing	Custom forward logic	High

2. tf.data Pipeline

Chains transformations lazily. .map(), .batch(), .shuffle(), .prefetch(AUTOTUNE). Prefetching overlaps loading with GPU execution. .cache() for small datasets. .interleave() for reading multiple files. TFRecord format for large datasets.

3. Callbacks — Training Hooks

Callback	Purpose
ModelCheckpoint	Save best model
EarlyStopping	Stop when metric plateaus
ReduceLROnPlateau	Reduce LR when stuck
TensorBoard	Visualize metrics
CSVLogger	Log to CSV
LambdaCallback	Custom per-epoch logic

4. GradientTape — Custom Training

Record ops → compute gradients → apply. Use for: GANs, RL, custom losses, gradient penalty, multi-loss weighting. Same concept as PyTorch's manual loop.

5. @tf.function — Production Speed

Trace Python → TF graph. Benefits: optimized execution, XLA, export. Gotchas: Python side effects only during tracing. Use tf.print() in graphs.

6. SavedModel — Universal Deployment

model.save('path') exports architecture + weights + computation. Ready for: TF Serving (production), TF Lite (mobile), TF.js (browser). One model, any platform.

7. Keras Tuner — Automated Hyperparameter Search

Build model function → Tuner searches space. Strategies: Random, Hyperband, Bayesian. Integrates with TensorBoard. Alternative to Optuna for Keras models.

8. TF vs PyTorch — Decision Guide

Choose TF When	Choose PyTorch When
Production deployment at scale	Research & prototyping
Mobile (TFLite mature)	Hugging Face ecosystem
TPU training	GPU research
Edge devices	Custom architectures
Browser (TF.js)	Academic papers

💻 TensorFlow Project Code

1. Functional API — Multi-Input Model

import tensorflow as tf
from tensorflow import keras

text_input = keras.Input(shape=(100,), name='text')
num_input = keras.Input(shape=(5,), name='features')

x1 = keras.layers.Embedding(10000, 64)(text_input)
x1 = keras.layers.GlobalAveragePooling1D()(x1)
x2 = keras.layers.Dense(32, activation='relu')(num_input)

combined = keras.layers.Concatenate()([x1, x2])
x = keras.layers.Dense(64, activation='relu')(combined)
x = keras.layers.Dropout(0.3)(x)
output = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=[text_input, num_input], outputs=output)

2. Training with Callbacks

callbacks = [
    keras.callbacks.ModelCheckpoint('best.keras',
        monitor='val_loss', save_best_only=True),
    keras.callbacks.EarlyStopping(patience=5,
        restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    keras.callbacks.TensorBoard(log_dir='./logs')
]

model.compile(optimizer='adam', loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.AUC()])
model.fit(X_train, y_train, epochs=50,
    validation_split=0.2, callbacks=callbacks)

3. Custom Training Loop (GradientTape)

@tf.function
def train_step(model, X, y, optimizer, loss_fn):
    with tf.GradientTape() as tape:
        preds = model(X, training=True)
        loss = loss_fn(y, preds)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

4. tf.data Pipeline

dataset = (
    tf.data.Dataset.from_tensor_slices((X, y))
    .shuffle(10000)
    .batch(64)
    .map(lambda x, y: (augment(x), y),
        num_parallel_calls=tf.data.AUTOTUNE)
    .prefetch(tf.data.AUTOTUNE)
)

5. Custom Callback for Experiment Logging

class ExperimentLogger(keras.callbacks.Callback):
    def __init__(self, log_path):
        self.log_path = log_path
        self.logs_data = []
    
    def on_epoch_end(self, epoch, logs=None):
        self.logs_data.append({'epoch': epoch, **logs})
        pd.DataFrame(self.logs_data).to_csv(self.log_path, index=False)
        if logs['val_loss'] > logs['loss'] * 1.5:
            print(f"⚠️ Possible overfitting at epoch {epoch}")

🎯 TensorFlow Interview Questions

Q1: Sequential vs Functional vs Subclassing?

Answer: Sequential: linear. Functional: multi-I/O, branching. Subclassing: full Python control. Use Functional for most projects.

Q2: What does @tf.function do?

Answer: Traces Python → TF graph. Faster, XLA, export. Gotcha: side effects only during tracing.

Q3: tf.data performance?

Answer: prefetch(AUTOTUNE) overlaps loading+training. cache() for small data. interleave() for multiple files.

Q4: EarlyStopping config?

Answer: monitor='val_loss', patience=5-10, restore_best_weights=True. Combine with ReduceLROnPlateau.

Q5: When GradientTape?

Answer: GANs, RL, custom gradients, multi-loss. When .fit() is too restrictive.

Q6: TF vs PyTorch?

Answer: TF: deployment (Serving, Lite, JS), mobile. PyTorch: research, HuggingFace. Both converging.

Q7: How to deploy TF model?

Answer: SavedModel → TF Serving (REST/gRPC), TFLite (mobile), TF.js (browser). Docker + TF Serving for production.

📦 Production Python — Complete Engineering Guide

⚡ Production = Reliability + Reproducibility + Observability

Production code must be tested (pytest), typed (mypy), logged (structured), packaged (pyproject.toml), containerized (Docker), and monitored (metrics). The gap between notebook and production is enormous.

1. pytest — Professional Testing

Feature	Purpose	Example
fixtures	Reusable test setup	`@pytest.fixture`
parametrize	Many inputs, same test	`@pytest.mark.parametrize`
conftest.py	Shared fixtures	DB connections, mock data
monkeypatch	Override functions/env	Mock API calls
tmp_path	Temp directory	Test file I/O
markers	Tag tests	`pytest -m "not slow"`
coverage	Measure test coverage	`pytest --cov`

2. Testing ML Code

🎯 What to Test in ML

Unit: data transforms, feature engineering, loss functions.
Integration: full pipeline end-to-end.
Model: output shape, range, determinism with seed.
Data: schema validation, distribution shifts, missing patterns.

3. Logging Best Practices

Level	When
DEBUG	Tensor shapes, intermediate values
INFO	Training started, epoch complete
WARNING	Unexpected but handled (fallback used)
ERROR	Model load failure, API error
CRITICAL	OOM, GPU crash

Never use print(). Use structured logging (JSON format) for production — parseable by log aggregators (ELK, Datadog).

4. FastAPI for Model Serving

Modern async framework. Auto-generates OpenAPI docs. Pydantic validation. Deploy with Uvicorn + Docker. Add: health checks, input validation, error handling, rate limiting, request logging.

5. Docker for ML

Containerize everything: Python, CUDA, dependencies. Multi-stage builds: builder (install) → runtime (slim). Pin versions. NVIDIA Container Toolkit for GPU. docker compose for multi-service (API + Redis + DB).

6. pyproject.toml — Modern Packaging

Replaces setup.py/cfg. Project metadata, dependencies, build system, tool configs (pytest, mypy, ruff). [project.optional-dependencies] for dev/test extras. pip install -e ".[dev]" for editable installs.

7. Configuration Management

Tool	Best For	Key Feature
Hydra	ML experiments	YAML, CLI overrides, multi-run
Pydantic Settings	App config	Env var loading, validation
python-dotenv	Simple projects	.env file loading

8. CI/CD for ML

GitHub Actions: lint (ruff) → type check (mypy) → test (pytest) → build (Docker) → deploy. Add model validation gate: new model must beat baseline on test metrics before deployment.

9. Code Quality Tools

Tool	Purpose
ruff	Fast linter + formatter (replaces black, isort, flake8)
mypy	Static type checking
pre-commit	Git hooks for auto-formatting
pytest-cov	Test coverage
bandit	Security linting

10. MLOps — Model Lifecycle

Tool	Purpose
MLflow	Experiment tracking, model registry
DVC	Data versioning (like Git for data)
Weights & Biases	Experiment tracking, visualization
Evidently	Data drift & model monitoring
Great Expectations	Data validation

11. Database for ML Projects

DB	Use Case	Python Library
SQLite	Local, small data, prototyping	sqlite3 (built-in)
PostgreSQL	Production, ACID, JSON	psycopg2, SQLAlchemy
Redis	Caching, queues, sessions	redis-py
MongoDB	Flexible schema, documents	pymongo
Pinecone/Weaviate	Vector search (embeddings)	Official SDKs

💻 Production Python Project Code

1. pytest — Complete ML Testing

import pytest
import numpy as np

# conftest.py — shared fixtures
@pytest.fixture
def sample_data():
    np.random.seed(42)
    X = np.random.randn(100, 10)
    y = np.random.randint(0, 2, 100)
    return X, y

@pytest.fixture
def trained_model(sample_data):
    X, y = sample_data
    model = RandomForestClassifier(n_estimators=10)
    model.fit(X, y)
    return model

# Test multiple models with one function
@pytest.mark.parametrize("model_cls", [
    LogisticRegression, RandomForestClassifier, GradientBoostingClassifier
])
def test_model_output(model_cls, sample_data):
    X, y = sample_data
    model = model_cls()
    model.fit(X, y)
    preds = model.predict(X)
    assert preds.shape == y.shape
    assert set(np.unique(preds)).issubset({0, 1})

# Test data pipeline
def test_pipeline_no_leakage(sample_data, pipeline):
    X, y = sample_data
    scores = cross_val_score(pipeline, X, y, cv=3)
    assert all(s >= 0 and s <= 1 for s in scores)

2. Structured Logging

import logging, json, sys

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'module': record.module,
            'message': record.getMessage()
        }
        if record.exc_info:
            log['exception'] = self.formatException(record.exc_info)
        return json.dumps(log)

def setup_logging(level=logging.INFO):
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(JSONFormatter())
    logging.root.handlers = [handler]
    logging.root.setLevel(level)

logger = logging.getLogger(__name__)
logger.info("Training started", extra={'model': 'xgb'})

3. FastAPI — Complete ML API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import joblib, numpy as np

app = FastAPI(title="ML Prediction API")
model = None

@app.on_event("startup")
def load_model():
    global model
    model = joblib.load("models/pipeline.pkl")

class PredictRequest(BaseModel):
    features: list[float] = Field(..., min_length=1)

class PredictResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    try:
        X = np.array(req.features).reshape(1, -1)
        pred = model.predict(X)[0]
        proba = model.predict_proba(X)[0].max()
        return PredictResponse(
            prediction=int(pred), probability=float(proba),
            model_version="v2.1"
        )
    except Exception as e:
        raise HTTPException(500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

4. Dockerfile for ML

# Multi-stage build
FROM python:3.11-slim AS builder
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/deps -r requirements.txt

FROM python:3.11-slim
COPY --from=builder /deps /usr/local/lib/python3.11/site-packages
COPY src/ /app/src/
COPY models/ /app/models/
WORKDIR /app
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8000"]

5. Makefile for Project Commands

# Makefile — run from project root
.PHONY: install test lint train serve

install:
    pip install -e ".[dev]"

test:
    pytest tests/ -v --cov=src --cov-report=term-missing

lint:
    ruff check src/ tests/
    mypy src/

train:
    python -m src.training.train --config configs/default.yaml

serve:
    uvicorn src.api:app --reload --port 8000

6. MLflow Experiment Tracking

import mlflow

mlflow.set_experiment("customer_churn")
with mlflow.start_run():
    mlflow.log_params({"model": "xgb", "lr": 0.01})
    model.fit(X_train, y_train)
    mlflow.log_metrics({"f1": f1, "auc": auc_score})
    mlflow.sklearn.log_model(pipeline, "model")

🎯 Production Python Interview Questions

Q1: How to test ML code?

Answer: Unit: transforms, features. Integration: full pipeline. Model: shape, range, determinism. Data: schema, distributions. Use pytest fixtures.

Q2: print() vs logging?

Answer: Logging: levels, file output, structured (JSON), zero cost when disabled, thread-safe. Print: none. Production = logging.

Q3: How to serve ML model?

Answer: FastAPI + Docker. Load model at startup. Add health checks, validation, error handling, logging. Async for throughput.

Q4: pyproject.toml vs setup.py?

Answer: pyproject.toml: modern standard, all tools in one file. Pin deps. Use optional deps for dev/test. pip install -e ".[dev]".

Q5: ML experiment configs?

Answer: Hydra: YAML + CLI overrides + multi-run sweeps. Version control configs. Never hardcode hyperparams.

Q6: CI/CD for ML?

Answer: lint → type-check → test → build → deploy. Model validation gate: must beat baseline. GitHub Actions + Docker.

Q7: How to handle model versioning?

Answer: MLflow model registry. DVC for data. Git for code. timestamp + metrics in model filename. A/B testing for rollout.

Q8: What is data drift?

Answer: Input distribution changes post-deployment. Detect: Evidently, statistical tests. Monitor: feature distributions, prediction distributions. Retrain trigger.

⚡ Performance & Optimization — Complete Guide

⚡ The Optimization Hierarchy

1. Algorithm (O(n²)→O(n log n)) > 2. Data structures (list→set) > 3. Vectorization (NumPy) > 4. Compilation (Numba/Cython) > 5. Parallelization (multiprocessing) > 6. Hardware (GPU). Always start from the top.

1. Profiling — Measure First

Tool	Type	When	Overhead
cProfile	Function-level	Find slow functions	~2x
line_profiler	Line-by-line	Optimize hot function	Higher
Py-Spy	Sampling	Production profiling	Near zero
tracemalloc	Memory	Find leaks	Low
memory_profiler	Line memory	Memory per line	High
scalene	CPU+Memory+GPU	Comprehensive	Low

2. The GIL — What Every Python Dev Must Know

🔒 Global Interpreter Lock

GIL prevents true multi-threading for CPU-bound Python. BUT: NumPy, Pandas, scikit-learn release the GIL during C operations. Python 3.13: experimental free-threaded CPython (no-GIL).

Task Type	Solution	Why
I/O-bound	asyncio / threading	GIL released during I/O
CPU-bound Python	multiprocessing	Separate processes, separate GIL
CPU-bound NumPy	threading OK	NumPy releases GIL
Many tasks	concurrent.futures	Simple Pool interface

3. Numba — JIT Compilation

@numba.jit(nopython=True): compile to machine code. 10-100x speedup for loops. Supports NumPy, math. @numba.vectorize: custom ufuncs. @cuda.jit: GPU kernels. Best for: tight loops that can't be vectorized.

4. Dask — Parallel Computing

Pandas/NumPy API for data bigger than memory. dask.dataframe, dask.array, dask.delayed. Lazy execution. Task graph scheduler. Scales from laptop to cluster. Alternative: Polars for single-machine parallel.

5. Ray — Distributed ML

General-purpose distributed framework. Ray Tune (hyperparameter tuning), Ray Serve (model serving), Ray Data. Easier than Dask for ML. Used by OpenAI, Uber.

6. Memory Optimization

__slots__: 40% memory savings per instance
Generators: O(1) memory vs O(n) for lists
dtype downcasting: float64→float32 = 50% savings
Category dtype: Repeated strings → 90% savings
Memory-mapped files: Process files > RAM
del + gc.collect(): Free large objects
array module: For simple typed arrays (no NumPy overhead)

7. Caching Strategies

Tool	Scope	Use Case
@functools.lru_cache	In-memory, function	Expensive computations
@functools.cache	Unbounded cache	Pure functions
joblib.Memory	Disk cache	Data processing pipelines
Redis	External cache	Multi-process, API responses
diskcache	Pure Python disk	Simple persistent cache

8. Python 3.12-3.13 Performance

3.12: 5-15% faster, better errors, per-interpreter GIL. 3.13: Free-threaded (no-GIL experimental), JIT compiler (experimental). The future of Python performance is exciting.

9. Common Performance Anti-Patterns

Anti-Pattern	Fix	Speedup
`for row in df.iterrows()`	Vectorized ops	100-1000x
`s += "text"` in loop	`''.join(parts)`	100x
`x in big_list`	`x in big_set`	1000x
Python list of floats	NumPy array	50-100x
Global imports in function	Import at top	Variable
Not using built-ins	`sum()`, `min()`	5-10x

💻 Performance Code Examples

1. Profiling Workflow

import cProfile, pstats

# Profile and find bottlenecks
with cProfile.Profile() as pr:
    result = expensive_pipeline(data)

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 slow functions

# Memory profiling
import tracemalloc
tracemalloc.start()
# ... process data ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('filename')[:5]:
    print(stat)

2. Numba JIT

import numba
import numpy as np

@numba.jit(nopython=True)
def pairwise_distance(X):
    n = X.shape[0]
    D = np.empty((n, n))
    for i in range(n):
        for j in range(i+1, n):
            d = 0.0
            for k in range(X.shape[1]):
                d += (X[i,k] - X[j,k]) ** 2
            D[i,j] = D[j,i] = d ** 0.5
    return D
# 100x faster than pure Python!

3. concurrent.futures — Parallel Processing

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

# CPU-bound: processes
with ProcessPoolExecutor(max_workers=8) as ex:
    results = list(ex.map(process_chunk, data_chunks))

# I/O-bound: threads
with ThreadPoolExecutor(max_workers=32) as ex:
    results = list(ex.map(fetch_url, urls))

4. Dask for Large Data

import dask.dataframe as dd

# Read 100GB of CSVs — lazy!
ddf = dd.read_csv('data/*.csv')

# Same Pandas API — but parallel
result = (
    ddf.groupby('category')
    .agg({'revenue': 'sum', 'qty': 'mean'})
    .compute()  # Only here does it execute
)

5. functools.lru_cache — Memoization

from functools import lru_cache

@lru_cache(maxsize=1024)
def expensive_feature(customer_id: int) -> dict:
    # DB query, computation, etc.
    return compute_features(customer_id)

# First call: computes. Second call: instant from cache
print(expensive_feature.cache_info())  # hits, misses, size

6. slots for Memory

class Point:
    __slots__ = ('x', 'y', 'z')
    def __init__(self, x, y, z):
        self.x, self.y, self.z = x, y, z

# 1M instances: ~60MB vs ~160MB without __slots__
points = [Point(i, i*2, i*3) for i in range(1_000_000)]

7. String Performance

# ❌ O(n²) — creates new string each iteration
result = ""
for word in words:
    result += word + " "

# ✅ O(n) — single allocation at end
result = " ".join(words)

🎯 Performance Interview Questions

Q1: Why the GIL?

Answer: Simplifies reference counting. Makes single-threaded faster. Easier C extensions. Python 3.13 has experimental no-GIL mode.

Q2: Optimize nested loop?

Answer: (1) NumPy vectorize. (2) Numba JIT. (3) Cython. (4) multiprocessing if independent.

Q3: Threading vs multiprocessing?

Answer: Threading: I/O-bound (shared memory). Multiprocessing: CPU-bound (bypasses GIL). Downloads→threads. Matrix math→processes.

Q4: What is Numba?

Answer: JIT compiler: Python→machine code via LLVM. @jit(nopython=True). 10-100x for NumPy loops. No Pandas/strings.

Q5: How to profile Python?

Answer: cProfile: functions. line_profiler: lines. Py-Spy: production. tracemalloc: memory. scalene: all-in-one. Profile FIRST, optimize second.

Q6: Dask vs Ray vs Spark?

Answer: Dask: Pandas API, Python-native. Ray: ML-focused. Spark: JVM, TB+ data. Python ML: Dask/Ray. Big data ETL: Spark.

Q7: Top 3 Python performance tips?

Answer: (1) Use sets not lists for lookups. (2) NumPy not Python loops. (3) Generator expressions for memory. Bonus: lru_cache for expensive functions.

Q8: How does lru_cache work?

Answer: Hash-based memoization. Args must be hashable. maxsize=None for unlimited. cache_info() shows hits/misses. Perfect for pure functions.