Python is a dynamically-typed, garbage-collected, interpreted language with a C-based runtime (CPython). Everything is an object β integers, functions, even classes. Understanding this object model is what separates beginners from professionals.
1. Data Structures β Complete Reference
Type
Mutable
Ordered
Hashable
Use Case
list
β
β
β
Sequential data, time series, feature lists
tuple
β
β
β
Fixed records, dict keys, DataFrame rows
dict
β
β (3.7+)
β
Lookup tables, JSON, config, caches
set
β
β
β
Unique values, membership testing O(1)
frozenset
β
β
β
Immutable set, usable as dict keys
deque
β
β
β
O(1) append/pop both ends, sliding windows
bytes
β
β
β
Binary data, serialization, network I/O
bytearray
β
β
β
Mutable binary buffers
2. Time Complexity β What Every Dev Must Know
Operation
list
dict
set
Lookup by index/key
O(1)
O(1)
β
Search (x in ...)
O(n)
O(1)
O(1)
Insert/Append
O(1) end, O(n) middle
O(1)
O(1)
Delete
O(n)
O(1)
O(1)
Sort
O(n log n)
β
β
Iteration
O(n)
O(n)
O(n)
Real-world impact: Checking if an item exists in a list of 1M elements = ~50ms. In a set = ~0.00005ms. That's 1,000,000x faster. Always use sets/dicts for membership testing.
3. Python Memory Model
β‘ Everything Is An Object on the Heap
Variables are references (pointers), not boxes. a = [1,2,3] creates a list on the heap; a points to it. b = a makes both point to the same list. This is aliasing β the #1 source of bugs in beginner Python code.
Reference Counting: Each object tracks how many names reference it. When count = 0, freed immediately. del decrements the count, doesn't necessarily free memory.
Integer Interning: Python caches integers -5 to 256. So a = 100; b = 100; a is b β True. But a = 1000; b = 1000; a is b β may be False. Never use is for value comparison.
Garbage Collection: 3 generations (gen0, gen1, gen2). New objects in gen0. Survivors promoted. Use gc.collect() after deleting large ML models.
4. Generators & Iterators β The Heart of Python
π Lazy Evaluation
yield suspends state, return terminates. A list of 1B items = ~8GB. A generator = ~100 bytes. The Iterator Protocol: any object with __iter__ + __next__. Generator expressions: (x**2 for x in range(10**9)) β O(1) memory.
yield from: Delegates to sub-generator. Forwards send() and throw(). Essential for building composable data pipelines.
send(): Two-way communication with generators (coroutines). value = yield result β both receives and produces values.
5. Closures & First-Class Functions
Functions are first-class objects β passed as args, returned, assigned. A closure captures variables from enclosing scope. Foundation of decorators, callbacks, and functional programming.
6. Critical Python Gotchas for Projects
β οΈ The 5 Deadliest Python Traps
1. Mutable Default Args:def f(x, lst=[]): β list shared across ALL calls. Fix: lst=None. 2. Late Binding Closures:[lambda: i for i in range(5)] β all return 4! Fix: lambda i=i: i. 3. Shallow Copy:list(a) copies outer list but shares inner objects. 4. String Concatenation:s += "text" in a loop creates new string every time β O(nΒ²). Use ''.join(parts). 5. Circular Imports: Module A imports B, B imports A β ImportError. Fix: restructure or lazy import.
7. Error Handling for Production Projects
π‘οΈ Exception Hierarchy You Must Know
BaseException β Exception (catch this) β ValueError, TypeError, KeyError, FileNotFoundError, ConnectionError... Rules: (1) Never catch bare except:. (2) Catch specific exceptions. (3) Use else for success path. (4) finally always runs. (5) Create custom exceptions for your project.
8. collections Module β Power Tools
Class
Purpose
Project Use Case
defaultdict
Dict with default factory
Group data: defaultdict(list)
Counter
Count hashable objects
Label distribution, word frequency
namedtuple
Lightweight immutable class
Return multiple named values
deque
Double-ended queue
Sliding window, BFS, ring buffer
ChainMap
Stack multiple dicts
Config layers: defaults β env β CLI
OrderedDict
Ordered dict (legacy)
move_to_end() for LRU cache
9. itertools β Memory-Efficient Pipelines
Function
What It Does
Project Use
chain()
Concatenate iterables lazily
Merge data files
islice()
Slice any iterator
Take first N from generator
groupby()
Group consecutive elements
Process sorted logs by date
product()
Cartesian product
Hyperparameter grid
combinations()
All r-length combos
Feature interaction pairs
starmap()
map() with unpacked args
Apply function to paired data
accumulate()
Running accumulator
Cumulative sums, running max
tee()
Clone iterator N times
Multiple passes over stream
10. File I/O for Real Projects
Format
Read
Write
Best For
JSON
json.load(f)
json.dump(obj, f)
Configs, API responses
CSV
csv.DictReader(f)
csv.DictWriter(f)
Tabular data (small)
YAML
yaml.safe_load(f)
yaml.dump(obj, f)
Config files
Pickle
pickle.load(f)
pickle.dump(obj, f)
Python objects, models
Parquet
pd.read_parquet()
df.to_parquet()
Large DataFrames (fast)
SQLite
sqlite3.connect()
SQL queries
Local database
11. pathlib β Modern File Handling
Stop using os.path.join(). Use pathlib.Path: Path('data') / 'train' / 'images'. Methods: .glob(), .read_text(), .mkdir(parents=True), .exists(), .suffix, .stem. Cross-platform, readable, powerful.
f-strings (3.6+):f"{accuracy:.2%}" β "95.23%". f"{x=}" (3.8+) β "x=42" for debugging. f"{name!r}" β shows repr. regex:re.compile(pattern) for repeated use. re.sub() for cleaning. re.findall() for extraction. Always compile patterns used in loops.
15. Command-Line Interface (CLI) Tools
argparse: Built-in CLI parsing. click: Decorator-based, more Pythonic. typer: Modern, uses type hints. Every production project needs a CLI for: training, evaluation, data processing, deployment scripts.
`,
code: `
π» Python Fundamentals β Project Code
1. Generator Pipeline β Process Any Size Data
import json
from pathlib import Path
defread_jsonl(filepath):
"""Read JSON Lines file lazily β handles any size."""withopen(filepath) as f:
for line in f:
yield json.loads(line.strip())
deffilter_records(records, min_score=0.5):
for rec in records:
if rec.get('score', 0) >= min_score:
yield rec
defbatch(iterable, size=64):
"""Batch any iterable into fixed-size chunks."""from itertools import islice
it = iter(iterable)
while chunk := list(islice(it, size)):
yield chunk
# Compose into pipeline β still O(1) memory!
pipeline = batch(filter_records(read_jsonl("data.jsonl")), size=32)
for chunk in pipeline:
process(chunk) # Only 32 records in memory at a time
2. Coroutine Pattern β Running Statistics
defrunning_stats():
"""Coroutine that computes running mean & variance."""
n = 0
mean = 0.0
M2 = 0.0whileTrue:
x = yield {'mean': mean, 'var': M2/n if n > 0else0, 'n': n}
n += 1
delta = x - mean
mean += delta / n
M2 += delta * (x - mean) # Welford's algorithm β numerically stable
stats = running_stats()
next(stats) # Prime
stats.send(10) # {'mean': 10.0, 'var': 0, 'n': 1}
stats.send(20) # {'mean': 15.0, 'var': 25.0, 'n': 2}
# Walrus operator (:=) β assign + use (3.8+)if (n := len(data)) > 1000:
print(f"Large dataset: {n} samples")
# Dict merge (3.9+)
config = defaults | overrides
# match-case β Structural Pattern Matching (3.10+)match command:
case {"action": "train", "model": model_name}:
train(model_name)
case {"action": "predict", "data": path}:
predict(path)
case _:
print("Unknown command")
# Extended unpacking
first, *middle, last = sorted(scores)
# Nested dict comprehension
metrics = {
model: {metric: score for metric, score in results.items()}
for model, results in all_results.items()
}
8. Regex for Data Cleaning
import re
# Compile patterns used repeatedly (10x faster)
EMAIL = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
PHONE = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
# Extract all emails from text
emails = EMAIL.findall(text)
# Clean text for NLPdefclean_text(text):
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'[^a-zA-Z\s]', '', text) # Keep only letters
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespacereturn text.lower()
9. Configuration Management
import json, yaml
from pathlib import Path
from dataclasses import dataclass, asdict
@dataclassclassConfig:
model_name: str = "random_forest"
learning_rate: float = 0.001
batch_size: int = 32
epochs: int = 100
data_path: str = "data/train.csv"@classmethoddeffrom_yaml(cls, path):
withopen(path) as f:
return cls(**yaml.safe_load(f))
defsave(self, path):
Path(path).write_text(json.dumps(asdict(self), indent=2))
config = Config.from_yaml("configs/experiment.yaml")
`,
interview: `
π― Python Fundamentals β Interview Questions
Q1: List vs tuple β when to use which?
Answer: Tuples: immutable, hashable (dict keys), less memory. Lists: mutable, growable. Use tuples for fixed data (coordinates, config). Use lists for collections that change. Tuples signal "this shouldn't be modified."
Q2: How does Python's GIL affect DS?
Answer: GIL prevents multi-threading for CPU-bound Python. But NumPy/Pandas release the GIL during C operations. For pure Python CPU work β multiprocessing. For I/O β threading works. For data science, the GIL rarely matters.
Q3: Shallow vs deep copy?
Answer:copy.copy(): outer container copied, inner objects shared. copy.deepcopy(): everything copied recursively. Real trap: df2 = df is NOT a copy β it's aliasing. Use df.copy().
Q4: What is the mutable default argument trap?
Answer:def f(x, lst=[]): β default list created ONCE and shared. Fix: lst=None; if lst is None: lst = []. #1 Python interview gotcha.
Q5: Why are generators critical for large data?
Answer: O(1) memory. 1B items as list = 8GB. As generator = 100 bytes. Use for: file processing, streaming, batch training. yield from for composition.
Q6: Explain LEGB scope rule.
Answer: Name lookup order: Local β Enclosing β Global β Built-in. nonlocal for enclosing scope, global for module. list = [1] shadows built-in list().
Q7: How to handle a 10GB CSV?
Answer: (1) pd.read_csv(chunksize=N), (2) usecols=['needed'], (3) dtype={'col':'int32'}, (4) Dask, (5) DuckDB for SQL on CSV, (6) Polars for Rust-speed.
Q8: Dict lookup O(1) vs list search O(n)?
Answer: Dicts use hash tables. Key β hash β slot index. O(1) average. Lists scan linearly. x in set is O(1) but x in list is O(n). For 1M items: microseconds vs milliseconds.
Q9: Explain Python's garbage collection.
Answer: (1) Reference counting β freed at count=0. (2) Cyclic GC β detects AβBβA cycles. 3 generations. gc.collect() after deleting large models.
Q10: What is __slots__?
Answer: Replaces per-instance __dict__ with fixed array. ~40% memory savings. Use for millions of small objects. Trade-off: no dynamic attributes.
Q11: How do you structure a Python project?
Answer:src/package/ layout. pyproject.toml for config. tests/ with pytest. configs/ for YAML. Makefile for common commands. Separate data, models, training, serving.
Q12: What's the difference between is and ==?
Answer:== checks value equality. is checks identity (same memory). Use is only for singletons: x is None, x is True. Integer interning makes 256 is 256 True but 1000 is 1000 may be False.
`
},
"numpy": {
concepts: `
π’ NumPy β Complete Deep Dive
β‘ Why NumPy Is 50-100x Faster
(1) Contiguous memory β CPU cache-friendly. (2) Compiled C loops. (3) SIMD instructions β 4-8 floats simultaneously. Python list: array of pointers to objects. NumPy: raw typed data in a block.
1. ndarray Internals
Feature
Python List
NumPy ndarray
Storage
Pointers to objects
Contiguous typed data
Memory per int
~28 bytes + pointer
8 bytes (int64)
Operations
Python loop
Compiled C/Fortran
SIMD
Impossible
CPU vector instructions
2. Memory Layout & Strides
π§ Strides = The Secret Behind Views
Every ndarray has strides β bytes to jump in each dimension. For (3,4) float64: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles row stride. C-order (row-major): rows contiguous. Fortran-order: columns contiguous. Iterate along last axis for best performance.
3. Broadcasting Rules
π― Rules (Right to Left)
Two arrays compatible when, for each trailing dim: dims are equal OR one is 1. (5,3,1) + (1,4) β (5,3,4). The "1" dims stretch virtually β no memory copied. Common: X - X.mean(axis=0) β (1000,5) - (5,) works!
np.linalg.norm(X, axis=1) β L2 norms for distances
np.linalg.lstsq(X, y) β Stable linear regression
np.linalg.inv() β AVOID! Use solve() instead (numerically stable)
8. Random Number Generation
Modern: rng = np.random.default_rng(42) (NumPy 1.17+). PCG64 algorithm, thread-safe. Old np.random.seed(42) is global, not thread-safe. Always use default_rng() in projects.
9. Image Processing with NumPy
Images are just 3D arrays: (height, width, channels). Crop: img[100:200, 50:150]. Resize: scipy. Normalize: img / 255.0. Augment: flip img[:, ::-1], rotate with scipy.ndimage. Foundation of all computer vision.
Answer: Right-to-left: dims must equal or one is 1. (3,1) + (1,4) β (3,4). No memory copied. Gotcha: (3,) + (3,4) fails β reshape to (3,1).
Q4: axis=0 vs axis=1?
Answer: axis=0: operate down rows (collapse rows). axis=1: across columns (collapse columns). (100,5): mean(axis=0)β(5,). mean(axis=1)β(100,).
Q5: Implement PCA with NumPy?
Answer: Center, compute covariance, eigendecompose (eigh), sort by eigenvalue, project onto top-k eigenvectors. Or SVD directly.
Q6: np.dot vs @ vs einsum?
Answer:@: clean, broadcasts. np.dot: confusing for 3D+. einsum: most flexible, any tensor op. Use @ for readability.
Q7: How to handle NaN?
Answer:np.isnan() detects. np.nanmean() ignores NaN. Gotcha: NaN == NaN is False (IEEE 754).
Q8: C-order vs Fortran-order?
Answer: C: rows contiguous (default). Fortran: columns contiguous (LAPACK/BLAS). Iterate last axis for speed. Convert: np.asfortranarray().
`
},
"pandas": {
concepts: `
πΌ Pandas β Complete Deep Dive
β‘ DataFrame Internals β BlockManager
A DataFrame is NOT a 2D array. Uses BlockManager β same-dtype columns stored in contiguous blocks. Column operations: fast (same block). Row iteration: slow (crosses blocks). This is why df.iterrows() is 100x slower than vectorized ops.
1. The Golden Rules
β οΈ 5 Rules That Prevent 90% of Pandas Bugs
1. Use .loc (label) and .iloc (position) β never chain indexing. 2.df.loc[0:5] includes 5. df.iloc[0:5] excludes 5. 3.df[mask]['col'] = x creates copy. Use df.loc[mask, 'col'] = x. 4.df2 = df is NOT a copy. Use df2 = df.copy(). 5. Always check df.dtypes and df.isna().sum() first.
2. GroupBy β Split-Apply-Combine
Most powerful Pandas operation. (1) Split β (2) Apply function β (3) Combine results. GroupBy is lazy β no computation until aggregation. Key methods:
Method
Output Shape
Use Case
agg()
Reduced (one row/group)
Sum, mean, count per group
transform()
Same as input
Fill with group mean, normalize within group
filter()
Subset of groups
Keep groups with N > 100
apply()
Flexible
Custom function per group
3. Pandas 2.0 β Major Changes
Feature
Before (1.x)
After (2.0+)
Backend
NumPy only
Apache Arrow option
Copy semantics
Confusing
Copy-on-Write
String dtype
object
string[pyarrow] (faster)
Nullable types
NaN for everything
pd.NA (proper null)
4. Polars vs Pandas
Feature
Pandas
Polars
Speed
1x
5-50x (Rust)
Parallelism
Single-threaded
Multi-threaded auto
API
Eager
Lazy + Eager
Ecosystem
Massive
Growing fast
Use when
EDA, small-med data, legacy
Large data, production
5. Merge/Join Patterns
Method
How
When
merge()
SQL-style joins on columns
Combine tables on shared keys
join()
Joins on index
Index-based combining
concat()
Stack along axis
Append rows/columns
Common pitfall: Merge produces more rows than expected = many-to-many join. Always check: len(merged) vs len(left).
6. Memory Optimization Strategies
Strategy
Savings
When
Category dtype
90%+
Few unique strings
Downcast numerics
50-75%
int64 β int32/int16
Sparse arrays
80%+
Mostly zeros/NaN
PyArrow backend
30-50%
String-heavy data
Read only needed columns
Variable
usecols=['a','b']
7. Window Functions for Time Series
.rolling(N): fixed sliding window. .expanding(): cumulative. .ewm(span=N): exponentially weighted. All support .mean(), .std(), .apply(). Essential for: lag features, moving averages, volatility, Bollinger bands.
8. Pivot Tables & Crosstab
df.pivot_table(values, index, columns, aggfunc) β summarize data by two categorical dimensions. pd.crosstab() β frequency table of two categorical columns. Essential for EDA and business reporting.
9. Method Chaining Pattern
Fluent API: .assign() instead of df['col']=. .pipe(func) for custom. .query('col > 5') for readable filters. No intermediate variables = cleaner, reproducible pipelines.
Answer: merge: SQL joins on columns. join: on index. concat: stack along axis. Use merge for column joins, concat for appending.
Q3: apply vs map vs transform?
Answer: map: Series element-wise. apply: rows/columns. transform: same-shape output. All slow β prefer vectorized when possible.
Q4: GroupBy transform vs agg?
Answer: agg reduces. transform broadcasts back. Use transform for "fill with group mean" or "normalize within group" patterns.
Q5: How to handle missing data?
Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.
Q6: Pandas vs Polars?
Answer: Polars: 5-50x faster (Rust), multi-threaded, lazy eval. Pandas: mature ecosystem, wide compatibility. New projects with big data β Polars.
Q7: What is MultiIndex?
Answer: Hierarchical indexing. Use for pivot tables, panel data. Access with .xs() or tuple. Reset with .reset_index().
Q8: How to optimize a 5GB DataFrame?
Answer: (1) Read only needed columns. (2) Downcast dtypes. (3) Category for strings. (4) Sparse for zeros. (5) PyArrow backend. (6) Process in chunks. Can reduce 5GB to 1GB.
`
},
"visualization": {
concepts: `
π Data Visualization β Complete Guide
β‘ The Grammar of Graphics
Data + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart = this framework.
1. Choosing the Right Chart
Question
Chart Type
Library
Distribution?
Histogram, KDE, Box, Violin
Seaborn
Relationship?
Scatter, Hexbin, Regression
Seaborn/Plotly
Comparison?
Bar, Grouped bar, Violin
Seaborn
Trend over time?
Line, Area chart
Plotly/Matplotlib
Correlation?
Heatmap
Seaborn
Part of whole?
Pie, Treemap, Sunburst
Plotly
Geographic?
Choropleth, Mapbox
Plotly/Folium
High-dimensional?
Parallel coords, UMAP
Plotly
ML results?
Confusion matrix, ROC, SHAP
Seaborn/SHAP
2. Matplotlib Architecture
Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure β Axes (subplots) β Axis objects. Always use OO API: fig, ax = plt.subplots().
rcParams: Global defaults. plt.rcParams['font.size'] = 14. Create style files for project consistency. plt.style.use('seaborn-v0_8-whitegrid').
3. Color Theory for Data
π‘ Color Guide
Sequential: viridis, plasma (lowβhigh). Diverging: RdBu, coolwarm (center matters). Categorical: Set2, tab10 (distinct groups).
Never use rainbow/jet β bad for colorblind, perceptually non-uniform.
4. Seaborn β Statistical Visualization
Three API levels: Figure-level (relplot, catplot, displot), Axes-level (scatterplot, boxplot), Objects API (0.12+). Auto-computes regression lines, confidence intervals, density estimates.
5. Plotly β Interactive Dashboards
JavaScript-powered: hover, zoom, selection. plotly.express for quick plots. plotly.graph_objects for control. Integrates with Dash for production dashboards. Supports 3D, maps, animations. Export to HTML.
6. Visualization for ML Projects
What to Visualize
Chart
Why
Class distribution
Bar chart
Detect imbalance
Feature distributions
Histogram/KDE grid
Find skew, outliers
Feature correlations
Heatmap (triangular)
Multicollinearity
Training curves
Line plot (loss/acc vs epoch)
Detect overfit/underfit
Model comparison
Box plot of CV scores
Compare variance
Confusion matrix
Annotated heatmap
Error analysis
ROC curve
Line plot + AUC
Threshold selection
Feature importance
Horizontal bar
Model interpretation
SHAP values
Beeswarm/waterfall
Individual predictions
7. Common Mistakes
Truncated y-axis exaggerating differences
Pie charts for >5 categories β use bar instead
Rainbow/jet colormap β use viridis/cividis
Overplotting β use alpha, hexbin, KDE, or datashader
Missing labels, titles, units
3D charts without interaction β often misleading
Not saving high-DPI figures β use dpi=300
`,
code: `
π» Visualization Project Code
1. Publication-Quality Multi-Subplot Figure
import matplotlib.pyplot as plt
import numpy as np
# Professional style setup
plt.rcParams.update({
'font.size': 12, 'axes.titlesize': 14,
'figure.facecolor': 'white',
'axes.spines.top': False, 'axes.spines.right': False
})
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Distribution
axes[0,0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white')
axes[0,0].axvline(data.mean(), color='red', linestyle='--', label='Mean')
axes[0,0].set_title('Distribution')
# Scatter with colormap
sc = axes[0,1].scatter(x, y, c=z, cmap='viridis', alpha=0.7)
plt.colorbar(sc, ax=axes[0,1])
# Line with confidence interval
axes[1,0].plot(x, y_mean, 'b-', linewidth=2)
axes[1,0].fill_between(x, y_mean-y_std, y_mean+y_std, alpha=0.2)
# Bar with error bars
axes[1,1].bar(categories, values, yerr=errors, capsize=5, color='coral')
plt.tight_layout()
plt.savefig('figure.png', dpi=300, bbox_inches='tight')
Level 1: Simple wrapper (timing, logging). Level 2: With arguments (factory). Level 3: Class-based with state. Always use functools.wraps.
Common patterns: Retry with exponential backoff, caching, rate limiting, authentication, input validation, deprecation warnings.
2. Context Managers
Guarantee resource cleanup. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: files, DB connections, GPU locks, temporary settings, timers.
3. Dataclasses vs namedtuple vs Pydantic vs attrs
Feature
namedtuple
dataclass
Pydantic
attrs
Mutable
β
β
β (v2)
β
Validation
β
β
β (auto)
β (validators)
JSON
β
β
β (built-in)
via cattrs
Performance
Fastest
Fast
Medium
Fast
Use for
Records
Data containers
API models
Complex classes
4. Type Hints β Complete Guide
π― Why Type Hints Matter for Projects
Enable: IDE autocompletion, mypy static analysis, self-documenting code, Pydantic validation. Python doesn't enforce at runtime β they're for tools and humans.
Hint
Meaning
Example
list[int]
List of ints (3.9+)
scores: list[int] = []
dict[str, Any]
Dict str keys
config: dict[str, Any]
int | None
Optional (3.10+)
x: int | None = None
Callable[[int], str]
Function type
Callbacks
TypeVar
Generic
Generic containers
Literal
Exact values
Literal['train','test']
TypedDict
Dict with typed keys
JSON schemas
5. async/await β Concurrent I/O
For I/O-bound tasks: API calls, DB queries, file reads. NOT for CPU (use multiprocessing). Event loop manages coroutines cooperatively. asyncio.gather() runs concurrently. Game changer: 100 API calls in ~1s vs 100s sequentially.
6. Design Patterns for ML Projects
Pattern
Use Case
Python Implementation
Strategy
Swap algorithms
Pass function/class as argument
Factory
Create objects by name
Registry dict: models['rf']
Observer
Training callbacks
Event system with hooks
Pipeline
Data transformations
Chain of fitβtransform
Singleton
Model cache, DB pool
Module-level or metaclass
Template
Training loop
ABC with abstract methods
Registry
Auto-register models
Class decorator + dict
7. Descriptors β How @property Works
Any object implementing __get__/__set__/__delete__. @property is a descriptor. Control attribute access at class level. Used in Django ORM, SQLAlchemy, dataclass fields.
8. Metaclasses β Advanced
Classes are objects. Metaclasses define how classes behave. type is the default. Use for: auto-registration, interface enforcement, singleton. Most should use class decorators instead.
9. __slots__ for Memory Efficiency
Replaces __dict__ with fixed array. ~40% memory savings per instance. Use for millions of small objects. Trade-off: no dynamic attributes.
10. Multiprocessing for CPU-Bound Work
multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. Each process has its own GIL. Share data via: multiprocessing.Queue, shared_memory, or serialize (pickle). Overhead: process creation ~100ms. Only use for expensive computations.
from dataclasses import dataclass, field, asdict
import json
from datetime import datetime
@dataclassclassExperiment:
name: str
model: str
lr: float = 0.001
epochs: int = 100
batch_size: int = 32
tags: list[str] = field(default_factory=list)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
metrics: dict = field(default_factory=dict)
def__post_init__(self):
if self.lr <= 0: raiseValueError("lr must be positive")
defsave(self, path):
withopen(path, 'w') as f:
json.dump(asdict(self), f, indent=2)
@classmethoddefload(cls, path):
withopen(path) as f:
return cls(**json.load(f))
3. Model Registry Pattern
MODEL_REGISTRY = {}
defregister_model(name):
defdecorator(cls):
MODEL_REGISTRY[name] = cls
return cls
return decorator
@register_model("random_forest")
classRandomForestModel:
deftrain(self, X, y): ...
@register_model("xgboost")
classXGBoostModel:
deftrain(self, X, y): ...
# Create model by name from config
model = MODEL_REGISTRY[config["model_name"]]()
4. async β Parallel API Calls
import asyncio
import aiohttp
async deffetch(session, url):
async with session.get(url) as resp:
returnawait resp.json()
async deffetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
returnawait asyncio.gather(*tasks, return_exceptions=True)
# 100 API calls in ~1 second vs 100 seconds
results = asyncio.run(fetch_all(urls))
5. Pydantic for API Data Validation
from pydantic import BaseModel, Field, field_validator
classPredictionRequest(BaseModel):
features: list[float] = Field(..., min_length=1)
model_name: str = "default"
threshold: float = Field(0.5, ge=0, le=1)
@field_validator('features')
@classmethoddefcheck_features(cls, v):
ifany(np.isnan(x) for x in v):
raiseValueError("NaN not allowed")
return v
# Auto-validates on creation
req = PredictionRequest(features=[1.0, 2.0, 3.0])
6. Context Manager β Timer & GPU Lock
from contextlib import contextmanager
import time
@contextmanagerdeftimer(name="Block"):
start = time.perf_counter()
try:
yieldfinally:
elapsed = time.perf_counter() - start
print(f"{name}: {elapsed:.4f}s")
with timer("Training"):
model.fit(X_train, y_train)
`,
interview: `
π― Advanced Python Interview Questions
Q1: Explain MRO.
Answer: C3 Linearization for multiple inheritance. ClassName.mro() shows order. Subclasses before bases, left-to-right.
Q2: dataclass vs Pydantic?
Answer: dataclass: no validation, fast, standard library. Pydantic: auto-validation, JSON serialization, API models. Use Pydantic for external data, dataclass for internal.
Answer: It's a descriptor with __get__/__set__. Attribute access triggers descriptor protocol. Used for computed attributes and validation.
Q5: Decorator with parameters?
Answer: Three nested functions: factory(params) β decorator(func) β wrapper(*args). Use @wraps(func) always.
Q6: What is __slots__?
Answer: Fixed array instead of __dict__. ~40% less memory. No dynamic attributes. Use for millions of objects.
Q7: Explain closures with use case.
Answer: Function capturing enclosing scope variables. Use: factory functions, decorators, callbacks. make_multiplier(3) returns function multiplying by 3.
Q8: Design patterns in Python vs Java?
Answer: Python makes many patterns trivial: Strategy = pass a function. Singleton = module variable. Factory = dict of classes. Observer = list of callables. Python prefers simplicity.
`
},
"sklearn": {
concepts: `
π€ Scikit-learn β Complete ML Engineering
β‘ The Estimator API
Estimators:fit(X, y). Transformers:transform(X). Predictors:predict(X). Consistency allows seamless swapping and composition via Pipelines.
1. Pipelines β The Foundation of Production ML
β οΈ Data Leakage β The #1 ML Mistake
Fitting scaler on ENTIRE dataset before split = test set info leaks into training. Fix: put ALL preprocessing inside Pipeline. Pipeline ensures fit only on training folds during CV.
2. ColumnTransformer β Real-World Data
Real data has mixed types. ColumnTransformer applies different transformations per column set: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.
3. Custom Transformers
Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives fit_transform() free. Use check_is_fitted() for safety.
4. Cross-Validation Strategies
Strategy
When
Key Point
KFold
General
Doesn't preserve class ratios
StratifiedKFold
Imbalanced classification
Preserves class distribution
TimeSeriesSplit
Time-ordered data
Train always before test
GroupKFold
Grouped data (patients)
Same group never in train+test
RepeatedStratifiedKFold
Robust estimation
Multiple random splits
5. Hyperparameter Tuning
Method
Pros
Cons
GridSearchCV
Exhaustive
Exponential with params
RandomizedSearchCV
Faster, continuous dists
May miss optimal
Optuna
Smart search, pruning
Extra dependency
HalvingSearchCV
Successive halving
Newer, less docs
6. Complete ML Workflow
π― The Steps
1. EDA β 2. Train/Val/Test split β 3. Build Pipeline (preprocess + model) β 4. Cross-validate multiple models β 5. Select best β 6. Tune hyperparameters β 7. Final evaluation on test set β 8. Save model β 9. Deploy
7. Feature Engineering
Transformer
Purpose
PolynomialFeatures
Interaction & polynomial terms
FunctionTransformer
Apply any function (log, sqrt)
SplineTransformer
Non-linear feature basis
KBinsDiscretizer
Bin continuous into categories
TargetEncoder
Encode categoricals by target mean
8. Model Selection Guide
Data Size
Model
Why
<1K rows
Logistic/SVM/KNN
Simple, less overfitting
1K-100K
Random Forest, XGBoost
Best accuracy/speed tradeoff
100K+
XGBoost, LightGBM
Handles large data efficiently
Very large
SGDClassifier/online
Incremental learning
Tabular
Gradient Boosting
Almost always best for tabular
9. Handling Imbalanced Data
Strategy
How
class_weight='balanced'
Built-in for most models
SMOTE
Synthetic oversampling (imblearn)
Threshold tuning
Adjust decision threshold from 0.5
Metrics
Use F1, Precision-Recall AUC (not accuracy)
Ensemble
BalancedRandomForest
10. Model Persistence
joblib.dump(model, 'model.pkl') β faster than pickle for NumPy arrays. model = joblib.load('model.pkl'). Always save the entire pipeline (not just model) to include preprocessing. Version your models with timestamps.
`,
code: `
π» Scikit-learn Project Code
1. Production Pipeline β Complete Template
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
preprocessor = ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), make_column_selector(dtype_include='number')),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
]), make_column_selector(dtype_include='object'))
])
pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, n_jobs=-1))
])
# No data leakage!
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} Β± {scores.std():.3f}")
Answer: (1) class_weight='balanced'. (2) SMOTE oversampling. (3) Adjust threshold. (4) Use F1/AUC not accuracy. (5) BalancedRandomForest.
Q7: When to use which model?
Answer: Tabular: gradient boosting (XGBoost/LightGBM). Small data: Logistic/SVM. Interpretability: Logistic/trees. Speed: LightGBM. Baseline: Random Forest.
Q8: fit() vs transform() vs predict()?
Answer: fit: learn params from data. transform: apply params. predict: generate predictions. fit on train only, transform/predict on both.
`
},
"pytorch": {
concepts: `
π₯ Deep Learning with PyTorch β Complete Guide
β‘ PyTorch Philosophy: Define-by-Run
PyTorch builds the computational graph dynamically as operations execute (eager mode). Debug with print(), breakpoints, standard Python control flow.
1. Tensors β The Foundation
Concept
What
Key Point
Tensor
N-dimensional array
Like NumPy but GPU-capable
requires_grad
Track for autograd
Only for learnable params
device
CPU or CUDA
.to('cuda') moves to GPU
.detach()
Stop gradient tracking
Use for inference/metrics
.item()
Extract scalar
Use for logging loss
.contiguous()
Ensure contiguous memory
Required after transpose/permute
2. Autograd β How Backpropagation Works
π§ Computational Graph (DAG)
When requires_grad=True, every operation is recorded. Each tensor stores grad_fn. .backward() traverses graph in reverse (chain rule). Graph destroyed after backward() unless retain_graph=True. Gradients ACCUMULATE β must optimizer.zero_grad() before each backward.
3. nn.Module β Building Blocks
Every model inherits nn.Module. Layers in __init__, computation in forward(). model.train()/model.eval() toggle BatchNorm/Dropout. model.parameters() for optimizer. model.state_dict() for save/load. Use nn.Sequential for simple stacks, nn.ModuleList/nn.ModuleDict for dynamic architectures.
Dataset: override __len__ and __getitem__. DataLoader: batching, shuffling, multi-worker. num_workers>0 for parallel loading. pin_memory=True for faster GPU transfer. Use collate_fn for variable-length sequences.
6. Learning Rate Scheduling
Scheduler
Strategy
When
StepLR
Decay every N epochs
Simple baseline
CosineAnnealingLR
Cosine decay
Standard for vision
OneCycleLR
Warmup + decay
Best for fast training
ReduceLROnPlateau
Decay on stall
When loss plateaus
LinearLR
Linear warmup
Transformer models
7. Mixed Precision Training (AMP)
torch.cuda.amp: forward in float16 (2x faster), gradients in float32. GradScaler prevents underflow. 2-3x speedup. Standard practice for any GPU training.
8. Transfer Learning Patterns
Load pretrained β Freeze base β Replace head β Fine-tune with smaller LR. Discriminative LR: lower LR for earlier layers. Progressive unfreezing: unfreeze layers one at a time. Both work better than fine-tuning everything at once.
9. Distributed Training (DDP)
DistributedDataParallel: each GPU runs model copy, gradients averaged via all-reduce. Near-linear scaling. Use torchrun to launch. DistributedSampler for data splitting.
10. Debugging & Profiling
Tool
Purpose
register_forward_hook
View intermediate activations
register_backward_hook
Monitor gradient magnitudes
torch.profiler
GPU/CPU profiling
torch.cuda.memory_summary()
GPU memory debugging
detect_anomaly()
Find NaN/Inf sources
11. torch.compile (2.x)
JIT compiles model for 30-60% speedup. model = torch.compile(model). Uses TorchDynamo + Triton. Works on existing code. The future of PyTorch performance.
Answer: Rule: 4 Γ num_gpus. Too many = CPU overhead. pin_memory=True for faster transfers. Profile to find sweet spot.
Q6: torch.compile vs eager?
Answer: compile JITs model via TorchDynamo+Triton. 30-60% faster. One line change. The future of PyTorch performance.
Q7: How to save/load models?
Answer: state_dict (weights only) vs full checkpoint (weights + optimizer + epoch). Use state_dict for inference, checkpoint for resuming.
Q8: Mixed precision β how and why?
Answer: autocast(fp16 forward) + GradScaler(fp32 grads). 2-3x speedup. Minimal accuracy loss. Standard for GPU training.
`
},
"tensorflow": {
concepts: `
π§ TensorFlow & Keras β Complete Guide
β‘ TF2 = Eager by Default + @tf.function for Speed
TF2 defaults to eager mode (like PyTorch). @tf.function compiles to graph for production. Keras is the official API. TF handles full lifecycle: train β save β serve β monitor.
1. Three Model APIs
API
Use Case
Flexibility
Sequential
Linear stack
Low
Functional
Multi-input/output, branching
Medium (recommended)
Subclassing
Custom forward logic
High
2. tf.data Pipeline
Chains transformations lazily. .map(), .batch(), .shuffle(), .prefetch(AUTOTUNE). Prefetching overlaps loading with GPU execution. .cache() for small datasets. .interleave() for reading multiple files. TFRecord format for large datasets.
3. Callbacks β Training Hooks
Callback
Purpose
ModelCheckpoint
Save best model
EarlyStopping
Stop when metric plateaus
ReduceLROnPlateau
Reduce LR when stuck
TensorBoard
Visualize metrics
CSVLogger
Log to CSV
LambdaCallback
Custom per-epoch logic
4. GradientTape β Custom Training
Record ops β compute gradients β apply. Use for: GANs, RL, custom losses, gradient penalty, multi-loss weighting. Same concept as PyTorch's manual loop.
5. @tf.function β Production Speed
Trace Python β TF graph. Benefits: optimized execution, XLA, export. Gotchas: Python side effects only during tracing. Use tf.print() in graphs.
6. SavedModel β Universal Deployment
model.save('path') exports architecture + weights + computation. Ready for: TF Serving (production), TF Lite (mobile), TF.js (browser). One model, any platform.
7. Keras Tuner β Automated Hyperparameter Search
Build model function β Tuner searches space. Strategies: Random, Hyperband, Bayesian. Integrates with TensorBoard. Alternative to Optuna for Keras models.
8. TF vs PyTorch β Decision Guide
Choose TF When
Choose PyTorch When
Production deployment at scale
Research & prototyping
Mobile (TFLite mature)
Hugging Face ecosystem
TPU training
GPU research
Edge devices
Custom architectures
Browser (TF.js)
Academic papers
`,
code: `
π» TensorFlow Project Code
1. Functional API β Multi-Input Model
import tensorflow as tf
from tensorflow import keras
text_input = keras.Input(shape=(100,), name='text')
num_input = keras.Input(shape=(5,), name='features')
x1 = keras.layers.Embedding(10000, 64)(text_input)
x1 = keras.layers.GlobalAveragePooling1D()(x1)
x2 = keras.layers.Dense(32, activation='relu')(num_input)
combined = keras.layers.Concatenate()([x1, x2])
x = keras.layers.Dense(64, activation='relu')(combined)
x = keras.layers.Dropout(0.3)(x)
output = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=[text_input, num_input], outputs=output)
@tf.functiondeftrain_step(model, X, y, optimizer, loss_fn):
with tf.GradientTape() as tape:
preds = model(X, training=True)
loss = loss_fn(y, preds)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
π¦ Production Python β Complete Engineering Guide
β‘ Production = Reliability + Reproducibility + Observability
Production code must be tested (pytest), typed (mypy), logged (structured), packaged (pyproject.toml), containerized (Docker), and monitored (metrics). The gap between notebook and production is enormous.
1. pytest β Professional Testing
Feature
Purpose
Example
fixtures
Reusable test setup
@pytest.fixture
parametrize
Many inputs, same test
@pytest.mark.parametrize
conftest.py
Shared fixtures
DB connections, mock data
monkeypatch
Override functions/env
Mock API calls
tmp_path
Temp directory
Test file I/O
markers
Tag tests
pytest -m "not slow"
coverage
Measure test coverage
pytest --cov
2. Testing ML Code
π― What to Test in ML
Unit: data transforms, feature engineering, loss functions. Integration: full pipeline end-to-end. Model: output shape, range, determinism with seed. Data: schema validation, distribution shifts, missing patterns.
3. Logging Best Practices
Level
When
DEBUG
Tensor shapes, intermediate values
INFO
Training started, epoch complete
WARNING
Unexpected but handled (fallback used)
ERROR
Model load failure, API error
CRITICAL
OOM, GPU crash
Never use print(). Use structured logging (JSON format) for production β parseable by log aggregators (ELK, Datadog).
4. FastAPI for Model Serving
Modern async framework. Auto-generates OpenAPI docs. Pydantic validation. Deploy with Uvicorn + Docker. Add: health checks, input validation, error handling, rate limiting, request logging.
GitHub Actions: lint (ruff) β type check (mypy) β test (pytest) β build (Docker) β deploy. Add model validation gate: new model must beat baseline on test metrics before deployment.
9. Code Quality Tools
Tool
Purpose
ruff
Fast linter + formatter (replaces black, isort, flake8)
mypy
Static type checking
pre-commit
Git hooks for auto-formatting
pytest-cov
Test coverage
bandit
Security linting
10. MLOps β Model Lifecycle
Tool
Purpose
MLflow
Experiment tracking, model registry
DVC
Data versioning (like Git for data)
Weights & Biases
Experiment tracking, visualization
Evidently
Data drift & model monitoring
Great Expectations
Data validation
11. Database for ML Projects
DB
Use Case
Python Library
SQLite
Local, small data, prototyping
sqlite3 (built-in)
PostgreSQL
Production, ACID, JSON
psycopg2, SQLAlchemy
Redis
Caching, queues, sessions
redis-py
MongoDB
Flexible schema, documents
pymongo
Pinecone/Weaviate
Vector search (embeddings)
Official SDKs
`,
code: `
π» Production Python Project Code
1. pytest β Complete ML Testing
import pytest
import numpy as np
# conftest.py β shared fixtures@pytest.fixturedefsample_data():
np.random.seed(42)
X = np.random.randn(100, 10)
y = np.random.randint(0, 2, 100)
return X, y
@pytest.fixturedeftrained_model(sample_data):
X, y = sample_data
model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
return model
# Test multiple models with one function@pytest.mark.parametrize("model_cls", [
LogisticRegression, RandomForestClassifier, GradientBoostingClassifier
])
deftest_model_output(model_cls, sample_data):
X, y = sample_data
model = model_cls()
model.fit(X, y)
preds = model.predict(X)
assert preds.shape == y.shape
assertset(np.unique(preds)).issubset({0, 1})
# Test data pipelinedeftest_pipeline_no_leakage(sample_data, pipeline):
X, y = sample_data
scores = cross_val_score(pipeline, X, y, cv=3)
assertall(s >= 0and s <= 1for s in scores)
GIL prevents true multi-threading for CPU-bound Python. BUT: NumPy, Pandas, scikit-learn release the GIL during C operations. Python 3.13: experimental free-threaded CPython (no-GIL).
Task Type
Solution
Why
I/O-bound
asyncio / threading
GIL released during I/O
CPU-bound Python
multiprocessing
Separate processes, separate GIL
CPU-bound NumPy
threading OK
NumPy releases GIL
Many tasks
concurrent.futures
Simple Pool interface
3. Numba β JIT Compilation
@numba.jit(nopython=True): compile to machine code. 10-100x speedup for loops. Supports NumPy, math. @numba.vectorize: custom ufuncs. @cuda.jit: GPU kernels. Best for: tight loops that can't be vectorized.
4. Dask β Parallel Computing
Pandas/NumPy API for data bigger than memory. dask.dataframe, dask.array, dask.delayed. Lazy execution. Task graph scheduler. Scales from laptop to cluster. Alternative: Polars for single-machine parallel.
5. Ray β Distributed ML
General-purpose distributed framework. Ray Tune (hyperparameter tuning), Ray Serve (model serving), Ray Data. Easier than Dask for ML. Used by OpenAI, Uber.
array module: For simple typed arrays (no NumPy overhead)
7. Caching Strategies
Tool
Scope
Use Case
@functools.lru_cache
In-memory, function
Expensive computations
@functools.cache
Unbounded cache
Pure functions
joblib.Memory
Disk cache
Data processing pipelines
Redis
External cache
Multi-process, API responses
diskcache
Pure Python disk
Simple persistent cache
8. Python 3.12-3.13 Performance
3.12: 5-15% faster, better errors, per-interpreter GIL. 3.13: Free-threaded (no-GIL experimental), JIT compiler (experimental). The future of Python performance is exciting.
9. Common Performance Anti-Patterns
Anti-Pattern
Fix
Speedup
for row in df.iterrows()
Vectorized ops
100-1000x
s += "text" in loop
''.join(parts)
100x
x in big_list
x in big_set
1000x
Python list of floats
NumPy array
50-100x
Global imports in function
Import at top
Variable
Not using built-ins
sum(), min()
5-10x
`,
code: `
π» Performance Code Examples
1. Profiling Workflow
import cProfile, pstats
# Profile and find bottleneckswith cProfile.Profile() as pr:
result = expensive_pipeline(data)
stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 slow functions# Memory profilingimport tracemalloc
tracemalloc.start()
# ... process data ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('filename')[:5]:
print(stat)
2. Numba JIT
import numba
import numpy as np
@numba.jit(nopython=True)
defpairwise_distance(X):
n = X.shape[0]
D = np.empty((n, n))
for i inrange(n):
for j inrange(i+1, n):
d = 0.0for k inrange(X.shape[1]):
d += (X[i,k] - X[j,k]) ** 2
D[i,j] = D[j,i] = d ** 0.5return D
# 100x faster than pure Python!
3. concurrent.futures β Parallel Processing
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
# CPU-bound: processeswith ProcessPoolExecutor(max_workers=8) as ex:
results = list(ex.map(process_chunk, data_chunks))
# I/O-bound: threadswith ThreadPoolExecutor(max_workers=32) as ex:
results = list(ex.map(fetch_url, urls))
4. Dask for Large Data
import dask.dataframe as dd
# Read 100GB of CSVs β lazy!
ddf = dd.read_csv('data/*.csv')
# Same Pandas API β but parallel
result = (
ddf.groupby('category')
.agg({'revenue': 'sum', 'qty': 'mean'})
.compute() # Only here does it execute
)
5. functools.lru_cache β Memoization
from functools import lru_cache
@lru_cache(maxsize=1024)
defexpensive_feature(customer_id: int) -> dict:
# DB query, computation, etc.return compute_features(customer_id)
# First call: computes. Second call: instant from cacheprint(expensive_feature.cache_info()) # hits, misses, size
6. __slots__ for Memory
classPoint:
__slots__ = ('x', 'y', 'z')
def__init__(self, x, y, z):
self.x, self.y, self.z = x, y, z
# 1M instances: ~60MB vs ~160MB without __slots__
points = [Point(i, i*2, i*3) for i inrange(1_000_000)]
7. String Performance
# β O(nΒ²) β creates new string each iteration
result = ""for word in words:
result += word + " "# β O(n) β single allocation at end
result = " ".join(words)
`,
interview: `
π― Performance Interview Questions
Q1: Why the GIL?
Answer: Simplifies reference counting. Makes single-threaded faster. Easier C extensions. Python 3.13 has experimental no-GIL mode.
Answer: Dask: Pandas API, Python-native. Ray: ML-focused. Spark: JVM, TB+ data. Python ML: Dask/Ray. Big data ETL: Spark.
Q7: Top 3 Python performance tips?
Answer: (1) Use sets not lists for lookups. (2) NumPy not Python loops. (3) Generator expressions for memory. Bonus: lru_cache for expensive functions.
Q8: How does lru_cache work?
Answer: Hash-based memoization. Args must be hashable. maxsize=None for unlimited. cache_info() shows hits/misses. Perfect for pure functions.