kevinkyi
/

Homework2_Multishot_Prompting

Text Classification

adaptive-retrieval

Model card Files Files and versions

Homework2_Multishot_Prompting / README.md

kevinkyi's picture

Add Method Card

e5a7746 verified 8 months ago

|

history blame contribute delete

3.24 kB

	---
	library_name: transformers
	pipeline_tag: text-classification
	license: mit
	tags:
	- prompting
	- zero-shot
	- few-shot
	- football
	- sentiment
	- adaptive-retrieval
	model_name: Football Sentiment Prompting (0/1/5-shot)
	language:
	- en
	datasets:
	- james-kramer/football_news
	inference: false
	---

	# Method Card — Football Sentiment Prompting (0/1/5-shot)

	## TL;DR
	We compare zero-shot, adaptive one-shot, and adaptive 5-shot prompting for binary sentiment on football news.
	Same train/val/test as fine-tuning; we report metrics/CMs and discuss quality/latency/cost.

	## Data
	- Dataset: `james-kramer/football_news` (Hugging Face)
	- Task: Binary sentiment (0=negative, 1=positive)
	- Splits: Stratified 80/10/10
	- Cleaning: strip text; drop empty/NA

	## Models / APIs
	- LLM used: gpt-4o-mini (OpenAI API, September 2025 snapshot)
	- Similarity backend: sklearn TF-IDF + cosine similarity

	## Prompting Strategy
	- Zero-shot: instruction + schema (return 0 or 1 only).
	- Adaptive one-shot: retrieve most similar train example and include it as exemplar.
	- Adaptive 5-shot: retrieve top-5 similar exemplars.

	## Prompt Templates
	Zero-shot
	You are a concise sentiment classifier.
	Decide if the following football-related sentence is positive or negative.
	Only answer with a single word: "positive" or "negative".

	Sentence: "text",
	Answer:

	Adaptive One-shot
	You are a concise sentiment classifier for football news.
	Decide if each sentence is positive or negative. Only answer with one word.

	Example: [],
	Sentence: "ex_text",
	Label: "ex_label",

	Now classify the target sentence.
	Sentence: "text",
	Answer:

	Adaptive K-shot (e.g., K=5)
	You are a concise sentiment classifier for football news.
	Decide if the sentence is positive or negative. Only answer with one word.
	examples: [],
	Sentence: "text",
	Answer:


	## Evaluation Protocol
	- Metrics: accuracy, precision, recall, F1; confusion matrix
	- Latency: avg wall-clock per example
	- Seed: 42
	- Reproducibility: prompts/selection/eval code in this repo

	## Results (Val/Test)
	- Val:
	- Zero-shot: acc 0.8, f1 0.75, cm [[5, 0], [2, 3]], ~0.416s/ex
	- One-shot: acc 0.5, f1 0.2857142857, cm [[4, 1], [4, 1]], ~0.304s/ex
	- 5-shot: acc 0.8, f1 0.75, cm [[5, 0], [2, 3]], ~0.451s/ex
	- Test:
	- Zero-shot: acc 0.7, f1 0.7272727273, cm [[3, 2], [1, 4]], ~0.282s/ex
	- One-shot: acc 0.7, f1 0.7272727273, cm [[3, 2], [1, 4]], ~0.354s/ex
	- 5-shot: acc 0.7, f1 0.5714285714, cm [[5, 0], [3, 2]], ~0.449s/ex

	## Tradeoffs
	- Quality: zero-shot ≈ 5-shot ≥ one-shot on this dataset.
	- Latency: increases with K (see Results section; ~0.28s/ex for zero-shot → ~0.45s/ex for 5-shot).
	- Cost: scales roughly linearly with prompt length (token count). For this dataset (~20 examples), 5-shot prompts were ~3× the token usage of zero-shot.

	## Limits & Risks
	- No leakage: retrieve exemplars from train only.
	- Bias: sports phrasing may sway sentiment; small data → instability.

	## Reproducibility
	- Code: `prompts/`, `selection.py`, `evaluate_prompting.py`
	- Seed: 42
	- Python ≥ 3.10

	## Usage Disclosure
	This card and pipeline were organized with GenAI assistance; experiments and results were implemented and verified by the author.