Grocery Self Checkout Item Detection

By: Daniel Bagcal

1. Model Description

Context

This model is a YOLOv11 object detection model fine-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely matches the training data, as performance will degrade on unseen brands or product types not represented in the training data.

2. Training Data

The training dataset is a subset of the RPC-Dataset (rpc-dataset.github.io), a large-scale retail product checkout dataset consisting of 83,699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow (universe.roboflow.com/groceries-jxjfd/grocery-goods).

Annotation Process

The original RPC-Dataset contained 200 product-specific classes, where each class represented a specific product variant (e.g., 100_milk, 101_milk, 102_milk). These classes were collapsed into 17 broader product categories to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU. For example, all milk classes were merged into a single milk class, reducing the total class count from 200 to 17. Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed.

This is the final dataset used for training, after the annotation process (https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb).

Class Distribution

Class Name	Total Count	Training Count	Validation Count	Test Count
tissue	4,813	3,369	963	481
dessert	4,372	3,060	874	437
drink	3,760	2,632	752	376
seasoner	3,199	2,239	640	320
puffed_food	3,156	2,209	631	316
chocolate	3,146	2,202	629	315
instant_noodles	3,033	2,123	607	303
canned_food	2,714	1,900	543	271
milk	2,517	1,762	503	252
candy	2,499	1,749	500	250
personal_hygiene	2,495	1,747	499	250
instant_drink	2,492	1,744	498	249
alcohol	2,381	1,667	476	238
dried_fruit	2,368	1,658	474	237
dried_food	2,222	1,555	444	222
gum	1,923	1,346	385	192
stationery	1,466	1,026	293	147

Train/Validation/Test Split

Split	Ratio	Count
Train	70%	36,928
Validation	20%	10,505
Test	10%	5,276

Data Augmentation

The following augmentations were applied during training to simulate real-world checkout conditions:

Augmentation	Purpose
Rotation	Items placed on belt in any orientation
Horizontal/Vertical Flip	Additional orientation variation
Mosaic	Multiple items on belt simultaneously
HSV Shift (hue, saturation, value)	Simulate varied store lighting
Translation & Scale	Camera height and position variation

Known Biases and Limitations

Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments
Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset
Limited lighting variation. Real checkout environments may have inconsistent lighting not well represented in training images

3. Training Procedure

Framework: Ultralytics YOLOv11n
Hardware: A100 GPU in Google Colab
Epochs: 50
Batch Size: 64
Image Size: 640x640
Patience: 50
Training Time ~36.5 minutes (2,189.69 seconds)
Preprocessing: Augmentations applied at training time (see Data Augmentation section)

4. Evaluation Results

Comprehensive Metrics

All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992.

Metric	Value
Precision	0.989
Recall	0.985
mAP50	0.992
mAP50-95	0.862

Per-Class Breakdown

Class	Test Images	Instances	Precision	Recall	mAP50	mAP50-95
all	1,928	9,825	0.989	0.985	0.992	0.862
alcohol	252	503	0.996	0.986	0.995	0.864
candy	257	502	0.988	0.980	0.990	0.815
canned_food	252	545	0.982	0.996	0.990	0.877
chocolate	360	700	0.982	0.984	0.993	0.833
dessert	389	819	0.995	0.991	0.995	0.881
dried_food	244	415	0.982	0.993	0.995	0.877
dried_fruit	263	516	0.986	0.986	0.995	0.887
drink	360	796	0.982	0.990	0.994	0.871
gum	183	360	0.989	0.979	0.991	0.812
instant_drink	271	554	0.984	0.982	0.994	0.886
instant_noodles	302	614	0.989	0.997	0.995	0.888
milk	256	491	0.996	0.990	0.994	0.859
personal_hygiene	255	506	0.990	0.982	0.994	0.854
puffed_food	324	654	0.996	1.000	0.995	0.907
seasoner	302	572	0.986	0.965	0.993	0.849
stationery	162	300	0.986	0.957	0.972	0.785
tissue	482	978	0.999	0.994	0.995	0.909

Visual Examples of Classes

the grid above shows representative examples from the training dataset, organized alphabetically by class (left to right, top to bottom, following the order shown in the per-class breakdown), with the final three images showing multi-item/class detection examples. Because each class was reduced into a single category, each class has multiple different examples and varies from one another.

Position	Class	Features
Row 1, Col 1	alcohol	Glass bottles, aluminum cans/beer bottles
Row 1, Col 2	candy	Small packaging, often cylindrical or box-shaped
Row 1, Col 3	canned_food	Cylindrical canned foods
Row 1, Col 4	chocolate	Flat packaging, things like Snickers bars
Row 2, Col 1	dessert	Cup, boxed, flat packaging. Varies widely.
Row 2, Col 2	dried_food	Flat sealed bags, often with food photography on packaging
Row 2, Col 3	dried_fruit	Flat sealed bags, clear bags, colored packaging
Row 2, Col 4	drink	Plastic bottles such as sodas, aluminum cans
Row 3, Col 1	gum	Small box or pouch packaging
Row 3, Col 2	instant_drink	Varies widely, small cylinders, boxes, sealed packs
Row 3, Col 3	instant_noodles	Like instant ramen packs or cup-noodle packs
Row 3, Col 4	milk	Small milk cartons, slim box packaging, bottled packs
Row 4, Col 1	personal_hygiene	Items like toothbrushes, mouth wash, toothpaste
Row 4, Col 2	puffed_food	Inflated bags such as Cheetos, other chip bags
Row 4, Col 3	seasoner	Items varies from soy sauce to small seasoning packets
Row 4, Col 4	stationery	Items such as notebooks, paper, pencils, etc.
Row 5, Col 1	tissue	Small rectangular box packaging with soft branding
Row 5, Col 2-4	multi-class	Multiple items detected simultaneously in a single scene

Key Visualizations

Confusion Matrix

F1 Confidence Curve

Training & Validation Loss Curves

Performance Analysis

The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being stationery at 0.972. The strongest performing classes were tissue and puffed_food (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was stationery (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size.

5. Limitations and Biases

D2S Wild Image Test Sample (Failure Case)

When tested on the D2S Dataset, the model struggled significantly with unseen products and environments. The image below shows a representative failure case:

In this test image, the model:

Completely missed the avocado (no detection)
Missed the tea box entirely & drink under it
Misclassified a water bottle as instant_noodles (0.63 confidence)
Produced a low-confidence dried_fruit detection (0.45) on an incorrect region

This suggests the model learned to recognize specific packaging patterns from its training data rather than generalizing to grocery items as a broader category. This model should be store-specific on inventory with this training data.

Poor Performing Classes

Class	mAP50	mAP50-95	Likely Reason
stationery	0.972	0.785	Smallest class (1,466 images)
chocolate	0.993	0.833	Similar packaging causes 0.13 background confusion

Data Biases

Geographic bias: Dataset is predominantly composed of Chinese grocery product packaging. Model is not generalizable to Western or European retail environments
Product bias: Heavily skewed toward packaged and processed goods; fresh produce and unpackaged items classes are entirely absent
Environmental bias: Images were collected in controlled photography settings. Does not fully represent real store lighting, shadows, or possible covered items.

Environmental and Contextual Limitations

Performance degrades significantly when used on items not present in the training data, as seen with the D2S Dataset
Overlapping or partially occluded items in a self-checkout camera may cause missed or incorrect detections
Model was designed for overhead/top-down perspective, so differing angles/views could degrade performance

Inappropriate Use Cases

This specific model:

Should NOT be deployed in stores with inventory significantly different from the training data without retraining. Different models with different data should be used for different inventory and stock!
Should NOT be used as a standalone loss prevention or security system
Should NOT be used to detect fresh produce, unpackaged items, or non-grocery products
Should NOT be used in serious applications where misclassification has serious consequences

Ethical Considerations

Overhead camera systems at self-checkout may raise customer privacy concerns depending on how image/video data is stored and used
Model should not be used to make automated decisions that negatively impact customers without human review, as misclassifications may affect customers purchasing unfamiliar or international products not well represented in the training data

Sample Size Limitations

Stationery (1,466 images) is the smallest class and shows the weakest overall performance (albeit still strong). Additional training data would likely improve results
No representation of fresh produce, meaning the model has zero capability to detect items like fruits, vegetables, or deli products
Model would likely significantly improve if used with full 83,699 image dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support