Grocery Self Checkout Item Detection

By: Daniel Bagcal

1. Model Description

Context

This model is a YOLOv11 object detection model fine-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely matches the training data, as performance will degrade on unseen brands or product types not represented in the training data.

2. Training Data

The training dataset is a subset of the RPC-Dataset (rpc-dataset.github.io), a large-scale retail product checkout dataset consisting of 83,699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow (universe.roboflow.com/groceries-jxjfd/grocery-goods).

Annotation Process

The original RPC-Dataset contained 200 product-specific classes, where each class represented a specific product variant (e.g., 100_milk, 101_milk, 102_milk). These classes were collapsed into 17 broader product categories to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU. For example, all milk classes were merged into a single milk class, reducing the total class count from 200 to 17. Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed.

This is the final dataset used for training, after the annotation process (https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb).

Class Distribution

Class Name Total Count Training Count Validation Count Test Count
tissue 4,813 3,369 963 481
dessert 4,372 3,060 874 437
drink 3,760 2,632 752 376
seasoner 3,199 2,239 640 320
puffed_food 3,156 2,209 631 316
chocolate 3,146 2,202 629 315
instant_noodles 3,033 2,123 607 303
canned_food 2,714 1,900 543 271
milk 2,517 1,762 503 252
candy 2,499 1,749 500 250
personal_hygiene 2,495 1,747 499 250
instant_drink 2,492 1,744 498 249
alcohol 2,381 1,667 476 238
dried_fruit 2,368 1,658 474 237
dried_food 2,222 1,555 444 222
gum 1,923 1,346 385 192
stationery 1,466 1,026 293 147

Train/Validation/Test Split

Split Ratio Count
Train 70% 36,928
Validation 20% 10,505
Test 10% 5,276

Data Augmentation

The following augmentations were applied during training to simulate real-world checkout conditions:

Augmentation Purpose
Rotation Items placed on belt in any orientation
Horizontal/Vertical Flip Additional orientation variation
Mosaic Multiple items on belt simultaneously
HSV Shift (hue, saturation, value) Simulate varied store lighting
Translation & Scale Camera height and position variation

Known Biases and Limitations

  • Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments
  • Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset
  • Limited lighting variation. Real checkout environments may have inconsistent lighting not well represented in training images

3. Training Procedure

  • Framework: Ultralytics YOLOv11n
  • Hardware: A100 GPU in Google Colab
  • Epochs: 50
  • Batch Size: 64
  • Image Size: 640x640
  • Patience: 50
  • Training Time ~36.5 minutes (2,189.69 seconds)
  • Preprocessing: Augmentations applied at training time (see Data Augmentation section)

4. Evaluation Results

Comprehensive Metrics

All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992.

Metric Value
Precision 0.989
Recall 0.985
mAP50 0.992
mAP50-95 0.862

Per-Class Breakdown

Class Test Images Instances Precision Recall mAP50 mAP50-95
all 1,928 9,825 0.989 0.985 0.992 0.862
alcohol 252 503 0.996 0.986 0.995 0.864
candy 257 502 0.988 0.980 0.990 0.815
canned_food 252 545 0.982 0.996 0.990 0.877
chocolate 360 700 0.982 0.984 0.993 0.833
dessert 389 819 0.995 0.991 0.995 0.881
dried_food 244 415 0.982 0.993 0.995 0.877
dried_fruit 263 516 0.986 0.986 0.995 0.887
drink 360 796 0.982 0.990 0.994 0.871
gum 183 360 0.989 0.979 0.991 0.812
instant_drink 271 554 0.984 0.982 0.994 0.886
instant_noodles 302 614 0.989 0.997 0.995 0.888
milk 256 491 0.996 0.990 0.994 0.859
personal_hygiene 255 506 0.990 0.982 0.994 0.854
puffed_food 324 654 0.996 1.000 0.995 0.907
seasoner 302 572 0.986 0.965 0.993 0.849
stationery 162 300 0.986 0.957 0.972 0.785
tissue 482 978 0.999 0.994 0.995 0.909

Visual Examples of Classes

Class Examples

the grid above shows representative examples from the training dataset, organized alphabetically by class (left to right, top to bottom, following the order shown in the per-class breakdown), with the final three images showing multi-item/class detection examples. Because each class was reduced into a single category, each class has multiple different examples and varies from one another.

Position Class Features
Row 1, Col 1 alcohol Glass bottles, aluminum cans/beer bottles
Row 1, Col 2 candy Small packaging, often cylindrical or box-shaped
Row 1, Col 3 canned_food Cylindrical canned foods
Row 1, Col 4 chocolate Flat packaging, things like Snickers bars
Row 2, Col 1 dessert Cup, boxed, flat packaging. Varies widely.
Row 2, Col 2 dried_food Flat sealed bags, often with food photography on packaging
Row 2, Col 3 dried_fruit Flat sealed bags, clear bags, colored packaging
Row 2, Col 4 drink Plastic bottles such as sodas, aluminum cans
Row 3, Col 1 gum Small box or pouch packaging
Row 3, Col 2 instant_drink Varies widely, small cylinders, boxes, sealed packs
Row 3, Col 3 instant_noodles Like instant ramen packs or cup-noodle packs
Row 3, Col 4 milk Small milk cartons, slim box packaging, bottled packs
Row 4, Col 1 personal_hygiene Items like toothbrushes, mouth wash, toothpaste
Row 4, Col 2 puffed_food Inflated bags such as Cheetos, other chip bags
Row 4, Col 3 seasoner Items varies from soy sauce to small seasoning packets
Row 4, Col 4 stationery Items such as notebooks, paper, pencils, etc.
Row 5, Col 1 tissue Small rectangular box packaging with soft branding
Row 5, Col 2-4 multi-class Multiple items detected simultaneously in a single scene

Key Visualizations

Confusion Matrix

Confusion Matrix

F1 Confidence Curve

BoxF1 Curve

Training & Validation Loss Curves

Results

Performance Analysis

The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being stationery at 0.972. The strongest performing classes were tissue and puffed_food (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was stationery (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size.

5. Limitations and Biases

D2S Wild Image Test Sample (Failure Case)

When tested on the D2S Dataset, the model struggled significantly with unseen products and environments. The image below shows a representative failure case: D2S Sample

In this test image, the model:

  • Completely missed the avocado (no detection)
  • Missed the tea box entirely & drink under it
  • Misclassified a water bottle as instant_noodles (0.63 confidence)
  • Produced a low-confidence dried_fruit detection (0.45) on an incorrect region

This suggests the model learned to recognize specific packaging patterns from its training data rather than generalizing to grocery items as a broader category. This model should be store-specific on inventory with this training data.

Poor Performing Classes

Class mAP50 mAP50-95 Likely Reason
stationery 0.972 0.785 Smallest class (1,466 images)
chocolate 0.993 0.833 Similar packaging causes 0.13 background confusion

Data Biases

  • Geographic bias: Dataset is predominantly composed of Chinese grocery product packaging. Model is not generalizable to Western or European retail environments
  • Product bias: Heavily skewed toward packaged and processed goods; fresh produce and unpackaged items classes are entirely absent
  • Environmental bias: Images were collected in controlled photography settings. Does not fully represent real store lighting, shadows, or possible covered items.

Environmental and Contextual Limitations

  • Performance degrades significantly when used on items not present in the training data, as seen with the D2S Dataset
  • Overlapping or partially occluded items in a self-checkout camera may cause missed or incorrect detections
  • Model was designed for overhead/top-down perspective, so differing angles/views could degrade performance

Inappropriate Use Cases

This specific model:

  • Should NOT be deployed in stores with inventory significantly different from the training data without retraining. Different models with different data should be used for different inventory and stock!
  • Should NOT be used as a standalone loss prevention or security system
  • Should NOT be used to detect fresh produce, unpackaged items, or non-grocery products
  • Should NOT be used in serious applications where misclassification has serious consequences

Ethical Considerations

  • Overhead camera systems at self-checkout may raise customer privacy concerns depending on how image/video data is stored and used
  • Model should not be used to make automated decisions that negatively impact customers without human review, as misclassifications may affect customers purchasing unfamiliar or international products not well represented in the training data

Sample Size Limitations

  • Stationery (1,466 images) is the smallest class and shows the weakest overall performance (albeit still strong). Additional training data would likely improve results
  • No representation of fresh produce, meaning the model has zero capability to detect items like fruits, vegetables, or deli products
  • Model would likely significantly improve if used with full 83,699 image dataset
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support