VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout

Multi-class product counting and recognition identifies product items from images or videos for automated retail checkout. The task is challenging due to the real-world scenario of occlusions where product items overlap, fast movement in the conveyor belt, large similarity in overall appearance of the items being scanned, novel products, and the negative impact of misidentifying items. Further, there is a domain bias between training and test sets, specifically, the provided training dataset consists of synthetic images and the test set videos consist of foreign objects such as hands and tray. To address these aforementioned issues, we propose to segment and classify individual frames from a video sequence. The segmentation method consists of a unified single product item- and hand-segmentation followed by entropy masking to address the domain bias problem. The multi-class classification method is based on Vision Transformers (ViT). To identify the frames with target objects, we utilize several image processing methods and propose a custom metric to discard frames not having any product items. Combining all these mechanisms, our best system achieves 3rd place in the AI City Challenge 2022 Track 4 with an F1 score of 0.4545. Code will be available at


page 1

page 3


Improving Domain Generalization by Learning without Forgetting: Application in Retail Checkout

Designing an automatic checkout system for retail stores at the human le...

A Region-Based Deep Learning Approach to Automated Retail Checkout

Automating the product checkout process at conventional retail stores is...

Training with Product Digital Twins for AutoRetail Checkout

Automating the checkout process is important in smart retail, where user...

Joint learning of images and videos with a single Vision Transformer

In this study, we propose a method for jointly learning of images and vi...

Quantifying the Effect of Image Similarity on Diabetic Foot Ulcer Classification

This research conducts an investigation on the effect of visually simila...

Domain invariant hierarchical embedding for grocery products recognition

Recognizing packaged grocery products based solely on appearance is stil...

e-Commerce product classification: our participation at cDiscount 2015 challenge

This report describes our participation in the cDiscount 2015 challenge ...

Please sign up or login with your details

Forgot password? Click here to reset