Effect of Class Names on VLM Object Detection

Exploring the “language” part of VLM (Florence-2) for object detection
ML
CV
VLM
Author

Zeel B Patel

Published

February 15, 2025

Installation

try:
    import maestro
except ModuleNotFoundError:
    %pip install "maestro[florence_2]"

Imports

# Config
import os
import numpy as np
from tqdm.notebook import tqdm

from roboflow import Roboflow
from maestro.trainer.common.datasets import RoboflowJSONLDataset
from maestro.trainer.models.florence_2.inference import predict
from maestro.trainer.models.florence_2.checkpoints import (
    OptimizationStrategy, load_model)
import supervision as sv
from PIL import Image

from dotenv import load_dotenv
load_dotenv()
True

Ice Breaker

  • VLMs are useful for object detection in zero-shot and fine-tuning settings.
  • Unlike traditional object detection models like YOLO, VLMs are sensitive to the class names due to the “language” part of the model.
  • In other words, VLMs are fundamentally made to connect the visual and language domains.
  • Thus, the following questions naturally arise:
    • How VLM performance varies with variations in class names?

Build fine-tuning pipeline for Florence-2

This part is adapted from this Roboflow notebook.

Download dataset

ROBOFLOW_API_KEY = os.getenv('ROBOFLOW_API_KEY')
rf = Roboflow(api_key=ROBOFLOW_API_KEY)

project = rf.workspace("roboflow-jvuqo").project("poker-cards-fmjio")
version = project.version(4)
dataset = version.download("florence2-od", "/tmp/poker-cards-fmjio")
loading Roboflow workspace...
loading Roboflow project...
!head -n 1 {dataset.location}/train/annotations.jsonl
{"image":"IMG_20220316_172418_jpg.rf.e3cb4a86dc0247e71e3697aa3e9db923.jpg","prefix":"<OD>","suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>jack  of clubs<loc_566><loc_166><loc_823><loc_432>queen of clubs<loc_365><loc_465><loc_765><loc_999>king of clubs<loc_601><loc_440><loc_949><loc_873>"}
Command Type Description
–dataset TEXT Path to the dataset used for training [default: None] [required]
–model_id TEXT Identifier for the Florence-2 model [default: microsoft/Florence-2-base-ft]
–revision TEXT Model revision to use [default: refs/pr/20]
–device TEXT Device to use for training [default: auto]
–optimization_strategy TEXT Optimization strategy: lora, freeze, or none [default: lora]
–cache_dir TEXT Directory to cache the model weights locally [default: None]
–epochs INTEGER Number of training epochs [default: 10]
–lr FLOAT Learning rate for training [default: 1e-05]
–batch_size INTEGER Training batch size [default: 4]
–accumulate_grad_batches INTEGER Number of batches to accumulate for gradient updates [default: 8]
–val_batch_size INTEGER Validation batch size [default: None]
–num_workers INTEGER Number of workers for data loading [default:
–val_num_workers INTEGER Number of workers for validation data loading [default: None]
–output_dir TEXT Directory to store training outputs [default: ./training/florence_2]
–metrics TEXT List of metrics to track during training
–max_new_tokens INTEGER Maximum number of new tokens generated during inference [default: 1024]
–random_seed INTEGER Random seed for ensuring reproducibility. If None, no seed is set [default: None]