Installation

try:
    import maestro
except ModuleNotFoundError:
    %pip install "maestro[florence_2]"

Imports

# Config
import os
import numpy as np
from tqdm.notebook import tqdm

from roboflow import Roboflow
from maestro.trainer.common.datasets import RoboflowJSONLDataset
from maestro.trainer.models.florence_2.inference import predict
from maestro.trainer.models.florence_2.checkpoints import (
    OptimizationStrategy, load_model)
import supervision as sv
from PIL import Image

from dotenv import load_dotenv
load_dotenv()

True

Ice Breaker

VLMs are useful for object detection in zero-shot and fine-tuning settings.
Unlike traditional object detection models like YOLO, VLMs are sensitive to the class names due to the “language” part of the model.
In other words, VLMs are fundamentally made to connect the visual and language domains.
Thus, the following questions naturally arise:
- How VLM performance varies with variations in class names?

Build fine-tuning pipeline for Florence-2

This part is adapted from this Roboflow notebook.

Download dataset

ROBOFLOW_API_KEY = os.getenv('ROBOFLOW_API_KEY')
rf = Roboflow(api_key=ROBOFLOW_API_KEY)

project = rf.workspace("roboflow-jvuqo").project("poker-cards-fmjio")
version = project.version(4)
dataset = version.download("florence2-od", "/tmp/poker-cards-fmjio")

loading Roboflow workspace...
loading Roboflow project...

!head -n 1 {dataset.location}/train/annotations.jsonl

{"image":"IMG_20220316_172418_jpg.rf.e3cb4a86dc0247e71e3697aa3e9db923.jpg","prefix":"<OD>","suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>jack  of clubs<loc_566><loc_166><loc_823><loc_432>queen of clubs<loc_365><loc_465><loc_765><loc_999>king of clubs<loc_601><loc_440><loc_949><loc_873>"}

Command	Type	Description
–dataset	TEXT	Path to the dataset used for training [default: None] [required]
–model_id	TEXT	Identifier for the Florence-2 model [default: microsoft/Florence-2-base-ft]
–revision	TEXT	Model revision to use [default: refs/pr/20]
–device	TEXT	Device to use for training [default: auto]
–optimization_strategy	TEXT	Optimization strategy: lora, freeze, or none [default: lora]
–cache_dir	TEXT	Directory to cache the model weights locally [default: None]
–epochs	INTEGER	Number of training epochs [default: 10]
–lr	FLOAT	Learning rate for training [default: 1e-05]
–batch_size	INTEGER	Training batch size [default: 4]
–accumulate_grad_batches	INTEGER	Number of batches to accumulate for gradient updates [default: 8]
–val_batch_size	INTEGER	Validation batch size [default: None]
–num_workers	INTEGER	Number of workers for data loading [default:
–val_num_workers	INTEGER	Number of workers for validation data loading [default: None]
–output_dir	TEXT	Directory to store training outputs [default: ./training/florence_2]
–metrics	TEXT	List of metrics to track during training
–max_new_tokens	INTEGER	Maximum number of new tokens generated during inference [default: 1024]
–random_seed	INTEGER	Random seed for ensuring reproducibility. If None, no seed is set [default: None]