This content originally appeared on DEV Community and was authored by Henri Wang
Yes, exactly.
In DINO and DINOv2, the DINO loss is applied between the [CLS] tokens of the teacher and student models.
The [CLS] token output from the teacher is softmaxed with temperature and centered.
The student is trained to match this distribution using cross-entropy loss.
Each view of the same image produces one [CLS] embedding, and the goal is to make the student’s [CLS] output match the teacher’s.
So, the comparison is always between the [CLS] tokens, across different augmentations of the same image.
This content originally appeared on DEV Community and was authored by Henri Wang