Does DINO loss compare the [CLS] tokens from both teacher and student?



This content originally appeared on DEV Community and was authored by Henri Wang

Yes, exactly.

In DINO and DINOv2, the DINO loss is applied between the [CLS] tokens of the teacher and student models.

The [CLS] token output from the teacher is softmaxed with temperature and centered.
The student is trained to match this distribution using cross-entropy loss.
Each view of the same image produces one [CLS] embedding, and the goal is to make the student’s [CLS] output match the teacher’s.
So, the comparison is always between the [CLS] tokens, across different augmentations of the same image.


This content originally appeared on DEV Community and was authored by Henri Wang