1. Core Vision
Tasks
- Image classification
- Object detection (bounding boxes)
- Image segmentation (semantic, instance, panoptic
segmentation)
- Image retrieval (metric learning, nearest-neighbor search
- Image description (captioning)
- Image description with visual grounding (phrase grounding)
- Image generation
2. Performance
Evaluation
- Classification (Accuracy, Top-k accuracy)
- Detection
- Intersection
over Union (IoU)
- Precision,
recall
- F1
score
- Average
Precision (AP), mAP
- Segmentation
- Retrieval
- Captioning & grounding
- BLEU,
CIDEr, METEOR
- Grounding
accuracy (IoU ≥ threshold)
3. Feature
Representation
- Hand-crafted
features
- Edge
detectors (Sobel, Canny), SIFT, HOG
- Learned
features
- CNN
feature maps
- Patch
embeddings
- Multimodal
embeddings (vision-language)
4. Models and Architectures
- Fully Connected Networks (FCN)
- Convolutional Neural Networks (CNN)
- Vision Transformers (ViT)
- Patch
tokenization
- Self-attention
- Contrastive Models (CLIP)
- Large Language Models (LLM)
- Token
prediction
- Prompting
and conditioning
- Vision-Language Models (VLM)
- LLaVA, Qwen-VL
- Cross-attention
vs unified token spaces
- Grounded Vision Models
- Grounding
DINO
- Open-vocabulary
detection
5. Training Paradigms
- Supervised learning
- Self-supervised learning
- Masked
image modeling (MAE)
- Weakly supervised learning
- Zero-shot and few-shot learning
6. Loss Functions
- Cross-entropy loss
- Contrastive loss (InfoNCE)
- Focal loss
- IoU-based losses
- Metric learning losses (Triplet loss, Cosine
similarity loss)
- Generative losses (Diffusion loss, adversarial
loss (GANs))
7. Explainability and Interpretability
- Saliency maps
- Heatmpas (CAM, Grad-CAM, LeGrad)
- Attention visualization
- LogitLens for VLMs
Sources:
Stanford CS 231. Deep Learning for CV and Video Lectures
Computer Vision Lectures by Svetlana Lazebnik
Understanding Deep Learning book
and notebooks
Vision-Language Models