CIS 5543 Computer Vision

Suggested Lectures

1. Vision Transformers (ViT)

a. Patch tokenization
  - Patch size vs. resolution trade-offs
  - Linear patch embeddings
  - Positional encodings (learned vs. fixed)

b. Self-attention
  - Multi-head self-attention mechanism
  - Global receptive field vs. CNN locality
  - Computational complexity (O(N^2))
  - Attention visualization and interpretability

c. ViT architectures
- ViT-Base / Large
- Training at scale and data requirements

2. Object Detection & Segmentation with ViT

a. Bounding Box (BB) prediction with ViT as a backbone
- Feature pyramids with transformers
- ViT vs. CNN backbones

b. Transformer-based detection
- DETR and query-based detection
- End-to-end detection pipelines

c. Attaching segmentation heads to ViT models
- Semantic vs. instance vs. panoptic segmentation
- Segment Anything Model (SAM)

3. Contrastive Models (CLIP, BLIP)

a. Contrastive learning objective
- Image-text alignment
- InfoNCE loss

b. CLIP
  - Joint embedding space
  - Zero-shot classification
  - Prompt engineering for vision

c. BLIP
- Captioning + contrastive pretraining
- Bootstrapped language supervision

4. Large Language Models (LLM)

a. Token prediction
  - Autoregressive modeling
  - Next-token prediction loss
  - Logits and softmax

b. Prompting and conditioning
  - In-context learning
  - Few-shot vs. zero-shot prompting
  - System, user, and instruction prompts

c. Architectural components
- Transformer decoder stack
- Embedding and unembedding matrices

5. Vision-Language Models (VLM)

a. BLIP-2, LLaVA, Qwen-VL (focus more on LLaVA)
  - Frozen vision encoder + LLM
  - Query-based visual tokens
  - Instruction-tuned multimodal models

b. Cross-attention (BLIP-2) vs. unified token spaces with self-attention (LLaVA)
- Cross-modal attention blocks
- Unified sequence modeling

c. Multimodal reasoning
- Visual question answering
- Image-conditioned text generation

6. Grounded Vision Models

a. Grounding DINO
- Language-conditioned object queries
- Text-to-box grounding

b. Open-vocabulary detection
- Class-agnostic detection
- Zero-shot and few-shot detection

c. Evaluation and datasets
- RefCOCO, Visual Genome, COCO-Captions

7. Video Language Models

a. Video representation
- Frame sampling strategies
- Temporal attention

b. Video-text alignment
- Video captioning
- Video question answering

c. Multimodal temporal reasoning
- Event understanding
- Long-context challenges

8. Image Retrieval & Image Similarity Measures

a. Embedding-based retrieval
- Cosine similarity vs. dot product
- Learned metric spaces

b. Cross-modal retrieval
- Text-to-image search
- Image-to-text retrieval

c. Evaluation metrics
- Recall@K
- Mean Average Precision (mAP)

9. Self-Supervised Learning

a. Autoencoders
- Reconstruction loss
- Bottleneck representations

b. Masked Autoencoders (MAE)
- Masking strategies
- ViT-based MAE

c. DINO
- Self-distillation without labels
- Teacher-student models

d. Segment Anything Model (SAM)
- Promptable segmentation
- Generalization across tasks

10. Image Generation

a. Generative modeling paradigms

b. Diffusion models
- Forward and reverse diffusion
- Classifier-free guidance

c. Text-to-image generation
- Latent diffusion
- Prompt control, text-conditioned image generation

Topics to be covered

1. Core Vision Tasks

Image classification
Object detection (bounding boxes)
Image segmentation (semantic, instance, panoptic segmentation)
Image retrieval (metric learning, nearest-neighbor search
Image description (captioning)
Image description with visual grounding (phrase grounding)
Image generation

2. Performance Evaluation

Classification (Accuracy, Top-k accuracy)
Detection

Intersection over Union (IoU)
Precision, recall
F1 score
Average Precision (AP), mAP

Segmentation

Mean IoU
Pixel accuracy

Retrieval

Recall@K
Mean reciprocal rank (MRR)

Captioning & grounding

BLEU, CIDEr, METEOR
Grounding accuracy (IoU ≥ threshold)

3. Feature Representation

Hand-crafted features

Edge detectors (Sobel, Canny), SIFT, HOG

Learned features

CNN feature maps
Patch embeddings
Multimodal embeddings (vision-language)

4. Models and Architectures

Fully Connected Networks (FCN)
Convolutional Neural Networks (CNN)

AlexNet, VGG, ResNet

Vision Transformers (ViT)

Patch tokenization
Self-attention

Contrastive Models (CLIP)
Large Language Models (LLM)

Token prediction
Prompting and conditioning

Vision-Language Models (VLM)

BLIP-2, LLaVA, Qwen-VL
Cross-attention vs unified token spaces

Grounded Vision Models

Grounding DINO
Open-vocabulary detection

5. Training Paradigms

Supervised learning
Self-supervised learning

Masked image modeling (MAE)

Weakly supervised learning
Zero-shot and few-shot learning

6. Loss Functions

Cross-entropy loss
Contrastive loss (InfoNCE)
Focal loss
IoU-based losses
Metric learning losses (Triplet loss, Cosine similarity loss)
Generative losses (Diffusion loss, adversarial loss (GANs))

7. Explainability and Interpretability

Saliency maps
CAM / Grad-CAM
Attention visualization
Token-level attribution
Logit Lens for VLMs
Failure modes and compositional errors

Sources:

Stanford CS 231. Deep Learning for CV: https://cs231n.stanford.edu/ and Video Lectures: https://youtu.be/2fq9wYslV0A?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16

Computer Vision Lectures by Svetlana Lazebnik

Understanding Deep Learning book: https://udlbook.github.io/udlbook/

and notebooks: https://udlbook.github.io/udlbook/

Vision-Language Models https://iaee.substack.com/p/visual-question-answering-with-frozen-large-language-models-353d42791054