CIS 5543 Computer Vision

 

Suggested Lectures


1. Vision Transformers (ViT)

a. Patch tokenization
  - Patch size vs. resolution trade-offs
  - Linear patch embeddings
  - Positional encodings (learned vs. fixed)

b. Self-attention
  - Multi-head self-attention mechanism
  - Global receptive field vs. CNN locality
  - Computational complexity (O(N^2))
  - Attention visualization and interpretability

c. ViT architectures
  - ViT-Base / Large
  - Training at scale and data requirements


2. Object Detection & Segmentation with ViT

a. Bounding Box (BB) prediction with ViT as a backbone
  - Feature pyramids with transformers
  - ViT vs. CNN backbones

b. Transformer-based detection
  - DETR and query-based detection
  - End-to-end detection pipelines

c. Attaching segmentation heads to ViT models
  - Semantic vs. instance vs. panoptic segmentation
  - Segment Anything Model (SAM)


3. Contrastive Models (CLIP, BLIP)

a. Contrastive learning objective
  - Image-text alignment
  - InfoNCE loss

b. CLIP
  - Joint embedding space
  - Zero-shot classification
  - Prompt engineering for vision

c. BLIP
  - Captioning + contrastive pretraining
  - Bootstrapped language supervision


4. Large Language Models (LLM)

a. Token prediction
  - Autoregressive modeling
  - Next-token prediction loss
  - Logits and softmax

b. Prompting and conditioning
  - In-context learning
  - Few-shot vs. zero-shot prompting
  - System, user, and instruction prompts

c. Architectural components
  - Transformer decoder stack
  - Embedding and unembedding matrices


5. Vision-Language Models (VLM)

a. BLIP-2, LLaVA, Qwen-VL (focus more on LLaVA)
  - Frozen vision encoder + LLM
  - Query-based visual tokens
  - Instruction-tuned multimodal models

b. Cross-attention (BLIP-2) vs. unified token spaces with self-attention (LLaVA)
  - Cross-modal attention blocks
  - Unified sequence modeling

c. Multimodal reasoning
  - Visual question answering
  - Image-conditioned text generation


6. Grounded Vision Models

a. Grounding DINO
  - Language-conditioned object queries
  - Text-to-box grounding

b. Open-vocabulary detection
  - Class-agnostic detection
  - Zero-shot and few-shot detection

c. Evaluation and datasets
  - RefCOCO, Visual Genome, COCO-Captions


7. Video Language Models

a. Video representation
  - Frame sampling strategies
  - Temporal attention

b. Video-text alignment
  - Video captioning
  - Video question answering

c. Multimodal temporal reasoning
  - Event understanding
  - Long-context challenges


8. Image Retrieval & Image Similarity Measures

a. Embedding-based retrieval
  - Cosine similarity vs. dot product
  - Learned metric spaces

b. Cross-modal retrieval
  - Text-to-image search
  - Image-to-text retrieval

c. Evaluation metrics
  - Recall@K
  - Mean Average Precision (mAP)


9. Self-Supervised Learning

a. Autoencoders
  - Reconstruction loss
  - Bottleneck representations

b. Masked Autoencoders (MAE)
  - Masking strategies
  - ViT-based MAE

c. DINO
  - Self-distillation without labels
  - Teacher-student models

d. Segment Anything Model (SAM)
  - Promptable segmentation
  - Generalization across tasks


10. Image Generation

a. Generative modeling paradigms

b. Diffusion models
  - Forward and reverse diffusion
  - Classifier-free guidance

c. Text-to-image generation
  - Latent diffusion
  - Prompt control, text-conditioned image generation


 

 

Topics to be covered

1. Core Vision Tasks

2. Performance Evaluation

3. Feature Representation

4. Models and Architectures

5. Training Paradigms

6. Loss Functions

7. Explainability and Interpretability

 

 

 

Sources:

Stanford CS 231. Deep Learning for CV: https://cs231n.stanford.edu/ and Video Lectures: https://youtu.be/2fq9wYslV0A?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16

Computer Vision Lectures by Svetlana Lazebnik

Understanding Deep Learning book: https://udlbook.github.io/udlbook/

and notebooks: https://udlbook.github.io/udlbook/

Vision-Language Models https://iaee.substack.com/p/visual-question-answering-with-frozen-large-language-models-353d42791054