CIS 5543 Computer Vision

Topics

Core Vision Tasks

Image classification
Object detection (bounding boxes)
Image segmentation (semantic, instance, panoptic segmentation)
Image retrieval (metric learning, nearest-neighbor search
Image description (captioning)
Image description with visual grounding (phrase grounding)
Image generation

Performance Evaluation

Classification (Accuracy, Top-k accuracy)
Detection
Intersection over Union (IoU)
Precision, recall
F1 score
Average Precision (AP), mAP
Segmentation
Mean IoU
Pixel accuracy
Retrieval
Recall@K
Mean reciprocal rank (MRR)
Captioning & grounding
BLEU, CIDEr, METEOR
Grounding accuracy (IoU ≥ threshold)

Feature Representation

Hand-crafted features
Edge detectors (Sobel, Canny), SIFT, HOG
Learned features
CNN feature maps
Patch embeddings
Multimodal embeddings (vision-language)

Models and Architectures

Fully Connected Networks (FCN)
Convolutional Neural Networks (CNN)
AlexNet, VGG, ResNet
Vision Transformers (ViT)
Patch tokenization
Self-attention
Contrastive Models (CLIP)
Large Language Models (LLM)
Token prediction
Prompting and conditioning
Vision-Language Models (VLM)
BLIP-2, LLaVA, Qwen-VL
Cross-attention vs unified token spaces
Grounded Vision Models
Grounding DINO
Open-vocabulary detection

Training Paradigms

Supervised learning
Self-supervised learning
Masked image modeling (MAE)
Weakly supervised learning
Zero-shot and few-shot learning

Loss Functions

Cross-entropy loss
Contrastive loss (InfoNCE)
Focal loss
IoU-based losses
Metric learning losses (Triplet loss, Cosine similarity loss)
Generative losses (Diffusion loss, adversarial loss (GANs))

Explainability and Interpretability

Saliency maps
CAM / Grad-CAM
Attention visualization
Token-level attribution
Logit Lens for VLMs
Failure modes and compositional errors

Sources

Stanford CS 231. Deep Learning for CV: https://cs231n.stanford.edu/ and Video Lectures: https://youtu.be/2fq9wYslV0A?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16
Computer Vision Lectures by Svetlana Lazebnik
Understanding Deep Learning book: https://udlbook.github.io/udlbook/
and notebooks: https://udlbook.github.io/udlbook/
Vision-Language Models https://iaee.substack.com/p/visual-question-answering-with-frozen-large-language-models-353d42791054