CIS 5543 Computer Vision
Topics
Core Vision Tasks
- Image classification
- Object detection (bounding boxes)
- Image segmentation (semantic, instance, panoptic segmentation)
- Image retrieval (metric learning, nearest-neighbor search
- Image description (captioning)
- Image description with visual grounding (phrase grounding)
- Image generation
Performance Evaluation
- Classification (Accuracy, Top-k accuracy)
- Detection
- Intersection over Union (IoU)
- Precision, recall
- F1 score
- Average Precision (AP), mAP
- Segmentation
- Mean IoU
- Pixel accuracy
- Retrieval
- Recall@K
- Mean reciprocal rank (MRR)
- Captioning & grounding
- BLEU, CIDEr, METEOR
- Grounding accuracy (IoU ≥ threshold)
Feature Representation
- Hand-crafted features
- Edge detectors (Sobel, Canny), SIFT, HOG
- Learned features
- CNN feature maps
- Patch embeddings
- Multimodal embeddings (vision-language)
Models and Architectures
- Fully Connected Networks (FCN)
- Convolutional Neural Networks (CNN)
- AlexNet, VGG, ResNet
- Vision Transformers (ViT)
- Patch tokenization
- Self-attention
- Contrastive Models (CLIP)
- Large Language Models (LLM)
- Token prediction
- Prompting and conditioning
- Vision-Language Models (VLM)
- BLIP-2, LLaVA, Qwen-VL
- Cross-attention vs unified token spaces
- Grounded Vision Models
- Grounding DINO
- Open-vocabulary detection
Training Paradigms
- Supervised learning
- Self-supervised learning
- Masked image modeling (MAE)
- Weakly supervised learning
- Zero-shot and few-shot learning
Loss Functions
- Cross-entropy loss
- Contrastive loss (InfoNCE)
- Focal loss
- IoU-based losses
- Metric learning losses (Triplet loss, Cosine similarity loss)
- Generative losses (Diffusion loss, adversarial loss (GANs))
Explainability and Interpretability
- Saliency maps
- CAM / Grad-CAM
- Attention visualization
- Token-level attribution
- Logit Lens for VLMs
- Failure modes and compositional errors
Sources