Troubleshooting guide

Encountering issues during training or deployment is normal. This guide helps you quickly identify and resolve common problems.

General debugging workflow

When encountering any issue:

1. Identify symptoms clearly
   ↓
2. Check this guide for matching issue
   ↓
3. Apply suggested solutions in order
   ↓
4. Monitor metrics after each change
   ↓
5. Document what worked

Quick reference table

Symptom	Most likely cause	First action
Loss stays flat	Learning rate issue	Adjust learning rate
Training crashes	Out of memory	Reduce batch size
Good training, poor validation	Overfitting	Increase augmentation
Slow inference	Unoptimized model	Optimize batch size
Inconsistent production results	Preprocessing mismatch	Verify preprocessing pipeline

Training issues

Loss not decreasing

Symptoms:

Training loss stays flat or barely changes
Validation loss doesn't improve after many epochs
Model accuracy remains at baseline (random guessing level)

How to Detect

Monitor your training curves. If loss hasn't decreased significantly after 20-30 epochs, you likely have this issue.

Root causes & solutions:

Problem	How to Identify	Solution
Learning rate too high	Loss oscillates wildly or increases	Reduce learning rate by 10x (e.g., 0.01 → 0.001)
Learning rate too low	Loss decreases extremely slowly	Increase learning rate by 2-3x
Poor data quality	Loss stuck from the start	Review dataset - check labels, remove corrupted images
Class imbalance	High accuracy but poor per-class performance	Enable class balancing or use focal loss
Wrong architecture	Loss improves slightly then plateaus	Try a different backbone or model type
Bad initialization	Training diverges immediately	Restart with different random seed

Quick fix checklist:

Adjust learning rate (most common fix).
Verify dataset correctness.
Manually inspect labels for accuracy.
Check class distribution balance.
Try a simpler/different architecture.

First Step

Before changing architecture or data, try adjusting the learning rate. This solves 60% of "loss not decreasing" issues.

GPU out of memory

Symptoms:

System freezes during training
Error message on training

How to Detect

Training fails shortly after starting, often on the first or second batch. Check logs for "out of memory" errors.

Immediate solutions (in order of effectiveness):

1. Reduce batch size : High Cut batch size in half (32→16→8).
2. Enable mixed precision : High Enable in training settings.
3. Smaller input dimensions : Medium Reduce image resolution (1024→512).
4. Lighter backbone : Medium Switch to smaller model (Large→Medium→Small).
5. Increase output downsample : Low Change from 4x→8x or 8x→16x.
6. Clear GPU cache : Low Restart training session.

Step-by-step resolution:

1. First attempt: Reduce batch size to 8 or 4
   ↓ Still failing?
2. Enable mixed precision training
   ↓ Still failing?
3. Reduce input image size by 50%
   ↓ Still failing?
4. Switch to smaller backbone (e.g., Medium → Small)
   ↓ Still failing?
5. Check you're not running other GPU processes

Performance Trade-offs

Smaller batch sizes may make training less stable. If reducing batch size below 8, consider slightly lowering the learning rate as well.

Prevention tips

Start new projects with conservative settings (batch size 16, medium backbone)
Monitor GPU memory usage during first few epochs
Leave 20% GPU memory as buffer for operations

Overfitting

Symptoms:

Training accuracy keeps improving but validation accuracy plateaus or decreases
Large gap between training loss and validation loss
Model performs well on training data but poorly on new images

How to Detect

Monitor the gap between training and validation metrics. A gap larger than 10-15% indicates overfitting.

Example:

Training accuracy: 95%
Validation accuracy: 70% → Clear overfitting (25% gap)

Solutions by severity:

Severity	Gap range	Solutions
Mild overfitting	5-10%	Increase data augmentation: Enable more aggressive augmentations (stronger brightness/contrast variations, more rotation angles, additional geometric transforms) Add more training data: Collect 20-30% more diverse images, focus on underrepresented scenarios
Moderate overfitting	10-20%	All mild solutions, plus: Simplify model architecture: Switch to smaller backbone size, reduce model capacity Add regularization: Enable dropout (if available), add weight decay, use early stopping with patience 10-15
Severe overfitting	>20%	All previous solutions, plus: Major data expansion: Double your dataset size, ensure diverse scenarios Restart with smaller model: Use Simple or ResNet-Simple backbone, smaller input dimensions

Quick decision tree:

Overfitting detected?
│
├─ Gap < 10% → Increase augmentation
│
├─ Gap 10-20% → Smaller model + more augmentation
│
└─ Gap > 20% → Need more data + much simpler model

Prevention

The best cure is prevention. Use strong data augmentation from the start and monitor train/validation gap throughout training.

Production issues

Symptoms:

Model takes too long to process images in production
Cannot meet real-time requirements
High latency in production environment

How to Detect

Measure inference time: If processing one image takes >100ms on your target hardware, you have a performance issue.

Optimization strategies:

Strategy	Effort	Speed improvement	Quality impact
Optimize batch size	Low	2-5x faster	None
Hardware acceleration	Low	3-10x faster	None
Mixed precision inference	Low	1.5-2x faster	Minimal
Model quantization	Medium	2-4x faster	Small
Smaller backbone	Medium	2-5x faster	Moderate
Model pruning	High	2-3x faster	Small-Moderate

Step-by-step optimization:

Step 1: Low-hanging fruit (Try these first)

Increase batch size if processing multiple images
Enable GPU acceleration (if available)
Use mixed precision inference
Optimize image preprocessing pipeline

Step 2: Model optimization (If still too slow)

1. Switch to more efficient backbone (EfficientNet)
2. Reduce input resolution (if acceptable)

Step 3: Advanced techniques (Last resort)

1. Hardware-specific optimization

Real-world example:

Initial: ResNet, 1024x1024 input → 200ms per image
After batch optimization (batch=8) → 80ms per image
After mixed precision → 50ms per image
After switching to EfficientNet → 15ms per image

Quick Win

For most applications, simply optimizing batch size and enabling hardware acceleration can cut inference time by 50-70%.

Inconsistent results

Symptoms:

Model works well in testing but poorly in production
Same image gives different results at different times
Results vary between development and deployment environments

How to Detect

Compare predictions on the same test image across environments. If results differ significantly, you have a preprocessing mismatch.

Common causes & fixes:

Issue	Problem / Symptoms	Training Setup	Production Setup	Fix / Prevention
Preprocessing mismatch	Production preprocessing differs from training	Normalization: Mean Resolution: 512x512 Aspect ratio: Preserved	Normalization: Different or missing Resolution: 640x480 Aspect ratio: Stretched	✅ Document preprocessing pipeline ✅ Use identical preprocessing steps in both environments ✅ Test preprocessing outputs match
Color space issues	RGB vs BGR confusion Colors look wrong in predictions Model performs worse than testing	RGB during training	Potential BGR in production	✅ Verify color channel order matches ✅ Adjust conversion if necessary
Resolution handling	Inconsistent image scaling, stretched or misaligned inputs	Resize to 512x512 with padding to preserve aspect ratio	Resize must match training exactly	✅ Preserve aspect ratio ✅ Match interpolation method ✅ Match padding strategy ❌ Wrong: Direct resize to 512x512 (stretches image) ✅ Correct: Resize with padding to 512x512

Critical Checks

Always verify preprocessing matches training exactly.
Color channel order mismatches are one of the most common deployment issues.
Resolution and aspect ratio mismatches can severely degrade model performance.

Debug checklist

Run through this checklist systematically:

Environment check:

Same model version in both environments

Preprocessing check:

Data check:

Input data type matches
Value range matches
Channel order matches

Advanced Configuration - Parameter tuning guide
Best Practices - Prevention strategies
Model Training Guide - Training fundamentals

Troubleshooting guide

General debugging workflow​

Quick reference table​

Training issues​

Loss not decreasing​

Symptoms:​

Root causes & solutions:​

Quick fix checklist:​

GPU out of memory​

Symptoms:​

Immediate solutions (in order of effectiveness):​

Step-by-step resolution:​

Overfitting​

Symptoms:​

Solutions by severity:​

Production issues​

Inconsistent results​

Debug checklist​

Related resources​

General debugging workflow

Quick reference table

Training issues

Loss not decreasing

Symptoms:

Root causes & solutions:

Quick fix checklist:

GPU out of memory

Symptoms:

Immediate solutions (in order of effectiveness):

Step-by-step resolution:

Overfitting

Symptoms:

Solutions by severity:

Production issues

Inconsistent results

Debug checklist

Related resources