Troubleshooting guide
Encountering issues during training or deployment is normal. This guide helps you quickly identify and resolve common problems.
General debugging workflow
When encountering any issue:
1. Identify symptoms clearly
↓
2. Check this guide for matching issue
↓
3. Apply suggested solutions in order
↓
4. Monitor metrics after each change
↓
5. Document what worked
Quick reference table
| Symptom | Most likely cause | First action |
|---|---|---|
| Loss stays flat | Learning rate issue | Adjust learning rate |
| Training crashes | Out of memory | Reduce batch size |
| Good training, poor validation | Overfitting | Increase augmentation |
| Slow inference | Unoptimized model | Optimize batch size |
| Inconsistent production results | Preprocessing mismatch | Verify preprocessing pipeline |
Training issues
Loss not decreasing
Symptoms:
- Training loss stays flat or barely changes
- Validation loss doesn't improve after many epochs
- Model accuracy remains at baseline (random guessing level)
Monitor your training curves. If loss hasn't decreased significantly after 20-30 epochs, you likely have this issue.
Root causes & solutions:
| Problem | How to Identify | Solution |
|---|---|---|
| Learning rate too high | Loss oscillates wildly or increases | Reduce learning rate by 10x (e.g., 0.01 → 0.001) |
| Learning rate too low | Loss decreases extremely slowly | Increase learning rate by 2-3x |
| Poor data quality | Loss stuck from the start | Review dataset - check labels, remove corrupted images |
| Class imbalance | High accuracy but poor per-class performance | Enable class balancing or use focal loss |
| Wrong architecture | Loss improves slightly then plateaus | Try a different backbone or model type |
| Bad initialization | Training diverges immediately | Restart with different random seed |
Quick fix checklist:
1. Adjust learning rate (most common fix).
2. Verify dataset correctness.
3. Manually inspect labels for accuracy.
4. Check class distribution balance.
5. Try a simpler/different architecture.
Before changing architecture or data, try adjusting the learning rate. This solves 60% of "loss not decreasing" issues.
GPU out of memory
Symptoms:
- System freezes during training
- Error message on training
Training fails shortly after starting, often on the first or second batch. Check logs for "out of memory" errors.
Immediate solutions (in order of effectiveness):
- 1. Reduce batch size : High Cut batch size in half (32→16→8).
- 2. Enable mixed precision : High Enable in training settings.
- 3. Smaller input dimensions : Medium Reduce image resolution (1024→512).
- 4. Lighter backbone : Medium Switch to smaller model (Large→Medium→Small).
- 5. Increase output downsample : Low Change from 4x→8x or 8x→16x.
- 6. Clear GPU cache : Low Restart training session.
Step-by-step resolution:
1. First attempt: Reduce batch size to 8 or 4
↓ Still failing?
2. Enable mixed precision training
↓ Still failing?
3. Reduce input image size by 50%
↓ Still failing?
4. Switch to smaller backbone (e.g., Medium → Small)
↓ Still failing?
5. Check you're not running other GPU processes
Smaller batch sizes may make training less stable. If reducing batch size below 8, consider slightly lowering the learning rate as well.
- Start new projects with conservative settings (batch size 16, medium backbone)
- Monitor GPU memory usage during first few epochs
- Leave 20% GPU memory as buffer for operations
Overfitting
Symptoms:
- Training accuracy keeps improving but validation accuracy plateaus or decreases
- Large gap between training loss and validation loss
- Model performs well on training data but poorly on new images
Monitor the gap between training and validation metrics. A gap larger than 10-15% indicates overfitting.
Example:
- Training accuracy: 95%
- Validation accuracy: 70% → Clear overfitting (25% gap)
Solutions by severity:
| Severity | Gap range | Solutions |
|---|---|---|
| Mild overfitting | 5-10% | Increase data augmentation: Enable more aggressive augmentations (stronger brightness/contrast variations, more rotation angles, additional geometric transforms) Add more training data: Collect 20-30% more diverse images, focus on underrepresented scenarios |
| Moderate overfitting | 10-20% | All mild solutions, plus: Simplify model architecture: Switch to smaller backbone size, reduce model capacity Add regularization: Enable dropout (if available), add weight decay, use early stopping with patience 10-15 |
| Severe overfitting | >20% | All previous solutions, plus: Major data expansion: Double your dataset size, ensure diverse scenarios Restart with smaller model: Use Simple or ResNet-Simple backbone, smaller input dimensions |
Quick decision tree:
Overfitting detected?
│
├─ Gap < 10% → Increase augmentation
│
├─ Gap 10-20% → Smaller model + more augmentation
│
└─ Gap > 20% → Need more data + much simpler model
The best cure is prevention. Use strong data augmentation from the start and monitor train/validation gap throughout training.
Production issues
Symptoms:
- Model takes too long to process images in production
- Cannot meet real-time requirements
- High latency in production environment
Measure inference time: If processing one image takes >100ms on your target hardware, you have a performance issue.
Optimization strategies:
| Strategy | Effort | Speed improvement | Quality impact |
|---|---|---|---|
| Optimize batch size | Low | 2-5x faster | None |
| Hardware acceleration | Low | 3-10x faster | None |
| Mixed precision inference | Low | 1.5-2x faster | Minimal |
| Model quantization | Medium | 2-4x faster | Small |
| Smaller backbone | Medium | 2-5x faster | Moderate |
| Model pruning | High | 2-3x faster | Small-Moderate |
Step-by-step optimization:
Step 1: Low-hanging fruit (Try these first)
1. Increase batch size if processing multiple images
2. Enable GPU acceleration (if available)
3. Use mixed precision inference
4. Optimize image preprocessing pipeline
Step 2: Model optimization (If still too slow)
1. Switch to more efficient backbone (EfficientNet)
2. Reduce input resolution (if acceptable)
Step 3: Advanced techniques (Last resort)
1. Hardware-specific optimization
Real-world example:
Initial: ResNet, 1024x1024 input → 200ms per image
After batch optimization (batch=8) → 80ms per image
After mixed precision → 50ms per image
After switching to EfficientNet → 15ms per image
For most applications, simply optimizing batch size and enabling hardware acceleration can cut inference time by 50-70%.
Inconsistent results
Symptoms:
- Model works well in testing but poorly in production
- Same image gives different results at different times
- Results vary between development and deployment environments
Compare predictions on the same test image across environments. If results differ significantly, you have a preprocessing mismatch.
Common causes & fixes:
| Issue | Problem / Symptoms | Training Setup | Production Setup | Fix / Prevention |
|---|---|---|---|---|
| Preprocessing mismatch | Production preprocessing differs from training |
|
|
|
| Color space issues |
| RGB during training | Potential BGR in production |
|
| Resolution handling | Inconsistent image scaling, stretched or misaligned inputs | Resize to 512x512 with padding to preserve aspect ratio | Resize must match training exactly |
|
- Always verify preprocessing matches training exactly.
- Color channel order mismatches are one of the most common deployment issues.
- Resolution and aspect ratio mismatches can severely degrade model performance.
Debug checklist
Run through this checklist systematically:
Environment check:
- Same model version in both environments
Preprocessing check:
- Identical normalization values
- Identical input dimensions
- Same color space
- Same resizing method
- Same aspect ratio handling
Data check:
- Input data type matches
- Value range matches
- Channel order matches
Related resources
- Advanced Configuration - Parameter tuning guide
- Best Practices - Prevention strategies
- Model Training Guide - Training fundamentals