Skip to main content

Troubleshooting guide

Encountering issues during training or deployment is normal. This guide helps you quickly identify and resolve common problems.

General debugging workflow

When encountering any issue:

1. Identify symptoms clearly

2. Check this guide for matching issue

3. Apply suggested solutions in order

4. Monitor metrics after each change

5. Document what worked

Quick reference table

SymptomMost likely causeFirst action
Loss stays flatLearning rate issueAdjust learning rate
Training crashesOut of memoryReduce batch size
Good training, poor validationOverfittingIncrease augmentation
Slow inferenceUnoptimized modelOptimize batch size
Inconsistent production resultsPreprocessing mismatchVerify preprocessing pipeline

Training issues

Loss not decreasing

Symptoms:

  • Training loss stays flat or barely changes
  • Validation loss doesn't improve after many epochs
  • Model accuracy remains at baseline (random guessing level)
How to Detect

Monitor your training curves. If loss hasn't decreased significantly after 20-30 epochs, you likely have this issue.

Root causes & solutions:

ProblemHow to IdentifySolution
Learning rate too highLoss oscillates wildly or increasesReduce learning rate by 10x (e.g., 0.01 → 0.001)
Learning rate too lowLoss decreases extremely slowlyIncrease learning rate by 2-3x
Poor data qualityLoss stuck from the startReview dataset - check labels, remove corrupted images
Class imbalanceHigh accuracy but poor per-class performanceEnable class balancing or use focal loss
Wrong architectureLoss improves slightly then plateausTry a different backbone or model type
Bad initializationTraining diverges immediatelyRestart with different random seed

Quick fix checklist:

1. Adjust learning rate (most common fix).
2. Verify dataset correctness.
3. Manually inspect labels for accuracy.
4. Check class distribution balance.
5. Try a simpler/different architecture.
First Step

Before changing architecture or data, try adjusting the learning rate. This solves 60% of "loss not decreasing" issues.

GPU out of memory

Symptoms:

  • System freezes during training
  • Error message on training
How to Detect

Training fails shortly after starting, often on the first or second batch. Check logs for "out of memory" errors.

Immediate solutions (in order of effectiveness):

  • 1. Reduce batch size : High Cut batch size in half (32→16→8).
  • 2. Enable mixed precision : High Enable in training settings.
  • 3. Smaller input dimensions : Medium Reduce image resolution (1024→512).
  • 4. Lighter backbone : Medium Switch to smaller model (Large→Medium→Small).
  • 5. Increase output downsample : Low Change from 4x→8x or 8x→16x.
  • 6. Clear GPU cache : Low Restart training session.

Step-by-step resolution:

1. First attempt: Reduce batch size to 8 or 4
↓ Still failing?
2. Enable mixed precision training
↓ Still failing?
3. Reduce input image size by 50%
↓ Still failing?
4. Switch to smaller backbone (e.g., Medium → Small)
↓ Still failing?
5. Check you're not running other GPU processes
Performance Trade-offs

Smaller batch sizes may make training less stable. If reducing batch size below 8, consider slightly lowering the learning rate as well.

Prevention tips
  • Start new projects with conservative settings (batch size 16, medium backbone)
  • Monitor GPU memory usage during first few epochs
  • Leave 20% GPU memory as buffer for operations

Overfitting

Symptoms:

  • Training accuracy keeps improving but validation accuracy plateaus or decreases
  • Large gap between training loss and validation loss
  • Model performs well on training data but poorly on new images
How to Detect

Monitor the gap between training and validation metrics. A gap larger than 10-15% indicates overfitting.

Example:

  • Training accuracy: 95%
  • Validation accuracy: 70% → Clear overfitting (25% gap)

Solutions by severity:

SeverityGap rangeSolutions
Mild overfitting5-10%Increase data augmentation: Enable more aggressive augmentations (stronger brightness/contrast variations, more rotation angles, additional geometric transforms)

Add more training data: Collect 20-30% more diverse images, focus on underrepresented scenarios
Moderate overfitting10-20%All mild solutions, plus:

Simplify model architecture: Switch to smaller backbone size, reduce model capacity

Add regularization: Enable dropout (if available), add weight decay, use early stopping with patience 10-15
Severe overfitting>20%All previous solutions, plus:

Major data expansion: Double your dataset size, ensure diverse scenarios

Restart with smaller model: Use Simple or ResNet-Simple backbone, smaller input dimensions

Quick decision tree:

Overfitting detected?

├─ Gap < 10% → Increase augmentation

├─ Gap 10-20% → Smaller model + more augmentation

└─ Gap > 20% → Need more data + much simpler model
Prevention

The best cure is prevention. Use strong data augmentation from the start and monitor train/validation gap throughout training.

Production issues

Symptoms:

  • Model takes too long to process images in production
  • Cannot meet real-time requirements
  • High latency in production environment
How to Detect

Measure inference time: If processing one image takes >100ms on your target hardware, you have a performance issue.

Optimization strategies:

StrategyEffortSpeed improvementQuality impact
Optimize batch sizeLow2-5x fasterNone
Hardware accelerationLow3-10x fasterNone
Mixed precision inferenceLow1.5-2x fasterMinimal
Model quantizationMedium2-4x fasterSmall
Smaller backboneMedium2-5x fasterModerate
Model pruningHigh2-3x fasterSmall-Moderate

Step-by-step optimization:

Step 1: Low-hanging fruit (Try these first)

1. Increase batch size if processing multiple images
2. Enable GPU acceleration (if available)
3. Use mixed precision inference
4. Optimize image preprocessing pipeline

Step 2: Model optimization (If still too slow)

1. Switch to more efficient backbone (EfficientNet)
2. Reduce input resolution (if acceptable)

Step 3: Advanced techniques (Last resort)

1. Hardware-specific optimization

Real-world example:

Initial: ResNet, 1024x1024 input → 200ms per image
After batch optimization (batch=8) → 80ms per image
After mixed precision → 50ms per image
After switching to EfficientNet → 15ms per image
Quick Win

For most applications, simply optimizing batch size and enabling hardware acceleration can cut inference time by 50-70%.

Inconsistent results

Symptoms:

  • Model works well in testing but poorly in production
  • Same image gives different results at different times
  • Results vary between development and deployment environments
How to Detect

Compare predictions on the same test image across environments. If results differ significantly, you have a preprocessing mismatch.

Common causes & fixes:

IssueProblem / SymptomsTraining SetupProduction SetupFix / Prevention
Preprocessing mismatchProduction preprocessing differs from training
  • Normalization: Mean
  • Resolution: 512x512
  • Aspect ratio: Preserved
  • Normalization: Different or missing
  • Resolution: 640x480
  • Aspect ratio: Stretched
  • ✅ Document preprocessing pipeline
  • ✅ Use identical preprocessing steps in both environments
  • ✅ Test preprocessing outputs match
Color space issues
  • RGB vs BGR confusion
  • Colors look wrong in predictions
  • Model performs worse than testing
RGB during trainingPotential BGR in production
  • ✅ Verify color channel order matches
  • ✅ Adjust conversion if necessary
Resolution handlingInconsistent image scaling, stretched or misaligned inputsResize to 512x512 with padding to preserve aspect ratioResize must match training exactly
  • ✅ Preserve aspect ratio
  • ✅ Match interpolation method
  • ✅ Match padding strategy
  • ❌ Wrong: Direct resize to 512x512 (stretches image)
  • ✅ Correct: Resize with padding to 512x512
Critical Checks
  • Always verify preprocessing matches training exactly.
  • Color channel order mismatches are one of the most common deployment issues.
  • Resolution and aspect ratio mismatches can severely degrade model performance.

Debug checklist

Run through this checklist systematically:

Environment check:

  • Same model version in both environments

Preprocessing check:

  • Identical normalization values
  • Identical input dimensions
  • Same color space
  • Same resizing method
  • Same aspect ratio handling

Data check:

  • Input data type matches
  • Value range matches
  • Channel order matches