Validation and Trust#

This page summarizes how NRTK is validated today, the available evidence, the remaining gaps, and how users should interpret perturbation-based robustness results (see Concepts of Robustness in Computer Vision). It is not a full T&E manual, but a transparency resource for anyone integrating NRTK into evaluation workflows.

NRTK provides rapid, cost-effective perturbation testing to identify potential model vulnerabilities and robustness gaps. The perturbations are designed to be indicative rather than authoritative. They provide fast, low-cost stress tests to expose potential vulnerabilities, not statistically definitive operational predictions.

Important

NRTK perturbations are designed to complement, not replace, complete model validation. They are one tool in a comprehensive T&E strategy, not a replacement for evaluation with real operational data.

Validation Status#

We’re transparent about what’s verified, what’s in progress, and what’s planned.

Status as of February 2026. Updates occur quarterly.

Validation Aspect

Status

Details

Algorithmic Correctness

✅ Verified

Unit and integration testing; continuous integration

Reproducibility

✅ Verified

Deterministic outputs with fixed seeds; documented test cases

Parameter Validation

✅ Verified

Range checks, unit consistency, fail-fast logic, and default-parameter justification

Cross-Tool Integration

✅ Verified

MAITE compliance; tested with DataEval, XAITK

Operational Realism

⚙️ In Progress

Collecting real-world degraded imagery for comparison

Domain Coverage

⚙️ In Progress

Aerial, maritime, overhead/WAMI, automotive, biometric (long-range)

Modalities Coverage

⚙️ In Progress

Still imagery → FMV (NRTK v1.1+); long-range video in progress

Real-World Benchmarking

⚙️ In Progress

RarePlanes, BDD100k, non-public WAMI and maritime datasets

Independent Validation

⚙️ In Progress

NAML’26 and MSS’26 (March 2026); SPIE’26 (April 2026)

How we validate:

  • Algorithmic: Mathematical correctness of perturbation implementations

  • Empirical: Comparison with real-world degraded imagery where available

  • Operational: Feedback from T&E engineers using NRTK in actual workflows

  • Methodological: Experimentally validated using methodology grounded in academic literature

  • Reproducibility: Consistent outputs across platforms and versions

Note

For module-specific validation details, see:

Each perturbation module page includes parameter documentation and usage examples.

When to Use NRTK#

✅ Good For#

  • Early-stage robustness screening

  • Parameter sensitivity analysis

  • Identifying potential failure modes

  • Data augmentation during training

  • Comparing robustness across models

  • Cost-performance trade-off studies

⚠️ Supplement with Mission-Representative Data#

NRTK is reliable for perturbation-driven insights, but not a substitute for mission-representative data. Combine NRTK results with operational evaluation for:

  • Final deployment decisions

  • Safety-critical systems

  • Novel operational environments

❌ Not Appropriate For#

  • Sole source of model validation

  • Regulatory certification or compliance

  • Precise predictions of real-world performance

Known Limitations#

We document limitations openly to help users make informed decisions:

Current Scope#

  • Optimized for static images (FMV support in development)

  • Primary focus on classification and detection (segmentation/tracking in development)

  • Examples emphasize aerial imaging (expanding to ground/surface domains)

Technical Constraints#

  • Spectral domain assumptions: Defaults assume visible-spectrum RGB imagery. IR/SAR/HSI sensors require domain-appropriate optical parameters; NRTK does not provide full spectral physics for all modalities.

  • Perturbation composition effects: Applying perturbations sequentially may not perfectly replicate real-world conditions where effects occur simultaneously. For example, sensor noise and atmospheric blur interact differently than applying blur then noise in post-processing.

Validation Evidence#

  • Real-world imagery comparison ongoing; results published as available (e.g. ReadTheDocs, GitHub, and academic publications)

  • MSS Parallel’26: pyBSM-based perturbers evaluated on overhead imagery in WAMI format (non-public) and RarePlanes (public)

  • NAML’26: Custom synthetic waterdroplet-on-lens perturbation on maritime/aerial data (non-public)

  • SPIE’26 (in progress): Improving AI Test and Evaluation via Semantic Gap Detection and Generative Augmentation — generative AI perturbation approaches on BDD100k

  • Biometric application (upcoming): Detection of individuals in long-range video; comparative analysis of pyBSM-based ground range simulation against real-world ground range

  • Community feedback on perturbation realism is limited but growing

We track these in our GitHub Issues and prioritize based on community feedback and DoD use-case requirements.

Validation Roadmap#

Embedding-space validation evaluates whether perturbations produce monotonic, stable, and interpretable changes in model representations.

Initiated Nov’25 (Ongoing)#

  • ⚙️ Quantify perturbation effects in embedding space for photometric, geometric, and optical modules using standard baseline models

Planned for Mar’26#

  • 📋 Compare optical-perturbation outputs against real degraded imagery with known atmospheric and sensor parameters — detection of individuals in long-range video with comparative analysis of pyBSM ground range vs real-world ground range

Q1’26 (Dissemination & Reporting)#

  • ⚙️ NAML’26 and MSS Parallel’26 conference presentations (March 2026)

  • ⚙️ Improving AI Test and Evaluation via Semantic Gap Detection and Generative Augmentation — generative AI perturbation approaches on BDD100k (SPIE Defense + Security, April 26–30)

How You Can Help#

Have real-world degraded imagery?#

If you can share operational data with known degradation factors (sensor specs, atmospheric conditions, etc.), contact us at nrtk@kitware.com. This information directly improves our validation evidence.

Found unexpected behavior?#

Report it in GitHub Issues with details about your use case. User feedback is a critical validation input.

Using NRTK in your T&E workflow?#

Share your experience. Case studies help us understand what validation evidence matters most to the community.

Bottom Line#

NRTK accelerates the early stages of robustness evaluation by providing systematic, parametric perturbations. It is not intended to replace operational testing, but to help users identify where deeper evaluation is required. Validation evidence grows continuously, and this page is updated quarterly to reflect new findings.

Questions? nrtk@kitware.com | Last Updated: Feb. 26 2026

Publications & Presentations#

Note

Entries will be updated with full citations after proceedings are released.

Naval Applications of Machine Learning (NAML’26) — March 2–5, 2026

Establishing Trust in Maritime Detection Models with the Natural Robustness Toolkit — Custom synthetic waterdroplet-on-lens perturbation; maritime/aerial domain (non-public data)

Military Sensing Symposia (MSS Parallel’26) — March 2–6, 2026

Understanding Sensor-based Robustness of Object Detection Models for Overhead Imagery — pyBSM-based perturbers; WAMI (non-public) and RarePlanes (public)

SPIE Defense + Security (SPIE’26) — April 26–30, 2026 (in preparation)

Improving AI Test and Evaluation via Semantic Gap Detection and Generative Augmentation — Generative AI perturbation approaches on BDD100k

How to Cite#

When referencing NRTK validation in reports, briefings, or evaluation documentation:

Recommended citation:

Kitware, Inc. (2025). NRTK Validation & Trust Documentation. Natural Robustness Toolkit. Retrieved from https://nrtk.readthedocs.io/en/stable/validation_and_trust.html

BibTeX:

@misc{nrtk_validation_2025,
  title        = {NRTK Validation \& Trust Documentation},
  author       = {{Kitware, Inc.}},
  year         = {2025},
  howpublished = {\url{https://nrtk.readthedocs.io/en/stable/validation_and_trust.html}},
  note         = {Accessed: [Insert Date]}
}

References#