Frameworks › NIST AI RMF

NIST AI RMF (AI 100-1)

The NIST AI Risk Management Framework (AI 100-1, January 2023) defines four functions: GOVERN, MAP, MEASURE, and MANAGE. ConstantX is a TEVV service that operates primarily within the MEASURE function, contributes to three MAP subcategories through threat modeling, and produces inputs that inform MANAGE deployment decisions.

Function Overview

Function	ConstantX Role
GOVERN	Not in scope. GOVERN defines organizational policies, accountability structures, and roles. ConstantX evaluation outputs can serve as evidence inputs to governance documentation, but ConstantX does not perform GOVERN activities.
MAP	Partial contribution through threat modeling. The T-code threat model walk directly satisfies MAP 2.3 (TEVV documentation), contributes to MAP 3.2 (cost documentation), and defines MAP 3.3 (application scope). The remaining MAP categories — organizational context, risk tolerance, human oversight process — are out of scope.
MEASURE	Primary function. ConstantX directly satisfies 12 MEASURE subcategories with empirical verdict data, confidence intervals, and cryptographic evidence chains.
MANAGE	ConstantX evidence informs two MANAGE subcategories: go/no-go deployment decisions (MANAGE 1.1) and validation that deactivation mechanisms function under adversarial conditions (MANAGE 2.4). Organizational response planning and incident procedures are out of scope.

MAP — Contributions

ConstantX engagements begin with a T-code threat model walk against the target system. This process directly satisfies three MAP subcategories. The remaining MAP categories are organizational activities that ConstantX does not perform.

Subcategory	NIST Description (AI 100-1)	ConstantX Output
MAP 2.3	Scientific integrity and TEVV considerations are identified and documented, including those related to experimental design, data collection and selection, system trustworthiness, and construct validation.	The threat model walk (T1–T17 against the target system) is the TEVV design document: it identifies which attack techniques to test, what constructs each scenario validates, and what is structurally out of scope. Coverage Boundaries documentation makes the scope explicit and defensible.
MAP 3.2	Potential costs, including non-monetary costs, which result from expected or realized AI errors or system functionality and trustworthiness — as connected to organizational risk tolerance — are examined and documented.	Adversarial scenarios document the specific consequence of each threat if realized: data exfiltration (TM-004), unauthorized command execution (TM-005), identity spoofing (TM-012), supply chain compromise (TM-018). The `undefined_behavior` verdict identifies where costs were not contained by the enforcement surface.
MAP 3.3	Targeted application scope is specified and documented based on the system’s capability, established context, and AI system categorization.	The engagement scope document specifies the target deployment configuration (model snapshot, controller version, tool set), the suite version applied, and the Coverage Boundaries that define what the evaluation does and does not assess.

MEASURE — Satisfied Subcategories

The following 12 subcategories are satisfied with empirical evidence from completed engagements. Descriptions are quoted directly from NIST AI 100-1 (Table 3).

Subcategory	NIST Description (AI 100-1)	ConstantX Output
MEASURE 1.1	Approaches and metrics for measurement of AI risks enumerated during the MAP function are selected for implementation starting with the most significant AI risks. The risks or trustworthiness characteristics that will not — or cannot — be measured are properly documented.	Decision Coverage methodology targets adversarial runtime risks identified in the threat model walk, prioritized by threat severity. Coverage Boundaries explicitly documents structural limits on what cannot be measured within sandbox evaluation scope (e.g., behavioral drift, deceptive alignment, multi-service lateral movement).
MEASURE 1.3	Internal experts who did not serve as front-line developers for the system and/or independent assessors are involved in regular assessments and updates.	ConstantX is the independent third-party evaluator. It does not develop the system under test. Evaluations are conducted against an externally defined target by assessors independent of that system’s development team.
MEASURE 2.1	Test sets, metrics, and details about the tools used during TEVV are documented.	Every scenario carries a `threat_id`, scenario spec, and enforcement tool configuration. Suite version, scenario IDs, and run window are recorded and bound to the engagement artifact. Auditors can inspect the exact test set used for any completed engagement.
MEASURE 2.3	AI system performance or assurance criteria are measured qualitatively or quantitatively and demonstrated for conditions similar to deployment setting(s). Measures are documented.	Terminal Coverage = (valid_commit + bounded_failure) / Total Runs. Measured under single-pass autonomous execution with no retries and no human-in-the-loop — the exact condition of deployment. Documented in the Decision Coverage report with Wilson 95% CI.
MEASURE 2.5	The AI system to be deployed is demonstrated to be valid and reliable. Limitations of the generalizability beyond the conditions under which the technology was developed are documented.	TC metric with Wilson 95% CI establishes statistical validity bounds for the evaluated configuration. Coverage Boundaries documents generalizability limits: scope is bound to sandbox-testable enforcement behaviors; structural out-of-scope risks are named explicitly, not omitted.
MEASURE 2.6	The AI system is evaluated regularly for safety risks. The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail safely, particularly if made to operate beyond its knowledge limits.	All adversarial scenarios test safe failure under attack conditions. `bounded_failure` verdict demonstrates that enforcement surfaces contained unsafe actions before completion. Terminal Coverage establishes the rate at which the system fails safely. The `undefined_behavior` rate with confidence interval quantifies residual risk.
MEASURE 2.7	AI system security and resilience — as identified in the MAP function — are evaluated and documented.	Adversarial scenarios covering prompt injection, tool argument attacks, path traversal, privilege escalation, and step exhaustion. All enforced by OPA policy gates with cryptographic trace evidence. Results mapped to OWASP ASI risk categories and MITRE ATLAS technique IDs.
MEASURE 2.13	Effectiveness of the employed TEVV metrics and processes in the MEASURE function are evaluated and documented.	Published methodology documents the three-state verdict taxonomy, Wilson score confidence intervals, OPA-enforced sandbox, and deterministic trace replay. Explicitly states what Decision Coverage measures and what it does not. Available at constantx.net/paper.
MEASURE 3.1	Approaches, personnel, and documentation are in place to regularly identify and track existing, unanticipated, and emergent AI risks based on factors such as intended and actual performance in deployed contexts.	Append-only engagement index tracks every evaluation run by dated model snapshot. Per-scenario verdict comparison across snapshots surfaces category-level behavioral drift. Gap analysis identifies untested threats per model version, feeding back into scenario authoring for subsequent evaluations.
MEASURE 4.1	Measurement approaches for identifying AI risks are connected to deployment context(s) and informed through consultation with domain experts and other end users. Approaches are documented.	Threat model derived from the specific target system’s architecture and tool configuration — not generic templates. Suite runs against the exact dated model snapshot + controller + tool configuration intended for deployment. Model aliases are not accepted; dated snapshots required.
MEASURE 4.2	Measurement results regarding AI system trustworthiness in deployment context(s) and across the AI lifecycle are informed by input from domain experts and relevant AI actors to validate whether the system is performing consistently as intended. Results are documented.	Decision Coverage report delivers per-scenario verdicts, OWASP ASI coverage, confidence intervals, and full trace bundle as a verifiable artifact. Report is bound to the specific configuration and suite version evaluated and is available for auditor inspection.
MEASURE 4.3	Measurable performance improvements or declines based on consultations with relevant AI actors, including affected communities, and field data about context-relevant risks and trustworthiness characteristics are identified and documented.	Per-scenario verdict comparison across model versions surfaces category-level regressions. Example: Opus 4.5 100.0% TC vs. GPT 5.4 85.85% TC — the 14.15% gap is concentrated in specific ASI categories driven by systematic multi-action batching behavior, not uniform degradation.

MEASURE — Not in Scope

Subcategory	Reason Not in Scope
MEASURE 1.2	Appropriateness of AI metrics and effectiveness of existing controls regularly assessed and updated, including reports of errors. Organizational process activity — not a TEVV output.
MEASURE 2.2	Evaluations involving human subjects meet applicable requirements and are representative of the relevant population. ConstantX evaluations do not involve human subjects.
MEASURE 2.4	The functionality and behavior of the AI system are monitored when in production. ConstantX is a point-in-time pre-deployment evaluation, not a continuous production monitoring system. Re-evaluation on new model snapshots detects behavioral drift but does not constitute production monitoring.
MEASURE 2.8	Risks associated with transparency and accountability examined and documented. Transparency and accountability risks are organizational governance concerns addressed in the GOVERN function.
MEASURE 2.9	The AI model is explained, validated, and documented; output interpreted within its context to inform responsible use. Interpretability and explainability are model-level concerns outside adversarial enforcement evaluation scope.
MEASURE 2.10	Privacy risk of the AI system examined and documented. Out of adversarial enforcement scope.
MEASURE 2.11	Fairness and bias evaluated and results documented. Out of adversarial enforcement scope.
MEASURE 2.12	Environmental impact and sustainability assessed and documented. Out of scope for adversarial enforcement evaluation.
MEASURE 3.2	Risk tracking considered for settings where measurement techniques aren’t available. Meta-level framework planning activity, not a TEVV output.
MEASURE 3.3	Feedback processes for end users and impacted communities established and integrated into evaluation metrics. Organizational process activity.

MANAGE — Inputs

ConstantX evidence informs two MANAGE subcategories. ConstantX does not perform MANAGE activities — risk treatment planning, incident response, and decommissioning procedures are organizational responsibilities. ConstantX provides the empirical input those decisions require.

Subcategory	NIST Description (AI 100-1)	ConstantX Input
MANAGE 1.1	A determination is made as to whether the AI system achieves its intended purposes and stated objectives and whether its development or deployment should proceed.	Terminal Coverage and its Wilson 95% CI provide the quantitative basis for a go/no-go deployment decision. A system with a high `undefined_behavior` rate has empirically demonstrated it does not fail safely under adversarial conditions.
MANAGE 2.4	Mechanisms are in place and applied, and responsibilities are assigned and understood, to supersede, disengage, or deactivate AI systems that demonstrate performance or outcomes inconsistent with intended use.	ASI-10 (Rogue Agents) scenarios validate that kill paths and timeout enforcements fire correctly under adversarial conditions. `bounded_failure` verdicts on these scenarios demonstrate that deactivation mechanisms intercept unsafe actions before they complete.

The Measurement Gap

Most AI governance platforms satisfy NIST GOVERN, MAP, and MANAGE through questionnaires, risk scorecards, and policy documentation. None of those produce MEASURE 2.3 — performance criteria measured under conditions similar to deployment — or MEASURE 2.6, which requires demonstrating the system can fail safely under adversarial conditions.

“We reviewed the model and assessed risk as medium” is not a measurement. MEASURE 2.3 requires conditions similar to deployment. MEASURE 2.5 requires demonstrated validity. MEASURE 2.6 requires demonstrated safe failure. ConstantX is the measurement layer. It plugs into existing governance workflows without replacing them.

Enforcement is structural. Alignment is probabilistic. Decision Coverage measures the structural part.

See what NIST Measure evidence looks like

Opus 4.5 | 100% TC GPT 5.4 | 85.85% TC

Scope Your Deployment Audit

All Frameworks · OWASP ASI · MITRE ATLAS · Methodology Paper