Metamorphic Testing of Vision–Language
Action–Enabled Robots

1Mondragon University 2University of Seville 3Simula Research Laboratory

Overview

Vision-Language Action (VLA) models are multimodal robotic task controllers that, given an instruction and visual inputs, produce a sequence of low-level control actions (or motor commands) enabling a robot to execute the requested task in the physical environment. These systems face the test oracle problem from multiple perspectives. On the one hand, a test oracle must be defined for each instruction prompt, which is a complex and non-generalizable approach. On the other hand, current state-of-the-art oracles typically capture symbolic representations of the world (e.g., robot and object states), enabling the correctness evaluation of a task, but fail to assess other critical aspects, such as the quality with which VLA-enabled robots perform a task. In this paper, we explore whether Metamorphic Testing (MT) can alleviate the test oracle problem in this context. To do so, we propose two metamorphic relation patterns and five metamorphic relations to assess whether changes to the test inputs impact the original trajectory of the VLA-enabled robots. An empirical study involving five VLA models, two simulated robots, and four robotic tasks shows that MT can effectively alleviate the test oracle problem by automatically detecting diverse types of failures, including, but not limited to, uncompleted tasks. More importantly, the proposed MRs are generalizable, making the proposed approach applicable across different VLA models, robots, and tasks, even in the absence of test oracles.


Metamorphic Evaluation Framework

We propose a trajectory-based testing framework using Metamorphic Testing (MT) to assess the behavioral consistency and robustness of VLA-enabled robots. By evaluating the relationship between source and follow-up trajectories, we define two core Metamorphic Relation Patterns (MRP):

MRP1: Trajectory Consistency (TC)

Captures input transformations that should not affect the robot's trajectory, ensuring stability under superficial changes.

  • MR1: Synonym Substitution: Replacing action verbs with synonyms (e.g., "Pick" to "Grab") should yield an invariant trajectory.
  • MR2: Non-Interfering Object Addition: Adding irrelevant objects (>0.1m from target) should not alter the planned path.
  • MR3: Light Brightness Change: Slightly altering scene illumination should not degrade visual grounding or execution consistency.

MRP2: Trajectory Variation (TV)

Captures transformations that should induce predictable, measurable changes in the robot's trajectory.

  • MR4: Negation or Task Inversion: Negating a command (e.g., "Don't pick") should result in a static pose or a significantly different trajectory.
  • MR5: Target Object Relocation: Moving the target object by $\Delta p$ must result in a proportional trajectory adaptation, bounded by lower and upper constraints.

Experimental Setup

We evaluated five VLA models (OpenVLA, Pi0, SpatialVLA, GR00T-N1.5, and EO-1) using the SimplerEnv benchmark. We focused on 1,864 source cases where the VLAs succeeded, generating 9,320 follow-up test cases across all the VLA models, MRs and four tasks: Pick Up, Move Near, Put On, and Put In.


Research Questions

In our evaluation we aimed at answering to the following research questions:

  • RQ1 — Effectiveness: How effective is MT at detecting failures in VLA-enabled robots?
  • RQ2 — Sensitivity: How do different MRs differ in their ability to detect failures?
  • RQ3 — Taxonomy: What types of failures does MT reveal?

Results

RQ1: Effectiveness

MT effectively detects failures in VLA-enabled robots, with its detection capability strongly influenced by the selected strictness threshold. High thresholds maximize the identification of additional failures beyond those captured by symbolic oracles, while low thresholds largely mirror oracle-based detection. Notably, the medium threshold provides the best trade-off by aligning with failures detected by symbolic oracles while still exposing a significant number of additional failure cases.

RQ2: Sensitivity

Overall, environment-based transformations (MR2) and semantic transformations (MR4) are the most effective at exposing failures across models and tasks, while robustness-oriented relations, that is MR1 and MR3, are more sensitive to the chosen strictness threshold. This highlights the importance of combining multiple MRs to obtain a comprehensive assessment of VLA-enabled robotic behavior.

RQ3: Taxonomy

Metamorphic testing reveals a more diverse set of failure types than symbolic oracles, particularly in the areas of motion quality and manipulation robustness. While symbolic oracles primarily detect task-level failures, MT uncovers execution-level and behavioral failures that are critical for assessing the reliability and safety of VLA-enabled robotic systems. These results demonstrate that combining symbolic oracles with MT is essential

MR Violation rate

Figure 1: MR Violation Rate for each MR across three different strictness thresholds for each Model and task.

VLA Failure Taxonomy generated from the analysis of the results

Figure 2: VLA Failure Taxonomy generated from the analysis of the results.


Failure Case Comparisons

The following pairs highlight cases where the robot passes the Symbolic Oracle in the source and follow-up cases but violates a Metamorphic Relation in the follow-up case.

MR1: Synonym Substitution

Source: "Pick blue plastic bottlee"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Let's pick up blue plastic bottle"

✖ MR Violation

Source: "Move blue plastic bottle near carrot"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Can you move blue plastic bottle near carrot?"

✖ MR Violation

MR2: Non-Interfering Object Addition

Source: "3 objects (Put the apple on the plate larger)"

✔ Success (Symbolic Oracle Pass)

Follow-up: "5 objects (Put the apple on the plate larger)"

✖ MR Violation

Source: "1 object (Pick redbull can)"

✔ Success (Symbolic Oracle Pass)

Follow-up: "4 objects (Pick redbull can)"

✖ MR Violation

MR3: Light Brightness Change

Source: "Normal brightness (Put the fanta can into the yellow basket)"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Slightly brighter (Put the fanta can into the yellow basket)"

✖ MR Violation

Source: "Normal brightness (Pick sprite can)"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Slightly darker (Pick sprite can)"

✖ MR Violation

MR4: Negation or Task Inversion

Source: "Move 7up can near sprite can"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Would you avoid moving 7up can near sprite can?"

✖ MR Violation

Source: "Pick blue plastic bottle"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Cease the act of picking blue plastic bottle"

✖ MR Violation

MR5: Target Object Relocation

Source: Original position (Pick 7up can)

✔ Success (Symbolic Oracle Pass)

Follow-up: Relocated +10cm diagonal (Pick 7up can)

✖ Violation: Reachability Failure

Source: "Original position (Pick blue plastic bottle)"

✔ Success (Symbolic Oracle Pass)

Follow-up: "Relocated +10cm horizontal (Pick blue plastic bottle)"

✖ MR Violation

Citation

If you find our work helpful, please cite us:
@inproceedings{valle2026mt_of_vlas_icst,
            title        = {Metamorphic Testing of Vision-Language Action-Enabled Robots},
            author       = {Pablo Valle and Sergio Segura and Shaukat Ali and Aitor Arrieta},
            booktitle    = {Proceedings of the 19th IEEE International Conference on Software Testing, Verification and Validation (ICST 2026)},
            year         = {2026},
            month        = may,
            address      = {Daejeon, Republic of Korea},
            organization = {IEEE},
            note         = {\url{https://arxiv.org/abs/2602.22579}},
            url          = {https://arxiv.org/abs/2602.22579}
          }