Table 1. Matched 2×2 text×prosody factorial experiment. Negative-control text-only rows report prosody effect under safe and unsafe text; audio rows report the safe-text prosody effect. Values are mean [95% CI].
| Judge | Modality | Measure | N | Prosody effect (safe text) [95% CI] | Prosody effect (unsafe text) [95% CI] |
| Llama-3.1-8B-Instruct | text-only | Sens. | 140 | 0 [0, 0] | 0 [0, 0] |
| Llama-3.1-8B-Instruct | text-only | Spec. | 140 | 0 [0, 0] | 0 [0, 0] |
| MERaLiON-2-10B | text-only | Sens. | 252 | 0 [0, 0] | 0 [0, 0] |
| MERaLiON-2-10B | text-only | Spec. | 252 | 0 [0, 0] | 0 [0, 0] |
| Qwen2-Audio-7B-Instruct | text-only | Sens. | 259 | 0 [0, 0] | 0 [0, 0] |
| Qwen2-Audio-7B-Instruct | text-only | Spec. | 259 | 0 [0, 0] | 0 [0, 0] |
| audio-flamingo-3-hf | text-only | Sens. | 123 | 0 [0, 0] | 0 [0, 0] |
| audio-flamingo-3-hf | text-only | Spec. | 123 | 0 [0, 0] | 0 [0, 0] |
| Qwen2-Audio-7B-Instruct | audio-only | Spec. | 7,889 | −0.09 [−0.099, −0.08] | — |
| MERaLiON-2-10B | audio-only | Sens. | 7,896 | −0.056 [−0.065, −0.045] | — |
| MERaLiON-2-10B | audio-text | Sens. | 7,896 | −0.055 [−0.067, −0.042] | — |
| Qwen2-Audio-7B-Instruct | audio-text | Spec. | 7,889 | −0.041 [−0.047, −0.032] | — |
| MERaLiON-2-10B | audio-text | Spec. | 7,803 | −0.04 [−0.046, −0.035] | — |
| audio-flamingo-3-hf | audio-only | Spec. | 3,602 | −0.015 [−0.023, −0.008] | — |
Table 2A. Mask validation for 160 ms local time-segment reversal. The mask destroys lexical content while preserving prosodic contour.
| Metric | Mean | Std | Interpretation |
| Whisper-Large WER | 1.64 | 2.61 | Words destroyed |
| Whisper-Base WER | 5.36 | 6.18 | Even coarser ASR fails |
| Duration ratio | 1 | 0 | Exact length preserved |
| Energy-envelope corr | 0.951 | 0.018 | Prosodic contour preserved |
Table 2B. Prosody effect under clean and masked audio. Values are mean [95% CI].
| Judge | Measure | Unmasked | Masked | Status |
| MERaLiON-2-10B | Sens. | −0.055 [−0.065, −0.045] | −0.088 [−0.098, −0.077] | Yes (amplified) |
| MERaLiON-2-10B | Spec. | −0.004 [−0.006, −0.002] | −0.007 [−0.013, −0.002] | Yes (amplified) |
| Qwen2-Audio-7B-Instruct | Sens. | +0.005 [−0.004, +0.019] | −0.007 [−0.016, +0.004] | Baseline ≈ 0 |
| Qwen2-Audio-7B-Instruct | Spec. | −0.087 [−0.096, −0.077] | −0.076 [−0.084, −0.067] | Yes (12% attenuation) |
| audio-flamingo-3-hf | Sens. | −0.001 [−0.015, +0.012] | +0.001 [−0.005, +0.006] | Baseline ≈ 0 |
| audio-flamingo-3-hf | Spec. | −0.015 [−0.022, −0.008] | −0.015 [−0.02, −0.012] | Yes (persists) |
Table 2C. Text effect under clean and masked audio. Values are mean [95% CI].
| Judge | Measure | Unmasked | Masked | Attenuation |
| MERaLiON-2-10B | Sens. | −0.27 [−0.295, −0.248] | −0.244 [−0.268, −0.225] | 10% |
| MERaLiON-2-10B | Spec. | −0.076 [−0.088, −0.069] | −0.016 [−0.022, −0.01] | 79% |
| Qwen2-Audio-7B-Instruct | Spec. | −0.125 [−0.14, −0.108] | −0.105 [−0.12, −0.088] | 16% |
| audio-flamingo-3-hf | Sens. | +0.05 [+0.031, +0.069] | +0.029 [+0.016, +0.042] | 42% |
Table 3. GT decomposition: gain-GT = metric(audio-text, GT) − metric(text-only, GT). Values are mean [95% CI].
| Judge | Measure | gain-GT [95% CI] | Status |
| Qwen2-Audio-7B-Instruct | Spec. | +0.138 [+0.114, +0.161] | Positive |
| Qwen2.5-Omni-7B | Sens. | +0.061 [+0.003, +0.184] | Positive |
| MiniCPM-o-4.5 | Spec. | +0.021 [−0.007, +0.049] | Positive (CI touches 0) |
| MiniCPM-o-4.5 | Sens. | +0.017 [−0.009, +0.041] | Positive (CI touches 0) |
| audio-flamingo-3-hf | Sens. | +0.011 [−0.054, +0.069] | Positive (wide CI) |
| MERaLiON-2-10B | Sens. | −0.006 [−0.042, +0.029] | Near-zero |
| Qwen2.5-Omni-7B | Spec. | −0.01 [−0.063, +0.071] | Near-zero |
| Gemini-2.5-Flash | Sens. | −0.013 [−0.028, −0.001] | Negative |
| Gemini-2.5-Flash | Spec. | −0.019 [−0.051, +0.002] | Negative (CI touches 0) |
| audio-flamingo-3-hf | Spec. | −0.047 [−0.107, +0.003] | Negative (CI touches 0) |
| Qwen2-Audio-7B-Instruct | Sens. | −0.099 [−0.142, −0.053] | Negative |
| MERaLiON-2-10B | Spec. | −0.159 [−0.197, −0.12] | Negative |
Table 4. Expanded 7-judge overview across modalities (Sens. summary).
| Judge | Type | New? | text-only Sens. | text-only Spec. | audio-only Sens. | audio-only Spec. | audio-text Sens. | audio-text Spec. |
| Qwen2-Audio-7B-Instruct | Open LALM | | −0.015 | 1 | +0.011 | 1 | −0.114 | 0.7 |
| MERaLiON-2-10B | Open LALM | | +0.204 | 1 | +0.169 | 1 | +0.198 | 1 |
| audio-flamingo-3-hf | Open LALM | | +0.037 | 1 | −0.064 | 0.4 | +0.048 | 0.9 |
| MiniCPM-o-4.5 | Open LALM | New | −0.127 | 0 | −0.158 | 0 | −0.109 | 0 |
| Qwen2.5-Omni-7B | Open LALM | New | +0.029 | 0.4 | +0.089 | 0.1 | +0.09 | 0 |
| Gemini-2.5-Flash | Closed API | New | +0.08 | 1 | — | — | +0.067 | 1 |
| Llama-3.1-8B-Instruct | Text-only | | +0.331 | 0.3 | — | — | — | — |
Table 5A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.
| Judge | Modality | GT | Whisper-Large | Whisper-Base | Direction |
| Gemini-2.5-Flash | text-only | +0.08 | +0.015 | −0.009 | Degrades (−0.089) |
| Gemini-2.5-Flash | audio-text | +0.067 | +0.015 | — | Degrades (−0.052) |
| Qwen2.5-Omni-7B | text-only | +0.029 | +0.037 | +0.02 | Stable |
| Qwen2.5-Omni-7B | audio-text | +0.09 | +0.127 | +0.143 | Improves (+0.053) |
| MiniCPM-o-4.5 | text-only | −0.127 | +0.226 | +0.213 | Inverts |
| MiniCPM-o-4.5 | audio-text | −0.109 | −0.149 | −0.15 | Degrades (−0.041) |
Table 5B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.
| Judge | Modality | GT | Whisper-Large | Whisper-Base | Direction |
| Gemini-2.5-Flash | text-only | 1 | 0.4 | 0.5 | Collapses |
| Gemini-2.5-Flash | audio-text | 1 | 0.8 | — | Degrades |
| Qwen2.5-Omni-7B | text-only | 0.4 | 0.9 | 0.8 | Improves |
| Qwen2.5-Omni-7B | audio-text | 0 | 0.2 | 0.2 | Improves |
| MiniCPM-o-4.5 | text-only | 0 | 0.9 | 0.8 | Improves |
| MiniCPM-o-4.5 | audio-text | 0 | 0 | 0 | Unchanged |
Table 5C. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.
| Judge | Measure | gain-GT | gain-WL | gain-WB | rescue-WL | rescue-WB |
| Qwen2.5-Omni-7B | Sens. | +0.061 | +0.09 | +0.122 | +0.029 | +0.061 |
| Qwen2.5-Omni-7B | Spec. | −0.01 | +0.09 | +0.122 | +0.1 | +0.132 |
| Gemini-2.5-Flash | Sens. | −0.013 | 0 | — | +0.013 | — |
| MiniCPM-o-4.5 | Sens. | +0.017 | −0.375 | −0.363 | −0.392 | −0.38 |
| MiniCPM-o-4.5 | Spec. | +0.021 | −0.374 | −0.356 | −0.395 | −0.377 |
Table 6A. Position bias (AAPB of Sens.) across modalities for the 7-judge set.
| Judge | text-only | audio-text | audio-only |
| Qwen2-Audio-7B-Instruct | 0.024 | 0.34 | 0.215 |
| MERaLiON-2-10B | 0.124 | 0.071 | 0.106 |
| audio-flamingo-3-hf | 0.091 | 0.143 | 0.075 |
| MiniCPM-o-4.5 | 0.414 | 0.325 | 0.531 |
| Qwen2.5-Omni-7B | 0.059 | 0.097 | 0.047 |
| Gemini-2.5-Flash | 0.027 | 0.042 | — |
| Llama-3.1-8B-Instruct | 0.136 | — | — |
Table 6B. AAPB of Sens. across transcript sources for judges with multi-source evaluations.
| Judge | Modality | GT | Whisper-Large | Whisper-Base |
| Gemini-2.5-Flash | text-only | 0.027 | 0.025 | 0.028 |
| Gemini-2.5-Flash | audio-text | 0.042 | — | — |
| Qwen2.5-Omni-7B | text-only | 0.059 | 0.159 | 0.205 |
| Qwen2.5-Omni-7B | audio-text | 0.097 | 0.106 | 0.087 |
| MiniCPM-o-4.5 | text-only | 0.414 | 0.087 | 0.075 |
| MiniCPM-o-4.5 | audio-text | 0.325 | 0.35 | 0.36 |
Table 7A. MERaLiON-2-10B category slices under GT (Sens.): audio-text vs text-only.
| Category | text-only Sens. | audio-text Sens. | Δ Sens | text-only Spec. | audio-text Spec. |
| dangerous | +0.218 | +0.12 | −0.098 | 1 | 0.9 |
| deception | +0.101 | +0.086 | −0.015 | 1 | 1 |
| harassment | +0.117 | +0.182 | +0.066 | 1 | 1 |
| hate | +0.456 | +0.446 | −0.01 | 0.3 | 0.4 |
| overall | +0.015 | +0.019 | +0.004 | 1 | 1 |
| self-harm | +0.129 | +0.09 | −0.039 | 1 | 1 |
| sexual | +0.412 | +0.385 | −0.027 | 1 | 0.9 |
| violence | +0.164 | +0.203 | +0.039 | 1 | 0.9 |
Table 7B. Qwen2-Audio-7B-Instruct category slices under GT (Sens.): audio-text vs text-only.
| Category | text-only Sens. | audio-text Sens. | Δ Sens | text-only Spec. | audio-text Spec. |
| dangerous | −0.058 | −0.171 | −0.112 | 0.8 | 0.5 |
| deception | −0.111 | +0.128 | +0.239 | 0.6 | 0.8 |
| harassment | +0.15 | +0.183 | +0.033 | 1 | 1 |
| hate | −0.064 | −0.654 | −0.59 | 0.9 | 0.2 |
| overall | +0.054 | +0.026 | −0.028 | 1 | 1 |
| self-harm | −0.012 | +0.029 | +0.041 | 0.9 | 0.9 |
| sexual | −0.03 | −0.325 | −0.296 | 0.4 | 0.8 |
| violence | −0.068 | −0.337 | −0.269 | 1 | 0.2 |
Table 7C. Gemini-2.5-Flash category slices under GT (Sens.): audio-text vs text-only.
| Category | text-only Sens. | audio-text Sens. | Δ Sens | text-only Spec. | audio-text Spec. |
| dangerous | +0.095 | +0.132 | +0.037 | 1 | 1 |
| deception | +0.103 | +0.015 | −0.088 | 0.8 | 0.9 |
| harassment | +0.142 | +0.102 | −0.04 | 0.9 | 0.9 |
| hate | +0.063 | +0.053 | −0.01 | 0.9 | 0.9 |
| overall | 0 | 0 | 0 | 0.95 | 0.9 |
| self-harm | +0.017 | +0.01 | −0.007 | 0.9 | 0.9 |
| sexual | +0.135 | +0.13 | −0.005 | 1 | 0.9 |
| violence | +0.017 | +0.025 | +0.008 | 0.9 | 1 |
Table 7D. MiniCPM-o-4.5 category slices under GT (Sens.): audio-text vs text-only.
| Category | text-only Sens. | audio-text Sens. | Δ Sens | text-only Spec. | audio-text Spec. |
| dangerous | −0.09 | −0.063 | +0.026 | 0.1 | 0 |
| deception | −0.161 | −0.019 | +0.142 | 0 | 0 |
| harassment | +0.234 | +0.176 | −0.058 | 0 | 0 |
| hate | +0.29 | +0.103 | −0.187 | 0.1 | 0.1 |
| overall | −0.542 | −0.449 | +0.093 | 0 | 0.1 |
| self-harm | +0.275 | +0.24 | −0.035 | 0.1 | 0.2 |
| sexual | −0.443 | −0.397 | +0.045 | 0 | 0 |
| violence | −0.589 | −0.466 | +0.123 | 0 | 0 |
Table 7E. MERaLiON-2-10B audio-only prosody-alone results by category (Sens.).
| Category | Sens | Spec |
| hate | +0.435 | 0.5 |
| self-harm | +0.249 | 1 |
| harassment | +0.239 | 1 |
| dangerous | +0.161 | 1 |
| sexual | +0.088 | 0.3 |
| deception | +0.063 | 1 |
| violence | +0.029 | 0.8 |
| overall | +0.001 | 1 |
Table 8A. Severity-response curves (Sens.).
| Judge | Modality | s0 | s1 | s2 | s3 | s4 | s5 | Monotonic? |
| Qwen2-Audio-7B-Instruct | text-only | 0.675 | 0.69 | 0.682 | 0.659 | 0.635 | 0.629 | No |
| Qwen2-Audio-7B-Instruct | audio-text | 0.557 | 0.648 | 0.65 | 0.671 | 0.632 | 0.625 | No |
| Qwen2-Audio-7B-Instruct | audio-only | 0.712 | 0.701 | 0.699 | 0.692 | 0.678 | 0.671 | Yes |
| MERaLiON-2-10B | text-only | 0.939 | 0.736 | 0.687 | 0.609 | 0.533 | 0.509 | Yes |
| MERaLiON-2-10B | audio-text | 0.932 | 0.734 | 0.73 | 0.64 | 0.541 | 0.501 | Yes |
| MERaLiON-2-10B | audio-only | 0.848 | 0.679 | 0.647 | 0.609 | 0.555 | 0.543 | Yes |
| audio-flamingo-3-hf | text-only | 0.575 | 0.539 | 0.525 | 0.476 | 0.383 | 0.361 | Yes |
| audio-flamingo-3-hf | audio-text | 0.581 | 0.533 | 0.517 | 0.53 | 0.516 | 0.506 | No |
| audio-flamingo-3-hf | audio-only | 0.598 | 0.632 | 0.651 | 0.662 | 0.644 | 0.65 | No |
| MiniCPM-o-4.5 | text-only | 0.535 | 0.297 | 0.341 | 0.462 | 0.628 | 0.661 | No |
| MiniCPM-o-4.5 | audio-text | 0.546 | 0.396 | 0.42 | 0.493 | 0.612 | 0.656 | No |
| MiniCPM-o-4.5 | audio-only | 0.273 | 0.117 | 0.128 | 0.217 | 0.382 | 0.431 | No |
| Qwen2.5-Omni-7B | text-only | 0.786 | 0.742 | 0.69 | 0.662 | 0.729 | 0.757 | No |
| Qwen2.5-Omni-7B | audio-text | 0.915 | 0.671 | 0.682 | 0.694 | 0.81 | 0.825 | No |
| Qwen2.5-Omni-7B | audio-only | 0.94 | 0.66 | 0.83 | 0.828 | 0.837 | 0.851 | No |
| Gemini-2.5-Flash | text-only | 0.985 | 0.905 | 0.894 | 0.736 | 0.413 | 0.37 | Yes |
| Gemini-2.5-Flash | audio-text | 0.988 | 0.921 | 0.9 | 0.739 | 0.413 | 0.365 | Yes |
| Llama-3.1-8B-Instruct | text-only | 0.611 | 0.227 | 0.214 | 0.209 | 0.259 | 0.28 | No |
Table 8B. Safety scores for severity-0 dialogues.
| Model and modality | Safety score |
| Qwen2-Audio, Audio-text | 0.824 |
| Qwen2-Audio, Audio | 0.8 |
| audio-flamingo-3-hf, Audio-text | 0.711 |
| audio-flamingo-3-hf, Audio | 0.739 |
| MERaLiON-2-10B, Audio-text | 0.654 |
| MERaLiON-2-10B, Audio | 0.672 |