supplementary-tables

Table 1. Matched 2×2 text×prosody factorial experiment. Negative-control text-only rows report prosody effect under safe and unsafe text; audio rows report the safe-text prosody effect. Values are mean [95% CI].

Judge	Modality	Measure	N	Prosody effect (safe text) [95% CI]	Prosody effect (unsafe text) [95% CI]
Llama-3.1-8B-Instruct	text-only	Sens.	140	0 [0, 0]	0 [0, 0]
Llama-3.1-8B-Instruct	text-only	Spec.	140	0 [0, 0]	0 [0, 0]
MERaLiON-2-10B	text-only	Sens.	252	0 [0, 0]	0 [0, 0]
MERaLiON-2-10B	text-only	Spec.	252	0 [0, 0]	0 [0, 0]
Qwen2-Audio-7B-Instruct	text-only	Sens.	259	0 [0, 0]	0 [0, 0]
Qwen2-Audio-7B-Instruct	text-only	Spec.	259	0 [0, 0]	0 [0, 0]
audio-flamingo-3-hf	text-only	Sens.	123	0 [0, 0]	0 [0, 0]
audio-flamingo-3-hf	text-only	Spec.	123	0 [0, 0]	0 [0, 0]
Qwen2-Audio-7B-Instruct	audio-only	Spec.	7,889	−0.09 [−0.099, −0.08]	—
MERaLiON-2-10B	audio-only	Sens.	7,896	−0.056 [−0.065, −0.045]	—
MERaLiON-2-10B	audio-text	Sens.	7,896	−0.055 [−0.067, −0.042]	—
Qwen2-Audio-7B-Instruct	audio-text	Spec.	7,889	−0.041 [−0.047, −0.032]	—
MERaLiON-2-10B	audio-text	Spec.	7,803	−0.04 [−0.046, −0.035]	—
audio-flamingo-3-hf	audio-only	Spec.	3,602	−0.015 [−0.023, −0.008]	—

Table 2A. Mask validation for 160 ms local time-segment reversal. The mask destroys lexical content while preserving prosodic contour.

Metric	Mean	Std	Interpretation
Whisper-Large WER	1.64	2.61	Words destroyed
Whisper-Base WER	5.36	6.18	Even coarser ASR fails
Duration ratio	1	0	Exact length preserved
Energy-envelope corr	0.951	0.018	Prosodic contour preserved

Table 2B. Prosody effect under clean and masked audio. Values are mean [95% CI].

Judge	Measure	Unmasked	Masked	Status
MERaLiON-2-10B	Sens.	−0.055 [−0.065, −0.045]	−0.088 [−0.098, −0.077]	Yes (amplified)
MERaLiON-2-10B	Spec.	−0.004 [−0.006, −0.002]	−0.007 [−0.013, −0.002]	Yes (amplified)
Qwen2-Audio-7B-Instruct	Sens.	+0.005 [−0.004, +0.019]	−0.007 [−0.016, +0.004]	Baseline ≈ 0
Qwen2-Audio-7B-Instruct	Spec.	−0.087 [−0.096, −0.077]	−0.076 [−0.084, −0.067]	Yes (12% attenuation)
audio-flamingo-3-hf	Sens.	−0.001 [−0.015, +0.012]	+0.001 [−0.005, +0.006]	Baseline ≈ 0
audio-flamingo-3-hf	Spec.	−0.015 [−0.022, −0.008]	−0.015 [−0.02, −0.012]	Yes (persists)

Table 2C. Text effect under clean and masked audio. Values are mean [95% CI].

Judge	Measure	Unmasked	Masked	Attenuation
MERaLiON-2-10B	Sens.	−0.27 [−0.295, −0.248]	−0.244 [−0.268, −0.225]	10%
MERaLiON-2-10B	Spec.	−0.076 [−0.088, −0.069]	−0.016 [−0.022, −0.01]	79%
Qwen2-Audio-7B-Instruct	Spec.	−0.125 [−0.14, −0.108]	−0.105 [−0.12, −0.088]	16%
audio-flamingo-3-hf	Sens.	+0.05 [+0.031, +0.069]	+0.029 [+0.016, +0.042]	42%

Table 3. GT decomposition: gain-GT = metric(audio-text, GT) − metric(text-only, GT). Values are mean [95% CI].

Judge	Measure	gain-GT [95% CI]	Status
Qwen2-Audio-7B-Instruct	Spec.	+0.138 [+0.114, +0.161]	Positive
Qwen2.5-Omni-7B	Sens.	+0.061 [+0.003, +0.184]	Positive
MiniCPM-o-4.5	Spec.	+0.021 [−0.007, +0.049]	Positive (CI touches 0)
MiniCPM-o-4.5	Sens.	+0.017 [−0.009, +0.041]	Positive (CI touches 0)
audio-flamingo-3-hf	Sens.	+0.011 [−0.054, +0.069]	Positive (wide CI)
MERaLiON-2-10B	Sens.	−0.006 [−0.042, +0.029]	Near-zero
Qwen2.5-Omni-7B	Spec.	−0.01 [−0.063, +0.071]	Near-zero
Gemini-2.5-Flash	Sens.	−0.013 [−0.028, −0.001]	Negative
Gemini-2.5-Flash	Spec.	−0.019 [−0.051, +0.002]	Negative (CI touches 0)
audio-flamingo-3-hf	Spec.	−0.047 [−0.107, +0.003]	Negative (CI touches 0)
Qwen2-Audio-7B-Instruct	Sens.	−0.099 [−0.142, −0.053]	Negative
MERaLiON-2-10B	Spec.	−0.159 [−0.197, −0.12]	Negative

Table 4. Expanded 7-judge overview across modalities (Sens. summary).

Judge	Type	New?	text-only Sens.	text-only Spec.	audio-only Sens.	audio-only Spec.	audio-text Sens.	audio-text Spec.
Qwen2-Audio-7B-Instruct	Open LALM		−0.015	1	+0.011	1	−0.114	0.7
MERaLiON-2-10B	Open LALM		+0.204	1	+0.169	1	+0.198	1
audio-flamingo-3-hf	Open LALM		+0.037	1	−0.064	0.4	+0.048	0.9
MiniCPM-o-4.5	Open LALM	New	−0.127	0	−0.158	0	−0.109	0
Qwen2.5-Omni-7B	Open LALM	New	+0.029	0.4	+0.089	0.1	+0.09	0
Gemini-2.5-Flash	Closed API	New	+0.08	1	—	—	+0.067	1
Llama-3.1-8B-Instruct	Text-only		+0.331	0.3	—	—	—	—

Table 5A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.

Judge	Modality	GT	Whisper-Large	Whisper-Base	Direction
Gemini-2.5-Flash	text-only	+0.08	+0.015	−0.009	Degrades (−0.089)
Gemini-2.5-Flash	audio-text	+0.067	+0.015	—	Degrades (−0.052)
Qwen2.5-Omni-7B	text-only	+0.029	+0.037	+0.02	Stable
Qwen2.5-Omni-7B	audio-text	+0.09	+0.127	+0.143	Improves (+0.053)
MiniCPM-o-4.5	text-only	−0.127	+0.226	+0.213	Inverts
MiniCPM-o-4.5	audio-text	−0.109	−0.149	−0.15	Degrades (−0.041)

Table 5B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.

Judge	Modality	GT	Whisper-Large	Whisper-Base	Direction
Gemini-2.5-Flash	text-only	1	0.4	0.5	Collapses
Gemini-2.5-Flash	audio-text	1	0.8	—	Degrades
Qwen2.5-Omni-7B	text-only	0.4	0.9	0.8	Improves
Qwen2.5-Omni-7B	audio-text	0	0.2	0.2	Improves
MiniCPM-o-4.5	text-only	0	0.9	0.8	Improves
MiniCPM-o-4.5	audio-text	0	0	0	Unchanged

Table 5C. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.

Judge	Measure	gain-GT	gain-WL	gain-WB	rescue-WL	rescue-WB
Qwen2.5-Omni-7B	Sens.	+0.061	+0.09	+0.122	+0.029	+0.061
Qwen2.5-Omni-7B	Spec.	−0.01	+0.09	+0.122	+0.1	+0.132
Gemini-2.5-Flash	Sens.	−0.013	0	—	+0.013	—
MiniCPM-o-4.5	Sens.	+0.017	−0.375	−0.363	−0.392	−0.38
MiniCPM-o-4.5	Spec.	+0.021	−0.374	−0.356	−0.395	−0.377

Table 6A. Position bias (AAPB of Sens.) across modalities for the 7-judge set.

Judge	text-only	audio-text	audio-only
Qwen2-Audio-7B-Instruct	0.024	0.34	0.215
MERaLiON-2-10B	0.124	0.071	0.106
audio-flamingo-3-hf	0.091	0.143	0.075
MiniCPM-o-4.5	0.414	0.325	0.531
Qwen2.5-Omni-7B	0.059	0.097	0.047
Gemini-2.5-Flash	0.027	0.042	—
Llama-3.1-8B-Instruct	0.136	—	—

Table 6B. AAPB of Sens. across transcript sources for judges with multi-source evaluations.

Judge	Modality	GT	Whisper-Large	Whisper-Base
Gemini-2.5-Flash	text-only	0.027	0.025	0.028
Gemini-2.5-Flash	audio-text	0.042	—	—
Qwen2.5-Omni-7B	text-only	0.059	0.159	0.205
Qwen2.5-Omni-7B	audio-text	0.097	0.106	0.087
MiniCPM-o-4.5	text-only	0.414	0.087	0.075
MiniCPM-o-4.5	audio-text	0.325	0.35	0.36

Table 7A. MERaLiON-2-10B category slices under GT (Sens.): audio-text vs text-only.

Category	text-only Sens.	audio-text Sens.	Δ Sens	text-only Spec.	audio-text Spec.
dangerous	+0.218	+0.12	−0.098	1	0.9
deception	+0.101	+0.086	−0.015	1	1
harassment	+0.117	+0.182	+0.066	1	1
hate	+0.456	+0.446	−0.01	0.3	0.4
overall	+0.015	+0.019	+0.004	1	1
self-harm	+0.129	+0.09	−0.039	1	1
sexual	+0.412	+0.385	−0.027	1	0.9
violence	+0.164	+0.203	+0.039	1	0.9

Table 7B. Qwen2-Audio-7B-Instruct category slices under GT (Sens.): audio-text vs text-only.

Category	text-only Sens.	audio-text Sens.	Δ Sens	text-only Spec.	audio-text Spec.
dangerous	−0.058	−0.171	−0.112	0.8	0.5
deception	−0.111	+0.128	+0.239	0.6	0.8
harassment	+0.15	+0.183	+0.033	1	1
hate	−0.064	−0.654	−0.59	0.9	0.2
overall	+0.054	+0.026	−0.028	1	1
self-harm	−0.012	+0.029	+0.041	0.9	0.9
sexual	−0.03	−0.325	−0.296	0.4	0.8
violence	−0.068	−0.337	−0.269	1	0.2

Table 7C. Gemini-2.5-Flash category slices under GT (Sens.): audio-text vs text-only.

Category	text-only Sens.	audio-text Sens.	Δ Sens	text-only Spec.	audio-text Spec.
dangerous	+0.095	+0.132	+0.037	1	1
deception	+0.103	+0.015	−0.088	0.8	0.9
harassment	+0.142	+0.102	−0.04	0.9	0.9
hate	+0.063	+0.053	−0.01	0.9	0.9
overall	0	0	0	0.95	0.9
self-harm	+0.017	+0.01	−0.007	0.9	0.9
sexual	+0.135	+0.13	−0.005	1	0.9
violence	+0.017	+0.025	+0.008	0.9	1

Table 7D. MiniCPM-o-4.5 category slices under GT (Sens.): audio-text vs text-only.

Category	text-only Sens.	audio-text Sens.	Δ Sens	text-only Spec.	audio-text Spec.
dangerous	−0.09	−0.063	+0.026	0.1	0
deception	−0.161	−0.019	+0.142	0	0
harassment	+0.234	+0.176	−0.058	0	0
hate	+0.29	+0.103	−0.187	0.1	0.1
overall	−0.542	−0.449	+0.093	0	0.1
self-harm	+0.275	+0.24	−0.035	0.1	0.2
sexual	−0.443	−0.397	+0.045	0	0
violence	−0.589	−0.466	+0.123	0	0

Table 7E. MERaLiON-2-10B audio-only prosody-alone results by category (Sens.).

Category	Sens	Spec
hate	+0.435	0.5
self-harm	+0.249	1
harassment	+0.239	1
dangerous	+0.161	1
sexual	+0.088	0.3
deception	+0.063	1
violence	+0.029	0.8
overall	+0.001	1

Table 8A. Severity-response curves (Sens.).

Judge	Modality	s0	s1	s2	s3	s4	s5	Monotonic?
Qwen2-Audio-7B-Instruct	text-only	0.675	0.69	0.682	0.659	0.635	0.629	No
Qwen2-Audio-7B-Instruct	audio-text	0.557	0.648	0.65	0.671	0.632	0.625	No
Qwen2-Audio-7B-Instruct	audio-only	0.712	0.701	0.699	0.692	0.678	0.671	Yes
MERaLiON-2-10B	text-only	0.939	0.736	0.687	0.609	0.533	0.509	Yes
MERaLiON-2-10B	audio-text	0.932	0.734	0.73	0.64	0.541	0.501	Yes
MERaLiON-2-10B	audio-only	0.848	0.679	0.647	0.609	0.555	0.543	Yes
audio-flamingo-3-hf	text-only	0.575	0.539	0.525	0.476	0.383	0.361	Yes
audio-flamingo-3-hf	audio-text	0.581	0.533	0.517	0.53	0.516	0.506	No
audio-flamingo-3-hf	audio-only	0.598	0.632	0.651	0.662	0.644	0.65	No
MiniCPM-o-4.5	text-only	0.535	0.297	0.341	0.462	0.628	0.661	No
MiniCPM-o-4.5	audio-text	0.546	0.396	0.42	0.493	0.612	0.656	No
MiniCPM-o-4.5	audio-only	0.273	0.117	0.128	0.217	0.382	0.431	No
Qwen2.5-Omni-7B	text-only	0.786	0.742	0.69	0.662	0.729	0.757	No
Qwen2.5-Omni-7B	audio-text	0.915	0.671	0.682	0.694	0.81	0.825	No
Qwen2.5-Omni-7B	audio-only	0.94	0.66	0.83	0.828	0.837	0.851	No
Gemini-2.5-Flash	text-only	0.985	0.905	0.894	0.736	0.413	0.37	Yes
Gemini-2.5-Flash	audio-text	0.988	0.921	0.9	0.739	0.413	0.365	Yes
Llama-3.1-8B-Instruct	text-only	0.611	0.227	0.214	0.209	0.259	0.28	No

Table 8B. Safety scores for severity-0 dialogues.

Model and modality	Safety score
Qwen2-Audio, Audio-text	0.824
Qwen2-Audio, Audio	0.8
audio-flamingo-3-hf, Audio-text	0.711
audio-flamingo-3-hf, Audio	0.739
MERaLiON-2-10B, Audio-text	0.654
MERaLiON-2-10B, Audio	0.672