Table 1. Matched 2×2 text×prosody factorial experiment. Negative-control text-only rows report prosody effect under safe and unsafe text; audio rows report the safe-text prosody effect. Values are mean [95% CI].
JudgeModalityMeasureNProsody effect (safe text) [95% CI]Prosody effect (unsafe text) [95% CI]
Llama-3.1-8B-Instructtext-onlySens.1400 [0, 0]0 [0, 0]
Llama-3.1-8B-Instructtext-onlySpec.1400 [0, 0]0 [0, 0]
MERaLiON-2-10Btext-onlySens.2520 [0, 0]0 [0, 0]
MERaLiON-2-10Btext-onlySpec.2520 [0, 0]0 [0, 0]
Qwen2-Audio-7B-Instructtext-onlySens.2590 [0, 0]0 [0, 0]
Qwen2-Audio-7B-Instructtext-onlySpec.2590 [0, 0]0 [0, 0]
audio-flamingo-3-hftext-onlySens.1230 [0, 0]0 [0, 0]
audio-flamingo-3-hftext-onlySpec.1230 [0, 0]0 [0, 0]
Qwen2-Audio-7B-Instructaudio-onlySpec.7,889−0.09 [−0.099, −0.08]
MERaLiON-2-10Baudio-onlySens.7,896−0.056 [−0.065, −0.045]
MERaLiON-2-10Baudio-textSens.7,896−0.055 [−0.067, −0.042]
Qwen2-Audio-7B-Instructaudio-textSpec.7,889−0.041 [−0.047, −0.032]
MERaLiON-2-10Baudio-textSpec.7,803−0.04 [−0.046, −0.035]
audio-flamingo-3-hfaudio-onlySpec.3,602−0.015 [−0.023, −0.008]
Table 2A. Mask validation for 160 ms local time-segment reversal. The mask destroys lexical content while preserving prosodic contour.
MetricMeanStdInterpretation
Whisper-Large WER1.642.61Words destroyed
Whisper-Base WER5.366.18Even coarser ASR fails
Duration ratio10Exact length preserved
Energy-envelope corr0.9510.018Prosodic contour preserved
Table 2B. Prosody effect under clean and masked audio. Values are mean [95% CI].
JudgeMeasureUnmaskedMaskedStatus
MERaLiON-2-10BSens.−0.055 [−0.065, −0.045]−0.088 [−0.098, −0.077]Yes (amplified)
MERaLiON-2-10BSpec.−0.004 [−0.006, −0.002]−0.007 [−0.013, −0.002]Yes (amplified)
Qwen2-Audio-7B-InstructSens.+0.005 [−0.004, +0.019]−0.007 [−0.016, +0.004]Baseline ≈ 0
Qwen2-Audio-7B-InstructSpec.−0.087 [−0.096, −0.077]−0.076 [−0.084, −0.067]Yes (12% attenuation)
audio-flamingo-3-hfSens.−0.001 [−0.015, +0.012]+0.001 [−0.005, +0.006]Baseline ≈ 0
audio-flamingo-3-hfSpec.−0.015 [−0.022, −0.008]−0.015 [−0.02, −0.012]Yes (persists)
Table 2C. Text effect under clean and masked audio. Values are mean [95% CI].
JudgeMeasureUnmaskedMaskedAttenuation
MERaLiON-2-10BSens.−0.27 [−0.295, −0.248]−0.244 [−0.268, −0.225]10%
MERaLiON-2-10BSpec.−0.076 [−0.088, −0.069]−0.016 [−0.022, −0.01]79%
Qwen2-Audio-7B-InstructSpec.−0.125 [−0.14, −0.108]−0.105 [−0.12, −0.088]16%
audio-flamingo-3-hfSens.+0.05 [+0.031, +0.069]+0.029 [+0.016, +0.042]42%
Table 3. GT decomposition: gain-GT = metric(audio-text, GT) − metric(text-only, GT). Values are mean [95% CI].
JudgeMeasuregain-GT [95% CI]Status
Qwen2-Audio-7B-InstructSpec.+0.138 [+0.114, +0.161]Positive
Qwen2.5-Omni-7BSens.+0.061 [+0.003, +0.184]Positive
MiniCPM-o-4.5Spec.+0.021 [−0.007, +0.049]Positive (CI touches 0)
MiniCPM-o-4.5Sens.+0.017 [−0.009, +0.041]Positive (CI touches 0)
audio-flamingo-3-hfSens.+0.011 [−0.054, +0.069]Positive (wide CI)
MERaLiON-2-10BSens.−0.006 [−0.042, +0.029]Near-zero
Qwen2.5-Omni-7BSpec.−0.01 [−0.063, +0.071]Near-zero
Gemini-2.5-FlashSens.−0.013 [−0.028, −0.001]Negative
Gemini-2.5-FlashSpec.−0.019 [−0.051, +0.002]Negative (CI touches 0)
audio-flamingo-3-hfSpec.−0.047 [−0.107, +0.003]Negative (CI touches 0)
Qwen2-Audio-7B-InstructSens.−0.099 [−0.142, −0.053]Negative
MERaLiON-2-10BSpec.−0.159 [−0.197, −0.12]Negative
Table 4. Expanded 7-judge overview across modalities (Sens. summary).
JudgeTypeNew?text-only Sens.text-only Spec.audio-only Sens.audio-only Spec.audio-text Sens.audio-text Spec.
Qwen2-Audio-7B-InstructOpen LALM−0.0151+0.0111−0.1140.7
MERaLiON-2-10BOpen LALM+0.2041+0.1691+0.1981
audio-flamingo-3-hfOpen LALM+0.0371−0.0640.4+0.0480.9
MiniCPM-o-4.5Open LALMNew−0.1270−0.1580−0.1090
Qwen2.5-Omni-7BOpen LALMNew+0.0290.4+0.0890.1+0.090
Gemini-2.5-FlashClosed APINew+0.081+0.0671
Llama-3.1-8B-InstructText-only+0.3310.3
Table 5A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.
JudgeModalityGTWhisper-LargeWhisper-BaseDirection
Gemini-2.5-Flashtext-only+0.08+0.015−0.009Degrades (−0.089)
Gemini-2.5-Flashaudio-text+0.067+0.015Degrades (−0.052)
Qwen2.5-Omni-7Btext-only+0.029+0.037+0.02Stable
Qwen2.5-Omni-7Baudio-text+0.09+0.127+0.143Improves (+0.053)
MiniCPM-o-4.5text-only−0.127+0.226+0.213Inverts
MiniCPM-o-4.5audio-text−0.109−0.149−0.15Degrades (−0.041)
Table 5B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for judges with multi-source evaluations.
JudgeModalityGTWhisper-LargeWhisper-BaseDirection
Gemini-2.5-Flashtext-only10.40.5Collapses
Gemini-2.5-Flashaudio-text10.8Degrades
Qwen2.5-Omni-7Btext-only0.40.90.8Improves
Qwen2.5-Omni-7Baudio-text00.20.2Improves
MiniCPM-o-4.5text-only00.90.8Improves
MiniCPM-o-4.5audio-text000Unchanged
Table 5C. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.
JudgeMeasuregain-GTgain-WLgain-WBrescue-WLrescue-WB
Qwen2.5-Omni-7BSens.+0.061+0.09+0.122+0.029+0.061
Qwen2.5-Omni-7BSpec.−0.01+0.09+0.122+0.1+0.132
Gemini-2.5-FlashSens.−0.0130+0.013
MiniCPM-o-4.5Sens.+0.017−0.375−0.363−0.392−0.38
MiniCPM-o-4.5Spec.+0.021−0.374−0.356−0.395−0.377
Table 6A. Position bias (AAPB of Sens.) across modalities for the 7-judge set.
Judgetext-onlyaudio-textaudio-only
Qwen2-Audio-7B-Instruct0.0240.340.215
MERaLiON-2-10B0.1240.0710.106
audio-flamingo-3-hf0.0910.1430.075
MiniCPM-o-4.50.4140.3250.531
Qwen2.5-Omni-7B0.0590.0970.047
Gemini-2.5-Flash0.0270.042
Llama-3.1-8B-Instruct0.136
Table 6B. AAPB of Sens. across transcript sources for judges with multi-source evaluations.
JudgeModalityGTWhisper-LargeWhisper-Base
Gemini-2.5-Flashtext-only0.0270.0250.028
Gemini-2.5-Flashaudio-text0.042
Qwen2.5-Omni-7Btext-only0.0590.1590.205
Qwen2.5-Omni-7Baudio-text0.0970.1060.087
MiniCPM-o-4.5text-only0.4140.0870.075
MiniCPM-o-4.5audio-text0.3250.350.36
Table 7A. MERaLiON-2-10B category slices under GT (Sens.): audio-text vs text-only.
Categorytext-only Sens.audio-text Sens.Δ Senstext-only Spec.audio-text Spec.
dangerous+0.218+0.12−0.09810.9
deception+0.101+0.086−0.01511
harassment+0.117+0.182+0.06611
hate+0.456+0.446−0.010.30.4
overall+0.015+0.019+0.00411
self-harm+0.129+0.09−0.03911
sexual+0.412+0.385−0.02710.9
violence+0.164+0.203+0.03910.9
Table 7B. Qwen2-Audio-7B-Instruct category slices under GT (Sens.): audio-text vs text-only.
Categorytext-only Sens.audio-text Sens.Δ Senstext-only Spec.audio-text Spec.
dangerous−0.058−0.171−0.1120.80.5
deception−0.111+0.128+0.2390.60.8
harassment+0.15+0.183+0.03311
hate−0.064−0.654−0.590.90.2
overall+0.054+0.026−0.02811
self-harm−0.012+0.029+0.0410.90.9
sexual−0.03−0.325−0.2960.40.8
violence−0.068−0.337−0.26910.2
Table 7C. Gemini-2.5-Flash category slices under GT (Sens.): audio-text vs text-only.
Categorytext-only Sens.audio-text Sens.Δ Senstext-only Spec.audio-text Spec.
dangerous+0.095+0.132+0.03711
deception+0.103+0.015−0.0880.80.9
harassment+0.142+0.102−0.040.90.9
hate+0.063+0.053−0.010.90.9
overall0000.950.9
self-harm+0.017+0.01−0.0070.90.9
sexual+0.135+0.13−0.00510.9
violence+0.017+0.025+0.0080.91
Table 7D. MiniCPM-o-4.5 category slices under GT (Sens.): audio-text vs text-only.
Categorytext-only Sens.audio-text Sens.Δ Senstext-only Spec.audio-text Spec.
dangerous−0.09−0.063+0.0260.10
deception−0.161−0.019+0.14200
harassment+0.234+0.176−0.05800
hate+0.29+0.103−0.1870.10.1
overall−0.542−0.449+0.09300.1
self-harm+0.275+0.24−0.0350.10.2
sexual−0.443−0.397+0.04500
violence−0.589−0.466+0.12300
Table 7E. MERaLiON-2-10B audio-only prosody-alone results by category (Sens.).
CategorySensSpec
hate+0.4350.5
self-harm+0.2491
harassment+0.2391
dangerous+0.1611
sexual+0.0880.3
deception+0.0631
violence+0.0290.8
overall+0.0011
Table 8A. Severity-response curves (Sens.).
JudgeModalitys0s1s2s3s4s5Monotonic?
Qwen2-Audio-7B-Instructtext-only0.6750.690.6820.6590.6350.629No
Qwen2-Audio-7B-Instructaudio-text0.5570.6480.650.6710.6320.625No
Qwen2-Audio-7B-Instructaudio-only0.7120.7010.6990.6920.6780.671Yes
MERaLiON-2-10Btext-only0.9390.7360.6870.6090.5330.509Yes
MERaLiON-2-10Baudio-text0.9320.7340.730.640.5410.501Yes
MERaLiON-2-10Baudio-only0.8480.6790.6470.6090.5550.543Yes
audio-flamingo-3-hftext-only0.5750.5390.5250.4760.3830.361Yes
audio-flamingo-3-hfaudio-text0.5810.5330.5170.530.5160.506No
audio-flamingo-3-hfaudio-only0.5980.6320.6510.6620.6440.65No
MiniCPM-o-4.5text-only0.5350.2970.3410.4620.6280.661No
MiniCPM-o-4.5audio-text0.5460.3960.420.4930.6120.656No
MiniCPM-o-4.5audio-only0.2730.1170.1280.2170.3820.431No
Qwen2.5-Omni-7Btext-only0.7860.7420.690.6620.7290.757No
Qwen2.5-Omni-7Baudio-text0.9150.6710.6820.6940.810.825No
Qwen2.5-Omni-7Baudio-only0.940.660.830.8280.8370.851No
Gemini-2.5-Flashtext-only0.9850.9050.8940.7360.4130.37Yes
Gemini-2.5-Flashaudio-text0.9880.9210.90.7390.4130.365Yes
Llama-3.1-8B-Instructtext-only0.6110.2270.2140.2090.2590.28No
Table 8B. Safety scores for severity-0 dialogues.
Model and modalitySafety score
Qwen2-Audio, Audio-text0.824
Qwen2-Audio, Audio0.8
audio-flamingo-3-hf, Audio-text0.711
audio-flamingo-3-hf, Audio0.739
MERaLiON-2-10B, Audio-text0.654
MERaLiON-2-10B, Audio0.672