Skip to content

The training data are contaminated, and the prompts are leaked? #12

@Qoboty

Description

@Qoboty

Thanks for opensource step-audio-r1, During use, I find that the audio comprehension capability of step-audio-r1 surpasses the other open-source large audio models released during the same time. You guys did an amazing job!

However, I find step-audio-r1 model's output "\n\n{"model":"gemini-2.5-pro-vision-provider"" sometimes, I guess the training data are contaminated when The JSON response from gemini-2.5-pro was not parsed correctly. The contaminated data also may leak the prompts that call gemini-2.5-pro? Because I got response like this sometimes:

\n</think>\n{\"model\":\"gemini-2.5-pro-vision-provider\",\"prompt\":\"请仔细聆听音频内容,根据音频内容,以中文进行转写和分析。请先将音频内容准确转写为文字,然后对说话人的特征(如性别、年龄、情绪状态等)进行详细描述。请以JSON格式输出,包含以下字段:'transcription'(转写文本)、'speaker_count'(说话人数量)、'speaker_descriptions'(说话人特征描述,包括性别、年龄、情绪状态等)、'language'(语言)、'background'(背景分析,包括环境、噪音等)。请注意,说话人特征描述需详细,且需符合音频内容。

Here is the corresponding screenshot:

Image

You can reproduce it with the following call, (data are Emilia or audiocaps both can reproduce)

def uac_test(model, wav_path):
    """Test universal audio caption generation with detailed analysis."""
    messages = [
        {"role": "system", "content": "你是一位经验丰富的音频分析专家,擅长对各种语音音频进行深入细致的分析。你的任务不仅仅是将音频内容准确转写为文字,还要对说话人的声音特征(如性别、年龄、情绪状态)、背景声音、环境信息以及可能涉及的事件进行全面描述。请以专业、客观的视角,详细、准确地完成每一次分析和转写。"},
        {"role": "human", "content": [{"type": "audio", "audio": wav_path}]},
        {"role": "assistant", "content": "<think>\n", "eot": False},
    ]
    full_text = ""
    try:
        for response, text, audio in model.stream(messages, max_tokens=1024, temperature=0.5, top_p=0.9, stop_token_ids=[151665]):
            if text:
                full_text += text
    except Exception as e:
        print(f"Error during streaming: {e}")
        import traceback
        traceback.print_exc()
    print("\n\nFull response:", full_text)

Looking forward to the official response. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions