使用 OpenVINO™ 加速部署 MiniCPM-o-4.5 全模态模型

openlab_96bf3613 更新于 9小时前

 引言

MiniCPM-o-4.5 是 OpenBMB 最新推出的端到端全模态(omnimodal)模型,基于 SigLip2 + Whisper-medium + CosyVoice2 + Qwen3-8B 构建,总参数量仅 9B。它在 OpenCompass 8 大视觉语言基准上平均得分 77.6,视觉能力超越 GPT-4o、接近 Gemini 2.5 Flash,同时支持双语实时语音对话、语音克隆,以及开创性的全双工多模态实时流式交互(同时看、听、说,无阻塞)。

本文将详细介绍如何使用 Intel® OpenVINO™ 工具套件优化和部署 MiniCPM-o-4.5 模型,将其 15 个子模型全部转换为 OpenVINO™ IR 格式并进行 INT4/INT8 量化,在 Intel CPU/GPU 平台上实现高效推理。

第一步 环境准备

基于以下命令即可完成 Python 环境安装(推荐使用虚拟环境)。

Bash

python -m venv py_venv./py_venv/Scripts/activate.bat

Python

from pip_helper import pip_install
# Install in batches to avoid pip resolver conflicts# Batch 1: PyTorch (with CPU index URL)print("Installing PyTorch...")pip_install(    "-q",    "torch==2.8.0",    "torchvision==0.23.0",    "torchaudio==2.8.0",    "--extra-index-url",    "https://download.pytorch.org/whl/cpu",)
# Batch 2: OpenVINO stackprint("Installing OpenVINO and transformers...")pip_install(    "-q",    "transformers==4.51.0",    "openvino>=2026.0.0",    "openvino-tokenizers>=2026.0",    "nncf>=2.16",)
# Batch 3: Audio, UI, and utility packagesprint("Installing audio, UI, and utility packages...")pip_install(    "-q",    "Pillow",    "librosa",    "soundfile",    "gradio==6.0.0",    "accelerate",    "einops",    "onnxruntime",    "hyperpyyaml",    "librosa",    "minicpmo-utils>=1.0.5",    "setuptools<82",    "onnx",)
print(" All dependencies installed successfully!")

(额外工具脚本可从 OpenVINO™ notebooks 仓库下载 notebook_utils.py、pip_helper.py 和 gradio_helper.py)

第二步 模型转换和量化

MiniCPM-o-4.5 由 15 个子模型组成(LLM、视觉编码器、音频编码器、TTS 各模块等)。这些模型的作用分别是:

我们使用 minicpm_o_4_5_helper 提供的转换函数一键完成 OpenVINO™ IR 导出与量化:

  • LLM:INT4 对称量化(group_size=64, ratio=1.0)

  • Vision:INT8 对称量化

  • Audio Encoder:启用 stateful KV cache(高效流式)

  • 其余小模型:保持 FP16(体积小,无需量化)

转换命令如下(首次运行会自动下载原模型并转换):

Python

from pathlib import Path
# Path to the original model (downloaded from HuggingFace/ModelScope)model_id = "openbmb/MiniCPM-o-4_5"
# Output path for converted OpenVINO modelsov_model_path = Path("MiniCPM-o-4_5-OV")
from minicpm_o_4_5_helper import convert_minicpmo_model
if not (ov_model_path / "openvino_llm_embedding_model.xml").exists():    # Quantization config:    # - "vision": INT8 for vision encoder (moderate compression)    # - "llm": INT4_SYM with group_size=64, ratio=1.0 for LLM (best quality/size)    # - "text": No quantization for other text/audio/TTS models    quantization_config = {        "vision": {            "mode": "int8_sym",        },        "llm": {            "mode": "int4_sym",            "group_size": 64,            "ratio": 1.0,        },        "text": None,    }
    convert_minicpmo_model(        model_id=model_id,        output_dir=str(ov_model_path),        quantization_config=quantization_config,        use_stateful_apm=True,  # Export Audio Encoder with stateful KV cache    )    print(" All 15 sub-models converted successfully!")else:    print(f" Converted models already exist at {ov_model_path}")

第三步 模型部署

转换完成后,直接使用 minicpm_o_4_5_helper 中的 OVMiniCPMO 类即可加载 OpenVINO™ 优化模型,进行多模态推理(视觉聊天、语音 ASR、半双工/全双工流式、实时语音对话等)。完整 Pipeline 初始化和推理流程如下(直接来自 notebook 示例):

  • 模型初始化

Python

from minicpm_o_4_5_helper import OVMiniCPMO
ov_model = OVMiniCPMO(    model_path=str(ov_model_path),    device=device.value,    tts_device=tts_device.value,)
  • 多模态 Chat 示例(单图 文本)

Python

from PIL import Imageimport requestsfrom io import BytesIO
# Download a sample image
image_url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"response = requests.get(image_url, timeout=30)image = Image.open(BytesIO(response.content)).convert("RGB")
question = "What is in the image?"msgs = [{"role": "user", "content": [image, question]}]
answer = ov_model.chat(    msg***sgs,    use_tts_template=False,)print(answer)
  • 多轮对话示例(支持上下文记忆):

Python

# Multi-turn: follow up on the previous image# Reset LLM state for fresh generationov_model.llm._ov_language.reset_state()ov_model.llm._past_length = 0
msgs = [    {"role": "user", "content": [image, "What is in this image?"]},    {"role": "assistant", "content": [answer]},    {"role": "user", "content": ["What season do you think this picture was taken in? Why?"]},]
answer2 = ov_model.chat(    msg***sgs,    use_tts_template=False,    enable_thinking=False,)print(answer2)
  • 音频理解示例(ASR / 说话人分析):

Python

import librosaimport numpy as npimport soundfile as sffrom IPython.display import display, Audio
audio_path = "assets/system_ref_audio.wav"
audio_input, _ = librosa.load(str(audio_path), sr=16000, mono=True)print(f" Audio loaded: {len(audio_input)/16000:.1f}s @ 16kHz")
# Display input audioprint(" Input Audio:")display(Audio(audio_input, rate=16000))
# Reset LLM stateov_model.llm._ov_language.reset_state()ov_model.llm._past_length = 0
# ASR task (aligned with original model README)task_prompt = "Please listen to the audio snippet carefully and transcribe the content."msgs = [{"role": "user", "content": [task_prompt, audio_input]}]
asr_result = ov_model.chat(    msg***sgs,    do_sample=True,    max_new_tokens=512,    use_tts_template=True,    temperature=0.3,)print("ASR Result:", asr_result)
  • 半双工实时语音对话:

通过将用户音频分割为1秒长的块,并通过流式处理管道依次处理这些块,来模拟实时语音交互。模型会听取所有块,然后生成文本(并可选地生成音频)响应。

Python

import librosaimport numpy as npimport torchimport soundfile as sffrom IPython.display import display, Audio
ov_model = ov_model.as_duplex()
# Set reference audio for voice styleref_audio_path = "assets/system_ref_audio.wav"ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# Example system msg for English Conversationsy***sg = {    "role": "system",    "content": [        "Clone the voice in the provided audio prompt.",        ref_audio,        "Please assist users while maintaining this voice style. Please answer the user's questions seriously and in a high quality. Please chat with the user in a highly human-like and oral style in English. You are a helpful assistant developed by ModelBest: MiniCPM-Omni",    ],}
# You can use each type of system prompt mentioned above in streaming speech conversation
# Reset stateov_model.reset_session(reset_token2wav_cache=True)ov_model.init_token2wav_cache(prompt_speech_16k=ref_audio)
session_id = "demo"
# First, prefill system turnov_model.streaming_prefill(    session_id=session_id,    msgs=[sy***sg],    omni_mode=False,    is_last_chunk=True,)
# Here we simulate realtime speech conversation by splitting whole user input audio into chunks of 1s.# Use the system reference audio as a demo input (replace with your own audio file)user_audio, _ = librosa.load("assets/system_ref_audio.wav", sr=16000, mono=True)
IN_SAMPLE_RATE = 16000  # input audio sample rate, fixed valueCHUNK_SAMPLES = IN_SAMPLE_RATE  # sampleOUT_SAMPLE_RATE = 24000  # output audio sample rate, fixed valueMIN_AUDIO_SAMPLES = 16000
total_samples = len(user_audio)num_chunks = (total_samples + CHUNK_SAMPLES - 1) // CHUNK_SAMPLES
for chunk_idx in range(num_chunks):    start = chunk_idx * CHUNK_SAMPLES    end = min((chunk_idx + 1) * CHUNK_SAMPLES, total_samples)    chunk_audio = user_audio[start:end]
    is_last_chunk = chunk_idx == num_chunks - 1    if is_last_chunk and len(chunk_audio) < MIN_AUDIO_SAMPLES:        chunk_audio = np.concatenate([chunk_audio, np.zeros(MIN_AUDIO_SAMPLES - len(chunk_audio), dtype=chunk_audio.dtype)])    user_msg = {"role": "user", "content": [chunk_audio]}
    # For each 1s audio chunk, perform streaming_prefill once to reduce first-token latency    ov_model.streaming_prefill(        session_id=session_id,        msgs=[user_msg],        omni_mode=False,        is_last_chunk=is_last_chunk,    )
# Let model generate response in a streaming mannergenerate_audio = Trueiter_gen = ov_model.streaming_generate(    session_id=session_id,    generate_audio=generate_audio,    use_tts_template=True,    enable_thinking=False,    do_sample=True,    max_new_tokens=512,    length_penalty=1.1,  # For realtime speech conversation mode, we suggest length_penalty=1.1 to improve response content)
audios = []text = ""
output_audio_path = "output_realtime.wav"if generate_audio:    for wav_chunk, text_chunk in iter_gen:        audios.append(wav_chunk)        text += text_chunk
    generated_waveform = torch.cat(audios, dim=-1)[0]    sf.write(output_audio_path, generated_waveform.cpu().numpy(), samplerate=24000)
    print("Text:", text)    print(f"Audio saved to {output_audio_path}")    display(Audio(generated_waveform.cpu().numpy(), rate=24000))else:    for text_chunk, is_finished in iter_gen:        text += text_chunk    print("Text:", text)
  • 全双工模式:

在全双工模式下,模型可以同时监听和说话,实时处理多模态输入(视频帧 + 音频块)。这是最先进的交互模式,非常适合需要AI在用户仍在说话时就做出响应的实时对话。

Python

import librosaimport torchfrom minicpmo.utils import generate_duplex_video, get_video_frame_audio_segmentsfrom IPython.display import display, Video
# Ensure duplex mode for full duplex APIov_model = ov_model.as_duplex() if hasattr(ov_model, "as_duplex") else ov_model
# Load video and reference audiocn_video_path = "assets/omni_duplex1.mp4"ref_audio_path = "assets/HT_ref_audio.wav"en_video_url = "https://github.com/user-attachments/assets/f6ab8ed5-5829-4ce2-9feb-3a8bf7374dbb"en_video_path = "assets/omni_duplex_en.mp4"
try:    resp = requests.get(en_video_url, timeout=30)    resp.raise_for_status()    if not resp.content:        raise ValueError("Downloaded file is empty.")    with open(en_video_path, "wb") as f:        f.write(resp.content)    video_path = en_video_path    print(f" en_video downloaded: {cn_video_path}")except Exception as e:    video_path = cn_video_path    print(f" Download failed, fallback to {cn_video_path}\n{e}")
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
ov_model.reset_session(reset_token2wav_cache=True)
# Extract video frames and audio segmentsvideo_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(video_path, stack_frames=1, use_ffmpeg=True, adjust_audio_length=True)
# Prepare duplex session with system prompt and voice referenceov_model.prepare(    prefix_system_prompt="Streaming Omni Conversation.",    ref_audio=ref_audio,    prompt_wav_path=ref_audio_path,)
results_log = []timed_output_audio = []
# Process each chunk in streaming fashionfor chunk_idx in range(len(audio_segments)):    audio_chunk = audio_segments[chunk_idx] if chunk_idx < len(audio_segments) else None    frame = video_frames[chunk_idx] if chunk_idx < len(video_frames) else None    frame_list = []    if frame is not None:        frame_list.append(frame)        if stacked_frames is not None and chunk_idx < len(stacked_frames) and stacked_frames[chunk_idx] is not None:            frame_list.append(stacked_frames[chunk_idx])
    # Step 1: Streaming prefill    ov_model.streaming_prefill(        audio_waveform=audio_chunk,        frame_list=frame_list,        max_slice_nums=1,  # Increase for HD mode (e.g., [2, 1] for stacked frames)        batch_vision_feed=False,  # Set True for faster processing    )
    # Step 2: Streaming generate    result = ov_model.streaming_generate(        prompt_wav_path=ref_audio_path,        max_new_speak_tokens_per_chunk=20,        decode_mode="sampling",    )
    if result["audio_waveform"] is not None:        timed_output_audio.append((chunk_idx, result["audio_waveform"]))
    chunk_result = {        "chunk_idx": chunk_idx,        "is_listen": result["is_listen"],        "text": result["text"],        "end_of_turn": result["end_of_turn"],        "current_time": result["current_time"],        "audio_length": len(result["audio_waveform"]) if result["audio_waveform"] is not None else 0,    }    results_log.append(chunk_result)
    print("listen..." if result["is_listen"] else f"speak> {result['text']}")
# Generate output video with AI responses# Please install Chinese fonts (fonts-noto-cjk or fonts-wqy-microhei) to render CJK subtitles correctly.# apt-get install -y fonts-noto-cjk fonts-wqy-microhei# fc-cache -fvgenerate_duplex_video(    video_path=video_path,    output_video_path="duplex_output.mp4",    results_log=results_log,    timed_output_audio=timed_output_audio,    output_sample_rate=24000,)
# Play the generated videodisplay(Video("duplex_output.mp4", embed=True, width=640))


全双工模式输出效果如下:视频

总结

本文详细介绍了如何使用 OpenVINO™ 部署 MiniCPM-o-4.5 全模态模型。OpenVINO™ 为其提供了显著的性能提升(INT4/INT8 量化 + stateful KV cache),特别适合在 Intel CPU/GPU 上进行边缘部署,实现实时多模态交互。作为领先的轻量全模态模型,MiniCPM-o-4.5 在 Intel 平台上展现出极高的实用价值。

参考资源

立即尝试 OpenVINO™ 加速版 MiniCPM-o-4.5,开启全双工多模态智能体验!

OpenVINO 小助手微信 OpenVINO-China

如需咨询或交流相关信息,欢迎添加OpenVINO小助手微信,加入专属社群,与技术专家实时沟通互动。

0个评论