Jing Peng

English | 中文

News 🎉

🌟 Story begins at 2024/5/20 — The first self-recommendation letter I wrote to Prof. Kai Yu

🥳 May 2026

Four papers accepted to Interspeech 2026!

TASU2
SURE
RAS
VISA (Agent Track)

📖 Nov 2025

Survey published in IEEE JSTSP

A Survey on Speech Large Language Models for Understanding

🎊 Oct 2025

Three papers accepted to ICASSP 2026!

TASU — Oral
MOSA — Poster
ISA-Bench — Oral

🔥 Aug 2025

Two papers accepted to ASRU 2025!

Low-Resource Domain Adaptation
Fewer Hallucinations, More Verification

I am now a Zhiyuan Honor Ph.D. Student at Shanghai Jiao Tong University (SJTU), X-LANCE Lab, advised by Prof. Kai Yu (and co-advised by Prof. Shinji Watanabe), closely collaborating with Prof. Shuai Wang.

My research focuses on Speech Large Language Models (Speech LLMs), with an emphasis on building well-aligned speech understanding systems that are robust to domain shift and multi-speaker conditions.

Research Interests

My research centers on building robust and practical speech understanding systems, spanning from foundational ASR to modern Speech Large Language Models.

🧠 Speech Large Language Models for Understanding

📊 Survey & Benchmark

Building reproducible experimentation frameworks and benchmarks to measure what speech understanding systems can and cannot do.

Representative: SURE ISA-Bench Survey

🔗 Speech-Text Alignment

Aligning speech representations with language models through controllable simulation and text-only adaptation techniques.

Representative: TASU TASU2

🤖 Agentic Systems

Equipping speech and audio systems with agentic reasoning, multi-modal evidence, and reliable multi-agent collaboration.

Representative: Audio-Mind VISA XFlow

🌍 Multilingual and Multispeaker

Tackling complex real-world scenarios with multiple speakers and multiple languages under unified frameworks.

Representative: G-STAR MOSA

🎙️ Automatic Speech Recognition (Traditional)

Alongside Speech LLM research, I continue to work on foundational ASR problems.

🎙️ Streaming & Non-streaming ASR

Unified architectures such as TC-BiMamba that bridge streaming and non-streaming recognition.

Representative: TC-BiMamba

✍️ ASR Error Correction & Controllability

LLM-based error correction and controllable contextual speech recognition.

Representative: Fewer Hallucinations Joint Decoding

📏 Reliability & Evaluation

Metrics like RAS that focus on the reliability of ASR outputs beyond simple word-error rates.

Representative: RAS

Research Experience

🎙️ Speech LLMs for Speech Understanding

AISpeech, Suzhou, Jiangsu
I work on ASR and multimodal alignment methods that connect speech representations with language model reasoning and instruction following.

🗣️ SA-ASR with Speech LLMs

Shenzhen Research Institute of Big Data, Remote
I explore Speech LLM-based frameworks for speaker-attributed transcription, aiming to improve speaker consistency and controllability in multi-speaker scenarios.

👥 Speaker Discrimination on Omni/SLM

Hi Lab, Xiaohongshu, Shanghai
I study explicit speaker discrimination and implicit speaker selection strategies for multi-speaker understanding, with an eye toward robust speaker identity modeling under real-world conditions.