English | 中文

News 🎉

🌟 Story begins at 2024/5/20 — The first self-recommendation letter I wrote to Prof. Kai Yu

I am now a Zhiyuan Honor Ph.D. Student at Shanghai Jiao Tong University (SJTU), X-LANCE Lab, advised by Prof. Kai Yu (and co-advised by Prof. Shinji Watanabe), closely collaborating with Prof. Shuai Wang.

My research focuses on Speech Large Language Models (Speech LLMs), with an emphasis on building well-aligned speech understanding systems that are robust to domain shift and multi-speaker conditions.

Research Interests

My research centers on building robust and practical speech understanding systems, spanning from foundational ASR to modern Speech Large Language Models.

🧠 Speech Large Language Models for Understanding
📊 Survey & Benchmark

Building reproducible experimentation frameworks and benchmarks to measure what speech understanding systems can and cannot do.

Representative: SURE ISA-Bench Survey
🔗 Speech-Text Alignment

Aligning speech representations with language models through controllable simulation and text-only adaptation techniques.

Representative: TASU TASU2
🤖 Agentic Systems

Equipping speech and audio systems with agentic reasoning, multi-modal evidence, and reliable multi-agent collaboration.

Representative: Audio-Mind VISA XFlow
🌍 Multilingual and Multispeaker

Tackling complex real-world scenarios with multiple speakers and multiple languages under unified frameworks.

Representative: G-STAR MOSA
🎙️ Automatic Speech Recognition (Traditional)

Alongside Speech LLM research, I continue to work on foundational ASR problems.

🎙️ Streaming & Non-streaming ASR

Unified architectures such as TC-BiMamba that bridge streaming and non-streaming recognition.

Representative: TC-BiMamba
✍️ ASR Error Correction & Controllability

LLM-based error correction and controllable contextual speech recognition.

Representative: Fewer Hallucinations Joint Decoding
📏 Reliability & Evaluation

Metrics like RAS that focus on the reliability of ASR outputs beyond simple word-error rates.

Representative: RAS

Research Experience

🎙️ Speech LLMs for Speech Understanding

AISpeech, Suzhou, Jiangsu
I work on ASR and multimodal alignment methods that connect speech representations with language model reasoning and instruction following.

🗣️ SA-ASR with Speech LLMs

Shenzhen Research Institute of Big Data, Remote
I explore Speech LLM-based frameworks for speaker-attributed transcription, aiming to improve speaker consistency and controllability in multi-speaker scenarios.

👥 Speaker Discrimination on Omni/SLM

Hi Lab, Xiaohongshu, Shanghai
I study explicit speaker discrimination and implicit speaker selection strategies for multi-speaker understanding, with an eye toward robust speaker identity modeling under real-world conditions.

Publications (Selected)

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
Jing Peng*, C. Wang*, Y. Yang, L. Qian, J. Li, Y. Xi, S. Wang, K. Yu
arXiv:2604.08384 · Accepted by Interspeech 2026
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Y. Wang*, Jing Peng*, H. Li, C. Wang, W. Tu, Y. Xi, Z. Sun, K. Yu, S. Wang
arXiv:2605.28480 · Submitted to EMNLP 2026
XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows
H. Li*, Jing Peng*, Z. Wang, L. Chen, K. Yu
arXiv:2606.14790
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
Jing Peng*, Z. Chen*, H. Li*, Y. Wang, D. Ma, M. Li, Y. Du, D. Xu, K. Yu, S. Wang
arXiv:2603.10468 · Submitted to EMNLP 2026
TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Jing Peng*, Q. She*, Y. Fang, Y. Xi, K. Yu
arXiv:2602.11546 · Submitted to EMNLP 2026
A Unified and Reproducible Experimentation Framework for Speech Understanding
Jing Peng*, J. Du*, C. Wang*, H. Li*, Y. Yang*, et al.
arXiv:2605.30899 · Accepted by Interspeech 2026
RAS: a Reliability Oriented Metric for Automatic Speech Recognition
W. Huang, Y. Qiu, B. Li, Y. Guo, Jing Peng, H. Wang, X. Chen, K. Yu
arXiv:2604.24278 · Accepted by Interspeech 2026
A Survey on Speech Large Language Models for Understanding
Jing Peng*, Y. Wang*, Y. Fang, Y. Xi, X. Li, X. Zhang, K. Yu
arXiv:2410.18908 · Accepted by IEEE JSTSP
TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Y. Yang, X. Li, Y. Xi, Q. Tang, Y. Fang, J. Li, K. Yu
arXiv:2511.03310 · Accepted by ICASSP 2026
Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
Y. Fang*, Jing Peng*, X. Li, Y. Xi, C. Zhang, G. Zhong, K. Yu
arXiv:2506.05671 · Accepted by ASRU 2025
MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR
Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu
arXiv:2508.18998 · Accepted by ICASSP 2026

Contact Information

I am so happy to chat and collaborate on the topics above and you can contact me by: