I am now a Zhiyuan Honor Ph.D. Student at Shanghai Jiao Tong University (SJTU), X-LANCE Lab, advised by Prof. Kai Yu (and co-advised by Prof. Shinji Watanabe).
My research focuses on Speech Large Language Models (Speech LLMs), with an emphasis on building well-aligned speech understanding systems that are robust to domain shift and multi-speaker conditions.
Research Interests
- Multimodal alignment between speech and text for instruction-following speech systems
- Efficient adaptation for low-resource / cross-domain settings
- Speaker-attributed ASR (SA-ASR) and multi-speaker understanding
Publications (Selected)
* indicates equal contribution.
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
Jing Peng*, C. Wang*, Y. Yang, L. Qian, J. Li, Y. Xi, S. Wang, K. Yu.
arXiv:2604.08384.
Accepted by Interspeech 2026.
arXiv
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Y. Wang*, Jing Peng*, H. Li, C. Wang, W. Tu, Y. Xi, Z. Sun, K. Yu, S. Wang.
arXiv:2605.28480.
Submitted to EMNLP 2026.
arXiv
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
Jing Peng*, Z. Chen*, H. Li*, Y. Wang, D. Ma, M. Li, Y. Du, D. Xu, K. Yu, S. Wang.
arXiv:2603.10468.
Submitted to EMNLP 2026.
arXiv
TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Jing Peng*, Q. She*, Y. Fang, Y. Xi, K. Yu.
arXiv:2602.11546.
Submitted to EMNLP 2026.
arXiv
A Unified and Reproducible Experimentation Framework for Speech Understanding
Jing Peng*, J. Du*, C. Wang*, H. Li*, Y. Yang*, Y. Wang, X. Gu, G. Chen, Y. Wang, J. Li, Z. Zhao, H. Wang, W. Tu, H. Li, D. Ma, L. Qian, Y. Xi, W. Wen, J. Guo, H. Zhang, S. Fan, W. Jiang, S. Wang, K. Yu.
arXiv:2605.30899.
Accepted by Interspeech 2026.
arXiv
RAS: a Reliability Oriented Metric for Automatic Speech Recognition
W. Huang, Y. Qiu, B. Li, Y. Guo, Jing Peng, H. Wang, X. Chen, K. Yu.
arXiv:2604.24278.
Accepted by Interspeech 2026.
arXiv
A Survey on Speech Large Language Models for Understanding
Jing Peng*, Y. Wang*, Y. Fang, Y. Xi, X. Li, X. Zhang, K. Yu.
arXiv:2410.18908.
Accepted by IEEE JSTSP.
arXiv
TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Y. Yang, X. Li, Y. Xi, Q. Tang, Y. Fang, J. Li, K. Yu.
arXiv:2511.03310.
Accepted by ICASSP 2026.
arXiv
Contact Information
我是 上海交通大学 (SJTU) X-LANCE Lab 的致远荣誉博士生,导师是 俞凯教授(联合导师是 Shinji Watanabe 教授)。
我的研究专注于语音大语言模型 (Speech LLMs),重点是构建对领域迁移和多说话人场景具有鲁棒性的良好对齐的语音理解系统。
研究兴趣
- 语音和文本之间的多模态对齐,用于指令跟随语音系统
- 低资源/跨领域场景的高效自适应
- 说话人归属 ASR (SA-ASR) 和多说话人理解
发表论文 (精选)
* 表示同等贡献。
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
Jing Peng*, C. Wang*, Y. Yang, L. Qian, J. Li, Y. Xi, S. Wang, K. Yu.
arXiv:2604.08384.
Accepted by Interspeech 2026.
arXiv
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Y. Wang*, Jing Peng*, H. Li, C. Wang, W. Tu, Y. Xi, Z. Sun, K. Yu, S. Wang.
arXiv:2605.28480.
Submitted to EMNLP 2026.
arXiv
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
Jing Peng*, Z. Chen*, H. Li*, Y. Wang, D. Ma, M. Li, Y. Du, D. Xu, K. Yu, S. Wang.
arXiv:2603.10468.
Submitted to EMNLP 2026.
arXiv
TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Jing Peng*, Q. She*, Y. Fang, Y. Xi, K. Yu.
arXiv:2602.11546.
Submitted to EMNLP 2026.
arXiv
A Unified and Reproducible Experimentation Framework for Speech Understanding
Jing Peng*, J. Du*, C. Wang*, H. Li*, Y. Yang*, Y. Wang, X. Gu, G. Chen, Y. Wang, J. Li, Z. Zhao, H. Wang, W. Tu, H. Li, D. Ma, L. Qian, Y. Xi, W. Wen, J. Guo, H. Zhang, S. Fan, W. Jiang, S. Wang, K. Yu.
arXiv:2605.30899.
Accepted by Interspeech 2026.
arXiv
A Survey on Speech Large Language Models for Understanding
Jing Peng*, Y. Wang*, Y. Fang, Y. Xi, X. Li, X. Zhang, K. Yu.
arXiv:2410.18908.
Accepted by IEEE JSTSP.
arXiv
TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Y. Yang, X. Li, Y. Xi, Q. Tang, Y. Fang, J. Li, K. Yu.
arXiv:2511.03310.
Accepted by ICASSP 2026.
arXiv
联系方式
Publications / 发表论文
* indicates equal contribution. / * 表示同等贡献。
🧠 Speech Large Language Models for Understanding
📊 Survey & Benchmark
A Unified and Reproducible Experimentation Framework for Speech Understanding
Jing Peng*, J. Du*, C. Wang*, H. Li*, Y. Yang*, Y. Wang, X. Gu, G. Chen, Y. Wang, J. Li, Z. Zhao, H. Wang, W. Tu, H. Li, D. Ma, L. Qian, Y. Xi, W. Wen, J. Guo, H. Zhang, S. Fan, W. Jiang, S. Wang, K. Yu
Interspeech 2026 (accepted), 2026
arXiv
A Survey on Speech Large Language Models for Understanding
Jing Peng*, Y. Wang*, Y. Fang, Y. Xi, X. Li, X. Zhang, K. Yu
IEEE JSTSP (accepted), 2024
arXiv
ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
B. Li, W. Huang, Y. Qiu, Y. Guo, H. Wang, Z. Li, Jing Peng, Z. Ma, X. Chen, K. Yu
ICASSP 2026 (accepted), 2026
arXiv
🔗 Speech-Text Alignment
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
Jing Peng*, C. Wang*, Y. Yang, L. Qian, J. Li, Y. Xi, S. Wang, K. Yu
Interspeech 2026 (accepted), 2026
arXiv
TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Y. Yang, X. Li, Y. Xi, Q. Tang, Y. Fang, J. Li, K. Yu
ICASSP 2026 (accepted), 2026
arXiv
Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
Y. Fang*, Jing Peng*, X. Li, Y. Xi, C. Zhang, G. Zhong, K. Yu
ASRU 2025 (accepted), 2025
arXiv
🤖 Agentic Systems
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Y. Wang*, Jing Peng*, H. Li, C. Wang, W. Tu, Y. Xi, Z. Sun, K. Yu, S. Wang
submitted to EMNLP 2026, 2026
arXiv
VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track
W. Tu, J. Gao, Y. Huo, Y. Wang, Jing Peng, B. Li, Z. Ma, T. Liu, S. Fan, K. Yu, X. Chen, Z. Zheng
Interspeech 2026 (accepted), 2026
arXiv
XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows
H. Li*, Jing Peng*, Z. Wang, L. Chen, K. Yu
arXiv preprint, 2026
arXiv
Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness
Z. Wang, H. Li, Z. Yang, Z. Hu, S. Zuo, Y. Zhang, D. Ma, D. Luo, C. Wang, Jing Peng, T. Huang, S. Guo, H. Wang, Z. Zhu, S. Han, Y. Cao, K. Yu, L. Chen
arXiv preprint, 2026
arXiv
🌍 Multilingual and Multispeaker
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
Jing Peng*, Z. Chen*, H. Li*, Y. Wang, D. Ma, M. Li, Y. Du, D. Xu, K. Yu, S. Wang
submitted to EMNLP 2026, 2026
arXiv
MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR
Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu
ICASSP 2026 (accepted), 2026
arXiv
🎙️ Automatic Speech Recognition (Traditional)
TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Jing Peng*, Q. She*, Y. Fang, Y. Xi, K. Yu
submitted to EMNLP 2026, 2026
arXiv
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction
Y. Fang, B. Cheng, Jing Peng, X. Li, Y. Xi, C. Zhang, G. Zhong
ASRU 2025 (accepted), 2025
arXiv
Joint Decoding Method for Controllable Contextual Speech Recognition Based on Speech LLM
Y. Fang*, J. Peng*, Y. Xi, X. Li, H. Li, C. Zhang, G. Zhong, K. Yu
arXiv preprint, 2025
arXiv
RAS: a Reliability Oriented Metric for Automatic Speech Recognition
W. Huang, Y. Qiu, B. Li, Y. Guo, Jing Peng, H. Wang, X. Chen, K. Yu
Interspeech 2026 (accepted), 2026
arXiv