I am now a Zhiyuan Honor Ph.D. Student at Shanghai Jiao Tong University (SJTU), X-LANCE Lab, advised by Prof. Kai Yu (and co-advised by Prof. Shinji Watanabe).

My research focuses on Speech Large Language Models (Speech LLMs), with an emphasis on building well-aligned speech understanding systems that are robust to domain shift and multi-speaker conditions.


Research Interests

Generally, I am focusing on Speech Large Language Models (Speech LLMs) for speech understanding and reasoning:

  • Multimodal alignment between speech and text for instruction-following speech systems
  • Efficient adaptation for low-resource / cross-domain settings
  • Speaker-attributed ASR (SA-ASR) and multi-speaker understanding

Research Experience

My recent work spans both academic labs and industry research:

  • Speech LLMs for Speech Understanding (AISpeech, Suzhou, Jiangsu)
    I work on ASR and multimodal alignment methods that connect speech representations with language model reasoning and instruction following.

  • SA-ASR with Speech LLMs (Shenzhen Research Institute of Big Data, Remote)
    I explore Speech LLM-based frameworks for speaker-attributed transcription, aiming to improve speaker consistency and controllability in multi-speaker scenarios.

  • Speaker Discrimination on Omni/SLM (Hi Lab, Xiaohongshu, Shanghai)
    I study explicit speaker discrimination and implicit speaker selection strategies for multi-speaker understanding, with an eye toward robust speaker identity modeling under real-world conditions.


Publications (Selected)

You can see the full list on Publications.

  • indicates equal contribution.
  • A Survey on Speech Large Language Models for Understanding
    Jing Peng, Y. Wang, Y. Fang, Y. Xi, X. Li, X. Zhang, K. Yu.
    arXiv:2410.18908. Accepted by IEEE JSTSP.
    https://arxiv.org/abs/2410.18908

  • TASU: Text-Only Alignment for Speech Understanding
    Jing Peng, Y. Yang, X. Li, Y. Xi, Q. Tang, Y. Fang, J. Li, K. Yu.
    arXiv:2511.03310. Accepted by ICASSP 2026.
    https://arxiv.org/abs/2511.03310

  • Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
    Y. Fang, Jing Peng, X. Li, Y. Xi, C. Zhang, G. Zhong, K. Yu.
    arXiv:2506.05671. Accepted by ASRU 2025.
    https://arxiv.org/abs/2506.05671

  • MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR
    Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu.
    arXiv:2508.18998. Accepted by ICASSP 2026.
    https://arxiv.org/abs/2508.18998


Contact Information

I am so happy to chat and collaborate on the topics above and you can contact me by: