Yitian Gong - Homepage

About Me

Hi! I am Yitian Gong, a master's student at the FudanNLP Lab at Fudan University, advised by Prof. Xipeng Qiu. I am currently interning at MOSI and the Shanghai Innovation Institute, where I work on multimodal foundation models, multimodal representation learning, speech generation, end-to-end speech models, and large-scale distributed training.

My research interests focus on Audio Foundation Models, Multimodal Representation Learning, and Large-Scale Distributed Training. I have worked on several speech and audio foundation model projects, including MOSS-TTS-Family, MOSS-Audio-Tokenizer, SpeechGPT2-preview, and XY-Tokenizer.

I expect to graduate in 2027 and am seeking Ph.D. and job opportunities in related research areas. I am also open to academic collaboration opportunities. Please feel free to contact me at ytgong24@m.fudan.edu.cn if you are interested.

News

2026.04We released MOSS-TTS-Nano, a 0.1B-parameter model designed for real-time speech generation that can run directly on CPU without a GPU.

2026.04We released MOSS-Audio-Tokenizer-Nano, a 20M ultra-lightweight audio tokenizer that supports 48 kHz stereo input and output and serves as the audio tokenizer for MOSS-TTS-Nano.

2026.03Our XY-Tokenizer was accepted to the ACL 2026 main conference. See you in San Diego.

2026.02We released MOSS-TTS-Family, designed for high-fidelity, highly expressive, and complex real-world speech generation scenarios including long-form speech, multi-speaker dialogue, voice and character design, environmental sound effects, and real-time streaming TTS.

2026.02We released MOSS-Audio-Tokenizer, a purely Transformer-based audio tokenizer designed for next-generation native audio language models and used in MOSS-TTS-Family.

2026.01We released MOSS Transcribe Diarize, an end-to-end model for multi-speaker speech transcription.

2025.10I gave invited talks at OPPO Speech Group on end-to-end speech tokenizers and speech-to-speech LLMs.

2025.09We released MOSS-Speech, an end-to-end speech language model that can generate speech without first producing text.

2025.06We released MOSS-TTSD, which directly generates high-quality conversational speech from multi-speaker dialogue text and accurately models conversational prosody and intonation.

2025.06We released XY-Tokenizer, a SOTA speech tokenizer at around 1 kbps that effectively mitigates the semantic-acoustic conflict in low-bitrate speech codecs.

2025.01We released SpeechGPT2-Preview, a GPT-4o-level speech LLM and our first human-like real-time interaction system as we move toward context intelligence.

2024.04I joined FudanNLP Lab as a master student.

Publications

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu

ICML 2026

MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, it supports streaming and variable bitrates, delivering SOTA reconstruction and strong performance in generation and understanding—serving as a unified interface for next-generation native audio language models.

Role: Project Leader

Paper / Code 193 / HF -- / Demo
MOSS-TTS Technical Report

Yitian Gong*, Botian Jiang*, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, et al.

* Equal contribution.

Technical Report

MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.

Role: Co-Lead

Paper / Code 1.6k / HF -- / Demo
XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Yitian Gong, Luozhijie Jin, Kuangwei Chen, Dong Zhang, Ruifan Deng, Xiaogui Yang, Xin Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu

ACL 2026 Main

XY-Tokenizer is a low-bitrate speech codec designed to balance semantic alignment and acoustic fidelity for speech-language modeling. Trained with a structured multi-stage, multi-task strategy, it mitigates the semantic-acoustic conflict in prior codecs while preserving fine-grained details for reconstruction, delivering strong speech understanding and generation performance alongside high-quality reconstruction in both clean and out-of-distribution settings.

Role: Project Leader

Paper / Code 92 / Demo
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, and Yitian Gong, et al.

Technical Report

MOSS-Speech introduces true end-to-end speech interaction by directly generating speech without first producing text. It removes the text bottleneck of cascaded or text-guided systems while inheriting knowledge from pretrained text language models for more natural and efficient speech-to-speech dialogue.

Paper / Code 130 / Demo
MOSS Transcribe Diarize Technical Report

Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, et al.

Technical Report

MOSS Transcribe Diarize is a unified multimodal large language model for end-to-end speaker-attributed, time-stamped transcription. With large-scale real-world training data and a 128k context window for long-form audio, it delivers strong robustness and outperforms state-of-the-art commercial systems across multiple benchmarks.

Paper / Code 3 / Demo

Projects

SpeechGPT2-preview

SpeechGPT2-preview is an end-to-end real-time spoken interaction system trained on million-hour-scale speech data, designed for low-latency, interruptible, and human-like dialogue. It supports expressive style control, role-playing, and strong text-aligned capabilities such as tool use and knowledge integration, with current training focused on Chinese speech.

Code 371 / Demo
MOSS-TTSD: Text to Spoken Dialogue Generation

MOSS-TTSD is a spoken dialogue generation model that enables expressive dialogue speech synthesis in both Chinese and English, supporting zero-shot multi-speaker voice cloning, voice event control, and long-form speech generation.

Code 1.3k / HF -- / Demo
MOSS-TTS-Nano

MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1B parameters, it is designed for realtime speech generation, can run directly on CPU without a GPU, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.

Role: Project Leader

Code 1.9k / HF -- / Demo

Education

Fudan University

M.S. in Computer Science and Technology, advised by Prof. Xipeng Qiu.

Sept. 2024 - Jun. 2027 expected

University of Jinan

B.S. in Data Science and Big Data Technology.

Sept. 2020 - Jun. 2024

Experience

模思智能

Research Intern. Research and engineering on multimodal foundation models, speech tokenizers, speech understanding, speech generation, and large-scale distributed training.

Apr. 2025 - present

Shanghai Innovation Institute (SII)

Research Intern. Research and engineering on multimodal foundation models, speech tokenizers, speech understanding, speech generation, and large-scale distributed training.

Apr. 2025 - present

OpenMOSS

Member of OpenMOSS.

Apr. 2024 - present

Honors

🏅

ICPC Asia Regional, Silver Medal, 2022; EC-Final Bronze Medal, 2023.

🏅

CCPC, Silver Medal, 2022; National Finals Bronze Medal, 2023.

🏅

Lanqiao Cup, National First Prize, 2022, 2023, 2024.

🏅

RAICOM Programming Skills, National First Prize, 2022, 2023.

🏅

ICPC Shandong Provincial Contest, Gold Medal, 2022, 2023, 2024.

Skills

Programming: C, Python.

Deep Learning: PyTorch, Transformers, Megatron, large-scale distributed training.

Speech: SpeechTokenizer, VAE, ASR, TTS, audio tokenization.

Languages: Mandarin, Wu Chinese, English.