About Me

I am a researcher at ByteDance, where my work focuses on Large Language Models (LLMs). My research interests lie in developing stronger architectures for LLMs. Before joining ByteDance, I completed my master's degree at the Chinese Academy of Sciences in 2020, focusing on generative AI research. Prior to that, I earned my bachelor's degree from Northeastern University in 2017.

Selected Publications

Virtual Width Networks
Seed Team Project Lead
Tech Report

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8 times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and convergence-speedup ratio increase, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts
Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min
ICML 2025

Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race.

Frac-Connections: Fractional Extension of Hyper-Connections
Defa Zhu, Hongzhi Huang, Jundong Zhou, Zihao Huang, Yutao Zeng, Banggu Wu, Qiyang Min, Xun Zhou
Tech Report

Residual connections are central to modern deep learning architectures. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width.

Over-tokenized transformer: Vocabulary is generally worth scaling
Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou
ICML 2025

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance.

Ultra-Sparse Memory Network
Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou
ICLR 2025

It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address limitations in MoE. Our approach significantly reduces inference latency while maintaining model performance.

Hyper-Connections
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou
ICLR 2025

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse.