August 2024 - present; Supervisor: Fan Lai @ GAEA Lab

Identified KV cache as a growing memory and latency bottleneck in large-scale LLM serving systems.
Developed novel cross-layer parallelism framework combining token-wise, layer-wise, and multi-GPU parallelism for hybrid KV cache restoration.
Implemented SLO-aware scheduling to manage I/O contention across requests, batches, and models while maintaining performance guarantees.
Addressed limitations of existing systems in long-context scenarios and concurrent restoration workloads.
Designed cross-layer management system that optimally balances computation and communication for KV cache restoration at cluster scale.

August 2024 - present; Supervisor: Fan Lai @ GAEA Lab

Identified diverse SLO requirements across three LLM request patterns: latency-sensitive, throughput-intensive, and collective requests with varying performance needs.
Developed SLO-aware scheduler that maximizes service gain by allocating just enough bandwidth to meet individual SLOs while preserving residual capacity for other requests.
Implemented hybrid approach using quantile regression forests for response length estimation and dependency-graph matching for collective request coordination.
Created service density-based prioritization algorithm that balances requests across different SLO constraints without starving low-priority workloads.
Achieved 1.3×-8.3× improvement in service gain and 4.0×-10.3× improvement in SLO goodput compared to existing systems across diverse workloads and models.

May 2023 – April 2024; Supervisor: Mosharaf Chowdhury @ Symbiotic Lab Here’s a more concise version:

Identified that in LLM text-streaming services, generating text faster than user reading speed is crucial for enhancing user experience, addressing gaps in prior metrics.
Defined Quality of Experience (QoE) in LLM serving by tracking each step of text generation and monitoring the overall user experience throughout the entire streaming process.
Formulated QoE optimization as a knapsack problem and developed a scheduling algorithm to maximize QoE by efficiently allocating resources.
Built Andes, an LLM serving system on top of vLLM, integrating the scheduling algorithm to enhance QoE in real-time LLM services.
Co-authored the paper “Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services” as second author.

Publications

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services; Preprint, 2024; Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury
Tempo: Application-aware LLM Serving with Mixed SLO Requirements; Preprint, 2025; submit to SOSP 25; Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, Fan Lai
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization; Preprint, 2025; Jae-Won Chung, Jiachen Liu, Jeff J Ma, Ruofan Wu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, Mosharaf Chowdhury