b'Chuan Wu'

research

∙ 06/02/2023

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Distributed full-graph training of Graph Neural Networks (GNNs) over lar...

0 Borui Wan, et al. ∙

research

∙ 05/14/2023

A Cognitive Stimulation Dialogue System with Multi-source Knowledge Fusion for Elders with Cognitive Impairment

When communicating with elders with cognitive impairment, cognitive stim...

0 Jiyue Jiang, et al. ∙

research

∙ 02/16/2023

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

We present Rhino, a system for accelerating tensor programs with automat...

0 Shiwei Zhang, et al. ∙

research

∙ 02/13/2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training tak...

0 Yuchen Zhong, et al. ∙

research

∙ 02/13/2023

Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

This paper presents TAG, an automatic system to derive optimized DNN tra...

0 Shiwei Zhang, et al. ∙

research

∙ 02/07/2023

Towards Robust Inductive Graph Incremental Learning via Experience Replay

Inductive node-wise graph incremental learning is a challenging task due...

0 Junwei Su, et al. ∙

research

∙ 02/01/2023

Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

Machine learning (ML) tasks are one of the major workloads in today's ed...

0 Yihong Li, et al. ∙

research

∙ 09/26/2022

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module...

0 Xiaodong Yi, et al. ∙

research

∙ 05/05/2022

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Distributed training using multiple devices (e.g., GPUs) has been widely...

0 Hanpeng Hu, et al. ∙

research

∙ 04/22/2022

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emer...

0 Ziyue Luo, et al. ∙

research

∙ 02/02/2022

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have...

0 Menglu Yu, et al. ∙

research

∙ 12/26/2021

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU...

0 XiaoYang Zhao, et al. ∙

research

∙ 12/16/2021

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Graph neural networks (GNNs) have extended the success of deep neural ne...

1 Tianfeng Liu, et al. ∙

research

∙ 11/19/2021

Adversarial Deep Learning for Online Resource Allocation

Online algorithm is an important branch in algorithm design. Designing o...

0 Bingqian Du, et al. ∙

research

∙ 10/28/2021

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Deep learning frameworks such as TensorFlow and PyTorch provide a produc...

0 Jinhui Yuan, et al. ∙

research

∙ 08/06/2021

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learni...

0 Menglu Yu, et al. ∙

research

∙ 05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...

0 Menglu Yu, et al. ∙

research

∙ 07/02/2020

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU ...

1 Shiqing Fan, et al. ∙

research

∙ 11/16/2019

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) ...

0 Hanpeng Hu, et al. ∙

research

∙ 10/14/2019

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, incl...

0 Mengdi Wang, et al. ∙

research

∙ 09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...

0 Yanghua Peng, et al. ∙

research

∙ 05/16/2018

NFVactor: A Resilient NFV System using the Distributed Actor Model

Resilience functionality, including failure resilience and flow migratio...

0 Jingpu Duan, et al. ∙

research

∙ 01/03/2018

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been depl...

0 Yixin Bao, et al. ∙

research

∙ 09/13/2017

Normalized Direction-preserving Adam

Optimization algorithms for training deep models not only affects the co...

0 Zijun Zhang, et al. ∙

Chuan Wu

Featured Co-authors

Sign in with Google

Consider DeepAI Pro