| Fast and Accurate Deep Nerual Network Training-上海交通大学电子信息与电气工程学院
讲座地点:电信群楼3-414
讲座时间:6月17日(周四)14:30-15:30
主讲人:尤洋
主讲人介绍:
尤洋是新加坡国立大学计算机系的校长青年教授 (Presidential Young Professor)。他从加州大学伯克利分校计算机系获得了博士学位。尤洋的研究兴趣包括高性能计算,并行算法,以及机器学习。他当前的研究重点是大规模深度学习训练算法的分布式优化。他曾创造ImageNet以及BERT训练速度的世界纪录,并被ScienceDaily,The Next Web,i-programmer等几十家媒体广泛报道。他设计的算法被广泛应用于谷歌,微软,英特尔,英伟达等科技巨头。尤洋近三年以第一作者身份在NIPS,ICLR,Supercomputing,IPDPS,ICS等国际重要会议或期刊上发表论文十余篇。他曾以第一作者身份获得了国际并行与分布式处理大会(IPDPS)的最佳论文(0.8%获奖率)和国际并行处理大会(ICPP)的最佳论文(0.3%获奖率)。尤洋曾获清华大学优秀毕业生,北京市优秀毕业生,国家奖学金,以及当时清华大学计算机系数额最高的西贝尔奖学金。他还在2017年获得美国计算机协会(ACM)官网上唯一颁给在读博士生的ACM-IEEE CS George Michael Memorial HPC Fellowship。他也获得了颁发给伯克利优秀毕业生的Lotfi A. Zadeh Prize。他还被UC Berkeley提名为ACM Doctoral Dissertation Award候选人(81名UC Berkeley EECS 2020博士毕业生中选2人)。尤洋在2021年被选入福布斯30岁以下精英榜 (亚洲)。更多信息请查看他的研究小组主页(https://ai.comp.nus.edu.sg/)。
讲座摘要:
In the last three years, supercomputers have become increasingly popular in leading AI companies. Amazon built a High Performance Computing (HPC) cloud. Google released its first 100-petaFlop supercomputer (TPU Pod). Facebook made a submission on the Top500 supercomputer list. Why do they like supercomputers? Because the computation of deep learning is very expensive. For example, even with 16 TPUs, BERT training takes more than 3 days. On the other hand, supercomputers can process 10^17 floating point operations per second. So why don’t we just use supercomputers and finish the training of deep neural networks in a very short time? The reason is that deep learning does not have enough parallelism to make full use of thousands or even millions of processors in a typical modern supercomputer. There are two directions for parallelizing deep learning: model parallelism and data parallelism. Model parallelism is very limited. For data parallelism, current optimizers can not scale to thousands of processors because large-batch training is a sharp minimum problem. In this talk, I will introduce LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments for Batch training) optimizers, which can find more parallelism for deep learning. They can not only make deep learning systems scale well, but they can also help real-world applications to achieve higher accuracy.
Since 2017, all the Imagenet training speed world records have been achieved using LARS. LARS was added to MLperf, which is the industry benchmark for fast deep learning. Google used LAMB to reduce BERT training time from 3 days to 76 minutes and achieve new state-of-the-art results on GLUE, RACE, and SQuAD benchmarks. The approaches introduced in this talk have been used by state-of-the-art distributed systems at Google, Intel, NVIDIA, Sony, Tencent, and so on.