{"id":920775,"date":"2023-02-18T14:57:06","date_gmt":"2023-02-18T22:57:06","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?post_type=msr-project&#038;p=920775"},"modified":"2023-02-18T14:57:07","modified_gmt":"2023-02-18T22:57:07","slug":"merak-an-analytical-performance-simulator-for-large-scale-distributed-training","status":"publish","type":"msr-project","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/merak-an-analytical-performance-simulator-for-large-scale-distributed-training\/","title":{"rendered":"Merak: An Analytical Performance Simulator for Large-scale Distributed Training"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h2 id=\"\"><\/h2>\n\n\n\n<p>The growing computational demand for training deep neural networks (DNNs) makes it a standard practice to<br>adopt distributed training. Though existing training systems use multiple devices to achieve high degrees of<br>data parallelism, linear speedup of the performance of large-scale distributed training cannot be promised. A<br>major challenge faced by practitioners is that they cannot learn the precise efficiency of the task unless they<br>deploy the model and profile its performance in a cluster. However, deployment and profiling are tedious and<br>cost inefficient. We address this problem by introducing Merak, a DAG-based simulator, which vividly replays<br>the training process and accurately predicts the step time. We draw attention to the communication operations<br>in distributed training and report two critical problems in existing simulation work. (1) We propose a running<br>time formulation for all-reduce kernels, which features the cost of data propagation and reduce operation. (2)<br>We design and train an ML-based prediction model to capture the interference between computation kernels and<br>all-reduce kernels. We adopt the profile-and-predict approach to derive the step time of a large-scale distributed<br>task from the knowledge of a small-scale task. We implement Merak for PyTorch with NCCL communication<br>library and evaluate the performance on Nvidia Ampere A100 clusters. Extensive experiments on various DNN<br>models show that the average accuracy of Merak\u2019s prediction is up to 98.25%<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p><\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>The growing computational demand for training deep neural networks (DNNs) makes it a standard practice toadopt distributed training. Though existing training systems use multiple devices to achieve high degrees ofdata parallelism, linear speedup of the performance of large-scale distributed training cannot be promised. Amajor challenge faced by practitioners is that they cannot learn the precise [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13547],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-920775","msr-project","type-msr-project","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/920775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":1,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/920775\/revisions"}],"predecessor-version":[{"id":920778,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/920775\/revisions\/920778"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=920775"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=920775"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=920775"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=920775"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=920775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}