{"id":700210,"date":"2020-10-22T18:00:57","date_gmt":"2020-10-23T01:00:57","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?post_type=msr-research-item&#038;p=700210"},"modified":"2020-10-22T20:36:45","modified_gmt":"2020-10-23T03:36:45","slug":"rammer-enabling-holistic-deep-learning-compiler-optimizations-with-rtasks","status":"publish","type":"msr-research-item","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/rammer-enabling-holistic-deep-learning-compiler-optimizations-with-rtasks\/","title":{"rendered":"Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks"},"content":{"rendered":"<p>Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is challenging. Existing DNN frameworks and compilers often treat the DNN operators in a data flow graph (DFG) as opaque library functions and schedule them onto accelerators to be executed individually. They rely on another layer of scheduler, often implemented in hardware, to exploit the parallelism available in the operators. Such a two-layered approach incurs significant scheduling overhead and often cannot fully utilize the available hardware resources. In this paper, we propose RAMMER, a DNN compiler design that optimizes the execution of DNN workloads on massively parallel accelerators. RAMMER generates an efficient static spatio-temporal schedule for a DNN at compile time to minimize scheduling overhead. It maximizes hardware utilization by holistically exploiting parallelism through inter- and intra- operator co-scheduling. RAMMER achieves this by proposing several novel, hardware neutral, and clean abstractions for the computation tasks and the hardware accelerators. These abstractions expose a much richer scheduling space to RAMMER, which employs several heuristics to explore this space and finds efficient schedules. We implement RAMMER for multiple hardware backends such as NVIDIA GPUs, AMD GPUs, and Graphcore IPU. Experiments show RAMMER significantly outperforms state-of-the-art compilers such as TensorFlow XLA and TVM by up to 20.1\u00d7. It also outperforms TensorRT, a vendor optimized proprietary DNN inference library from NVIDIA, by up to 3.1\u00d7.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is challenging. Existing DNN frameworks and compilers often treat the DNN operators in a data flow graph (DFG) as opaque library functions and schedule them onto accelerators to be executed individually. They rely on another layer of scheduler, often implemented in hardware, to exploit the [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-11-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-700210","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-11-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/ma","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Lingxiao Ma","user_id":0,"rest_url":false},{"type":"text","value":"Zhiqiang Xie","user_id":0,"rest_url":false},{"type":"text","value":"Zhi Yang","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Jilong Xue","user_id":36987,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jilong Xue"},{"type":"user_nicename","value":"Youshan Miao","user_id":35038,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Youshan Miao"},{"type":"user_nicename","value":"Wei Cui","user_id":38859,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Wei Cui"},{"type":"user_nicename","value":"Wenxiang Hu","user_id":39763,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Wenxiang Hu"},{"type":"user_nicename","value":"Fan Yang","user_id":31782,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Fan Yang"},{"type":"user_nicename","value":"Lintao Zhang","user_id":32693,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Lintao Zhang"},{"type":"user_nicename","value":"Lidong Zhou","user_id":32673,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Lidong Zhou"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[510017,920469,922377],"msr_project":[555282],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":555282,"post_title":"Deep Learning Compiler and Optimizer","post_name":"deep-learning-compiler-and-optimizer","post_type":"msr-project","post_date":"2018-12-04 18:10:52","post_modified":"2023-07-10 03:41:13","post_status":"publish","permalink":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/deep-learning-compiler-and-optimizer\/","post_excerpt":"Project Overview This project aims to build a deep learning compiler and optimizer infrastructure that can provide automatic scalability and efficiency optimization for distributed and local execution.\u00a0 Overall, this stack covers two types of general optimizations: fast distributed training over large-scale servers and efficient local execution on various hardware devices.\u00a0 Currently, our optimizations focus on many different parts of the system stack, such as fast distributed training over RDMA, automatic computation placement across devices, automatic&hellip;","_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/555282"}]}}]},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700210\/revisions"}],"predecessor-version":[{"id":700213,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700210\/revisions\/700213"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=700210"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=700210"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=700210"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=700210"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=700210"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=700210"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=700210"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=700210"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=700210"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=700210"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=700210"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=700210"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=700210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}