{"id":852753,"date":"2022-06-16T06:06:19","date_gmt":"2022-06-16T13:06:19","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?post_type=msr-project&#038;p=852753"},"modified":"2022-06-16T08:22:13","modified_gmt":"2022-06-16T15:22:13","slug":"offlinerlalgo","status":"publish","type":"msr-project","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/offlinerlalgo\/","title":{"rendered":"Offline Reinforcement Learning Algorithms"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 id=\"offline-reinforcement-learning-algorithms\">Offline Reinforcement Learning Algorithms<\/h1>\n\n\n\n<p>In this page, we describe the algorithmic landscape of Offline RL and enumerate some algorithmic development efforts made by MSR in this space<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>In a&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/offrl\">tutorial lecture<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/offlinerl\">Offline RL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we analyze its algorithmic landscape and come up with a classification in five categories:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><span style=\"text-decoration: underline\">Multiple-MDP algorithms<\/span>&nbsp;contemplate the multiplicity of the plausible MDPs.<\/li><li><span style=\"text-decoration: underline\">Pessimistic algorithms<\/span>&nbsp;transform the value function with a component penalizing taking actions with high uncertainty.<\/li><li><span style=\"text-decoration: underline\">Conservative algorithms<\/span>&nbsp;constrain the set of candidate policies in such a way that it remains close to the behavioral policy.<\/li><li><span style=\"text-decoration: underline\">Early-stopping algorithms<\/span>&nbsp;apply a limit in the number of updates allowed to train the new policy.<\/li><li><span style=\"text-decoration: underline\">Reward-conditioned supervised learning<\/span>&nbsp;learns a model of the action distribution yielding a target return.<\/li><\/ul>\n\n\n\n<p>At MSR, we have produced several algorithmic development efforts:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Pessimistic algorithms: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fproceedings.neurips.cc%2Fpaper%2F2020%2Fhash%2F0dc23b6a0e4abc39904388dd3ffadcd1-Abstract.html&data=05%7C01%7CRomain.Laroche%40microsoft.com%7C161eeee0d0034835124408da48d49907%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902375182185876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=b6IyJxeiGX3u%2BMKUhDhUEu%2F1XaVSHThxCv9zb6BpSw4%3D&reserved=0\">Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [NeurIPS\u201920]. Assumptions are necessary when training a policy from a fixed dataset without any environment interactions. Previous offline algorithms made an unreasonable \u201cconcentrability\u201d assumption, and could fail catastrophically when it was violated. In this work we showed that a suitably pessimistic algorithm does not require concentrability and gracefully fails as the quality of the training dataset is degraded.<\/li><li>Conservative algorithms: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/spibb\">Safe Policy Improvement with Baseline Bootstrapping<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a family of conservative algorithms that allows change in the behavioral policy only when sufficient statistical evidence is provided. This thread of research led us to develop <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/safe-policy-improvement-with-baseline-bootstrapping-2\/\">its theoretical foundations<\/a>, <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/safe-policy-improvement-with-soft-baseline-bootstrapping\/\">algorithmic improvements<\/a>, additional guarantees with an <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/safe-policy-improvement-with-an-estimated-baseline-policy\/\">estimated behavioral policy<\/a> (Markovian or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2205.13950\">non-Markovian<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>), its <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/multi-objective-spibb-seldonian-offline-policy-improvement-with-safety-constraints-in-finite-mdps\/\">extension to Multi-objective Offline RL<\/a>, and its implementation to deep Offline RL (<a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/04\/RLDM___SPIBB_DQN-2.pdf\">with a heuristic uncertainty measure<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2206.01085\">with a trained uncertainty measure<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/li><li>Other algorithms: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fproceedings.neurips.cc%2Fpaper%2F2021%2Fhash%2F70d31b87bd021441e5e6bf23eb84a306-Abstract.html&data=05%7C01%7CRomain.Laroche%40microsoft.com%7C161eeee0d0034835124408da48d49907%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902375182185876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=roG8xppWTRRSmAWd%2By0djnWVF4xNEv%2FzeNhU1uhOerk%3D&reserved=0\">Heuristic-Guided Reinforcement Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [NeurIPS\u201921]. The typical output of an offline RL algorithm is a learned policy. Are there other useful outputs for downstream decision-making? We can additionally extract heuristics (e.g., as commonly used in A* search) that can for instance accelerate subsequent RL algorithms. A good heuristic can substantially reduce the effective horizon for decision-making and simplify the RL problem. In this work we showed how to extract good heuristics (closely linked to the pessimism and conservatism principles in offline RL) and how to use them effectively (closely linked to Blackwell optimality and potential-based reward shaping).<\/li><\/ul>\n\n\n","protected":false},"excerpt":{"rendered":"<p>In this page, we describe the algorithmic landscape of Offline RL and enumerate some algorithmic development efforts made by MSR in this space In a&nbsp;tutorial lecture (opens in new tab) on Offline RL (opens in new tab), we analyze its algorithmic landscape and come up with a classification in five categories: Multiple-MDP algorithms&nbsp;contemplate the multiplicity [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-852753","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[896463,1148823],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[437514,1148609],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852753","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":3,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852753\/revisions"}],"predecessor-version":[{"id":852807,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852753\/revisions\/852807"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=852753"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=852753"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=852753"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=852753"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=852753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}