Bayesian policy reuse

Rosman, Benjamin S; Hawasly, M; Ramamoorthy, S

dc.contributor.author	Rosman, Benjamin S
dc.contributor.author	Hawasly, M
dc.contributor.author	Ramamoorthy, S
dc.date.accessioned	2017-05-16T10:19:45Z
dc.date.available	2017-05-16T10:19:45Z
dc.date.issued	2016-02
dc.identifier.citation	Rosman, B.S, Hawasly, M. and Ramamoorthy, S. 2016. Bayesian policy reuse. Machine Learning, vol. 104(1): 99-127. DOI: 10.1007/s10994-016-5547-y	en_US
dc.identifier.issn	0885-6125
dc.identifier.uri	DOI: 10.1007/s10994-016-5547-y
dc.identifier.uri	http://link.springer.com/article/10.1007/s10994-016-5547-y
dc.identifier.uri	http://hdl.handle.net/10204/9043
dc.description	© The Author(s) 2016. This is a pre-print version of the article. The definitive published version can be obtained from http://link.springer.com/article/10.1007/s10994-016-5547-y#enumeration	en_US
dc.description.abstract	A long-lived autonomous agent should be able to respond online to novel instances of tasks from a familiar domain. Acting online requires `fast' responses, in terms of rapid convergence, especially when the task instance has a short duration such as in applications involving interactions with humans. These requirements can be problematic for many established methods for learning to act. In domains where the agent knows that the task instance is drawn from a family of related tasks, albeit without access to the label of any given instance, it can choose to act through a process of policy reuse from a library in contrast to policy learning. In policy reuse, the agent has prior experience from the class of tasks in the form of a library of policies that were learnt from sample task instances during an offline training phase. We formalise the problem of policy reuse and present an algorithm for efficiently responding to a novel task instance by reusing a policy from this library of existing policies, where the choice is based on observed `signals' which correlate to policy performance. We achieve this by posing the problem as a Bayesian choice problem with a corresponding notion of an optimal response, but the computation of that response is in many cases intractable. Therefore, to reduce the computation cost of the posterior, we follow a Bayesian optimisation approach and define a set of policy selection functions, which balance exploration in the policy library against exploitation of previously tried policies, together with a model of expected performance of the policy library on their corresponding task instances. We validate our method in several simulated domains of interactive, short-duration episodic tasks, showing rapid convergence in unknown task variations.	en_US
dc.description.sponsorship	This research has benefitted from support by the UK Engineering and Physical Sciences Research Council (Grant Number EP/H012338/1) and the European Commission (TOMSY and SmartSociety grants).	en_US
dc.language.iso	en	en_US
dc.publisher	Springer Verlag	en_US
dc.rights	CC0 1.0 Universal	*
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	*
dc.subject	Policy Reuse	en_US
dc.subject	Reinforcement Learning	en_US
dc.subject	Online bandits	en_US
dc.subject	Transfer learning	en_US
dc.subject	Bayesian Optimisation	en_US
dc.subject	Bayesian Decision Theory	en_US
dc.title	Bayesian policy reuse	en_US
dc.type	Article	en_US
dc.identifier.apacitation	Rosman, B. S., Hawasly, M., & Ramamoorthy, S. (2016). Bayesian policy reuse. http://hdl.handle.net/10204/9043	en_ZA
dc.identifier.chicagocitation	Rosman, Benjamin S, M Hawasly, and S Ramamoorthy "Bayesian policy reuse." (2016) http://hdl.handle.net/10204/9043	en_ZA
dc.identifier.vancouvercitation	Rosman BS, Hawasly M, Ramamoorthy S. Bayesian policy reuse. 2016; http://hdl.handle.net/10204/9043.	en_ZA
dc.identifier.ris	TY - Article AU - Rosman, Benjamin S AU - Hawasly, M AU - Ramamoorthy, S AB - A long-lived autonomous agent should be able to respond online to novel instances of tasks from a familiar domain. Acting online requires `fast' responses, in terms of rapid convergence, especially when the task instance has a short duration such as in applications involving interactions with humans. These requirements can be problematic for many established methods for learning to act. In domains where the agent knows that the task instance is drawn from a family of related tasks, albeit without access to the label of any given instance, it can choose to act through a process of policy reuse from a library in contrast to policy learning. In policy reuse, the agent has prior experience from the class of tasks in the form of a library of policies that were learnt from sample task instances during an offline training phase. We formalise the problem of policy reuse and present an algorithm for efficiently responding to a novel task instance by reusing a policy from this library of existing policies, where the choice is based on observed `signals' which correlate to policy performance. We achieve this by posing the problem as a Bayesian choice problem with a corresponding notion of an optimal response, but the computation of that response is in many cases intractable. Therefore, to reduce the computation cost of the posterior, we follow a Bayesian optimisation approach and define a set of policy selection functions, which balance exploration in the policy library against exploitation of previously tried policies, together with a model of expected performance of the policy library on their corresponding task instances. We validate our method in several simulated domains of interactive, short-duration episodic tasks, showing rapid convergence in unknown task variations. DA - 2016-02 DB - ResearchSpace DP - CSIR KW - Policy Reuse KW - Reinforcement Learning KW - Online bandits KW - Transfer learning KW - Bayesian Optimisation KW - Bayesian Decision Theory LK - https://researchspace.csir.co.za PY - 2016 SM - 0885-6125 T1 - Bayesian policy reuse TI - Bayesian policy reuse UR - http://hdl.handle.net/10204/9043 ER -	en_ZA

Files in this item

Name: Rosman_2016.pdf

Size: 669.2Kb

Format: PDF

Description: Pre-print article

View/Open

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

Journal Articles

Show simple item record

Except where otherwise noted, this item's license is described as CC0 1.0 Universal

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.