TY - GEN
T1 - A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret
AU - Dey, Sheelabhadra
AU - Pendurkar, Sumedh
AU - Sharon, Guni
AU - Hanna, Josiah P.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - In various control task domains, existing controllers provide a baseline level of performance that-though possibly suboptimal-should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret with respect to a baseline policy performance. We present a joint imitation-reinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind (a) training while leveraging demonstrations from the baseline policy to minimize regret with respect to the baseline policy, and (b) eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imitate the baseline policy and gradually shifting control from the baseline to an RL agent. Experimental results show that JIRL effectively accomplishes the aforementioned objectives in several, continuous action-space domains. The results demonstrate that JIRL is comparable to a state-of-the-art algorithm in its final performance while incurring significantly lower baseline regret during training. Moreover, the results show a reduction factor of up to 21 in baseline regret over a trust-region-based approach that guarantees monotonic policy improvement.
AB - In various control task domains, existing controllers provide a baseline level of performance that-though possibly suboptimal-should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret with respect to a baseline policy performance. We present a joint imitation-reinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind (a) training while leveraging demonstrations from the baseline policy to minimize regret with respect to the baseline policy, and (b) eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imitate the baseline policy and gradually shifting control from the baseline to an RL agent. Experimental results show that JIRL effectively accomplishes the aforementioned objectives in several, continuous action-space domains. The results demonstrate that JIRL is comparable to a state-of-the-art algorithm in its final performance while incurring significantly lower baseline regret during training. Moreover, the results show a reduction factor of up to 21 in baseline regret over a trust-region-based approach that guarantees monotonic policy improvement.
UR - http://www.scopus.com/inward/record.url?scp=85124368758&partnerID=8YFLogxK
U2 - 10.1109/IROS51168.2021.9636294
DO - 10.1109/IROS51168.2021.9636294
M3 - Conference contribution
AN - SCOPUS:85124368758
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 3485
EP - 3491
BT - IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
PB - Institute of Electrical and Electronics Engineers
T2 - 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
Y2 - 27 September 2021 through 1 October 2021
ER -