TY - GEN
T1 - Taming the noise in reinforcement learning via soft updates
AU - Fox, Roy
AU - Pakman, Ari
AU - Tishby, Naftali
PY - 2016/1/1
Y1 - 2016/1/1
N2 - Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.
AB - Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.
UR - http://www.scopus.com/inward/record.url?scp=85001976707&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85001976707
T3 - 32nd Conference on Uncertainty in Artificial Intelligence 2016, UAI 2016
SP - 202
EP - 211
BT - 32nd Conference on Uncertainty in Artificial Intelligence 2016, UAI 2016
A2 - Janzing, Dominik
A2 - Ihler, Alexander
PB - Association For Uncertainty in Artificial Intelligence (AUAI)
T2 - 32nd Conference on Uncertainty in Artificial Intelligence 2016, UAI 2016
Y2 - 25 June 2016 through 29 June 2016
ER -