TY - GEN
T1 - Restless Multi-Armed Bandits under Exogenous Global Markov Process
AU - Gafni, Tomer
AU - Yemini, Michal
AU - Cohen, Kobi
N1 - Publisher Copyright:
© 2022 IEEE
PY - 2022/1/1
Y1 - 2022/1/1
N2 - We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where an unknown exogenous global Markov process governs the rewards distribution of each arm. Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule, which is nonidentical among different arms. At each time, a player chooses an arm out of N arms to play, and receives a random reward from a finite set of reward states. The arms are restless, that is, their local state evolves regardless of the player's actions. The objective is an arm-selection policy that minimizes the regret, defined as the reward loss with respect to a player that knows the dynamics of the problem, and plays at each time t the arm that maximizes the expected immediate value. We develop the Learning under Exogenous Markov Process (LEMP) algorithm, that achieves a logarithmic regret order with time, and a finite-sample bound on the regret is established. Simulation results support the theoretical study and demonstrate strong performances of LEMP.
AB - We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where an unknown exogenous global Markov process governs the rewards distribution of each arm. Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule, which is nonidentical among different arms. At each time, a player chooses an arm out of N arms to play, and receives a random reward from a finite set of reward states. The arms are restless, that is, their local state evolves regardless of the player's actions. The objective is an arm-selection policy that minimizes the regret, defined as the reward loss with respect to a player that knows the dynamics of the problem, and plays at each time t the arm that maximizes the expected immediate value. We develop the Learning under Exogenous Markov Process (LEMP) algorithm, that achieves a logarithmic regret order with time, and a finite-sample bound on the regret is established. Simulation results support the theoretical study and demonstrate strong performances of LEMP.
KW - Markov processes
KW - Restless multi-armed bandit
KW - sequential decision making
KW - sequential learning
UR - http://www.scopus.com/inward/record.url?scp=85126695352&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9746078
DO - 10.1109/ICASSP43922.2022.9746078
M3 - Conference contribution
AN - SCOPUS:85126695352
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 5218
EP - 5222
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers
T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Y2 - 22 May 2022 through 27 May 2022
ER -