TY - GEN
T1 - Accommodating Picky Customers
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
AU - Wu, Jingfeng
AU - Braverman, Vladimir
AU - Yang, Lin F.
N1 - Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound Oe(pmin{d, S} · H2SAK), where d is the number of objectives, S is the number of states, A is the number of actions, H is the length of the horizon, and K is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to ǫ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity Oe(min{d, S} · H3SA/ǫ2). This result partly resolves an open problem raised by Jin et al. [2020].
AB - In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound Oe(pmin{d, S} · H2SAK), where d is the number of objectives, S is the number of states, A is the number of actions, H is the length of the horizon, and K is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to ǫ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity Oe(min{d, S} · H3SA/ǫ2). This result partly resolves an open problem raised by Jin et al. [2020].
UR - https://www.scopus.com/pages/publications/85128389252
M3 - Conference contribution
AN - SCOPUS:85128389252
T3 - Advances in Neural Information Processing Systems
SP - 13112
EP - 13124
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
PB - Neural information processing systems foundation
Y2 - 6 December 2021 through 14 December 2021
ER -