R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning

    Research output: Contribution to journalConference articlepeer-review

    49 Scopus citations

    Abstract

    R-MAX is a simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, the model is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E 3 algorithm, covering zero-sum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the "optimism under uncertainty" bias used in many RL algorithms. (4) It is much simpler and more general than Brafman and Tennenholtz's LSG algorithm for learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for near-optimal learning in repeated games known to be polynomial, providing a much simpler and more efficient alternative to previous algorithms by Banos and by Megiddo.

    Original languageEnglish
    Pages (from-to)953-958
    Number of pages6
    JournalIJCAI International Joint Conference on Artificial Intelligence
    StatePublished - 1 Dec 2001
    Event17th International Joint Conference on Artificial Intelligence, IJCAI 2001 - Seattle, WA, United States
    Duration: 4 Aug 200110 Aug 2001

    ASJC Scopus subject areas

    • Artificial Intelligence

    Fingerprint

    Dive into the research topics of 'R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning'. Together they form a unique fingerprint.

    Cite this