TY - GEN
T1 - Bounded-Error Policy Optimization for Mixed Discrete-Continuous MDPs via Constraint Generation in Nonlinear Programming
AU - Gimelfarb, Michael
AU - Taitler, Ayal
AU - Sanner, Scott
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - We propose the Constraint-Generation Policy Optimization (CGPO) framework to optimize policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDP). CGPO can not only provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bilevel mixed-integer nonlinear optimization framework for optimizing policies in defined expressivity classes (e.g. piecewise linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through chance constraints, providing high-probability performance guarantees. We also present a roadmap for understanding the computational complexities of different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control. In summary, CGPO provides structured, compact and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.
AB - We propose the Constraint-Generation Policy Optimization (CGPO) framework to optimize policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDP). CGPO can not only provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bilevel mixed-integer nonlinear optimization framework for optimizing policies in defined expressivity classes (e.g. piecewise linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through chance constraints, providing high-probability performance guarantees. We also present a roadmap for understanding the computational complexities of different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control. In summary, CGPO provides structured, compact and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.
KW - Chance Constraints
KW - Constraint Generation
KW - Control
KW - Mixed Discrete-Continuous MDP
KW - Piecewise-Linear Policy
KW - Planning
KW - Policy Optimization
KW - Sequential Decision Optimization
UR - https://www.scopus.com/pages/publications/105010193517
U2 - 10.1007/978-3-031-95973-8_15
DO - 10.1007/978-3-031-95973-8_15
M3 - Conference contribution
AN - SCOPUS:105010193517
SN - 9783031959721
T3 - Lecture Notes in Computer Science
SP - 239
EP - 255
BT - Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 22nd International Conference, CPAIOR 2025, Proceedings
A2 - Tack, Guido
PB - Springer Science and Business Media Deutschland GmbH
T2 - 22nd International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, CPAIOR 2025
Y2 - 10 November 2025 through 13 November 2025
ER -