Online Adaptive Optimal Control For Continuous-time Systems

Online Adaptive Optimal Control For Continuous-time Systems

Date

2010-03-03T23:30:47Z

Publisher

Electrical Engineering

Abstract

This work makes two major contributions. * First, in the field of computational intelligence, it develops reinforcement learning controllers (i.e. approximate dynamic programming algorithms) for continuous-time systems, whereas in the past, reinforcement learning has been mainly developed for discrete-time systems. * Second, in the field of control systems engineering, it develops on-line optimal adaptive controllers, whereas in the past, optimal control has been an off-line design tool, and on-line adaptive controllers have not been optimal. The online algorithms presented herein are reinforcement learning schemes which provide online synthesis of optimal control for a class of nonlinear systems with unknown drift term. The results are direct adaptive control algorithms which converge to the optimal control solution without using an explicit, a priori obtained, model of the drift dynamics of the system. The online algorithms can be implemented while making use of two function approximation structures, in an Actor-Critic interconnection. In this continuous-time formulation the result is a hybrid control structure which involves a continuous-time controller and a supervisory adaptation structure which operates based on data sampled from the plant and from the continuous-time performance dynamics. Such control structure is unlike any standard form of controllers previously seen in the literature. The research begins with the development of an adaptive controller which solves online the linear quadratic regulation (LQR) problem. The online procedure provides the solution of the algebraic Riccati equation (ARE) underlying the LQR problem while renouncing the requirement of exact knowledge on the drift term of the controlled system, while only using discrete measurements of the system's states and performance. From the perspective of computational intelligence this algorithm is a new data-based continuous-time policy iteration (PI) approach to the solution of the optimization problem. It became then interesting to develop an online method which provides control solutions for a system with nonlinear dynamics. In this case the theoretical development becomes a bit more complicated since the equation underlying the optimal control problem is the Hamilton-Jacobi-Bellman (HJB) equation, a nonlinear partial differential equation which is in general impossible to be solved analytically and most often does not have smooth solution. The new online data-based approach to adaptive optimal control is extended to provide a local approximate optimal control solution for the case of nonlinear systems. The convergence guarantee of the online algorithm is given under the realistic assumption that the two function approximators involved in the online policy iteration procedure, namely actor and critic, do not provide perfect representations for the nonlinear control and cost functions. Also in this case the algorithm reaches to the solution without using any information on the form of the drift term in the dynamics of the system. At each step of the online iterative algorithm, a generalized HJB (GHJB) equation is solved using measured data and a reinforcement learning technique based on temporal differences. Thus it became interesting to see if these GHJB equations can be solved by iterative means. This evolved into a new formulation for the PI algorithm that allowed developing the generalized policy iteration (GPI) algorithm for continuous-time systems. The GPI represents a spectrum of algorithms which has at one end the exact policy iteration (PI) algorithm and at the other a variant of the value iteration (VI) algorithm. At the middle part of the spectrum lies the so called optimistic policy iteration (OPI) algorithm for CT systems. From this perspective the new continuous-time GPI provides a unified point of view over the approximate dynamic programming (ADP) algorithms that deal with continuous-time systems. The appropriate formulation of the Value Iteration algorithm in a continuous-time framework is now straightforward. Understanding the relation between the PI and VI algorithms is now of utmost importance. The analysis is done here for linear systems with quadratic cost index. The value iteration algorithm provides computational means for a sequence of positive definite matrices which converges to the unique positive definite solution of the ARE. While the PI algorithm is a Newton method, the VI algorithm is a quasi-Newton method. The VI algorithm does not require solution of a Lyapunov equation at each step of the iteration thus the stringent requirement of an initial stabilizing control policy is not necessary. The last result provides an online approach to the solution of zero-sum differential games with linear dynamics and quadratic cost index. It is known that the solution of the zero-sum differential game can be obtained by means of iteration on Riccati equations. Here we exploit our first result to find the saddle point of the game in an online fashion. This work provides the equilibrium solution for the game, in an online fashion, when either the control actor or the disturbance actor is actively learning. At every stage of the game one player learns online an optimal policy to counteract the constant policy of its opponent. The learning procedure takes place based only on discrete-time measurement information of the states of the system and of the value function of the game and without requirement of exact parametric information of the drift term of the system.