四轴飞行器算法研究
2005 IEEE/RSJ International Conference on Intelligent Robots and Systems
Multi-Agent Quadrotor Testbed Control Design:
∗
Integral Sliding Mode vs. Reinforcement Learning
Steven L. Waslander †, Gabriel M. Hoffmann †, ‡
Ph.D. Candidate
Aeronautics and Astronautics
Stanford University
{stevenw, gabeh }@stanford.edu
Jung Soon Jang
Research Associate
Aeronautics and Astronautics
Stanford University [email protected]
Claire J. Tomlin
Associate Professor
Aeronautics and Astronautics
Stanford University [email protected]
Abstract —The Stanford Testbed of Autonomous Rotorcraft for Multi-Agent Control (STARMAC)is a multi-vehicle testbed currently comprised of two quadrotors, also called X4-flyers,with capacity for eight. This paper presents a comparison of control design techniques, specificallyfor outdoor altitude control, in and above ground effect, that accommodate the unique dynamics of the aircraft. Due to the complex airflowin-duced by the four interacting rotors, classical linear techniques failed to provide sufficientstability. Integral Sliding Mode and Reinforcement Learning control are presented as two design techniques for accommodating the nonlinear disturbances. The methods both result in greatly improved performance over classical control techniques.
I. I NTRODUCTION
As firstintroduced by the authors in [1],the Stanford Testbed of Autonomous Rotorcraft for Multi-Agent Con-trol (STARMAC)is an aerial platform intended to validate novel multi-vehicle control techniques and present real-world problems for further investigation. The base vehicle for STARMAC is a four rotor aircraft with fixedpitch blades, referred to as a quadrotor, or an X4-flyer.They are capable of
15minute outdoor flightsin a 100m square area [1].
Fig. 1. One of the STARMAC quadrotors in action.
There have been numerous projects involving quadrotors to date, with the firstknown hover occurring in October, 1922[2].Recent interest in the quadrotor concept has been sparked by commercial remote control versions, such as the
supported by ONR under the MURI contract N00014-02-1-0720called “CoMotion:Computational Methods for Collaborative Motion”,by the NASA Joint University Program under grant NAG 2-1564, and by NASA grant NCC 2-5536.
†These authors contributed equally to this work.
‡Funding provided by National Defense Science and Engineering Grant.
∗Research
DraganFlyer IV [3].Many groups [4]–[7]have seen signif-icant success in developing autonomous quadrotor vehicles. To date, however, STARMAC is the only operational multi-vehicle quadrotor platform capable of autonomous outdoor flight,without tethers or motion guides.
The firstmajor milestone for STARMAC was autonomous hover control, with closed loop control of attitude, altitude and position. Using inertial sensing, the attitude of the aircraft is simple to control, by applying small variations in the relative speeds of the blades. In fact, standard integral LQR techniques were applied to provide reliable attitude stability and tracking for the vehicle. Position control was also achieved with an integral LQR, with careful design in order to ensure spectral separation of the successive loops. Unfortunately, altitude control proves less straightforward. There are many factors that affect the altitude loop specif-ically that do not amend themselves to classical control techniques. Foremost is the highly nonlinear and destabilizing effect of four rotor downwashes interacting. In our experi-ence, this effect becomes critical when motion is not damped by motion guides or tethers. Empirical observation during manual flightrevealed a noticeable loss in thrust upon descent through the highly turbulent flowfield.Similar aerodynamic phenomenon for helicopters have been studied extensively [8],but not for the quadrotor, due to its relative obscurity and complexity. Other factors that introduce disturbances into the altitude control loop include blade flex,ground effect and battery discharge dynamics. Although these effects are also present in generating attitude controlling moments, the differential nature of the control input eliminates much of the absolute thrust disturbances that complicate altitude control. Additional complications arise from the limited choice in low cost, high resolution altitude sensors. An ultrasonic ranging device [9]was used, which suffers from non-Gaussian noise—falseechoes and dropouts. The resulting raw data stream includes spikes and echoes that are difficultto mitigate, and most successfully handled by rejection of infeasible measurements prior to Kalman filtering.
In order to accommodate this combination of noise and disturbances, two distinct approaches are adopted. Integral Sliding Mode (ISM)control [10]–[12]takes the approach that the disturbances cannot be modeled, and instead designs
a control law that is guaranteed to be robust to disturbances as long as they do not exceed a certain magnitude. Model-based reinforcement learning [13]creates a dynamic model based on recorded inputs and responses, without any knowledge of the underlying dynamics, and then seeks an optimal control law using an optimization technique based on the learned model. This paper presents an exposition of both methods and contrasts the techniques from both a design and implementation point of view.
II. S YSTEM D ESCRIPTION
STARMAC consists of a fleetof quadrotors and a ground station. The system communicates over a Bluetooth Class 1network. The core of the aircraft are microcontroller circuit boards designed and assembled at Stanford, for this project. The microcontrollers run real-time control code, interface with sensors and the ground station, and supervise the system. The aircraft are capable of sensing position, attitude, and proximity to the ground. The differential GPS receiver is the Trimble Lassen LP, operating on the L1band, providing 1Hz updates. The IMU is the MicroStrain 3DM-G, a low cost, light weight IMU that delivers 76Hz attitude, attitude rate, and acceleration readings. The distance from the ground is found using ultrasonic ranging at 12Hz.
The ground station consists of a laptop computer, to interface with the aircraft, and a GPS receiver, to provide differential corrections. It also has a battery charger, and joysticks for control-augmented manual flight,when desired.
III. Q UADROTOR D YNAMICS
The derivation of the nonlinear dynamics is performed in North-East-Down (NED)inertial and body fixedcoordinates. Let {e N , e E , e D }denote the inertial axes, and {x B , y B , z B }denote the body axes, as definedin Figure 2. Euler angles of the body axes are {φ,θ,ψ}with respect to the e N , e E and e D axes, respectively, and are referred to as roll, pitch and yaw. Let r be definedas the position vector from the inertial origin to the vehicle center of gravity (CG),and let ωB be definedas the angular velocity in the body frame. The current velocity
direction is referred to as e v in inertial coordinates.
aerodynamic torque, Q i , and thrust, T i , both parallel to the rotor’saxis of rotation, and both used for vehicle control.
Here, T i ≈u i , where u i is the voltage applied to the motors, as determined from a load cell test. In flight,T i can vary greatly from this approximation. The torques, Q i , are proportional to the rotor thrust, and are given by Q i =k r T i . Rotors 1and 3rotate in the opposite direction as rotors 2and 4, so that counteracting aerodynamic torques can be used independently for yaw control. Horizontal velocity results in a moment on the rotors, R i , about −e v , and a drag force, D i , in the direction, −e v .
The body drag force is definedas D B , vehicle mass is m , acceleration due to gravity is g , and the inertia matrix is I ∈R 3×3. A free body diagram is depicted in Figure 2. The total force, F , and moment, M , can be summed as,
F =−D B e v +mg e D +M =
4 i =1
4 i =1
(−T i z B −D i e v ) (1)
(Q i z B −R i e v −D i (r i ×e v ) +T i (r i ×z B ))
(2)
m ¨r
I ω˙B +ωB ×IωB
=F
=M
(3)
The full nonlinear dynamics can be described as,
where the total angular momentum of the rotors is assumed to be near zero, because they are counter-rotating. Near hover conditions, the contributions by rolling moment and drag can be neglected 4in Equations (1)and (2).Definethe total thrust as T =i =1T i . The translational motion is definedby,
m ¨r =F =−R ψ·R θ·R φT z B +mg e D
(4)
where R φ, R θ, and R ψare the rotation matrices for roll,
pitch, and yaw, respectively. Applying the small angle ap-proximation to the rotation matrices,
⎤⎤⎡⎤⎡⎤⎡⎡
001ψθr ¨x
¨y ⎦=⎣ψ1φ⎦⎣0⎦+⎣0⎦(5)m ⎣r
mg −T θ−φ1r ¨z Finally, assuming total thrust approximately counteracts grav-¯=mg , except in the e D axis, ity, T ≈T
⎤⎡⎤⎤⎡⎤⎡⎡
¯0φ00−T r ¨x
¯00⎦⎣θ⎦(6)¨m ⎣r y ⎦=⎣0⎦+⎣T 001T r ¨mg z For small angular velocities, the Euler angle accelerations
are determined from Equation (3)by dropping the second order term, ω×Iω, and expanding the thrust into its four constituents. The angular equations become,
⎡⎤
⎤⎡⎤⎡T 1¨I x φ0l 0−l ⎢T 2⎥
⎥¨⎦=⎣l ⎣I y θ0−l 0⎦⎢⎣T 3⎦(7)
¨K r −K r K r −K r I z ψT
4
Fig. 2. Free body diagram of a quadrotor aircraft.
The rotors, numbered 1−4, are mounted outboard on the
x B , y B , −x B and −y B axes, respectively, with position vectors r i with respect to the CG. Each rotor produces an
where the moment arm length l =||r i ×z B ||is identical for all rotors due to symmetry. The resulting linear models can now be used for control design.
IV. E STIMATION AND C ONTROL D ESIGN
Applying the concept of spectral separation, inner loop control of attitude and altitude is performed by commanding motor voltages, and outer loop position control is performed by commanding attitude requests for the inner loop. Accurate attitude control of the plant in Equation (7)is achieved with an integral LQR controller design to account for thrust biases. Position estimation is performed using a navigation filterthat combines horizontal position and velocity information from GPS, vertical position and estimated velocity information from the ultrasonic ranger, and acceleration and angular rates from the IMU in a Kalman filterthat includes bias estimates. Integral LQR techniques are applied to the horizontal com-ponents of the linear position plant described in Equation (6).The resulting hover performance is shown in Figure 6.
As described above, altitude control suffers exceedingly from unmodeled dynamics. In fact, manual command of the throttle for altitude control remains a challenge for the authors to this day. Additional complications arise from the ultrasonic ranging sensor, which has frequent erroneous readings, as seen in Figure 3. To alleviate the effect of this noise, rejection of infeasible measurements is used to remove much of the non-Gaussian noise component. This is followed by altitude and altitude rate estimation by Kalman filtering,which adds lag to the estimate. This section proceeds with a derivation of two control techniques that can be used to overcome the unmodeled dynamics and the remaining noise.
is assumed that ξ(·) satisfies ξ ≤γ, where γis the upper bounded norm of ξ(·) .
In early attempts to stabilize this system, it was observed that LQR control was not able to address the instability and performance degradation due to ξ(g, x ) . Sliding Mode Control (SMC)was adapted to provide a systematic approach to the problem of maintaining stability and consistent perfor-mance in the face of modeling imprecision and disturbances. However, until the system dynamics reach the sliding mani-fold, such nice properties of SMC are not assured. In order to provide robust control throughout the flightenvelope, the Integral Sliding Mode (ISM)technique is applied.
The ISM control is designed in two parts. First, a standard successive loop closure is applied to the linear plant. Second, integral sliding mode techniques are applied to guarantee disturbance rejection. Let
u u p
=u p +u d
=−K p x 1−K d x 2
(9)
where K p and K d are proportional and derivative loop gains that stabilize the linear dynamics without disturbances. For disturbance rejection, a sliding surface, s , is designed,
s =s 0(x 1, x 2) +z s 0=α(x 2+kx 1)
(10)
such that state trajectories are forced towards the manifold s =0. Here, s 0is a conventional sliding mode design, z is an additional term that enables integral control to be included, and α,k ∈R are positive constants. Based on
2
s , the the following Lyapunov function candidate, V =˙
guranteeing convergence to the sliding manifold.
˙=s s ˙1) +z ˙V ˙=s α(x ˙2+k x
˙
be guaranteed to satisfy,
s u d +ξ(g, x ) 0(12)Since the disturbances, ξ(g, x ) , are bounded by γ, defineu d
to be u d =−λswith λ∈R . Equation (11)becomes,
˙V =s α(−λs+ξ(g, x )
2
(13)≤α−λ|s |+γ|s |0. As a result, for u p and
u d as above, the sliding mode condition holds when,
γ|s |>(14)
With the input derived above, the dynamics are guaranteed to evolve such that s decays to within the boundary layer,
Fig. 3. Characteristic unprocessed ultrasonic ranging data, displaying spikes, false echoes and dropouts. Powered flightcommences at 185seconds.
A. Integral Sliding Mode Control
A linear approximation to the altitude error dynamics of a quadrotor aircraft in hover is given by,
x ˙1x ˙2
=x 2
=u +ξ(g, x )
(8)
{(r z,des −r z ) , (r ˙z,des −r ˙z ) }are the altitude where {x 1, x 2}= 4
error states, u =i =1u i is the control input, and ξ(·) is a bounded model of disturbances and dynamic uncertainty. It
,
of the sliding manifold. Additionally, the system does not suffer from input chatter as conventional sliding mode controllers do, as the control law does not include a switching function along the sliding mode.
V. R EINFORCEMENT L EARNING C ONTROL
An alternate approach is to implement a reinforcement learning controller. Much work has been done on continuous state-action space reinforcement learning methods [13],[14].For this work, a nonlinear, nonparametric model of the system is firstconstructed using flightdata, approximating the system as a stochastic Markov process [15],[16].Then a model-based reinforcement learning algorithm uses the model in policy-iteration to search for an optimal control policy that can be implemented on the embedded microprocessors. In order to model the aircraft dynamics as a stochas-tic Markov process, a Locally Weighted Linear Regression (LWLR)approach is used to map the current state, S (t ) ∈R n s , and mate, ˆinput, u (t ) ∈R n u , onto the subsequent state esti-(t +1) . In this application, S =[r z r ˙where S
V is the battery level. In the altitude loop, z r ¨the z V ], input, u mapping ∈R , is is the the summation total motor of power, the traditional u . The subsequent LWLR estimate, state using the current state and input, with the random vector, v ∈R n s , representing unmodeled noise. The value for v is drawn from the distribution of output error as determined by using a maximum likelihood estimate [16]of the Gaussian noise in the LWLR estimate. Although the true distribution is not perfectly Gaussian, this model is found to be adequate. The LWLR method [17]is well suited to this problem, as it fitsa non-parametric curve to the local structure of the data. The scheme extends least squares by assigning weights to each training data point according to its proximity to the input value, for which the output is to be computed. The technique requires a sizable set of training data in order to reflectthe full dynamics of the system, which is captured from flightsflownunder both automatic and manually controlled thrust, with the attitude states under automatic control.
For m training data points, the input training samples are stored in X ∈R (m ) ×(n s +n u +1), and the outputs correspond-ing to those inputs are stored in Y ∈R m ×n s . These matrices are defined⎡as
1S (t T 1)
u (t T
1) ⎤⎡T ⎤X =⎢⎣. . . . . . . . . ⎥S (t 1+1) ⎦, Y =⎢⎣. . . ⎥⎦1
S (t m )
T
u (t m )
T
S (t m +1)
T
(15)
The column of ones in X enables the inclusion of a constant offset in the solution, as used in linear regression.
The diagonal weighting matrix W ∈R m ×m , which acts on X , has one diagonal entry for each training data point. That entry gives more weight to training data points that are close
to the S (t ) and u (t ) for which S
ˆ(t +1) is to be computed. The distance measure used in this work is
W i,i =exp
−||x (i ) −x ||
(16)
where x (i ) is the i th row of X , x is the vector
[the 1range S (t ) T of influenceu (t ) T
], and of training fitparameter points. τThe is value used for to adjust τcan be tuned by cross validation to prevent over-or under-fittingthe data. Note that it may be necessary to scale the columns before taking the Euclidean norm to prevent undue influenceof one state on the W matrix.
The subsequent state estimate is computed by summing the LWLR estimate with v ,
S ˆ(t +1) = X T
W X −1X T W T x +v (17)Because W is a continuous function of x and X , as x is
varied, the resulting estimate is a continuous non-parametric curve capturing the local structure of the data. The matrix computations, in code, exploit the large diagonal matrix W ; as each W i,i is computed, it is multiplied by row x (i ) , and stored in W X .
The matrix being inverted is poorly conditioned, because weakly related data points have little influence,so their contribution cannot be accurately numerically inverted. To more accurately compute the numerical inversion, one can
perform a singular value decomposition, (X T W X ) =U ΣV T
. Then, numerical error during inversion can be avoided by using the n singular values σthe value of C i with values of i
where max is chosen by cross validation. In this work, C error, and was max typically ≈10satisfiedwas found by n to =minimize 1. The inverse numerical can be directly computed using the n upper singular values in the diagonal matrix Σn ∈R n ×n , and the corresponding singular vectors, in U model n ∈becomes
R m ×n and V n ∈R m ×n . Thus, the stochastic Markov S ˆ(t +1) =V T n Σ−n
1U T n X T W x +v (18)
Next, model-based reinforcement learning is implemented,
incorporating the stochastic Markov model, to design a controller. A quadratic reward function is used,
R (S , S ref ) =−c 1(r z −r z,ref ) 2−c 2r ˙2
z
(19)
where R :R 2n s →R , c 1>0and c 2>0are constants
giving reward for accurate tracking and good damping re-spectively, and S ref =[r ˙reference state desired for z,ref r
the the system.
z,ref r ¨z,ref V ref ]is The control policy maps the observed state S onto the input command u . In this work, the state space has the constraint of r 0≤z u ≤≥u 0, and the input command policy is chosen has the to constraint be of max . The control π(S , w ) =w 1+w 2(r z −r z,ref ) +w 3r ˙z +w 4r ¨z
(20)
where w ∈R n c is the vector of policy coefficientsw 1,... , w n c . Linear functions were sufficientto achieve good stability and performance. Additional terms, such as battery level and integral of altitude error, could be included to make the policy more resilient to differing flightconditions.
Policy iteration is performed as explained in Algorithm 1. The algorithm aims to findthe value of w that yields the greatest total reward R total , as determined by simulating the system over a finitehorizon from a set of random initial conditions, and summing the values of R (S , S ref ) at each state encountered.
Algorithm 1Model-Based Reinforcement Learning 1:Generate set S 0of random initial states
2:Generate set T of random reference trajectories 3:Initialize w to reasonable values 4:R best ←−∞, w best ←w 5:repeat 6:R total ←07:for s 0∈S 0, t ∈T do 8:S (0)←s 09:for t =0to t max −1do 10:u (t ) ←π(S (t ) , w ) 11:S (t +1) ←LW LR (S (t ) , u (t )) +v 12:R total ←R total +R (S (t +1)) 13:end for 14:end for 15:if R total >R best then 16:R best ←R total , w best ←w 17:end if 18:Add Gaussian random vector to w best , store as w 19:until w best converges
In policy iteration, a fixedset of random initial conditions and reference trajectories are used to simulate flightsat each iteration, with a given policy parameterized by w . It is neces-sary to use the same random set at each iteration in order for convergence to be possible [15].After each iteration, the new value of w is stored as w best if it outperforms the previous best policy, as determined by comparing R total to R best , the previous best reward encountered. Then, a Gaussian random vector is added to w best . The result is stored as w , and the simulation is performed again. This is iterated until the value of w best remains fixedfor an appropriate number of iterations, as determined by the particular application. The simulation results must be examined to predict the likely performance of the resulting control policy.
By using a Gaussian update rule for the policy weights, w , it is possible to escape local maxima of R total . The highest probability steps are small, and result in refinementof a solution near a local maximum of R total . However, if the algorithm is not at the global maximum, and is allowed to continue, there exists a finiteprobability that a sufficiently
large Gaussian step will be performed such that the algorithm can keep ascending.
VI. F LIGHT T EST R ESULTS
A. Integral Sliding Mode
The results of an outdoor flighttest with ISM control can be seen in Figure 4. The response time is on the order of 1-2seconds, with 5seconds settling time, and little to no steady state offset. Also, an oscillatory character can be seen in the response, which is most likely being triggered by the nonlinear aerodynamic effects and sensor data spikes described earlier.
Compared to linear control design techniques implemented on the aircraft, the ISM control proves a significantenhance-ment. By explicitly incorporating bounds on the unknown disturbance forces in the derivation of the control law, it is possible to maintain stable altitude on a system that has evaded standard approaches. B. Reinforcement Learning Control
One of the most exciting aspects of RL control design is its ease of implementation. The policy iteration algorithm arrived at the implemented control law after only 3hours on a Pentium IV computer. Figure 5presents flighttest results for the controller. The high fidelitymodel of the system, used for RL control design, provides a useful tool for comparison of the RL control law with other controllers. In fact, in simulation with linear controllers that proved unstable on the quadrotor, flightpaths with growing oscillations were predicted that closely matched real flightdata.
The locally weighted linear regression model showed many relations that were not reflectedin the linear model, but that reflectthe physics of the system well. For instance, with all other states held fixed,an upward velocity results in more acceleration at the subsequent time step for a throttle level, and a downward velocity yields the opposite effect. This is essentially negative damping. The model also shows a strong ground effect. That is, with all other states held fixed,the closer the vehicle is to the ground, the more acceleration it will have at the subsequent time step for a given throttle level.
Time [s]
Fig. 5. Reinforcement learning controller response to manually applied step input, in outdoor flighttest. Spikes in state estimates are from sensor noise passing through the Kalman filter.
VII. C ONCLUSION
This paper summarizes the development of an autonomous quadrotor capable of extended outdoor trajectory tracking control. This is the firstdemonstration of such capabilities on a quadrotor known to the authors, and represents a critical step in developing a novel, easy to use, multi-vehicle testbed for validation of multi-agent control strategies for au-tonomous aerial robots. Specifically,two design approaches were presented for the altitude control loop, which proved a challenging hurdle. Both techniques resulted in stable controllers with similar response times, and were a significantimprovement over linear controllers that failed to stabilize the system adequately.
Acknowledgments
The authors would like to thank Dev Gorur Rajnarayan and David Dostal for their many contributions to STARMAC development and testing, as well as Prof. Andrew Ng of Stanford University for his advice and guidance in developing the Reinforcement Learning control.
R EFERENCES
[1]Hoffmann, G., Rajnarayan, D. G., Waslander, S. L., Dostal, D., Jang,
J. S., and Tomlin, C. J., “TheStanford Testbed of Autonomous Ro-torcraft for Multi-Agent Control (STARMAC),”23rd Digital Avionics System Conference , Salt Lake City, UT, November 2004.
[2]Lambermont, P., Helicopters and Autogyros of the World , 1958. [3]DraganFly-Innovations, “www.rctoys.com,”2003.
[4]Pounds, P., Mahony, R., Hynes, P., and Roberts, J., “Designof a
Four-Rotor Aerial Robot,”Australian Conference on Robotics and Automation , Auckland, November 2002.
[5]Altug, E., Ostrowski, J. P., and Taylor, C. J., “QuadrotorControl Using
Dual Camera Visual Feedback,”ICRA , Taipei, September 2003.
[6]Bouabdallah, S., Murrieri, P., and Siegwart, R., “Designand Control
of an Indoor Micro Quadrotor,”ICRA , New Orleans, April 2004. [7]Castillo, P., Dzul, A., and Lozano, R., “Real-TimeStabilization and
Tracking of a Four-Rotor Mini Rotorcraft,”IEEE Transactions on Control Systems Technology , V ol. 12, No. 4, 2004.
[8]Bramwell, A., Done, G., and Blamford, D., Bramwell’sHelicopter
Dynamics , Butterworth-Heinemann, 2nd ed., 2001.
[9]Devantech, “http://www.robot-electronics.co.uk/htm/srf08tech.shtml,”
SRF08Ultrasonic Ranger. ¨[10]Utkin, V ., Guldner, J., and Shi, J., Sliding Mode Control in Electro-mechanical Systems , Taylor-Francis Inc., 1999.
[11]Khalil, H. K., Nonlinear Systems , Prentice Hall, 1996.
[12]Jang, J. S., Nonlinear Control Using Discrete-Time Dynamic Inversion
Under Input Saturation:Theory and Experiment on the Stanford DragonFly UAVs , Ph.D. thesis, Stanford University, 2004.
[13]Sutton, R. S. and Barto, A. G., Reinforcement Learning:An Introduc-tion , MIT Press, Cambridge, MA, 1998.
[14]Doya, K., Samejima, K., ichi Katagiri, K., and Kawato, M., “Multiple
Model-based Reinforcement Learning,”Tech. rep., Kawato Dynamic Brain Project Technical Report, KDB-TR-08, Japan Science and Tech-nology Corporatio, June 2000.
[15]Ng, A. Y . and Jordan, M. I., “PEGASUS:A policy search method
for large MDPs and POMDPs,”Uncertainty in ArtificialIntelligence , 2000.
[16]Ng, A. Y ., Coates, A., Diel, M., Ganapathi, V ., Schulte, J., Tse,
B., Berger, E., and Liang, E., “Autonomousinverted helicopter flightvia reinforcement learning,”International Symposium on Experimental Robotics, , 2004.
[17]Atkeson, C. G., Moore, A. W., and Schaal, S., “LocallyWeighted
Learning,”ArtificialIntelligence Review , V ol. 11, No. 1-5, 1997, pp. 11–73.
The reinforcement learning control law is susceptible to system disturbances for which it is not trained. In particular, varying battery levels and blade degradation may cause a reduction in stability or steady state offset. Addition of an integral error term to the control policy may prove an effective means of mitigating steady state disturbances, as was seen in the ISM control law.
Comparison of the step response for ISM and RL control reveals both stable performance and similar response times, although the transient dynamics of the ISM control are more pronounced. RL does, however, have the advantage that it incorporates accelerometer measurement into its control, and as such uses a more direct measurement of the disturbances imposed on the aircraft. C. Autonomous Hover
Applying ISM altitude control and integral LQR position control techniques, flighttests were performed to achieve the goal of autonomous hover. Position response was maintained within a 3m circle for the duration of a two minute flight(seeFigure 6), which is well within the expected error bound for the L1band differential GPS used.
Fig. 6. Autonomous hover flightrecorded position, with 3m error circle.