Reinforcement Learning for UAV Trajectory Optimization

Avoiding jammers while optimizing wireless connectivity through reinforcement learning

Research Paper 16 min read AI & Autonomy

Drones have found extensive use in military applications owing to their inconspicuousness and easy deployment. However, major challenges remain: maintaining seamless connectivity with the ground station to ensure command and control (C2) link continuity, and avoiding RF jammers that can saturate receivers and inhibit communication.

This work addresses these challenges by developing a reinforcement learning (RL) algorithm that optimizes wireless connectivity while avoiding jammers along the drone route—ensuring minimal deviation from the planned path even in the presence of hostile jamming.

The training process can be completely accomplished pre-flight. The drone only needs to carry a lookup table (Q-table) during flight, dramatically reducing onboard computing requirements while maintaining sophisticated decision-making capabilities.

The Challenge: Dual Threats to UAV Operations

Modern drone operations face two critical challenges that must be addressed simultaneously for mission success.

CHALLENGE 01

Maintaining Wireless Connectivity

Maintaining a good quality C2 link over the drone route is critical for sensitive missions and may take priority over shortest distance. For surveillance missions, continuous streaming capability requires the drone trajectory to always lie in regions of good coverage.

CHALLENGE 02

Avoiding RF Jammers

Drones face constant threat from jamming—focused high-power antenna beams that saturate the front-end receiver, making it impossible to communicate with ground stations or GPS satellites. Jammer locations are often unknown a priori, requiring adaptive response.

Why Traditional Approaches Fall Short

Classical anti-jamming techniques include adaptive filters that attenuate jammer signals or Frequency Hopping Spread Spectrum (FHSS) where frequency dynamically varies to avoid corruption. However, such hardware implementations may not conform to the Size, Weight, Power, and Cost (SWaP-C) constraints posed by drone form factors.

The fundamental problem is the dynamicity of the wireless channel. Wireless channels are impacted by geography, weather, user traffic, and more—making it impossible to build standard channel models that compute optimal routes a priori. This challenge is further compounded by jammers whose locations might be unknown.

The PhoenixAI Solution: Reinforcement Learning

PhoenixAI proposes a reinforcement learning approach that allows drones to fly in regions of good connectivity while avoiding multiple jammers on their route. The algorithm deviates minimally from pre-planned paths in the vicinity of jammers—critical for sensitive missions.

Why Reinforcement Learning?

01

Model-Free Learning

RL doesn't require apriori knowledge about the environment, making it ideal for highly dynamic and unpredictable wireless environments.

02

Adaptive Intelligence

The drone learns key characteristics of the wireless channel, enabling it to fly in regions of good coverage while avoiding threats.

03

Pre-Flight Training

Training occurs on the ground where computation and power are less constrained. The drone carries only a lookup table during flight.

04

Minimal SWaP-C

Computational requirements for deployed algorithm are minimal and can run even on drones with modest computational capabilities.

Technical Approach: Wireless Link Modeling

To generate data for training and evaluating the RL algorithm, PhoenixAI developed a representative RF simulator that captures fundamental trends in signal power at drone flight altitudes.

Wireless Heatmap Generation

The link between drone and base station is modeled using the well-known Friis equation, which accounts for transmit power, antenna gains, wavelength, and distance. For the base station antenna, a 15-element linear array with 0.5λ spacing operating at 2.0 GHz provides peak directivity of 18 dBi with a 7-degree down-tilt.

Key Insight: Sidelobe Operations

A drone flying in the sky never sees the main beam of the base station antenna—only the sidelobes, which are at least 13 dB lower than the main beam. This makes drones extremely sensitive to high-power jammer signals that can shadow cellular signals.

For each drone position, the simulator evaluates power received from each sector of each base station. The maximum power becomes the received signal, with powers from all other base stations and sectors treated as interference. The Signal to Interference and Noise Ratio (SINR) is then calculated, accounting for receiver noise.

Jammer Modeling

Jammers are modeled as highly directive antennas pointed directly upward toward the drone. A 10×10 planar array with 0.5λ spacing generates a main beam with 25 dBi directivity and 5-degree beamwidth in both principal planes.

Key Differentiator
The jammer's main beam points directly at the sky, making the power density radiated toward the drone several orders of magnitude higher than the base station antenna's sidelobe radiation. This dramatically increases interference in SINR computation.

Reinforcement Learning Implementation

PhoenixAI's algorithm utilizes Q-based learning, a model-free RL technique. During training, an estimate Q of the optimal state-action function is generated as the drone explores the environment. The Q-function maps state-action pairs to the value of taking action a in state s.

Q-Learning Fundamentals

The training process consists of iterative updates to the Q-learning equation, derived from the Bellman optimality condition. At each iteration, if the agent is in state s and takes action a, causing state change to s', the state-action estimate is updated based on the immediate reward and discounted future rewards.

After training, a policy that selects an action given a state is defined by choosing the action that maximizes the Q-function for that state. The output of training is a Q-table that serves as a lookup table for executing the policy on the drone during flight.

Constrained Q-Learning Innovation

Smart Action Space Restriction

Unlike classic Q-learning with fixed action sets, PhoenixAI employs constrained Q-learning that restricts the action space at any iteration to only actions that won't increase distance from destination and respect geofencing. This leads to superior policies and ensures the drone will never fail to arrive due to poorly chosen actions.

Epsilon-Greedy Exploration

During training, actions balance exploration and exploitation. With probability 1-ε, the action maximizes the Q function; otherwise, a random action is chosen. Starting with ε near 1 for initial exploration, the value decreases as training progresses to favor convergence.

Intelligent State Space Design

The selection of state space, reward, and action space is the critical design decision. An intuitive approach might define state by position with SINR as reward—but this creates a fundamental problem.

The Generalization Challenge

If the state is simply position and SINR, the policy essentially tells the drone what to do at certain locations. If evaluation SINR differs from training, the drone takes actions optimized for the training environment—not necessarily optimal for deployment.

There is a necessity to define the state space to include quantities that are generalizable—not specific to any particular training or deployment environment.

PhoenixAI's Enhanced State Space

To enable generalization, PhoenixAI's state space includes four parameters:

This state space allows the drone to learn trends of antenna patterns and associated signal quality, making the algorithm applicable to a wide variety of environments. The use of CQI (quantized SINR) makes the approach more amenable to Q-learning by reducing dynamic range.

Performance Results: Dynamic Path Optimization

The algorithm was evaluated in a hypothetical urban environment with nine cellular base stations placed uniformly in a 5km grid, each with three sectorized antennas serving 120-degree sectors. A drone was trained to fly from bottom-left to upper-right corner at 150m altitude over 40,000 episodes.

Baseline Performance: Same Environment

When evaluated in the same environment used for training, the RL algorithm strives to keep drones in regions of good connectivity. The RL path achieved 0.2 dB better average SINR than the straight-line path. While modest, this demonstrates the algorithm's ability to optimize connectivity.

Dynamic Environment Performance

The true power of the RL algorithm emerges in dynamic environments. When the model trained on the original configuration was evaluated in a scenario where the center base station was removed (creating a low-coverage region), the results were dramatic:

Straight Path
Flew directly over the poor coverage region, experiencing significant SINR degradation and potential loss of connectivity.
RL Path
Completely avoided the center region and took an alternative path, maintaining good SINR throughout the mission. CDF curves showed significantly better performance than the straight path.

Jammer Avoidance: Adaptive Response

To demonstrate jammer avoidance capabilities, three jammers (A, B, C) were introduced directly in the optimized path originally predicted by the RL algorithm. The jammers appear as deep blue spots in the SINR heatmap due to dramatically increased interference.

Intelligent Evasion Behavior

PHASE 01

Initial Flight Path

The drone initially follows exactly the same path as if no jammer were present, making steady progress toward the destination.

PHASE 02

Jammer Detection & Avoidance

As the drone senses SINR dropping near Jammer A, it immediately changes direction to avoid the main beam, flying through sidelobes instead—providing much lower interference.

PHASE 03

Multiple Threat Navigation

The path change from Jammer A plus RL optimization enables complete avoidance of Jammer B. The algorithm ensures the drone only flies over regions of good SINR.

PHASE 04

Path Convergence

Once past Jammer B, the drone almost immediately returns to the originally planned path, continuing until reaching Jammer C where it performs similar avoidance maneuvers.

The path deviation in the presence of jammers is highly localized—the drone skirts the jammer area and quickly returns to the original path. This minimal deviation is critical for sensitive missions with specific waypoint requirements.

Complex Scenario: Poor Coverage + Jammers

To fully demonstrate applicability to dynamically changing wireless environments, the same trained model was evaluated in an environment where the center base station was removed AND jammers were present.

Results showed the drone followed a path identical to the poor-coverage scenario until reaching jammer locations. Upon encountering jammers, it performed localized avoidance maneuvers while maintaining the overall strategy of avoiding poor coverage zones.

Comparison with Straight Path

The straight path in this scenario would avoid the jammer but fly directly through the poor coverage zone, leading to connectivity drops. The RL optimized path simultaneously optimizes wireless connectivity while avoiding jammers—a capability no simple heuristic can match.

Key Advantages and Strategic Value

  • Dual Optimization: Simultaneously optimizes wireless connectivity and avoids jammers without requiring apriori knowledge of jammer locations.
  • Environment Generalization: Intelligent state space design enables the trained model to perform well in environments different from training scenarios.
  • Minimal Path Deviation: Jammer avoidance is highly localized—the drone quickly returns to planned paths after threat avoidance.
  • Pre-Flight Training: All training occurs on the ground with unconstrained computation. Flight execution requires only a lookup table.
  • SWaP-C Compliant: Minimal computational requirements during flight enable deployment on drones with modest processing capabilities.
  • Mission-Appropriate: Suitable for highly sensitive operations where maintaining planned routes and continuous connectivity are critical.
  • Adaptive Learning: The drone learns wireless channel characteristics and antenna patterns, enabling intelligent decisions in new environments.

Implementation Details

Training Configuration

Parameter Value
Training Platform Apple MacBook Air with M1 CPU
Training Episodes 40,000 episodes
Environment 9 base stations in 5km grid
Drone Altitude 150 meters
Bandwidth 1 MHz
Base Station Frequency 2.0 GHz
Antenna Array 15-element linear array, 0.5λ spacing
Peak Directivity 18 dBi with 7° down-tilt
Jammer Configuration 10×10 planar array, 25 dBi, 5° beamwidth

Future Directions

While the current results clearly illustrate the potential of PhoenixAI's algorithm, several areas remain for future exploration:

Conclusion

Drone jamming poses a serious threat to strategic missions. Concurrently, many drone applications require continuous connectivity for data streaming and C2 link maintenance. PhoenixAI's RL algorithm equips drones with intelligence to avoid jammers while simultaneously flying only in regions of good coverage.

Using constrained Q-learning, PhoenixAI developed a lookup table that allows drones to take optimal actions during flight. The state space and reward structure enable generalization to new environments. Path deviation in the presence of jammers is highly localized, leading to minimal deviation from originally planned paths.

Critically, this RL model allows drones to be trained completely before flight. The result is a lookup table easily loaded into the drone, adhering to SWaP-C constraints. This represents a significant advancement in autonomous UAV operations, providing military and civilian operators with intelligent, adaptive trajectory optimization that maintains mission effectiveness even in contested electromagnetic environments.

Advancing UAV Autonomy Through AI

PhoenixAI continues to push the boundaries of reinforcement learning for autonomous systems. Our trajectory optimization algorithm demonstrates how AI can solve complex, multi-objective problems in dynamic environments—enabling safer, more effective drone operations in contested scenarios.