Approximate Dynamic Programming: An Introduction

Cover Image

2025-01-02

Editor's Note

If the following explanations remind you of Reinforcement Learning, you are correct. The concept is the same, except RL comes from the field of Machine Learning, while ADP originates from Operations Research.


1. Why ADP?

Classical DP vs. ADP:

Dynamic Programming (DP) is known as a powerful tool for finding optimal decision rules in sequential and interdependent processes. However, classical DP often fails in realistic scenarios due to the so-called "Curse of Dimensionality," for instance:

  • State space: Extremely large (e.g., many variables, wide ranges).
  • Action space: Complex or nearly infinite (e.g., continuous action spaces).
  • Future costs: Computationally expensive to calculate, making exact solutions impractical.

Approximate Dynamic Programming (ADP) solves this issue by combining techniques from Reinforcement Learning and Operations Research. Using approximation and sampling, it makes the computational problem manageable without having to calculate all possible states exactly.


2. Basic Idea: Markov Decision Process (MDP)

 S_i  -- Action a_i -->  S_(i+1)
   \                     /
    \        Cost       /
     \------ R_i ------/

Both classical DP and ADP are based on the concept of the Markov Decision Process (MDP):

  • State (S_i): Describes the system at a given time (e.g., current inventory, vehicle location, machine health).
  • Actions: Decisions made in state S_i that lead to a new state S_(i+1).
  • Transition function: Describes how to move from S_i to S_(i+1).
  • Cost (or reward): Every transition or action incurs (future) costs or yields rewards.

Goal: Find a policy that minimizes total costs or maximizes total rewards over time.
For small problems, exact DP can compute an optimal policy. For larger, complex problems, ADP is used.


3. What Makes ADP Different?

Classical DP vs. Approximate DP:

+-------------------+              +--------------------+
| All states        |              | Sampling of        |
| exhaustively      |              | "representative"   |
| calculated        |              | states             |
+-------------------+              +--------------------+
              |                                  |
              v                                  v
      (Curse of Dim.)                      (Scales better)

ADP focuses on approximations where exact computation would be too costly. Key concepts include:

Value Function Approximation

  • The exact value function ( V(S) ), assigning a "true" future cost or reward to each state, is often impossible to calculate or store.
  • Instead, an approximate function ( \hat{V}(S; \theta) ) is used, which has fewer parameters (e.g., linear functions or neural networks).
  • Parameters (( \theta )) are iteratively learned through simulation (e.g., Monte Carlo methods).

Post-Decision States

  • After a decision, an intermediate state is considered, isolating the effect of the chosen action.
  • This simplifies estimating how the system will evolve, as the influence of the last action is "removed."

Sampling Instead of Exhaustive Search

  • States and transitions are sampled (e.g., via Monte Carlo simulation) rather than systematically calculated.
  • Learning focuses on these samples to estimate how good an action is in a given state.

4. Application Examples

Inventory Management

  • States: Current inventory levels, expected demand, lead times.
  • Actions: Determine order quantities or delivery schedules.
  • ADP Solution: Approximate future costs (e.g., holding costs, shortage penalties) with a value function approximation learned online or through simulation.

Vehicle Routing and Fleet Management

  • States: Current location, available capacity, list of tasks.
  • Actions: Choose the next customer to visit or prioritize routes.
  • ADP Solution: Train cost estimates using sampled scenarios (randomly generated tasks and locations) to enable quick decision-making.

Machine Maintenance and Repair

  • States: Current age or "health" of the machine, past failures.
  • Actions: Schedule maintenance (now or later?).
  • ADP Solution: Compare "maintenance now" vs. "maintenance later" using an approximate value function that considers costs for maintenance, downtime, and repairs.

5. How to Implement ADP

      +-------------------------------------+
      | 1. Initial Approximation of         |
      |    Value Function (e.g., random)    |
      +-----------------------+-------------+
                              |
                              v
               +-----------------------------+
               | 2. Simulation / Sampling    |
               |    (States & Actions)       |
               +-------------+---------------+
                             |
                             v
           +---------------------------------+
           | 3. Evaluate Results             |
           |    (e.g., Costs, Rewards)       |
           +--------------+------------------+
                            |
                            v
         +-------------------------------------+
         | 4. Update Approximation            |
         |    (e.g., Adjust Parameters)       |
         +----------------+--------------------+
                            |
                            v
            +-------------------------------+
            | 5. Derive New Policy          |
            +---------------+---------------+
                              |
                              +-----> (Back to Step 2)

Steps:

  1. Formulate the Problem: Define state space, action space, transition functions (or at least a simulation), and cost structure.
  2. Choose Approximation: Start simple (e.g., linear ( \hat{V}(S) = \alpha \cdot \text{Feature}(S) + \beta )). Use more complex models if needed (e.g., neural networks).
  3. Simulate or Sample: Generate representative states and execute actions. Adjust parameters (( \theta )) based on observed future costs.
  4. Iterative Learning: Improve the value function iteratively using methods like Forward-DP or Monte Carlo simulations. Adjust the policy and continue testing.
  5. Monitor Performance: Compare the ADP policy with simple heuristics or exact solutions (for smaller cases) to refine the approximation as needed.

Base Paper: University of Twente Research

Cookies for Innovation.

We use cookies to deliver and enhance the quality of our website. Learn more.