nurupo.io
602 words
3 minutes
To Block Or Not To Block

The Reinforcement Learning Challenge#

I’ve been working on RLMatrix, a deep reinforcement learning framework in C# using TorchSharp. While developing a Unity plugin, I ran into an interesting timing issue in the agent-environment interaction.

asieradzk
/
RL_Matrix
Waiting for api.github.com...
00K
0K
0K
Waiting...

The main challenge is balancing simulation speed with decision-making time. We want the simulation to run smoothly and quickly, but the deep RL algorithms need time to process, which can slow things down.

The Data Flow#

While working on a Unity plugin, I’ve been questioning the timing of this flow:

  1. Capture simulation state
  2. Get action for this state
  3. Perform action in simulation
  4. Get reward
  5. Re-arrange into “transition” and send to experience buffer
NOTE

The critical issue arises between steps 2 and 3. The simulation might continue running while we’re deciding on an action, potentially making our chosen action suboptimal for the new state.

Here’s the basic workflow in RLMatrix:

foreach (var env in _environments)
{
    var stateTask = GetStateAsync(env.Key, env.Value);
    stateTaskList.Add(stateTask);
}
var stateResults = await Task.WhenAll(stateTaskList);

List<(Guid environmentId, TState state)> payload = stateResults.ToList();
var actions = await GetActionsBatchAsync(payload, isTraining);

foreach (var action in actions)
{
    var env = _environments[action.Key];
    var rewardTask = env.Step(action.Value.discreteActions, action.Value.continuousActions)
        .ContinueWith(t => (action.Key, t.Result));
    rewardTaskList.Add(rewardTask);
}

This setup has a potential issue: the state might change between when we observe it and when we act on it. This means we might be taking actions based on outdated information, and we can’t be sure we’re recording the exact state-action-reward relationships unless we pause the simulation.

The Unity Timing Puzzle#

One simple solution might be to block Unity’s main thread:

void Update()
{
   myAgent.StepSync();
}

But this ties the learning process to the frame rate, which doesn’t work well for physics-based simulations.

Another approach is to use FixedUpdate(), which runs at set intervals:

private void FixedUpdate()
{
    if (myAgent != null && !isFixedUpdateBusy)
    {
        isFixedUpdateBusy = true;
        var cachedTimeScale = timeScale;
        timeScale = 0.0000001f;

        if (stepCounter % poolingRate == poolingRate - 1)
        {
            myAgent.StepSync();
        }
        else
        {
            foreach (var env in myEnvs)
            {
                env.GhostStep();
            }
        }
        timeScale = cachedTimeScale;
        stepCounter = (stepCounter + 1) % poolingRate;
        isFixedUpdateBusy = false;
    }
}

This doesn’t block the main update loop, but it means our actions might be slightly outdated. To help with this, I slow down the timescale to 0.0000001f during the step. This should reduce how much the simulation state changes while we’re deciding on an action.

The Third Way: Manual Time Steps#

After working with both approaches in Unity and Godot, I found neither solution provided satisfactory results. The timing issues and headaches led me to a third approach: manually counting time and implementing fixed intervals. Most of the time, a 0.02-second interval works well as a starting point:

//Update:

accumulatedTime += Time.deltaTime;

while (accumulatedTime >= stepInterval / Time.timeScale)
{
    PerformStep();
    accumulatedTime -= stepInterval / Time.timeScale;
}
NOTE

The only drawback is that this will clearly fail if Update can’t keep up with stepInterval. However, this can be addressed by dynamically adjusting the timescale or step interval based on performance in the rare situations where it’s necessary.

Conclusion#

After experimenting with various approaches to timing in reinforcement learning environments, the manual time step method has proven most reliable. While both the main thread blocking and FixedUpdate approaches seemed promising initially, they each came with their own set of problems that became more apparent during extended use. The manual time step approach provides a good balance of control and simplicity. It’s easier to reason about, more predictable in its behavior, and gives us the flexibility to adjust timing dynamically when needed. The 0.02-second interval has worked well as a default, though your specific needs might require different values.

This is still an area that could use improvement, but for now, this approach has given me the most consistent results across different game engines and scenarios.

Adrian

To Block Or Not To Block
https://nurupo.io/posts/toblockornot/
Author
Adrian Sieradzki
Published at
2024-08-07