Types
Data classes used throughout tinyrl.
PolicyOutput
Structured return type for policy functions.
from tinyrl import PolicyOutput
# just an action
PolicyOutput(action=2)
# with training info
PolicyOutput(action=2, logprob=-0.3, entropy=1.2)
| Field | Type | Default | Description |
|---|---|---|---|
action |
int \| np.ndarray |
required | The action to take |
logprob |
float \| None |
None |
Log-probability of the action |
entropy |
float \| None |
None |
Entropy of the policy distribution |
Step
A single environment transition, stored in trajectories.
| Field | Type | Default | Description |
|---|---|---|---|
obs |
np.ndarray |
required | Observation before the action |
action |
int \| np.ndarray |
required | Action taken |
reward |
float |
required | Reward received |
next_obs |
np.ndarray |
required | Observation after the action |
done |
bool |
required | Whether the episode ended |
logprob |
float \| None |
None |
Log-probability (if policy provided it) |
entropy |
float \| None |
None |
Entropy (if policy provided it) |
Trajectory
A full episode trajectory. Supports indexing, iteration, and batched property access.
result = runner.run_episode(policy, return_trajectory=True)
traj = result.trajectory
traj[0] # first Step
len(traj) # number of steps
for step in traj: # iterate
...
Batched properties
| Property | Shape | Description |
|---|---|---|
traj.obs |
(length, state_dim) |
All observations |
traj.actions |
(length,) or (length, action_dim) |
All actions |
traj.rewards |
(length,) |
All rewards |
traj.logprobs |
(length,) or None |
All log-probs (if provided) |
traj.entropies |
(length,) or None |
All entropies (if provided) |
EpisodeResult
Return type for Runner.run_episode().
| Field | Type | Description |
|---|---|---|
reward |
float |
Total episode reward |
steps |
int |
Number of steps taken |
trajectory |
Trajectory \| None |
Full trajectory (if return_trajectory=True) |