OnPoint project page

Implementation Guide

OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization · ECCV 2026

NDA notice. Full source code cannot be released due to an active Non-Disclosure Agreement. This guide is derived from the paper and supplementary material and is intended to let researchers re-implement OnPoint from public components. Paper: arXiv:2607.00289.
If you have any questions or need support while implementing this work, please contact the author at reza.s@northeastern.edu.

1. Overview & pipeline

OnPoint is a two-stage framework:

  1. Train a point-supervised offline teacher on full videos (HR-Pro for THUMOS'14, TSASPC for EGTEA/HOI4D-O).
  2. Freeze the teacher and train an online student (built on OAT) with multi-level distillation plus direct point supervision.
Video features
    │
    ├─► [Offline teacher] ──► CAS sequence A
    │                         Pseudo segments (s̃, ẽ, ã, p̃)
    │
    └─► [Online student, sliding window at each t]
            Transformer encoder → F_t
                ├─ CASS predictor → Â_t          (distill from A_t^teacher)
                ├─ Anticipation head → ŷ         (distill from future CAS)
                ├─ Anchor decoder → Z            (actionness-calibrated)
                │     ├─ Instance cls/reg heads  (supervised by pseudo segments)
                │     └─ Point cls/reg heads     (supervised by point GT)
                └─ ONMS post-processing → final proposals

Recommended starting point: fork OAT for the online backbone and anchor pipeline, then add the four OnPoint modules described below.

2. Problem definition (POTAL)

Point-Supervised Online Temporal Action Localization (POTAL) detects action segments in a streaming video using only one temporal point + class label per action instance at training time.

3. Phase 1 — Data & features

3.1 Datasets

DatasetNotesPoint labels
THUMOS'14 413 sports videos, 20 classes, ~15 instances/video Use point annotations from Lee et al. (LACP), provided in HR-Pro repo
EGTEA Egocentric kitchen, 22 classes, dense actions Sample one point per instance from Gaussian centered at midpoint, σ = duration/6
HOI4D-O 553 egocentric office videos, 12 classes Same Gaussian point sampling as EGTEA
EPIC-Kitchens-100, FineAction Additional benchmarks in supplement Same point generation procedure

ActivityNet v1.3 is excluded from OnTAL evaluation (videos usually contain one long action).

3.2 Pre-extracted features

DatasetFeature extractorSource
THUMOS (online student) Two-stream TSN, Kinetics-pretrained OAT repo
THUMOS (offline teacher) I3D features + point labels HR-Pro repo
EGTEA I3D at 24 FPS, stride 12 (2 vectors/sec) HAT repo
HOI4D-O CLIP-ViT features Follow HAT / prior HOI4D feature extraction

Project all features to dimension D = 1024 before the online transformer encoder (match OAT feature dim).

4. Phase 2 — Offline teacher

Step 4.1

Choose backbone

  • THUMOS'14 (sparse actions): train HR-Pro with point labels. Use repo defaults.
  • EGTEA / HOI4D-O (dense/procedural): train TSASPC — stages=4, layers=10, feature maps=64, feature dim=2048, batch=8, lr=0.0005.
Step 4.2

Train with point supervision only

Follow each repo's point-supervised training recipe. The teacher outputs:

  • Per-frame CAS: A ∈ ℝT×(C+1) (C action classes + background)
  • After offline post-processing: pseudo segments (s̃, ẽ, ã, p̃)
Step 4.3

Freeze teacher for online training

The offline teacher is not updated during online student training. Pre-compute and cache teacher outputs per video to speed up training.

5. Phase 3 — Generate pseudo labels for online training

THUMOS'14 (HR-Pro)

  1. Run HR-Pro on each training video to obtain CAS A.
  2. Extract intermediate reliable proposals (not final merged output) as pseudo segment GT for instance distillation.
  3. Slice CAS to each online window: Atteacher = A[t−W+1 : t].
  4. Slice future CAS for anticipation: A[t+1 : t+W′] → binarize with threshold 0.5 → multi-hot target y.

EGTEA / HOI4D-O (TSASPC)

  1. Run TSASPC to obtain per-frame CAS.
  2. Build pseudo segments by grouping temporally adjacent frames with the same predicted class.
  3. Align pseudo segments and CAS slices to each sliding window as above.

Anchor–pseudo matching (instance supervision)

Follow OAT's anchor assignment: match each pseudo segment to anchors in the current window by temporal IoU / center overlap. Positive anchors receive classification + regression targets from the matched pseudo segment (same protocol as fully supervised OAT, but labels come from teacher).

6. Phase 4 — Online student architecture

At each timestep t, form window Xt ∈ ℝW×D from the last W feature vectors. Pass through:

6.1 Transformer encoder (from OAT)

6.2 CASS predictor (new)

Lcass = (1/W) Σi=1W ‖Ât[i] − Atteacher[i]‖₂²

6.3 Anticipation head (new)

Lant = − Σc=1C [ yc log ŷc + (1−yc) log(1−ŷc) ]

Use fixed W′ per dataset (see hyperparameter table).

6.4 Actionness-calibrated anchor decoder (new)

Start from OAT's anchor decoder (5 transformer blocks, 4 heads). Replace standard cross-attention with calibrated attention:

  1. Initialize learnable anchor queries Q ∈ ℝM×D (M anchors with fixed temporal scales).
  2. Actionness from CASS background channel: r = 1 − σ(At[:, C+1]).
  3. Calibration: r̄ = r + log(r) (element-wise; clamp r for numerical stability).
  4. K = Linear(Ft), V = Linear(Ft).
  5. Each decoder layer:
    • Self-attention on Q.
    • Cross-attention: logits = QK/√D + r̄, α = Softmax(logits), context = α·V.
    • Q ← FeedForward(Q + context).
  6. Output anchor features Z = Q.

6.5 Instance prediction heads (from OAT)

On each anchor feature in Z:

6.6 Anchor-level point prediction (new)

For each anchor, two two-layer MLP heads (ReLU between layers), input = anchor embedding ∈ ℝD:

Lpnt = Lpc + Lpr,   Lpr = (1/Np) Σ |d̂j − |ca − pj| / la|

Assign point labels to anchors whose temporal span contains the annotated point.

7. Phase 5 — Losses & training

7.1 Total objective

Ltotal = α·Lins + β·Lcass + γ·Lant + δ·Lpnt

Recommended weights: α = β = δ = 1. Tune only γ (anticipation).

7.2 Training loop (per iteration)

  1. Sample a training video and timestep t (or a batch of windows).
  2. Load feature window Xt, teacher CAS slice, future CAS, pseudo segments, and point labels for that window.
  3. Forward online student → compute all four losses.
  4. Backpropagate Ltotal; teacher remains frozen.

7.3 Optimizer

8. Phase 6 — Inference & evaluation

8.1 Online inference

  1. Stream features frame-by-frame; maintain a buffer of the last W features.
  2. At each t, run encoder + anchor decoder + instance heads only (CASS/anticipation/point heads used at train time; instance heads drive inference).
  3. Decode anchors to proposals {(ŝk, êk, âk, p̂k)} following OAT.
  4. Apply Online NMS (ONMS) from MATR codebase:
    • Suppress proposals with high temporal overlap with already-selected ones.
    • Remove proposals whose end time ê > current time t (wait for boundary confirmation).

8.2 Metrics

9. Distillation-free baseline

To ablate distillation, remove Lins and train with anchor-level point supervision only (Lpnt as primary signal):

10. Hyperparameters

Per-dataset online settings

DatasetWMAnchor lengthsW′γOffline teacher
THUMOS'14646{4,8,16,32,48,64}161.0HR-Pro
EGTEA247{2,4,6,8,12,16,24}120.8TSASPC
HOI4D-O566{4,8,16,24,48,56}160.8TSASPC

Shared online architecture

11. Environment & seeds

Hardware

Software

Random seeds (single run per experiment)

Public code dependencies

RepoPurpose
HR-ProOffline teacher (THUMOS), features & point labels
TSASPCOffline teacher (EGTEA, HOI4D-O)
OATOnline backbone, anchors, sliding window
MATRONMS post-processing
HATEGTEA features & evaluation protocol

12. Implementation checklist

Use this list to track a from-scratch re-implementation: