OnPoint project page

Implementation Guide

OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization · ECCV 2026

NDA notice. Full source code cannot be released due to an active Non-Disclosure Agreement. This guide is derived from the paper and supplementary material and is intended to let researchers re-implement OnPoint from public components. Paper: arXiv:2607.00289.

If you have any questions or need support while implementing this work, please contact the author at reza.s@northeastern.edu.

1. Overview & pipeline

OnPoint is a two-stage framework:

Train a point-supervised offline teacher on full videos (HR-Pro for THUMOS'14, TSASPC for EGTEA/HOI4D-O).
Freeze the teacher and train an online student (built on OAT) with multi-level distillation plus direct point supervision.

Video features
    │
    ├─► [Offline teacher] ──► CAS sequence A
    │                         Pseudo segments (s̃, ẽ, ã, p̃)
    │
    └─► [Online student, sliding window at each t]
            Transformer encoder → F_t
                ├─ CASS predictor → Â_t          (distill from A_t^teacher)
                ├─ Anticipation head → ŷ         (distill from future CAS)
                ├─ Anchor decoder → Z            (actionness-calibrated)
                │     ├─ Instance cls/reg heads  (supervised by pseudo segments)
                │     └─ Point cls/reg heads     (supervised by point GT)
                └─ ONMS post-processing → final proposals

Recommended starting point: fork OAT for the online backbone and anchor pipeline, then add the four OnPoint modules described below.

2. Problem definition (POTAL)

Point-Supervised Online Temporal Action Localization (POTAL) detects action segments in a streaming video using only one temporal point + class label per action instance at training time.

Input at time t: partial video V_1:t (no future frames).
Output: proposals (ŝ, ê, â, p̂) with start/end times, class, confidence.
Proposals are emitted online and cannot be revised once published.
Goal: recover the full action set Ψ = {(ŝ_k, ê_k, â_k)} sequentially.

3. Phase 1 — Data & features

3.1 Datasets

Dataset	Notes	Point labels
THUMOS'14	413 sports videos, 20 classes, ~15 instances/video	Use point annotations from Lee et al. (LACP), provided in HR-Pro repo
EGTEA	Egocentric kitchen, 22 classes, dense actions	Sample one point per instance from Gaussian centered at midpoint, σ = duration/6
HOI4D-O	553 egocentric office videos, 12 classes	Same Gaussian point sampling as EGTEA
EPIC-Kitchens-100, FineAction	Additional benchmarks in supplement	Same point generation procedure

ActivityNet v1.3 is excluded from OnTAL evaluation (videos usually contain one long action).

3.2 Pre-extracted features

Dataset	Feature extractor	Source
THUMOS (online student)	Two-stream TSN, Kinetics-pretrained	OAT repo
THUMOS (offline teacher)	I3D features + point labels	HR-Pro repo
EGTEA	I3D at 24 FPS, stride 12 (2 vectors/sec)	HAT repo
HOI4D-O	CLIP-ViT features	Follow HAT / prior HOI4D feature extraction

Project all features to dimension D = 1024 before the online transformer encoder (match OAT feature dim).

4. Phase 2 — Offline teacher

Step 4.1

Choose backbone

THUMOS'14 (sparse actions): train HR-Pro with point labels. Use repo defaults.
EGTEA / HOI4D-O (dense/procedural): train TSASPC — stages=4, layers=10, feature maps=64, feature dim=2048, batch=8, lr=0.0005.

Step 4.2

Train with point supervision only

Follow each repo's point-supervised training recipe. The teacher outputs:

Per-frame CAS: A ∈ ℝ^T×(C+1) (C action classes + background)
After offline post-processing: pseudo segments (s̃, ẽ, ã, p̃)

Step 4.3

Freeze teacher for online training

The offline teacher is not updated during online student training. Pre-compute and cache teacher outputs per video to speed up training.

5. Phase 3 — Generate pseudo labels for online training

THUMOS'14 (HR-Pro)

Run HR-Pro on each training video to obtain CAS A.
Extract intermediate reliable proposals (not final merged output) as pseudo segment GT for instance distillation.
Slice CAS to each online window: A_t^teacher = A[t−W+1 : t].
Slice future CAS for anticipation: A[t+1 : t+W′] → binarize with threshold 0.5 → multi-hot target y.

EGTEA / HOI4D-O (TSASPC)

Run TSASPC to obtain per-frame CAS.
Build pseudo segments by grouping temporally adjacent frames with the same predicted class.
Align pseudo segments and CAS slices to each sliding window as above.

Anchor–pseudo matching (instance supervision)

Follow OAT's anchor assignment: match each pseudo segment to anchors in the current window by temporal IoU / center overlap. Positive anchors receive classification + regression targets from the matched pseudo segment (same protocol as fully supervised OAT, but labels come from teacher).

6. Phase 4 — Online student architecture

At each timestep t, form window X_t ∈ ℝ^W×D from the last W feature vectors. Pass through:

6.1 Transformer encoder (from OAT)

3 transformer blocks, 8 attention heads each, hidden dim D=1024.
Output window features F_t ∈ ℝ^W×D.

6.2 CASS predictor (new)

Two-layer MLP per frame: Linear(D→D) → ReLU → Linear(D→C+1).
Output Â_t ∈ ℝ^W×(C+1).
Use L2 (MSE) loss against teacher CAS slice (do not normalize to a probability simplex).

L_cass = (1/W) Σ_i=1^W ‖Â_t[i] − A_t^teacher[i]‖₂²

6.3 Anticipation head (new)

Linear(D → D/4) on F_t, flatten, then FC → ReLU → FC → Sigmoid.
Output ŷ ∈ [0,1]^C+1 (multi-label).
Target: binarize teacher CAS over next W′ frames (threshold 0.5) → y ∈ {0,1}^C+1.
Loss: binary cross-entropy over action classes.

L_ant = − Σ_c=1^C [ y_c log ŷ_c + (1−y_c) log(1−ŷ_c) ]

Use fixed W′ per dataset (see hyperparameter table).

6.4 Actionness-calibrated anchor decoder (new)

Start from OAT's anchor decoder (5 transformer blocks, 4 heads). Replace standard cross-attention with calibrated attention:

Initialize learnable anchor queries Q ∈ ℝ^M×D (M anchors with fixed temporal scales).
Actionness from CASS background channel: r = 1 − σ(A_t[:, C+1]).
Calibration: r̄ = r + log(r) (element-wise; clamp r for numerical stability).
K = Linear(F_t), V = Linear(F_t).
Each decoder layer:
- Self-attention on Q.
- Cross-attention: logits = QK^⊤/√D + r̄^⊤, α = Softmax(logits), context = α·V.
- Q ← FeedForward(Q + context).
Output anchor features Z = Q.

6.5 Instance prediction heads (from OAT)

On each anchor feature in Z:

Classification head: anchor class + foreground/background (follow OAT).
Regression head: temporal offsets/scales to refine anchor boundaries.
Supervised by matched pseudo segments → L_ins = L_ic + L_ir (same as OAT).

6.6 Anchor-level point prediction (new)

For each anchor, two two-layer MLP heads (ReLU between layers), input = anchor embedding ∈ ℝ^D:

Classification: cross-entropy — does this anchor contain a point label, and which class?
Regression: predict normalized distance d̂ = |c_a − p| / l_a (anchor center, point time, anchor length).

L_pnt = L_pc + L_pr, L_pr = (1/N_p) Σ |d̂_j − |c_a − p_j| / l_a|

Assign point labels to anchors whose temporal span contains the annotated point.

7. Phase 5 — Losses & training

7.1 Total objective

L_total = α·L_ins + β·L_cass + γ·L_ant + δ·L_pnt

Recommended weights: α = β = δ = 1. Tune only γ (anticipation).

7.2 Training loop (per iteration)

Sample a training video and timestep t (or a batch of windows).
Load feature window X_t, teacher CAS slice, future CAS, pseudo segments, and point labels for that window.
Forward online student → compute all four losses.
Backpropagate L_total; teacher remains frozen.

7.3 Optimizer

Adam, lr = 1e-4, weight decay = 1e-4, batch size = 128.
Train until convergence using the same epoch count / schedule as OAT for the target dataset.

8. Phase 6 — Inference & evaluation

8.1 Online inference

Stream features frame-by-frame; maintain a buffer of the last W features.
At each t, run encoder + anchor decoder + instance heads only (CASS/anticipation/point heads used at train time; instance heads drive inference).
Decode anchors to proposals {(ŝ_k, ê_k, â_k, p̂_k)} following OAT.
Apply Online NMS (ONMS) from MATR codebase:
- Suppress proposals with high temporal overlap with already-selected ones.
- Remove proposals whose end time ê > current time t (wait for boundary confirmation).

8.2 Metrics

Primary: mAP at tIoU thresholds {0.1, 0.2, 0.3, 0.4, 0.5} (and 0.6, 0.7 on THUMOS).
A prediction is correct iff class matches and tIoU ≥ threshold.
Secondary: instance-level F1 at tIoU = 0.5.

9. Distillation-free baseline

To ablate distillation, remove L_ins and train with anchor-level point supervision only (L_pnt as primary signal):

At inference, use point prediction head: if any class activation > 0.1, treat anchor as proposal.
Use anchor boundaries + regressed center offset as segment; apply ONMS.
Without pseudo segments, the model cannot learn proper temporal scale, so performance is substantially lower than the full distillation-based model.

10. Hyperparameters

Per-dataset online settings

Dataset	W	M	Anchor lengths	W′	γ	Offline teacher
THUMOS'14	64	6	{4,8,16,32,48,64}	16	1.0	HR-Pro
EGTEA	24	7	{2,4,6,8,12,16,24}	12	0.8	TSASPC
HOI4D-O	56	6	{4,8,16,24,48,56}	16	0.8	TSASPC

Shared online architecture

D = 1024, encoder: 3 blocks × 8 heads, decoder: 5 blocks × 4 heads
Batch size = 128, Adam lr = 1e-4, weight decay = 1e-4
α = β = δ = 1; tune γ only

11. Environment & seeds

Hardware

Training: NVIDIA L4, 24 GB, Driver 535.183.06
Efficiency benchmark: RTX 4090 (same checkpoint)

Software

Python 3.10.18, PyTorch 2.5.1, CUDA 11.8, TorchVision 0.20.1
NumPy 2.0.1, SciPy 1.15.3, Pandas 2.3.1, Matplotlib 3.10.3, TensorBoard 2.19.0

Random seeds (single run per experiment)

Online model: 52
Offline HR-Pro: 0
Offline TSASPC: 1538574472

Public code dependencies

Repo	Purpose
HR-Pro	Offline teacher (THUMOS), features & point labels
TSASPC	Offline teacher (EGTEA, HOI4D-O)
OAT	Online backbone, anchors, sliding window
MATR	ONMS post-processing
HAT	EGTEA features & evaluation protocol

12. Implementation checklist

Use this list to track a from-scratch re-implementation:

☐ Download dataset features and point labels for target benchmark
☐ Train / obtain checkpoint for offline teacher (HR-Pro or TSASPC)
☐ Pre-compute and cache per-video CAS and pseudo segments
☐ Implement sliding-window dataloader with teacher label alignment
☐ Port OAT transformer encoder + anchor instance heads
☐ Add CASS predictor + L2 distillation loss
☐ Add anticipation head + BCE on binned future CAS
☐ Modify anchor decoder cross-attention with r̄ = r + log(r) bias
☐ Add anchor-level point cls/reg heads + L1 regression
☐ Combine losses with α=β=δ=1, tune γ
☐ Integrate ONMS at inference
☐ Evaluate mAP@tIoU (and optionally instance-level F1)