NDA notice. Full source code cannot be released due to an active Non-Disclosure Agreement. This guide is derived from the paper and supplementary material and is intended to let researchers re-implement OnPoint from public components. Paper: arXiv:2607.00289.
If you have any questions or need support while implementing this work, please contact the author at reza.s@northeastern.edu.
1. Overview & pipeline
OnPoint is a two-stage framework:
Train a point-supervised offline teacher on full videos (HR-Pro for THUMOS'14, TSASPC for EGTEA/HOI4D-O).
Freeze the teacher and train an online student (built on OAT) with multi-level distillation plus direct point supervision.
Video features
│
├─► [Offline teacher] ──► CAS sequence A
│ Pseudo segments (s̃, ẽ, ã, p̃)
│
└─► [Online student, sliding window at each t]
Transformer encoder → F_t
├─ CASS predictor → Â_t (distill from A_t^teacher)
├─ Anticipation head → ŷ (distill from future CAS)
├─ Anchor decoder → Z (actionness-calibrated)
│ ├─ Instance cls/reg heads (supervised by pseudo segments)
│ └─ Point cls/reg heads (supervised by point GT)
└─ ONMS post-processing → final proposals
Recommended starting point: fork OAT for the online backbone and anchor pipeline, then add the four OnPoint modules described below.
2. Problem definition (POTAL)
Point-Supervised Online Temporal Action Localization (POTAL) detects action segments in a streaming video using only one temporal point + class label per action instance at training time.
Input at time t: partial video V1:t (no future frames).
Follow each repo's point-supervised training recipe. The teacher outputs:
Per-frame CAS: A ∈ ℝT×(C+1) (C action classes + background)
After offline post-processing: pseudo segments (s̃, ẽ, ã, p̃)
Step 4.3
Freeze teacher for online training
The offline teacher is not updated during online student training. Pre-compute and cache teacher outputs per video to speed up training.
5. Phase 3 — Generate pseudo labels for online training
THUMOS'14 (HR-Pro)
Run HR-Pro on each training video to obtain CAS A.
Extract intermediate reliable proposals (not final merged output) as pseudo segment GT for instance distillation.
Slice CAS to each online window: Atteacher = A[t−W+1 : t].
Slice future CAS for anticipation: A[t+1 : t+W′] → binarize with threshold 0.5 → multi-hot target y.
EGTEA / HOI4D-O (TSASPC)
Run TSASPC to obtain per-frame CAS.
Build pseudo segments by grouping temporally adjacent frames with the same predicted class.
Align pseudo segments and CAS slices to each sliding window as above.
Anchor–pseudo matching (instance supervision)
Follow OAT's anchor assignment: match each pseudo segment to anchors in the current window by temporal IoU / center overlap. Positive anchors receive classification + regression targets from the matched pseudo segment (same protocol as fully supervised OAT, but labels come from teacher).
6. Phase 4 — Online student architecture
At each timestep t, form window Xt ∈ ℝW×D from the last W feature vectors. Pass through:
6.1 Transformer encoder (from OAT)
3 transformer blocks, 8 attention heads each, hidden dim D=1024.
Output window features Ft ∈ ℝW×D.
6.2 CASS predictor (new)
Two-layer MLP per frame: Linear(D→D) → ReLU → Linear(D→C+1).
Output Ât ∈ ℝW×(C+1).
Use L2 (MSE) loss against teacher CAS slice (do not normalize to a probability simplex).
Lcass = (1/W) Σi=1W ‖Ât[i] − Atteacher[i]‖₂²
6.3 Anticipation head (new)
Linear(D → D/4) on Ft, flatten, then FC → ReLU → FC → Sigmoid.
Output ŷ ∈ [0,1]C+1 (multi-label).
Target: binarize teacher CAS over next W′ frames (threshold 0.5) → y ∈ {0,1}C+1.
Loss: binary cross-entropy over action classes.
Lant = − Σc=1C [ yc log ŷc + (1−yc) log(1−ŷc) ]
Use fixed W′ per dataset (see hyperparameter table).
6.4 Actionness-calibrated anchor decoder (new)
Start from OAT's anchor decoder (5 transformer blocks, 4 heads). Replace standard cross-attention with calibrated attention: