IMU-SelfSupEncoder-v1

A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the WISDM smartphone+smartwatch dataset with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss.

Model Overview

  • Architecture: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion
  • Input: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds)
  • Output: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs
  • Params: ~1.4M
  • Training: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss

Usage

from modeling_imu_encoder import IMUMaskedEncoder

model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1")
model.eval()

# Input: (batch, 6 channels, 200 timesteps)
x = torch.randn(8, 6, 200)

with torch.no_grad():
    patch_out, intermediates, cls_out, global_freq = model(x)

# cls_out: (8, 192) β€” use for classification
# patch_out: (8, 20, 192) β€” per-patch features
# intermediates: {2: (8, 20, 192), 4: (8, 20, 192)}

Linear Probe Example

import torch.nn as nn

# Freeze encoder
for p in model.parameters():
    p.requires_grad = False
model.eval()

# Simple classifier on CLS token
classifier = nn.Sequential(
    nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(256, 18),  # 18 activity classes
)

# Extract features and train classifier
with torch.no_grad():
    cls_features = model.encode(imu_windows)  # (N, 192)

Training Details

Pretraining Objective

The model was trained with a self-supervised masked prediction approach:

  1. x-encoder (student): Sees masked input with [MASK] tokens at masked positions
  2. y-encoder (teacher): Sees full original signal, EMA-updated from x-encoder
  3. Predictor: x-encoder output β†’ predicts y-encoder representations at masked positions

Masking Strategy

Multi-mask with 4 views per sample:

Mask Type Probability Description
Time block 50% Blocks of 3-8, 10-18, or 20-30 patches
Channel 25% Mask 1-2 of 6 sensor channels
Frequency 25% Mask 30% of STFT frequency bins (bias toward mid-high)

Loss Functions

Component Weight Purpose
L_pred (MSE) 1.0 Predict teacher representations at masked positions
L_lmm (frequency) 0.1 Reconstruct original signal patches with frequency-domain loss
L_supcon 0.15 Supervised contrastive loss on CLS tokens
L_sigreg adaptive Prevent representation collapse

Dataset

  • Source: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope)
  • Subjects: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609)
  • Activities: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes)
  • Sampling: 20 Hz, 10-second windows with 2.5s stride (75% overlap)

Training Config

Parameter Value
Epochs 12
Batch size 128
Learning rate 3e-4 (cosine to 1e-5)
Warmup epochs 2
Optimizer AdamW (weight_decay=0.05)
EMA tau 0.999 β†’ 0.9999 (cosine)

Model Architecture

Input: (B, 6, 200)
    β”‚
    β”œβ”€β”€ Conv1d Stem (6β†’96, kernel=10, stride=10)
    β”‚   └── Time tokens: (B, 20, 96)
    β”‚
    β”œβ”€β”€ Per-patch FFT β†’ Linear
    β”‚   └── Freq tokens: (B, 20, 96)
    β”‚
    β”œβ”€β”€ Concat + Fusion β†’ (B, 20, 192)
    β”‚
    β”œβ”€β”€ Global FFT (full 200-pt) β†’ Linear β†’ (B, 1, 192)
    β”‚
    β”œβ”€β”€ Position Embedding (learned, 21 positions)
    β”‚
    └── Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0)
        β”œβ”€β”€ Layer 2 β†’ intermediate output
        β”œβ”€β”€ Layer 4 β†’ intermediate output
        └── CLS token + 20 patch tokens + global_freq token

Citation

@misc{imu-selfsup-encoder,
  author = {Li, Yu},
  title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition},
  year = {2026},
  url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1},
}

@inproceedings{weiss2019wisdm,
  title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living},
  author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier},
  booktitle={IEEE Access},
  volume={7},
  pages={133190--133202},
  year={2019},
  publisher={IEEE}
}
Downloads last month
43
Safetensors
Model size
1.55M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support