Multi-Modal Spatial-Temporal American Sign Language Translation with Transformers

We develop an end-to-end translation framework that bridges the gap between American Sign Language (ASL) and natural language text. By fusing DinoV3 self-supervised visual embeddings with SAM-3D skeletal coordinates, the system captures both the semantic nuance of hand shapes and the spatial-temporal dynamics of body movement. We leverage a sliding-window CNN architecture and a fine-tuned M2M100 decoder, to evaluate performance on the How2Sign benchmark as measured by BLEU and METEOR scores.

Bence Jordan Dankó

Last updated March 18, 2026 at 3:00 PM

Test Content

Test content