Project Proposal: Symmetric and Directional Structures in Vision and Diffusion Transformers
1. Background
Recent research has provided a novel mathematical framework for understanding the internal workings of Transformer models, which are central to modern AI. The key insight from the study by Saponati et al. (2025) is that the training objective fundamentally shapes the structure of the self-attention mechanism. Specifically, the paper demonstrates that:
- - Encoder-only models (like BERT), which are trained with a bidirectional objective (predicting a masked word using context from both left and right), naturally develop symmetric structures in their query-key (Wqk) matrices. This reflects the balanced nature of the training task.
- - Decoder-only models (like GPT), trained with an autoregressive objective (predicting the next word based only on past words), develop directional structures, where a few columns in the Wqk matrix become dominant.
These findings were shown to be a useful inductive bias - pre-setting the W_qk matrix to be symmetric at the start of training was found to speed up convergence and improve the performance of encoder-only models.
2. Motivation & Project Objectives
The original study established a theoretical and empirical basis for these structural properties in pure transformers, trained on Language tasks. This project aims to extend the original analytical and empirical framework to the domains of diffusion transformers, as well as the broader application of transformers in the vision domain. Specifically, the projects will aim to
- 1. Analyze Diffusion Transformers (DiTs): Diffusion models are the state-of-the-art in image generation. This project will investigate whether the symmetric properties observed in encoder-only models also emerge in DiTs. This involves analyzing the Wqk matrices of pre-trained DiTs to identify structural patterns.
- 2. Explore Structures in Vision Transformers (ViTs): While the original paper touched on vision models, a deeper dive into vision is needed. This objective involves a comprehensive analysis of various ViT architectures (e.g., for classification, segmentation) to see how different training objectives in the visual domain influence the Wqk structure.
- 3. Test Inductive Biases in New Domains: Based on the findings from the analysis, the project will test whether enforcing symmetry or directionality at initialization can improve training efficiency and performance for diffusion and vision transformers. For instance, could a symmetrically initialized DiT learn to generate faster or with higher fidelity? 🖼️
3. Requirements
This project is well-suited for a master's student with a strong interest in the theoretical foundations of deep learning and generative AI. The ideal candidate should have:
- - Strong Programming Skills: Proficiency in Python and experience with a major deep learning framework (PyTorch is preferred).
- - Solid Mathematical Foundation: A good understanding of linear algebra, calculus, and probability is essential to grasp the concepts from the source paper.
- - Experience with Transformers: Prior hands-on experience with training and analyzing transformer-based models is highly desirable.
Contact
Matteo Saponati (masapo@ini.ethz.ch): I am a research scientist in Machine Learning and Neuromorphic Computing. I am currently working as a postdoctoral researcher at the Institute of Neuroinformatics (ETH/UZH) in the lab of Prof. Benjamin Grewe and the lab of Prof. Giacomo Indiveri. I did my PhD at the Max-Planck Institute for Brain Research and at the Ernst Strüngmann Institute for Neuroscience, and I got my PhD title from the Radboud University in Nijmegen (NL). My main research interests are: reasoning models and test-time computation, mechanistic interpretability, neuromorphic engineering and neural computation. Read more: https://matteosaponati.github.io/
Pascal Sager (sage@zhaw.ch): Pascal is a PhD candidate at the Centre for Artificial Intelligence at ZHAW and a visiting student at the Institute of Neuroinformatics at UZH/ETH. His research focuses on model-based reinforcement learning, specifically on how AI systems can construct robust world models through structured latent representations. Before starting a PhD, he gained experience in the industry as a hardware and software engineer. Read more: linkedin.com/in/sagerpascal/ and sagerpascal.github.io
Yassine Taoudi-Benchekroun (ytaoudi@ethz.ch): Yassine is a 2nd year PhD Student under the supervision of Prof. Benjamin Grewe and Prof. Melika Peyvand at the Institute of Neuroinformatics. His main research interests include compositionality, modularity and reasoning. Read more: https:yassine.fyi
Starting date + Duration:
(Short) Semester project. Can possibly be extended to a thesis. Starting date possible from September 1st, 2025.
References
- 1. Saponati, M., Sager, P., Vilimelis Aceituno, P., Stadelmann, T., & Grewe, B. (2025). The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training. arXiv:2502.10927.
- 2. Peebles, W., & Xie, S. (2022). Scalable Diffusion Models with Transformers. arXiv:2212.09748.
- 3. Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
- 4. Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762.