Automated identification of interest areas in neural networks through training loss trajectories
Proposal in one sentence:
To understand how neural networks represent input dimensions, we can use the token loss trajectories over the input to understand where we should target our interpretability methods.
Description of the project and what problem it is solving:
As part of the third Alignment Jam, AI testing hackathon, Alex Foote used the token trajectories method from Olsson et al. (2022) to look deeper into induction heads and find points of intervention for interpretability methods in the training dynamics. We can observe the differences in specific token loss for the 500th and the 50th token over training steps and use this as a proxy for how that token prior updates. This enables us to see which types of tokens and neurons update significantly at different points.
This project will focus on finishing this paper up for conference submission.
- Automating parts of the training dynamics analysis
- Finishing the paper with Alex Foote
- Submitting to workshop at ICML or to the main track of ICLR
- Twitter: fazlbarez
- Discord: Fazl#3700
- Twitter: esbenkc
- Discord: Arvino Bibulus#9302
- Alex Foote
Additional notes for proposals