LossID: Identifying targets for interpretability tools in neural networks


Automated identification of interest areas in neural networks through training loss trajectories

Proposal in one sentence:

To understand how neural networks represent input dimensions, we can use the token loss trajectories over the input to understand where we should target our interpretability methods.

Description of the project and what problem it is solving:

As part of the third Alignment Jam, AI testing hackathon, Alex Foote used the token trajectories method from Olsson et al. (2022) to look deeper into induction heads and find points of intervention for interpretability methods in the training dynamics. We can observe the differences in specific token loss for the 500th and the 50th token over training steps and use this as a proxy for how that token prior updates. This enables us to see which types of tokens and neurons update significantly at different points.

This project will focus on finishing this paper up for conference submission.

Grant Deliverables:

  • Automating parts of the training dynamics analysis
  • Finishing the paper with Alex Foote
  • Submitting to workshop at ICML or to the main track of ICLR


Squad Lead:

  • Twitter: fazlbarez
  • Discord: Fazl#3700

Squad members:

  • Twitter: esbenkc
  • Discord: Arvino Bibulus#9302
  • Alex Foote

Additional notes for proposals