Proposal: Buzz - Whisper Transcription and Beyond

Buzz - Whisper based transcription and beyond

Proposal in One Sentence

Using the newly available Whisper model by OpenAI to transcribe content such as meetings and build on top of the obtained transcripts to perform tasks like topic detection, summarization,etc

Description of the project and what problem it is solving:

Meetings and online content offer a wealth of information. However, unlike textual information, the contents of most of these aren’t easily searchable or discoverable. These content, especially technical and educational content, could also be harder to comprehend without proper captions or transcripts. Transcripts increase the accessibility and allow for further textual processing, such as translation, to be carried on top of it.

OpenAI recently released Whisper, which is a general-purpose speech recognition model that can perform speech recognition as well as speech translation and language identification. This enables translation and transcription in languages other than English as well allowing transcription and translation of content on non-English Discord channels as well. The aim of the project is to utilize the Whisper model to transcribe content such as Algovera’s recorded meetings to make it easier for the community to search through the meeting contents and to provide this service as a Discord Bot. An additional outcome is to focus on creating a pipeline to combine this multilingual speech recognition with downstream natural language processing tasks such as topic modelling to find out what major topics were discussed, summarization to automatically generate Minutes of Meeting (from TL;DR to Too Long Didn’t Listen), emotion detection, etc. An example demo of one such application where emotion can be detected directly from voice/speech in many languages can be found here

Grant Deliverables:

  • Transcribed demo files (such as recorded meetings/podcasts) based on the choice of the community
  • ML pipeline containing Voice2X (where X is one of: topic detection/summary/emotion)

Spread the Love:

I am currently working solo on this and I am delighted to welcome anyone with similar interest to work on this together. Suggestions and contributions are always welcome


Ram - Machine Learning Researcher with experience in Reinforcement Learning and a new kid on the Block.

Discord : shinjeki007#8768


Hey interesting idea. Whisper is definitely under-utilised recent AI model. I always wanted to transcribe accurate text for medical/clinical use for a long time now.
Since you’ll get to build this, it’ll be helpful we catch-up after you’ve built something on this. I’m sure colab gateways would need a collab on this.

Definitely agree that Whisper has a lot of potential. I was also thinking about medical speech/legal speech transcription and beyond as possible avenues to explore. A pipeline combining speech recognition/transcription with Whisper and something like named entity recognition models(to find important disease terms and so on) for medical/clinical domain could be a really useful application, and this could potentially be multilingual as well

1 Like

I think there is honestly a good business that could be had with doing court transcription services, if you chose to go with a SAAS model for certified court reporters.

I kind of did similar thing as my bachelor’s project, attached

  • speaker diarization
  • speech to text
  • summerization

gotta check robustness of whisper