Teaching Pre-Trained Language Models to Master Idiomatic Expressions (aka Munchies Index)

Stonewright.AI · April 15, 2022, 4:10am

The Stonewright.AI team has an updated plan to take the Munchies Index™ to new highs.

The Munchies Index is a collection of scripts to mine text data from cannabis social media. Our goal has been to teach a pre-trained language model to communicate using the idiomatic patterns of a niche audience. The cannabis community has been an ideal audience due to their distinct communication style and fast growing online presence.

Stonewright first started gathering cannabis social media in May 2019. We’re going to celebrate the 3 year Munchies Index™ anniversary by open sourcing our data and our code.

Please help us make this possible. Our plan back in January was to (a.) gather more data, and (b.) develop preliminary designs for our next generation reporting tool. With the help of Arshy and the Algovera community, we nearly doubled the size of our text database from 82M to 140M words. Thanks for the righteous lift, dudes!

Recently we selected Highcharts as a tech partner to the project. Their JS libraries will help us roll our own beautiful reporting interface.

We’ve also chosen Estuary (and IPFS/Filecoin) to be our decentralized data partner. Soon the full raw Munchies Index™ dataset will be made accessible via their gateway.

For the next steps, Arshy has already begun exploring some ideas via Huggingface, while Mnkyntigr and I are trimming the early Munchies Index™ R scripts to launch an initial repo.

Thanks again to everyone in the Algovera community. We appreciate your friendship & support.

Funding requested: $1,000