Pet generator: Research for web app for practical modified textual inversion

Name of Project: Roommate’s dog generator

Proposal in one sentence:

For the Roommate’s dog generator, I’ll be doing open-source research on 3 main techniques for making textual inversion fast enough for deployment and general use.

Description of the project:

For the Roommate’s dog generator in round 6, I looked at the existing methods for generating pet images from a few images and a prompt and I noticed that they weren’t too practical for deployment

  1. Textual inversion has a low amount of data to represent the concept. However, one problem is that it takes too long to train and the results are highly inconsistent
  2. Dreambooth takes a shorter time to train with better results. However, one problem is that since we are modifying the entire model, each concept would need 4GB of space in the cloud which is not practical, and also it’s very susceptible to overfitting.

To solve these issues and have a practical method for deployment, we have 3 main ideas we want to test

1. Having a model generate concepts for textual inversion.

My main idea here is to make a VIT that given a series of dog photos, output embeddings for textual inversion to be used as a concept. This is the approach I am most interested in as it’ll allow for one-shot generation without needing any training.

2. Using Image Inversion in Cross Attention prompt to prompt editing

Cross attention for prompt-to-prompt editing was introduced in the paper “Prompt-to-Prompt Image Editing with Cross Attention Control” by Google

The basic idea is that given an image with a prompt, you can modify the prompt and output an edited image. Recently, there have been advances such as DDIMs that allow us to encode an image that we have into the model so we can later generate it. This technique is called image inversion. I recommend checking out CrossAttentionControl/InverseCrossAttention_Release.ipynb at main · bloc97/CrossAttentionControl · GitHub to get the main idea!

In my situation, the results were way too inconsistent. My main idea for correcting this would be to encode multiple images and choose more in-depth initial prompts.

3. Continual learning using dream booth

The main reason dream booth is impractical for us was because, apart from the inconsistency, each new concept required a new model. However, one idea is why not experiment with papers like https://proceedings.neurips.cc/paper/2020/file/b704ea2c39778f07c617f6b7ce480e9e-Paper.pdf to see if it’s practical for us to keep adding concepts to one model. This paper in particular uses a technique called Dark Experience Replay which has the model keep remembering what its previous tokens are.

Grant Deliverables:

Testing out the above techniques and getting a textual inversion technique that actually can be deployed.

Squad Lead: Isamu Isozaki
Discord: Chad Kensington
Team Mate: Daniel Schwartz
Discord: dsbuddy

1 Like

Thanks! I’m just reading the paper and I see what you mean by the clip guidance approach.