Some bitter experiences with diffusion model fine-tuning
Time: 2024-04-15 ~ 2024-06-15
Aims:
- Based on the Stable Diffusion model, the training is fine-tuned using a locally demanded image dataset so that the fine-tuned model can be directionally generated into the type of image we want;
- Fine-tuning training for monster hunter style images;
Contents:
- Bitter memories of dataset construction
- Fine tuning principle and experimental details
- Model Evaluation Methodology and Performance Improvement Record
- Some lessons learned, shortcomings and regrets
- Reference
1. Bitter memories of dataset construction
After deciding to perform full fine-tuning of the UNet network for the Diffusion model, the first thing to do was to look for the open-source dataset of txt2img
, and we found that there is not much open-source data on Hugging Face, or the quality of that open-source data is very low, even though the websites such as CivitAI , Hugging Face, etc., have a lot of successful LoRA fine-tuned models and empirical analyses, the instructions for dataset construction were silent, which made us a bit confused when preparing the data.
Much of the subsequent time was slowed by a lack of knowledge of the data, and it was only as we collected more and more image data that we gradually deepened our understanding of the dataset, including the importance of image quality, high-quality prompts, and the length limitations of CLIP’s text editor.
Open source dataset – a first try at Pxovela
- Dataset: pxovela/cosmopolitanThe dataset profile: 11,475 image-text pairs, which was divided into a training set and a validation set, leaving 1,722 images as validation data (15% of the total).
- full-fine-tuning-guide-for-sd15
With the reference case in hand, we then proceeded to fine-tune our experiments on that dataset, using the following model:
- sd-1-5;
- sd-2-1;
- SDXL;
Inexperience and the difference in training between LLM’s practical experience and Vincennes’ charts led to unsatisfactory progress in the early stages until 202405 when adjustments in thinking began to be made:
- Targeting high-quality open source datasets
- 200 pieces of high-quality data (source: imgsys) were selected for overfitting training, 200 epochs (a number that, in hindsight, is still too conservative);
- Monster Hunter datasets for local needs
- Selected locally collected monster hunter high-quality 100 (from the view of high quality at the time, in fact, later re-viewed, still poor), overfitting fine-tuning training, 300 epochs.
A total of five different fine-tuned model tests (2024-04-15 ~ 2024-05-10) were conducted to address the above adjustments in thinking, as follows:
- Run-1
- Experimental designBase model: SDXL with longer fine-tuned training on extended dataset
pxovela_2778_images_size_1024_v2
, 50 epochs -> 100 epochs; - Evaluation of model resultsThe overfitting is not as effective as expected, and the model is able to reproduce about 70% of the image performance of the training data.
- Experimental designBase model: SDXL with longer fine-tuned training on extended dataset
- Run-2
- Experimental designBase model: SDXL, trained for 200 epochs of overfitting against a picked
imgsys-200_size_1024
; - Evaluation of model resultsThe fine-tuned model is able to learn more efficiently on high-quality open-source datasets, and
txt2img
reproduces the training dataset by 85%; - Supplementary reasoning test hereDescription: in order to demonstrate the effectiveness of model learning on the training dataset, here is specifically quasi-targeted to the un-fine-tuned base model SDXL, using the same Prompt, with the same sampling parameter settings for image generation, and the results from the great variability of the generated images with the training dataset, illustrate the effectiveness of fine-tuned training.
- Experimental designBase model: SDXL, trained for 200 epochs of overfitting against a picked
- Run-3
- Experimental designBase model: SDXL with 300 epochs of fine-tuned training using locally picked
monster-hunter_100_size_1024
; - Evaluation of model resultsThe model’s learning ability was poor?for detail and noise background removal, although the model for fine-tuning did learn effectively, essentially learning the basic features of the training dataset image, but more coarsely (still not realizing at this point the fundamental impact of image quality on fine-tuning).
- Experimental designBase model: SDXL with 300 epochs of fine-tuned training using locally picked
- Run-4
- Experimental designBase model: SDXL, design of dataset: at this time, a mixed dataset is used, which are: 1, 200 imgsys high quality dataset, 2, 100 local monster hunter dataset, with image size of 768 x 768, 200 epochs; this experiment is mainly used to validate the fine-tuning of the dataset using different sources (the core of which is the difference in image quality), and the fine-tuning of the performance differences of the model;
- Evaluation of model resultsThe fine-tuned model performs very well on the high-quality open-source dataset, but performs poorly on the poor-quality local Monster Hunter dataset.
- Run-5
- Experimental designTo further complement the findings of experiment run-4, this experiment follows the mixed dataset but with a longer fine-tuned training period: 600 epochs;
- Evaluation of model resultsResults: learning for open-source datasets is better with more than 90% restoration; fine-tuning for local monster hunter data, the model is difficult to learn and the quality of the generated images is poor;
On the bitter results of the above experiments, our team began to gradually realize the importance of image quality in the training dataset, about how to build a high-quality dataset based on the Monster Hunter theme will be the next focus.
Midjourney Data Synthesis – Building High Quality Monster Hunter Datasets
Here we use the most successful closed-source model in image generation: the Midjouney Niji , to further enhance our locally collected low-quality Monster Hunter dataset.
Cleaning of local data
Before using Midjourney for image synthesis, we need to do some processing on the local cell phone’s monster hunter data, mainly including: using Photoshop, image super-segmentation techniques, etc. to process, so as to obtain the character elevation images with pure backgrounds, which consume quite a lot of manpower (in the case of not many people in the team itself). After completing this step, we did some fine-tuning training using this dataset, and the results showed that the pure data is favorable for the fine-tuning of the model.
Start Midjourney Synthetic Testing
As we were moving forward with image data synthesis using Midjourney, we were pleasantly surprised to find that the midjourney.com/imagine
web page has an image synthesis function, which will free us from the constraints of Discord and allow us to do image synthesis by multiple people at the same time, thus accelerating the efficiency of our data synthesis.
When using the Midjourney Niji-v6 model for image synthesis, we tested a variety of ways to synthesize data, and finally came up with the Prompt and images we wanted for the monster hunter synthesis effect, and then we generated a variety of high-quality synthesized data through image matching, blending, and other ways.
This part of the practical experience shows that when we are difficult to obtain high-quality diversity image data locally, we are able to assist us in the task of building our own dataset through closed-source modeling, which is also very efficient and effective.
Attempts to build a high-quality Prompt
WD-tagger generated tags with the use of the tags to call the MJ model to generate images, and then, then the MJ-generated raid that the reverse tagger inference, found that the two taggers have a strong similarity, the WD-tagger on quadratic data tags inference is more effective; based on the existing CLIP/WD_1.4- tagger/BLIP/DeepSeek-VL/LlaVa and other open-source models for image data comprehension, the Prompts-prompted words of high-quality datasets required for constructing stable diffusion model training;
Subsequent reorientation
Due to the subsequent use of Midjourney to synthesize the data, as the dataset increases, it is difficult to use manual modification for Prompt generation, so the follow-up is dedicated to the use of GPT4-V for labeling, which basically ensures that the quality of the generated Prompt will not be too low, and maintains the quality of the training dataset to a certain extent.
Look at the dataset (coming soon)
2. Fine tuning principle and experimental details
Diffusion Modeling
- Paper: High-Resolution Image Synthesis with Latent Diffusion Models
Components of the Diffusion model
- VAEAutoencoder model for decoding latent space vectors into image space;
- Tokenizer & Text_encoderText tokenizers and text encoders that tokenize and encode text;
- UNetUNet model for generating latent variables;
- Scheduler
The Autoencoder (AE)
The autoencoder is trained to compress the image into smaller representations and then reconstruct the image from the compressed version; e.g., a 3 * 512 * 512 image is compressed into 4 * 64 * 64 latent vectors, so that the volume of each 3 * 8 * 8 pixel in the input image is compressed into four numbers (4 * 1 * 1).
Explanation of the reasons for using an autoencoder:
- We could do diffusion in pixel space – the model takes all the image data as input and produces an output prediction of the same shape, but this means dealing with a large amount of data, so generating high resolution images is very computationally expensive. Some solutions are to diffuse at a low resolution (e.g. 64px) and then train a separate model to repeat the zoom (DALLE 2 / Imagen).Latent diffusion, on the other hand, performs diffusion in “latent space”, using a compressed representation of the AE rather than the original image. These representations are rich in information and small enough to be processed on consumer-grade hardware, and once we have generated new “images” as latent representations, the autoencoder can convert these final latent outputs into actual pixels.
The UNet
The actual diffusion model is a typical UNet model that takes in the noise latent variable (x) and predicts the noise. We use a conditional model that also conditions the time step (t) and the text embedding (aka, the encoder hidden state), and the process of feeding all of these into the model is as follows:
Noise_pred = unet(latents, t, encoder_hidden_states=text_embeddings)["sample"]
When fine-tuning a diffusion model, it is common to adjust the parameters of the UNet model primarily, leaving the other components unchanged for the following reasons:
- UNet is the core component of the diffusion model, responsible for stepwise denoising and image generation, and adjusting UNet can directly affect the model’s ability to generate domain-specific or style-specific images;
- adjusting only UNet can significantly reduce the number of parameters that need to be updated and reduce the demand for computational resources, thus speeding up the fine-tuning process;
- the VAE (Variable Auto-Encoder), Tokenizer, and text encoder have typically learned more general feature representations during the pre-training phase. Keeping them unchanged preserves the base capabilities of the model and avoids problems such as catastrophic forgetting;
- For task-specific fine-tuning, UNet adjustments are usually sufficient to accommodate new data distributions or styles without changing the entire model architecture.
Fine-tuning experiments based on the Monster Hunter dataset
The production of high-quality monster hunter data was iterative when performing further fine-tuning; Phase 1: 100 pieces of data were produced for primary production and subsequently increased to 300 pieces of data, which were used to validate the validity of high-quality monster hunter data for fine-tuning; Phase 2: the data were expanded to 800 and subsequently 1400 pieces of data, respectively; Phase 3: in order to further improve the quality of the generated images in terms of facial details and clothing details, the Midjourney Niji model was used to expand 400 high-quality half-body images, and by mixing the full-body elevation images with high-quality half-body images, it allowed the model to be able to enhance the detailed expression of the characters in the subsequent generated images.
After the initial attempts in the dataset production and exploration phase, 1.8k high-quality monster hunter style datasets have been acquired, and the initial time experience shows that sd-v1-5 is easier to fine-tune and learns better than sd-v2-1 & SDXL, so the subsequent fine-tuning model is mainly based on sd-v1-5.
The experimental records are as follows
- Run-20240513
- Experimental design100 high quality Monster Hunter style images using the MJ Niji-v6 extension with base model selection sd-v1-5;
- Model Performance EvaluationConfirm the following: 1. High-quality synthetic data with more significant performance improvement for fine-tuned models; 2. Raw low-quality data is difficult for fast and effective learning;
- Run-20240514
- Experimental designThe sd-v1-5 base model was selected and trained by fine-tuning the data from 1294 local monster hunter datasets (512 x 512) under 600 epochs, and the fine-tuned models were evaluated under different epochs, respectively;
- Model Performance EvaluationConfirm the following: 1. The model is able to learn further, but it starts to overfit later and becomes less effective; 2. The fine-tuned model follows the Prompt cue words better, indicating that the model is able to learn efficiently;
- Run-20240522
- Experimental designChoosing the sd-v1-5 base model, this experiment uses the MJ Niji-v6 extended 500+ monster hunter data, of which 300+ high-quality data were selected for fine-tuning training;
- Model Performance EvaluationEVALUATION CONCLUSION: Compared to previous fine-tuned run-36 experiments using raw monster hunter data, there was a large improvement in the quality of the generated details, demonstrating the effectiveness of high-quality data synthesis using Midjourney for fine-tuned training.
- Run-20240523-mix-datasets
- Experimental designThe sd-v1-5 base model was chosen, and three kinds of image data with different quality were used in this experiment: 1 native monster hunter dataset, 2 monster hunter dataset generated by using Midjourney, and 3 other open-source high-quality datasets, and the quality of the dataset was improved at one time;
- Model Performance EvaluationFrom the evaluation results: the higher the quality of the image the better the learning effect, Inspiration: the subsequent need to further study the use of midjourney to expand the monster hunter dataset with high quality, so as to improve the learning effect of the model on the monster hunter risqué data.
- Run-20240531
- Experimental designChoosing the sd-v1-5 base model, this experiment uses 850 high-quality monster hunter datasets (768 x 768) synthesized from the MJ Niji-v6 model, and the training dataset is fine-tuned by a total of 1,000 epochs, with the batch-size set to 16, for a total of 52,000 steps of fine-tuning;
- Model Performance EvaluationFrom the evaluation results, the fine-tuning model accomplished a good learning effect on the training dataset, reaching new heights in terms of character poses, image colors, clothing styles, etc., and the backgrounds of the generated images are also more pure;
- revised
- Further improvements to the detail generation of clothing, weapons, and characters;
- further improvement in model following performance for Prompt;
- further reduction of cluttered background noise;
- drawbacks
- Facial details need to be further improved and optimized;
- The generation quality of male characters is poorer than that of female characters;
- revised
- Run-20240602
- Experimental designThe sd-v1-5 base model was chosen to use the same dataset as run-20240531, with a reduced number of fine-tuning steps (600 epochs, 22000 steps), and improved batch-size and learning rate size of the model;
- Model Performance Evaluation
- revised
- Compared to the run-20240531 model, the run-20240602 model has further improved the following performance of Prompt with better stability;
- The background noise is further reduced;
- drawbacksDetails are learned worse than the run-20240531 model;
- revised
- Run-20240614-mix-datasets
- Experimental designThe sd-v1-5 base model was chosen, and the dataset used a mixture of 1.4 k full-figure stand-up images and 400+ HD half-figure images, with 600 epochs of fine-tuning, and the goal of the experiment was to verify whether HD half-figure images with more details could improve the effect of the fine-tuned model in the generation of full-figure stand-ups;
- Model Performance EvaluationThe fine-tuning model was learned well, and the performance of the full-body stereo image needs to be learned further; the model does not have good control over the generated results of the half-body image and the full-body image, and the expected output of the model becomes less controllable due to the presence of mixed data;
- Run-20240616-mix-datasets
- Experimental designContinuing to follow the training data and base model from the Run-20240614-mix-datasets experiment, the settings of hyperparameters such as the learning rate were adjusted to see if they had a large impact on the final performance of the fine-tuned model;
- Model Performance EvaluationReducing the learning rate makes the model’s ability to learn details increase, and achieves better results in the synthesis of half-body figure images, but the model’s problem of poor controllability for the generation of full-body stereo images and half-body figure images is still retained, but with some degree of improvement;
The following conclusions can be drawn from the above fine-tuning experiments:
- The quality and diversity of images are still the key factors limiting the fine-tuning performance of the model, in the localization of the fine-tuning training, there must be enough ways to collect or produce a large number of high-quality images consistent with the final demand; on the basis of high-quality, diverse data, the fine-tuning of the model can be advanced;
- Before training the fine-tuning of the model, be sure to pay attention to the step-by-step experimental method, and verify your ideas through some small but quick experiments, e.g., if you want to verify whether it is useful to use the images synthesized by MJ for fine-tuning, you don’t need to synthesize a large number of more than 1,000 images before doing the experiments because, if there are few people in your team, then hand-producing 1,000 images may take It will take about a week, which does not mean that all 1000 images synthesized using MJ are ready for use, there is still a need for selection and elimination, and a 50% pass rate would be very good;Therefore, you can first produce 300 images, and then select 100 high-quality images for overfitting fine-tuning training, and then check whether the fine-tuning model achieves better results on a small number of datasets, if not, it means that the fine-tuning experiments have bugs in the setup or code, which requires further troubleshooting;
- When designing experiments, you must think clearly and know the purpose of your experiment, if you blindly run the experiment without a clear experimental purpose, you may neither be able to understand the results nor accurately assess and give conclusions, thus gaining nothing, but if you design the experiment with a clear purpose, then you can go with a clear question to find the answer, and this can make the experiment itself effective;
Notes on fine-tuning hyperparameter settings
For the fine-tuning experiment itself, the hyperparameters are set as follows:
- learning rate
- Learning rates of
1e-5
are able to fit the data faster, the Loss curve decreases faster, but attention needs to be paid to learning the details of the image data, which may not be learned well enough; - Learning rates of
3e-6
tend to take longer to train, the Loss curve decreases more slowly, but generally learn the details of the training set images better, and ultimately need to be evaluated in more detail; - The learning rate of `5e-7′ is lower, takes longer to fine-tune the training, the Loss curve decreases more slowly, and in the actual fine-tuning training, direct evaluation is needed to determine the final effect.
- Learning rates of
- batch size
- In the actual fine-tuning training, we need to evaluate directly to determine the final effect of batch size. Combined with the local GPU situation, we will analyze the situation, and try to maximize the use of GPU acceleration effect with higher settings;
- num_train_epochs
- Diffusion model fine-tuning and LLM large model fine-tuning there are large differences, can not simply migrate each other’s experience, to diffusion model for good learning of image data, often need more epochs (500 ~ 1000 ranging, and may even more);
To achieve the best results in the actual fine-tuning of the diffusion model, it is often necessary to do more experiments and verify more of one’s own ideas, so as to find the best fine-tuning method, and this aspect requires the accumulation and exchange of experience, rather than directly copying the practical experience of others.
3. Model Evaluation Methodology and Performance Improvement Record
Diffusion model evaluation
- Image quality assessment
- Quantitative Methods
- FID (Frechet Inception Distance)
- IS (Inception Score)
- SSIM (Structural Similarity Index)
- PSNR (Peak Signal-to-Noise Ratio)
- Qualitative Methods
- Manual Evaluation
- A/B Testing
- Status Analysis
- FID and IS are still the most common metrics, but have limitations;
- Researchers are exploring better evaluation metrics such as Clean FID and Improved Precision and Recall;
- Quantitative Methods
- Text-image consistency assessment
- Methods
- CLIP Score
- CLIP-I
- Current status
- These methods can better assess the consistency of the generated images with the text descriptions, but there are still limitations and insufficient comprehension of complex allegories;
- Methods
- Diversity assessment
- Methods
- LPIPS (Learned Perceptual Image Patch Similarity)
- Diversity Score
- Status
- Diversity assessment is important for measuring the creativity and generalization ability of a model, but existing methods cannot fully capture the diversity of human perception;
- Methods
In actual fine-tuning experiments of diffusion models, researchers usually adopt some relatively direct and practical methods to evaluate the performance of the fine-tuned models, and manual visual evaluation is still an indispensable method. Researchers and developers will directly compare the images generated before and after fine-tuning, and observe the changes in image quality, detail performance, and stylistic consistency.
Overall, practical evaluations usually combine multiple methods, considering both quantitative metrics and emphasizing qualitative assessments. The choice of which assessment methods to use often depends on the specific application scenario, available resources, and research objectives. It is important to ensure that the assessment methods provide a comprehensive picture of the model’s performance improvement on the target task.
Evaluation methodology for the fine-tuning model used in this project
For the limitations of the existing assessment methods, this experiment basically adopts the manual assessment method, which can ensure the objectivity and accuracy of the assessment despite the time cost.
Open Source inference Framework
The framework provides a rich sampling strategy, which can evaluate the performance of the fine-tuning model in a more comprehensive way, and basically ensures the validity of the existing evaluation methods; for the fine-tuning model, a variety of commonly used sampling strategies are used to test the learning of the fine-tuning model on the training dataset by adjusting the number of sampling steps and CFG Scale parameter, and other methods; when picking the generated images, basically 1/2 or 1/3 of the picking rate is used for the record (1/3, i.e., one Prompt generates three images at a time and selects the best effect for the record).
How to better evaluate model performance?
I also follow some researchers doing diffusion model fine-tuning on Twitter, including @cloneofsimo, @meltenx, etc., to learn more about valid information through their sharing of some information:
- @meltenxI fine-tuned the diffusion model and the Proteus model was trained on over 400K gpt4V images with annotations compared to 10K DPOs.Proteus is a sophisticated enhancement to OpenDalleV1.1, leveraging its core functionality to deliver superior results. Key areas of advancement include improved responsiveness to cues and enhanced creativity. To achieve this, it was fine-tuned using approximately 220,000 GPTV subtitled images from copyright-free stock images, including some anime, and then standardized. In addition, DPO (Direct Preference Optimization) was applied by collecting 10,000 carefully selected AI-generated high-quality image pairs.Composition of high-quality datasets:
- Fine image data
- Order of magnitude: 220k
- high quality prompt-prompt words
- GPT4-vision for labeling
- Fine image data
- @cloneofsimoLavenderflow 6.8B just reached Dalle-3 level on GenEval, but I suspect this was lucky run. Still, its going somewhere close to Dalle-3 (and SD3-8B) which makes me happy. Going to do Parti-prompts GPT-V eval soon!!! (+ GenEval is not a solid nor comprehensive benchmark so i dont want to overhype.)Task breakdown:
- single object
- colors
- color attr
- counting
- position
- two object
From the point of view of the current development of diffusion technology, the method of evaluating model performance is still being explored, at this time the more realistic and objective method of evaluating the performance of the fine-tuning model is still artificial judgment, through human observation to directly perceive the effect of fine-tuning the model.
4. Some lessons learned, shortcomings and regrets
Lessons learned
When fine-tuning the diffusion model, I first thought of looking for some datasets for fine-tuning reference on HuggingFace Datasets, but then I found that there was very little data in this area. Although there are a large number of fine-tuning models on CivitAI, there are also few sharing of datasets. This actually shows from the side that the data itself is of great value. Therefore, collecting high-quality and diverse image data is the most important thing for fine-tuning the diffusion model. Never ignore the data. Look at your dataset is not just a joke.
As time goes by, after entering May, some fine-tuning experience and open source data gradually appeared on Twitter, which brought us a lot of valuable information and reaffirmed the effectiveness of fine-tuning the foundation. However, for local needs: fine-tuning a Monster Hunter-style diffusion model, although we have collected a lot of image data (2k+), the quality and repeatability of the data itself have also deeply dragged down the performance improvement of the fine-tuning model in subsequent fine-tuning experiments.
The situation has improved since the introduction of the Midjourney model to improve the quality of existing data. However, the images synthesized by Midjourney are also constrained by the training dataset of the Niji-v6 model, so it is difficult to continuously improve the Monster Hunter data, which will also become the ceiling of our diffusion model fine-tuning performance. If we can continue to explore ways to further improve the local Monster Hunter dataset, the performance of the fine-tuning model can still be continuously improved. This part of the function may need to be left for future exploration.
Some deficiencies
The fine-tuning practice of diffusion modeling in our team has very significant limitations in the following ways:
- Size limitation of the training datasetThe size of the final monster hunter dataset produced by our team is 1.8k character elevation images, which is far from the size of the dataset fine-tuned by other bloggers on Twitter;
- Quality limitations of training imagesThe data of our team relies on Midjoueney Niki-v6 model for img2img compositing and quality enhancement of the local dataset, but due to the capability of the Niki-v6 model itself, the quality cannot be continuously improved, and the quality of the final training dataset is medium from the objective assessment, which cannot be considered as a very high quality image dataset;
- Limitations of Prompt qualityDue to the small number of personnel in the team, and the accumulation of technology in the direction of diffusion is not very deep, so the practical ability of Prompt engineering and quality enhancement is general, and the labeling of the later dataset relies on the labeling function of the GPT4-V model, which makes it difficult to assess the quality of the Prompt, and compared with the manual modification and adjustment of the Prompt, the quality can be further improved; due to the limitations of human resources, this part did not continue to promote the Prompt quality. limitations, this part was not pursued.
The code for this experiment was adopted from HuggingFace’s diffusers library, and no major modifications were made to the code as this experiment mainly focused on parameter optimization of the UNet network model in the diffusion model, but this part should not be the bottleneck of the practice.
The majority of this practice was fine-tuned using 768 x 768 sized images, which was mainly constrained by hardware constraints, and since the company’s H800 server was in a public state, there were instances where the throughput varied a lot, which somewhat constrained the experiment from being trained and evaluated faster.
5. Reference
Hugging Face -Datasets
- pxovela/MJ_Discord_Datasets
- pxovela/cosmopolitan
Diffusion model fine-tuning experience reference
- Medium: Fine-Tuning Satble Diffusion With Validation
- Medium: The LR Finder Method for Stable Diffusion
- Blog: Find Optimal Learning Rates for Stable Diffusion Fine-tunes
- CosmopolitanFull Fine-tuning Guide for SD1.5 General Purpose Models: Cosmopolitan
Open source fine-tuning framework reference
- Hugging Face Diffusers 微调库文档,GitHub 代码仓库;
- Hugging Face Blogs – Diffusion models fine-tune:
- Hugging Face Stable diffusion : Dataset create;
- GitHub :
- Colab:
Image-to-labels
- WD-tagger-v1.4
- Github:
- Blip / blip2
- GitHub: salesforce / LAVIS / projects / blip2
- Hugging Face Space: tonyassi/blip-image-captioning-large
- Hugging Face Models: Salesforce/blip-image-captioning-large
- LLaVA
- GitHub: camenduru / LLaVA-colab
- Colab