|
| 1 | +# CogView3 & CogView-3Plus 微调代码源码解析 |
| 2 | + |
| 3 | +[Read this in Chinese](./README_zh.md) |
| 4 | + |
| 5 | +<div align="center"> |
| 6 | +<img src=resources/logo.svg width="50%"/> |
| 7 | +</div> |
| 8 | + |
| 9 | +<p align="center"> |
| 10 | +Experience the CogView3-Plus-3B model online on <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space" target="_blank"> 🤗 Huggingface Space</a> |
| 11 | +</p> |
| 12 | +<p align="center"> |
| 13 | +📚 Check out the <a href="https://arxiv.org/abs/2403.05121" target="_blank">paper</a> |
| 14 | +</p> |
| 15 | +<p align="center"> |
| 16 | + 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> |
| 17 | +</p> |
| 18 | +<p align="center"> |
| 19 | +📍 Visit <a href="https://chatglm.cn/main/gdetail/65a232c082ff90a2ad2f15e2?fr=osm_cogvideox&lang=zh">Qingyan</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> for larger-scale commercial video generation models. |
| 20 | +</p> |
| 21 | + |
| 22 | +## Project Updates |
| 23 | + |
| 24 | +- 🔥🔥 ```2024/10/13```: We have adapted and open-sourced the **CogView-3Plus-3B** model in the [diffusers](https://github.com/huggingface/diffusers) version. You can [experience it online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space). |
| 25 | +- 🔥 ```2024/9/29```: We have open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system based on cascaded diffusion, utilizing a relay diffusion framework. **CogView-3Plus** is a series of newly developed text-to-image models based on Diffusion Transformers. |
| 26 | + |
| 27 | +## Model Introduction |
| 28 | + |
| 29 | +CogView-3-Plus builds upon CogView3 (ECCV'24) by introducing the latest DiT framework for further overall performance |
| 30 | +improvements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention |
| 31 | +mechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while |
| 32 | +maintaining the model's basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16. |
| 33 | + |
| 34 | +The table below shows the list of text-to-image models we currently offer along with their basic information. |
| 35 | + |
| 36 | +<table style="border-collapse: collapse; width: 100%;"> |
| 37 | + <tr> |
| 38 | + <th style="text-align: center;">Model Name</th> |
| 39 | + <th style="text-align: center;">CogView3-Base-3B</th> |
| 40 | + <th style="text-align: center;">CogView3-Base-3B-distill</th> |
| 41 | + <th style="text-align: center;">CogView3-Plus-3B</th> |
| 42 | + </tr> |
| 43 | + <tr> |
| 44 | + <td style="text-align: center;">Model Description</td> |
| 45 | + <td style="text-align: center;">The base and relay stage models of CogView3, supporting 512x512 text-to-image generation and 2x super-resolution generation.</td> |
| 46 | + <td style="text-align: center;">The distilled version of CogView3, with 4 and 1 step sampling in two stages (or 8 and 2 steps).</td> |
| 47 | + <td style="text-align: center;">The DiT version image generation model, supporting image generation ranging from 512 to 2048.</td> |
| 48 | + </tr> |
| 49 | + <tr> |
| 50 | + <td style="text-align: center;">Resolution</td> |
| 51 | + <td colspan="2" style="text-align: center;">512 * 512</td> |
| 52 | + <td style="text-align: center;"> |
| 53 | + 512 <= H, W <= 2048 <br> |
| 54 | + H * W <= 2^{21} <br> |
| 55 | + H, W \mod 32 = 0 |
| 56 | + </td> |
| 57 | + </tr> |
| 58 | + <tr> |
| 59 | + <td style="text-align: center;">Inference Precision</td> |
| 60 | + <td colspan="2" style="text-align: center;"><b>FP16 (recommended)</b>, BF16, FP32</td> |
| 61 | + <td style="text-align: center;"><b>BF16* (recommended)</b>, FP16, FP32</td> |
| 62 | + </tr> |
| 63 | + <tr> |
| 64 | + <td style="text-align: center;">Memory Usage (bs = 4)</td> |
| 65 | + <td style="text-align: center;"> 17G </td> |
| 66 | + <td style="text-align: center;"> 64G </td> |
| 67 | + <td style="text-align: center;"> 30G (2048 * 2048) <br> 20G (1024 * 1024) </td> |
| 68 | + </tr> |
| 69 | + <tr> |
| 70 | + <td style="text-align: center;">Prompt Language</td> |
| 71 | + <td colspan="3" style="text-align: center;">English*</td> |
| 72 | + </tr> |
| 73 | + <tr> |
| 74 | + <td style="text-align: center;">Maximum Prompt Length</td> |
| 75 | + <td colspan="2" style="text-align: center;">225 Tokens</td> |
| 76 | + <td style="text-align: center;">224 Tokens</td> |
| 77 | + </tr> |
| 78 | + <tr> |
| 79 | + <td style="text-align: center;">Download Link (SAT)</td> |
| 80 | + <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td> |
| 81 | + </tr> |
| 82 | + <tr> |
| 83 | + <td style="text-align: center;">Download Link (Diffusers)</td> |
| 84 | + <td colspan="2" style="text-align: center;">Not Adapted</td> |
| 85 | + <td style="text-align: center;"> |
| 86 | + <a href="https://huggingface.co/THUDM/CogView3-Plus-3B">🤗 HuggingFace</a><br> |
| 87 | + <a href="https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B">🤖 ModelScope</a><br> |
| 88 | + <a href="https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B">🟣 WiseModel</a> |
| 89 | + </td> |
| 90 | +</tr> |
| 91 | +</table> |
| 92 | + |
| 93 | +**Data Explanation** |
| 94 | + |
| 95 | ++ All inference tests were conducted on a single A100 GPU with a batch size of 4, |
| 96 | + using `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to save memory. |
| 97 | ++ The models only support English input. Other languages can be translated into English when refining with large models. |
| 98 | ++ This test environment uses the `SAT` framework. Many optimization points are not yet complete, and we will work with |
| 99 | + the community to create a version of the model for the `diffusers` library. Once the `diffusers` repository is |
| 100 | + supported, we will test using `diffusers`. The release is expected in November 2024. |
| 101 | + |
| 102 | +## Quick Start |
| 103 | + |
| 104 | +### Prompt Optimization |
| 105 | + |
| 106 | +Although CogView3 series models are trained with long image descriptions, we highly recommend rewriting prompts using |
| 107 | +large language models (LLMs) before generating text-to-image, as this will significantly improve generation quality. |
| 108 | + |
| 109 | +We provide an [example script](prompt_optimize.py). We suggest running this script to refine the prompt: |
| 110 | + |
| 111 | +```shell |
| 112 | +python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" |
| 113 | +``` |
| 114 | + |
| 115 | +### Inference Model (Diffusers) |
| 116 | + |
| 117 | +First, ensure the `diffusers` library is installed **from source**. |
| 118 | +``` |
| 119 | +pip install git+https://github.com/huggingface/diffusers.git |
| 120 | +``` |
| 121 | + |
| 122 | +Then, run the following code: |
| 123 | + |
| 124 | +```python |
| 125 | +from diffusers import CogView3PlusPipeline |
| 126 | +import torch |
| 127 | + |
| 128 | +pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda") |
| 129 | + |
| 130 | +# Enable it to reduce GPU memory usage |
| 131 | +pipe.enable_model_cpu_offload() |
| 132 | +pipe.vae.enable_slicing() |
| 133 | +pipe.vae.enable_tiling() |
| 134 | + |
| 135 | +prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background." |
| 136 | +image = pipe( |
| 137 | + prompt=prompt, |
| 138 | + guidance_scale=7.0, |
| 139 | + num_images_per_prompt=1, |
| 140 | + num_inference_steps=50, |
| 141 | + width=1024, |
| 142 | + height=1024, |
| 143 | +).images[0] |
| 144 | + |
| 145 | +image.save("cogview3.png") |
| 146 | +``` |
| 147 | + |
| 148 | +For more inference code, please refer to [inference](inference/cli_demo.py). This folder also contains a simple WEBUI code wrapped with Gradio. |
| 149 | + |
| 150 | +### Inference Model (SAT) |
| 151 | + |
| 152 | +Please check the [sat](sat/README.md) tutorial for step-by-step instructions on model inference. |
| 153 | + |
| 154 | +### Open Source Plan |
| 155 | + |
| 156 | +Since the project is in its early stages, we are working on the following: |
| 157 | + |
| 158 | ++ [ ] Fine-tuning the SAT version of CogView3-Plus-3B, including SFT and LoRA fine-tuning |
| 159 | ++ [X] Inference with the Diffusers library version of the CogView3-Plus-3B model |
| 160 | ++ [ ] Fine-tuning the Diffusers library version of the CogView3-Plus-3B model |
| 161 | ++ [ ] Related work for the CogView3-Plus-3B model, including ControlNet and other tasks. |
| 162 | + |
| 163 | +## CogView3 (ECCV'24) |
| 164 | + |
| 165 | +Official paper |
| 166 | +repository: [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://arxiv.org/abs/2403.05121) |
| 167 | + |
| 168 | +CogView3 is a novel text-to-image generation system using relay diffusion. It breaks down the process of generating |
| 169 | +high-resolution images into multiple stages. Through the relay super-resolution process, Gaussian noise is added to |
| 170 | +low-resolution generation results, and the diffusion process begins from these noisy images. Our results show that |
| 171 | +CogView3 outperforms SDXL with a winning rate of 77.0%. Additionally, through progressive distillation of the diffusion |
| 172 | +model, CogView3 can generate comparable results while reducing inference time to only 1/10th of SDXL's. |
| 173 | + |
| 174 | + |
| 175 | + |
| 176 | + |
| 177 | +Comparison results from human evaluations: |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | +## Citation |
| 182 | + |
| 183 | +🌟 If you find our work helpful, feel free to cite our paper and leave a star. |
| 184 | + |
| 185 | +``` |
| 186 | +@article{zheng2024cogview3, |
| 187 | + title={Cogview3: Finer and faster text-to-image generation via relay diffusion}, |
| 188 | + author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie}, |
| 189 | + journal={arXiv preprint arXiv:2403.05121}, |
| 190 | + year={2024} |
| 191 | +} |
| 192 | +``` |
| 193 | + |
| 194 | +We welcome your contributions! Click [here](resources/contribute.md) for more information. |
| 195 | + |
| 196 | +## Model License |
| 197 | + |
| 198 | +This codebase is released under the [Apache 2.0 License](LICENSE). |
| 199 | + |
| 200 | +The CogView3-Base, CogView3-Relay, and CogView3-Plus models (including the UNet module, Transformers module, and VAE |
| 201 | +module) are released under the [Apache 2.0 License](LICENSE). |
0 commit comments