Skip to content

Commit d071d8b

Browse files
committed
2024-10-22 14:28:46
1 parent 7739d40 commit d071d8b

File tree

52 files changed

+11939
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+11939
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Raise valuable PR / 提出有价值的PR
2+
3+
## Caution / 注意事项:
4+
Users should keep the following points in mind when submitting PRs:
5+
6+
1. Ensure that your code meets the requirements in the [specification](../../resources/contribute.md).
7+
2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
8+
9+
用户在提交PR时候应该注意以下几点:
10+
11+
1. 确保您的代码符合 [规范](../../resources/contribute_zh.md) 中的要求。
12+
2. 提出的PR应该具有针对性,如果具有多个不同的想法和优化方案,应该分配到不同的PR中。
13+
14+
## 不应该提出的PR / PRs that should not be proposed
15+
16+
If a developer proposes a PR about any of the following, it may be closed or Rejected.
17+
18+
1. those that don't describe improvement options.
19+
2. multiple issues of different types combined in one PR.
20+
3. The proposed PR is highly duplicative of already existing PRs.
21+
22+
如果开发者提出关于以下方面的PR,则可能会被直接关闭或拒绝通过。
23+
24+
1. 没有说明改进方案的。
25+
2. 多个不同类型的问题合并在一个PR中的。
26+
3. 提出的PR与已经存在的PR高度重复的。
27+
28+
29+
# 检查您的PR
30+
- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分?
31+
- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过?如果是,请添加链接。
32+
- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档?这里是文档指南,这里是文档格式化技巧。
33+
- [ ] Did you write new required tests? / 您是否编写了新的必要测试?
34+
- [ ] Are your PRs for only one issue / 您的PR是否仅针对一个问题

docs/cogview3-finetune/README.md

+201
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# CogView3 & CogView-3Plus 微调代码源码解析
2+
3+
[Read this in Chinese](./README_zh.md)
4+
5+
<div align="center">
6+
<img src=resources/logo.svg width="50%"/>
7+
</div>
8+
9+
<p align="center">
10+
Experience the CogView3-Plus-3B model online on <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space" target="_blank"> 🤗 Huggingface Space</a>
11+
</p>
12+
<p align="center">
13+
📚 Check out the <a href="https://arxiv.org/abs/2403.05121" target="_blank">paper</a>
14+
</p>
15+
<p align="center">
16+
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a>
17+
</p>
18+
<p align="center">
19+
📍 Visit <a href="https://chatglm.cn/main/gdetail/65a232c082ff90a2ad2f15e2?fr=osm_cogvideox&lang=zh">Qingyan</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> for larger-scale commercial video generation models.
20+
</p>
21+
22+
## Project Updates
23+
24+
- 🔥🔥 ```2024/10/13```: We have adapted and open-sourced the **CogView-3Plus-3B** model in the [diffusers](https://github.com/huggingface/diffusers) version. You can [experience it online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space).
25+
- 🔥 ```2024/9/29```: We have open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system based on cascaded diffusion, utilizing a relay diffusion framework. **CogView-3Plus** is a series of newly developed text-to-image models based on Diffusion Transformers.
26+
27+
## Model Introduction
28+
29+
CogView-3-Plus builds upon CogView3 (ECCV'24) by introducing the latest DiT framework for further overall performance
30+
improvements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention
31+
mechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while
32+
maintaining the model's basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16.
33+
34+
The table below shows the list of text-to-image models we currently offer along with their basic information.
35+
36+
<table style="border-collapse: collapse; width: 100%;">
37+
<tr>
38+
<th style="text-align: center;">Model Name</th>
39+
<th style="text-align: center;">CogView3-Base-3B</th>
40+
<th style="text-align: center;">CogView3-Base-3B-distill</th>
41+
<th style="text-align: center;">CogView3-Plus-3B</th>
42+
</tr>
43+
<tr>
44+
<td style="text-align: center;">Model Description</td>
45+
<td style="text-align: center;">The base and relay stage models of CogView3, supporting 512x512 text-to-image generation and 2x super-resolution generation.</td>
46+
<td style="text-align: center;">The distilled version of CogView3, with 4 and 1 step sampling in two stages (or 8 and 2 steps).</td>
47+
<td style="text-align: center;">The DiT version image generation model, supporting image generation ranging from 512 to 2048.</td>
48+
</tr>
49+
<tr>
50+
<td style="text-align: center;">Resolution</td>
51+
<td colspan="2" style="text-align: center;">512 * 512</td>
52+
<td style="text-align: center;">
53+
512 <= H, W <= 2048 <br>
54+
H * W <= 2^{21} <br>
55+
H, W \mod 32 = 0
56+
</td>
57+
</tr>
58+
<tr>
59+
<td style="text-align: center;">Inference Precision</td>
60+
<td colspan="2" style="text-align: center;"><b>FP16 (recommended)</b>, BF16, FP32</td>
61+
<td style="text-align: center;"><b>BF16* (recommended)</b>, FP16, FP32</td>
62+
</tr>
63+
<tr>
64+
<td style="text-align: center;">Memory Usage (bs = 4)</td>
65+
<td style="text-align: center;"> 17G </td>
66+
<td style="text-align: center;"> 64G </td>
67+
<td style="text-align: center;"> 30G (2048 * 2048) <br> 20G (1024 * 1024) </td>
68+
</tr>
69+
<tr>
70+
<td style="text-align: center;">Prompt Language</td>
71+
<td colspan="3" style="text-align: center;">English*</td>
72+
</tr>
73+
<tr>
74+
<td style="text-align: center;">Maximum Prompt Length</td>
75+
<td colspan="2" style="text-align: center;">225 Tokens</td>
76+
<td style="text-align: center;">224 Tokens</td>
77+
</tr>
78+
<tr>
79+
<td style="text-align: center;">Download Link (SAT)</td>
80+
<td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
81+
</tr>
82+
<tr>
83+
<td style="text-align: center;">Download Link (Diffusers)</td>
84+
<td colspan="2" style="text-align: center;">Not Adapted</td>
85+
<td style="text-align: center;">
86+
<a href="https://huggingface.co/THUDM/CogView3-Plus-3B">🤗 HuggingFace</a><br>
87+
<a href="https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B">🤖 ModelScope</a><br>
88+
<a href="https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B">🟣 WiseModel</a>
89+
</td>
90+
</tr>
91+
</table>
92+
93+
**Data Explanation**
94+
95+
+ All inference tests were conducted on a single A100 GPU with a batch size of 4,
96+
using `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to save memory.
97+
+ The models only support English input. Other languages can be translated into English when refining with large models.
98+
+ This test environment uses the `SAT` framework. Many optimization points are not yet complete, and we will work with
99+
the community to create a version of the model for the `diffusers` library. Once the `diffusers` repository is
100+
supported, we will test using `diffusers`. The release is expected in November 2024.
101+
102+
## Quick Start
103+
104+
### Prompt Optimization
105+
106+
Although CogView3 series models are trained with long image descriptions, we highly recommend rewriting prompts using
107+
large language models (LLMs) before generating text-to-image, as this will significantly improve generation quality.
108+
109+
We provide an [example script](prompt_optimize.py). We suggest running this script to refine the prompt:
110+
111+
```shell
112+
python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus"
113+
```
114+
115+
### Inference Model (Diffusers)
116+
117+
First, ensure the `diffusers` library is installed **from source**.
118+
```
119+
pip install git+https://github.com/huggingface/diffusers.git
120+
```
121+
122+
Then, run the following code:
123+
124+
```python
125+
from diffusers import CogView3PlusPipeline
126+
import torch
127+
128+
pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")
129+
130+
# Enable it to reduce GPU memory usage
131+
pipe.enable_model_cpu_offload()
132+
pipe.vae.enable_slicing()
133+
pipe.vae.enable_tiling()
134+
135+
prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
136+
image = pipe(
137+
prompt=prompt,
138+
guidance_scale=7.0,
139+
num_images_per_prompt=1,
140+
num_inference_steps=50,
141+
width=1024,
142+
height=1024,
143+
).images[0]
144+
145+
image.save("cogview3.png")
146+
```
147+
148+
For more inference code, please refer to [inference](inference/cli_demo.py). This folder also contains a simple WEBUI code wrapped with Gradio.
149+
150+
### Inference Model (SAT)
151+
152+
Please check the [sat](sat/README.md) tutorial for step-by-step instructions on model inference.
153+
154+
### Open Source Plan
155+
156+
Since the project is in its early stages, we are working on the following:
157+
158+
+ [ ] Fine-tuning the SAT version of CogView3-Plus-3B, including SFT and LoRA fine-tuning
159+
+ [X] Inference with the Diffusers library version of the CogView3-Plus-3B model
160+
+ [ ] Fine-tuning the Diffusers library version of the CogView3-Plus-3B model
161+
+ [ ] Related work for the CogView3-Plus-3B model, including ControlNet and other tasks.
162+
163+
## CogView3 (ECCV'24)
164+
165+
Official paper
166+
repository: [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://arxiv.org/abs/2403.05121)
167+
168+
CogView3 is a novel text-to-image generation system using relay diffusion. It breaks down the process of generating
169+
high-resolution images into multiple stages. Through the relay super-resolution process, Gaussian noise is added to
170+
low-resolution generation results, and the diffusion process begins from these noisy images. Our results show that
171+
CogView3 outperforms SDXL with a winning rate of 77.0%. Additionally, through progressive distillation of the diffusion
172+
model, CogView3 can generate comparable results while reducing inference time to only 1/10th of SDXL's.
173+
174+
![CogView3 Showcase](resources/CogView3_showcase.png)
175+
![CogView3 Pipeline](resources/CogView3_pipeline.jpg)
176+
177+
Comparison results from human evaluations:
178+
179+
![CogView3 Evaluation](resources/CogView3_evaluation.png)
180+
181+
## Citation
182+
183+
🌟 If you find our work helpful, feel free to cite our paper and leave a star.
184+
185+
```
186+
@article{zheng2024cogview3,
187+
title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
188+
author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
189+
journal={arXiv preprint arXiv:2403.05121},
190+
year={2024}
191+
}
192+
```
193+
194+
We welcome your contributions! Click [here](resources/contribute.md) for more information.
195+
196+
## Model License
197+
198+
This codebase is released under the [Apache 2.0 License](LICENSE).
199+
200+
The CogView3-Base, CogView3-Relay, and CogView3-Plus models (including the UNet module, Transformers module, and VAE
201+
module) are released under the [Apache 2.0 License](LICENSE).

0 commit comments

Comments
 (0)