Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

千问vl_2b lora微调的小疑问 #361

Open
zzt941006 opened this issue Feb 13, 2025 · 5 comments
Open

千问vl_2b lora微调的小疑问 #361

zzt941006 opened this issue Feb 13, 2025 · 5 comments

Comments

@zzt941006
Copy link

zzt941006 commented Feb 13, 2025

大致是 按照微调教程一步步走下去的
已经把数据对应的图片 csv json 都传到了服务器上
可是执行:
train_ds = Dataset.from_json("/lora_qwen/data_vl_train.json")
去利用Dataset读取对应json的时候,非常非常慢,这是虽然服务器上已经有图片了,但是没检测到,还是去网上下载了么?

虽然最后能微调训练完毕,可是感觉这个现象挺异常的,所以想看看有无思路去排查。恳请不吝赐教,谢谢!
当然有个情况是,我在本地拿modelscope去load这批图片也load不进来,是通过其它电脑将图片和原始的csv生成后,发到我本机的,然后再在我本机上生成了data_vl json 再统一上传到服务器去执行这些的。

对应生成的data_vl_test.json 只有4张图片,但是导入也是同样的缓慢
[
{
"id": "identity_497",
"conversations": [
{
"from": "user",
"value": "COCO Yes: <|vision_start|>/coco_2014_caption/374916.jpg<|vision_end|>"
},
{
"from": "assistant",
"value": "Apples and leaves on the ground with a cat in the background."
}
]
},
{
"id": "identity_498",
"conversations": [
{
"from": "user",
"value": "COCO Yes: <|vision_start|>/coco_2014_caption/322562.jpg<|vision_end|>"
},
{
"from": "assistant",
"value": "A boat that is in a lake by some houses."
}
]
},
{
"id": "identity_499",
"conversations": [
{
"from": "user",
"value": "COCO Yes: <|vision_start|>/coco_2014_caption/467727.jpg<|vision_end|>"
},
{
"from": "assistant",
"value": "Elephant standing in front a fence next to a house."
}
]
},
{
"id": "identity_500",
"conversations": [
{
"from": "user",
"value": "COCO Yes: <|vision_start|>/coco_2014_caption/191327.jpg<|vision_end|>"
},
{
"from": "assistant",
"value": "A line of motorcycles are parked in a row ranging from vintage to rocket."
}
]
}
]

后面确实是设置了环境变量:
export HF_DATASETS_OFFLINE=1
才能正常读取已经准备好的离线数据。
证实了确实问题出在优先去hg官网上再下载了。
可是为什么会这样咧,在已经准备好的情况下,依然优先选择去hg上获取。。不太明白。

@InTheFuture7
Copy link

@zzt941006 您好,请问您在微调Qwen2-VL-2B微调时有遇到报错吗?我在运行从数据下载到生成CSV时得到如下报错,环境是使用教程中给的autodl中的镜像

FileNotFoundError: https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/train2014/COCO_train2014_000000011195.jpg

@zzt941006
Copy link
Author

@zzt941006 您好,请问您在微调Qwen2-VL-2B微调时有遇到报错吗?我在运行从数据下载到生成CSV时得到如下报错,环境是使用教程中给的autodl中的镜像

FileNotFoundError: https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/train2014/COCO_train2014_000000011195.jpg

这个错误咋看上去是镜像里这个路径下面没有这张图片,是download图片的时候失败了么,可以排查一下我感觉,先确保数据真能下载下来吧~

@InTheFuture7
Copy link

@zzt941006 定位到问题是在 ds[i] 这里报错的,前面都正常。请问您运行代码前有做修改吗?

    for i in range(total):
        # 获取每个样本的信息
        item = ds[i]
        image_id = item['image_id']
        caption = item['caption']
        image = item['image']

@InTheFuture7
Copy link

@zzt941006 请问,能否直接分享下下载到 coco_2014_caption 文件夹中内容呢?万分感谢!

@InTheFuture7
Copy link

@zzt941006 问题已解决,谢谢啦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants