--resume fails after 1 epoch with Pytorch 1.0 release

Using --resume fails after 1 epoch with Pytorch 1.0 release with error below.  I tried this with resnet50 and resnet18
```
Traceback (most recent call last):
  File "main.py", line 398, in <module>
    main()
  File "main.py", line 110, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/tools/anaconda3-5.3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/tools/anaconda3-5.3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/tools/anaconda3-5.3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/space8T/mdflickner/pytorch/examples/imagenet/main.py", line 241, in main_worker
    is_best = acc1 > best_acc1
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathCompareT.cu:15
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

--resume fails after 1 epoch with Pytorch 1.0 release #476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--resume fails after 1 epoch with Pytorch 1.0 release #476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions