Skip to content

[Core] Support aarch64 -- causing docker on M1 build and runtime errors #28103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lundybernard opened this issue Aug 25, 2022 · 17 comments · Fixed by #31522
Closed

[Core] Support aarch64 -- causing docker on M1 build and runtime errors #28103

lundybernard opened this issue Aug 25, 2022 · 17 comments · Fixed by #31522
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@lundybernard
Copy link

What happened + What you expected to happen

We are building a docker-compose config to wire-up ray as a single-node or cluster, and wire it up with additional services (logging, S3 storage, etc.), and provide an "easy button" for our users. Many of our users are now stuck on Apple M1(arm64) hardware, and we need to support them.

Best Case:

use the rayproject/ray (or ray-ml) container directly

version: "3"

services:
  ray-head:
    image: rayproject/ray
    command: "ray start -v --head --port=6377 --redis-shard-ports=6380,6381 --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block"

docker-compose up Results in a Timeout error:

ray-head_1    | 2022-08-25 09:32:31,135 INFO scripts.py:715 -- Local node IP: 172.26.0.2
ray-head_1    | Traceback (most recent call last):
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 307, in __init__
ray-head_1    |     self.redis_password,
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 397, in wait_for_node
ray-head_1    |     raise TimeoutError("Timed out while waiting for node to startup.")
ray-head_1    | TimeoutError: Timed out while waiting for node to startup.
ray-head_1    | 
ray-head_1    | During handling of the above exception, another exception occurred:
ray-head_1    | 
ray-head_1    | Traceback (most recent call last):
ray-head_1    |   File "/home/ray/anaconda3/bin/ray", line 8, in <module>
ray-head_1    |     sys.exit(main())
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2339, in main
ray-head_1    |     return cli()
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
ray-head_1    |     return self.main(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
ray-head_1    |     rv = self.invoke(ctx)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
ray-head_1    |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
ray-head_1    |     return ctx.invoke(self.callback, **ctx.params)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
ray-head_1    |     return __callback(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
ray-head_1    |     return f(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 738, in start
ray-head_1    |     ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 311, in __init__
ray-head_1    |     "The current node has not been updated within 30 "
ray-head_1    | Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.
dc_ray-head_1 exited with code 1

docker run --platform linux/x86_64 rayproject/ray ray start -v --head --port=6377 --redis-shard-ports=6380,6381 --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block results in a Raylet Some Ray subprcesses exited unexpectedly error

2022-08-25 09:27:57,870 INFO scripts.py:715 -- Local node IP: 172.17.0.2
2022-08-25 09:28:07,345 SUCC scripts.py:757 -- --------------------
2022-08-25 09:28:07,345 SUCC scripts.py:758 -- Ray runtime started.
2022-08-25 09:28:07,345 SUCC scripts.py:759 -- --------------------
2022-08-25 09:28:07,346 INFO scripts.py:761 -- Next steps
2022-08-25 09:28:07,346 INFO scripts.py:762 -- To connect to this Ray runtime from another node, run
2022-08-25 09:28:07,346 INFO scripts.py:767 --   ray start --address='172.17.0.2:6377'
2022-08-25 09:28:07,346 INFO scripts.py:770 -- Alternatively, use the following Python code:
2022-08-25 09:28:07,347 INFO scripts.py:772 -- import ray
2022-08-25 09:28:07,347 INFO scripts.py:785 -- ray.init(address='auto')
2022-08-25 09:28:07,347 INFO scripts.py:789 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-08-25 09:28:07,347 INFO scripts.py:793 -- connect to a remote cluster from your laptop directly, use the following
2022-08-25 09:28:07,348 INFO scripts.py:796 -- Python code:
2022-08-25 09:28:07,348 INFO scripts.py:798 -- import ray
2022-08-25 09:28:07,348 INFO scripts.py:804 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-08-25 09:28:07,349 INFO scripts.py:810 -- If connection fails, check your firewall settings and network configuration.
2022-08-25 09:28:07,349 INFO scripts.py:816 -- To terminate the Ray runtime, run
2022-08-25 09:28:07,349 INFO scripts.py:817 --   ray stop
2022-08-25 09:28:07,349 INFO scripts.py:892 -- --block
2022-08-25 09:28:07,350 INFO scripts.py:894 -- This command will now block until terminated by a signal.
2022-08-25 09:28:07,350 INFO scripts.py:897 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-08-25 09:28:10,364 ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
2022-08-25 09:28:10,365 ERR scripts.py:914 -- raylet [exit code=1]
2022-08-25 09:28:10,365 ERR scripts.py:919 -- Remaining processes will be killed.

Workable Solution:

Build a ray image locally by installing the package in a dockerfile

native build

#FROM --platform=linux/amd64 python:3.9
#FROM --platform=linux/amd64 continuumio/miniconda3
FROM continuumio/miniconda3

#RUN conda install python=3.10
#RUN conda update conda && conda update pip

# fix grpcio for M1
#RUN pip uninstall grpcio; conda install grpcio

# Install Ray
#RUN pip install --no-cache-dir ray[default]~=1.13 ray[serve]~=1.13
RUN uname -m && \
    uname -a && \
    python --version && \
    pip --version && \
    conda install -c conda-forge ray-core
    #pip install ray

This fails to build, when using pip:

 > [5/5] RUN uname -m &&     uname -a &&     python --version &&     pip --version &&     pip install setuptools &&     pip install ray:                                                                                                  
#8 0.216 aarch64                                                                                                     
#8 0.216 Linux buildkitsandbox 5.10.109-0-virt #1-Alpine SMP Mon, 28 Mar 2022 11:20:52 +0000 aarch64 GNU/Linux       
#8 0.218 Python 3.10.4                                                                                               
#8 0.373 pip 22.1.2 from /opt/conda/lib/python3.10/site-packages/pip (python 3.10)
#8 0.600 Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (63.4.1)
#8 0.643 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#8 1.244 ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
#8 1.244 ERROR: No matching distribution found for ray

and when using conda:

 > [5/5] RUN uname -m &&     uname -a &&     python --version &&     pip --version &&     pip install setuptools &&     conda install -c conda-forge ray-core #pip install --no-cache-dir ray:                                            
#8 0.343 aarch64                                                                                                     
#8 0.343 Linux buildkitsandbox 5.10.109-0-virt #1-Alpine SMP Mon, 28 Mar 2022 11:20:52 +0000 aarch64 GNU/Linux       
#8 0.350 Python 3.10.4                                                                                               
#8 0.484 pip 22.1.2 from /opt/conda/lib/python3.10/site-packages/pip (python 3.10)
#8 0.686 Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (63.4.1)
#8 0.724 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#8 4.660 Collecting package metadata (current_repodata.json): ...working... done
#8 7.763 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#8 7.764 Collecting package metadata (repodata.json): ...working... done
#8 21.87 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#8 21.88 
#8 21.88 PackagesNotFoundError: The following packages are not available from current channels:
#8 21.88 
#8 21.88   - ray-core
#8 21.88 
#8 21.88 Current channels:
#8 21.88 
#8 21.88   - https://conda.anaconda.org/conda-forge/linux-aarch64
#8 21.88   - https://conda.anaconda.org/conda-forge/noarch
#8 21.88   - https://repo.anaconda.com/pkgs/main/linux-aarch64
#8 21.88   - https://repo.anaconda.com/pkgs/main/noarch
#8 21.88   - https://repo.anaconda.com/pkgs/r/linux-aarch64
#8 21.88   - https://repo.anaconda.com/pkgs/r/noarch
#8 21.88 
#8 21.88 To search for alternate channels that may provide the conda package you're
#8 21.88 looking for, navigate to
#8 21.88 
#8 21.88     https://anaconda.org
#8 21.88 
#8 21.88 and use the search bar at the top of the page.
#8 21.88 
#8 21.88 

platform=linux/amd64 emulation build

setting the image to emulate amd64 with platform=linux/amd64, we can build successfully using pip or conda, however at runtime we get the Raylet subprocess error.

worst case

Build ray from source in a local image. Undesireable due to build time, and extra work for users, but I am willing to test it on M1 hardware with some guidance.

raylet Logs:

> cat logs/session_2022-08-25_15-27-56_654718_1/logs/raylet.*
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  bind: Operation not permitted
[2022-08-25 15:27:59,090 E 53 68] logging.cc:414: *** Aborted at 1661441279 (unix time) try "date -d @1661441279" if you are using GNU date ***
[2022-08-25 15:27:59,092 E 53 68] logging.cc:414: PC: @                0x0 (unknown)
[2022-08-25 15:27:59,094 E 53 68] logging.cc:414: *** SIGABRT (@0x35) received by PID 53 (TID 0x40379c7700) from PID 53; stack trace: ***
[2022-08-25 15:27:59,103 E 53 68] logging.cc:414:     @       0x400062e55f google::(anonymous namespace)::FailureSignalHandler()
[2022-08-25 15:27:59,104 E 53 68] logging.cc:414:     @       0x40027c7140 (unknown)
[2022-08-25 15:27:59,105 E 53 68] logging.cc:414:     @       0x400282fce1 gsignal
[2022-08-25 15:27:59,106 E 53 68] logging.cc:414:     @       0x4002819537 abort
[2022-08-25 15:27:59,110 E 53 68] logging.cc:414:     @       0x40025a5872 __gnu_cxx::__verbose_terminate_handler()
[2022-08-25 15:27:59,111 E 53 68] logging.cc:414:     @       0x40025a3f6f __cxxabiv1::__terminate()
[2022-08-25 15:27:59,162 E 53 68] logging.cc:414:     @       0x40025a3fb1 std::terminate()
[2022-08-25 15:27:59,164 E 53 68] logging.cc:414:     @       0x40025a419a __cxa_throw
[2022-08-25 15:27:59,167 E 53 68] logging.cc:414:     @       0x400012b564 boost::throw_exception<>()
[2022-08-25 15:27:59,168 E 53 68] logging.cc:414:     @       0x4000963d8d boost::asio::detail::do_throw_error()
[2022-08-25 15:27:59,170 E 53 68] logging.cc:414:     @       0x4000290087 plasma::PlasmaStore::PlasmaStore()
[2022-08-25 15:27:59,171 E 53 68] logging.cc:414:     @       0x4000297741 plasma::PlasmaStoreRunner::Start()
[2022-08-25 15:27:59,173 E 53 68] logging.cc:414:     @       0x4000236bac std::thread::_State_impl<>::_M_run()
[2022-08-25 15:27:59,182 E 53 68] logging.cc:414:     @       0x40025c0039 execute_native_thread_routine
[2022-08-25 15:27:59,183 E 53 68] logging.cc:414:     @       0x40027bbea7 start_thread
[2022-08-25 15:27:59,184 E 53 68] logging.cc:414:     @       0x40028f1def clone
[2022-08-25 15:27:58,990 I 53 53] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2022-08-25 15:27:59,008 I 53 53] store_runner.cc:31: Allowing the Plasma store to use up to 0.59385GB of memory.
[2022-08-25 15:27:59,009 I 53 53] store_runner.cc:44: Starting object store with directory /dev/shm and huge page support disabled
[2022-08-25 15:27:59,090 E 53 68] logging.cc:414: *** Aborted at 1661441279 (unix time) try "date -d @1661441279" if you are using GNU date ***
[2022-08-25 15:27:59,092 E 53 68] logging.cc:414: PC: @                0x0 (unknown)
[2022-08-25 15:27:59,094 E 53 68] logging.cc:414: *** SIGABRT (@0x35) received by PID 53 (TID 0x40379c7700) from PID 53; stack trace: ***
[2022-08-25 15:27:59,103 E 53 68] logging.cc:414:     @       0x400062e55f google::(anonymous namespace)::FailureSignalHandler()
[2022-08-25 15:27:59,104 E 53 68] logging.cc:414:     @       0x40027c7140 (unknown)
[2022-08-25 15:27:59,105 E 53 68] logging.cc:414:     @       0x400282fce1 gsignal
[2022-08-25 15:27:59,106 E 53 68] logging.cc:414:     @       0x4002819537 abort
[2022-08-25 15:27:59,110 E 53 68] logging.cc:414:     @       0x40025a5872 __gnu_cxx::__verbose_terminate_handler()
[2022-08-25 15:27:59,111 E 53 68] logging.cc:414:     @       0x40025a3f6f __cxxabiv1::__terminate()
[2022-08-25 15:27:59,162 E 53 68] logging.cc:414:     @       0x40025a3fb1 std::terminate()
[2022-08-25 15:27:59,164 E 53 68] logging.cc:414:     @       0x40025a419a __cxa_throw
[2022-08-25 15:27:59,167 E 53 68] logging.cc:414:     @       0x400012b564 boost::throw_exception<>()
[2022-08-25 15:27:59,168 E 53 68] logging.cc:414:     @       0x4000963d8d boost::asio::detail::do_throw_error()
[2022-08-25 15:27:59,170 E 53 68] logging.cc:414:     @       0x4000290087 plasma::PlasmaStore::PlasmaStore()
[2022-08-25 15:27:59,171 E 53 68] logging.cc:414:     @       0x4000297741 plasma::PlasmaStoreRunner::Start()
[2022-08-25 15:27:59,173 E 53 68] logging.cc:414:     @       0x4000236bac std::thread::_State_impl<>::_M_run()
[2022-08-25 15:27:59,182 E 53 68] logging.cc:414:     @       0x40025c0039 execute_native_thread_routine
[2022-08-25 15:27:59,183 E 53 68] logging.cc:414:     @       0x40027bbea7 start_thread
[2022-08-25 15:27:59,184 E 53 68] logging.cc:414:     @       0x40028f1def clone

Versions / Dependencies

Macintosh M1 hardware

uname -a
Darwin MacBook-Pro.local 21.4.0 Darwin Kernel Version 21.4.0: Mon Feb 21 20:35:58 PST 2022; root:xnu-8020.101.4~2/RELEASE_ARM64_T6000 arm64

Python: 3.9, 3.10
docker image arch: linux/amd64, linux/aarch64
Ray: 1.13, 2.0

Reproduction script

repro will require a working docker+compose installation on Apple M1 hardware

Issue Severity

High: It blocks me from completing my task.

@lundybernard lundybernard added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 25, 2022
@DmitriGekhtman DmitriGekhtman added the core Issues that should be addressed in Ray Core label Aug 25, 2022
@lundybernard
Copy link
Author

Additional information:

I changed the ray worker's target address from the "redis" port 6379 to the ray head node port 10001. Now instead of the generic: Unable to reach GCS at ray-head error, I get the following GRPC error:

ray-worker_1  | 2022-08-31 13:58:00,017 ERROR utils.py:1249 -- Internal KV Get failed
ray-worker_1  | Traceback (most recent call last):
ray-worker_1  |   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/utils.py", line 1236, in internal_kv_get_with_retry
ray-worker_1  |     result = gcs_client.internal_kv_get(key, namespace)
ray-worker_1  |   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 137, in wrapper
ray-worker_1  |     return f(self, *args, **kwargs)
ray-worker_1  |   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 209, in internal_kv_get
ray-worker_1  |     reply = self._kv_stub.InternalKVGet(req)
ray-worker_1  |   File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
ray-worker_1  |     return _end_unary_response_blocking(state, call, False, None)
ray-worker_1  |   File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
ray-worker_1  |     raise _InactiveRpcError(state)
ray-worker_1  | grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
ray-worker_1  |         status = StatusCode.UNIMPLEMENTED
ray-worker_1  |         details = ""
ray-worker_1  |         debug_error_string = "{"created":"@1661979480.013344060","description":"Error received from peer ipv4:172.64.0.2:22346","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"","grpc_status":12}"
ray-worker_1  | >
ray-head_1    | 2022-08-31 13:58:00,388 ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
ray-head_1    | 2022-08-31 13:58:00,389 ERR scripts.py:911 -- dashboard [exit code=-9]
ray-head_1    | 2022-08-31 13:58:00,390 ERR scripts.py:919 -- Remaining processes will be killed.

@kleecmt
Copy link

kleecmt commented Sep 24, 2022

Were you able to get any workaround?

@richardliaw richardliaw added the QS Quantsight triage label label Oct 11, 2022
@rakeshrm
Copy link

rakeshrm commented Nov 8, 2022

Facing the same problem, do we have a workaround yet?

@louis-dv
Copy link

I'm facing the same issue.

@hora-anyscale hora-anyscale added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 14, 2022
@richardliaw richardliaw removed the QS Quantsight triage label label Dec 20, 2022
@richardliaw
Copy link
Contributor

Could you all say a bit more about what exactly is the issue here?

My understanding is that:

  1. You are trying to run a Ray docker image on an M1 machine
  2. Ray does not work for a variety of reasons -- either cannot be found on pip (due to no aarch64 wheel) or when you try running the docker image it crashes?

@richardliaw richardliaw self-assigned this Dec 20, 2022
@romilbhardwaj
Copy link
Member

@richardliaw

You are trying to run a Ray docker image on an M1 machine.

Correct. More specifically, I have my own docker image inside which I'm trying to install Ray, all on a M1 Mac.

Ray does not work for a variety of reasons -- either cannot be found on pip (due to no aarch64 wheel) or when you try running the docker image it crashes?

Yes. On a M1 mac, installing ray inside a container (in both ubuntu and continuumio/miniconda3 images) fails:

root@e3f2ee7d4203:/# pip install ray
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray
root@e3f2ee7d4203:/# pip install ray[default]
ERROR: Could not find a version that satisfies the requirement ray[default] (from versions: none)
ERROR: No matching distribution found for ray[default]
root@e3f2ee7d4203:/# pip3 install ray[default]
ERROR: Could not find a version that satisfies the requirement ray[default] (from versions: none)
ERROR: No matching distribution found for ray[default]

A workaround to the installation issue is to use the --platform=linux/amd64 flag when launching the container. With that, installation using pip install ray succeeds. However, import ray; ray.init() fails with some ray processes crashing (GCS?), as @lundybernard noted above. For example, one of our users reported this stack trace:

E 12-16 11:43:34 subprocess_utils.py:70] 2022-12-16 16:43:24,612        INFO cli.py:28 -- Job submission server address: http://127.0.0.1:8265/
E 12-16 11:43:34 subprocess_utils.py:70] Traceback (most recent call last):
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/bin/ray", line 8, in <module>
E 12-16 11:43:34 subprocess_utils.py:70]     sys.exit(main())
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2588, in main
E 12-16 11:43:34 subprocess_utils.py:70]     return cli()
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
E 12-16 11:43:34 subprocess_utils.py:70]     return self.main(*args, **kwargs)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 1053, in main
E 12-16 11:43:34 subprocess_utils.py:70]     rv = self.invoke(ctx)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
E 12-16 11:43:34 subprocess_utils.py:70]     return _process_result(sub_ctx.command.invoke(sub_ctx))
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
E 12-16 11:43:34 subprocess_utils.py:70]     return _process_result(sub_ctx.command.invoke(sub_ctx))
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
E 12-16 11:43:34 subprocess_utils.py:70]     return ctx.invoke(self.callback, **ctx.params)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/click/core.py", line 754, in invoke
E 12-16 11:43:34 subprocess_utils.py:70]     return __callback(*args, **kwargs)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
E 12-16 11:43:34 subprocess_utils.py:70]     return f(*args, **kwargs)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 184, in submit
E 12-16 11:43:34 subprocess_utils.py:70]     job_id = client.submit_job(
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 156, in submit_job
E 12-16 11:43:34 subprocess_utils.py:70]     self._raise_error(r)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 253, in _raise_error
E 12-16 11:43:34 subprocess_utils.py:70]     raise RuntimeError(
E 12-16 11:43:34 subprocess_utils.py:70] RuntimeError: Request failed with status code 500: Traceback (most recent call last):
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/dashboard/optional_utils.py", line 277, in decorator
E 12-16 11:43:34 subprocess_utils.py:70]     raise e from None
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/dashboard/optional_utils.py", line 270, in decorator
E 12-16 11:43:34 subprocess_utils.py:70]     ray.init(
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
E 12-16 11:43:34 subprocess_utils.py:70]     return func(*args, **kwargs)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/_private/worker.py", line 1479, in init
E 12-16 11:43:34 subprocess_utils.py:70]     _global_node = ray._private.node.Node(
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/_private/node.py", line 226, in __init__
E 12-16 11:43:34 subprocess_utils.py:70]     node_info = ray._private.services.get_node_to_connect_for_driver(
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/_private/services.py", line 395, in get_node_to_connect_for_driver
E 12-16 11:43:34 subprocess_utils.py:70]     return global_state.get_node_to_connect_for_driver(node_ip_address)
E 12-16 11:43:34 subprocess_utils.py:70]   File "/home/sky/.local/lib/python3.9/site-packages/ray/_private/state.py", line 729, in get_node_to_connect_for_driver
E 12-16 11:43:34 subprocess_utils.py:70]     node_info_str = self.global_state_accessor.get_node_to_connect_for_driver(
E 12-16 11:43:34 subprocess_utils.py:70]   File "python/ray/includes/global_state_accessor.pxi", line 155, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
E 12-16 11:43:34 subprocess_utils.py:70] RuntimeError: b'GCS has started but no raylets have registered yet.'

@concretevitamin
Copy link
Contributor

Thanks @richardliaw. I believe this is impacting @pounde as well.

@richardliaw richardliaw changed the title [Core] Docker on M1 build and runtime errors [Core] Support aarch64 -- causing docker on M1 build and runtime errors Dec 21, 2022
@pounde
Copy link

pounde commented Dec 21, 2022

Thanks @richardliaw. I believe this is impacting @pounde as well.

Correct. Currently developing on Mac M1. Happy to help where able.

@kleecmt
Copy link

kleecmt commented Dec 23, 2022

Re the new title: Afaik, both linux/amd64 and linux/aarch64 do not work inside docker hosted on a M1, but they failed differently (see romilbhardwaj's post).

@richardliaw
Copy link
Contributor

richardliaw commented Dec 23, 2022 via email

@kleecmt
Copy link

kleecmt commented Dec 23, 2022

On linux/arm64 (i.e., the 'native' / default linux build on a M1 mac), it fails to find a wheel to install or build one from source (via pip).

  • I didn't try to compile one from git source
  • I suspect it would work somehow (AWS Glue for Ray is a thing that runs on a Graviton ARM chip)

On linux/amd64 (i.e., x86_64 emulation; also note that people was trying to find a "workaround"), it fails with an error related to "raylet" (see the error logs posted by others)

  • it "installs", but gets error when you actually run it.

(hopefully this is consistent with what others are seeing, feel free to correct me)

@richardliaw
Copy link
Contributor

OK got it, so arm64 == aarch64, but they are different from amd64. Would it be OK for us only to support aarch64 first?

@louis-dv
Copy link

On linux/arm64 (i.e., the 'native' / default linux build on a M1 mac), it fails to find a wheel to install or build one from source (via pip).

  • I didn't try to compile one from git source
  • I suspect it would work somehow (AWS Glue for Ray is a thing that runs on a Graviton ARM chip)

On linux/amd64 (i.e., x86_64 emulation; also note that people was trying to find a "workaround"), it fails with an error related to "raylet" (see the error logs posted by others)

  • it "installs", but gets error when you actually run it.

(hopefully this is consistent with what others are seeing, feel free to correct me)

This is exactly what I am experiencing as well

@pounde
Copy link

pounde commented Dec 26, 2022

On linux/arm64 (i.e., the 'native' / default linux build on a M1 mac), it fails to find a wheel to install or build one from source (via pip).

  • I didn't try to compile one from git source
  • I suspect it would work somehow (AWS Glue for Ray is a thing that runs on a Graviton ARM chip)

On linux/amd64 (i.e., x86_64 emulation; also note that people was trying to find a "workaround"), it fails with an error related to "raylet" (see the error logs posted by others)

  • it "installs", but gets error when you actually run it.

(hopefully this is consistent with what others are seeing, feel free to correct me)

This is exactly what I am experiencing as well

Same here as well.

@krfricke
Copy link
Contributor

krfricke commented Jan 9, 2023

We are now building aarch64 images (#31522), which suffices the workable solution. I'll track another issue though to add support for a platform-native docker image (i.e. use rayproject/ray directly)

@mattip

This comment was marked as resolved.

@jtlz2
Copy link

jtlz2 commented Nov 30, 2023

@lundybernard Did your build from source idea ever work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet