Setting up Local Cluster
Contents
Setting up Local Cluster¶
Prerequisites¶
To ensure sky nodes can communicate with each other, SkyPilot On-prem requires the system admin to open up all ports from 10001
to 19999
, inclusive, on all nodes. This is how SkyPilot differentiates input/output for multiple worker processes on a single node. In addition, SkyPilot requires port 8265
for Ray Dashboard on all nodes.
For the head node, SkyPilot requires port 6379
for the GCS server on Ray.
For further reference, here are the required ports directly from the Ray docs.
Installing SkyPilot dependencies¶
SkyPilot On-prem requires python3
, ray==2.0.1
, and sky
to be setup on all local nodes and globally available to all users.
To install Ray and SkyPilot for all users, run the following commands on all local nodes:
$ pip3 install ray[default]==2.0.1
$ # SkyPilot requires python >= 3.6.
$ pip3 install skypilot
Launching SkyPilot services¶
For SkyPilot to automatically launch the cluster manager, the system administrator needs to fill out a private cluster YAML file. An example of such is provided below:
# Header for cluster specific data.
cluster:
# List of IPS/hostnames in the cluster. The first element is the head node.
ips: [my.local.cluster.hostname, 3.20.226.96, 3.143.112.6]
name: my-local-cluster
# How the system admin authenticates into the local cluster.
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/ubuntu.pem
Next, the system admin runs:
$ sky admin deploy my-cluster-config.yaml
SkyPilot will automatically perform the following 4 tasks:
Check if the local cluster environment is setup correctly
Profile the cluster for custom resources, such as GPUs
Launch SkyPilot’s cluster manager
Generate a public distributable cluster YAML, conveniently stored in
~/.sky/local/my-local-cluster.yaml
Finally, to check if SkyPilot services have been installed correctly, run the following on the head node:
$ # Check if Ray cluster is launched on all nodes
$ ray status
======== Autoscaler status: 2022-04-27 08:53:44.995448 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_788952ec7fb0c6c5cfac0015101952b6593f10913df9bccef44ea346
1 node_ec653cdb9bc6d4e2d982fa39485f6e4a90be947288ca6c1e5accd843
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/64.0 CPU
0.0/8.0 GPU
0.0/8.0 V100
0.00/324.119 GiB memory
0.00/142.900 GiB object_store_memory
The console should display a list of healthy nodes the size of the local cluster.
Publishing cluster YAML¶
Under the hood, sky admin deploy
automatically stores a public distributable cluster YAML in ~/.sky/local/my-cluster.yaml
. This cluster YAML follows a similar structure as that of the private cluster YAML, with admin authentication replaced with a placeholder value (for regular users to fill in):
# Do NOT modify ips, OK to modify name
cluster:
ips: [my.local.cluster.hostname, 3.20.226.96, 3.143.112.6]
name: my-local-cluster
auth:
ssh_user: PLACEHOLDER
ssh_private_key: PLACEHOLDER
# Path to the python binary to be used by SkyPilot. Must be the same on all nodes and executable by all users.
python: /usr/bin/python3
The distributable cluster YAML can be published on the company’s website or sent privately between users. Regular users store this yaml in ~/.sky/local/
, and replace PLACEHOLDER
with their credentials.