Setting up Local Cluster

Prerequisites

To ensure sky nodes can communicate with each other, SkyPilot On-prem requires the system admin to open up all ports from 10001 to 19999, inclusive, on all nodes. This is how SkyPilot differentiates input/output for multiple worker processes on a single node. In addition, SkyPilot requires port 8265 for Ray Dashboard on all nodes.

For the head node, SkyPilot requires port 6379 for the GCS server on Ray.

For further reference, here are the required ports directly from the Ray docs.

Installing SkyPilot dependencies

SkyPilot On-prem requires python3, ray==2.0.1, and sky to be setup on all local nodes and globally available to all users.

To install Ray and SkyPilot for all users, run the following commands on all local nodes:

$ pip3 install ray[default]==2.0.1

$ # SkyPilot requires python >= 3.6.
$ pip3 install skypilot

Launching SkyPilot services

For SkyPilot to automatically launch the cluster manager, the system administrator needs to fill out a private cluster YAML file. An example of such is provided below:

# Header for cluster specific data.
cluster:
  # List of IPS/hostnames in the cluster. The first element is the head node.
  ips: [my.local.cluster.hostname, 3.20.226.96, 3.143.112.6]
  name: my-local-cluster

# How the system admin authenticates into the local cluster.
auth:
  ssh_user: ubuntu
  ssh_private_key: ~/.ssh/ubuntu.pem

Next, the system admin runs:

$ sky admin deploy my-cluster-config.yaml

SkyPilot will automatically perform the following 4 tasks:

  • Check if the local cluster environment is setup correctly

  • Profile the cluster for custom resources, such as GPUs

  • Launch SkyPilot’s cluster manager

  • Generate a public distributable cluster YAML, conveniently stored in ~/.sky/local/my-local-cluster.yaml

Finally, to check if SkyPilot services have been installed correctly, run the following on the head node:

$ # Check if Ray cluster is launched on all nodes
$ ray status

======== Autoscaler status: 2022-04-27 08:53:44.995448 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_788952ec7fb0c6c5cfac0015101952b6593f10913df9bccef44ea346
 1 node_ec653cdb9bc6d4e2d982fa39485f6e4a90be947288ca6c1e5accd843
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/64.0 CPU
 0.0/8.0 GPU
 0.0/8.0 V100
 0.00/324.119 GiB memory
 0.00/142.900 GiB object_store_memory

The console should display a list of healthy nodes the size of the local cluster.

Publishing cluster YAML

Under the hood, sky admin deploy automatically stores a public distributable cluster YAML in ~/.sky/local/my-cluster.yaml. This cluster YAML follows a similar structure as that of the private cluster YAML, with admin authentication replaced with a placeholder value (for regular users to fill in):

# Do NOT modify ips, OK to modify name
cluster:
  ips: [my.local.cluster.hostname, 3.20.226.96, 3.143.112.6]
  name: my-local-cluster

auth:
  ssh_user: PLACEHOLDER
  ssh_private_key: PLACEHOLDER

# Path to the python binary to be used by SkyPilot. Must be the same on all nodes and executable by all users.
python: /usr/bin/python3

The distributable cluster YAML can be published on the company’s website or sent privately between users. Regular users store this yaml in ~/.sky/local/, and replace PLACEHOLDER with their credentials.