.. _dist-jobs: Distributed Jobs on Many VMs ================================================ SkyPilot supports multi-node cluster provisioning and distributed execution on many VMs. For example, here is a simple PyTorch Distributed training example: .. code-block:: yaml :emphasize-lines: 6-6 name: resnet-distributed-app resources: accelerators: V100:4 num_nodes: 2 setup: | pip3 install --upgrade pip git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet cd pytorch-distributed-resnet # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 mkdir -p data && mkdir -p saved_models && cd data && \ wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz tar -xvzf cifar-10-python.tar.gz run: | cd pytorch-distributed-resnet num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \ --master_port=8008 resnet_ddp.py --num_epochs 20 In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s. Environment variables ----------------------------------------- SkyPilot exposes these environment variables that can be accessed in a task's ``run`` commands: - :code:`SKYPILOT_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of the node executing the task. - :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. - You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l` and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n 3p`. - To manipulate these IP addresses, you can also store them to a file in the :code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`. - :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the task; the same as the count in ``accelerators: :`` (rounded up if a fraction). Launching a multi-node task (new cluster) ------------------------------------------------- When using ``sky launch`` to launch a multi-node task on **a new cluster**, the following happens in sequence: 1. Nodes are provisioned. (barrier) 2. Workdir/file_mounts are synced to all nodes. (barrier) 3. ``setup`` commands are executed on all nodes. (barrier) 4. ``run`` commands are executed on all nodes. Launching a multi-node task (existing cluster) ------------------------------------------------- When using ``sky launch`` to launch a multi-node task on **an existing cluster**, the cluster may have more nodes than the current task's ``num_nodes`` requirement. The following happens in sequence: 1. SkyPilot checks the runtime on all nodes are up-to-date. (barrier) 2. Workdir/file_mounts are synced to all nodes. (barrier) 3. ``setup`` commands are executed on **all nodes** of the cluster. (barrier) 4. ``run`` commands are executed on **the subset of nodes** scheduled to execute the task, which may be fewer than the cluster size. .. tip:: To skip rerunning the setup commands, use either ``sky launch --no-setup ...`` (performs steps 1, 2, 4 above) or ``sky exec`` (performs step 2 (workdir only) and step 4). Executing a task on the head node only ----------------------------------------- To execute a task on the head node only (a common scenario for tools like ``mpirun``), use the ``SKYPILOT_NODE_RANK`` environment variable as follows: .. code-block:: yaml ... num_nodes: run: | if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then # Launch the head-only command here. fi