Task YAML

Task YAMLΒΆ

SkyPilot provides an intuitive YAML interface to specify a task (resource requirements, setup commands, run commands, file mounts, storage mounts, and so on).

Task YAMLs can be used with the CLI, or the programmatic API (sky.Task.from_yaml()).

Available fields:

# Task name (optional), used for display purposes.
name: my-task

# Working directory (optional), synced to ~/sky_workdir on the remote cluster
# each time launch or exec is run with the yaml file.
#
# Commands in "setup" and "run" will be executed under it.
#
# If a .gitignore file (or a .git/info/exclude file) exists in the working
# directory, files and directories listed in it will be excluded from syncing.
workdir: ~/my-task-code

# Number of nodes (optional; defaults to 1) to launch including the head node.
#
# A task can set this to a smaller value than the size of a cluster.
num_nodes: 4

# Per-node resource requirements (optional).
resources:
  cloud: aws  # The cloud to use (optional).

  # The region to use (optional). Auto-failover will be disabled
  # if this is specified.
  region: us-east-1

  # The zone to use (optional). Auto-failover will be disabled
  # if this is specified.
  zone: us-east-1a

  # Accelerator name and count per node (optional).
  #
  # Use `sky show-gpus` to view available accelerator configurations.
  #
  # Format: <name>:<count> (or simply <name>, short for a count of 1).
  accelerators: V100:4

  # Instance type to use (optional). If 'accelerators' is specified,
  # the corresponding instance type is automatically inferred.
  instance_type: p3.8xlarge

  # Whether the cluster should use spot instances (optional).
  # If unspecified, defaults to False (on-demand instances).
  use_spot: False

  # The recovery strategy for spot jobs (optional).
  # `use_spot` must be True for this to have any effect. For now, only
  # `FAILOVER` strategy is supported.
  spot_recovery: none

  # Disk size in GB to allocate for OS (mounted at /). Increase this if you
  # have a large working directory or tasks that write out large outputs.
  disk_size: 256

  # Additional accelerator metadata (optional); only used for TPU node
  # and TPU VM.
  # Example usage:
  #
  #   To request a TPU node:
  #     accelerator_args:
  #       tpu_name: ...
  #
  #   To request a TPU VM:
  #     accelerator_args:
  #       tpu_vm: True
  #
  # By default, the value for "runtime_version" is decided based on which is
  # requested and should work for either case. If passing in an incompatible
  # version, GCP will throw an error during provisioning.
  accelerator_args:
    # Default is "2.5.0" for TPU node and "tpu-vm-base" for TPU VM.
    runtime_version: 2.5.0
    tpu_name: mytpu
    tpu_vm: False  # False to use TPU nodes (the default); True to use TPU VMs.

  # Custom image id (optional, advanced). The image id used to boot the
  # instances. Only supported for AWS and GCP. If not specified, SkyPilot
  # will use the default debian-based image suitable for machine learning tasks.
  #
  # AWS
  # To find AWS AMI ids: https://leaherb.com/how-to-find-an-aws-marketplace-ami-image-id
  # You can also change the default OS version by choosing from the following image tags provided by SkyPilot:
  #   image_id: skypilot:gpu-ubuntu-2004
  #   image_id: skypilot:k80-ubuntu-2004
  #   image_id: skypilot:gpu-ubuntu-1804
  #   image_id: skypilot:k80-ubuntu-1804
  # It is also possible to specify a per-region image id (failover will only go through the regions sepcified as keys;
  # useful when you have the custom images in multiple regions):
  #   image_id:
  #     us-east-1: ami-0729d913a335efca7
  #     us-west-2: ami-050814f384259894c
  image_id: ami-0868a20f5a3bf9702
  # GCP
  # To find GCP images: https://cloud.google.com/compute/docs/images
  # image_id: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004

file_mounts:
  # Uses rsync to sync local files/directories to all nodes of the cluster.
  #
  # If symlinks are present, they are copied as symlinks, and their targets
  # must also be synced using file_mounts to ensure correctness.
  /remote/dir1/file: /local/dir1/file
  /remote/dir2: /local/dir2

  # Uses SkyPilot Storage to create a S3 bucket named sky-dataset, uploads the
  # contents of /local/path/datasets to the bucket, and marks the bucket
  # as persistent (it will not be deleted after the completion of this task).
  # Symlink contents are copied over.
  #
  # Mounts the bucket at /datasets-storage on every node of the cluster.
  /datasets-storage:
    name: sky-dataset  # Name of storage, optional when source is bucket URI
    source: /local/path/datasets  # Source path, can be local or s3/gcs URL. Optional, do not specify to create an empty bucket.
    store: s3  # Could be either 's3' or 'gcs'; default: None. Optional.
    persistent: True  # Defaults to True; can be set to false. Optional.
    mode: MOUNT  # Either MOUNT or COPY. Optional.

  # Copies a cloud object store URI to the cluster. Can be private buckets.
  /datasets-s3: s3://my-awesome-dataset

# Setup script (optional) to execute on every `sky launch`.
# This is executed before the 'run' commands.
#
# The '|' separator indicates a multiline string. To specify a single command:
#   setup: pip install -r requirements.txt
setup: |
  echo "Begin setup."
  pip install -r requirements.txt
  echo "Setup complete."

# Main program (optional, but recommended) to run on every node of the cluster.
run: |
  echo "Beginning task."
  python train.py