Python API

SkyPilot offers a programmatic API in Python, which is used under the hood by the CLI.

Note

The Python API contains more experimental functions/classes than the CLI. That said, it has been used to develop several Python libraries by users.

For questions or request for support, please reach out to the development team. Your feedback is much appreciated in evolving this API!

Core API

sky.launch

sky.launch(task, cluster_name=None, retry_until_up=False, idle_minutes_to_autostop=None, dryrun=False, down=False, stream_logs=True, backend=None, optimize_target=OptimizeTarget.COST, detach_setup=False, detach_run=False, no_setup=False)[source]

Launch a task.

The task’s setup and run commands are executed under the task’s workdir (when specified, it is synced to remote cluster). The task undergoes job queue scheduling on the cluster.

Currently, the first argument must be a sky.Task, or (EXPERIMENTAL advanced usage) a sky.Dag. In the latter case, currently it must contain a single task; support for pipelines/general DAGs are in experimental branches.

Parameters
  • task (Union[Task, Dag]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch.

  • cluster_name (Optional[str]) – name of the cluster to create/reuse. If None, auto-generate a name.

  • retry_until_up (bool) – whether to retry launching the cluster until it is up.

  • idle_minutes_to_autostop (Optional[int]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to running sky.launch(..., detach_run=True, ...) and then sky.autostop(idle_minutes=<minutes>). If not set, the cluster will not be autostopped.

  • down (bool) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.

  • dryrun (bool) – if True, do not actually launch the cluster.

  • stream_logs (bool) – if True, show the logs in the terminal.

  • backend (Optional[Backend]) – backend to use. If None, use the default backend (CloudVMRayBackend).

  • optimize_target (OptimizeTarget) – target to optimize for. Choices: OptimizeTarget.COST, OptimizeTarget.TIME.

  • detach_setup (bool) – If True, run setup in non-interactive mode as part of the job itself. You can safely ctrl-c to detach from logging, and it will not interrupt the setup process. To see the logs again after detaching, use sky logs. To cancel setup, cancel the job via sky cancel. Useful for long-running setup commands.

  • detach_run (bool) – If True, as soon as a job is submitted, return from this function and do not stream execution logs.

  • no_setup (bool) – if True, do not re-run setup commands.

Example

import sky
task = sky.Task(run='echo hello SkyPilot')
task.set_resources(
    sky.Resources(cloud=sky.AWS(), accelerators='V100:4'))
sky.launch(task, cluster_name='my-cluster')
Return type

None

sky.exec

sky.exec(task, cluster_name, dryrun=False, down=False, stream_logs=True, backend=None, detach_run=False)[source]

Execute a task on an existing cluster.

This function performs two actions:

  1. workdir syncing, if the task has a workdir specified;

  2. executing the task’s run commands.

All other steps (provisioning, setup commands, file mounts syncing) are skipped. If any of those specifications changed in the task, this function will not reflect those changes. To ensure a cluster’s setup is up to date, use sky.launch() instead.

Execution and scheduling behavior:

  • The task will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.

  • The task is run under the workdir (if specified).

  • The task is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as htop do not work. Use ssh my_cluster instead.

Parameters
  • task (Union[Task, Dag]) – sky.Task, or sky.Dag (experimental; 1-task only) containing the task to execute.

  • cluster_name (str) – name of an existing cluster to execute the task.

  • down (bool) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.

  • dryrun (bool) – if True, do not actually execute the task.

  • stream_logs (bool) – if True, show the logs in the terminal.

  • backend (Optional[Backend]) – backend to use. If None, use the default backend (CloudVMRayBackend).

  • detach_run (bool) – if True, detach from logging once the task has been submitted.

Raises

ValueError – if the specified cluster does not exist or is not in UP status.

Return type

None

sky.stop

sky.stop(cluster_name, purge=False)[source]

Stop a cluster.

Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.

Currently, spot instance clusters cannot be stopped.

Parameters
  • cluster_name (str) – name of the cluster to stop.

  • purge (bool) – whether to ignore cloud provider errors (if any).

Raises
  • ValueError – the specified cluster does not exist.

  • RuntimeError – failed to stop the cluster.

  • sky.exceptions.NotSupportedError – if the specified cluster is a spot cluster, or a TPU VM Pod cluster, or the managed spot controller.

Return type

None

sky.start

sky.start(cluster_name, idle_minutes_to_autostop=None, retry_until_up=False, down=False, force=False)[source]

Restart a cluster.

If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this function will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.

Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.

If a cluster is already in the UP status, this function has no effect.

Parameters
  • cluster_name (str) – name of the cluster to start.

  • idle_minutes_to_autostop (Optional[int]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to running sky.launch(..., detach_run=True, ...) and then sky.autostop(idle_minutes=<minutes>). If not set, the cluster will not be autostopped.

  • retry_until_up (bool) – whether to retry launching the cluster until it is up.

  • down (bool) – Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requires idle_minutes_to_autostop to be set.

  • force (bool) – whether to force start the cluster even if it is already up. Useful for upgrading SkyPilot runtime.

Raises
  • ValueError – argument values are invalid: (1) the specified cluster does not exist; (2) if down is set to True but idle_minutes_to_autostop is None; (3) if the specified cluster is the managed spot controller, and either idle_minutes_to_autostop is not None or down is True (omit them to use the default autostop settings).

  • sky.exceptions.NotSupportedError – if the cluster to restart was launched using a non-default backend that does not support this operation.

  • sky.exceptions.ClusterOwnerIdentitiesMismatchError – if the cluster to restart was launched by a different user.

Return type

None

sky.down

sky.down(cluster_name, purge=False)[source]

Tear down a cluster.

Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.

For local on-prem clusters, this function does not terminate the local cluster, but instead removes the cluster from the status table and terminates the calling user’s running jobs.

Parameters
  • cluster_name (str) – name of the cluster to down.

  • purge (bool) – whether to ignore cloud provider errors (if any).

Raises
  • ValueError – the specified cluster does not exist.

  • sky.exceptions.NotSupportedError – the specified cluster is the managed spot controller.

Return type

None

sky.status

sky.status(refresh=False)[source]

Get all cluster statuses.

Each returned value has the following fields:

{
    'name': (str) cluster name,
    'launched_at': (int) timestamp of last launch on this cluster,
    'handle': (ResourceHandle) an internal handle to the cluster,
    'last_use': (str) the last command/entrypoint that affected this
      cluster,
    'status': (sky.ClusterStatus) cluster status,
    'autostop': (int) idle time before autostop,
    'to_down': (bool) whether autodown is used instead of autostop,
    'metadata': (dict) metadata of the cluster,
}

Each cluster can have one of the following statuses:

  • INIT: The cluster may be live or down. It can happen in the following cases:

    • Ongoing provisioning or runtime setup. (A sky.launch() has started but has not completed.)

    • Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try sky launch again on it.)

  • UP: Provisioning and runtime setup have succeeded and the cluster is live. (The most recent sky.launch() has completed successfully.)

  • STOPPED: The cluster is stopped and the storage is persisted. Use sky.start() to restart the cluster.

Autostop column:

  • The autostop column indicates how long the cluster will be autostopped after minutes of idling (no jobs running). If to_down is True, the cluster will be autodowned, rather than autostopped.

Getting up-to-date cluster statuses:

  • In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.

  • In cases where the clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use refresh=True to query the latest cluster statuses from the cloud providers.

Parameters

refresh (bool) – whether to query the latest cluster statuses from the cloud provider(s).

Return type

List[Dict[str, Any]]

Returns

A list of dicts, with each dict containing the information of a cluster.

sky.autostop

sky.autostop(cluster_name, idle_minutes, down=False)[source]

Schedule an autostop/autodown for a cluster.

Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.

Idleness time of a cluster is reset to zero, whenever:

  • A job is submitted (sky.launch() or sky.exec()).

  • The cluster has restarted.

  • An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.

Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.

When multiple autostop settings are specified for the same cluster, the last setting takes precedence.

Parameters
  • cluster_name (str) – name of the cluster.

  • idle_minutes (int) – the number of minutes of idleness (no pending/running jobs) after which the cluster will be stopped automatically. Setting to a negative number cancels any autostop/autodown setting.

  • down (bool) – if true, use autodown (tear down the cluster; non-restartable), rather than autostop (restartable).

Raises
  • ValueError – if the cluster does not exist.

  • sky.exceptions.ClusterNotUpError – if the cluster is not UP.

  • sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend or the cluster is TPU VM Pod.

  • sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.

  • sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

Return type

None

Task

class sky.Task(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None)[source]

Task: a computation to be run on the cloud.

__init__(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None)[source]

Initializes a Task.

All fields are optional. Task.run is the actual program: either a shell command to run (str) or a command generator for different nodes (lambda; see below).

Optionally, call Task.set_resources() to set the resource requirements for this task. If not set, a default CPU-only requirement is assumed (the same as sky cpunode).

All setters of this class, Task.set_*(), return self, i.e., they are fluent APIs and can be chained together.

Example

# A Task that will sync up local workdir '.', containing
# requirements.txt and train.py.
sky.Task(setup='pip install requirements.txt',
         run='python train.py',
         workdir='.')

# An empty Task for provisioning a cluster.
task = sky.Task(num_nodes=n).set_resources(...)

# Chaining setters.
sky.Task().set_resources(...).set_file_mounts(...)
Parameters
  • name (Optional[str]) – A string name for the Task for display purposes.

  • setup (Optional[str]) – A setup command, which will be run before executing the run commands run, and executed under workdir.

  • run (Union[str, Callable[[int, List[str]], Optional[str]], None]) – The actual command for the task. If not None, either a shell command (str) or a command generator (callable). If latter, it must take a node rank and a list of node addresses as input and return a shell command (str) (valid to return None for some nodes, in which case no commands are run on them). Run commands will be run under workdir. Note the command generator should be a self-contained lambda.

  • envs (Optional[Dict[str, str]]) – A dictionary of environment variables to set before running the setup and run commands.

  • workdir (Optional[str]) – The local working directory. This directory will be synced to a location on the remote VM(s), and setup and run commands will be run under that location (thus, they can rely on relative paths when invoking binaries).

  • num_nodes (Optional[int]) – The number of nodes to provision for this Task. If None, treated as 1 node. If > 1, each node will execute its own setup/run command, where run can either be a str, meaning all nodes get the same command, or a lambda, with the semantics documented above.

  • docker_image (Optional[str]) – (EXPERIMENTAL: Only in effect when LocalDockerBackend is used.) The base docker image that this Task will be built on. Defaults to ‘gpuci/miniforge-cuda:11.4-devel-ubuntu18.04’.

static from_yaml(yaml_path)[source]

Initializes a task from a task YAML.

Example

task = sky.Task.from_yaml('/path/to/task.yaml')
Parameters

yaml_path (str) – file path to a valid task yaml file.

Raises

ValueError – if the path gets loaded into a str instead of a dict; or if there are any other parsing errors.

Return type

Task

set_envs(envs)[source]

Sets the environment variables for use inside the setup/run commands.

Parameters

envs (Union[None, Tuple[Tuple[str, str]], Dict[str, str]]) – (optional) either a list of (env_name, value) or a dict {env_name: value}.

Returns

The current task, with envs set.

Return type

self

Raises

ValueError – if various invalid inputs errors are detected.

set_resources(resources)[source]

Sets the required resources to execute this task.

If this function is not called for a Task, default resource requirements will be used (8 vCPUs).

Parameters

resources (Union[Resources, Set[Resources]]) – either a sky.Resources, or a set of them. The latter case is EXPERIMENTAL and indicates asking the optimizer to “pick the best of these resources” to run this task.

Returns

The current task, with resources set.

Return type

self

set_file_mounts(file_mounts)[source]

Sets the file mounts for this task.

Useful for syncing datasets, dotfiles, etc.

File mounts are a dictionary: {remote_path: local_path/cloud URI}. Local (or cloud) files/directories will be synced to the specified paths on the remote VM(s) where this Task will run.

Neither source or destimation paths can end with a slash.

Example

task.set_file_mounts({
    '~/.dotfile': '/local/.dotfile',
    # /remote/dir/ will contain the contents of /local/dir/.
    '/remote/dir': '/local/dir',
})
Parameters

file_mounts (Optional[Dict[str, str]]) – an optional dict of {remote_path: local_path/cloud URI}, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.

Returns

the current task, with file mounts set.

Return type

self

Raises

ValueError – if input paths are invalid.

update_file_mounts(file_mounts)[source]

Updates the file mounts for this task.

Different from set_file_mounts(), this function updates into the existing file_mounts (calls dict.update()), rather than overwritting it.

This should be called before provisioning in order to take effect.

Example

task.update_file_mounts({
    '~/.config': '~/Documents/config',
    '/tmp/workdir': '/local/workdir/cnn-cifar10',
})
Parameters

file_mounts (Dict[str, str]) – a dict of {remote_path: local_path/cloud URI}, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.

Returns

the current task, with file mounts updated.

Return type

self

Raises

ValueError – if input paths are invalid.

set_storage_mounts(storage_mounts)[source]

Sets the storage mounts for this task.

Storage mounts are a dictionary: {mount_path: sky.Storage object}, each of which mounts a sky.Storage object (a cloud object store bucket) to a path inside the remote cluster.

A sky.Storage object can be created by uploading from a local directory (setting source), or backed by an existing cloud bucket (setting name to the bucket name; or setting source to the bucket URI).

Example

task.set_storage_mounts({
    '/remote/imagenet/': sky.Storage(name='my-bucket',
                                     source='/local/imagenet'),
})
Parameters

storage_mounts (Optional[Dict[str, Storage]]) – an optional dict of {mount_path: sky.Storage object}, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.

Returns

The current task, with storage mounts set.

Return type

self

Raises

ValueError – if input paths are invalid.

update_storage_mounts(storage_mounts)[source]

Updates the storage mounts for this task.

Different from set_storage_mounts(), this function updates into the existing storage_mounts (calls dict.update()), rather than overwritting it.

This should be called before provisioning in order to take effect.

Parameters

storage_mounts (Dict[str, Storage]) – an optional dict of {mount_path: sky.Storage object}, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.

Returns

The current task, with storage mounts updated.

Return type

self

Raises

ValueError – if input paths are invalid.