Python API
Contents
Python API¶
SkyPilot offers a programmatic API in Python, which is used under the hood by the CLI.
Note
The Python API contains more experimental functions/classes than the CLI. That said, it has been used to develop several Python libraries by users.
For questions or request for support, please reach out to the development team. Your feedback is much appreciated in evolving this API!
Core API¶
sky.launch¶
- sky.launch(task, cluster_name=None, retry_until_up=False, idle_minutes_to_autostop=None, dryrun=False, down=False, stream_logs=True, backend=None, optimize_target=OptimizeTarget.COST, detach_setup=False, detach_run=False, no_setup=False)[source]¶
Launch a task.
The task’s setup and run commands are executed under the task’s workdir (when specified, it is synced to remote cluster). The task undergoes job queue scheduling on the cluster.
Currently, the first argument must be a sky.Task, or (EXPERIMENTAL advanced usage) a sky.Dag. In the latter case, currently it must contain a single task; support for pipelines/general DAGs are in experimental branches.
- Parameters
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch.cluster_name (
Optional
[str
]) – name of the cluster to create/reuse. If None, auto-generate a name.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch(..., detach_run=True, ...)
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.dryrun (
bool
) – if True, do not actually launch the cluster.stream_logs (
bool
) – if True, show the logs in the terminal.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).optimize_target (
OptimizeTarget
) – target to optimize for. Choices: OptimizeTarget.COST, OptimizeTarget.TIME.detach_setup (
bool
) – If True, run setup in non-interactive mode as part of the job itself. You can safely ctrl-c to detach from logging, and it will not interrupt the setup process. To see the logs again after detaching, use sky logs. To cancel setup, cancel the job via sky cancel. Useful for long-running setup commands.detach_run (
bool
) – If True, as soon as a job is submitted, return from this function and do not stream execution logs.no_setup (
bool
) – if True, do not re-run setup commands.
Example
import sky task = sky.Task(run='echo hello SkyPilot') task.set_resources( sky.Resources(cloud=sky.AWS(), accelerators='V100:4')) sky.launch(task, cluster_name='my-cluster')
- Return type
sky.exec¶
- sky.exec(task, cluster_name, dryrun=False, down=False, stream_logs=True, backend=None, detach_run=False)[source]¶
Execute a task on an existing cluster.
This function performs two actions:
workdir syncing, if the task has a workdir specified;
executing the task’s
run
commands.
All other steps (provisioning, setup commands, file mounts syncing) are skipped. If any of those specifications changed in the task, this function will not reflect those changes. To ensure a cluster’s setup is up to date, use
sky.launch()
instead.Execution and scheduling behavior:
The task will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.
The task is run under the workdir (if specified).
The task is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as
htop
do not work. Usessh my_cluster
instead.
- Parameters
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) containing the task to execute.cluster_name (
str
) – name of an existing cluster to execute the task.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.dryrun (
bool
) – if True, do not actually execute the task.stream_logs (
bool
) – if True, show the logs in the terminal.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).detach_run (
bool
) – if True, detach from logging once the task has been submitted.
- Raises
ValueError – if the specified cluster does not exist or is not in UP status.
- Return type
sky.stop¶
- sky.stop(cluster_name, purge=False)[source]¶
Stop a cluster.
Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.
Currently, spot instance clusters cannot be stopped.
- Parameters
- Raises
ValueError – the specified cluster does not exist.
RuntimeError – failed to stop the cluster.
sky.exceptions.NotSupportedError – if the specified cluster is a spot cluster, or a TPU VM Pod cluster, or the managed spot controller.
- Return type
sky.start¶
- sky.start(cluster_name, idle_minutes_to_autostop=None, retry_until_up=False, down=False, force=False)[source]¶
Restart a cluster.
If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this function will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.
Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.
If a cluster is already in the UP status, this function has no effect.
- Parameters
cluster_name (
str
) – name of the cluster to start.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch(..., detach_run=True, ...)
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.down (
bool
) – Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requiresidle_minutes_to_autostop
to be set.force (
bool
) – whether to force start the cluster even if it is already up. Useful for upgrading SkyPilot runtime.
- Raises
ValueError – argument values are invalid: (1) the specified cluster does not exist; (2) if
down
is set to True butidle_minutes_to_autostop
is None; (3) if the specified cluster is the managed spot controller, and eitheridle_minutes_to_autostop
is not None ordown
is True (omit them to use the default autostop settings).sky.exceptions.NotSupportedError – if the cluster to restart was launched using a non-default backend that does not support this operation.
sky.exceptions.ClusterOwnerIdentitiesMismatchError – if the cluster to restart was launched by a different user.
- Return type
sky.down¶
- sky.down(cluster_name, purge=False)[source]¶
Tear down a cluster.
Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.
For local on-prem clusters, this function does not terminate the local cluster, but instead removes the cluster from the status table and terminates the calling user’s running jobs.
- Parameters
- Raises
ValueError – the specified cluster does not exist.
sky.exceptions.NotSupportedError – the specified cluster is the managed spot controller.
- Return type
sky.status¶
- sky.status(refresh=False)[source]¶
Get all cluster statuses.
Each returned value has the following fields:
{ 'name': (str) cluster name, 'launched_at': (int) timestamp of last launch on this cluster, 'handle': (ResourceHandle) an internal handle to the cluster, 'last_use': (str) the last command/entrypoint that affected this cluster, 'status': (sky.ClusterStatus) cluster status, 'autostop': (int) idle time before autostop, 'to_down': (bool) whether autodown is used instead of autostop, 'metadata': (dict) metadata of the cluster, }
Each cluster can have one of the following statuses:
INIT
: The cluster may be live or down. It can happen in the following cases:Ongoing provisioning or runtime setup. (A
sky.launch()
has started but has not completed.)Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try
sky launch
again on it.)
UP
: Provisioning and runtime setup have succeeded and the cluster is live. (The most recentsky.launch()
has completed successfully.)STOPPED
: The cluster is stopped and the storage is persisted. Usesky.start()
to restart the cluster.
Autostop column:
The autostop column indicates how long the cluster will be autostopped after minutes of idling (no jobs running). If
to_down
is True, the cluster will be autodowned, rather than autostopped.
Getting up-to-date cluster statuses:
In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.
In cases where the clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use
refresh=True
to query the latest cluster statuses from the cloud providers.
sky.autostop¶
- sky.autostop(cluster_name, idle_minutes, down=False)[source]¶
Schedule an autostop/autodown for a cluster.
Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.
Idleness time of a cluster is reset to zero, whenever:
A job is submitted (
sky.launch()
orsky.exec()
).The cluster has restarted.
An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.
Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.
When multiple autostop settings are specified for the same cluster, the last setting takes precedence.
- Parameters
cluster_name (
str
) – name of the cluster.idle_minutes (
int
) – the number of minutes of idleness (no pending/running jobs) after which the cluster will be stopped automatically. Setting to a negative number cancels any autostop/autodown setting.down (
bool
) – if true, use autodown (tear down the cluster; non-restartable), rather than autostop (restartable).
- Raises
ValueError – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend or the cluster is TPU VM Pod.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
- Return type
Task¶
- class sky.Task(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None)[source]¶
Task: a computation to be run on the cloud.
- __init__(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None)[source]¶
Initializes a Task.
All fields are optional.
Task.run
is the actual program: either a shell command to run (str) or a command generator for different nodes (lambda; see below).Optionally, call
Task.set_resources()
to set the resource requirements for this task. If not set, a default CPU-only requirement is assumed (the same assky cpunode
).All setters of this class,
Task.set_*()
, returnself
, i.e., they are fluent APIs and can be chained together.Example
# A Task that will sync up local workdir '.', containing # requirements.txt and train.py. sky.Task(setup='pip install requirements.txt', run='python train.py', workdir='.') # An empty Task for provisioning a cluster. task = sky.Task(num_nodes=n).set_resources(...) # Chaining setters. sky.Task().set_resources(...).set_file_mounts(...)
- Parameters
name (
Optional
[str
]) – A string name for the Task for display purposes.setup (
Optional
[str
]) – A setup command, which will be run before executing the run commandsrun
, and executed underworkdir
.run (
Union
[str
,Callable
[[int
,List
[str
]],Optional
[str
]],None
]) – The actual command for the task. If not None, either a shell command (str) or a command generator (callable). If latter, it must take a node rank and a list of node addresses as input and return a shell command (str) (valid to return None for some nodes, in which case no commands are run on them). Run commands will be run underworkdir
. Note the command generator should be a self-contained lambda.envs (
Optional
[Dict
[str
,str
]]) – A dictionary of environment variables to set before running the setup and run commands.workdir (
Optional
[str
]) – The local working directory. This directory will be synced to a location on the remote VM(s), andsetup
andrun
commands will be run under that location (thus, they can rely on relative paths when invoking binaries).num_nodes (
Optional
[int
]) – The number of nodes to provision for this Task. If None, treated as 1 node. If > 1, each node will execute its own setup/run command, whererun
can either be a str, meaning all nodes get the same command, or a lambda, with the semantics documented above.docker_image (
Optional
[str
]) – (EXPERIMENTAL: Only in effect when LocalDockerBackend is used.) The base docker image that this Task will be built on. Defaults to ‘gpuci/miniforge-cuda:11.4-devel-ubuntu18.04’.
- static from_yaml(yaml_path)[source]¶
Initializes a task from a task YAML.
Example
task = sky.Task.from_yaml('/path/to/task.yaml')
- Parameters
yaml_path (
str
) – file path to a valid task yaml file.- Raises
ValueError – if the path gets loaded into a str instead of a dict; or if there are any other parsing errors.
- Return type
- set_resources(resources)[source]¶
Sets the required resources to execute this task.
If this function is not called for a Task, default resource requirements will be used (8 vCPUs).
- set_file_mounts(file_mounts)[source]¶
Sets the file mounts for this task.
Useful for syncing datasets, dotfiles, etc.
File mounts are a dictionary:
{remote_path: local_path/cloud URI}
. Local (or cloud) files/directories will be synced to the specified paths on the remote VM(s) where this Task will run.Neither source or destimation paths can end with a slash.
Example
task.set_file_mounts({ '~/.dotfile': '/local/.dotfile', # /remote/dir/ will contain the contents of /local/dir/. '/remote/dir': '/local/dir', })
- Parameters
file_mounts (
Optional
[Dict
[str
,str
]]) – an optional dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Returns
the current task, with file mounts set.
- Return type
self
- Raises
ValueError – if input paths are invalid.
- update_file_mounts(file_mounts)[source]¶
Updates the file mounts for this task.
Different from set_file_mounts(), this function updates into the existing file_mounts (calls
dict.update()
), rather than overwritting it.This should be called before provisioning in order to take effect.
Example
task.update_file_mounts({ '~/.config': '~/Documents/config', '/tmp/workdir': '/local/workdir/cnn-cifar10', })
- Parameters
file_mounts (
Dict
[str
,str
]) – a dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Returns
the current task, with file mounts updated.
- Return type
self
- Raises
ValueError – if input paths are invalid.
- set_storage_mounts(storage_mounts)[source]¶
Sets the storage mounts for this task.
Storage mounts are a dictionary:
{mount_path: sky.Storage object}
, each of which mounts a sky.Storage object (a cloud object store bucket) to a path inside the remote cluster.A sky.Storage object can be created by uploading from a local directory (setting
source
), or backed by an existing cloud bucket (settingname
to the bucket name; or settingsource
to the bucket URI).Example
task.set_storage_mounts({ '/remote/imagenet/': sky.Storage(name='my-bucket', source='/local/imagenet'), })
- Parameters
storage_mounts (
Optional
[Dict
[str
,Storage
]]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Returns
The current task, with storage mounts set.
- Return type
self
- Raises
ValueError – if input paths are invalid.
- update_storage_mounts(storage_mounts)[source]¶
Updates the storage mounts for this task.
Different from set_storage_mounts(), this function updates into the existing storage_mounts (calls
dict.update()
), rather than overwritting it.This should be called before provisioning in order to take effect.
- Parameters
storage_mounts (
Dict
[str
,Storage
]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Returns
The current task, with storage mounts updated.
- Return type
self
- Raises
ValueError – if input paths are invalid.