Grid Search
Contents
Grid Search¶
To submit multiple trials with different hyperparameters to a cluster:
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
Options used:
--gpus
: specify the resource requirement for each job.-d
/--detach
: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the other 4 run in parallel. Once a job finishes, the next job will begin executing immediately. Refer to Job Queue for more details on SkyPilot’s scheduling behavior.
Multiple trials per GPU¶
To run multiple trials per GPU, use fractional GPUs in the resource requirement.
For example, use --gpus V100:0.5
to make 2 trials share 1 GPU:
$ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3
...
When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).