Tutorial: DNN Training
Tutorial: DNN TrainingΒΆ
This example uses SkyPilot to train a Transformer-based language model from HuggingFace.
First, define a task YAML with resource requirements, the setup commands, and the commands to run:
# dnn.yaml
name: huggingface
resources:
accelerators: V100:4
# Optional: upload a working directory to remote ~/sky_workdir.
# Commands in "setup" and "run" will be executed under it.
#
# workdir: .
# Optional: upload local files.
# Format:
# /remote/path: /local/path
#
# file_mounts:
# ~/.vimrc: ~/.vimrc
# ~/.netrc: ~/.netrc
setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16
Then, launch training:
$ sky launch -c lm-cluster dnn.yaml
This will provision a cluster with the required resources, execute the setup commands, then execute the run commands.
After the training job starts running, you can safely Ctrl-C
to detach
from logging and the job will continue to run remotely on the cluster. To stop
the job, use the sky cancel <cluster_name> <job_id>
command (refer to CLI reference).
After training, transfer artifacts such as logs and checkpoints using familiar tools.