Launching ENOT framework on multiple GPU’s (or CPU’s)
ENOT framework utilizes torch.distributed package to parallel computations over multiple training workers. For complete description of PyTorch distributed package, please see its documentation.
Note
If you are familiar with distributed training, we recommend you to start with Distributed (multi-gpu / multi-node) pretrain example .
Warning
Currently, only pretrain phase supports distributed training. We plan to add distributed search in the nearest future.
Launching distributed training
The easiest way to launch your application in distributed manner is to use
Launch utility
(torch.distributed.launch
script).
A simple example bash script with launch utility:
# Will use four GPU's: 0, 1, 4 and 6.
CUDA_VISIBLE_DEVICES=0,1,4,6 \
python -m torch.distributed.launch \
--nproc_per_node=4 \
/path/to/your_main_script.py \
--lr 0.1 \
--data ../data/mnist/
For more detailed description of the launch utility, please refer to its documentation.
There are other ways to run your program in distributed manner, too. First of all, Elastic Launch is a new rapidly-developing utility to launch distributed runs in error-resilient manner. Also, you can spawn worker processes manually, following the PyTorch distributed package tutorial.
Distributed training requirements
Warning
DO NOT WRAP model or search space with DistributedDataParallel module.
In order to adapt program for distributed training:
Main script should correctly parse local_rank argument (see Launch utility documentation).
Initialize distributed process group. Best option is to use
enot.distributed.init_torch()
function. Please, see its documentation for more information. ENOT distributed functions requires that default process group is initialized and the default cuda device is set when running multi-GPU training (this is automatically done byinit_torch()
function). A simple manual initialization might look like:torch.cuda.set_device(os.environ["LOCAL_RANK"]) dist.init_process_group(backend="nccl", init_method="env://")
After creation, you should broadcast your search space from master process to all other processes. This can be easily made with
enot.distributed.sync_model()
function. Example synchronization code snippet:device = ... search_space = SearchSpaceModel(model).to(device=device) sync_model(search_space, reduce_parameters=False, reduce_buffers=False)
Warning
Always call this function with
reduce
flags set to False!Restrict all IO operations to run only in master (main) process. When launching distributed training on multiple machines, there are multiple local master processes (one per machine), and a single global master, which is unique for the whole process group (by default, process group contains all of your distributed worker processes). You should probably log to console only in local master processes, and dump model files only in global master process.
It can be achieved by launching selected code sections with the following statements (using
is_master()
andis_local_master()
functions):if is_master(): # Do the master-only work here. # (usually, IO operations and different stuff not related to network training) if is_local_master(): # Do console logging here.
Dataloaders should use DistributedSampler correctly. You can find example usage in PyTorch documentation (see
torch.utils.data.distributed.DistributedSampler
class documentation).
Useful utilities
You can find all useful distributed utilities from ENOT framework in Enot distributed package.