Launching ENOT framework on multiple GPU’s (or CPU’s)

ENOT framework utilizes torch.distributed package to parallel computations over multiple training workers. For complete description of PyTorch distributed package, please see its documentation.

Note

If you are familiar with distributed training, we recommend you to start with Distributed (multi-gpu / multi-node) pretrain example .

Warning

Currently, only pretrain phase supports distributed training. We plan to add distributed search in the nearest future.

Launching distributed training

The easiest way to launch your application in distributed manner is to use Launch utility (torch.distributed.launch script).

A simple example bash script with launch utility:

# Will use four GPU's: 0, 1, 4 and 6.
CUDA_VISIBLE_DEVICES=0,1,4,6          \
  python -m torch.distributed.launch \
    --nproc_per_node=4              \
    /path/to/your_main_script.py   \
      --lr 0.1                    \
      --data ../data/mnist/

For more detailed description of the launch utility, please refer to its documentation.

There are other ways to run your program in distributed manner, too. First of all, Elastic Launch is a new rapidly-developing utility to launch distributed runs in error-resilient manner. Also, you can spawn worker processes manually, following the PyTorch distributed package tutorial.

Distributed training requirements

Warning

DO NOT WRAP model or search space with DistributedDataParallel module.

In order to adapt program for distributed training:

  1. Main script should correctly parse local_rank argument (see Launch utility documentation).

  2. Initialize distributed process group. Best option is to use enot.distributed.init_torch() function. Please, see its documentation for more information. ENOT distributed functions requires that default process group is initialized and the default cuda device is set when running multi-GPU training (this is automatically done by init_torch() function). A simple manual initialization might look like:

    torch.cuda.set_device(os.environ["LOCAL_RANK"])
    dist.init_process_group(backend="nccl", init_method="env://")
    
  3. After creation, you should broadcast your search space from master process to all other processes. This can be easily made with enot.distributed.sync_model() function. Example synchronization code snippet:

    device = ...
    search_space = SearchSpaceModel(model).to(device=device)
    sync_model(search_space, reduce_parameters=False, reduce_buffers=False)
    

    Warning

    Always call this function with reduce flags set to False!

  4. Restrict all IO operations to run only in master (main) process. When launching distributed training on multiple machines, there are multiple local master processes (one per machine), and a single global master, which is unique for the whole process group (by default, process group contains all of your distributed worker processes). You should probably log to console only in local master processes, and dump model files only in global master process.

    It can be achieved by launching selected code sections with the following statements (using is_master() and is_local_master() functions):

    if is_master():
        # Do the master-only work here.
        # (usually, IO operations and different stuff not related to network training)
    
    if is_local_master():
        # Do console logging here.
    
  5. Dataloaders should use DistributedSampler correctly. You can find example usage in PyTorch documentation (see torch.utils.data.distributed.DistributedSampler class documentation).

Useful utilities

You can find all useful distributed utilities from ENOT framework in Enot distributed package.