##################################################### Launching ENOT framework on multiple GPU's (or CPU's) ##################################################### ENOT framework utilizes `torch.distributed package `_ to parallel computations over multiple training workers. For complete description of PyTorch distributed package, please see its `documentation `_. .. note:: **If you are familiar with distributed training, we recommend you to start with** :ref:`Distributed (multi-gpu / multi-node) pretrain example ` **.** .. warning:: Currently, only pretrain phase supports distributed training. We plan to add distributed search in the nearest future. ****************************** Launching distributed training ****************************** The easiest way to launch your application in distributed manner is to use `Launch utility `_ (``torch.distributed.launch`` script). A simple example bash script with launch utility: .. code-block:: bash # Will use four GPU's: 0, 1, 4 and 6. CUDA_VISIBLE_DEVICES=0,1,4,6 \ python -m torch.distributed.launch \ --nproc_per_node=4 \ /path/to/your_main_script.py \ --lr 0.1 \ --data ../data/mnist/ For more detailed description of the launch utility, please refer to its documentation. There are other ways to run your program in distributed manner, too. First of all, `Elastic Launch `_ is a new rapidly-developing utility to launch distributed runs in error-resilient manner. Also, you can spawn worker processes manually, following the `PyTorch distributed package tutorial `_. ********************************* Distributed training requirements ********************************* .. warning:: DO NOT WRAP model or search space with `DistributedDataParallel module `_. In order to adapt program for distributed training: 1. Main script should correctly parse local_rank argument (see `Launch utility documentation `_). 2. Initialize distributed process group. Best option is to use :func:`enot.distributed.init_torch` function. Please, see its documentation for more information. ENOT distributed functions requires that default process group is initialized and the default cuda device is set when running multi-GPU training (this is automatically done by :func:`.init_torch` function). A simple manual initialization might look like: .. code-block:: python torch.cuda.set_device(os.environ["LOCAL_RANK"]) dist.init_process_group(backend="nccl", init_method="env://") 3. After creation, you should broadcast your search space from master process to all other processes. This can be easily made with :func:`enot.distributed.sync_model` function. Example synchronization code snippet: .. code-block:: python device = ... search_space = SearchSpaceModel(model).to(device=device) sync_model(search_space, reduce_parameters=False, reduce_buffers=False) .. warning:: Always call this function with ``reduce`` flags set to False! 4. Restrict all IO operations to run only in master (main) process. When launching distributed training on multiple machines, there are multiple **local master** processes (one per machine), and a single **global master**, which is unique for the whole process group (by default, process group contains all of your distributed worker processes). You should probably log to console only in local master processes, and dump model files only in global master process. It can be achieved by launching selected code sections with the following statements (using :func:`.is_master` and :func:`.is_local_master` functions): .. code-block:: python if is_master(): # Do the master-only work here. # (usually, IO operations and different stuff not related to network training) if is_local_master(): # Do console logging here. 5. Dataloaders should use **DistributedSampler** correctly. You can find example usage in `PyTorch documentation `_ (see ``torch.utils.data.distributed.DistributedSampler`` class documentation). **************** Useful utilities **************** You can find all useful distributed utilities from ENOT framework in :doc:`distributed`. .. seealso:: :ref:`Distributed (multi-gpu / multi-node) pretrain example ` :doc:`distributed` `torch.distributed package documentation `_ `torch.distributed package tutorial `_