Enot distributed package

enot.distributed package contains distributed training utilities. If you use distributed training with ENOT framework, it is recommended to use this package’s functional.

Initialization

init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=1800))

Creates default distributed process group if necessary and initializes enot.distributed package.

Process group initialization relies on environment variables. Specifically, we require two environment variables to be set: “WORLD_SIZE” and “LOCAL_RANK”. They are set automatically by PyTorch launching scripts such as torch.distributed.launch and torch.distributed.run. If you are using your own distributed training launch script - then set these environment variables in the same way as in the aforementioned scripts.

Distributed process group is initialized when “WORLD_SIZE” environment variable is larger than 1. Before that we perform a number of checks, and eventually call torch.distributed.init_process_group function to initialize distributed process group.

In distributed setting with nccl backend, additionally sets the default gpu device index to process local rank and initializes cuda.

Parameters:
Raises:

DistributedRuntimeException – On process group initialization errors.

Return type:

None

init_torch(seed=None, cuda_optimize_for_speed=False, cuda_reproducible=False, backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=1800))

Initializes PyTorch framework for training in both single- and multi-process setup.

This function does the following:

  • prepares seed for experiment, if seed is specified;

  • specifies certain cuda flags when cuda_optimize_for_speed or cuda_reproducible are set to True;

  • initializes distributed process group for multi-gpu (multi-process) training.

When specified, seed is set in Python’s random library, in NumPy’s random library, and in PyTorch.

Process group initialization relies on environment variables. Specifically, we require two environment variables to be set: “WORLD_SIZE” and “LOCAL_RANK”. They are set automatically by PyTorch launching scripts such as torch.distributed.launch and torch.distributed.run. If you are using your own distributed training launch script - then set these environment variables in the same way as in the aforementioned scripts.

Distributed process group is initialized when “WORLD_SIZE” environment variable is larger than 1. Before that we perform a number of checks, and eventually call torch.distributed.init_process_group function to initialize distributed process group.

In distributed setting with nccl backend, additionally sets the default gpu device index to process local rank and initializes cuda.

Parameters:
  • seed (int or None, optional) – Random seed for all randomizers. Defaults to None, which disables manual seeding.

  • cuda_optimize_for_speed (bool, optional) – Whether to set cuda-related optimization flags to enhance training speed. Mutually exclusive with cuda_reproducible flag. Makes training non-deterministic (non-reproducible runs even when the same seed is specified). Default value is False.

  • cuda_reproducible (bool, optional) – Whether to set cuda-related optimization flags to make training runs reproducible. Mutually exclusive with cuda_optimize_for_speed flag. May slightly decrease training speed. Default value is False.

  • backend (str, optional) – “backend” argument of torch.distributed.init_process_group function. Default value is “nccl”.

  • init_method (str, optional) – “init_method” argument of torch.distributed.init_process_group function. Default value is “env://”.

  • timeout (timedelta, optional) – “timeout” argument of torch.distributed.init_process_group function. Default value depends on torch.distributed package implementation (is equal to 30 minutes in most cases).

Raises:
  • RuntimeError – When both cuda_optimize_for_speed and cuda_reproducible flags are set.

  • DistributedRuntimeException – On process group initialization errors.

Return type:

None

Common distributed utilities

get_world_size()

Returns the number of parallel PyTorch processes in multi-process training.

Returns:

Number of processes.

Return type:

int

get_local_rank()

Returns process rank within a single node.

Returns:

Local process rank.

Return type:

int

is_dist()

Checks if distributed mode is enabled.

Returns:

Whether torch.distributed is initialized and world size is bigger than 1.

Return type:

bool

is_master()

Checks whether the current process is the global master in torch.distributed processes group or not.

Returns:

Whether the current process is the global master process.

Return type:

bool

is_local_master()

Checks whether the current process is a local node master in torch.distributed processes group or not.

Returns:

Whether the current process is a local master process.

Return type:

bool

master_only(fn)

Decorator which makes a procedure run only in global master process.

Parameters:

fn (callable which returns None) – Function to call only in master process.

Returns:

Function wrapped with is_master().

Return type:

callable

dist_only(fn)

Decorator which makes a procedure run only in distributed setup (multi-process setup).

Parameters:

fn (callable which returns None) – Function to call only in distributed setup.

Returns:

Function wrapped with is_dist().

Return type:

callable

Model synchronization

sync_model(model, sync_parameters=True, sync_buffers=True, reduce_parameters=True, reduce_buffers=True)

Synchronizes model parameters and buffers across distributed processes.

After executing this function, user should have identical copies of his model on all distributed worker processes.

Parameters:
  • model (torch.nn.Module) – Model which parameters will be synchronized.

  • sync_parameters (bool, optional) – Whether to synchronize model parameters or not. Default value is True.

  • sync_buffers (bool, optional) – Whether to synchronize model buffers or not. Default value is True.

  • reduce_parameters (bool, optional) – Whether to use reduction (cross-process averaging) to synchronize model parameters or not. When set to False, broadcasts parameters from master process to all other processes instead. Default value is True.

  • reduce_buffers (bool, optional) – Whether to use reduction (cross-process averaging) to synchronize model buffers or not. When set to False, broadcasts parameters from master process to all other processes instead. Default value is True.

Return type:

None