hivemind.optim

This module contains decentralized optimizers that wrap regular pytorch optimizers to collaboratively train a shared model. Depending on the exact type, optimizer may average model parameters with peers, exchange gradients, or follow a more complicated distributed training strategy.

class hivemind.optim.CollaborativeOptimizer(opt: torch.optim.optimizer.Optimizer, *, dht: hivemind.dht.DHT, prefix: str, target_batch_size: int, batch_size_per_step: Optional[int] = None, scheduler: Optional[torch.optim.lr_scheduler._LRScheduler] = None, min_refresh_period: float = 0.5, max_refresh_period: float = 30, default_refresh_period: float = 3, expected_drift_peers: float = 3, expected_drift_rate: float = 0.2, performance_ema_alpha: float = 0.1, metadata_expiration: float = 60.0, averaging_timeout: Optional[float] = None, load_state_timeout: float = 600.0, step_tolerance: int = 1, reuse_grad_buffers: bool = False, accumulate_grads_on: Optional[torch.device] = None, client_mode: bool = False, verbose: bool = False, **kwargs)[source]

An optimizer that performs model updates after collaboratively accumulating a target (large) batch size across peers

These optimizers use DHT to track how much progress did the collaboration make towards target batch size. Once enough samples were accumulated, optimizers will compute a weighted average of their statistics.

Note:

This optimizer behaves unlike regular pytorch optimizers in two ways:

  • calling .step will periodically zero-out gradients w.r.t. model parameters after each step
  • it may take multiple .step calls without updating model parameters, waiting for peers to accumulate enough samples
Parameters:
  • opt – a standard pytorch optimizer, preferably a large-batch one such as LAMB, LARS, etc.
  • dht – a running hivemind.DHT daemon connected to other peers
  • prefix – a common prefix for all metadata stored by CollaborativeOptimizer in the DHT
  • target_batch_size – perform optimizer step after all peers collectively accumulate this many samples
  • batch_size_per_step – before each call to .step, user should accumulate gradients over this many samples
  • min_refresh_period – wait for at least this many seconds before fetching new collaboration state
  • max_refresh_period – wait for at most this many seconds before fetching new collaboration state
  • default_refresh_period – if no peers are detected, attempt to fetch collaboration state this often (seconds)
  • expected_drift_peers – assume that this many new peers can join between steps
  • expected_drift_rate – assumes that this fraction of current collaboration can join/leave between steps
  • bandwidth – peer’s network bandwidth for the purpose of load balancing (recommended: internet speed in mbps)
  • step_tolerance – a peer can temporarily be delayed by this many steps without being deemed out of sync
  • performance_ema_alpha – smoothing value used to estimate this peer’s performance (training samples per second)
  • averaging_expiration – peer’s requests for averaging will be valid for this many seconds
  • metadata_expiration – peer’s metadata (e.g. samples processed) is stored onto DHT for this many seconds
  • averaging_timeout – if an averaging step hangs for this long, it will be cancelled.
  • load_state_timeout – wait for at most this many seconds before giving up on load_state_from_peers
  • scheduler – if specified, use this scheduler to update optimizer learning rate
  • reuse_grad_buffers – if True, use model’s .grad buffers for gradient accumulation. This is more memory efficient, but it requires that the user does NOT call model/opt zero_grad at all
  • accumulate_grads_on – if specified, accumulate gradients on this device. By default, this will use the same device as model parameters. One can specify a different device (e.g. ‘cpu’ vs ‘cuda’) to save device memory at the cost of extra time per step. If reuse_gradient_accumulators is True, this parameter has no effect.
  • client_mode – if True, runs training without incoming connections, in a firewall-compatible mode
  • kwargs – additional parameters forwarded to DecentralizedAverager
Note:

The expected collaboration drift parameters are used to adjust the frequency with which this optimizer will refresh the collaboration-wide statistics (to avoid missing the moment when to run the next step)

Note:

If you are using CollaborativeOptimizer with lr_scheduler, it is recommended to pass this scheduler explicitly into this class. Otherwise, scheduler may not be synchronized between peers.

step(batch_size: Optional[int] = None, grad_scaler: Optional[hivemind.optim.grad_scaler.HivemindGradScaler] = None, **kwargs)[source]

Report accumulating gradients w.r.t. batch_size additional samples, optionally update model parameters

Parameters:
  • batch_size – optional override for batch_size_per_step from init
  • grad_scaler – if amp is enabled, this must be a hivemind-aware gradient scaler
Note:

this .step is different from normal pytorch optimizers in several key ways. See __init__ for details.

class hivemind.optim.CollaborativeAdaptiveOptimizer(opt: torch.optim.optimizer.Optimizer, average_opt_statistics: Sequence[str], **kwargs)[source]

Behaves exactly as CollaborativeOptimizer except:

  • averages adaptive learning rates of an optimizer
  • doesn’t average gradients
Parameters:
  • average_opt_statistics – average optimizer statistics with corresponding names in statedict
  • kwargs – options for CollaborativeOptimizer