hivemind.optim¶

This module contains decentralized optimizers that wrap regular pytorch optimizers to collaboratively train a shared model. Depending on the exact type, optimizer may average model parameters with peers, exchange gradients, or follow a more complicated distributed training strategy.

class hivemind.optim.CollaborativeOptimizer(opt: torch.optim.optimizer.Optimizer, *, dht: hivemind.dht.DHT, prefix: str, target_batch_size: int, batch_size_per_step: Optional[int] = None, scheduler: Optional[torch.optim.lr_scheduler._LRScheduler] = None, min_refresh_period: float = 0.5, max_refresh_period: float = 30, default_refresh_period: float = 3, expected_drift_peers: float = 3, expected_drift_rate: float = 0.2, performance_ema_alpha: float = 0.1, metadata_expiration: float = 60.0, averaging_timeout: Optional[float] = None, step_tolerance: int = 1, reuse_grad_buffers: bool = False, accumulate_grads_on: Optional[torch.device] = None, client_mode: bool = False, verbose: bool = False, **kwargs)[source]¶

An optimizer that performs model updates after collaboratively accumulating a target (large) batch size across peers

These optimizers use DHT to track how much progress did the collaboration make towards target batch size. Once enough samples were accumulated, optimizers will compute a weighted average of their statistics.

Note:	This optimizer behaves unlike regular pytorch optimizers in two ways: calling .step will periodically zero-out gradients w.r.t. model parameters after each step it may take multiple .step calls without updating model parameters, waiting for peers to accumulate enough samples
Parameters:	opt – a standard pytorch optimizer, preferably a large-batch one such as LAMB, LARS, etc. dht – a running hivemind.DHT daemon connected to other peers prefix – a common prefix for all metadata stored by CollaborativeOptimizer in the DHT target_batch_size – perform optimizer step after all peers collectively accumulate this many samples batch_size_per_step – before each call to .step, user should accumulate gradients over this many samples min_refresh_period – wait for at least this many seconds before fetching new collaboration state max_refresh_period – wait for at most this many seconds before fetching new collaboration state default_refresh_period – if no peers are detected, attempt to fetch collaboration state this often (seconds) expected_drift_peers – assume that this many new peers can join between steps expected_drift_rate – assumes that this fraction of current collaboration can join/leave between steps bandwidth – peer’s network bandwidth for the purpose of load balancing (recommended: internet speed in mbps) step_tolerance – a peer can temporarily be delayed by this many steps without being deemed out of sync performance_ema_alpha – smoothing value used to estimate this peer’s performance (training samples per second) averaging_expiration – peer’s requests for averaging will be valid for this many seconds metadata_expiration – peer’s metadata (e.g. samples processed) is stored onto DHT for this many seconds averaging_timeout – if an averaging step hangs for this long, it will be cancelled. scheduler – if specified, use this scheduler to update optimizer learning rate reuse_grad_buffers – if True, use model’s .grad buffers for gradient accumulation. This is more memory efficient, but it requires that the user does NOT call model/opt zero_grad at all accumulate_grads_on – if specified, accumulate gradients on this device. By default, this will use the same device as model parameters. One can specify a different device (e.g. ‘cpu’ vs ‘cuda’) to save device memory at the cost of extra time per step. If reuse_gradient_accumulators is True, this parameter has no effect. client_mode – if True, runs training without incoming connections, in a firewall-compatible mode kwargs – additional parameters forwarded to DecentralizedAverager
Note:	The expected collaboration drift parameters are used to adjust the frequency with which this optimizer will refresh the collaboration-wide statistics (to avoid missing the moment when to run the next step)
Note:	If you are using CollaborativeOptimizer with lr_scheduler, it is recommended to pass this scheduler explicitly into this class. Otherwise, scheduler may not be synchronized between peers.

step(batch_size: Optional[int] = None, **kwargs)[source]¶

Report accumulating gradients w.r.t. batch_size additional samples, optionally update model parameters

Parameters:	batch_size – optional override for batch_size_per_step from init
Note:	this .step is different from normal pytorch optimizers in several key ways. See __init__ for details.

class hivemind.optim.CollaborativeAdaptiveOptimizer(opt: torch.optim.optimizer.Optimizer, average_opt_statistics: Sequence[str], **kwargs)[source]¶

Behaves exactly as CollaborativeOptimizer except:

averages adaptive learning rates of an optimizer
doesn’t average gradients

Parameters:	average_opt_statistics – average optimizer statistics with corresponding names in statedict kwargs – options for CollaborativeOptimizer