hivemind.moe.client¶
This module lets you connect to distributed MixtureofExperts or individual experts hosted
class
hivemind.moe.client.
RemoteExpert
(uid, endpoint: str)[source]¶ A simple module that runs forward/backward of an expert hosted on a remote machine. Works seamlessly with pytorch autograd. (this is essentially a simple RPC function)
Warning: RemoteExpert currently assumes that you provide it with correct input shapes. Sending wrong input shapes can cause RemoteExpert to freeze indefinitely due to error in runtime.
Parameters:  uid – unique expert identifier
 endpoint – network endpoint of a server that services that expert, e.g. “201.123.321.99:1337” or “[::]:8080”

class
hivemind.moe.client.
RemoteMixtureOfExperts
(*, in_features, grid_size: Tuple[int, ...], dht: hivemind.dht.DHT, uid_prefix: str, k_best: int, k_min: int = 1, forward_timeout: Optional[float] = None, timeout_after_k_min: Optional[float] = None, backward_k_min: int = 1, backward_timeout: Optional[float] = None, detect_anomalies: bool = False, allow_zero_outputs: bool = False, **dht_kwargs)[source]¶ A torch module that performs MixtureofExperts inference with a local gating function and multiple remote experts. Natively supports pytorch autograd.
Note: By default, not all experts are guaranteed to perform forward pass. Moreover, not all of those who ran forward pass are guaranteed to perform backward pass. In the latter case, gradient will be averaged without the missing experts
Parameters:  in_features – common input size for experts and gating function
 grid_size – dimensions that form expert uid (see below)
 uid_prefix – common prefix for all expert uids (must end with ‘.’)
 dht – a DHT instance used to search for best experts
 k_best – average this many highestscoring experts to compute activations
 k_min – make sure at least this many experts returned output (i.e. didn’t fail)
 timeout_after_k_min – wait for this many seconds after k_min experts returned results. Any expert that didn’t manage to return output after that delay is considered unavailable
 detect_anomalies – whether to check input/output tensors for NaN and infinity values
 allow_zero_outputs – whether to return zeros if no experts respond on forward pass
Note: expert uid follows the pattern {uid_prefix}.{0…grid_size[0]}.{0…grid_size[1]}…{0…grid_size[1]}

forward
(input: torch.Tensor, *args, **kwargs)[source]¶ Choose k best experts with beam search, then call chosen experts and average their outputs. Input tensor is averaged over all dimensions except for first and last (we assume that extra dimensions represent sequence length or image height/width)
Parameters:  input – a tensor of values that are used to estimate gating function, batchfirst.
 args – extra positional parameters that will be passed to each expert after input, batchfirst
 kwargs – extra keyword parameters that will be passed to each expert, batchfirst
Returns: averaged predictions of all experts that delivered result on time, nested structure of batchfirst

compute_expert_scores
(grid_scores: List[torch.Tensor], batch_experts: List[List[hivemind.moe.client.expert.RemoteExpert]]) → torch.Tensor[source]¶ Compute scores for each expert by adding up grid scores, autogradfriendly :param grid_scores: list of torch tensors, ith tensor contains scores for ith grid dimension :param batch_experts: list(batch) of lists(k) of up to k experts selected for this batch :returns: a tensor of scores, float32[batch_size, k] :note: if some rows in batch have less than max number of experts, their scores will be padded with inf

class
hivemind.moe.client.
RemoteSwitchMixtureOfExperts
(*, grid_size: Tuple[int, ...], utilization_alpha: float = 0.9, grid_dropout: float = 1.0, jitter_eps: float = 0.01, k_best=1, k_min=0, backward_k_min=0, allow_zero_outputs=True, **kwargs)[source]¶ A module implementing Switch Transformers [1] MixtureofExperts inference with remote experts.
 [1] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
 William Fedus, Barret Zoph, Noam Shazeer. https://arxiv.org/abs/2101.03961
Note: By default, not all experts are guaranteed to perform forward pass. Moreover, not all of those who ran forward pass are guaranteed to perform backward pass. In the latter case, gradient will be averaged without the missing experts
Parameters:  in_features – common input size for experts and gating function
 grid_size – dimensions that form expert uid (see below)
 uid_prefix – common prefix for all expert uids (must end with ‘.’)
 dht – a DHT instance used to search for best experts
 k_best – average this many highestscoring experts to compute activations
 k_min – make sure at least this many experts returned output (i.e. didn’t fail)
 timeout_after_k_min – wait for this many seconds after k_min experts returned results. Any expert that didn’t manage to return output after that delay is considered unavailable
 detect_anomalies – whether to check input/output tensors for NaN and infinity values
 allow_zero_outputs – whether to return just the input if no experts respond on forward pass
Note: expert uid follows the pattern {uid_prefix}.{0…grid_size[0]}.{0…grid_size[1]}…{0…grid_size[1]}

forward
(input: torch.Tensor, *args, **kwargs)[source]¶ Choose k best experts with beam search, then call chosen experts and average their outputs. Input tensor is averaged over all dimensions except for first and last (we assume that extra dimensions represent sequence length or image height/width)
Parameters:  input – a tensor of values that are used to estimate gating function, batchfirst.
 args – extra positional parameters that will be passed to each expert after input, batchfirst
 kwargs – extra keyword parameters that will be passed to each expert, batchfirst
Returns: averaged predictions of all experts that delivered result on time, nested structure of batchfirst

compute_expert_scores
(grid_probs: List[torch.Tensor], batch_experts: List[List[hivemind.moe.client.expert.RemoteExpert]]) → torch.Tensor[source]¶ Compute scores for each expert by multiplying grid probabilities, autogradfriendly :param grid_probs: list of torch tensors, ith tensor contains scores for ith grid dimension :param batch_experts: list(batch) of lists(k) of up to k experts selected for this batch :returns: a tensor of scores, float32[batch_size, k] :note: if some rows in batch have less than max number of experts, their scores will be padded with inf