hivemind.moe.client¶

This module lets you connect to distributed Mixture-of-Experts or individual experts hosted ~~in the cloud cloud~~ on someone else's computer.

class hivemind.moe.client.RemoteExpert(expert_info: ExpertInfo, p2p: P2P)[source]¶

A simple module that runs forward/backward of an expert hosted on a remote machine. Works seamlessly with pytorch autograd. (this is essentially a simple RPC function) Warning: RemoteExpert currently assumes that you provide it with correct input shapes. Sending wrong input shapes can cause RemoteExpert to freeze indefinitely due to error in runtime.

Parameters

expert_info – RemoteExpertInfo with uid and server PeerInfo
p2p – P2P instance connected to the running p2pd

forward(*args, **kwargs)[source]¶: Call RemoteExpert for the specified inputs and return its output(s). Compatible with pytorch.autograd.

class hivemind.moe.client.RemoteMixtureOfExperts(*, in_features, grid_size: Tuple[int, ...], dht: DHT, uid_prefix: str, k_best: int, k_min: int = 1, forward_timeout: Optional[float] = None, timeout_after_k_min: Optional[float] = None, backward_k_min: int = 1, backward_timeout: Optional[float] = None, detect_anomalies: bool = False, allow_zero_outputs: bool = False, **dht_kwargs)[source]¶

A torch module that performs Mixture-of-Experts inference with a local gating function and multiple remote experts. Natively supports pytorch autograd.

Note

By default, not all experts are guaranteed to perform forward pass. Moreover, not all of those who ran forward pass are guaranteed to perform backward pass. In the latter case, gradient will be averaged without the missing experts

Parameters

in_features – common input size for experts and gating function
grid_size – dimensions that form expert uid (see below)
uid_prefix – common prefix for all expert uids (must end with ‘.’)
dht – a DHT instance used to search for best experts
k_best – average this many highest-scoring experts to compute activations
k_min – make sure at least this many experts returned output (i.e. didn’t fail)
timeout_after_k_min – wait for this many seconds after k_min experts returned results. Any expert that didn’t manage to return output after that delay is considered unavailable
detect_anomalies – whether to check input/output tensors for NaN and infinity values
allow_zero_outputs – whether to return zeros if no experts respond on forward pass

Note

expert uid follows the pattern {uid_prefix}.{0…grid_size[0]}.{0…grid_size[1]}…{0…grid_size[-1]}

forward(input: Tensor, *args: Tensor, **kwargs: Tensor)[source]¶

Choose k best experts with beam search, then call chosen experts and average their outputs. Input tensor is averaged over all dimensions except for first and last (we assume that extra dimensions represent sequence length or image height/width)

Parameters

input – a tensor of values that are used to estimate gating function, batch-first.
args – extra positional parameters that will be passed to each expert after input, batch-first
kwargs – extra keyword parameters that will be passed to each expert, batch-first

Returns

averaged predictions of all experts that delivered result on time, nested structure of batch-first

compute_expert_scores(grid_scores: List[Tensor], batch_experts: List[List[RemoteExpert]]) → Tensor[source]¶: Compute scores for each expert by adding up grid scores, autograd-friendly :param grid_scores: list of torch tensors, i-th tensor contains scores for i-th grid dimension :param batch_experts: list(batch) of lists(k) of up to k experts selected for this batch :returns: a tensor of scores, float32[batch_size, k] :note: if some rows in batch have less than max number of experts, their scores will be padded with -inf

class hivemind.moe.client.RemoteSwitchMixtureOfExperts(*, grid_size: Tuple[int, ...], utilization_alpha: float = 0.9, grid_dropout: float = 1.0, jitter_eps: float = 0.01, k_best=1, k_min=0, backward_k_min=0, allow_zero_outputs=True, **kwargs)[source]¶

A module implementing Switch Transformers [1] Mixture-of-Experts inference with remote experts.

[1] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.: William Fedus, Barret Zoph, Noam Shazeer. https://arxiv.org/abs/2101.03961

Note

By default, not all experts are guaranteed to perform forward pass. Moreover, not all of those who ran forward pass are guaranteed to perform backward pass. In the latter case, gradient will be averaged without the missing experts

Parameters

in_features – common input size for experts and gating function
grid_size – dimensions that form expert uid (see below)
uid_prefix – common prefix for all expert uids (must end with ‘.’)
dht – a DHT instance used to search for best experts
k_best – average this many highest-scoring experts to compute activations
k_min – make sure at least this many experts returned output (i.e. didn’t fail)
timeout_after_k_min – wait for this many seconds after k_min experts returned results. Any expert that didn’t manage to return output after that delay is considered unavailable
detect_anomalies – whether to check input/output tensors for NaN and infinity values
allow_zero_outputs – whether to return just the input if no experts respond on forward pass

Note

expert uid follows the pattern {uid_prefix}.{0…grid_size[0]}.{0…grid_size[1]}…{0…grid_size[-1]}

forward(input: Tensor, *args: Tensor, **kwargs: Tensor)[source]¶

Choose k best experts with beam search, then call chosen experts and average their outputs. Input tensor is averaged over all dimensions except for first and last (we assume that extra dimensions represent sequence length or image height/width)

Parameters

input – a tensor of values that are used to estimate gating function, batch-first.
args – extra positional parameters that will be passed to each expert after input, batch-first
kwargs – extra keyword parameters that will be passed to each expert, batch-first

Returns

averaged predictions of all experts that delivered result on time, nested structure of batch-first

compute_expert_scores(grid_probs: List[Tensor], batch_experts: List[List[RemoteExpert]]) → Tensor[source]¶: Compute scores for each expert by multiplying grid probabilities, autograd-friendly :param grid_probs: list of torch tensors, i-th tensor contains scores for i-th grid dimension :param batch_experts: list(batch) of lists(k) of up to k experts selected for this batch :returns: a tensor of scores, float32[batch_size, k] :note: if some rows in batch have less than max number of experts, their scores will be padded with -inf