Quick start [nothing here yet]¶
This will eventually become a tutorial on how to host a hivemind node or connect to an existing node.
What do I need to run it?¶
One or several computers, each equipped with at least one GPU
Each computer should have at least two open ports (if not, consider ssh port forwarding)
Some popular Linux x64 distribution
Tested on Ubuntu16.04, should work fine on any popular linux64 and even MacOS;
Running on Windows natively is not supported, please use vm or docker;
How do I run it?¶
Currently, there is no way to do it easily. There are some tests (you can check ./tests/benchmark_throughput.py
or look into CI logs) and we want to expand them. If you want to
do something complex with it, please contact us by opening an issue (less preferred: telegram).
hivemind
quick tour¶
Trainer process:
RemoteExpert
(hivemind/client/remote_expert.py
) behaves like a pytorch module with autograd support but actually sends request to a remote runtime.RemoteMixtureOfExperts
(hivemind/client/remote_moe.py
) finds best experts for a given input and either returns them asRemoteExpert
or applies them right away.
Runtime process:
Runtime
(hivemind/runtime/__init__.py
) aggregates batches and performs inference/training of experts according to their priority.Server
(hivemind/server/__init__.py
) wraps runtime and periodically uploads experts intoDHT
.
DHT:
DHT
(hivemind/dht/__init__.py
) is a node of Kademlia-based DHT that stores metadata used by trainer and runtime.
Limitations¶
DHT:
DHT functionality is severely limited by its inability to traverse NAT.
Because of this all the features that require DHT are in deep pre-alpha state and cannot be used without special setup.
Runtime:
You can achieve 4x less network load by passing quantized uint8 activations across experts. Implement your own quantization or wait for hivemind v0.8.
Currently runtime can form batches that exceed maximal batch_size by task_size - 1. We will fix that in the nearest patch.