Contributing to vLLM Spyre¶
Thank you for your interest in contributing to the Spyre plugin for vLLM! There are several ways you can contribute:
- Identify and report any issues or bugs.
- Suggest or implement new features.
- Improve documentation or contribute a how-to guide.
Issues¶
If you encounter a bug or have a feature request, please search existing issues first to see if it has already been reported. If not, please create a new issue, providing as much relevant information as possible.
You can also reach out for support in the #sig-spyre
channel in the vLLM Slack workspace.
Docs¶
Building the docs with MkDocs¶
Install MkDocs and Plugins¶
Install MkDocs along with the plugins used in the vLLM Spyre documentation.
Note
Ensure that your Python version is compatible with the plugins (e.g., mkdocs-awesome-nav
requires Python 3.10+)
Start the Development Server¶
MkDocs comes with a built-in dev-server that lets you preview your documentation as you work on it.
Make sure you're in the same directory as the mkdocs.yml
configuration file in the vllm-spyre
repository, and then start the server by running the mkdocs serve
command:
Example output:
INFO - Documentation built in 106.83 seconds
INFO - [22:02:02] Watching paths for changes: 'docs', 'mkdocs.yaml'
INFO - [22:02:02] Serving on http://127.0.0.1:8000/
View in Your Browser¶
Open up http://127.0.0.1:8000/ in your browser to see a live preview:.
Learn More¶
For additional features and advanced configurations, refer to the official MkDocs Documentation.
Testing¶
Tip
When running tests, if errors occur, these can be analyzed/debugged by setting DISABLE_ASSERTS = True
in spyre_util.py and by rerunning the test using pytest --capture=no tests/spyre/test_spyre_basic.py
. After debugging, DISABLE_ASSERTS
should be reset to False
.
Testing Locally on CPU (No Spyre card)¶
Optionally, download the ibm-ai-platform/micro-g3.3-8b-instruct-1b
model:
python -c "from transformers import pipeline; pipeline('text-generation', model='ibm-ai-platform/micro-g3.3-8b-instruct-1b')"
Caution
The Hugging Face API download does not work on arm64
.
By default, the model is saved to .cache/huggingface/hub/models--ibm-ai-platform--micro-g3.3-8b-instruct-1b
.
Then, source the environment variables:
Optionally, install development dependencies:
Now, you can run the tests:
Here is a list of pytest
markers you can use to filter them:
markers = [
"skip_global_cleanup",
"e2e: Tests using end-to-end engine spin-up",
"basic: Basic correctness tests",
"cb: Continuous batching tests",
"cpu: Tests using CPU (i.e. eager) backend",
"compat: backward compatibility tests",
"spyre: Tests using Spyre hardware backend",
"decoder: Tests for decoder models",
"embedding: Tests for embedding models",
"quantized: Tests for quantized models",
"multi: Tests that require >1 cards",
"utils: Tests for utility functions",
"worker: Tests for worker logic",
]
Testing Continuous Batching¶
Run the continuous batching tests:
Debugging¶
Tip
You can oc edit
a pod and change the image without having the pod schedule to a different node. This can be useful for testing whether software or hardware is the issue.
-
The script
/opt/sentient/bin/aiu-query-devices
in the pod can be used to see the connectivity between theAIUs
on the machine. You can also infer this from environment variables with names likeAIU_TIER_\d_SET_\d_RANK_\d
. -
SPYRE_DEVICES
can be used to select which devices will be selected for eachRANK
. This is similar to howCUDA_VISIBLE_DEVICES
works for GPU.Example
0,2,4,6
will assign rank0
to AIU index0
, rank1
to AIU index2
, rank2
to AIU index4
, and rank3
to AIU index6
.- An alternative is to use
AIU_WORLD_RANK_\d=0000:aa:00.0
to explicitly map ranks toPCI
addresses (make sure there are no duplicates used at runtime).
- An alternative is to use
-
A bash script that uses
/opt/sentient/senlib/bin/senlib_unit_test
to check eachAIU
allocated to the pod to see if they work for a basic test:#!/bin/bash # A bash script that uses `/opt/sentient/senlib/bin/senlib_unit_test` # to check each AIU allocated to the pod to see if # they work for a basic test: cleanup_done=0 cleanup() { if [ "$cleanup_done" -eq 0 ] && [ -f ~/.senlib.json.bak ]; then echo "Restoring .senlib.json from backup" cp ~/.senlib.json.bak ~/.senlib.json cleanup_done=1 fi kill -- -$PPID wait exit } trap cleanup EXIT SIGINT # Create backup .senlib.json if it doesn't exist if [ -f "$HOME"/.senlib.json ]; then if [ ! -f "$HOME"/.senlib.json.bak ]; then echo "Creating backup of $HOME/.senlib.json" cp "$HOME"/.senlib.json "$HOME"/.senlib.json.bak else echo "$HOME/.senlib.json.bak already exists" fi fi for device_id in $(jq -r .GENERAL.sen_bus_id[] /etc/aiu/senlib_config.json); do echo "======================================================================" echo "Checking AIU ${device_id}" echo "======================================================================" jq -n '{"GENERAL": { "sen_bus_id": "'"${device_id}"'" }}' > .senlib.json # run in background to not override bash signal handler timeout 10 /opt/sentient/senlib/bin/senlib_unit_test --gtest_filter=SmlPF1VF0.Open & wait done
Logging levels¶
Various log levels that can be configured:
DTLOG_LEVEL
-TRACE, DEBUG, INFO, WARNING, ERROR
TORCH_SENDNN_LOG
-WARNING, CRITICAL
VLLM_LOGGING_LEVEL
-DEBUG, INFO, WARNING, ERROR
DT_DEEPRT_VERBOSE
-0, -1
Tip
DTLOG_LEVEL=INFO
(piped to file) can help you see what device addresses are actually in use. Look for the string Opened: SEN:VFIO
.
Tip
Set DT_DEEPRT_VERBOSE
to 0 to enable verbose compiler prints for debugging.
Tip
In order to stop massive log spew, this configuration is ideal:
For tensor-parallel debugging, you can enable an option to redirect all log output from each rank to an individual file.
Set VLLM_SPYRE_WORKER_LOG_REDIRECT_DIR
to a local directory, and each rank will redirect stdout and stderr into their own file inside the directory.
This can be helpful to avoid having interleaved stack dumps from different ranks in stderr.
Topology Aware Allocation¶
This section is specific to the AIU operator and scheduling workloads onto specific cards.
(TODO: link to docs once they exist)
-
This mode supports users to request a special set of AIU cards based on
PCI
topology. By using this mode, we can guarantee to pick up AIU cards of a particular class in the node:Tier0
provides a set of cards in the samePCI
switch.Tier1
provides a set of cards from at most one-hop awayPCI
switch.Tier2
provides a set of cards from at most two-hops awayPCI
switch.
-
Running a Multi AIU Job using
ibm.com/aiu_pf_tier0,tier1,tier2
:- This resource type is used for picking up a topology aware card set, which is required to run tensor parallel (
TP
) workloads more effectively. By usingtierX
class resource,TP
users can automatically get a best performing card set for the workload.
- This resource type is used for picking up a topology aware card set, which is required to run tensor parallel (
-
The maximum number of allocatable resources in each tier depends on the platform & cluster, but we can get up to:
Tier0
-4
cardsTier1
-8
cardsTier2
-16
cards
-
Devices in
tier0
can dopeer-to-peer (P2P) RDMA
, devices on different trees useHost DMA
sharing files through/dev/shm
.Warning
If you request cards greater than the cards supported by the switch, the pod will never be scheduled. In the above example, if you specify
ibm.com/aiu_pf_tier0: 5
in your yaml, the pod will never be scheduled because the maximum set of cards intier0
was specified as4
.
Pull Requests¶
Linting¶
When submitting a PR, please make sure your code passes all linting checks. You can install the linting requirements using either uv
or pip
.
Using uv
:
Using pip
:
After installing the requirements, run the formatting script:
Then, make sure to commit any changes made by the formatter:
DCO and Signed-off-by¶
When contributing, you must agree to the DCO.Commits must include a Signed-off-by:
header which certifies agreement with the terms of the DCO.
Using -s
with git commit
will automatically add this header.
License¶
See LICENSE.