feat(bootstrap,cli): switch GPU injection to CDI where supported#495
feat(bootstrap,cli): switch GPU injection to CDI where supported#495
Conversation
| /// | Input | Output | | ||
| /// |--------------|--------------------------------------------------------------| | ||
| /// | `[]` | `[]` — no GPU | | ||
| /// | `["legacy"]` | `["legacy"]` — pass through | | ||
| /// | `["auto"]` | `["nvidia.com/gpu=all"]` if CDI supported, else `["legacy"]` | | ||
| /// | `[cdi-ids…]` | unchanged | | ||
| pub(crate) fn resolve_gpu_device_ids(gpu: &[String], docker_version: Option<&str>) -> Vec<String> { |
There was a problem hiding this comment.
It feels weird to me to overload this flag with legacy and auto if it is meant to be the list of device_ids in the end.
Does it make more sense to add a new flag with the mode that accepts auto. legacy or cdi and then have --gpus (or your new --devices alias) accept the CDI devices if/only if its cdi mode (or auto mode choosing cdi).
Is the concern backwards compatibility with the existing semantics of the --gpu boolean flag?
There was a problem hiding this comment.
Yes, the reason I wanted to extend the existing flag is to maintain backward compatibility. I was initially going to add a separate --device flag to mirror what we have done in other runtimes, but this would require more user engagement.
It sould also be noted that --gpu is equivalent to --gpu="auto" so that the UX does not change. I also only added legacy as an option to allow users to explicitly opt out in the cases where CDI injection is not doing what it should. In the medium term, I would expect legacy to be removed entirely (or just mapping to nvidia.com/gpu=all).
b1e6015 to
dd2682c
Compare
|
dd2682c to
aa0c7bb
Compare
Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>
The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Explicit CDI device IDs can now be passed: --gpu=nvidia.com/gpu=all single CDI device --gpu=nvidia.com/gpu=0 --gpu=nvidia.com/gpu=1 multiple CDI devices parse_gpu_flag validates the input and rejects mixing legacy/auto with CDI device names or specifying them more than once. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
808270c to
f304997
Compare
Summary
Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the
docker infoendpoint returns a non-empty list of CDI spec directories). When this is not the case existing--gpus allNVIDIADeviceRequestpath is used as a fallback. The--gpuflag ongateway startis extended to let users control injection mode and pass explicit CDI device names or force the--gpus allflag.Related Issue
Part of #398
Changes
feat(bootstrap): Auto-select CDI (driver="cdi",device_ids=["nvidia.com/gpu=all"]) if CDI is enabled; fall back to legacydriver="nvidia"on older daemons or when version is unknownfeat(cli):--gpunow accepts an optional value: omit for auto-select,--gpu=legacyto force legacy, or--gpu=<cdi-device>for an explicit CDI device name (e.g.nvidia.com/gpu=all,nvidia.com/gpu=0)feat(cli):--deviceadded as an alias for--gpulegacy/autowith explicit CDI names or specifying them more than onceTesting
mise run pre-commitpassesChecklist