Releases · NVIDIA-NeMo/Run

16 Apr 20:13

svcnvidia-nemo-ci

v0.9.0

045e8b8

NVIDIA NeMo Run 0.9.0 Latest

Latest

Others

beep boop 🤖: Bumping nemo_run to v0.9.0rc1 #489

Assets 2

16 Apr 19:58

svcnvidia-nemo-ci

v0.9.0rc0.dev0

64b91e0

NVIDIA NeMo Run 0.9.0rc0.dev0 Pre-release

Pre-release

Executors

Add special resolution for /$nemo_run in mounts for slurm and docker #155
Add support for heterogeneous job group indices in SlurmExecutor #158
Fix logging for packaging jobs in Slurm executor #160
Add SlurmRay launcher and transform API for launchers #159
Add error handling for executor deserialization in dgxcloud scheduler #166
refactor: Improve packaging job handling in SlurmExecutor #171
Fix docker scheduler creation #174
Add slurm dependency type section to execution guide #181
Slurm add --segment argument #186
Add DGXCloudExecutor docs and update execution guide #192
Support torchrun multi node on local executor #143
zozhang/dgxc executor data mover #206
Add support for job groups for local executor #220
Add cancel to docker executor #233
Add LeptonExecutor support #224
Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
Add storage mount options to LeptonExecutor #237
Update to latest Lepton SDK #248
Upgrade skypilot executor with 0.9.2 #246
Support for %j in slurm log retrieval #252
Sync job code in local tunnel for Slurm Ray job #254
Support overlapped srun commands in Slurm Ray #263
[Bugfix - LeptonExecutor] Setting names to be lowercase and shortened for length #274
Allow customizing folder for SlurmRayRequest #281
Add logs dir to container mount for ray slurm #287
finetune on dgxcloud with nemo-run and deploy on bedrock example #286
Fix skypilot archive mount bug #288
Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251
Upgrade skypilot to v0.10.0, introduce network_tier #297
Remove breaking torchrun config for single-node runs #292
Added Pre-Launch Commands Support to LeptonExecutor #312
Add image pull secrets param for lepton #330
Add node reservations for LeptonExecutor #336
[SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
[SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor #340
Support SkyPilot Storage configurations in file_mounts for automatic cloud sync #335
[SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 #339
Create SkypilotJobsExecutor to allow running managed jobs #343
fix: exit code docker runs #365
fix(typo): exit_code prints empty #379
fix: limit docker hostname to 32 characters #378
add secrets to lepton #383
Add RayCluster support for DGX Cloud Lepton #389
Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
feat: add het-job support for ray slurm #407
feat: use slurm executor to get ray template name #410
feat: support container-image None in slurm #409
Honor executor srun_args for Ray COMMAND srun #440
fix: Flaky Slurm network issues #445
fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures #458
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() #460
feat: poll and print SLURM job estimated start time while pending #464
feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD) #462
fix: guard SLURM start-time polling behind a feature flag #469
cp: feat: add extra_resource_requests and extra_resource_limits to KubeflowExecutor (479) into r0.9.0 #484

Ray Integration

Add SlurmRay launcher and transform API for launchers #159
Add RayCluster API with Kuberay support #222
Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
Import guard k8s import in Ray Cluster and Job #245
Add user scoping for k8s backend and log level support for Ray APIs #247
Add KubeRay tests for Ray APIs #249
Sync job code in local tunnel for Slurm Ray job #254
Support overlapped srun commands in Slurm Ray #263
Allow customizing folder for SlurmRayRequest #281
Add logs dir to container mount for ray slurm #287
Add nsys patch in ray sub template #318
Add ray head start timeout #324
Remove ray deprecated dashboard-grpc-port arg #325
Update ray template #375
fix ray templates by using --exclusive to launch ray nodes #380
Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
Add RayCluster support for DGX Cloud Lepton #389
Update ray_enroot template #406
feat: add het-job support for ray slurm #407
feat: use slurm executor to get ray template name #410
Honor executor srun_args for Ray COMMAND srun #440

CLI & Configuration

Slurm add --segment argument #186
Add --cuda-event-trace=false to nsys command #180
Adding support for ForwardRef in CLI #176
Fix bug in CLI with calling a factory-fn inside a list #214
Fix some bugs for --lazy in CLI #179
Fix bug with a CLI overwrite #235
Support overlapped srun commands in Slurm Ray #263
Added Pre-Launch Commands Support to LeptonExecutor #312
Honor executor srun_args for Ray COMMAND srun #440

Experiment & Job Management

Add support for heterogeneous job group indices in SlurmExecutor #158
Fix logging for packaging jobs in Slurm executor #160
refactor: Improve packaging job handling in SlurmExecutor #171
add clean mode for experiment to avoid printing any NeMo-Run specific… #208
Handle ctx in entrypoint for experiment #213
Ensure job directory creation for various schedulers #216
Add support for job groups for local executor #220
Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
Import guard k8s import in Ray Cluster and Job #245
Add storage mount options to LeptonExecutor [#237](https://github.c...

Assets 2

21 Mar 00:10

svcnvidia-nemo-ci

v0.8.1

35f7add

NVIDIA NeMo Run 0.8.1

Executors

cp: fix: Flaky Slurm network issues (445) into r0.8.0 #447
cp: fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures (458) into r0.8.0 #459
cp: fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() (460) into r0.8.0 #463

Bug Fixes

cp: fix: Remove outdated nvrx arg (441) into r0.8.0 #442
cp: fix: Flaky Slurm network issues (445) into r0.8.0 #447
cp: fix: Add GROUP_RANK (448) into r0.8.0 #449
cp: fix: Catch OSError with exponential backoff (450) into r0.8.0 #452
cp: fix: Catch can't start new thread (453) into r0.8.0 #456
cp: fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures (458) into r0.8.0 #459
cp: fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() (460) into r0.8.0 #463
cp: fix: PRE_RELEASE variable (470) into r0.8.0 #471

Others

chore: Bump version #443
cp: build: Bump cryptography and urllib3 (455) into r0.8.0 #457

Assets 2

26 Feb 02:21

svcnvidia-nemo-ci

v0.8.0

8c55dcc

NVIDIA NeMo Run 0.8.0

Executors

fix(typo): exit_code prints empty #379
fix: limit docker hostname to 32 characters #378
add secrets to lepton #383
Add RayCluster support for DGX Cloud Lepton #389
Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
feat: add het-job support for ray slurm #407
feat: use slurm executor to get ray template name #410

Ray Integration

Update ray template #375
fix ray templates by using --exclusive to launch ray nodes #380
Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
Add RayCluster support for DGX Cloud Lepton #389
Update ray_enroot template #406
feat: add het-job support for ray slurm #407
feat: use slurm executor to get ray template name #410

Experiment & Job Management

Add RayCluster support for DGX Cloud Lepton #389
Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
feat: add het-job support for ray slurm #407

Documentation

fix: limit docker hostname to 32 characters #378
fix: Update README.md #388
fix broken links in README.md #386
docs: Fix broken links in README and CONTRIBUTING #390
Add RayCluster support for DGX Cloud Lepton #389
docs: Release docs #412
cp: ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (423) into r0.8.0 #424
cp: ci: Update release workflow to include changelog and docs (426) into r0.8.0 #427
docs: Update docs for 0.8.0 #428
docs: Update docs to include nightly and use latest #431

CI/CD

Update ray template #375
Update changelog for r0.7.0 #396
cp: ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (423) into r0.8.0 #424
cp: ci: Update release workflow to include changelog and docs (426) into r0.8.0 #427

Bug Fixes

fix host #373
fix ray templates by using --exclusive to launch ray nodes #380
fix(typo): exit_code prints empty #379
fix: limit docker hostname to 32 characters #378
fix: Update README.md #388
fix broken links in README.md #386
Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
docs: Fix broken links in README and CONTRIBUTING #390
fix: Retry polling token #392
fix: DGXC streaming #401
Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
fix: remove unexpected side effect in get_srun_flags #408
fix: Search for incluster config if no kubeconfig is given #411
fix: Pass DGXC to ft_launcher #402
cp: Fix uv sync error (#422) into r0.8.0 #425

Others

Version bump to 0.8.0rc0.dev0 #368
feat: add copyright check #369
feat: copyright check #370
Add port parameter to SSHTunnel #372
update copyright check version #376
feat: Stream DGXC logs #377
feat: Stream logs to disk #393
Update nvidia-sphinx-theme #398

Assets 2

03 Dec 23:54

chtruong814

v0.7.0

bbdea4c

NVIDIA NeMo Run 0.7.0

NVIDIA Nemo Run 0.7.0

Detailed Changelogs:

Executors

Add image pull secrets param for lepton #330
Add node reservations for LeptonExecutor #336
[SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
[SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor #340
Support SkyPilot Storage configurations in file_mounts for automatic cloud sync #335
[SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 #339
Create SkypilotJobsExecutor to allow running managed jobs #343
fix: exit code docker runs #365

Ray Integration

Add ray head start timeout #324
Remove ray deprecated dashboard-grpc-port arg #325

Experiment & Job Management

add a grace for Jobs that may start in Unknown #291
Create SkypilotJobsExecutor to allow running managed jobs #343

Packaging & Deployment

Support SkyPilot Storage configurations in file_mounts for automatic cloud sync #335
Refactor tar packaging logic to work for submodule and extra repo #347

Documentation

Add broken links check in docs #333
[SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
Documentation Restructurting #350
Fix spelling in docstring #359
fix: exit code docker runs #365

CI/CD

Update cherry-pick workflow to use version 0.63.0 #344
fix: exit code docker runs #365

Bug Fixes

[SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
Fix spelling in docstring #359
fix: exit code docker runs #365

Others

chore: Bump to version 0.7.0rc0.dev0 #322
Update community-bot to add community issues to shared project #321
Bump community-bot to 0.54.4 #332
remove custom dir #351
Bumping to 0.5.0 #352
Update release notes header in changelog build #355
add changelog-config #356
Changelog 0.6.0 #357
feat: new changelog-build #367

Assets 2

03 Dec 23:25

chtruong814

v0.7.0rc0.dev0

dc86aea

NVIDIA NeMo Run 0.7.0rc0.dev0 Pre-release

Pre-release

Prerelease: NVIDIA NeMo Run 0.7.0rc0.dev0 (2025-12-03)

Assets 2

09 Oct 16:13

chtruong814

v0.6.0

030f862

NVIDIA NeMo Run 0.6.0

NVIDIA Nemo Run 0.6.0

Detailed Changelogs:

Executors

Added Pre-Launch Commands Support to LeptonExecutor #312
Remove breaking torchrun config for single-node runs #292
Upgrade skypilot to v0.10.0, introduce network_tier #297
Fixes for multi-node execution with torchrun + LocalExecutor #251
Add option to specify --container-env for srun #293
Fix skypilot archive mount bug #288
finetune on dgxcloud with nemo-run and deploy on bedrock example #286

Ray Integration

Add nsys patch in ray sub template #318
Add logs dir to container mount for ray slurm #287
Allow customizing folder for SlurmRayRequest #281

CLI & Configuration

Experiment & Job Management

Use thread pool for status, run methods inside experiment + other fixes #295

Packaging & Deployment

Correctly append tar files for packaging #317

Documentation

Create CHANGELOG.md #314
docs: Fixing doc build issue #290
fix docs tutorial links and add intro to guides/index.md #285
README #277

CI/CD

changelog workflow #315
Update release.yml #306
ci(fix): Use GITHUB_TOKEN for community bot #302
ci: Add community-bot #300

Bug Fixes

[Bugfix] Adding a check for name length #273
misc fixes #280
adding fix for lowercase and name length k8s requirements #274

Others

Specify nodes for gpu metrics collection and split data to each rank #320
Apply '_enable_goodbye_message' check to both goodbye messages. #319
Update refs #278
chore: Bump to version 0.6.0rc0.dev0 #272

Assets 2

09 Oct 05:53

chtruong814

v0.6.0rc0.dev0

d01f76a

NVIDIA NeMo Run 0.6.0rc0.dev0 Pre-release

Pre-release

Prerelease: NVIDIA NeMo Run 0.6.0rc0.dev0 (2025-10-09)

Assets 2

04 Aug 21:10

pablo-garay

v0.5.0

b234cfd

NVIDIA NeMo Run 0.5.0

Features and improvements

Assets 2

09 May 00:58

ko3n1g

v0.4.0

33458c8

NVIDIA NeMo Run 0.4.0

Features and improvements.

Assets 2

Releases: NVIDIA-NeMo/Run

NVIDIA NeMo Run 0.9.0

Others

Uh oh!

NVIDIA NeMo Run 0.9.0rc0.dev0

Executors

Ray Integration

CLI & Configuration

Experiment & Job Management

Uh oh!

NVIDIA NeMo Run 0.8.1

Executors

Bug Fixes

Others

Uh oh!

NVIDIA NeMo Run 0.8.0

Executors

Ray Integration

Experiment & Job Management

Documentation

CI/CD

Bug Fixes

Others

Uh oh!

NVIDIA NeMo Run 0.7.0

NVIDIA Nemo Run 0.7.0

Detailed Changelogs:

Executors

Ray Integration

Experiment & Job Management

Packaging & Deployment

Documentation

CI/CD

Bug Fixes

Others

Uh oh!

NVIDIA NeMo Run 0.7.0rc0.dev0

Uh oh!

NVIDIA NeMo Run 0.6.0

NVIDIA Nemo Run 0.6.0

Detailed Changelogs:

Executors

Ray Integration

CLI & Configuration

Experiment & Job Management

Packaging & Deployment

Documentation

CI/CD

Bug Fixes

Others

Uh oh!

NVIDIA NeMo Run 0.6.0rc0.dev0

Uh oh!

NVIDIA NeMo Run 0.5.0

Uh oh!

NVIDIA NeMo Run 0.4.0

Uh oh!