Releases: NVIDIA-NeMo/Run
Releases · NVIDIA-NeMo/Run
NVIDIA NeMo Run 0.9.0
Others
- beep boop 🤖: Bumping nemo_run to v0.9.0rc1 #489
NVIDIA NeMo Run 0.9.0rc0.dev0
Executors
- Add special resolution for /$nemo_run in mounts for slurm and docker #155
- Add support for heterogeneous job group indices in SlurmExecutor #158
- Fix logging for packaging jobs in Slurm executor #160
- Add SlurmRay launcher and transform API for launchers #159
- Add error handling for executor deserialization in dgxcloud scheduler #166
- refactor: Improve packaging job handling in SlurmExecutor #171
- Fix docker scheduler creation #174
- Add slurm dependency type section to execution guide #181
- Slurm add
--segmentargument #186 - Add DGXCloudExecutor docs and update execution guide #192
- Support torchrun multi node on local executor #143
- zozhang/dgxc executor data mover #206
- Add support for job groups for local executor #220
- Add cancel to docker executor #233
- Add LeptonExecutor support #224
- Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
- Add storage mount options to LeptonExecutor #237
- Update to latest Lepton SDK #248
- Upgrade skypilot executor with 0.9.2 #246
- Support for %j in slurm log retrieval #252
- Sync job code in local tunnel for Slurm Ray job #254
- Support overlapped srun commands in Slurm Ray #263
- [Bugfix - LeptonExecutor] Setting names to be lowercase and shortened for length #274
- Allow customizing folder for SlurmRayRequest #281
- Add logs dir to container mount for ray slurm #287
- finetune on dgxcloud with nemo-run and deploy on bedrock example #286
- Fix skypilot archive mount bug #288
- Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251
- Upgrade skypilot to v0.10.0, introduce network_tier #297
- Remove breaking torchrun config for single-node runs #292
- Added Pre-Launch Commands Support to LeptonExecutor #312
- Add image pull secrets param for lepton #330
- Add node reservations for LeptonExecutor #336
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
- [SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor #340
- Support SkyPilot Storage configurations in
file_mountsfor automatic cloud sync #335 - [SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 #339
- Create SkypilotJobsExecutor to allow running managed jobs #343
- fix: exit code docker runs #365
- fix(typo): exit_code prints empty #379
- fix: limit docker hostname to 32 characters #378
- add secrets to lepton #383
- Add RayCluster support for DGX Cloud Lepton #389
- Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
- feat: add het-job support for ray slurm #407
- feat: use slurm executor to get ray template name #410
- feat: support container-image None in slurm #409
- Honor executor srun_args for Ray COMMAND srun #440
- fix: Flaky Slurm network issues #445
- fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures #458
- fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() #460
- feat: poll and print SLURM job estimated start time while pending #464
- feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD) #462
- fix: guard SLURM start-time polling behind a feature flag #469
- cp:
feat: add extra_resource_requests and extra_resource_limits to KubeflowExecutor (479)intor0.9.0#484
Ray Integration
- Add SlurmRay launcher and transform API for launchers #159
- Add RayCluster API with Kuberay support #222
- Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
- Import guard k8s import in Ray Cluster and Job #245
- Add user scoping for k8s backend and log level support for Ray APIs #247
- Add KubeRay tests for Ray APIs #249
- Sync job code in local tunnel for Slurm Ray job #254
- Support overlapped srun commands in Slurm Ray #263
- Allow customizing folder for SlurmRayRequest #281
- Add logs dir to container mount for ray slurm #287
- Add nsys patch in ray sub template #318
- Add ray head start timeout #324
- Remove ray deprecated dashboard-grpc-port arg #325
- Update ray template #375
- fix ray templates by using --exclusive to launch ray nodes #380
- Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
- Add RayCluster support for DGX Cloud Lepton #389
- Update ray_enroot template #406
- feat: add het-job support for ray slurm #407
- feat: use slurm executor to get ray template name #410
- Honor executor srun_args for Ray COMMAND srun #440
CLI & Configuration
- Slurm add
--segmentargument #186 - Add --cuda-event-trace=false to nsys command #180
- Adding support for ForwardRef in CLI #176
- Fix bug in CLI with calling a factory-fn inside a list #214
- Fix some bugs for --lazy in CLI #179
- Fix bug with a CLI overwrite #235
- Support overlapped srun commands in Slurm Ray #263
- Added Pre-Launch Commands Support to LeptonExecutor #312
- Honor executor srun_args for Ray COMMAND srun #440
Experiment & Job Management
- Add support for heterogeneous job group indices in SlurmExecutor #158
- Fix logging for packaging jobs in Slurm executor #160
- refactor: Improve packaging job handling in SlurmExecutor #171
- add clean mode for experiment to avoid printing any NeMo-Run specific… #208
- Handle ctx in entrypoint for experiment #213
- Ensure job directory creation for various schedulers #216
- Add support for job groups for local executor #220
- Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
- Import guard k8s import in Ray Cluster and Job #245
- Add storage mount options to LeptonExecutor [#237](https://github.c...
NVIDIA NeMo Run 0.8.1
Executors
- cp:
fix: Flaky Slurm network issues (445)intor0.8.0#447 - cp:
fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures (458)intor0.8.0#459 - cp:
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() (460)intor0.8.0#463
Bug Fixes
- cp:
fix: Remove outdated nvrx arg (441)intor0.8.0#442 - cp:
fix: Flaky Slurm network issues (445)intor0.8.0#447 - cp:
fix: Add GROUP_RANK (448)intor0.8.0#449 - cp:
fix: Catch OSError with exponential backoff (450)intor0.8.0#452 - cp:
fix: Catchcan't start new thread(453)intor0.8.0#456 - cp:
fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures (458)intor0.8.0#459 - cp:
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() (460)intor0.8.0#463 - cp:
fix: PRE_RELEASE variable (470)intor0.8.0#471
Others
NVIDIA NeMo Run 0.8.0
Executors
- fix(typo): exit_code prints empty #379
- fix: limit docker hostname to 32 characters #378
- add secrets to lepton #383
- Add RayCluster support for DGX Cloud Lepton #389
- Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
- feat: add het-job support for ray slurm #407
- feat: use slurm executor to get ray template name #410
Ray Integration
- Update ray template #375
- fix ray templates by using --exclusive to launch ray nodes #380
- Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
- Add RayCluster support for DGX Cloud Lepton #389
- Update ray_enroot template #406
- feat: add het-job support for ray slurm #407
- feat: use slurm executor to get ray template name #410
Experiment & Job Management
- Add RayCluster support for DGX Cloud Lepton #389
- Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
- feat: add het-job support for ray slurm #407
Documentation
- fix: limit docker hostname to 32 characters #378
- fix: Update README.md #388
- fix broken links in README.md #386
- docs: Fix broken links in README and CONTRIBUTING #390
- Add RayCluster support for DGX Cloud Lepton #389
- docs: Release docs #412
- cp:
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (423)intor0.8.0#424 - cp:
ci: Update release workflow to include changelog and docs (426)intor0.8.0#427 - docs: Update docs for 0.8.0 #428
- docs: Update docs to include nightly and use latest #431
CI/CD
- Update ray template #375
- Update changelog for
r0.7.0#396 - cp:
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (423)intor0.8.0#424 - cp:
ci: Update release workflow to include changelog and docs (426)intor0.8.0#427
Bug Fixes
- fix host #373
- fix ray templates by using --exclusive to launch ray nodes #380
- fix(typo): exit_code prints empty #379
- fix: limit docker hostname to 32 characters #378
- fix: Update README.md #388
- fix broken links in README.md #386
- Revert "fix ray templates by using --exclusive to launch ray nodes (#380) #384
- docs: Fix broken links in README and CONTRIBUTING #390
- fix: Retry polling token #392
- fix: DGXC streaming #401
- Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
- fix: remove unexpected side effect in get_srun_flags #408
- fix: Search for incluster config if no kubeconfig is given #411
- fix: Pass DGXC to ft_launcher #402
- cp:
Fix uv sync error(#422) intor0.8.0#425
Others
NVIDIA NeMo Run 0.7.0
NVIDIA Nemo Run 0.7.0
Detailed Changelogs:
Executors
- Add image pull secrets param for lepton #330
- Add node reservations for LeptonExecutor #336
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
- [SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor #340
- Support SkyPilot Storage configurations in
file_mountsfor automatic cloud sync #335 - [SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 #339
- Create SkypilotJobsExecutor to allow running managed jobs #343
- fix: exit code docker runs #365
Ray Integration
Experiment & Job Management
- add a grace for Jobs that may start in Unknown #291
- Create SkypilotJobsExecutor to allow running managed jobs #343
Packaging & Deployment
- Support SkyPilot Storage configurations in
file_mountsfor automatic cloud sync #335 - Refactor tar packaging logic to work for submodule and extra repo #347
Documentation
- Add broken links check in docs #333
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
- Documentation Restructurting #350
- Fix spelling in docstring #359
- fix: exit code docker runs #365
CI/CD
Bug Fixes
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs #338
- Fix spelling in docstring #359
- fix: exit code docker runs #365
Others
- chore: Bump to version 0.7.0rc0.dev0 #322
- Update community-bot to add community issues to shared project #321
- Bump community-bot to 0.54.4 #332
- remove custom dir #351
- Bumping to 0.5.0 #352
- Update release notes header in changelog build #355
- add changelog-config #356
- Changelog 0.6.0 #357
- feat: new changelog-build #367
NVIDIA NeMo Run 0.7.0rc0.dev0
Prerelease: NVIDIA NeMo Run 0.7.0rc0.dev0 (2025-12-03)
NVIDIA NeMo Run 0.6.0
NVIDIA Nemo Run 0.6.0
Detailed Changelogs:
Executors
- Added Pre-Launch Commands Support to LeptonExecutor #312
- Remove breaking torchrun config for single-node runs #292
- Upgrade skypilot to v0.10.0, introduce network_tier #297
- Fixes for multi-node execution with torchrun + LocalExecutor #251
- Add option to specify --container-env for srun #293
- Fix skypilot archive mount bug #288
- finetune on dgxcloud with nemo-run and deploy on bedrock example #286
Ray Integration
- Add nsys patch in ray sub template #318
- Add logs dir to container mount for ray slurm #287
- Allow customizing folder for SlurmRayRequest #281
CLI & Configuration
Experiment & Job Management
- Use thread pool for status, run methods inside experiment + other fixes #295
Packaging & Deployment
- Correctly append tar files for packaging #317
Documentation
- Create CHANGELOG.md #314
- docs: Fixing doc build issue #290
- fix docs tutorial links and add intro to guides/index.md #285
- README #277
CI/CD
- changelog workflow #315
- Update release.yml #306
- ci(fix): Use GITHUB_TOKEN for community bot #302
- ci: Add community-bot #300
Bug Fixes
- [Bugfix] Adding a check for name length #273
- misc fixes #280
- adding fix for lowercase and name length k8s requirements #274
Others
NVIDIA NeMo Run 0.6.0rc0.dev0
Prerelease: NVIDIA NeMo Run 0.6.0rc0.dev0 (2025-10-09)
NVIDIA NeMo Run 0.5.0
Features and improvements
NVIDIA NeMo Run 0.4.0
Features and improvements.