analyze graph: fix incorrect cross-package module resolution in ruff analyze graph (#24182)#24183
analyze graph: fix incorrect cross-package module resolution in ruff analyze graph (#24182)#24183markjm wants to merge 1 commit intoastral-sh:mainfrom
analyze graph: fix incorrect cross-package module resolution in ruff analyze graph (#24182)#24183Conversation
|
ruff analyze graph (#24182)analyze graph: fix incorrect cross-package module resolution in ruff analyze graph (#24182)
There was a problem hiding this comment.
Thank you. This overall makes sense. I left two comments: I'm not sure if what I propose works but ty's module resolver has evolved since we wrote ruff analyze and it's plausible that some of the workarounds implemented in analyze are no longer necessary.
The main caveat of this change is that analyze now behaves differently from lint, but I think that's fine.
| ); | ||
| // Collect `src` paths from all discovered configs (root + hierarchically discovered). | ||
| // This is more efficient than iterating over files since we only process unique configs. | ||
| for settings in resolver.settings() { |
There was a problem hiding this comment.
I guess this is still somewhat prone to packages resolving across project boundaries. Ultimately, Ruff's hierarchical configuration here is directly add odds with requiring a global src_roots.
It makes me wonder if ty's desperate module resolution works well enough in most cases and, if not, could we make it so that it works out of the box?
There was a problem hiding this comment.
I tried this but i think it doesnt exactly work for the uv workspace case. desperate module resolution simply walks up to the root, but we also need to be aware of subprojects with pyproject.toml files setting src-s to a sibling directory.
Please feel free to give it a shot and let me know if im missing something, but removing this and relying on ty desperate resolution "later" demonstrated failures in the uv workspace testcase which is already here (that I am modifying)
crates/ruff_graph/src/lib.rs
Outdated
| // Compute the module path relative to the src_root (package's parent). | ||
| // For a file at `src_root/ruff/foo/bar.py`, the module path is `["ruff", "foo", "bar"]`. | ||
| // We use package.parent() because `package` points to the package directory itself, | ||
| // and we need to include the package name in the module path for relative imports to work. | ||
| let module_path = package.and_then(|package| { | ||
| let src_root = package.parent()?; | ||
| path.as_std_path() | ||
| .strip_prefix(src_root.as_std_path()) | ||
| .ok()? | ||
| .iter() | ||
| .map(|component| { | ||
| Path::new(component) | ||
| .file_stem() | ||
| .and_then(|s| s.to_str()) | ||
| .map(String::from) | ||
| }) | ||
| .collect::<Option<Vec<String>>>() | ||
| }); |
There was a problem hiding this comment.
This makes me wonder if we could use path_to_module here instead
There was a problem hiding this comment.
this we can do and i think I made the change!
|
Thanks for the review, I think I have addressed your feedback @MichaReiser in a way where tests are unaffected and the changes still look right in my jumbo repo. But i will let you resolve your comments when comfortable because I'm new to the tool and am not 100% sure I've done things right |
2f6da0b to
af4ec41
Compare
As discussed in astral-sh#24183 (specifically astral-sh#24183 (comment)) In this change, we pull out the heuristics for finding search paths (currently in in `options.rs`) into a seperate function `discover_src_layout_roots` so that we can also use it in `absolute_desperate_search_paths` and `relative_desperate_search_paths`. Now, when desperate resolution encounters a `pyproject.toml` or `ty.toml`, it will also checks for these well-known src-layout subdirectories "on the way up" as we are searching up the ancestors. Note: naming is hard so may also consider some function rename to indicate that these are kind of just heuristics 🤷
af4ec41 to
7ba919b
Compare
Typing conformance resultsNo changes detected ✅Current numbersThe percentage of diagnostics emitted that were expected errors held steady at 86.59%. The percentage of expected errors that received a diagnostic held steady at 80.96%. The number of fully passing files held steady at 68/132. |
Memory usage reportMemory usage unchanged ✅ |
7ba919b to
1f8c5a2
Compare
|
1f8c5a2 to
687f745
Compare
| let package_roots = resolver | ||
| .package_roots( | ||
| &paths | ||
| .iter() | ||
| .flatten() | ||
| .map(ResolvedFile::path) | ||
| .collect::<Vec<_>>(), | ||
| ) | ||
| .into_iter() | ||
| .map(|(path, package)| { | ||
| ( | ||
| path.to_path_buf(), | ||
| package.map(PackageRoot::path).map(Path::to_path_buf), | ||
| ) | ||
| }) |
There was a problem hiding this comment.
I'm just spitballing ideas here to find a way to implement this performance optimization without breaking with Ruff's understanding of packages.
Instead of ignoring all package roots, could we filter out roots that we know overlap with what ty does in its module resolver? In your project, the issue seems to be mainly about adding a src root to every directory containing an __init__.py. Could we skip all PackageRoot::Nested? Or does some other heuristic work?
There was a problem hiding this comment.
Thanks for checking back @MichaReiser -
I did investigate that and the results also have some incorrect cross-package module resolution.
I think the crux of the issue here is that ruff only cares about pyproject.toml files which specifically set some settings in tool.ruff.
ruff/crates/ruff_workspace/src/resolver.rs
Line 455 in 55f667b
One additional way we can solve this issue to to make ruff consider pyproject.toml files under a ruff-enabled root as being a ruff-related config file. This would allow this whole thing to be simplified because we would naturally slot into how ruff already handles hierarchical project resolution.
Now
`pyproject.toml` with `tool.ruff`
`src`
`__init__.py`
`a.py`
`packages`
`other`
`pyproject.toml` no `tool.ruff` <- not considered when ruff does its hierarchical config finding
Potential Change
`pyproject.toml` with `tool.ruff`
`src`
`__init__.py`
`a.py`
`packages`
`other`
`pyproject.toml` no `tool.ruff` <- would essentially be considered the same as if `tool.ruff` was set with no overridden values
There is some complexity in the code to make sure such pyproject.toml files inherit the right configs, which may not look so nice (example here, see resolver.rs note the test snapshot differences, but now actual output tests change https://github.com/astral-sh/ruff/compare/main...markjm:ruff:mmolinaro/analyze-pyproject-inheritance?expand=1)
Its also possible to hack in this heuristic just to analyze_graph. This resolves the issue for analyze in a slightly hacked-in way while leaving everything else along https://github.com/astral-sh/ruff/compare/main...markjm:ruff:mmolinaro/just-graph?expand=1
Note: please use those examples as directions for approach, not complete/tested PRs. thats why i left them as just branch compares for now
However, it seems like isort ruff rules kind of fall prey to this same issue of blank pyproject.tomls messing with its resolution. Specifically, the setting https://docs.astral.sh/ruff/settings/#lint_isort_detect-same-package seems to be there to address exactly this type of issue. Looking back at the example workspace in the test I changed, if i were to see how isort handles it:
isort categorization for albatross/__init__.py:
| Import | Category | Reason | Note |
|---|---|---|---|
| albatross | FirstParty | SamePackage | Only works because of the same-package heuristic |
I believe its not a problem with isort because that setting defaults to true :)
Additionally, if we look specifically at the example in the test which was changed
https://github.com/astral-sh/uv/blob/aa629c4a54c31d6132ab1655b90dd7542c17d120/scripts/workspaces/albatross-virtual-workspace/packages/albatross/src/albatross/__init__.py
ruff check packages/albatross/src/albatross/__init__.py --show-settings | grep linter.src -A3
linter.src = [
"/tmp/uv",
"/tmp/uv/src",
]
we see that ruff indeed does not naturally consider packages/albatross/ or packages/albatross/src to be a linter.src entry
Edit: editted a few times for more info or clarity
There was a problem hiding this comment.
Put succinctly:
I believe that that the original approach which caused that test to need to change its settings was the "most correct" (and also the fastest :) ) because it most closely matches how ruff resolves settings to a particular file today. The examples with isort and the uv workspace example above, I think, are evidence towards that direction.
The need to include the test change was not because we suddenly stopped including configured or inferred src_roots based on the configs available, but only because analyze graph over-adds src roots so much so now that those directories just happened to be included in spite of a ruff their missing ruff configurations.
There was a problem hiding this comment.
The time I can invest into this PR is very limited at the moment. That's why I'd prefer a local, ideally minimal, change.
I did investigate that and the results also have some incorrect cross-package module resolution.
I don't think this can be fully avoided. It's inherent to having a single search path and mimics Python's behavior.
I changed the venv test and added a packages/albatross/src/albatross/sub/__init__.py. As suspected, package_roots now returns an entry for the sub directory, which means src_roots will have as many entries as there are directories with an __init__.py, leading to horrible performance.
13 │+[crates/ruff/src/commands/analyze_graph.rs:55:25] resolver.package_roots(&paths.iter().flatten().map(ResolvedFile::path).collect::<Vec<_>>(),) = {
14 │+ "/private/var/folders/7s/n00zmc2d0qgchs1ns43xq47h0000gn/T/.tmpK2kE6t/packages/albatross": None,
15 │+ "/private/var/folders/7s/n00zmc2d0qgchs1ns43xq47h0000gn/T/.tmpK2kE6t/packages/albatross/src/albatross": Some(
16 │+ Root {
17 │+ path: "/private/var/folders/7s/n00zmc2d0qgchs1ns43xq47h0000gn/T/.tmpK2kE6t/packages/albatross/src/albatross",
18 │+ },
19 │+ ),
20 │+ "/private/var/folders/7s/n00zmc2d0qgchs1ns43xq47h0000gn/T/.tmpK2kE6t/packages/albatross/src/albatross/sub": Some(
21 │+ Root {
22 │+ path: "/private/var/folders/7s/n00zmc2d0qgchs1ns43xq47h0000gn/T/.tmpK2kE6t/packages/albatross/src/albatross",
23 │+ },
24 │+ ),
25 │+}
I think what we should do is:
- Always add
PackageRoot::Nestedbecause (or decide we don't wnat to support this case, but we should add a test for it, unless there's a pre-existing test) - For
PackageRoot::Root: Only add the outermost directories usingdeduplicate_nested_paths
Unless @charliermarsh has some ideas. He originally wrote all this :)
There was a problem hiding this comment.
Hmm, my comment about the sub package is incorrect. The IndexMap already handles it. @markjm can you share an example of the paths that the auto discovery adds in your repository that shouldn't be added (there should be a very large number of them, given the performance issues you're running into).
One solution here could be to not add the discovered package roots if a project explicitly sets src_roots.
There was a problem hiding this comment.
I cant share the specific code, but i pulled these numbers
| Approach | Src Roots | Package Discovery | Runtime | Correctness |
|---|---|---|---|---|
| Hierarchical (mmolinaro/analyze) | 2 | Explicit config: ["libs", "."] (just where our code happens to ve configured) |
~4.5s | ✅ No false dependencies |
| Main branch (original) | 174 | Auto-discover all packages (10,539 found → deduplicated to 174) | ~27s | ❌ False cross-service dependencies |
| keep nested + dedup roots | 174 | 8,427 Root → deduplicated to 1 outermost + nested | ~13s | ❌ False cross-service dependencies, and as slow as main branch |
Also pulled some stats on # of resolutions:
- Hierarchical (2 roots): 691,261 total imports, 434,637 resolved (62.9%), 256,624 failed (37.1%)
- Main (174 roots): 691,575 total imports, 436,721 resolved (63.1%), 254,854 failed (36.9%)
The performance difference comes from checking 2 roots vs 174 roots for every import resolution attempt (~691K attempts). The major performance hit comes from all the failed resolutions which are needing to look through all those src roots.
First, I changed the PR back to close-to the original approach (pre-messing with ty desperate resolution) in eb29f54.
Since the "include nested" approach mentioned above acts the same as main, I updated to an approach which just adds all package_roots but relies on deduplicate_nested_paths to remove most of them (though it removes the nested packages discussed above, it does mean the test you were concerned about now passes :) ). This change also ensures we dont touch anything outside of ruff analyze graph (scoped as requested above!).
I am happy to either (1) the eb29f54 approach, (2) limit to only when src_roots are explicitely set (though i dont know if there is any current signal if a setting is explicit vs defaulted), or (3) this approach shown (perhaps close to "decide we don't wnat to support this case"). Whatever you would prefer, I can make happen.
P.S I made this small demo which just shows the type of resolution issues which can happen with current approach (I know we both agree this can happen, but i figured its helpful to have an example)
markjm/demos-by-commit@347feb2#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711
Edit: made a bunch of edits to this message as i explored and experimented
There was a problem hiding this comment.
can you share an example of the paths that the auto discovery adds in your repository that shouldn't be added (there should be a very large number of them, given the performance issues you're running into).
There was a problem hiding this comment.
Hm, I thought the demo does show this
markjm/demos-by-commit@347feb2
I made another with the exact src we have configured here, with some naming changes, the file structure is just ripped from the repo
markjm/demos-by-commit@49e27fe
You will notice in this case that the conflict is finding a package named openai (an existing 3p pip package) because all/most other imports in the repo are absolute (from services.llm_actions.src.llm_actions.openai import call_openai), making these conflicts much more likely with 3p package names
The major performance hit comes from all the failed resolutions which are needing to look through all those src roots.
I was sharing this ⬆️ , though in response to ⬇️
(there should be a very large number of them, given the performance issues you're running into)
because there actually are not a very large number of them. Instead there are a relatively small number of them, as they represent cases where the resolution with so many extra roots find some arbitrary name-matching root elsewhere, while the vast majority of cases dont and just have a much slower exhaustive "failed resolution" case where each resolution doesn't find anything but does exhaustively search all the many auto-discovered roots.
There was a problem hiding this comment.
Sorry, I somehow missed the demo. I don't think I've the time now to play and look into the demo myself. I can help drive this forward if we have a clear analysis which extra paths are added and if there's any pattern (all nested roots, something else?).
687f745 to
62c6e8a
Compare
|
((reverted the branch back to the state it was in before exploring the ty resolver changes)) in eb29f54 |
62c6e8a to
a44bcd7
Compare
a44bcd7 to
e0bc18b
Compare
Problem
The previous implementation added all detected package roots to the module search paths (
src_roots). In a monorepo with many packages, this causedimport fooin one package to incorrectly resolve to an unrelatedfoopackage elsewhere in the repository.Solution
Two key changes:
src_roots collection: Instead of adding all package roots, collect
srcpaths from discovered configs usingresolver.settings(). This uses ruff's hierarchical config discovery - each file's settings come from its closest pyproject.toml, and only explicitly configuredsrcpaths are used for resolution.module_path computation: Fix the path calculation in lib.rs to use
package.parent()as the src_root when computing the module path. This ensures relative imports resolve correctly (the package directory must be included in the module path, not used as the strip prefix).Performance
The reduced search space dramatically improves performance on a large
monorepo:
Notes
uvand modified the test to more representatively look like auvworkspacemainto represent the issueSee #24182