Content-hash-based change tracking for data imports#3199
Draft
jonathangreen wants to merge 9 commits intomainfrom
Draft
Content-hash-based change tracking for data imports#3199jonathangreen wants to merge 9 commits intomainfrom
jonathangreen wants to merge 9 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3199 +/- ##
=======================================
Coverage 93.30% 93.31%
=======================================
Files 497 497
Lines 46144 46193 +49
Branches 6318 6322 +4
=======================================
+ Hits 43055 43103 +48
+ Misses 2004 2003 -1
- Partials 1085 1087 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4947ef2 to
7d42839
Compare
Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.
…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.
a1f0909 to
f58ea44
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously,
BibliographicDataandCirculationDataused only the data source's "last updated" timestamp to decidewhether to re-apply incoming data to an
EditionorLicensePool. This causedtwo problems:
writes and work creation.
This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
form of the incoming data is stored on the database record after each import.
Subsequent imports compare both the timestamp and the hash before deciding whether
to apply an update.
Key changes:
json_hash()/json_canonical()utilities (util/json.py) produce astable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
BaseMutableDatagainsupdated_at,created_at,as_of_timestamp,calculate_hash(), andshould_apply_to(). Theshould_apply_to()methodis now the single decision point for both bibliographic and circulation data.
BibliographicData.has_changed()andCirculationData.has_changed()areremoved and replaced by the shared
should_apply_to()logic.EditionandLicensePooleach gain anupdated_at_data_hashcolumn.LicensePoolalso gainscreated_atandupdated_atcolumns to track whenits
CirculationDatawas first and most recently imported.hash matches, because license availability can change as licenses expire
independently of feed content.
f98e4049c87dadds all four new columns.Motivation and Context
The original
has_changed()implementation only compared timestamps, which isinsufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.
How Has This Been Tested?
BibliographicDataandCirculationDatacover thenew
should_apply_to()logic, including the null-hash bootstrap case, thetimestamp-is-older short-circuit, and the hash-match skip.
json_canonical()andjson_hash()verify orderingstability across dict keys, list items, and float precision.
pass with the updated field names (
updated_atin place ofdata_source_last_updated).tox -e py312-docker -- --no-cov.Checklist