Content-hash-based change tracking for data imports by jonathangreen · Pull Request #3199 · ThePalaceProject/circulation

jonathangreen · 2026-04-02T19:42:33Z

Description

Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously, BibliographicData and
CirculationData used only the data source's "last updated" timestamp to decide
whether to re-apply incoming data to an Edition or LicensePool. This caused
two problems:

Re-publishing the same content with a newer timestamp triggered unnecessary
writes and work creation.
A real change arriving with the same (or missing) timestamp was silently skipped.
This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
form of the incoming data is stored on the database record after each import.
Subsequent imports compare both the timestamp and the hash before deciding whether
to apply an update.
Key changes:
New json_hash() / json_canonical() utilities (util/json.py) produce a
stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
BaseMutableData gains updated_at, created_at, as_of_timestamp,
calculate_hash(), and should_apply_to(). The should_apply_to() method
is now the single decision point for both bibliographic and circulation data.
BibliographicData.has_changed() and CirculationData.has_changed() are
removed and replaced by the shared should_apply_to() logic.
Edition and LicensePool each gain an updated_at_data_hash column.
LicensePool also gains created_at and updated_at columns to track when
its CirculationData was first and most recently imported.
Individual-license pools (e.g. ODL) always re-apply availability even when the
hash matches, because license availability can change as licenses expire
independently of feed content.
Database migration f98e4049c87d adds all four new columns.

Motivation and Context

The original has_changed() implementation only compared timestamps, which is
insufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

Updated unit tests for BibliographicData and CirculationData cover the
new should_apply_to() logic, including the null-hash bootstrap case, the
timestamp-is-older short-circuit, and the hash-match skip.
New unit tests for json_canonical() and json_hash() verify ordering
stability across dict keys, list items, and float precision.
All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers
pass with the updated field names (updated_at in place of
data_source_last_updated).
Full test suite run via tox -e py312-docker -- --no-cov.

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2026-04-03T02:43:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.31%. Comparing base (c825a11) to head (f58ea44).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3199   +/-   ##
=======================================
  Coverage   93.30%   93.31%           
=======================================
  Files         497      497           
  Lines       46144    46193   +49     
  Branches     6318     6322    +4     
=======================================
+ Hits        43055    43103   +48     
+ Misses       2004     2003    -1     
- Partials     1085     1087    +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at

…ion tool.

…g alembic.

- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.

…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.

jonathangreen added the feature New feature label Apr 2, 2026

dbernstein changed the title ~~WIP: Content-hash-based change tracking for data imports~~ Content-hash-based change tracking for data imports Apr 6, 2026

dbernstein force-pushed the feature/change-tracking branch 4 times, most recently from 4947ef2 to 7d42839 Compare April 8, 2026 17:10

dbernstein and others added 9 commits April 10, 2026 14:03

Revert "Fix BibliographicData.has_changed to throttle updates when da…

2436b87

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

First work on tracking changes in a hash

dd6fa3b

Redo the revision: CLAUDE did not generate it using the alembic revis…

ed053d4

…ion tool.

Update CLAUDE.md to include directive about generating revisions usin…

924ca9e

…g alembic.

Address multiple heads in alembic revision.

65c0a3b

Address broken tests.

7cb58a7

dbernstein force-pushed the feature/change-tracking branch from a1f0909 to f58ea44 Compare April 10, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content-hash-based change tracking for data imports#3199

Content-hash-based change tracking for data imports#3199
jonathangreen wants to merge 9 commits intomainfrom
feature/change-tracking

jonathangreen commented Apr 2, 2026 •

edited by dbernstein

Loading

Uh oh!

codecov bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonathangreen commented Apr 2, 2026 • edited by dbernstein Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Checklist

Uh oh!

codecov bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonathangreen commented Apr 2, 2026 •

edited by dbernstein

Loading

codecov bot commented Apr 3, 2026 •

edited

Loading