Skip to content

Content-hash-based change tracking for data imports#3199

Draft
jonathangreen wants to merge 9 commits intomainfrom
feature/change-tracking
Draft

Content-hash-based change tracking for data imports#3199
jonathangreen wants to merge 9 commits intomainfrom
feature/change-tracking

Conversation

@jonathangreen
Copy link
Copy Markdown
Member

@jonathangreen jonathangreen commented Apr 2, 2026

Description

Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously, BibliographicData and
CirculationData used only the data source's "last updated" timestamp to decide
whether to re-apply incoming data to an Edition or LicensePool. This caused
two problems:

  • Re-publishing the same content with a newer timestamp triggered unnecessary
    writes and work creation.
  • A real change arriving with the same (or missing) timestamp was silently skipped.
    This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
    with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
    form of the incoming data is stored on the database record after each import.
    Subsequent imports compare both the timestamp and the hash before deciding whether
    to apply an update.
    Key changes:
  • New json_hash() / json_canonical() utilities (util/json.py) produce a
    stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
  • BaseMutableData gains updated_at, created_at, as_of_timestamp,
    calculate_hash(), and should_apply_to(). The should_apply_to() method
    is now the single decision point for both bibliographic and circulation data.
  • BibliographicData.has_changed() and CirculationData.has_changed() are
    removed and replaced by the shared should_apply_to() logic.
  • Edition and LicensePool each gain an updated_at_data_hash column.
    LicensePool also gains created_at and updated_at columns to track when
    its CirculationData was first and most recently imported.
  • Individual-license pools (e.g. ODL) always re-apply availability even when the
    hash matches, because license availability can change as licenses expire
    independently of feed content.
  • Database migration f98e4049c87d adds all four new columns.

Motivation and Context

The original has_changed() implementation only compared timestamps, which is
insufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

  • Updated unit tests for BibliographicData and CirculationData cover the
    new should_apply_to() logic, including the null-hash bootstrap case, the
    timestamp-is-older short-circuit, and the hash-match skip.
  • New unit tests for json_canonical() and json_hash() verify ordering
    stability across dict keys, list items, and float precision.
  • All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers
    pass with the updated field names (updated_at in place of
    data_source_last_updated).
  • Full test suite run via tox -e py312-docker -- --no-cov.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@jonathangreen jonathangreen added the feature New feature label Apr 2, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.31%. Comparing base (c825a11) to head (f58ea44).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3199   +/-   ##
=======================================
  Coverage   93.30%   93.31%           
=======================================
  Files         497      497           
  Lines       46144    46193   +49     
  Branches     6318     6322    +4     
=======================================
+ Hits        43055    43103   +48     
+ Misses       2004     2003    -1     
- Partials     1085     1087    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein dbernstein changed the title WIP: Content-hash-based change tracking for data imports Content-hash-based change tracking for data imports Apr 6, 2026
@dbernstein dbernstein force-pushed the feature/change-tracking branch 4 times, most recently from 4947ef2 to 7d42839 Compare April 8, 2026 17:10
dbernstein and others added 9 commits April 10, 2026 14:03
Fixes all broken tests, mypy errors, and incomplete source changes from
the initial WIP commit (bde0829).

This commit contains all Claude authored work.

- LicensePool model was missing `updated_at` and `created_at` columns
  referenced by new circulation code, causing 49 test failures
- 31 mypy errors across json.py, bibliographic.py, circulation.py,
  and integration importers
- Incomplete rename of `has_changed` → `needs_apply` left stale calls
  in bibliographic.py, circulation.py, and three integration importers
- `data_source_last_updated` still referenced in bibliographic.py,
  two OPDS extractors, and the Boundless parser/conftest
- Missing alembic migration for all new DB columns
- `LinkData.content` (bytes | str field) caused UnicodeDecodeError when
  hashing bibliographic data containing embedded binary images
- `_canonicalize` / `_canonicalize_sort_key` lacked type annotations
- ODL reimport of expired licenses was incorrectly skipped because
  license expiry is time-dependent, not detectable by content hash
src/palace/manager/sqlalchemy/model/licensing.py
- Add `created_at` and `updated_at` columns to LicensePool
src/palace/manager/data_layer/base/mutable.py
- Fix `should_apply_to` condition: `<=` → `<` so equal timestamps
  still trigger a hash check rather than an unconditional skip
src/palace/manager/data_layer/link.py
- Add `@field_serializer("content", when_used="json")` to base64-encode
  binary bytes in the `bytes | str | None` union field
src/palace/manager/data_layer/bibliographic.py
- Replace `data_source_last_updated` with `updated_at` throughout
- Replace `has_changed` calls with `should_apply_to` in apply() /
  apply_edition_only(); `_update_edition_timestamp` now also stores
  `updated_at_data_hash` on the edition
src/palace/manager/data_layer/circulation.py
- Replace remaining `has_changed` / `last_checked` references
- Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply
- Early-return skip is bypassed when `self.licenses is not None`
  (ODL-style pools) so time-expired licenses are always reprocessed;
  inner availability block gets the same treatment
src/palace/manager/util/json.py
- Add `int` type annotations to all `float_precision` parameters
src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py
- `has_changed` → `needs_apply`
src/palace/manager/integration/license/{opds1,odl}/extractor.py
src/palace/manager/integration/license/boundless/parser.py
- `data_source_last_updated=` → `updated_at=`
alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py
- New migration: `updated_at_data_hash` on editions and licensepools,
  `created_at` / `updated_at` on licensepools
tests/manager/data_layer/test_bibliographic.py
- Replace `data_source_last_updated` with `updated_at`; rewrite
  test_apply_no_changes_needed for hash-based semantics; rename
  test_data_source_last_updated_updates_timestamp
tests/manager/data_layer/test_measurement.py
- Update test_taken_at: taken_at now defaults to None
tests/manager/integration/license/{opds,overdrive}/test_importer.py
tests/manager/integration/license/boundless/conftest.py
- Update mock/fixture references from has_changed / last_checked
  to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash`
  so that identical content with different timestamps does not trigger
  spurious re-imports.
- Fix `_canonicalize_sort_key` crash when sorting sequences containing
  multiple `None` values (`None < None` raises TypeError in Python).
  Use a stable sentinel `""` as the second element of the sort key instead.
- Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid
  rebuilding the dict on every recursive call.
- Cache `calculate_hash()` result on the instance via `PrivateAttr` and
  invalidate on field mutation, avoiding a redundant SHA-256 computation
  per `apply()` cycle.
- Remove redundant `should_apply_to` guard inside `CirculationData.apply`;
  the early-return path already handles all the same conditions.
- Fix misleading log message when skipping a circulation data update.
- Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and
  `CirculationData.needs_apply`.
- Add tests for `json_hash`, multiple-None sequence sorting, and unsupported
  type errors in `_canonicalize_sort_key`.
- Add a note to the migration explaining the first-import-after-deploy
  performance impact.
…ction

The `opds_import_task` was not passing `apply_circulation` to
`importer.import_feed`, making the fallback path for "bibliographic
unchanged, circulation changed" completely dead code. Pass
`apply.circulation_apply.delay` to restore that path.
Add a `needs_apply` guard to the `elif` branch in
`import_feed_from_response` so `apply_circulation` is only queued when
the circulation data has actually changed, preventing redundant tasks on
every re-import of unchanged content.
Fix `CirculationData.needs_apply` to always return `True` when
`self.licenses` is not None (ODL-style pools). License expiry is
time-dependent and cannot be detected by content hashing alone; this
mirrors the existing exception already present in the `apply()` early-
return guard.
@dbernstein dbernstein force-pushed the feature/change-tracking branch from a1f0909 to f58ea44 Compare April 10, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants