Skip to content

Read Opt.#754

Open
ColinLeeo wants to merge 1 commit intodevelopfrom
read_opt
Open

Read Opt.#754
ColinLeeo wants to merge 1 commit intodevelopfrom
read_opt

Conversation

@ColinLeeo
Copy link
Copy Markdown
Contributor

@ColinLeeo ColinLeeo commented Mar 27, 2026

TsFile C++ Read Path Performance Optimization — Overview

Background

The current TsFile C++ read path uses row-by-row decoding with a row-oriented result set API. In full-scan and filtered query scenarios, throughput falls behind Parquet+Arrow. This optimization aims to make TsFile batch read throughput significantly exceed Parquet+Arrow while maintaining interface compatibility.

Summary of Optimizations

The optimizations span four layers:

1. Batch Decode Infrastructure

  • Added read_batch_int32/int64/float/double and skip_* batch interfaces to Decoder (PLAIN / TS2DIFF / Gorilla), processing 129 values per call instead of one virtual-dispatch per value.
  • Added satisfy_batch_time batch filter interface to Filter, evaluating an entire batch of timestamps at once.
  • Eliminated intermediate stack buffer copies in TS2DIFF batch decode — reads directly from the wrapped ByteStream buffer pointer.
  • PLAIN batch decode now uses __builtin_bswap64/32 (compiles to a single ARM REV instruction) and skips the read_buf intermediate copy.

2. Single-Column Batch Read Path

  • Added DECODE_TV_BATCH method in ChunkReader / AlignedChunkReader: decodes time + value in batches of 129 rows, applies batch filter, and writes results into TsBlock.
  • SingleDeviceTsBlockReader adapted to the batch path, supporting get_next_tsblock to return TsBlock directly to the user.

3. Multi-Value Column Merged Read

  • Introduced MultiAlignedTimeseriesIndex to allow a single AlignedChunkReader to hold 1 time column + N value columns simultaneously.
  • Time column is decoded only once; N value columns share the decoded timestamps and filter mask.
  • VectorMeasurementColumnContext wraps a multi-value SSI; SingleDeviceTsBlockReader automatically detects and merges multiple measurements within the same device.
  • Fixed double-delete bug in SingleDeviceTsBlockReader::close() where multiple map entries pointed to the same VectorMeasurementColumnContext.
  • Fixed per-column buffer size tracking in get_cur_page_header (previously shared file_data_value_buf_size_ caused heap-buffer-overflow when columns had different page sizes).

4. Parallel Decode + Batch Append Fast Path

  • Introduced DecodeThreadPool for page-level parallel decompression of N value columns (Snappy decompress in parallel).
  • In the scatter phase of multi_DECODE_TV_BATCH, when all rows pass the filter and no column has nulls, the per-row row_appender.append() loop is bypassed — each column's decoded batch is written to the Vector buffer in a single memcpy.

Test Dataset

Parameter Value
Table bench_table
Devices 10
Total rows 1,000,000 (100,000 per device)
Columns time, id1(TAG), id2(TAG), s1(INT64), s2(DOUBLE), s3(FLOAT), s4(INT32)
Encoding Time: TS2DIFF, Values: PLAIN
Compression Snappy
Platform macOS ARM64 (Apple Silicon), clang -O3

Benchmark Results

TAG_FILTER — filter by device id, read 100,000 rows × 4 value columns from a single device:

Read Mode Throughput vs Baseline
TsFile (row, pre-optimization baseline) ~4.5M rows/s 1.0x
TsFile (batch, single-column) ~9.5M rows/s 2.1x
TsFile (batch, multi-value + parallel + batch append) ~21M rows/s 4.7x
Parquet+Arrow ~1.7M rows/s

TIME_FILTER — filter by time range, read 333,333 rows × 4 value columns across all devices:

Read Mode Throughput vs Baseline
TsFile (row, pre-optimization baseline) ~4.5M rows/s 1.0x
TsFile (batch, single-column) ~9.2M rows/s 2.0x
TsFile (batch, multi-value + parallel + batch append) ~19.5M rows/s 4.3x
Parquet+Arrow ~6.4M rows/s

Phase Timing Breakdown (Post-Optimization)

Instrumented timing of each phase within multi_DECODE_TV_BATCH:

Phase % of Total Description
Time decode (TS2DIFF) ~5% 128-value block bit-unpacking + prefix sum
Filter + value decode (PLAIN bswap) ~95% Batch time filter + 4-column byte-swap decode
Scatter (write to TsBlock) ~0% Eliminated by batch append fast path

PR Plan

Split into 5 PRs, merged in dependency order:

PR 1: Batch Decode Infrastructure
│     decoder.h, plain_decoder.h, ts2diff_decoder.h, gorilla_decoder.h
│     filter.h, and_filter.h, or_filter.h, time_operator.h/.cc
│     gorilla_codec_test.cc
│
└─► PR 2: Single-Column Batch Read Path
      │   chunk_reader.cc/.h, aligned_chunk_reader.cc/.h
      │   tsfile_series_scan_iterator.cc/.h
      │   single_device_tsblock_reader.cc/.h
      │   result_set.h, tsblock.h
      │
      └─► PR 3: Multi-Value Column Merged Read
            │   tsfile_common.h (MultiAlignedTimeseriesIndex)
            │   tsfile_io_reader.cc/.h
            │   aligned_chunk_reader.cc/.h (ValueColumnState, multi-value methods)
            │   single_device_tsblock_reader.cc/.h (VectorMeasurementColumnContext)
            │   vector.h
            │
            └─► PR 4: Parallel Decode + Batch Append Fast Path
                  thread_pool.h (new file)
                  aligned_chunk_reader.cc/.h (parallel decompress, batch append)

PR 5: Benchmark Tooling + Decoder Micro-Optimizations (independent)
      bench_read.cpp/.h (new files), examples CMakeLists
      plain_decoder.h (__builtin_bswap, direct pointer access)
      ts2diff_decoder.h (eliminate stack copy)
      third_party/simde (portable SIMD library)

PR 1 → 2 → 3 → 4 have sequential dependencies and must be merged in order. PR 5 has no dependencies and can be merged independently.

Correctness Verification

  • All 9 TableModel tests pass (including MultiLargePage large-data test).
  • All PLAIN / TS2DIFF / Gorilla codec tests pass.
  • Remaining reader/writer test results are consistent with the develop branch (10 pre-existing failures unaffected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant