Read Opt. by ColinLeeo · Pull Request #754 · apache/tsfile

ColinLeeo · 2026-03-27T08:55:58Z

TsFile C++ Read Path Performance Optimization — Overview

Background

The current TsFile C++ read path uses row-by-row decoding with a row-oriented result set API. In full-scan and filtered query scenarios, throughput falls behind Parquet+Arrow. This optimization aims to make TsFile batch read throughput significantly exceed Parquet+Arrow while maintaining interface compatibility.

Summary of Optimizations

The optimizations span four layers:

1. Batch Decode Infrastructure

Added read_batch_int32/int64/float/double and skip_* batch interfaces to Decoder (PLAIN / TS2DIFF / Gorilla), processing 129 values per call instead of one virtual-dispatch per value.
Added satisfy_batch_time batch filter interface to Filter, evaluating an entire batch of timestamps at once.
Eliminated intermediate stack buffer copies in TS2DIFF batch decode — reads directly from the wrapped ByteStream buffer pointer.
PLAIN batch decode now uses __builtin_bswap64/32 (compiles to a single ARM REV instruction) and skips the read_buf intermediate copy.

2. Single-Column Batch Read Path

Added DECODE_TV_BATCH method in ChunkReader / AlignedChunkReader: decodes time + value in batches of 129 rows, applies batch filter, and writes results into TsBlock.
SingleDeviceTsBlockReader adapted to the batch path, supporting get_next_tsblock to return TsBlock directly to the user.

3. Multi-Value Column Merged Read

Introduced MultiAlignedTimeseriesIndex to allow a single AlignedChunkReader to hold 1 time column + N value columns simultaneously.
Time column is decoded only once; N value columns share the decoded timestamps and filter mask.
VectorMeasurementColumnContext wraps a multi-value SSI; SingleDeviceTsBlockReader automatically detects and merges multiple measurements within the same device.
Fixed double-delete bug in SingleDeviceTsBlockReader::close() where multiple map entries pointed to the same VectorMeasurementColumnContext.
Fixed per-column buffer size tracking in get_cur_page_header (previously shared file_data_value_buf_size_ caused heap-buffer-overflow when columns had different page sizes).

4. Parallel Decode + Batch Append Fast Path

Introduced DecodeThreadPool for page-level parallel decompression of N value columns (Snappy decompress in parallel).
In the scatter phase of multi_DECODE_TV_BATCH, when all rows pass the filter and no column has nulls, the per-row row_appender.append() loop is bypassed — each column's decoded batch is written to the Vector buffer in a single memcpy.

Test Dataset

Parameter	Value
Table	bench_table
Devices	10
Total rows	1,000,000 (100,000 per device)
Columns	time, id1(TAG), id2(TAG), s1(INT64), s2(DOUBLE), s3(FLOAT), s4(INT32)
Encoding	Time: TS2DIFF, Values: PLAIN
Compression	Snappy
Platform	macOS ARM64 (Apple Silicon), clang -O3

Benchmark Results

TAG_FILTER — filter by device id, read 100,000 rows × 4 value columns from a single device:

Read Mode	Throughput	vs Baseline
TsFile (row, pre-optimization baseline)	~4.5M rows/s	1.0x
TsFile (batch, single-column)	~9.5M rows/s	2.1x
TsFile (batch, multi-value + parallel + batch append)	~21M rows/s	4.7x
Parquet+Arrow	~1.7M rows/s	—

TIME_FILTER — filter by time range, read 333,333 rows × 4 value columns across all devices:

Read Mode	Throughput	vs Baseline
TsFile (row, pre-optimization baseline)	~4.5M rows/s	1.0x
TsFile (batch, single-column)	~9.2M rows/s	2.0x
TsFile (batch, multi-value + parallel + batch append)	~19.5M rows/s	4.3x
Parquet+Arrow	~6.4M rows/s	—

Phase Timing Breakdown (Post-Optimization)

Instrumented timing of each phase within multi_DECODE_TV_BATCH:

Phase	% of Total	Description
Time decode (TS2DIFF)	~5%	128-value block bit-unpacking + prefix sum
Filter + value decode (PLAIN bswap)	~95%	Batch time filter + 4-column byte-swap decode
Scatter (write to TsBlock)	~0%	Eliminated by batch append fast path

PR Plan

Split into 5 PRs, merged in dependency order:

PR 1: Batch Decode Infrastructure
│     decoder.h, plain_decoder.h, ts2diff_decoder.h, gorilla_decoder.h
│     filter.h, and_filter.h, or_filter.h, time_operator.h/.cc
│     gorilla_codec_test.cc
│
└─► PR 2: Single-Column Batch Read Path
      │   chunk_reader.cc/.h, aligned_chunk_reader.cc/.h
      │   tsfile_series_scan_iterator.cc/.h
      │   single_device_tsblock_reader.cc/.h
      │   result_set.h, tsblock.h
      │
      └─► PR 3: Multi-Value Column Merged Read
            │   tsfile_common.h (MultiAlignedTimeseriesIndex)
            │   tsfile_io_reader.cc/.h
            │   aligned_chunk_reader.cc/.h (ValueColumnState, multi-value methods)
            │   single_device_tsblock_reader.cc/.h (VectorMeasurementColumnContext)
            │   vector.h
            │
            └─► PR 4: Parallel Decode + Batch Append Fast Path
                  thread_pool.h (new file)
                  aligned_chunk_reader.cc/.h (parallel decompress, batch append)

PR 5: Benchmark Tooling + Decoder Micro-Optimizations (independent)
      bench_read.cpp/.h (new files), examples CMakeLists
      plain_decoder.h (__builtin_bswap, direct pointer access)
      ts2diff_decoder.h (eliminate stack copy)
      third_party/simde (portable SIMD library)

PR 1 → 2 → 3 → 4 have sequential dependencies and must be merged in order. PR 5 has no dependencies and can be merged independently.

Correctness Verification

All 9 TableModel tests pass (including MultiLargePage large-data test).
All PLAIN / TS2DIFF / Gorilla codec tests pass.
Remaining reader/writer test results are consistent with the develop branch (10 pre-existing failures unaffected).

Read Opt.

5ceeeba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Opt.#754

Read Opt.#754
ColinLeeo wants to merge 1 commit intodevelopfrom
read_opt

ColinLeeo commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ColinLeeo commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TsFile C++ Read Path Performance Optimization — Overview

Background

Summary of Optimizations

1. Batch Decode Infrastructure

2. Single-Column Batch Read Path

3. Multi-Value Column Merged Read

4. Parallel Decode + Batch Append Fast Path

Test Dataset

Benchmark Results

Phase Timing Breakdown (Post-Optimization)

PR Plan

Correctness Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ColinLeeo commented Mar 27, 2026 •

edited

Loading