Tracking Issue: Vectors

This is a tracking issue for adding vectors to Vortex.

Related: https://github.com/vortex-data/vortex/issues/4492
RFC: TODO LINK (https://github.com/vortex-data/vortex/pull/4726)

# Vectors

Since "vector" is a heavily overloaded term, we we define a vector in Vortex specifically as:

- A fully decompressed (which we call "Canonical" in Vortex) array of contiguous data
- A datatype that can hold itself as children (a vector can hold other child vectors)
- The data is _recursively_ **de**compressed, which means all child vectors are also fully decompressed (this is different from the current _canonical encodings_, which allow compressed children)
- Vectors are either borrowed or owned (`Vector` and `VectorMut`), mimicking the popular `Bytes` and `BytesMut` from the [`bytes`](https://docs.rs/bytes/latest/bytes/) crate.
- The owned version of vectors (`VectorMut`) is fully mutatable

See RFC 0001 (TODO LINK) for more information on why we need vectors. To summarize, instead of potentially compressed `Array`s being the intermediate representation (IR) between edges in our decompression plans (which is the model we use for our current compute functions), we want to enforce that fully decompressed arrays (which will be vectors) are the IR so that we can:

- Guarantee progress in decompression at every step (right now, compute functions can do whatever they want) 
- Redefine `Array`s to represent logical query plans to decompress data. This allows us to perform better optimizations (pushdown and common subexpression elimination) if we know exactly what data is flowing between nodes in our decompression plan

## Why not Arrow?

_Copied and pasted from the RFC_

There are a few reasons we are not directly using Arrow arrays:

* Vortex vectors are recursively built around variable-width primitives, allowing us to use the narrowest type possible
  (e.g. we can narrow dictionary codes, run-end offsets, list view sizes, etc.).
* Internally we use the `vortex-buffer` crate, meaning we preserve our capabilities around runtime alignment.
* We reserve the right to diverge from Arrow layouts in the future. For example, a `VarBinVector` could use in-memory
  pointers like DuckDB, rather than logical offsets. This makes the vectors cheaper to concatenate and slice, but
  incurs a conversion cost to/from Arrow.

We may re-visit this decision in the future prior to finalizing the Vortex extension API.

# Steps / History

- [x] https://github.com/vortex-data/vortex/pull/5010
- [x] https://github.com/vortex-data/vortex/pull/5056
- [x] https://github.com/vortex-data/vortex/pull/5093
- [x] Implement all `Vector` variants (1 per `DType`)
  - [x] `Null` 
  - [x] `Bool`
  - [x] `Primitive`
    - [x] https://github.com/vortex-data/vortex/pull/5079
    - [x] https://github.com/vortex-data/vortex/pull/5114
    - [x] https://github.com/vortex-data/vortex/pull/5287
  - [x] `Decimal`
    - [x] https://github.com/vortex-data/vortex/pull/5124
  - [x] `VarBin` / `Binary`
    - [X] https://github.com/vortex-data/vortex/pull/5115
    - [x] https://github.com/vortex-data/vortex/pull/5133
  - [x] `List`
    - [x] https://github.com/vortex-data/vortex/pull/5190
  - [x] `FixedSizeList`
    - [x] https://github.com/vortex-data/vortex/pull/5127
  - [X] `Struct`
    - [x] https://github.com/vortex-data/vortex/pull/5094
    - [X] https://github.com/vortex-data/vortex/pull/5105
  - [x] `Extension`
- [ ] Add conversion methods to Arrow (feature flags)
- [ ] Replace the current `ArrayBuilder`s with `Vector`s
- [ ] Refactor `vortex-scalar` to be wrappers around `Vector` of length 1 (we could also potentially flip this and make the scalars the elements of vectors instead)

## Chores

- [x] https://github.com/vortex-data/vortex/pull/5029
- [x] https://github.com/vortex-data/vortex/pull/5031
- [x] https://github.com/vortex-data/vortex/pull/5070
- [x] https://github.com/vortex-data/vortex/pull/5072
- [X] https://github.com/vortex-data/vortex/pull/5075
- [x] https://github.com/vortex-data/vortex/pull/5112
- [x] https://github.com/vortex-data/vortex/pull/5118
- [x] https://github.com/vortex-data/vortex/pull/5123
- [x] https://github.com/vortex-data/vortex/pull/5126
- [x] https://github.com/vortex-data/vortex/pull/5165
- [x] https://github.com/vortex-data/vortex/pull/5185
- [x] https://github.com/vortex-data/vortex/pull/5186
- [x] https://github.com/vortex-data/vortex/pull/5188

# Unresolved Questions

- N/A

# Public API

```rust
// Borrowed Vector
pub enum Vector {
    Null(NullVector),
    Bool(BoolVector),
    Primitive(PrimitiveVector),
    // ...
}

// Example Vector
pub struct BoolVector {
    bits: BitBuffer,
    validity: Option<Mask>,
}

// Owned and mutable Vector
enum VectorMut {
    Null(NullVectorMut),
    Bool(BoolVectorMut),
    Primitive(PrimitiveVectorMut),
    // ...
}

impl VectorMut {
    fn freeze(self) -> Vector { ... }
    fn split_off(&mut self, at: usize) -> VectorMut { ... }
    fn unsplit(&mut self, other: VectorMut) -> VectorMut { ... }
}
```

All vectors will implement one of these two helper traits:

```rust
/// Common operations for immutable vectors (all the variants of [`Vector`]).
pub trait VectorOps: private::Sealed + Into<Vector> {
    /// The mutable equivalent of this immutable vector.
    type Mutable: VectorMutOps<Immutable = Self>;

    /// Returns the number of elements in the vector, also referred to as its "length".
    fn len(&self) -> usize;

    /// Returns the validity mask of the vector, where `true` represents a _valid_ element and
    /// `false` represents a `null` element.
    fn validity(&self) -> &Mask;

    /// Tries to convert `self` into a mutable vector (implementing [`VectorMutOps`]).
    fn try_into_mut(self) -> Result<Self::Mutable, Self>
    where
        Self: Sized;
}

/// Common operations for mutable vectors (all the variants of [`VectorMut`]).
pub trait VectorMutOps: private::Sealed + Into<VectorMut> {
    /// The immutable equivalent of this mutable vector.
    type Immutable: VectorOps<Mutable = Self>;

    /// Returns the number of elements in the vector, also referred to as its "length".
    fn len(&self) -> usize;

    /// Returns the total number of elements the vector can hold without reallocating.
    fn capacity(&self) -> usize;

    /// Reserves capacity for at least `additional` more elements to be inserted in the given
    /// vector.
    fn reserve(&mut self, additional: usize);

    /// Extends the vector by appending elements from another vector.
    fn extend_from_vector(&mut self, other: &Self::Immutable);

    /// Appends `n` null elements to the vector (if it is nullable).
    fn append_nulls(&mut self, n: usize);

    /// Converts `self` into an immutable vector.
    fn freeze(self) -> Self::Immutable;

    /// Splits the vector into two at the given index.
    fn split_off(&mut self, at: usize) -> Self;

    /// Absorbs a mutable vector that was previously split off.
    fn unsplit(&mut self, other: Self);
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Vectors #5028

Vectors

Why not Arrow?

Steps / History

Chores

Unresolved Questions

Public API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking Issue: Vectors #5028

Description

Vectors

Why not Arrow?

Steps / History

Chores

Unresolved Questions

Public API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions