This is a tracking issue for adding vectors to Vortex.
Related: #4492
RFC: TODO LINK (#4726)
Vectors
Since "vector" is a heavily overloaded term, we we define a vector in Vortex specifically as:
- A fully decompressed (which we call "Canonical" in Vortex) array of contiguous data
- A datatype that can hold itself as children (a vector can hold other child vectors)
- The data is recursively decompressed, which means all child vectors are also fully decompressed (this is different from the current canonical encodings, which allow compressed children)
- Vectors are either borrowed or owned (
Vector and VectorMut), mimicking the popular Bytes and BytesMut from the bytes crate.
- The owned version of vectors (
VectorMut) is fully mutatable
See RFC 0001 (TODO LINK) for more information on why we need vectors. To summarize, instead of potentially compressed Arrays being the intermediate representation (IR) between edges in our decompression plans (which is the model we use for our current compute functions), we want to enforce that fully decompressed arrays (which will be vectors) are the IR so that we can:
- Guarantee progress in decompression at every step (right now, compute functions can do whatever they want)
- Redefine
Arrays to represent logical query plans to decompress data. This allows us to perform better optimizations (pushdown and common subexpression elimination) if we know exactly what data is flowing between nodes in our decompression plan
Why not Arrow?
Copied and pasted from the RFC
There are a few reasons we are not directly using Arrow arrays:
- Vortex vectors are recursively built around variable-width primitives, allowing us to use the narrowest type possible
(e.g. we can narrow dictionary codes, run-end offsets, list view sizes, etc.).
- Internally we use the
vortex-buffer crate, meaning we preserve our capabilities around runtime alignment.
- We reserve the right to diverge from Arrow layouts in the future. For example, a
VarBinVector could use in-memory
pointers like DuckDB, rather than logical offsets. This makes the vectors cheaper to concatenate and slice, but
incurs a conversion cost to/from Arrow.
We may re-visit this decision in the future prior to finalizing the Vortex extension API.
Steps / History
Chores
Unresolved Questions
Public API
// Borrowed Vector
pub enum Vector {
Null(NullVector),
Bool(BoolVector),
Primitive(PrimitiveVector),
// ...
}
// Example Vector
pub struct BoolVector {
bits: BitBuffer,
validity: Option<Mask>,
}
// Owned and mutable Vector
enum VectorMut {
Null(NullVectorMut),
Bool(BoolVectorMut),
Primitive(PrimitiveVectorMut),
// ...
}
impl VectorMut {
fn freeze(self) -> Vector { ... }
fn split_off(&mut self, at: usize) -> VectorMut { ... }
fn unsplit(&mut self, other: VectorMut) -> VectorMut { ... }
}
All vectors will implement one of these two helper traits:
/// Common operations for immutable vectors (all the variants of [`Vector`]).
pub trait VectorOps: private::Sealed + Into<Vector> {
/// The mutable equivalent of this immutable vector.
type Mutable: VectorMutOps<Immutable = Self>;
/// Returns the number of elements in the vector, also referred to as its "length".
fn len(&self) -> usize;
/// Returns the validity mask of the vector, where `true` represents a _valid_ element and
/// `false` represents a `null` element.
fn validity(&self) -> &Mask;
/// Tries to convert `self` into a mutable vector (implementing [`VectorMutOps`]).
fn try_into_mut(self) -> Result<Self::Mutable, Self>
where
Self: Sized;
}
/// Common operations for mutable vectors (all the variants of [`VectorMut`]).
pub trait VectorMutOps: private::Sealed + Into<VectorMut> {
/// The immutable equivalent of this mutable vector.
type Immutable: VectorOps<Mutable = Self>;
/// Returns the number of elements in the vector, also referred to as its "length".
fn len(&self) -> usize;
/// Returns the total number of elements the vector can hold without reallocating.
fn capacity(&self) -> usize;
/// Reserves capacity for at least `additional` more elements to be inserted in the given
/// vector.
fn reserve(&mut self, additional: usize);
/// Extends the vector by appending elements from another vector.
fn extend_from_vector(&mut self, other: &Self::Immutable);
/// Appends `n` null elements to the vector (if it is nullable).
fn append_nulls(&mut self, n: usize);
/// Converts `self` into an immutable vector.
fn freeze(self) -> Self::Immutable;
/// Splits the vector into two at the given index.
fn split_off(&mut self, at: usize) -> Self;
/// Absorbs a mutable vector that was previously split off.
fn unsplit(&mut self, other: Self);
}
This is a tracking issue for adding vectors to Vortex.
Related: #4492
RFC: TODO LINK (#4726)
Vectors
Since "vector" is a heavily overloaded term, we we define a vector in Vortex specifically as:
VectorandVectorMut), mimicking the popularBytesandBytesMutfrom thebytescrate.VectorMut) is fully mutatableSee RFC 0001 (TODO LINK) for more information on why we need vectors. To summarize, instead of potentially compressed
Arrays being the intermediate representation (IR) between edges in our decompression plans (which is the model we use for our current compute functions), we want to enforce that fully decompressed arrays (which will be vectors) are the IR so that we can:Arrays to represent logical query plans to decompress data. This allows us to perform better optimizations (pushdown and common subexpression elimination) if we know exactly what data is flowing between nodes in our decompression planWhy not Arrow?
Copied and pasted from the RFC
There are a few reasons we are not directly using Arrow arrays:
(e.g. we can narrow dictionary codes, run-end offsets, list view sizes, etc.).
vortex-buffercrate, meaning we preserve our capabilities around runtime alignment.VarBinVectorcould use in-memorypointers like DuckDB, rather than logical offsets. This makes the vectors cheaper to concatenate and slice, but
incurs a conversion cost to/from Arrow.
We may re-visit this decision in the future prior to finalizing the Vortex extension API.
Steps / History
vortex-vector#5010FromIterator<bool>implementation #5056IntoIterimplementation for vectors #5093Vectorvariants (1 perDType)NullBoolPrimitivePVectorMutsimilar tostd::vec::Vec#5079PrimitiveVectortoPrimitiveArray#5114spare_capacity_muttoPVectorMut#5287DecimalVarBin/BinaryListListVector#5190FixedSizeListFixedSizeListVector#5127StructOperatorVTable::bind#5105ExtensionArrayBuilders withVectorsvortex-scalarto be wrappers aroundVectorof length 1 (we could also potentially flip this and make the scalars the elements of vectors instead)Chores
from_option_iter#5031match_vector_pairmacro #5123validityfor mutable vectors #5188Unresolved Questions
Public API
All vectors will implement one of these two helper traits: