Skip to content

add byte range options to openReadStream #38

@max-mapper

Description

@max-mapper

Sort of thinking out loud here. I have a use case where I want to 'mount' a compressed archive and access bytes randomly without decompressing the whole archive up front. Basically I want to:

  1. Efficiently get the entry that matches some filename
  2. Read a byte range from that entry
  3. Repeat this many times, potentially reading the same entry multiple times

The yauzl API seems to be geared for single pass unzipping, which makes sense. One approach I was thinking is I could just get all on('entry') entries up front and keep them in memory, then when a byte range request comes in I can use the entry to retrieve the byte range, but I ran in to problems, it would be much nicer to be able to lazily consult the central directory as opposed to having to read it all up front.

The other issue is related to Deflate which requires decompression from the beginning of the entry. I guess an alternative compression type like BGZF would make arbitrary byte range lookups much faster, but it wouldn't be compatible with many implementations. However! I found another technique where you do a single pass over the entry and build an index (https://github.com/madler/zlib/blob/master/examples/zran.c). I think this would be acceptable for my use case.

Being able to implement the zran style indexing on top of yauzl would mean some API changes I think, e.g. a way to get a single entry from the CDR by name, and a lower level way to control the decompression state to support zran. Before I got too deep I wanted to sanity check this use case, does it seem doable?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions