Technical Notes

Memory Overhead

Inevitably there is an overhead to use an in-memory cache. This is proportional to the number of blocks held, regardless of their individual size.

_images/size_of_overhead.png

This works out at minimum of 154 bytes and asymptotically to 64 bytes a block.

Pickling

Pickling

A Sparse Virtual File (svfsc.cSVF) can be pickled:

import pickle
import svfsc

svf = svfsc.cSVF('id')
svf.write(21, b'ABCDEF')
pickle_result = pickle.dumps(svf)
# Save pickle_result somewhere

Un-Pickling

And to un-pickle:

import pickle
import svfsc

svf = svfsc.cSVF('id')
svf.write(21, b'ABCDEF')
pickle_result = pickle.dumps(s)
new_svf = pickle.loads(pickle_result)
assert id(new_svf) != id(svf)
assert new_svf.id() == svf.id()
assert new_svf.file_mod_date() == svf.file_mod_date()
assert new_svf.blocks() == svf.blocks()

Pickling is versioned by an integer number. As svfsc progresses this ensures that pickles from previous svfsc versions can be detected and either rejected or read and modified, if possible.

Using pickletools

import pickle
import pickletools
import svfsc

svf = svfsc.cSVF('id')
svf.write(1, b' ')
svf.write(12, b' ')
pickle_result = pickle.dumps(s)
pickletools.dis(pickle_result)

The result will be something like:

    0: \x80 PROTO      4
    2: \x95 FRAME      106
   11: \x8c SHORT_BINUNICODE 'svfs'
   17: \x94 MEMOIZE    (as 0)
   18: \x8c SHORT_BINUNICODE 'cSVF'
   24: \x94 MEMOIZE    (as 1)
   25: \x93 STACK_GLOBAL
   26: \x94 MEMOIZE    (as 2)
   27: )    EMPTY_TUPLE
   28: \x81 NEWOBJ
   29: \x94 MEMOIZE    (as 3)
   30: }    EMPTY_DICT
   31: \x94 MEMOIZE    (as 4)
   32: (    MARK
   33: \x8c     SHORT_BINUNICODE 'id'
   37: \x94     MEMOIZE    (as 5)

   8<---- Snip --->8

   95: \x8c     SHORT_BINUNICODE 'pickle_version'
  111: \x94     MEMOIZE    (as 14)
  112: K        BININT1    1
  114: u        SETITEMS   (MARK at 32)
  115: b    BUILD
  116: .    STOP
highest protocol among opcodes = 4

Pickling Overhead

The following table shows the overhead, in bytes, when pickling. The overhead is the length of the result of .dumps() minus the sum of the length of all blocks. The table shows the overhead for different block counts and sizes.

Pickling Overhead (bytes)

Block Count

Block Size 1

Block Size 256

Block Size 4096

1

109

112

112

16

215

278

287

256

2023

2927

3542

4096

32743

52982

55622

Detecting File Changes

This is tricky. If the remote file changes there is no real way that the SVF can know of this. There are a couple of ways that the user of an SVF can detect this however.

File Modification Time

On construction the SVF can take an optional file modification time as a float. The user can query this with file_mod_time() and compare it with the latest file modification time and act accordingly (like using .clear() and reload as necessary).

Cautious Overwrite

On construction the SVF can take an optional flag compare_for_diff. If True, then when making a write() if a data difference is detected on an overwrite an IOError will be raised. This is a weak detection technique and adds about 25% to the cost of an overlapping write.

Greedy Gets

With a high latency connection it will be expensive to make a lot of small requests so it makes sense to make a smaller number of larger GETs, a form of cache prefetching. This is done by passing a greedy_length value to need() and that will coalesce the result of need() where possible.

For example an SVF with these need(file_position, length) blocks:

((8,  4), (16, 4), (32, 4))

Requesting 40 bytes from file position 8 gives this minimal block set by need(8, 40):

((12, 4), (20, 12), (36, 12),)

The same request with need(8, 40, greedy_length=64) gives this block set:

((12, 64),)

The shorter request, but for more data may be cheaper. This can be explored with a simulator.

Network Simulator

In cpy/simulator.py there is a simulator that can reproduce the effect of network latency, network bandwidth, server seek/read times and writing data to a SVF. The default configuration is:

  • Network latency (each way): 10 milliseconds.

  • Network bandwidth: 50 million bits per second.

  • Server seek speed: 10 giga bytes per second.

  • Server read speed: 50 million bytes per second.

The simulator can also take a greedy-length argument which allows you to tune your GET requests.

Some pre-built simulation requests are in cpy/sim_example.py:

  • A simple read of 32 bytes of data every 64 bytes up to a size of 20,480 bytes.

  • Actual seek/read operations for reading TIFF metadata TIFF files up to around 2GB. This has a more detailed analysis of performance (below).

Synthetic File

Here is the read time using different greedy_length values:

_images/greedy_length_synthetic.png

Reading TIFF Metadata

The second example is all the seek read operations to get all the TIFF metadata from selected TIFF files. For each file the table gives:

  • The file size in Mb

  • The number of seek()/read() operations needed to read the TIFF metadata.

  • The size of the TIFF metadata in bytes and as a proportion of the file size.

Selected TIFF Files

File

Size (MB)

seek()/read() ops

Metadata bytes

Metadata %

CMU-1.tiff

195

62,615

256,566

0.126%

TUPAC-TR-001.svs

2,146

1,051,242

4,208,118

0.187%

TUPAC-TR-002.svs

657

84,845

483,582

0.070%

TUPAC-TR-003.svs

563

59,936

242,436

0.041%

TUPAC-TR-004.svs

744

291,302

1,311,074

0.168%

TUPAC-TR-005.svs

955

176,754

709,714

0.071%

TUPAC-TR-006.svs

945

254,948

1,165,658

0.118%

Given these sample files the time taken to read the TIFF metadata for various greed read lengths is:

_images/py_sim_greedy.png

The performance improvement is because SVF.has() is far more likely to succeed at larger greedy_length values. Here are some file examples with the count of cache hits (SVF.has() succeeds) and cache misses (SVF.has() fails) for different greedy_length values.

_images/py_sim_greedy_hits_misses.png

The minor drawback is that more bytes are read than strictly necessary. For example with CMU-1.tiff and greedy_length=0 the minimal byte set is 256,566 bytes total. With a greedy_length=131,072 the total number of bytes read is 1,179,648. This is about 4x the minimal read but still about 1/200 of the original file.

Here are examples off the total amount of data read for different greedy_length values:

Note

Linear scale

_images/py_sim_greedy_overhead.png

A Comparison Against a Local File Read

This is a comparison of the time it takes to read TIFF metadata when the file is on the local file system with the simulator time for the same file, remotely with the network connection described above, using a greedy length 64 KB.

Selected TIFF Files

File

Size (MB)

Local (s)

Remote (s)

Ratio

CMU-1.tiff

195

0.139

0.413

3.0 x

TUPAC-TR-001.svs

2,146

2.14

3.22

1.5 x

TUPAC-TR-002.svs

657

0.183

0.582

3.2 x

TUPAC-TR-003.svs

563

0.130

0.512

3.9 x

TUPAC-TR-004.svs

744

0.597

1.10

1.8 x

TUPAC-TR-005.svs

955

0.361

0.815

2.3 x

TUPAC-TR-006.svs

945

0.521

1.01

1.9 x

So choosing a decent greedy length can get the remote performance within hailing distance of the local file performance.

The Effect of Simulated Network Latency

With the simulator we can experiment with various values of network latency (each way), bandwidth and greedy reads. For example here is the result of reading TIFF metadata with different network latencies.

The ZLIB curve represents Zero Latency, Infinite Bandwidth and thus is the network performance floor and, as expected, the greedy read length has little effect there as svfsc is an optimisation for slow networks:

_images/py_sim_greedy_latency.png

As reading TIFF metadata is usually a large amount of scattered small reads then network latency has a dominant effect. The poor performance of high latency networks can be improved greatly by using greedy reads. High (64 KB) greedy reads can transform high latency (50 ms) networks to about 10x their ZLIB time.

The Effect of Simulated Network Bandwidth

Here is the result of different bandwidths for a network latency (each way) of 10 ms.

_images/py_sim_greedy_bandwidth.png

With this level of network latency the bandwidth is almost irrelevant. As usual high greedy lengths compensate and it is only when they are above 10,000 bytes or so does the bandwidth become significant. High (64 KB) greedy reads can transform low bandwidth (10 Mbps) networks to about 10x their ZLIB time.

Here is the result of different bandwidths for a network latency (each way) of 1 ms.

_images/py_sim_greedy_bandwidth_latency_1.png

With this level of network latency the bandwidth becomes more significant. Again, medium greedy reads (optimum around 8 to 32 KB) can transform low bandwidth (10 Mbps) networks to about 10x their ZLIB time.

Amazon AWS Cloud Example

Here is an example simulation where the TIFF files are on an AWS server with a typical connection latency (each way) of 100 ms and a bandwidth of 1 MB/s (8Mb/s).

_images/py_sim_greedy_AWS.png

These values are very close to some measured data of TIFF files on AWS.

Running the Simulator

Here is the help information for the simulator:

$ python src/cpy/simulator.py -h
usage: src/cpy/simulator.py
       [-h] [-l LOG_LEVEL] [--latency LATENCY]
       [--bandwidth BANDWIDTH] [--seek-rate SEEK_RATE]
       [--read-rate READ_RATE] [--greedy-length GREEDY_LENGTH]
       [--realtime]

Simulate reading into a SVF.

options:
  -h, --help            show this help message and exit
  -l LOG_LEVEL, --log-level LOG_LEVEL
                        Log level.
  --latency LATENCY     Communications channel latency (NOTE: one way)
                        in ms. [default: 10]
  --bandwidth BANDWIDTH
                        Communications channel bandwidth in
                        million bits per second. Zero is infinite
                        bandwidth. [default: 50]
  --seek-rate SEEK_RATE
                        Server seek rate in million bytes per
                        second. [default: 10000]
  --read-rate READ_RATE
                        Server read rate in million bytes per
                        second. [default: 50]
  --greedy-length GREEDY_LENGTH
                        The greedy length to read fragments from
                        the server. Zero means read every
                        fragment. Default is to run through a
                        range of greedy lengths and report the
                        performance. [default: -1]
  --realtime            Run in realtime (may be slow).
                        [default: 0]

The simulator uses data in src/cpy/sim_examples.py, in there are several examples of files. These examples are just a tuple of (file_position, length) values, however they are Run Length Encoded for compactness.

With no arguments the simulator runs through the pre-prepared set of values with a range of greedy-length values. If greedy-length is give then the simulator just runs on that value. For example, exploring the simulator with a greedy_length of 64 KB:

$ python src/cpy/simulator.py --greedy-length=65536
Simulator setup:
Network latency 10.000 (ms) bandwidth 50.000 (M bits/s)
Server seek rate 10000.000 (M bytes/s) read rate 50.000 (M bytes/s)
2023-05-09 13:00:46,285 - simulator.py#256  - INFO     - Running EXAMPLE_FILE_POSITIONS_LENGTHS_TIFF_CMU_1 with 62483 file actions and greedy_length 65536
2023-05-09 13:00:46,724 - simulator.py#153  - INFO     - has(): hits: 62472 misses: 11
2023-05-09 13:00:46,724 - simulator.py#154  - INFO     - Blocks: 8 bytes: 682936 sizeof: 683314 overhead: 378
2023-05-09 13:00:46,724 - simulator.py#159  - INFO     - Comms time :    335.412 (ms) ( 81.7%) +++++++++++++++++++++++++++++++++++++++++
2023-05-09 13:00:46,724 - simulator.py#164  - INFO     - Server time:     34.843 (ms) (  8.5%) ++++
2023-05-09 13:00:46,724 - simulator.py#169  - INFO     - SVF time   :     40.268 (ms) (  9.8%) +++++
2023-05-09 13:00:46,724 - simulator.py#179  - INFO     - Total      :    410.523 (ms) (100.0%)
2023-05-09 13:00:46,724 - simulator.py#180  - INFO     - SVF contents: 682936 Execution time: 0.411 (s) 1.587 (Mb/s)
2023-05-09 13:00:46,725 - simulator.py#256  - INFO     - Running EXAMPLE_FILE_POSITIONS_LENGTHS_TUPAC_TR_001_svs with 1051153 file actions and greedy_length 65536
2023-05-09 13:00:52,913 - simulator.py#153  - INFO     - has(): hits: 1051080 misses: 73
2023-05-09 13:00:52,913 - simulator.py#154  - INFO     - Blocks: 10 bytes: 4784128 sizeof: 4784570 overhead: 442
2023-05-09 13:00:52,913 - simulator.py#159  - INFO     - Comms time :   2225.938 (ms) ( 69.6%) +++++++++++++++++++++++++++++++++++
2023-05-09 13:00:52,913 - simulator.py#164  - INFO     - Server time:    320.664 (ms) ( 10.0%) +++++
2023-05-09 13:00:52,913 - simulator.py#169  - INFO     - SVF time   :    649.409 (ms) ( 20.3%) ++++++++++
2023-05-09 13:00:52,913 - simulator.py#179  - INFO     - Total      :   3196.010 (ms) (100.0%)
2023-05-09 13:00:52,913 - simulator.py#180  - INFO     - SVF contents: 4784128 Execution time: 3.196 (s) 1.428 (Mb/s)
EXAMPLE_FILE_POSITIONS_LENGTHS_TIFF_CMU_1:
 greedy_length   Time(ms)     Hits     Miss    Hits%   Min. Bytes   Act. Bytes  Act. / Min.     sizeof Overhead  sizeof / Act.
         65536      410.5    62472       11  99.982%       256566       682936     266.183%     683314     +378       100.055%
EXAMPLE_FILE_POSITIONS_LENGTHS_TUPAC_TR_001_svs:
 greedy_length   Time(ms)     Hits     Miss    Hits%   Min. Bytes   Act. Bytes  Act. / Min.     sizeof Overhead  sizeof / Act.
         65536     3196.0  1051080       73  99.993%      4208118      4784128     113.688%    4784570     +442       100.009%
Execution time:      6.636 (s)

Thread Safety

If compiled with SVF_THREAD_SAFE and SVFS_THREAD_SAFE defined a C++ mutex is introduced to preserve thread safety.

The Python implementation does not set SVF_THREAD_SAFE and SVFS_THREAD_SAFE, instead it uses Python mutexes using the technique described here.

Warning

Thread safety is strictly limited to make each API call atomic to that thread.

There is no contention resolution among API calls. For example thread A could call need() on a SVF and then thread B calls, for example, write(), erase() or clear() which might or would invalidate the need() information held by thread A.

Thread Safety In C++

If compiled with SVF_THREAD_SAFE then SVFS::SparseVirtualFile will have a mutex:

#ifdef SVF_THREAD_SAFE
    /// Thread mutex. This adds about 5-10% execution time compared with a single threaded version.
    mutable std::mutex m_mutex;
#endif

The mutex is used at any relevant method with invoking this mutex in a RAII form:

#ifdef SVF_THREAD_SAFE
    std::lock_guard<std::mutex> mutex(m_mutex);
#endif

This only protects the internal data structures from modification for a single API call. It does not protect those structures from multiple, possibly interleaving, API calls.

Thread Safety In Python

The Python build is usually configured without SVF_THREAD_SAFE defined, instead it has its own locking mechanism. There are several steps to this and the example here is svfsc.cSVF.

Adding a Lock to the Python Object

A Python threadlock PyThread_type_lock is added to the type, conditional on Python being thread safe:

typedef struct {
    PyObject_HEAD
    SVFS::SparseVirtualFile *pSvf;
#ifdef PY_THREAD_SAFE
    PyThread_type_lock lock;
#endif
} cp_SparseVirtualFile;

This needs to be intialised in the __init__ equivalent, with error checking:

#ifdef PY_THREAD_SAFE
    self->lock = PyThread_allocate_lock();
    if (self->lock == NULL) {
        delete self->pSvf;
        PyErr_SetString(PyExc_MemoryError, "Unable to allocate thread lock.");
        return -2;
    }
#endif

This needs to be free’d on de-allocation, with error checking:

#ifdef PY_THREAD_SAFE
    if (self->lock) {
        PyThread_free_lock(self->lock);
        self->lock = NULL;
    }
#endif

Using the Lock in a Function

The use this lock a C++ class is used that takes a pointer to the SVF, this gains the lock on construction and frees the lock when it goes out of scope:

#ifdef PY_THREAD_SAFE

/**
 * @brief A RAII wrapper around the PyThread_type_lock for the CPython SVF.
 *
 * See https://pythonextensionpatterns.readthedocs.io/en/latest/thread_safety.html
 * */
class AcquireLockSVF {
public:
    /**
     * Acquire the lock on the Python cp_SparseVirtualFile
     *
     * @param pSVF The Python cp_SparseVirtualFile.
     */
    explicit AcquireLockSVF(cp_SparseVirtualFile *pSVF) : _pSVF(pSVF) {
        assert(_pSVF);
        assert(_pSVF->lock);
        if (!PyThread_acquire_lock(_pSVF->lock, NOWAIT_LOCK)) {
            Py_BEGIN_ALLOW_THREADS
                PyThread_acquire_lock(_pSVF->lock, WAIT_LOCK);
            Py_END_ALLOW_THREADS
        }
    }

    /**
     * Release the lock on the Python cp_SparseVirtualFile
     *
     * @param pSVF The Python cp_SparseVirtualFile.
     */
    ~AcquireLockSVF() {
        assert(_pSVF);
        assert(_pSVF->lock);
        PyThread_release_lock(_pSVF->lock);
    }

private:
    cp_SparseVirtualFile *_pSVF;
};

#else
/** Make the class a NOP which should get optimised out. */
class AcquireLockSVF {
public:
    AcquireLockSVF(cp_SparseVirtualFile *) {}
};
#endif

And is used thus:

PyObject *some_function(cp_SparseVirtualFile *self) {
    AcquireLockSVF _lock(self);

    // Do stuff...

} // Lock is released.

A similar lock is used for the SVFS data structure, the lock class is AcquireLockSVFS.

Cache Punting

The current design has the following characteristics:

  • The is no punting, the cache always expands.

  • Blocks are coalesced where possible. This is enforced by integrity checks that do not allow adjacent blocks.

  • The SVF is not a read-through cache, it is only advisory to the caller.

Problems With Punting

The basic use case is that the caller checks the SVF by calling SparseVirtualFile::has() and there will be one of two outcomes:

All the Data is in the Cache

Then the caller is free to call SparseVirtualFile::read() which is (must be) guaranteed to succeed. If this call is immediate (or at least before any other call that might trigger punting) then this does not present a challenge, merely the block with the data is marked with a fresh access time/integer. There is no punting required.

If the call between SparseVirtualFile::has() and SparseVirtualFile::read() is interleaved with a call that triggers a punt then the SparseVirtualFile::read() is not necessarily guaranteed to succeed. This presents a problem for the caller.

Some of the Data is not in the Cache

If SparseVirtualFile::has() returns false then the canonical behaviour of the caller is then to:

  • Call SparseVirtualFile::need()

  • Go and get the data somehow.

  • Call SparseVirtualFile::write()

  • Call SparseVirtualFile::read() which gives the caller a copy of the data.

This sequence is expected to always succeed. However we must consider that other sequences of events might exist, for example the caller decides not to write or even read. Alternatively a caller might go part way through that sequence but then call SparseVirtualFile::read() of another, or smaller, part of the file. This call might touch blocks previously touched by the original SparseVirtualFile::need() call.

One approach is that any punting must happen at the end of a SparseVirtualFile::read() once the copy has been made. This would mean that the memory used might exceed the cache limit in the intervening SparseVirtualFile::write().

Another approach would be to mark the blocks somehow when SparseVirtualFile::need() is called and then unmark them when SparseVirtualFile::read() is called affecting those blocks. Meanwhile any punting during a SparseVirtualFile::write() has to ignore marked blocks.

Both mean that a second SparseVirtualFile::read() to the same place is not guaranteed to succeed.

Multi-threading Guarantees

This problems are especially acute in a multi-threaded environment.

Compromise Design

The compromise is to provide APIs that can assist the caller who has full knowledge the callers state is and all of its threads. The caller is responsible for deciding when and how much to punt.

The SVF marks each block with and integer that represents the age of last use (the so-called ‘touch’ integer). This integer starts at 0 and monotonically increases with each read/write so older blocks have lower values. When blocks are coalesced the resulting block is marked as being newest regardless of the touch values of the previous blocks.

The API is lru_punt() for both C++ and Python. This takes a single argument as an integer upperbound of the bytes held by the cache. This prunes older blocks which have low touch values until the cache is the required size but always one block will remain. It returns the number of bytes removed from the cache.