*4.1. BPC*

Internally, the BPC generates CxD pairs for four samples at a time (a full column of the zig-zag pattern), following the techniques of [27]. To generate the CxD pairs for a sample, flags from a 3 × 3 neighborhood around the current sample are taken into account. When dealing with columns, this neighborhood grows to 6 × 3. This is seen in Figure 6.

The main problem comes from the fact that the context generation modules are slow. They are based on lookup tables that require many LUT levels during FPGA implementation. They also need to be cascaded to generate the output significance that is required by the following bit strip. To avoid delays, a special module that predicts the next significance state in a faster way is used. It is shown in Figure 7.

The other modules are straightforward, with context generation being a ROM that outputs the context associated with the neighborhood by simple lookup. The cleanup predictor does a job similar to the significance predictor by looking at the first bit that is nonzero,and setting it and the following ones as significant if needed.

Up to 10 *CxD* pairs are generated per cycle. A serializer is used to order them sequentially before sending them to the MQ-coder (see Figure 8).

With a big enough vector queue, the problem of running into idle cycles when few CxD pairs are generated is avoided (e.g., the first refinement pass or the last significance and cleanup passes), since a big enough buffer exists to keep feeding the MQ coder. The serializer is designed to output one pair per cycle as long as the vector queue is not empty, so it can feed the MQ-coder without forcing a stall.

**Figure 6.** BPC structure. *sg* indicates sign bits for the current strip. *m* indicates magnitude bits for the current strip. *ic* is the coded flag for the strip, indicating whether each bit is coded. *f r* is yet another flag indicating, for each bit, whether it is being refined for the first time. *s* is the significance status for the neighborhood of the strip. *passx* are flags indicating the current pass (significance, cleanup, refinement). Flags are updated for the next pass and output to memory. The contexts for each pass exit the context generation (*CG*) modules for all three passes (*spc*, *mrc*,*sbc*) along with the xor bit for the sign. This ends up being output as a vector of triplets *oc*, *ob*, *ov*, containing the context–data (CxD) pairs as well as a valid bit.

**Figure 7.** Significance predictor module. It cascades the possible significance conversion of all bits in the strip by using a small critical path, instead of waiting for the cascade of *CG* modules. The significances *s*, magnitude bits *m*, and sign bits *sgn* are used to produce the new significance *s<sup>s</sup> <sup>f</sup>* . It is output along with a flag *s<sup>s</sup> <sup>a</sup>*, indicating if it is a newly acquired significance this cycle.

**Figure 8.** BPC architecture. The BPC core generates up to 10 CxD pairs per cycle, which are then serialized before sending them to the CxD FIFO.

#### *4.2. MQ-Coder*

The extensive work seen in Section 3.2 can be summed up in two different approaches: pipelining and dual-symbol processing. Since the target platform is an FPGA, pipelining is ideal to avoid a longer critical path which, on reconfigurable hardware, has a higher timing penalty than on fixed silicon. It will later be seen that a bottleneck arises in the CxD generation, so this area-efficient approach is the right choice since it can keep up with previous stages.

The only common step of all pipelining approaches is separating the byte out procedure in a last stage. Register updates are often split in two stages, treating *A* and *C* updates independently. Memory access is also split in some implementations. The point is, there are no obvious stages in the algorithm given the great dependency of the different stages. In fact, most designs implement the most likely execution path, having to stall in the event of an unexpected input.

Most implementations are offering a theoretical throughput that varies depending on the data being compressed. The goal with this paper was to design an implementation which can consistently process CxD pairs at a certain speed. By pipelining the design and inserting queues in between stages, any potential stall in one stage gets absorbed by the queues and does not affect the others. The result can be seen on Figure 9.

**Figure 9.** MQ-coder architecture. The interval updates are fused when possible, having less bound updates which could stall the pipeline.

Four main stages can be seen.

#### 4.2.1. First and Second Stages

First, the context memory is accessed. This memory outputs "context information", which has the prediction for that context, as well as the MPS and LPS transitions, XOR bit, and probability.

This memory is written with the context from the third stage, so care is taken whenever the same context is encountered twice in a three context window span:


All in all, reading is segmented in two different stages, without the stalling that sometimes could happen in implementations such as [49].

The second stage is the most complex one, where the critical path lies. First, the context information to be used is decided, which comes from the context memory unless the same context appeared twice in a row, in which case it is fetched from the state memory and previous prediction.

State and prediction changes are fed to the state and context memories.

The prediction is adjusted, and the *A* register is updated. This can happen in one of four ways: either the *A* register is not shifted, it is shifted once or twice, or the contents are assigned from memory. In the last case, the number of shifts is calculated in advance.

The number of shifts *s*¯, value *p*¯ to add to the *C* register, and hit flag *h* (indicating *C* needs to be updated with *p*¯) are sent to the next stage.

Both stages are seen in Figure 10.

**Figure 10.** MQ-coder first and second stages. The context and bit from the BPC are input, and a probability, shift, and hit (correct prediction or not) are output.

### 4.2.2. Third Stage

The fourth stage can stall the pipeline. In order to minimize that risk, the inputs from multiple cycles are combined into just one update, so as to send the minimum number of updates ahead. This can be done under two scenarios:


Both these merging techniques can be done recursively.

### 4.2.3. Fourth Stage

The *C* register is updated by adding *p*¯ and shifting it. Whenever it fills up, a byte is output. When the register overflows or a special byte 0*xFF* is output, padding needs to be added to avoid special markers used to indicate the end of stream. Up to three bytes might be output per update, and the control logic for all possibilities would make this stage too slow.

In order to avoid that problem, shifting is done byte by byte. If the shift amount is greater than one byte, the pipeline will stall for one cycle. Studies have shown [48] this problem to be negligible (<1% of the time). This stage is seen in Figure 11.

**Figure 11.** MQ-coder last stage. It interfaces with two FIFOs, reading updates from the previous stage and making sure space is available at the output to send bytes out.
