Using De Bruijn sequences for faster count leading zeros (CLZ)

The count leading zeros (CLZ) operation is an important function in many computing applications, especially on microcontrollers and other resource-constrained devices. CLZ counts the number of leading zero bits before the first 1 bit in a word. An efficient CLZ implementation allows faster bit scanning and manipulation operations. On ARM Cortex-M microcontrollers, CLZ is provided as a single cycle instruction, CLZ r0, r1, which counts the leading zeros in register r1 and stores the result in r0. However, in software implementations or on other architectures, CLZ can be challenging to optimize. This article explains how De Bruijn sequences can be used to accelerate CLZ in software.

Contents

The CLZ operation Naive CLZ implementations De Bruijn sequences for constant time CLZ Generating De Bruijn sequences Computing the CLZ outputs Handling variable word sizes Other considerations Performance comparison Conclusion

The CLZ operation

The CLZ operation takes a word input and returns the number of leading 0 bits before the first 1 bit. For example:

CLZ(00110001) = 2

CLZ(10000111) = 0
CLZ(11111111) = 0

CLZ is useful for many algorithms like normalizing floats, finding the highest/lowest set bit, calculating logarithms, etc. An efficient CLZ is important for high performance embedded and general purpose computing. However, most processors do not have a CLZ instruction and it must be implemented in software.

Naive CLZ implementations

A simple CLZ implementation is to loop through each bit, shifting and testing until the first 1 bit is found. This takes O(n) time where n is the word size. For 32-bit words, this is too slow for most applications. Unrolled loops can improve throughput but increase code size exponentially.

Another approach is binary search – test the midpoint, then search higher or lower half. This reduces average complexity to O(log n) but adds branching overhead. For small word sizes, the branches often mispredict, stalling pipelines.

Lookup tables are fast and give O(1) search time. But for large word sizes like 64-bit, the table size becomes prohibitive. Also, tables cannot be updated easily for changing word sizes.

De Bruijn sequences for constant time CLZ

De Bruijn sequences can generate an O(1) CLZ by using clever bit patterns. A De Bruijn sequence of order n contains every possible n-bit sequence exactly once in a cyclic pattern. We can leverage this to create a CLZ using:

Look up the word’s leading bits in a De Bruijn table, returning an index.
Use the index to lookup a precomputed CLZ for that bit pattern.

For example, with a 4-bit De Bruijn sequence:

0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100

If the input word starts with 0010 (index 4), the table indicates it has 2 leading zeros. This lookup takes O(1) time. The difficulties are constructing the compact De Bruijn table, and computing the CLZ outputs.

Generating De Bruijn sequences

While algorithms exist to generate De Bruijn sequences, for CLZ we can construct them manually. To build a table of size 2^n, create a binary counting sequence from 0000.. to 111..1. Then take each n-bit segment and rotate it left until it forms a cyclic chain:

4-bit sequence:

0000, 0001, 0010, 0011

Rotated left by 1:

0000, 0001, 0011, 0010

Repeating generates the complete De Bruijn sequence. For CLZ, the table size should match the word size. Larger tables give diminishing improvements due to caching effects.

Computing the CLZ outputs

Next, we need to populate each table entry with the CLZ count for its bit pattern. One approach is to directly count zeros on each entry. But we can optimize this by reusing computations.

Observe that rotating the bit pattern left or right changes the CLZ count by 1. E.g. CLZ(0010) = 2, CLZ(0100) = 3. We can organize the table so adjacent entries differ by 1 leading zero. Then we only need compute CLZ for the first entry, and increment each subsequent one.

Further, by ordering the table specially, we can minimize the number of distinct CLZ entries needed. For example, the 4-bit table only requires 3 base CLZ counts that are incremented:

0000, 0011, 0010, 0001 [CLZ = 3, 2, 1, 0]

This reduces initialization effort and storage requirements.

Handling variable word sizes

A limitation of De Bruijn CLZ is that the table is fixed for a specific word size. To handle variable widths, we can generate tables for powers of two sizes like 32, 64, etc. and select the appropriate one at runtime. Alternatively, mask the input word to the maximum table width.

The lookup cost scales logarithmically with table size. So while larger tables provide faster CLZ for big words, they may be overkill for smaller sizes. To optimize across word sizes, we can create multiple tables of different orders and select the best fit for each input.

Other considerations

Some additional design points for efficiently implementing De Bruijn CLZ in practice:

Store the table in fast cacheable memory like SRAM instead of flash.

Organize as a single lookup to avoid pipeline stalls.
Precompute values during initialization to reduce runtime overhead.
Optimize memory layout to exploit alignment and packing.

Use smaller bit widths like bytes to simplify population and access.

Of course, hardware CLZ instructions will always be faster than software where available. But De Bruijn sequences let us approach that speed in microcontrollers and other environments without CLZ support.

Performance comparison

Compared to other methods, De Bruijn CLZ provides a good blend of simplicity, speed, size, and scalability:

Method	Performance	Memory
Loop	O(n)	Tiny code
Binary search	O(log n)	Tiny code
Table lookup	O(1)	2^n entries
De Bruijn	O(1)	n entries

The constant overhead of De Bruijn lookup is very low compared to other O(1) techniques. And the table scales linearly rather than exponentially in the word size.

Conclusion

De Bruijn sequences provide an efficient way to implement count leading zeros in software. By using clever bit patterns and lookup tables, they can determine the CLZ in O(1) time with minimal memory. This allows performance close to hardware CLZ instructions without dedicated support. De Bruijn CLZ is fast, compact, and flexible enough for most embedded and application uses.

Using De Bruijn sequences for faster count leading zeros (CLZ)

The CLZ operation

Naive CLZ implementations

De Bruijn sequences for constant time CLZ

Generating De Bruijn sequences

Computing the CLZ outputs

Handling variable word sizes

Other considerations

Performance comparison

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Estimating Stack Memory Needs for Arm Cortex-M Applications

Differences between Thumb-16 and Thumb-2 instruction sets

Basepri Example

Execution Latency of ITCM vs External RAM on Cortex-M1