Well, what I did was to make a lookup array. I got a few percentage points improvement, but less than I hoped. I must have something not inlining, or else that it's not a const array due to needing to initialize it. The lookup array, anyway. The one with sizes is not.

For linear search or a vector the number of searchs would be the same or less but might take longer per item and I am pretty bare to the metal in this code.

Basically I want to make many fast allocations and deallocations for my memory manager and was hoping to shave off some infinitesimal speed to make more impressive benchmark.

Right now I can allocate and deallocate about 100 000 000 per CPU in 1.3 seconds, and I can't imagine that would be a bottleneck in any real world app.

The weird thing is it takes twice as long for larger blocks, though. I thought this lookup was the reason why but there must be something else I am not seeing. As far as I can tell it should not matter what size the block is.

Maybe it's the block header info being in separate memory area from the data causing the performance drop? I suppose there's not much to do about that without introducing a lot of overhead.