"Faster locks" is not the way to go. Instead, think of how you can reduce lock contention.

For example, reader-writer contention could be reduced by having the reader deque the entire contents of the Q, instead of one element at a time.

Another example: writer-writer contention could be removed completely by having each writer threads use a thread-local Q. Then the reader thread would iterate over all writer Q's, consuming the entire Q at a time.

gg