Hello all,

I have a program that handles large numbers held in arrays of qwords. All of the math stuff is implemented in assembly language. All of the routines request memory for at least the return value, and some of them request memory for temporary values as well. I found that my calls to malloc were slowing the alorithms down quite a bit, so I attempted to write my own memory management routine.

The requirements for my allocation method were: 1. I dont know in advance how much memory will be requested. 2. I do not know for how long an allocation will be used before it is returned to the pool (the return values can live indefinately). I know this leaves me with the potential for fragmentation and I am allready thinking of how to handle that though it has not been an issue as of yet.

My problem is that in a single threaded program it works great, and gave me a better boost in speed then I thought it would, however, in a multi-threaded app it really bogs down. I was wondering if yall might be able to guide me as to how to improve it, or what I should be doing different.

The first code listing gives just enough to show where the bottleneck is, the second gives the entire routine with an explanation of how I pictured it working. This is my first attempt at memory management so please dont laugh to hard. Any feedback would be appreciated.

I know my bottleneck is at the lock cmp xchg, but as far as I can see only one thread at a time is allowed in that section of code, else I end up with multiple threads assigning themselves the same blocks of memory (as I found out the hard way).

Code:
LOCAL _size:qword
		LOCAL _othersize:qword
		LOCAL _firstpass:qword
		LOCAL _startingset:qword
		LOCAL _rollset

		push	rbx
		push	rdx
		push	rdi
		push	rsi
		push	r8
		push	r9
		push	r10
		push	r11
		push	r12
		push	r13
		push	r14
		push	r15
		cmp		Initialized, 0
		jne		AllreadyInitialized
		call	Initialize
		mov		Initialized, 1

AllreadyInitialized:

		cmp		_usemalloc, 0				;use malloc or my algorithm
		je		NotMalloc
		sub		rsp, 28h
		clc
		call	malloc
		add		rsp, 28h
		pop		r15
		pop		r14
		pop		r13
		pop		r12
		pop		r11
		pop		r10
		pop		r9
		pop		r8
		pop		rsi
		pop		rdi
		pop		rdx
		pop		rbx

		RET

NotMalloc:

		mov		_firstpass, 1
		mov		_rollset, 1
		add		rcx, 16
		mov		_size, rcx		
		mov		rax, rcx					; mov amount req to rax
		xor		rdx, rdx					;prep for div
		div		_blocksize					;div by block size
		cmp		rdx, 0						;if there is a remainder
		je		NoInc
		inc		rax							;inc rax by one

NoInc:

		push	rax

TryAgain:

		mov		rax, 0					;have to prep the compare and exchange each loop
		mov		rdx, 0
		mov		rbx, 1
		mov		rcx, 0
		lock	cmpxchg8b	_gm
		jnz		TryAgain
		inc		_memcnt					;get rid of the following instructions in the working module
		cmp		_memcnt, 1
		je		DontWorry
		mov		rax, rax

DontWorry:
before any algorithm runs, there is an initialization routine that uses malloc to request memory (4194304 bytes, thats 65536 sets of 64 byte blocks) and initializes the memory. The address of the first byte is then stored in an array of pointers, I used an array of pointers so I could grow my pool on the fly.

So, I have a variable that I can change in the data section so that is skips my routine and just uses malloc, this is the very first check my routine does.

After that, it takes the amount of memory requested and adds 16 to it (this is for management info and is used to make assigning and returning the memory faster the layout will be explained below). This new number is then divided by 64 (that is the size of blocks that I chose to allocate) and if there is a remainder this is inc by one. This tells me how many blocks to look for.

I have 2 other variables that are then loaded the first is an index into the array of pointers, this is the current set I am assigning from. The second variable points to the next block after the last assignment I made. This block that it points to is then checked for zero (zero indicates it is unused), if it is zero it then checks how many contiguous blocks it has free, if there are enough to handle the request it marks the first few bytes, and the last 2 bytes, it updates the next free block with how many free blocks are left, and the address is modified to point to the first byte after the management bytes and is returned to the calling algorithm.

If the first byte is not zero, it checks how many blocks in this set are in use, adds this to the base address, and checks the block following this set to see if it is free. If there are not enough free blocks to handle the request, it starts at block zero and checks for any set of free blocks that are big enough for the request. If none are found, it inc the index into the array of pointers and checks the next set. If there are no more sets, then it makes a call to malloc to get another set and marks the blocks.

The management blocks are marked as follows:

;A used block will have the following pattern:
; high bit of byte zero will be set if it is asigned from the experimental unit, will be cleared otherwise
; bytes 0 and 1 will hold a word value of the memory set being used (which index into the memptr array)
; bytes 2 and 3 will hold a word value of the starting block
; bytes 4 and 5 will hold a word value of the number of blocks being used
; bytes (starting address +((starting block + number of blocks used) * blocksize)) -1 and -2 will hold a word value of the number of blocks being used

;an unused block will have the following pattern
; bytes 0 through 7 will be zerod out
; bytes 8 through 15 holds a qword value of the number of contiguous blocks unused
; bytes (starting address + ((starting block + numberof blocks) * blocksize) -1 and -2 will hold a word value of the number of contiguous blocks left


;when a memory request comes in 16 is added to the request right off the bat, this number is divided by the blocksize and if rdx is not zero the result is incremented one
;the current set is loaded into a register, and the current block is loaded into another register
;a check is made to make sure the first qword is zero and if it is the second qword is compared to the amount requested
;if the qword is equal to or larger then the first 6 bytes are set according to the pattern above
;8 is added to the beggining of the first block and this will be what is returned
;next the requested number of blocks is subtracted from the value held in the second qword above
;the number of contiguous blocks is checked to make sure it is not zero
;if it is not then the first qword after the blocks being returned is zerod
;and the second qword has the number of contiguous blocks left moved into it
;bytes (starting address + ((starting block + numberof blocks) * blocksize) -1 and -2 will hold a word value of the number of contiguous blocks left



;for the freeing of the memory
;8 will be subtracted from the starting address
;the word at bytes 0 and 1 will be moved into a register, memptr will be moved into another register, and the base address will be pulled from them
;the word at bytes 2 and 3 will be moved into a register and the word at bytes 4 and 5 will be moved into another register
;the first qword will be zerod, and the second qword will recieve the qword representation of the word value pulled from bytes 4 and 5
;the last byte address will be calculated and a word stored at bytes -1 and -2 will hold the word value pulled fom bytes 4 and 5
;next, the starting addrress + (number of blocks * blocksize) -2 will be moved into a word value
;this will be subtracted from the base address of the blocks being freed
;if the first qword at this address is zero then the word value at bytes 4 and 5 will be added to the qword value in the second spot
;that same value will be stored at the end of the blocks being freed as a word value
;next the first qword after the end of the blocks being freed is checked for zero
;if it is then the second qword here is added to the second qword at the beggining of the set (this will either be the second qword of the blocks being freed or the second qword from the new beggining)
;this will also be stored as a word value at the end of this upper set -1 -2

Just reading over what I wrote sounds complicated, but it works and it is faster than malloc, but Id like to make it better.

Thanks in advance.


Code:
GetMem64	PROC	tsize:qword

		LOCAL _size:qword
		LOCAL _othersize:qword
		LOCAL _firstpass:qword
		LOCAL _startingset:qword
		LOCAL _rollset

		push	rbx
		push	rdx
		push	rdi
		push	rsi
		push	r8
		push	r9
		push	r10
		push	r11
		push	r12
		push	r13
		push	r14
		push	r15
		cmp		Initialized, 0
		jne		AllreadyInitialized
		call	Initialize
		mov		Initialized, 1

AllreadyInitialized:

		cmp		_usemalloc, 0				;use malloc or my algorithm
		je		NotMalloc
		sub		rsp, 28h
		clc
		call	malloc
		add		rsp, 28h
		pop		r15
		pop		r14
		pop		r13
		pop		r12
		pop		r11
		pop		r10
		pop		r9
		pop		r8
		pop		rsi
		pop		rdi
		pop		rdx
		pop		rbx

		RET

NotMalloc:

		mov		_firstpass, 1
		mov		_rollset, 1
		add		rcx, 16
		mov		_size, rcx		
		mov		rax, rcx					; mov amount req to rax
		xor		rdx, rdx					;prep for div
		div		_blocksize					;div by block size
		cmp		rdx, 0						;if there is a remainder
		je		NoInc
		inc		rax							;inc rax by one

NoInc:

		push	rax

TryAgain:

		mov		rax, 0					;have to prep the compare and exchange each loop
		mov		rdx, 0
		mov		rbx, 1
		mov		rcx, 0
		lock	cmpxchg8b	_gm
		jnz		TryAgain
		inc		_memcnt					;get rid of the following instructions in the working module
		cmp		_memcnt, 1
		je		DontWorry
		mov		rax, rax

DontWorry:

		mov		rdi, _cset
		mov		_startingset, rdi
		mov		r10, _memptr
		mov		r10, qword ptr[r10+rdi*8]
		mov		rdi, _cblock
		xor		rsi, rsi
		pop		rbx

CheckLoop:

		cmp		rdi, _setsize
		jnb		Next
		mov		rax, rdi
		mul		_blocksize
		add		rax, r10
		cmp		qword ptr[rax], 0
		ja		CheckNext
		inc		rsi
		cmp		rsi, rbx
		jnb		FoundIt 
		inc		rdi
		jmp		CheckLoop

CheckNext:

		xor		rsi, rsi
		add		rax, 4
		xor		rdx, rdx
		mov		dx,word ptr[rax]
		add		rdi, rdx
		inc		rdi
		mov		_cblock, rdi
		jmp		CheckLoop

FoundIt:

		mov		rcx, rax
		mov		rdx, _cset
		mov		word ptr[rax], dx			;save the memory set being used
		or		word ptr[rax], 2147483648   ; set the high bit to mark it as assigned from here
		inc		rax
		inc		rax
		mov		rdx, _cblock
		mov		word ptr[rax], dx			;save the starting block being used
		inc		rax
		inc		rax
		mov		word ptr[rax], bx			;save the number of blocks being used
		mov		rax, rcx
		add		rax, 16
		inc		rdi
		mov		_cblock, rdi
		mov		_gm, 0
		pop		r15
		pop		r14
		pop		r13
		pop		r12
		pop		r11
		pop		r10
		pop		r9
		pop		r8
		pop		rsi
		pop		rdi
		pop		rdx
		pop		rbx

		RET

Next:

		cmp		_firstpass, 1
		jne		NextSet
		mov		_firstpass, 0
		xor		rdi, rdi
		xor		rsi, rsi
		mov		_cblock, 0
		jmp		CheckLoop

NextSet:

		mov		r10, _memptr
		mov		rax, _cset
		cmp		qword ptr[r10], rax
		je		AddAnotherSet
		cmp		rax, _startingset
		jne		NotYet
		cmp		_rollset, 1
		jne		GetMore

NotYet:

		inc		rax
		mov		_cset, rax
		xor		rsi, rsi
		xor		rdi, rdi
		mov		_cblock, rdi
		mov		r10, _memptr
		mov		r10, qword ptr[r10+rax*8]
		jmp		CheckLoop					;continue checking

AddAnotherSet:

		cmp		_rollset, 1
		jne		GetMore
		mov		rax, _startingset
		cmp		rax, _cset
		je		GetMore
		mov		_cset, 0
		mov		_rollset, 0
		xor		rsi, rsi
		xor		rdi, rdi
		mov		_cblock, rdi
		mov		r10, _memptr
		mov		r10, qword ptr[r10+rdi*8]
		jmp		CheckLoop

GetMore:

		push	rbx							;save amt req
		mov		rcx, _memreq				;mov 65536 dwords into rcx
		sub		rsp, 28h					;adjust stack
		clc
		call	malloc						;call malloc
		add		rsp, 28h					;readjust stack
		mov		r10, _memptr				;move memslot ptr into r10
		mov		rbx, qword ptr[r10]			;move current number of memslots into rbx
		inc		rbx							;inc rbx
		mov		_cset, rbx
		mov		_cblock, 0
		mov		qword ptr[r10], rbx			;move new num memslots into memslot
		mov		qword ptr[r10+rbx*8], rax
		xor		rdi, rdi

ClearEmLoop:

		mov		qword ptr[rax+rdi*8], 0
		add		rdi, 8
		cmp		rdi, _setsize
		jna		ClearEmLoop
		xor		rdi, rdi
		xor		rsi, rsi
		mov		r10, rax
		pop		rbx
		jmp		CheckLoop

	GetMem64 ENDP