-
November 16th, 2012, 03:31 AM
#1
[RESOLVED] A little advice on a memory management routine would be appreciated
Hello all,
I have a program that handles large numbers held in arrays of qwords. All of the math stuff is implemented in assembly language. All of the routines request memory for at least the return value, and some of them request memory for temporary values as well. I found that my calls to malloc were slowing the alorithms down quite a bit, so I attempted to write my own memory management routine.
The requirements for my allocation method were: 1. I dont know in advance how much memory will be requested. 2. I do not know for how long an allocation will be used before it is returned to the pool (the return values can live indefinately). I know this leaves me with the potential for fragmentation and I am allready thinking of how to handle that though it has not been an issue as of yet.
My problem is that in a single threaded program it works great, and gave me a better boost in speed then I thought it would, however, in a multi-threaded app it really bogs down. I was wondering if yall might be able to guide me as to how to improve it, or what I should be doing different.
The first code listing gives just enough to show where the bottleneck is, the second gives the entire routine with an explanation of how I pictured it working. This is my first attempt at memory management so please dont laugh to hard. Any feedback would be appreciated.
I know my bottleneck is at the lock cmp xchg, but as far as I can see only one thread at a time is allowed in that section of code, else I end up with multiple threads assigning themselves the same blocks of memory (as I found out the hard way).
Code:
LOCAL _size:qword
LOCAL _othersize:qword
LOCAL _firstpass:qword
LOCAL _startingset:qword
LOCAL _rollset
push rbx
push rdx
push rdi
push rsi
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
cmp Initialized, 0
jne AllreadyInitialized
call Initialize
mov Initialized, 1
AllreadyInitialized:
cmp _usemalloc, 0 ;use malloc or my algorithm
je NotMalloc
sub rsp, 28h
clc
call malloc
add rsp, 28h
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rsi
pop rdi
pop rdx
pop rbx
RET
NotMalloc:
mov _firstpass, 1
mov _rollset, 1
add rcx, 16
mov _size, rcx
mov rax, rcx ; mov amount req to rax
xor rdx, rdx ;prep for div
div _blocksize ;div by block size
cmp rdx, 0 ;if there is a remainder
je NoInc
inc rax ;inc rax by one
NoInc:
push rax
TryAgain:
mov rax, 0 ;have to prep the compare and exchange each loop
mov rdx, 0
mov rbx, 1
mov rcx, 0
lock cmpxchg8b _gm
jnz TryAgain
inc _memcnt ;get rid of the following instructions in the working module
cmp _memcnt, 1
je DontWorry
mov rax, rax
DontWorry:
before any algorithm runs, there is an initialization routine that uses malloc to request memory (4194304 bytes, thats 65536 sets of 64 byte blocks) and initializes the memory. The address of the first byte is then stored in an array of pointers, I used an array of pointers so I could grow my pool on the fly.
So, I have a variable that I can change in the data section so that is skips my routine and just uses malloc, this is the very first check my routine does.
After that, it takes the amount of memory requested and adds 16 to it (this is for management info and is used to make assigning and returning the memory faster the layout will be explained below). This new number is then divided by 64 (that is the size of blocks that I chose to allocate) and if there is a remainder this is inc by one. This tells me how many blocks to look for.
I have 2 other variables that are then loaded the first is an index into the array of pointers, this is the current set I am assigning from. The second variable points to the next block after the last assignment I made. This block that it points to is then checked for zero (zero indicates it is unused), if it is zero it then checks how many contiguous blocks it has free, if there are enough to handle the request it marks the first few bytes, and the last 2 bytes, it updates the next free block with how many free blocks are left, and the address is modified to point to the first byte after the management bytes and is returned to the calling algorithm.
If the first byte is not zero, it checks how many blocks in this set are in use, adds this to the base address, and checks the block following this set to see if it is free. If there are not enough free blocks to handle the request, it starts at block zero and checks for any set of free blocks that are big enough for the request. If none are found, it inc the index into the array of pointers and checks the next set. If there are no more sets, then it makes a call to malloc to get another set and marks the blocks.
The management blocks are marked as follows:
;A used block will have the following pattern:
; high bit of byte zero will be set if it is asigned from the experimental unit, will be cleared otherwise
; bytes 0 and 1 will hold a word value of the memory set being used (which index into the memptr array)
; bytes 2 and 3 will hold a word value of the starting block
; bytes 4 and 5 will hold a word value of the number of blocks being used
; bytes (starting address +((starting block + number of blocks used) * blocksize)) -1 and -2 will hold a word value of the number of blocks being used
;an unused block will have the following pattern
; bytes 0 through 7 will be zerod out
; bytes 8 through 15 holds a qword value of the number of contiguous blocks unused
; bytes (starting address + ((starting block + numberof blocks) * blocksize) -1 and -2 will hold a word value of the number of contiguous blocks left
;when a memory request comes in 16 is added to the request right off the bat, this number is divided by the blocksize and if rdx is not zero the result is incremented one
;the current set is loaded into a register, and the current block is loaded into another register
;a check is made to make sure the first qword is zero and if it is the second qword is compared to the amount requested
;if the qword is equal to or larger then the first 6 bytes are set according to the pattern above
;8 is added to the beggining of the first block and this will be what is returned
;next the requested number of blocks is subtracted from the value held in the second qword above
;the number of contiguous blocks is checked to make sure it is not zero
;if it is not then the first qword after the blocks being returned is zerod
;and the second qword has the number of contiguous blocks left moved into it
;bytes (starting address + ((starting block + numberof blocks) * blocksize) -1 and -2 will hold a word value of the number of contiguous blocks left
;for the freeing of the memory
;8 will be subtracted from the starting address
;the word at bytes 0 and 1 will be moved into a register, memptr will be moved into another register, and the base address will be pulled from them
;the word at bytes 2 and 3 will be moved into a register and the word at bytes 4 and 5 will be moved into another register
;the first qword will be zerod, and the second qword will recieve the qword representation of the word value pulled from bytes 4 and 5
;the last byte address will be calculated and a word stored at bytes -1 and -2 will hold the word value pulled fom bytes 4 and 5
;next, the starting addrress + (number of blocks * blocksize) -2 will be moved into a word value
;this will be subtracted from the base address of the blocks being freed
;if the first qword at this address is zero then the word value at bytes 4 and 5 will be added to the qword value in the second spot
;that same value will be stored at the end of the blocks being freed as a word value
;next the first qword after the end of the blocks being freed is checked for zero
;if it is then the second qword here is added to the second qword at the beggining of the set (this will either be the second qword of the blocks being freed or the second qword from the new beggining)
;this will also be stored as a word value at the end of this upper set -1 -2
Just reading over what I wrote sounds complicated, but it works and it is faster than malloc, but Id like to make it better.
Thanks in advance.
Code:
GetMem64 PROC tsize:qword
LOCAL _size:qword
LOCAL _othersize:qword
LOCAL _firstpass:qword
LOCAL _startingset:qword
LOCAL _rollset
push rbx
push rdx
push rdi
push rsi
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
cmp Initialized, 0
jne AllreadyInitialized
call Initialize
mov Initialized, 1
AllreadyInitialized:
cmp _usemalloc, 0 ;use malloc or my algorithm
je NotMalloc
sub rsp, 28h
clc
call malloc
add rsp, 28h
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rsi
pop rdi
pop rdx
pop rbx
RET
NotMalloc:
mov _firstpass, 1
mov _rollset, 1
add rcx, 16
mov _size, rcx
mov rax, rcx ; mov amount req to rax
xor rdx, rdx ;prep for div
div _blocksize ;div by block size
cmp rdx, 0 ;if there is a remainder
je NoInc
inc rax ;inc rax by one
NoInc:
push rax
TryAgain:
mov rax, 0 ;have to prep the compare and exchange each loop
mov rdx, 0
mov rbx, 1
mov rcx, 0
lock cmpxchg8b _gm
jnz TryAgain
inc _memcnt ;get rid of the following instructions in the working module
cmp _memcnt, 1
je DontWorry
mov rax, rax
DontWorry:
mov rdi, _cset
mov _startingset, rdi
mov r10, _memptr
mov r10, qword ptr[r10+rdi*8]
mov rdi, _cblock
xor rsi, rsi
pop rbx
CheckLoop:
cmp rdi, _setsize
jnb Next
mov rax, rdi
mul _blocksize
add rax, r10
cmp qword ptr[rax], 0
ja CheckNext
inc rsi
cmp rsi, rbx
jnb FoundIt
inc rdi
jmp CheckLoop
CheckNext:
xor rsi, rsi
add rax, 4
xor rdx, rdx
mov dx,word ptr[rax]
add rdi, rdx
inc rdi
mov _cblock, rdi
jmp CheckLoop
FoundIt:
mov rcx, rax
mov rdx, _cset
mov word ptr[rax], dx ;save the memory set being used
or word ptr[rax], 2147483648 ; set the high bit to mark it as assigned from here
inc rax
inc rax
mov rdx, _cblock
mov word ptr[rax], dx ;save the starting block being used
inc rax
inc rax
mov word ptr[rax], bx ;save the number of blocks being used
mov rax, rcx
add rax, 16
inc rdi
mov _cblock, rdi
mov _gm, 0
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rsi
pop rdi
pop rdx
pop rbx
RET
Next:
cmp _firstpass, 1
jne NextSet
mov _firstpass, 0
xor rdi, rdi
xor rsi, rsi
mov _cblock, 0
jmp CheckLoop
NextSet:
mov r10, _memptr
mov rax, _cset
cmp qword ptr[r10], rax
je AddAnotherSet
cmp rax, _startingset
jne NotYet
cmp _rollset, 1
jne GetMore
NotYet:
inc rax
mov _cset, rax
xor rsi, rsi
xor rdi, rdi
mov _cblock, rdi
mov r10, _memptr
mov r10, qword ptr[r10+rax*8]
jmp CheckLoop ;continue checking
AddAnotherSet:
cmp _rollset, 1
jne GetMore
mov rax, _startingset
cmp rax, _cset
je GetMore
mov _cset, 0
mov _rollset, 0
xor rsi, rsi
xor rdi, rdi
mov _cblock, rdi
mov r10, _memptr
mov r10, qword ptr[r10+rdi*8]
jmp CheckLoop
GetMore:
push rbx ;save amt req
mov rcx, _memreq ;mov 65536 dwords into rcx
sub rsp, 28h ;adjust stack
clc
call malloc ;call malloc
add rsp, 28h ;readjust stack
mov r10, _memptr ;move memslot ptr into r10
mov rbx, qword ptr[r10] ;move current number of memslots into rbx
inc rbx ;inc rbx
mov _cset, rbx
mov _cblock, 0
mov qword ptr[r10], rbx ;move new num memslots into memslot
mov qword ptr[r10+rbx*8], rax
xor rdi, rdi
ClearEmLoop:
mov qword ptr[rax+rdi*8], 0
add rdi, 8
cmp rdi, _setsize
jna ClearEmLoop
xor rdi, rdi
xor rsi, rsi
mov r10, rax
pop rbx
jmp CheckLoop
GetMem64 ENDP
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|