CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 14 of 14
  1. #1
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    optimizing an algorithm or two

    Hello all,

    Same project as Ive been posting about. After the replies I got from my last post, Ive been spending some time redoing parts of my library. When I got to the assembly code and read about optimizing it I implemented some of the ideas and doubled how fast the algorithms are. comparing my code to Microsofts BigInteger library, I am running neck to neck in the division and modular reduction algorithms, about 5 times faster in my multiplication algorithm, and ten to fifteen times faster in my exponentiation and modular expontiation algorithms, however, in my addition algorithm I am running between 10 and 20 times slower than Microsoft, and about 5 to 10 times slower in my subtraction algorithm and for the life of me I am not seeing how they do it.

    My tests are run from managed code (I am running Microsofts BigInteger library from the System.Numerics namespace in managed code) I generated two arrays one of Microsofts BigIntegers, the other of my BI library with a few thousand elements in each. The arrays are built up before running the tests and once in the tests the program simply feeds random entries from the arrays and measures the time it takes to run a predefined number of iterations for each algorithm.

    The addition algorithm:

    Code:
    	AAdd64 PROC aval1:PTR qword, aval2:PTR qword
    	
    		;The add alg is pretty standard, find the size of the larger array and create and copy the larger array _one larger for the result
    		;then add the smaller val to it cell by cell till you reach the final cell in the smaller array
    		;Then the alg will adc the result array until there carry bit clears
    		;The final part checks to make sure n_one of the high order bytes are _zero, if they are it dec the array size till it hit a non-_zero cell
    
    		;save the registers we change
    
    		push	rdi
    		push	rsi
    		push	r8
    		push	r11
    		push	r12
    
    		ENTERF								;this sets up the stack for us
    
    		mov		r11, rcx					;move val1 into r11
    		mov		r12, rdx					;move val2 into rr12
    		mov		rcx, qword ptr[r11]			;move the element count of val1 into rcx
    		cmp		rcx, qword ptr[r12]			;compare it to the element count of val2
    		ja		V1Big						;if val1 has more elements then val2 jump to V1Big
    		xchg	r11, r12					;otherwise swap val1 and val2
    		mov		rcx, qword ptr[r11]			;move the element count of val1 into rcx
    
    V1Big:
    
    		add		rcx, 5						;add some padding for the return value
    		shl		rcx, 3						;multiply it by 8
    		MEMGET								;call our getmem routine
    		cmp		rax, 0						;see if there was an error getting our memory
    		je		MallocError					;if there was jump to mallocerror
    		mov		r8, rax						;move the pointer into r8
    		mov		rdi, r8						;and move it into rdi
    		mov		rsi, r11					;move the pointer to val1 into rsi
    		mov		rcx, qword ptr[rsi]			;move the element count into rcx
    		inc		rcx							;increment the element count so that it also copies the element count
    		mov		qword ptr[r8+rcx*8], 0		;zero out the byte above our most significant byte in r8
    		pushf								;save the flags register
    		cld									;clear the direction flag so it increments
    rep		movsq								;copy val1 into our return value
    		popf								;restore the flags register
    		mov		rdi, 1						;point to the first element in the arrays
    		mov		rcx, qword ptr[r12]			;move the element count of val2 into rcx
    		clc									;clear the carry flag
    		
    AddLoop:
    
    		mov		rax, qword ptr[r12+rdi*8]	;move the element pointed to by rdu into rax
    		adc		qword ptr[r8+rdi*8], rax	;add with carry rax to the element pointed to by rdi in the return val
    		inc		rdi							;increment rdi
    		loop	AddLoop						;loop until we reach the last element in val2
    		jnc		GoOut						;if there wasnt a carry on the last addition then we are done
    		dec		rdi							;ootherwise decrement rdi
    		cmp		rdi, qword ptr[r8]			;compare it to our element count in the return value
    		ja		GoOut						;if it is above then we are done
    		inc		rdi							;restore rdi to the original value
    		mov		rcx, qword ptr[r8]			;move the element count in our return value into rcx
    		sub		rcx, qword ptr[r12]			;subtract the element count in val2 (this gives us how many times we may have to loop)
    		cmovz	rcx, _one					;if the subtraction result was zero then move one into rcx since we want this to run at least one time since there was a carry (and if it is zero then when it hits the loop instruction then rcx goes to 0xffffffff and runs on)
    		stc									;set the carry flag since we couldnt have got here without a carry from above
    
    AddLoop2:
    
    		jnc		GoOut						;if there is no carry then go out
    		adc		qword ptr[r8+rdi*8], 0		;all we are dealing with are the bytes in the return val that are above the most significant byte in val2 so we add with carry 0
    		inc		rdi							;increment rdi to point to the next element
    		loop	AddLoop2					;loop until we reach the last element
    
    GoOut:
    
    		mov		rdi, qword ptr[r8]			;move the element count into rdi
    		inc		rdi							;increment it by one to see if we changed the byte above the current MSB to one
    		cmp		qword ptr[r8+rdi*8], 0
    		je		Keep						;if it is zero then keep the element count we have
    		inc		qword ptr[r8]				;otherwise increment the element count by one
    
    Keep:
    
    		mov		rcx, qword ptr[r8]			;move the element count in the return val into rcx
    
    CheckLoop:
    
    		mov		qword ptr[r8], rcx			;move the current count into the return val
    		cmp		qword ptr[r8+rcx*8], 0		;compare the current element to 0
    		ja		CheckOut					;if its not zero then jump out
    		loop	CheckLoop					;otherwise loop until rcx=0
    
    CheckOut:
    
    		mov		rax, r8						;move our return value into rax
    
    		EXITF								;restore the stack
    
    		;restore the registers we used
    
    		pop		r12
    		pop		r11
    		pop		r8
    		pop		rsi
    		pop		rdi
    
    		RET
    
    MallocError:
    
    		;if there was an error requesting memory try to get the error code, store it in the return value and send it back to managed code
    
    		mov		rcx, 32
    		MEMGET					;call our getmem routine
    		cmp		rax, 0
    		je		ErrorOut
    		mov		r8, rax
    		call	GetLastError
    		mov		qword ptr[r8], 0
    		mov		qword ptr[r8+8], 0
    		mov		qword ptr[r8+16], rax
    
    ErrorOut:
    
    		mov		rax, rax
    		jmp		GoOut
    
    	AAdd64 ENDP
    and the subtraction algorithm:

    Code:
    	ASub64 PROC sval1:PTR qword, sval2:PTR qword
    	
    		;This alg is a standard subtraction alg
    		;it creates an answer array the size of _val1
    		;then subtracts the arrays cell by cell till it reaches the final cell of the smaller array
    		;at which time it continues doing subtractions with carry on the larger till all of the bytes are transfered
    		;the final part does the _zero check
    
    		;save the registers we use
    
    		push	rdi
    		push	rsi
    		push	r8
    		push	r11
    		push	r12
    
    		ENTERF								;This macro sets up the stack
    
    		mov		r11, rcx					;move val1 into r11
    		mov		r12, rdx					;move val2 into r12
    
    ItsAllGood:
    
    		mov		rcx, qword ptr[r11]			;move the element count of val1 into rcx
    		add		rcx, 10						;give the return value some padding
    		shl		rcx,3						;multiply the result from above by 8
    		MEMGET								;call our getmem routine
    		mov		r8, rax						;move the returned pointer into r8
    		cmp		r8, 0						;check if r8 is zero
    		je		MallocError					;if it is then the memory manager errored out
    		mov		rsi, r11					;move r11 into rsi
    		mov		rdi, r8						;move r8 into rdi
    		mov		rcx, qword ptr[r11]			;get the element count of val1
    		inc		rcx							;increment it by one so it saves the element county too
    		mov		qword ptr[r8+rcx*8], 0		;move zero into the byte above the last expected byte to be returned (this is done so we get the correct count at the end of the alg)
    		pushf								;save the flags register
    		cld									;clear the direction flag so it increments
    rep		movsq								;copy val1 into our return val
    		popf								;restore the flags register
    		inc		qword ptr[r8]				;increment the element count in r8 (so it points to 0)
    		mov		rdi, 1						;set rdi to point to the first element
    		mov		rcx, qword ptr[r12]			;set rcx to the element count in val2
    		clc									;clear the carry flag
    
    SubLoop:
    
    		mov		rax, qword ptr[r12+rdi*8]	;move the element pointed to by rdi into rax
    		sbb		qword ptr[r8+rdi*8], rax	;subtract with borrow rax from the element pointed to by rdi
    		inc		rdi							;increment rdi
    		loop	SubLoop						;loop until we reach the last element in val2
    		jnc		CheckForZ					;if the carry flag is clear we are done
    		cmp		rdi, qword ptr[r8]			;if rdi equals the element count in r8 or is bigger then we are done
    		jnb		CheckForZ
    		mov		rcx, qword ptr[r8]			;otherwise move r8's element count into rcx
    		sub		rcx, qword ptr[r12]			;and subtract the element count in val2 from it (this gives us how many elements in val1 are left to deasl with)
    		stc									;set the carry flag since we couldnt have got here unless there was a borrow from the last subtract=ion above
    
    CarryLoop:
    
    		jnc		CheckForZ					;once the carry flag is clear we are done
    		sbb		qword ptr[r8+rdi*8], 0		;at this point we are only dealing with the elements in val1 and the borrow so we subtracts with borrow 0
    		inc		rdi							;increment rdi to point to the next element
    		loop	CarryLoop					;loop until we reach the last element in val1
    
    CheckForZ:
    
    		mov		rcx, qword ptr[r8]			;move the element count in r8 into rcx
    
    CheckLoop:
    
    		mov		qword ptr[r8], rcx			;move the element count into r8
    		cmp		qword ptr[r8+rcx*8], 0		;if r8 does not equal zero we are done
    		ja		GoOut
    		loop	CheckLoop					;loop until rcx=0
    		jmp		GoOut						;go out
    
    PrepSwitchEm:
    
    		popf
    		xchg	r11, r12
    		jmp		ItsAllGood
    
    PrepReturnZero:
    
    		popf
    
    ReturnZero:
    
    		mov		rcx, 32
    		MEMGET					;call our getmem routine
    		cmp		rax, 0
    		je		MallocError
    		mov		qword ptr[rax], 1
    		mov		qword ptr[rax+8], 0
    		mov		r8, rax
    		jmp		GoOut
    
    MallocError:
    
    		;if there was an error requesting memory try to get the error code, store it in the return value and send it back to managed code
    
    		mov		rcx, 32
    		MEMGET					;call our getmem routine
    		cmp		rax, 0
    		je		ErrorOut
    		mov		r8, rax
    		call	GetLastError
    		mov		qword ptr[r8], 0
    		mov		qword ptr[r8+8], 0
    		mov		qword ptr[r8+16], rax
    
    ErrorOut:
    
    		mov		rax, rax
    
    GoOut:
    
    		mov		rax, r8				;move our return value into rax
    
    		EXITF						;restore the stack
    
    		pop		r12
    		pop		r11
    		pop		r8
    		pop		rsi
    		pop		rdi
    
    		RET
    	
    	ASub64 ENDP
    I managed to get the algorithms changed around to take advantage of the adc and sbb instructions which got rid of a lot of ugly code but I cant for the life of me figure out how to squeeze any more out of these algorithms.

    Thanks

  2. #2
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    Comparisons between this and MSBigint might change a lot depending on the average size of the bigints. some libs are optimized assuming "small" bigints while others don't start to shine unless you get into massively big numbers.
    You can compensate by having functions for both, but there's a small amount of overhead in testing which path will be the best ofc.

    If you assume most bigints will be small, then any extra "overhead" will end up being noticable and a simpler routine might have been better. This is a typical drawback of unrolling which may make things faster for "many times through the loop" by adding some overhead up front, but will work against you when you only need to loop once or twice.

    That's one of the primary lessons of optimising: "you only have the best code for a given set of data or a given set of assumptions".

    Also when dealing with assembly code, "fewer lines of code" don't equate to "faster code".
    It's all about getting your code to make best use of the execution pipeline, and that's makes your code dependant of the type and brand of CPU. So "your best code for a given set of data" won't be the "best on another type of cpu". Again, you can compensate by having code paths for various cpu's, but that'll add some overhead somewhere as well.


    The first simple solutions:
    - align your critical loops. Just adding an "allign 8" or "align 16" before the "AddLoop:" could be a noticable improvement
    - adc mem,reg is a 3 step instruction, it reads memory, adds with cary, and then writes memory. This will "block" the CPU for all three steps while it could do better when split up in 3 times a 1 step instruction (at the cost of needing an extra register).
    - when split up in simpler instructions: reorganising to remove dependencies and/or maximise the execution pipelines is the next step, this is where unrolling will really shine if you can use separate registers for everything.
    - your "falling out" of the addloop is not optimal at all.
    consider for example the AddLoop2 JNC jumps out if no carry is set, and the ADC 0, following it does nothing if it isn't.
    this "screams" out it can be done better.
    I can't tell by the code, but assuming the target buffer (R8) is at least 1 larger than the possible last carry (which it should or you have an error when carrying into the last qword when that qword is 0xFFFFFFFFFFFFFFFF)
    ... continuing the cary from before...
    Code:
           ;stc   ; no longer needed, we're doing regular ADD 1 as long as we know carry is still set.
    AddLoop2:
          Add  qword[r8+rdi*8],1
          inc rdi
          jc AddLoop2
    The loop instruction is removed, the carry now continues the loop. Again, the Add mem,1 is a 3 step instruction that could be simplified into 3x 1 step instructions.

    Applying that "split up into 1 step instructions" to the AddLoop: you could write it as follows.
    Code:
                lea r12,[r12+rdi*8] ; add rdi*8 into r12
                lea r8,[r8+rdi*8] ; add rdi*8 into r8
                ; rdi is now free
    Align 8   ; 16 might work better, could be worse on smaller numbers
    AddLoop:
                mov rdi, qword ptr [r8]; reusing rdi
                mov rax, qword ptr [r12]
                lea r12,[r12+8]
                adc rax,rdi
                mov qword ptr [r8],rax
                lea r8,[r8+8]
                loop AddLoop
    Split up into simpler 1 step instructions, better CPU pipeline usage, removed the need for rdi inside the loop, removed the pipeline stalls.
    The above does mean you'll need to account for the changes to r12 and r8 of course in the followup code.

    you also seem to be doing: allocate buffer. copy biggest bigint into buffer. then add-with-carry the other bigint into the new buffer. while the better approach would be to read the 2 source bigints one qword at a time and store the added result qword into the buffer. Basically this will remove the rep movsq and make the AddLoop: do that (again, needs an extra register). The loop will appear "bigger", but that single "rep movsq" is a hidden timebomb so to speak. SImilarly, the AddLoop2 will then need to continue the carry-copy (and extend into a flat copy for the remaining part). This is a change that goes beyond merely changing a few instructions of course, added complexity for more performance.

    optimizing the flow from adding the 2 bigints to continuing the carry into the result would be my next step.
    unrolling the loops 2 or 4 times maximising register usage would be my next step
    Last edited by OReubens; November 20th, 2013 at 11:49 AM.

  3. #3
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    Comparisons between this and MSBigint might change a lot depending on the average size of the bigints. some libs are optimized assuming "small" bigints while others don't start to shine unless you get into massively big numbers.
    ya thats why my comparison program uses three different sets of numbers, 1 set contains arrays with between 1 and 10 elements, the second set contains arrays with between 25 and 50 elements, and the third set contains arrays between 100 and 200 elements. When my program generates the values it creates equal values in my BI class and microsofts BigInteger class.

    align your critical loops. Just adding an "allign 8" or "align 16" before the "AddLoop:" could be a noticable improvement
    I was just researching how to get a function to start at a specific alignment because I had read about that and I have noticed that sometimes I can make a change in one function but notice a marked improvement in a totally unrelated function and though I am not sure, I suspect it may have been aligning just right after the change.

    adc mem,reg is a 3 step instruction, it reads memory, adds with cary, and then writes memory. This will "block" the CPU for all three steps while it could do better when split up in 3 times a 1 step instruction (at the cost of needing an extra register).
    Thats info I havent come across yet, Ive read that the loop instruction is more like a macro that gets expanded inside the processor (kind of), and I have written some code to try to iunvestigate the differences between using loop and dec rcx, jnz target, but havent noticed anything major yet.

    when split up in simpler instructions: reorganising to remove dependencies and/or maximise the execution pipelines is the next step, this is where unrolling will really shine if you can use separate registers for everything.
    I have been thinking about ways to unroll some of the loops and have copied a few algorithms and renamed them to try unrolling the loops but they are buggy as yet so I am waiting to see. Ive been thinking about the execution pipeline but have not even started on that optimization yet. Ive been trying to get a better understanding of it. I know it is processor specific, but I would think there are also some generalizations that can be applied to my code to cover a broader spectrum of processors.

    consider for example the AddLoop2 JNC jumps out if no carry is set, and the ADC 0, following it does nothing if it isn't.
    this "screams" out it can be done better.
    I can't tell by the code, but assuming the target buffer (R8) is at least 1 larger than the possible last carry (which it should or you have an error when carrying into the last qword when that qword is 0xFFFFFFFFFFFFFFFF)
    the adc 0 should never be hit unless the carry flag is set, but I do agree that it would be better to have the jc addloop2 at the bottom of the loop, in fact it used to be that way till I started messing around with it, before this last change it would fall into the addloop2 loop and thats why I moved the jnc to the top, for the cases where there was no carry.

    you also seem to be doing: allocate buffer. copy biggest bigint into buffer. then add-with-carry the other bigint into the new buffer. while the better approach would be to read the 2 source bigints one qword at a time and store the added result qword into the buffer. Basically this will remove the rep movsq and make the AddLoop: do that (again, needs an extra register). The loop will appear "bigger", but that single "rep movsq" is a hidden timebomb so to speak. SImilarly, the AddLoop2 will then need to continue the carry-copy (and extend into a flat copy for the remaining part). This is a change that goes beyond merely changing a few instructions of course, added complexity for more performance.
    Again the way you just mentioned is how it was running until not to long ago. I thought the rep movsq instruction would be quicker but I dont beleive it is. I am in the middle of profiling the algorithms right now so that I can compare the results after I make a change.

    optimizing the flow from adding the 2 bigints to continuing the carry into the result would be my next step.
    unrolling the loops 2 or 4 times maximising register usage would be my next step
    As mentioned above that happens to be what I am working on at the moment because I was thinking that may be an improvement to these two algorithms. Its kind of sad the two easiest algorithms to implement seem to be the hardest to get optimal.

    Thanks for the response, youve given me a few more things to think about.

  4. #4
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    Its kind of sad the two easiest algorithms to implement seem to be the hardest to get optimal.
    This is a strange remark. Especially since both the multiplication and division will en dup doing an Add (subtract) loop somewhere in their pipelines.

    The thing to remember with optimisation is that "instruction counting" or "cycle counting" and tricks like loop unrolling, pipeline optimisaiton are the lasts steps you should take, not the first.
    The above tricks will typically only net you a few percentages of extra performance, rarely you get something like a 2x or more.

    "orders of magnitude" type changes don't come from pure "bit twiddling", but come from taking a leap in approach to make use of a instruction that you can't do via a high level language like C or C++ (such as adc, and such as mul 64x64 -> 128, and div 128 -> 64 remainder and 64 dividend).
    Or by making more strict assumptions. You already made one here by making all bigints be multiples of 8 bytes in size which makes qword access easier. If you want to get into loop unrolling, then making all bigints multiples of 16, 32 or 64 might ease up the unrolling significantly.
    Or more importantly by using better algorithms. There's only so much you can do to make a linear search run faster, but a binary search, even a non optimal one will beat it hands down except for maybe very very small tables.



    If Microsofts Bigint takes 5x longer to do a multiplication then I would assume that:

    - For some reason they didn't bother to optimize it at all
    - or it is working with 32bit math rather than 64 (in this particular case, double sized multiplications and extra registers makes a big difference) whereas an Add-Carry loop is much less CPU bound and gets it's bottleneck from the memory accesses instead.
    - or... you have a bug you haven't yet detected.

  5. #5
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    or it is working with 32bit math rather than 64 (in this particular case, double sized multiplications and extra registers makes a big difference) whereas an Add-Carry loop is much less CPU bound and gets it's bottleneck from the memory accesses instead.
    - or... you have a bug you haven't yet detected.
    I want to say they are using a byte array to represent a twos complement number. I tried having my arrays represent a twos complement number, but, at the time, the code looked cleaner using the arrays to represent an absolute value and keep track of the sign in another variable. I am sure if I were to compensate for that difference my multiplications lead would disappear lol

  6. #6
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    They may appear to offer byte sized access, but it would surprise me somewhat that's how things actually work internally. A bigint multiplication based on Byte sized access would be MUCH much slower than 5x vs a decent 64bit implementation.


    Some bigint libs work on positive numbers with a sign indicator some use two's complement (I've even seen one using one's complement).
    There are advantages and disadvantages to either method.

    A multiplication or division in 2's complement shouldn't be very significantly slower or more complex than an absolute value implementation.

    From a pure technical POV, 2's complement will offer more means to opmize the code at the low level, mainly less branching (at the cost of implementation complexity).

    Then again, if you're only interested in positive numbers (i.e. a BigUInt) then this can be done faster than having to also account for negative numbers entirely. And negatives can be tacked on as an extention to a BigUInt. (in effect, a BigInt is derived class of a BigUint plus a sign). Extending further, a floating point is then a BigInt plus an exponent. or a fixed point is a bigint plus a scale (typically decimal, but binary is possible too).

    Although realistically, if you want a fixed point with decimal scale, you're probably better off by using a BCD math package than a BigInt one.

  7. #7
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    BCD? binary coded decimal? I implemented a BigDecimal using the BigInt with a decimal place variable attached. For addition and subtraction there is some prep code that has to run (in order to line up the decimal points) but multiplication, division, modular reduction, exponentiation and all are almost as fast as the BigInt. My biggest problem with that class was figuring out how to minimize the rounding errors (which I did figure out finally, but not before I wrote a BigFraction class to eliminate the problem).

    I originally planned on the array being a 2s complement representation of the number because that seemed like it would lead to the most efficient algorithms, but there was a lot of testing code that I had to implement to see if the final array needed a leading element with a value of zero so that the number being represented would be correct. I am sure there is a more elegant solution then what I came up with.

    Im currently running test runs on some of the various ideas you threw out there, while reading about how to oprimize the c++ code at the same time. For some reson turning on optimizations in the compiler completely breaks the library

  8. #8
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    Quote Originally Posted by AKRichard View Post
    BCD? binary coded decimal? I implemented a BigDecimal using the BigInt with a decimal place variable attached. For addition and subtraction there is some prep code that has to run (in order to line up the decimal points) but multiplication, division, modular reduction, exponentiation and all are almost as fast as the BigInt. My biggest problem with that class was figuring out how to minimize the rounding errors (which I did figure out finally, but not before I wrote a BigFraction class to eliminate the problem).
    If you need "fixed point with decimal scale", then that most probably means you're dealing with a finance/accounting type package, and in that case, BCD is the better way to go.
    BCD can do addition and subtraction "almost" as fast as Bigint, and it can do multiplication and subtraction "fast enough".
    In finance, multiplication and division aren't anywhere near as common occuring as addition/subtraction so that's not a real problem. THe BIG advantage is:
    No database I'm aware of can store bigints in a logical format. THey can store either string values and some can do BCD monetary values.
    Converting to/from decimal representation in BCD is fast and easy, whereas this is complex/slow in bigint.

    Also: nomatter how many extra bits you add to a Floating point or fixed point with binary scale, it cannot ever accurately work with decimal numbers. There's always the problem of rounding and approximation. and you ALWAYS have to be aware of this for every single operation you code with them. Even all the above technical reasons aside, this is the big reason to prefer BCD over BigInt approaches for finance/accounting.

    A bigint with decimal scale would work as well, but typically whatever speed benefits you have in the single operations is going to be negated (by far) by all the conversions to/from string representation. This is a big difference from math/3D/security/encryption/statistics/science type problems that typically have a lot of operations and few (if any) conversions to/from decimal format at all.


    Im currently running test runs on some of the various ideas you threw out there, while reading about how to oprimize the c++ code at the same time. For some reson turning on optimizations in the compiler completely breaks the library
    1) never profile code optimisations in debug builds.
    2) if your code works in a debug build and doesn't in a release or vice versa, then you have a bug. A typical problem is memory that's not being initialised which can sometimes be hidden by library debug code that fills all "empty" memory with a "random" value.

  9. #9
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    never profile code optimisations in debug builds.
    Why? I thought running the debug code outside of the debugger was pretty much the same as running the release code. Anyways I just added some code to the ENTER/EXIT macros to take a clock reading, subtract the two then add it to a running total while incrementiing a counter.

    if your code works in a debug build and doesn't in a release or vice versa, then you have a bug. A typical problem is memory that's not being initialised which can sometimes be hidden by library debug code that fills all "empty" memory with a "random" value.
    not intializing memory returned from malloc could cause that? There are a few places in my code that gets some memory but doesnt "initialize" it. Those are usually used as temporary storage for intermediate calculations (instead of creating a new object every run through the loop I get a big enough chunk to held the final value then delete it at the end of the algorithm). I havent run across anything about initializing memory before using it, Or am I missing what you mean there.

  10. #10
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    malloc does not initialise the memory to anything. a debug build "might" initialise it to something.

    this can cause bugs if you have assumed that all allocated memory from malloc is set to zero's. You can get deluded malloc does zero-fill because an allocation that end up going through the OS allocation that actually increases the heap will be zero filled. a later reuse of the memory via the CRT will leave the contents as it was.
    if you want/need zero initialisation, use calloc() (or malloc and initialise the buffer yourself).


    also, in a debug build, malloc will probably be much more elaborate and provide additional feedback options and stack tracing options (even if you're not using them), as well as leak detection features, all those things will affect the speed of your code.

    Your code may also have debug options active, and so do many C++ library functions and STL containers.

  11. #11
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    Anywhere my code expects the memory I explicitly zero it out, the places I do not initialize the memory are the places where it is just to hold the results of a calculation (and the calculation does NOT use said memory).

    I took the time to figure out why it wasnt running in Release mode, for some reason turning on optimizations in the linker causes it to break. I turned them off and it ran just fine.

    I do build with the debug info turned on in release mode (helps when I use windbg for debugging)

  12. #12
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    it still means you have a bug somewhere. it could be anything. common reasons why something works in debug but doesn't in release is:
    - variables not being initialised (or differently) in debug or release
    - memory overwrites
    - code that's in a ASSERT() type macro that's only executed in debug, but is required for proper functioning in release as well.
    - Then there's a whole series of potential problems related to multithreading and synchronisation that only pop up in release.
    - floating point results may be different in debug and release because intermediates in debug will go in variables (and this be forced to double format, while they may remain in the FPU stack in release (and remain as 10byte "long doubles").
    - ...

    They tend to be very annoying/difficult types of bugs to discover.

  13. #13
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: optimizing an algorithm or two

    I dont use any floating point variables or intermediates anywhere.
    I donr use any ASSERT statements.
    Multithreading is allways a possible issue, and a big pain to track down.
    Memory overwrites can be a problem though in my test platform I have a memory manager there that tacks on memory before and after the pointer it returns and I fill those memory areas with a specific pattern so I can check it after its freed, I havent run it through with it turned on though in a while since it slows things down considerably.
    variables not being initialised is a possibility.

    But what I was wondering about the optimizations in the c/c++ section of the compiler (turning these optimizations on breaks it), from what Ive read it will try to optimize register usage and even streamline function calls, how does that work when dealing with .asm files that are built up with an assembler? I have it set to produce an asm file after compilation but to be honest I havent went through it thoroughly enough to see if that could be the problem.

  14. #14
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: optimizing an algorithm or two

    When interfacing C++ and asm, the asm code needs to adhere to the rules set by the C++ compiler.

    Since this is VC++ Win64, all your routines that interface to C++ need to follow the Win64 calling convention. (look it up on MSDN). Basically RBX, RBP, RDI, RSI, RSP, and R12-R15 must be preserved, all other registers are volatile and don't need preserving.

    In a debug build, the code will tend to be slightly more relaxed about register preservation, so any issues may not be immediately apparent, in an optimized build, the compiler will more elaborately try to make use of register storage, so any register assumed to be preserved which is destroyed will be more of an issue.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured