CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7
  1. #1
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    My Big Number library

    Hello all,

    Ive been away for a few months now but finally moved back to Alaska yay. I am picking up my pet project again and would like to put it out there for anyone who wishes to take a look to do so and give some advice on How to improve it. The library is fairly compex (to me at least) so Ill make a few posts as to whats in it and how to use it. The library is only available in 64 bit mode, I keep meaning to implement a 32 bit mode of it but.....

    The best way to get an idea of the internals is if you have Doxygen, all files are commented to create full documentation with Doxygen, even the .asm files are commnented though to use it you need the asmtodoxy script (dont remember where I found it but will try to find it again and post the link as well as my versions of the .bat to convert the asm files for Doxygen and the Doxy file to create the documents).

    There are just a few things to note, the assembly language routines uses a memory management routine that I wrote and you have to be aware of this if you are going to change the source in the libraries, there are routines in the managed and native libraries to get and free memory. You can change the running behavior by going into the innards file and change the _usemalloc global variable to 1, this will force all memory requests to use malloc and free (you should be able to change this variable in the middle of running so that the arrays that were filled using my routine before the change can work with the arrays filled using malloc without any problems, you should be able to replace the arrays that were filled using my routine with one that you create using malloc and my free routine should recognize it and free it correctly.

    The complex versions of the classes are not operational yet.

    All of the managed classes have implicit conversions from the built in types (as well as string types) and explicit conversions to those types though not all of the explicit conversions are debugged yet. This means you can use members of my library with all the built in types without worrying about conversions as long as you expect on of my classes as a return value.

    The managed classes are by far more fully implemented, the native versions of the classes are mainly used as containers so that the managed classes can access the assembly language routines directly.

    BI is the managed version of BigInteger, BD is the managed version of BigDecimal, and BF is the managed version of BigFraction. All native classes are preceded with a C and all assembly language versions are preceded by an A (though the native and assembly language versions are the same and used interchangeably).

    All numbers are kept in an array as an unisgned number using Big Endian, the sign is kept in a variable names _sign (0=positive 1=Negative), the decimal versions also have a _decimalpoint variable to keep track of the position of the decimal point in the number, the decimal number is kept in an array as if there was no decimal point, the arrays are changed internally to make the decimal points line up for the algs that need it.

    Ill answer any other questions if you just post them, I had to strip everything out in order to get it to fit, so none of my addon libraries are included as well as what I had for the 32 bit stuff.

    Anyways, thanks for any insight as to how to improve it.
    Attached Files Attached Files
    • File Type: zip BI.zip (214.8 KB, 1317 views)

  2. #2
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: My Big Number library

    ok for the files to generate the documentation.

    the doxygen link:
    http://www.stack.nl/~dimitri/doxygen/download.html

    there is a doxyfile which is to be used also but it is too long to fit in one post. you have to edit it yourself to generate the documentation. If someone needs it I will upload my version but you will still have to edit it for your directory structure.

    for the asm to c script
    http://rudy.mif.pg.gda.pl/~bogdro/inne/asm4doxy.txt

    the above script must reside in the same directory as the .asm files along with the following .bat file

    Code:
    asm4doxy.pl -od -ud -ns AdvancedAlgs64.asm Asm64.asm Asm64e.asm BIAsm64.asm Debug64.asm Innards64.asm NumberTheory64.asm Macros.asm
    Last edited by AKRichard; September 26th, 2013 at 02:11 AM.

  3. #3
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: My Big Number library

    I only just took a quick glance over it all, may take a deeper look sometime later.

    Maybe doing everything in ASM was part of the whole plan for you, but it seems strange. THe majority in something like this is "management" and this isn't the kind of tedious work assembly is very good at, I would have imagined that having memory management in a higher language (C, C++, C#) and just using asm for the "core calculation work" makes more sense. The "strange thing" about the memory management is that it is locked into C/C++ anyway by having a hard call into malloc() anyway. If you'd want to use this from some other language, it wouldn't link for that reason alone. So since this is committed to being used from C/C++ this approach seems extra strange. There may be a reason behind it, but I would imaging that more asm, and custom allocation routines are just more areas that need debugging/development.
    For most of your librarywork. Stick to somethign like C/C++, and use asm for the time critical parts. 90% of the code in a big int/decimal lib isn't time critical.

    This is especially true for the "advancedalgs" in as far as I looked at it this benefits little, if at all, from asm, but having this in higher level would mean you only need 1 set of routines, and the compiler would provide the 64 or 32bit version for you. Maybe less of a concern as long as you only want to support 32bit, but could be a big thing if you wanted a lib that's usable in a variety of platforms.

    I'm also somewhat confused at the ENTERFUNC and EXITFUNC approach. A bunch of code is added, that does a lot of conditional jumps, this sort of defeats a large portion of why you'd want to use ASM in the first place (code size and performance). I can understand the need for an ENTERFUNC/EXITFUNC type approach, but doing it with runtime conditional jumps seems strange, doing this with compiletime conditionals would be the right approach.
    it's weird you have these ENTERFUNC/EXITFUNC in macro's however, again, this sort of defeats the purpose of using asm macro's.


    Also, this is tailored for 64bit (which is a big restriction in and of itself, but understandable), but it's strange that since this is comitted to Win64 anyway, you're not using stuff like SSE for some of the things (while SSE may not be available on all flavours of Win32 processors, it is always available for Win64).

    as to implementation:
    BIE64, BINE64, BILT64, BILTOE64, BIGT64, BIGTOE64
    you really only need 1 asm version. One that takes 2 ABI's and returns -1, 0 or 1, all the higher level ones can be evaluated from there (using inlines this would even remove extra calls).


    I'm also somewhat confused at the core critical loops. This would appear to be the whole reason for using ASM for this kind of calculation, but this code is not anywhere close to being optimal at all.
    Take Add64:
    There's a single per-int64 loop to add the 2 abi's into the result ABI.
    lets analyze:
    Code:
                                mov		rax, qword ptr[r12+rdi*8]
    		add		qword ptr[r8+rdi*8], rax
    		cmovc	rdx, _one
    		inc		rdi
    		add		qword ptr[r8+rdi*8], rdx
    		cmovnc	rdx, _zero
    		cmp		rdi, _t1
    		jna		AddLoop
    		cmp		_t5, 1
    		ja		CheckForZs
    for some reason... I'm getting the impression you don't know about the ADC instruction which is specifically designed for this purpose. you now have a convoluted approach that uses rdx as an intermediate to handle carry. strance because you are using ADC in the multiply loop.
    ideally you'd also want to unroll this loop a few times to reduce the overhead for the compare/jumps, and in doing so, you can also remove the mov rax,mem + add mem,rax which doesn't overlap since the 2Nd instruction needs to wait for completion of the 1st.

    this whole loop could be written as
    Code:
                                ...
                               clc
    AddLoop:              mov rax, qword ptr [r12+rdi*8]
                               adc  qword ptr [r8+rdi*8],rax 
                               inc rdi
                               cmp rdi,_t1
                                jna addloop
                                ...
    you can remove the explicit clc by having a regular add to start the loop.
    3/5 (60%) of the above code is loop-control, unrolling it 4 times would reduce that to 3/11 (27%) in reality it's even more dramatic since "jumps are bad". with a loop this tight, unrolling even more is feasible.


    the multiply loop is even worse with a strange way to handle the carry on the adds, as well as a strange way to handle the the upper 64 bits of the multiply.
    I might have missed something in the quick glance over, but are you sure the multiply works correctly? I'm getting the impression it's missing a loop.

  4. #4
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: My Big Number library

    sorry took so long to reply still working to get set up here again.

    Maybe doing everything in ASM was part of the whole plan for you, but it seems strange. THe majority in something like this is "management" and this isn't the kind of tedious work assembly is very good at, I would have imagined that having memory management in a higher language (C, C++, C#) and just using asm for the "core calculation work" makes more sense.
    I have mixed feelings about the memory management routine. It appears to make the algs run faster, almost every single assembly language routine calls for memory at least once some as many as 9 times for variables, and when I profiled the classes the code was spending a hefty amount of time in malloc calls. The funny thing is in single threaded mode there appears to be a greater increase in speed in the memory requests, but in multi-threaded mode the bottle neck reappears but not as badly as using malloc. The reason why I decided to implement it in assembly was because I have a table that keeps track of what memory is in use and what is free and the assembly language implementation was quite a bit faster.

    The "strange thing" about the memory management is that it is locked into C/C++ anyway by having a hard call into malloc() anyway
    Yes, when using my routine it will make 4 calls into malloc initially, the first two calls create 2 tables, one holds pointers to the tables used in the management routine, the other holds the starting address of all the blocks of memory being used. the third gets the memory that will be handed out (default is set to 4 megs). The other request is for the table with a very simple format. If the block (in the 4 megs)that a specific element in the table relates to is free, then the element in the table is set to 0. If the block is in use, then it stores the starting block to fill the request and the number of blocks being used to fill the request. This way when a block is in use the program doesnt have to check every single element, it simply extracts the starting block and adds the number of blocks then sets the program to start looking at the element the addition points too. The only other time a call to malloc will be made is if there is not enough free blocks at which point it makes 2 calls into malloc (for the table and the block to be handed out) adds those entries into the management table lists and continues.

    I was actually amazed the memory management ran as effectively as it does so much so that I rarely use the _usemalloc global variable. Im sure there is a more efficient way to implement it but for the requirements that I had it also became more complex than Id of likes it to be, like not knowing what size the rquests will be or what order they will be freed (because you usually cant determine variable lifetime at designtime) I couldnt figure out how to use a simpler method. My memory management free routine uses the same algorithm nno matter the state of the _usemalloc variable. It allways starts by comparing the address being freed to the starting address of each of the blocks in the management table, if it is within the 4 megs it simply subtracts that starting address from the address being returned then divides the result by 64 which will tell you which element in the table to start at to free (which table itself is deteremined by the index used in the management tables). when it frees the address (by zeroing out the respective elements in the management array) it does so by starting at the highest element, the memory management routine that hands out memory allways marks the elements starting at the lowest this way I dont have any conflicts between the two routines.

    This is especially true for the "advancedalgs" in as far as I looked at it this benefits little, if at all, from asm
    yes, that particular file was disappointing, I couldnt find any measurable benefit to including it.

    I'm also somewhat confused at the ENTERFUNC and EXITFUNC approach.
    Those particular macros are whats left over from my attempting to add error handling to the assembly language routines, which Im disappointed in that I never got that to work correctly. It was originally meant to setup/cleanup stack frames (which storing of used non-volatiles is a part of) then it morphed into its current state slowly over time. I thought it would make it easier for mee to keep track of what registers were being used and to make sure that all the registers that were pushed were also popped. It made it easier all right, but you are correct it is an ugly section of code.

    you're not using stuff like SSE for some of the things (while SSE may not be available on all flavours of Win32 processors, it is always available for Win64).
    Not long before I moved up here I started looking into possible benefits of sse. I dont think it can be used for everything (such as anything that requires taking care of a carry from a previous bounce through the loop.

    I'm also somewhat confused at the core critical loops. This would appear to be the whole reason for using ASM for this kind of calculation, but this code is not anywhere close to being optimal at all.
    ........
    for some reason... I'm getting the impression you don't know about the ADC instruction which is specifically designed for this purpose. you now have a convoluted approach that uses rdx as an intermediate to handle carry. strance because you are using ADC in the multiply loop.
    One of the things that really troubles me in assembly is that the carry flag is used for practically everything. So after many frustrating experiences early in my learning assembly I picked up the bad habit of only expecting the carry flag to be valid for the calculation for the instruction immediately after a calculation instruction that either set or cleared the flag. Your version of what the alg should look like looks almost exactly as a version I tried running, the problem was that cmp rdi, t1 command at the bottom of the loop destroys the carry flag so that it cannot be used with the adc instruction at the head of the loop.

    ideally you'd also want to unroll this loop a few times to reduce the overhead for the compare/jumps
    I am not completely clear on the concept of unrolling the loops. I am sure, as its name implies, its geared towards reducing/removing iterations from a loop. How do you know how many times you can "unroll" a loop when you dont know how many iterations the loop will run till runtime. I am assuming that the term "unroll" means something along the lines of inserting the contents of the loop ahead of the loop itself.

    the multiply loop is even worse with a strange way to handle the carry on the adds, as well as a strange way to handle the the upper 64 bits of the multiply.
    yes that was a changed to its current form from right before I left California. The idea in its weirdness was to get rid of the conditional moves that were in their (they were conditional jumps in the original version). From the way you are talking there must be a way around the problem of the compare instructions (usually at the bottom of the loop to test if the end of the array has been reached). I had tried using the loop instruction in another version of the algs but ran into problems that I couldnt figure out at the time, but I think I know what I did wrong before.

    This is exactly why I posted the library, I wanted to get an idea on some of the design and implementation issues and what I should have done for them.

    Having said that, the core parts of the library, everything in the asm64 files, the innards file, and most of whats in the BIAsm64 file have been thouroughly tested, I have a program that generates random numbers of random sizes, creates a variable of my class and microsofts BigInteger class then runs those numbers through the various airthmetic operations then compares the results of those operations(addition, subtraction, multiplication, division, modular reduction, exponentiation, and modular exponentiation) and I let it run all night while Im asleep and all day while Im at work.

    I was hoping someone was able to understand my division alg (which Im proud to say is on the order of 8 times faster than microsofts) and that is just a standard implementation (with a little extra logic) of the standard alg, I didnt use fft's or anything like that. I managed to almost completely eliminate the renormalization problem (I think thats what they call it at least), if my alg adjusts its almost allways only on the first round of the algorithm then extremely rarely ever needed again. It worked so well I use the exact same alg for the modular reduction routine as well. When comparing the runtime of my algs to microsofts microsofts is about five times faster for addition and subtraction, about equal for multiplication, my algs are about 8 times faster for division and modular reduction, and my algs are 10+ times faster for exponentiation and modular exponentiation. Though I think microsoft uses an array of bytes to keep their large numbers, since I am using 64 bit (quadword) arrays may have something to do with it, I am not sure how to "even the playing field" without changing mine to using arrays of bytes as well.

    Anyways, thanks for taking the time to look it over and putting your thoughts into the post.
    Last edited by AKRichard; September 29th, 2013 at 10:19 AM.

  5. #5
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: My Big Number library

    Had some more looks at it over the weekend.
    For starterts, "hats off to you sir" for doing something like this. I've been there myself so I know some of the stuff in there takes quite a bit of effort, even just getting everything to work is reason for a pat on the back. Getting it to work efficiently is challenging :-)



    The memory management still "seems strange" to do in ASM. It works so "good for you", but it undoubtably has taken a lot of effort to do, where a compiler would have done equally well. I understand what you did, it's pretty much just doing a custom allocator, but you could have done this in C++ as well. In my implementation I created a BigInt class using std::vector as a base for the 'big int'. It works well enough as it is for the majority of cases, and I've done it with a customized allocator a few times as well where such made a big impact.

    implementation wise, I'd say your biggest drawback is the awkward interface:
    - all the operator takes a CBI instead of a const CBI& (inefficient)
    - they also all return a CBI* instead of a CBI (typicaly returning this), this makes usage in compound statements awkward, and it requires care on the usage side to clean up the allocated data (or not, who owns the returned pointer).

    when I do something like
    CBI result = cbiA * cbiB;
    I don't expect that the above result is a pointer rather than a object (the above won't even compile);
    I also don't expect that the above statement changes cbiA to hold the result.

    I can "sort of" understand where this comes from, but the above approach is "an accident waiting to happen" from someone using your class.

    I suggest you have a look at std::complex and see how that class is designed, then use that as a base for your own CBI and CBD classes, the fraction class, I would design much liks std::complex to take either CBI or CBD (or any other compatible type for that matter) as a storage type for the fraction class.



    Most of the asm seems to be very similar to what a compiler would generate. or in other words, it is "straightforward". While it works, it means you aren't really leveraging the capabilities of what you can do in assembly to make the code shine. Properly using the carry flag in this type of application is part of that. It's all about "working around" the instructions that do change the carry flag.
    You're right about the CMP in the loop and it's wrong in my solution, I overlooked it. The whole trick here is to avoid the CMP entirely: you count DOWN to zero (at which point you can JNZ the loop or JZ out) which is why you use DEC/INC instead of SUB/ADD because INC/DEC doesn't change carry flag. Or you use rcx as counter and use LOOP.
    This sort of "clever use" of the available ASM syntax is precisely why ASM can do things a compiler can't, and it can pay off big in performance.


    unrolling a loop.
    This means that instead of having
    Code:
    next:   <some operation>
              dec reg
              jnz next
    where you have 2 'loop control' instructions for each actual bit of work being done.

    unrolling the loop 4 times means:
    Code:
    next:  <some operation>
              <some operation>
              <some operation>
              <some operation>
              dec rec
              jnz next
    so now you have 2 Loop control instructions only for each 4 bits of actual work. The effect of this is more dramatic the more compact <some operation> is. And you can unroll as many times as you want, although there'll be a tradeoffpoint somewhere that another unroll will not gain you much performance (or even slow the code if it exceeds code cache) for the effort.
    of course, the above means that you have "somehow" made sure that the stuff you do is indeed compatible with the above unrolled loop.

    you can do this by making sure that every BI is always a multiple of 4 (with the above 4x unrolling) ints large.
    or you can do this by having some preparation code that first handles the case if it's 1, 2 or 3 ints more than a multiple of 4. there's many ways to do this, but a jumptable is common...
    something like this (for 4x unroll, adjust as necessary for 8x 16x....):
    Code:
               mov reg,count
               mov reg2,reg
               and  reg2,3        ; reg2 = count % 4
               shr   reg,2         ; reg = reg / 4
               inc    reg            
               jmp  jumptable[reg2*8]
    jumptable DQ  next0, next1, next2, next3   ; addresses of jump target for different remainders
    next0: <some operation>
    next3: <some operation>
    next2: <some operation>
    next1: <some operation>
               dec reg
                jnz next0
    so suppose count is 7. you divide by 4 (being 1) and remainder is 3.
    you jump to next3, do the <some operation> 3 times, go into the loop once for another 4 times <some operation>

    unrolling either means data constraints (requiring data to be multiples of 4ints in the above case) or it means extra setup/overhead to get it working, which you gain by saving on the actual loops. if you mainly have "short" CBI's of only a few ints large, the overhead may make this approach slower, on really large CBI's the gains will be noticable.



    Bugs: your casts back to POD's are erroneous for signed types. take
    operator short():
    you're comparing with 65536 which is wrong.
    it's also strange in how it works. depending on the value you either copy the value if it fits or return 0. this isn't how a int behaves in C++ code
    I would expect this to either work like a regular int cast with clipping or throwing an exception (of a mix of both configurable via a global runtime or compiletime flag)



    GCD: uses a ton of divides (which are slow), there is a GCD calculation for binary numbers that only uses shifts, adds and subtractions. it's excellently suited for bigint type approach (won't help at all for CBD) do a google search for "binary gcd"
    Last edited by OReubens; September 30th, 2013 at 10:26 AM.

  6. #6
    Join Date
    Aug 2009
    Location
    Finally back in Alaska
    Posts
    141

    Re: My Big Number library

    For starterts, "hats off to you sir" for doing something like this. I've been there myself so I know some of the stuff in there takes quite a bit of effort, even just getting everything to work is reason for a pat on the back. Getting it to work efficiently is challenging :-)
    Thank you, This project has evolved greatly over the years. Started out as a vb dll to expand on javas BigInteger, then came a BigDecimal and BigFraction (which didnt turn out as well as Id hoped), then I finally decided to create my own from scratch, which is where I learned c++. Unfortunately I concentrated on managed c++, then learned enough native to get everything to work together. Its kind of funny, because the more advanced this project became, the more compact the code became (for the most part). Anyways, not bad for a self taught carpenter huh? thanks again.

    The memory management still "seems strange" to do in ASM.
    To tell the truth, I didnt even try to implement it in c++, I just knew the calls to malloc where a bottleneck for me and wanted my routine as fast as possible, thats why I went straight to ASM with it. Actually, implementing it didnt take as long as youd think. I spent more time trying different things till I hit one I was happy with, once I focused on how I wanted it to run, it didnt take all that long to implement.

    implementation wise, I'd say your biggest drawback is the awkward interface:
    Ya, the native side of things is a mess, I just started working on it before leaving CA, until recently there really wasnt any native implementation. I had just hit upon the idea of creating a native container so that the managed classes could access the asm algs directly. After that, it only seemed natural I should have a native implementation. But again, I am still learning c++ and Id say 90% of what Ive learned so far is on the managed side. So now that Im picking it back up, Im off to reading about how a native value class should look.

    As far as unrolling the loops go, like Id mentioned, I had thought about that before (though I wasnt aware that was unrolling), though I didnt actually implement it, I did think I saw some ways it could pay off, so that will probably be included sometime soon.

    Bugs: your casts back to POD's are erroneous for signed types. take
    operator short():
    ya, when I finally learned about explicit and implicit casts (managed c++) I knew how I wanted the implicit casts from the runtime types to mine to behave (and I implemented an implicit for all basic types to mine) but wasnt sure how I the explicit casts from mine to the basic types should run. Im not surprised that I didnt get it correct, as I really didnt spend that much time on the explicit casts, but that shouldnt really be that hard to implement correctly.

    GCD: uses a ton of divides (which are slow), there is a GCD calculation for binary numbers that only uses shifts, adds and subtractions. it's excellently suited for bigint type approach (won't help at all for CBD) do a google search for "binary gcd
    I am constantly looking at different algs to try out. I am well aware mine are not optimal, but still proud to have them running as well as they do. But I do attempt implementing some of the things I see if I think they miight be better. Ive been looking for a different way to implement the trig functions. Right now I am using the Taylor series, however I dont like the results Ive got with it, comparing my results to microsofts, there is too much difference, of course I am assuming that the microsoft implementation is correct out to however many decimal places it shows. And I am especially interested in different implementations of the division alg.

    One of the things Ive really been looking into is SSE, I can see where that would be a big boon, but I do not have a clear idea of how to implement it yet.


    Anyways, thanks for taking the time to look at it, I do appreciate the pointers

  7. #7
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: My Big Number library

    If you really want to look at what advanced "CBI" can do.

    check out the GMP project. www.gmplib.org. The problem with that is of course it's GPL licence which may or may not be suitable to your needs. (this is why I needed my own).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured