From: chughes@maths.tcd.ie (Conrad Hughes) Subject: Re: fast divide Date: 31 Oct 91 13:15:24 GMT DEFFNdiv(A,B,C,R):LOCALA%:[OPTpass:MOVS C,A,LSR#31:RSBNE A,A,#0 :ADDS C,C,A,LSL#16:MOV A,A,LSR#16:ADDS A,B,A:SUBCC A,A,B:] FORA%=1TO31:[OPTpass:ADCS C,C,C:ADCS A,B,A,LSL#1:SUBCC A,A,B:]NEXT [OPTpass:ADCS R,C,C:RSBCC R,R,#0:]=0 This is the BASIC assembler code to do the division.. Use it this way: FNdiv(reg_top,reg_bot,reg_misc,reg_res) while in the BASIC assembler.. It inlines the code. Register reg_res contains 32768*(reg_top/reg_bot); reg_bot _must_ be negative (if you don't know what sign it'll be, tack the following in at the beginning: CMP B,#0:RSBPL B,B,#0:RSBPL A,A,#0 ...) , and reg_top can be either sign.. If you need slightly higher accuracy (I think I can increase the accuracy by half a bit) and don't worry about registers, tell me. reg_res can be any register, including reg_top, reg_bot or reg_misc. If you want to turn it into a loop, do so.. But remember that the fundamental unit of of the routine takes 3 clock counts, a branch-loop takes five! Also, it'll be straightforward enough to turn into a subroutine.. If you need any clarification of it, ask me... Someone posted another version which is faster for a logarithmic distribution of numbers, with 0 and 1 occuring most frequently; I haven't got it to hand, but could dig it up if nobody else posts it. Conrad From: torbenm@diku.dk (Torben AEgidius Mogensen) Date: 31 Oct 91 15:34:21 GMT Subject: Re: fast divide Sender: torbenm@freke.diku.dk Here is one I cooked up. It is from memory, so there might be slight errors. The comparisons before the first SUB ensure that no overflow happens on the ASL. Worst case time is 5 cycles per bit. Arguments in p,q. on exit, r = p div q, p = p mod q .div MOV r,#0 CMP p,q BLT nodiv CMP p,q ASL #1 BLT div0 CMP p,q ASL #2 BLT div1 ... \ repeat for 3..30 CMP p,q ASL #31 SUBGE p,q ASL #31 ADDGE r,#2^31 CMP p,q ASL #30 .div30 SUBGE p,q ASL #30 ADDGE r,#2^30 CMP p,q ASL #29 .div29 ... \ repeat for 29..1 .div0 SUBGE p,q ASL #0 \ equiv. to SUBGE p,q ADDGE r,#2^0 .nodiv MOV PC,Link From: gcwillia@daisy.waterloo.edu (Graeme Williams) Date: 2 Nov 91 20:13:28 GMT Subject: Re: fast divide Organization: University of Waterloo Ok, here is the fastest unsigned divide routine I know of. The 3 instruction loop (unrolled) wasn't my idea - but the piece of code that figures out whereabouts to jump into the unrolled loop is. On entry d and n should be both +ve. CMP d,#0 : BEQ end% ; Remove zero divisors CMP d,n : MOVGT m,n : MOVGT n,#0 : BGT end% ; Exit if denominator > numerator ; This next bit of code figures out roughly how many bits the quotient is ; going to have and hence how many times you need to run through the ; divide loop proper - its only worth finding out to the nearest 4 times .start% MOV cnt,#28 : MOV m,n,LSR#4 CMP d,m,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE m,m,LSR#16 CMP d,m,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE m,m,LSR#8 CMP d,m : SUBLE cnt,cnt,#4 : MOVLE m,m,LSR#4 MOV n,n,LSL cnt ADDS n,n,n : RSB d,d,#0 ; Have to flip the sign of d for the divide loop ADD cnt,cnt,cnt,LSL#1 ; Multiply cnt by 3 ADD pc,pc,cnt,LSL#2 ; Jump cnt instructions MOV R0,R0 ; Dummy instruction to take care of pipelining 32 copies of { ADCS m,d,m,LSL#1 : SUBCC m,m,d : ADCS n,n,n } .end% The best times occur when the numerator and denominator of your division are of similar size. The worst times when the numerator is very large and the denominator small. (The case where the denominator is larger than the numerator is killed off very early.) Suppose one has a distribution of numbers wherein the highest bit of the numbers is equally distributed between bit 0 and bit 31, that is, there is the same number of numbers between 2^n and 2^(n+1) for every n. (I think this is actually the case for most purposes) Now if a numerator and denominator are selected at random from this distribution subject only to the requirement that denom <= numer then the above routine will execute on average in 58.25 cycles (from the point start% in the code). The worst case is 116 cycles (top bit of quotient is bit 28 or higher) the best case is 32 cycles (top bit of quotient is bit 0 to 3) The inbetween cases get 12 cycles worse for every 4 extra bits in the quotient. Note that the average case isn't the simple average of the various cases but has to be weighted toward the better cases - this is because there are more ways (27) for numer and denom to differ by say 5 bits than there are (7) for them to differ by say 25 bits. Have fun. Graeme Williams From: zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich) Subject: integer division Date: 10 Jun 91 11:25:36 GMT Hi So i know, there is no Assembler-command for integer-division in the ARM. Therefor a fast division-routine is a thing of common interest. In a earlier Posting i wrote about the bad quick hacked division-routine for the demo of my formula-tool (math. formula ==> assemblermacros). Now i wrote a better routine, but i don't know, if there is a better algorithm. The shorter and slower 32-Bit/32-Bit-division-routine requires 382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers and 4 Byte memory. With unrolling the loop, the faster and longer 32-Bit/32-Bit-division-routine requires 182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers. This is very faster then the old routine (which requires 2**33 Cycles in worst case and requires all time of the universe, if there is a division by zero ... ) Cause the INTERNAL-assembler-command of the INTEL 8086 for a 32-Bit/16-Bit-division requires 165-184 Cycles in 2-4 Bytes, with usage of 3 16-Bit-registers, i think, the result is ok. The result is nothing against the INTERNAL-assembler-command of the ARM for a 32-Bit*32-Bit-multiplication which requires 17 Cycles in 4 Bytes, with usage of 1-3 32-Bit-registers, but my Professor in digital electronics used to say: "If a program contain much divisions, the programmer is a idiot" ... Know someone a better (faster or shorter) algorithm ? so long MUFTI ( internet: zrzm0111@helpdesk.rus.uni-stuttgart.de ( from janet: zrzm0111%de.uni-stuttgart.rus.helpdesk@NFS-RELAY ) bitnet: ZRZM AT DS0RUS1I ) div-fast: --------- ; division routine ; 1. calculate flag if result is negate and convert operands to positiv ; 2. second "divide" with unsigned suczessive Approximation ; 3. fix the sign of result .MACRO INTdivide LDR R2,%2 LDR R3,%3 MOV R0,#0 CMP R2,#0 RSBLT R2,R2,#0 SUBLT R0,R0,#1 CMP R3,#0 RSBLT R3,R3,#0 MVNLT R0,R0 MOV R1,#0 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l2 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l3 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l4 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l5 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l6 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l7 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l8 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l9 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l10 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l11 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l12 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l13 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l14 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l15 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l16 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l17 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l18 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l19 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l20 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l21 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l22 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l23 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l24 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l25 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l26 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l27 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l28 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l29 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l30 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l31 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 \l32 ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 CMP R0,#0 RSBNE R2,R2,#0 STR R2,%1 .ENDM div-short: ---------- ; division routine ; 1. calculate flag if result is negate and convert operands to positiv ; 2. second "divide" with unsigned suczessive Approximation ; 3. fix the sign of result .MACRO INTdivide LDR R2,%2 LDR R3,%3 MOV R0,#0 CMP R2,#0 RSBLT R2,R2,#0 SUBLT R0,R0,#1 CMP R3,#0 RSBLT R3,R3,#0 MVNLT R0,R0 STR R0,minusflag MOV R0,#32. MOV R1,#0 \loop ADDS R2,R2,R2 ADCS R1,R1,R1 CMP R1,R3 SUBGE R1,R1,R3 ADDGE R2,R2,#1 SUB R0,R0,#1 CMP R0,#0 BNE \loop LDR R0,minusflag CMP R0,#0 RSBNE R2,R2,#0 STR R2,%1 .ENDM From: chughes@maths.tcd.ie (Conrad Hughes) Subject: Re: integer division Date: 11 Jun 91 14:10:31 GMT zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich) writes: >382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers >and 4 Byte memory. >With unrolling the loop, >the faster and longer 32-Bit/32-Bit-division-routine requires >182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers. This divides a 64bit number by a 32bit one, with 32bit result, in 97 instructions/388 bytes/97 clock cycles.. Entry: A=MSW of 64bit no., B=LSW of 64bit no., C=32bit no. C _must_ be negative (if it's positive, RSB C,C,#0), A _must_ be positive (if not, RSBS B,B,#0:RSC A,A,#0) ADDS B,B,B ADCS A,C,A,LSL#1:SUBCC A,A,C:ADCS B,B,B [repeat these three 32 times] Exit: A=Remainder of division, B=result, C unchanged. ..if you don't need the remainder, get rid of the last SUBCC A,A,C. The result will be unsigned, and flags will (obviously) reflect the last addition. Adjust as you will to get 32/32 division (Initialise A to 0, or AB = numerator shifted some way) or any other variety you like.. Rolling up the loop it could get slightly complicated since every instruction requires flag information from the previous one ;-) Something like SUB count,count,#1:TEQ count,#0:BNE loop with count initialised to #32 might do the trick, but it'll slow it down to at least 288 cycles (so what if it only uses 32 bytes?? ;-}), which is pretty unacceptable unless you're on an ARM3.. This is off the top of my head and I'm nowhere near the Arc, so if it doesn't work (it should - I've typed it in enough times now..) flame me.. Conrad From: gcwillia@daisy.waterloo.edu (Graeme Williams) Subject: Re: really FAST integer division Summary: Avg. exec. time < 60 cycles - the fastest div in the West? Date: 28 Jun 91 19:07:58 GMT Organization: University of Waterloo Following several posts regarding division routines, in particular one with a lovely 3 instr. central loop I modified my old division routine into the following (supersonic, lightning fast, ZR rated, :-) ) 32bit routine: ;Essentially what's happening here is we are figuring out how far left ;we can shift the numerator before any actual dividing begins and thus save ;a significant number of useless trips through the (unrolled) 3 instr. loop. ; - Note we only go to the nearest 4 bits, trying to save another 2 bits ; results in 3 extra instrs. for a saving of 6 instrs. half the time ; and 0 instrs. the other half - i.e. no nett benefit. MOV cnt,#28 : MOV mod,num,LSR#4 CMP den,mod,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE mod,mod,LSR#16 CMP den,mod,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE mod,mod,LSR#8 CMP den,mod : SUBLE cnt,cnt,#4 : MOVLE mod,mod,LSR#4 MOV num,num,LSL cnt : RSB den,den,#0 ADDS num,num,num ;Now skip over cnt copies of the 3 instr. loop. ADD cnt,cnt,cnt,LSL#1 : ADD pc,pc,cnt,LSL#2 : MOV R0,R0 {ADCS mod,den,mod,LSL#1 : SUBLO mod,mod,den : ADCS num,num,num} x 32 copies The routine behaves as follows: On Entry: num - is a signed numerator A (but +ve) den - is a signed denominator B (but non-zero and +ve) On Exit: mod - contains A MOD B num - contains A DIV B The routine requires 4 registers and 452 bytes of memory. Execution time: Given numbers A and B (both +ve) chosen at random from a distribution in which the most significant bit of the numbers is uniformly distributed between bits 0 and 31 (i.e. the density of numbers goes as roughly log n) then the average execution time is: For A < B : 32 cycles (no division actually required) For A >= B: 58.25 cycles The best and worst cases are 32 cycles and 116 cycles respectively. Have fun. (Stuff like this ought to be worth good money!) Graeme Williams gcwillia@daisy.waterloo.edu P.S. With any luck there's only a couple of typos :-). From: dseal@armltd.uucp (David Seal) Subject: Re: Dividing Date: 5 Mar 91 19:47:12 GMT In article <8828@castle.ed.ac.uk> ecwu61@castle.ed.ac.uk (R Renwick) writes: > Can anyone help me with this problem: >In my super-duper, wire-frame model animating 'C' program, I represent >floating point numbers as 4-byte integers. This I do by shifting the number >being represented by 15 bits (multiply by 32768) in order that I can keep a >track of the fractional part. The reason for doing this is that I want >to make sure my program runs as fast as possible by not using floating >point arithmetic... > > What I want to do is do a divide a number by another number >and not lose alot of the precision. > The way I do this at the moment is by > result=((a/b)<<15) >But if a and b are close together then (a/b) loses alot of precison >(since the result is an int). > >Does anyone know how I could do an integer division without losing >precision. I cannot use result=(((a<<8)/b)<<7) or any other tricks >because a may be close to 32768<<15 already... >Perhaps if I can find the reciprocal of b in the above representation >and multiply it by a then I'll not lose the as much precision, but I >don't know how to do this (find the reciprocal) :-( Two quick suggestions: (1) The basic problem is that integer division stops when it determines the bit in the units place, whereas you want it to go on and calculate another 15 fractional binary places. So rather than using C's integer divide operator, write a long division routine using subtractions and shifts and force it to go on 15 iterations longer than normal. (This won't be much slower than C's integer divide, which essentially uses a long division routine anyway). If speed is very important, you'll probably want to either (a) write a well-optimised assembler routine; or (b) write it as a C macro, so that the compiler includes it inline. In the latter case, you may want to look carefully at the assembler produced by the compiler for an instance of your macro and experiment a bit - careful choice of the details of the long division algorithm can make a significant difference to the speed of the division. Incidentally, you may want to watch out for significant bits being lost during the last 15 iterations: this is perfectly possible and means that your result has overflowed. (2) Use a Newton-Raphson iteration to find the reciprocal of your divisor and then multiply - this presupposes that you've got an efficient way to do multiplications, both for the Newton-Raphson iteration and for the final multiplication. The iteration to find the reciprocal of a number A is (in pseudo-code - you'll have to translate to C): newx = initial approximation to reciprocal; repeat x = newx; newx = x * (2 - A*x); until x and newx sufficiently close together; The main thing to beware of in this is that your initial approximation must be reasonably close to the reciprocal - it must be greater than 0 and less than twice the reciprocal. Such an approximation can be found by finding the most significant 1 in A and choosing a value based on this - e.g. if A lies in the range 1.0 <= A < 2.0, its most significant 1 is at bit position 16, and a suitable starting point for the iteration would be 0.75. This gives us: Most sign. bit position Range for A Suitable starting point ----------------------------------------------------------------------- : : 9 2^(-7) <= A < 2^(-6) 1.5*2^6 10 2^(-6) <= A < 2^(-5) 1.5*2^5 11 2^(-5) <= A < 2^(-4) 1.5*2^4 12 2^(-4) <= A < 2^(-3) 1.5*2^3 13 2^(-3) <= A < 2^(-2) 1.5*2^2 14 2^(-2) <= A < 2^(-1) 1.5*2^1 15 2^(-1) <= A < 2^0 1.5*2^0 16 2^0 <= A < 2^1 1.5*2^(-1) 17 2^1 <= A < 2^2 1.5*2^(-2) 18 2^2 <= A < 2^3 1.5*2^(-3) 19 2^3 <= A < 2^4 1.5*2^(-4) 20 2^4 <= A < 2^5 1.5*2^(-5) 21 2^5 <= A < 2^6 1.5*2^(-6) 22 2^6 <= A < 2^7 1.5*2^(-7) : : Watch out for very big values - e.g. 2^16 is representable in this number system, but 2^(-16) isn't! Hope these suggestions help. David Seal dseal@acorn.co.uk or dseal@armltd.uucp poppy@poppyfields.net |