Volume 4 -- Issue 10 | July 1984 |

In This Issue...

- 18-Digit Arithmetic, Part 3
- Building Label Tables for DISASM
- Quick Memory Testing
- 68000 Sieve Benchmark
- Updating the 6502 Prime Sifter
- Sorting and Swapping
- Apple //c Gotchas
- Orphans and Widows
- Speed vs. Space

Feedback on our DOSonomy

Our little dossier of DOS names was well received. It may be we will soon have so many names we will need a dosser (a large basket that can be carried on the back) to hold them all. On the other hand, if we keep writing about this our fortunes may reverse, forcing to finding new quarters in a doss house. What is the critical dosage?

Dan Pote offers "Kinda-Sorta-DOS". Which led Bill to coin "MaybeDOS". Randy Horton reminded us of "Ante-DiluviDOS". Chris Balthrop enters MacroDOS and "What's Up DOS". (I think the latter is "Buggy". Or "Bugsy"? Oh, it's not bunny anymore...) If you can take all this, you may be too docile.

Don Lancaster Strikes Again

We just have a little space and a little time to mention Don's new Assembly Cookbook for the Apple II/IIe, which just arrived. It looks like another winner! Look for a full review next month, or check our ad on page 3 for ordering info.

18-digit Arithmetic, Part 3 |
Bob Sander-Cederlof |

Plowing ahead, this installment will offer the division and input conversion subroutines.

You will remember that we covered addition and subtraction in the May 1984 issue, and multiplication in June. Now it's time for division, which completes the fundamental arithmetic operations. All four of these routines are designed to operate on two arguments stored in DAC and ARG, leaving the result in DAC. Addition and subtraction leave "garbage" in ARG. Multiplication leaves ARG unchanged. Division leaves in ARG what was in DAC.

Division is simple enough in concept, but no one would call it simple in implementation. "How many groups of X are in Y?" "If I deal an entire deck of 52 cards to 4 people, how many will each person get?" "If I scramble a dozen eggs, and serve them in equal-size portions to 7 people, how many eggs will each eat?" (Really, I am good cook!)

Suppose I have a pile of pennies, and want to find out how many dollars they represent. I will count out piles of 100 pennies, moving them into separate piles. Then I will count the little piles. Now, suppose I have two 18-digit numbers in my computer and want to divide the one in ARG by the one in DAC.... I will subtract the value in DAC from the one in ARG over and over, until I finally cross zero. Then if I was wise enough to count how many times I did the subtraction, I have the answer.

Let's look at the problem in more detail now. What I want to do is divide the value in ARG by the value in DAC:

numerator (in ARG) -------------------- = quotient (in DAC) denominator (in DAC)

Numbers in DP18 can be positive or negative, so we have to remember the rules of signed division. If the signs of the numerator and denominator are the same, the quotient will be positive; if they are different, the quotient will be negative.

Numbers in DP18 are coded as 18-digit fractions with a power- of-ten exponent. Remembering algebra:

.f * 10^m f --------- = --- * 10^(m-n) .g * 10^n g

The 18-digit fractions are normalized so that there are no leading zeroes. That is, the value will either be all zero, or it will be between .1 and .999999999999999999 (inclusive).

I think it is time now to start looking at the program. In the listing which follows there are references to subroutines and variables which we defined in the previous two installments of this series.

Line 4250 swaps the contents of ARG and DAC. I did it this way because it leaves something possibly useful in ARG after the division is finished. If you wanted to form the reciprocal quotient, DAC=DAC/ARG, you can enter at DDIVR, which skips the swapping step.

Lines 4260-4270 check for the illegal case of division by zero. If I divide something into zero-size parts, I get an infinite number of these parts. That's fine, but the DP18 has no representation for infinity; therefore we say it is illegal to divide by zero, just like Applesoft does. Some computers and some software arithmetic packages do represent infinity, but DP18 does not. Zero values are represented by having an exponent byte of zero, so we only have to check one byte here.

Lines 4280-4310 form the sign of the quotient. This is the same as lines 1280-1310 of the DMULT listing given last month, and so we could make them into a subroutine. The subroutine would take 10 bytes, and the two JSR's make another 6. That's 16 bytes, against the 18 bytes for the two versions of in-line code. Saves a total of 2 bytes, at a cost of adding 12 cpu cycles to both multiply and divide. (Small digression into the kind of trade-offs I am continually making....)

Lines 4330-4390 compute the exponent of the quotient, and check for overflow and underflow cases. The special case of the numerator being zero is also caught here, line 4350. Line 4380 restores the bias of $40. Bias? Remember, the exponent is kept in memory with $40 added to it, so that the range -63 through +63 is represented by $01 through $7F.

If the new exponent is still in the range $00 through $7F, we will go ahead and do the division. If not, the quotient is either too small (underflow) or too large (overflow). For example, 10^-40 / 10^40 results in 10^-80, which is too small for DP18. Lines 4410-4470 catch these cases, and change the quotient to zero. If the new exponent is between $80 and $BF, it represents 10^64 or larger, and so we call on the Applesoft OVERFLOW error.

Lines 4500-4550 set up the loop which does the actual division of the fractions. The 6502's decimal mode will be used during this loop. Ten bytes in MAC (defined in DMULT last month) will be used to hold the quotient until we are through with DAC. The X-register will be used to count out the 20 digits. The other end of the loop is in lines 4920-4930, where X is decremented and tested.

The body of the loop is really a lot simpler than it looks. Basically, ARG is subtracted from DAC until DAC goes negative. The number of subtractions is counted in MAC+9. Then ARG is added back to DAC to make it positive again, and MAC+9 decremented. The result is a quotient digit in MAC+9, and a remainder in DAC. One extra digit is needed, extending DAC on the left end. This digit is carried in the stack. See it pushed at line 4710, pulled at line 4790.

After each digit of the quotient is determined, both MAC and DAC are shifted left one digit place. This might shift a significant digit out of DAC (the remainder), so it is lifted out first and saved on the stack (lines 4570-4630). If the first two digits of the remainder (happen to be "00", then we know without subtracting that the quotient digit in this position will also be "0". (Remember that the leading digit of the denominator in ARG is NEVER zero.) This fact can speed up divisions, so it is tested for at line 4580, with lines 4670-4680.

After all 20 digits are formed, the loop terminates. Line 4950 then returns us to binary mode. Line 4960 adds one to the quotient exponent, adjusting for the normalization step. (.9/.1 = 9, but we want to represent it as .9*10^1.) If the exponent now is negative ($80), it may be still in range if the leading digit of the quotient is zero (.1/.9 = 0.1111...). This test takes place at lines 4970-5000.

Lines 5020-5060 copy the quotient from MAC to DAC. These are the same as lines 1330-1370 in DMULT, so they could be made into a subroutine. Two other candidates for subroutines are lines 4720-4780, which are identical to lines 1680-1740 of DADD (May 1984); and lines 4830-4890, which are the same as 1530-1590 of DADD.

Finally, DDIV exits by jumping to NORMALIZE.DAC.

Doesn't all this take a lot of time? You bet it does! I timed it in the full DP18 package with a program that looked like this:

&DP:INPUT X(0) : INPUT X(2) FOR I = 1 TO 100 &DP:X(4) = X(0)/X(2) NEXT

I determined the loop overhead by entering a value zero for X(0). Since this case skips around nearly everything in DDIV, I called its time the loop overhead time. After subtracting out the loop overhead, the times look like this:

0/anything 0 x/x 12 msec 1/9=.1111... 23 msec 8/9=.8888... 49 msec 1/7=.142857... 35 msec

It looks like the maximum time, which would be for a quotient with all 20 digits = 9, would be about 53 msec. The average time, about 35 msec. This compares with an average Applesoft 9-digit division time of about 7 msec.

1000 *SAVE S.DP18 DIVIDE 4220 *-------------------------------- 4230 * DAC = ARG / DAC 4240 *-------------------------------- 4250 DDIV JSR SWAP.ARG.DAC ...CHANGE TO DAC = DAC/ARG 4260 DDIVR LDA ARG.EXPONENT CHECK FOR ZERO DENOMINATOR 4270 BEQ .2 ...X/0 IS ILLEGAL 4280 *---FORM SIGN OF QUOTIENT-------- 4290 LDA DAC.SIGN 4300 EOR ARG.SIGN 4310 STA DAC.SIGN 4320 *---COMPUTE EXPONENT OF QUOTIENT- 4330 SEC 4340 LDA DAC.EXPONENT 4350 BEQ .0 ...0/X=0 4360 SBC ARG.EXPONENT 4370 CLC 4380 ADC #$40 ADJUST OFFSET 4390 STA DAC.EXPONENT 4400 *---CHECK OVER/UNDERFLOW--------- 4410 BPL .3 ...NEITHER 4420 ASL SEE WHICH... 4430 BPL .1 ...OVERFLOW 4440 .0 LDA #0 ...UNDERFLOW, SET RESULT = 0 4450 STA DAC.SIGN 4460 STA DAC.EXPONENT 4470 RTS 4480 .1 JMP AS.OVRFLW 4490 .2 JMP AS.ZRODIV DIVISION BY ZERO ERROR 4500 *---SET UP QUOTIENT LOOP--------- 4510 .3 SED DECIMAL MODE 4520 LDA #0 4530 STA MAC+9 CLEAR FIRST QUOTIENT DIGIT 4540 LDX #20 DO 20 DIGITS 4550 BNE .5 ...ALWAYS 4560 *---CONTINUE QUOTIENT LOOP------- 4570 .4 LDA DAC.HI 4580 PHP SAVE ZERO STATUS 4590 LSR 4600 LSR 4610 LSR 4620 LSR 4630 PHA DAC LEFT EXTENSION 4640 JSR SHIFT.DAC.LEFT.ONE 4650 JSR SHIFT.MAC.LEFT.ONE 4660 PLA DAC LEFT EXTENSION 4670 PLP SEE IF FIRST TWO DIGITS = 0 4680 BEQ .9 ...YES, SO QUOTIENT IS ALSO ZERO 4690 *---SUBTRACT WHILE POSSIBLE------ 4700 .5 INC MAC+9 COUNT 1 SUBTRACTION 4710 PHA DAC LEFT EXTENSION 4720 SEC DO A TRIAL SUB 4730 LDY #9 4740 .7 LDA DAC.HI,Y 4750 SBC ARG.HI,Y 4760 STA DAC.HI,Y 4770 DEY 4780 BPL .7 4790 PLA DAC LEFT EXTENSION 4800 SBC #0 4810 BCS .5 NO BORROW 4820 *---OVERSHOT, SO RESTORE--------- 4830 LDY #9 BORROW,SO ADD IT BACK IN 4840 CLC 4850 .8 LDA DAC.HI,Y 4860 ADC ARG.HI,Y 4870 STA DAC.HI,Y 4880 DEY 4890 BPL .8 4900 DEC MAC+9 BACK OFF QUOTIENT DIGIT, TOO 4910 *---NEXT DIGIT------------------- 4920 .9 DEX ALL DIGITS? 4930 BNE .4 ...NOT YET, KEEP GOING 4940 *---ADJUST EXP, CHECK OVERFLOW--- 4950 CLD BINARY MODE 4960 INC DAC.EXPONENT ADJUST FOR OFFSET 4970 BPL .10 ...NO OVERFLOW PROBLEM 4980 LDA MAC COULD BE OVERFLOW 4990 AND #$F0 5000 BNE .1 ...OVERFLOW 5010 *---COPY QUOTIENT TO DAC--------- 5020 .10 LDY #9 5030 .11 LDA MAC,Y 5040 STA DAC.HI,Y 5050 DEY 5060 BPL .11 5070 JMP NORMALIZE.DAC 5080 *-------------------------------- 5090 * SHIFT 20 DIGITS IN MAC RIGHT ONE PLACE 5100 *-------------------------------- 5110 SHIFT.MAC.LEFT.ONE 5120 LDY #4 5130 .1 ASL MAC+9 5140 ROL MAC+8 5150 ROL MAC+7 5160 ROL MAC+6 5170 ROL MAC+5 5180 ROL MAC+4 5190 ROL MAC+3 5200 ROL MAC+2 5210 ROL MAC+1 5220 ROL MAC 5230 DEY 5240 BNE .1 5250 RTS 5260 *-------------------------------- |

DP18 Input Conversion

The input conversion subroutine processes characters from memory to produce a value in DAC. This is analogous to what the equivalent subroutine in Applesoft ROMs does.

It is so analogous, in fact, that I even depend upon CHRGET and CHRGOT to fetch successive characters from memory. It is a lot faster than Applesoft conversion, however, because it is BCD coded rather than binary. This means that, stripping away the frills such as sign, exponent part, and decimal point, it even easier than an ASCII to hex conversion.

Of course, we need all those frills. Look ahead to the program listing which follows: Lines 1200-1220, just those three little lines, handle the conversion of digits. All the rest of the page is for frills! Well, to be honest about it, two of the three lines call subroutines, but still, the frills predominate.

The acceptable format of numbers is basically the same as that which normal Applesoft accepts. A leading sign is optional. The numeric portion can be more than 20 digits long, but only the first 20 will be accumulated (not counting leading zeroes). A decimal point is optional anywhere in the numeric portion. An exponent part can be appended to the numeric portion, and consists of the letter "E", and optional sign, and one or two digits. The exponent can be up to 81, just so the final number evaluates between .1*10^-63 and .9999...9*10^63. Numbers smaller than .1*10^-63 will be changed to zero, and numbers larger than .9999...9*10^63 will cause an OVERFLOW ERROR.

Looking at the program, lines 1040-1080 clear a working area which comprises DAC and four other variables: SGNEXP, EXP, DGTCNT, and DECFLG. SGNEXP will be used to hold the sign of the exponent part; EXP will hold the value of the exponent part; DGTCNT will count the digits in the numeric portion; and DECFLG will flag the occurrence of a decimal point. DAC includes DAC.SIGN. Note that the X-register will be left with $FF, which fact is important at line 1170 below.

Lines 1090-1100 preset the DAC.EXPONENT to $40, which indicates 10^0. This will be incremented along with DGTCNT until a decimal point is encountered.

Lines 1110-1180 handle the optional leading sign. DAC.SIGN has already been cleared above, indicating the positive case. If a minus sign is in front of the number, line 1170 sets DAC.SIGN negative. Note that calling CHRGOT and CHRGET to retrieve characters automatically eliminates (ignores) blanks. CHRGOT/CHRGET also checks whether the character retrieved is a digit or not, and indicates digits by carry clear. If the first non-blank character is a digit, we immediately jump to the numeric loop at line 1200. If not, the subroutine FIN.SIGN checks for a + or - character. The + or - may or may not be tokenized, depending on whether the string is from an INPUT statement or is a constant embedded in a program, so we have to check for both the character and the token form of both signs. FIN.SIGN handles this checking.

If that first character is neither a digit nor a sign, it may be a letter "E" or a decimal point; so, we go down to lines 1240-1270 to check for those two cases. If neither of these either, we must be at the end of the number. If it is a decimal point, lines 1630-1650 record the fact that a decimal point was found and also check whether this is the first one found or not. If the first, back we go to continue looking for digits. If not the first, it must be the end of the number, so we fall into the final processing section at line 1670.

Exponents are more difficult, because the value actually must be converted from ASCII to binary. Lines 1290-1610 do the work, including handling of the optional sign, and range checking.

Lines 1670-1730 compute the final exponent value. This is the number of digits before the decimal point (not counting any leading zeroes you may have typed to confuse me) plus the exponent computed in the optional "E" field. If the result is negative, between $C0 and $FF, it indicates underflow; in this case, the value is changed to zero. If there were no non-zero digits in the numeric portion, the value is set to zero regardless of any "E" field. If the resulting exponent is between $80 and $BF, it indicates OVERFLOW.

Lines 1840-2130 accumulate individual digits. DGTCNT is used to index into the nybbles of DAC, and the digit is stored directly into place. Leading zeroes on the numeric field are handled here (lines 2090-2120). Leading zeroes before a decimal point are entirely ignored, while leading zeroes after a decimal point cause the DAC.EXPONENT to be decremented. The incrementation of DAC.EXPONENT for each significant digit on the left of the decimal point is also taken care of here (lines 2020-2070).

This complete the third installment of DP18. We are well on the way to a working subset of the entire package. We still need output conversion and some sort of linkage to Applesoft before we can begin to see it all run. The entire DP18 package really exists, and works, now. It includes PRINT USING, very fancy input screen handling, full expression parsing, and all the math functions. Several of you have been very anxious to get the whole package for use in projects of your own, so we have offered a source code license to DP18 on an "as is" basis for only $200.

1000 *SAVE S.DP18 FIN 1010 *-------------------------------- 1020 * DP18 INPUT CONVERSION 1030 *-------------------------------- 1040 FIN LDA #0 CLEAR WORK AREA 1050 LDX #WRKSZ-1 (DAC, SGNEXP, EXP, 1060 .1 STA WORK,X DGTCNT, & DECFLG) 1070 DEX 1080 BPL .1 LEAVE X=$FF WHEN FINISHED 1090 LDA #$40 1100 STA DAC.EXPONENT 1110 *---HANDLE LEADING SIGN---------- 1120 JSR AS.CHRGOT 1130 BCC .2 IF DIGIT 0-9 1140 JSR FIN.SIGN ...SEE IF + OR - SIGN 1150 BNE .4 ...NEITHER + NOR - 1160 BCC .3 ...+ 1170 STX DAC.SIGN ...-, SET TO $FF 1180 BCS .3 ...ALWAYS 1190 *---GET DIGITS TILL NON-DIGIT---- 1200 .2 JSR ACCUMULATE.DIGIT 1210 .3 JSR AS.CHRGET GET NEXT CHARACTER 1220 BCC .2 ...DIGIT 1230 *---".", "E", OR END------------- 1240 .4 CMP #'. DECIMAL POINT? 1250 BEQ .9 YES 1260 CMP #'E LETTER E 1270 BNE .10 END OF NUMBER 1280 *---HANDLE EXPONENT FIELD-------- 1290 JSR AS.CHRGET 1300 BCC .6 ...DIGIT, ASSUME POSITIVE 1310 JSR FIN.SIGN ...SEE IF + OR - SIGN 1320 BNE .8 ...NEITHER + NOR - 1330 BCC .5 ...+ 1340 ROR SGNEXP ...-, SO SET SGNEXP NEGATIVE 1350 .5 JSR AS.CHRGET GET FIRST DIGIT OF EXP 1360 BCS .8 ...NO DIGITS! 1370 .6 AND #$0F ...ISOLATE EXP 1ST DIGIT 1380 STA EXP 1390 JSR AS.CHRGET GET 2ND DIGIT OF EXP, IF ANY 1400 BCS .8 ...NO MORE DIGITS 1410 AND #$0F ISOLATE 2ND DIGIT 1420 PHA SAVE ON STACK 1430 LDA EXP MULTIPLY 1ST DIGIT BY 10 1440 ASL 1450 ASL (CLEARS CARRY TOO) 1460 ADC EXP *5 1470 ASL *10 (CARRY STILL CLEAR) 1480 STA EXP ADD 2ND DIGIT 1490 PLA 1500 ADC EXP 1510 STA EXP 2 DIGIT EXP 1520 CMP #64+18 ALLOW .00000000000000001E+82 1530 BCS .7 OR 999999999999999999E-82 1540 JSR AS.CHRGET GET NEXT CHAR 1550 BCS .8 NO MORE DIGITS 1560 .7 JMP AS.OVRFLW OVERFLOW ERROR 1570 .8 ASL SGNEXP CHECK SIGN OF EXP 1580 BCC .10 ...POSITIVE 1590 LDA #0 ...NEGATIVE, SO COMPLEMENT EXP 1600 SBC EXP 1610 JMP .11 ...ALWAYS 1620 *---FOUND DECIMAL POINT---------- 1630 .9 ROR DECFLG SET DECIMAL POINT FLAG 1640 BIT DECFLG CHECK FOR TWO DECIMAL POINTS 1650 BVC .3 NO 1660 *---COMPUTE FINAL EXPONENT------- 1670 .10 LDA EXP GET EXPLICIT EXPONENT 1680 .11 CLC 1690 ADC DAC.EXPONENT 1700 LDX DGTCNT SEE IF ANY SIGNIFICANT DIGITS 1710 BNE .12 ...YES 1720 TXA ...NO, MAKE EXPONENT ZERO 1730 .12 STA DAC.EXPONENT 1740 TAX TEST RANGE OF EXPONENT 1750 BMI .13 ...NOT IN RANGE 0...7F 1760 RTS 1770 *---EITHER UNDER- OR OVER-FLOW--- 1780 .13 ASL UNDER, OR OVER? 1790 BCC .7 ...OVERFLOW 1800 LDA #0 1810 STA DAC.SIGN 1820 BEQ .12 ...ALWAYS 1830 *-------------------------------- 1840 ACCUMULATE.DIGIT 1850 AND #$0F ISOLATE DIGIT 1860 BEQ .4 ZERO DIGIT 1870 TAX SAVE DIGIT IN X-REG 1880 LDA DGTCNT NO MORE THAN 20 SIGNIFICANT DIGITS 1890 CMP #20 1900 BCS .2 DISCARD EXTRA DIGITS 1910 *---STORE THE DIGIT IN DAC------- 1920 LSR ODD/EVEN TO CARRY 1930 TAY INDEX TO Y-REG 1940 TXA GET DIGIT FROM X-REG 1950 BCS .1 ODD DIGIT ON RIGHT SIDE 1960 ASL EVEN DIGIT MUST BE SHIFTED 1970 ASL 1980 ASL 1990 ASL 2000 .1 ORA DAC.HI,Y MERGE 2010 STA DAC.HI,Y 2020 *---COUNT THE DIGIT-------------- 2030 .2 INC DGTCNT COUNT SIGNIFICANT DIGIT 2040 LDA DECFLG SEE IF IN FRACTION 2050 BMI .3 YES 2060 INC DAC.EXPONENT NO 2070 .3 RTS 2080 *---DIGIT = 0-------------------- 2090 .4 LDA DGTCNT SEE IF LEADING ZERO 2100 BNE .2 NO 2110 LDA DECFLG SEE IF PART OF FRACTION 2120 BPL .5 NO, COMPLETELY IGNORE IT 2130 DEC DAC.EXPONENT 2140 .5 RTS 2150 *-------------------------------- 2160 * SCAN + OR - SIGN 2170 * ------------------- 2180 * + .EQ., .CC. 2190 * - .EQ., .CS. 2200 * OTHER .NE. 2210 *-------------------------------- 2220 FIN.SIGN 2230 CMP #'-' 2240 BEQ .2 2250 CMP #TKN.MINUS 2260 BEQ .2 2270 CMP #'+' 2280 BEQ .1 2290 CMP #TKN.PLUS 2300 .1 CLC 2310 .2 RTS |

Building Label Tables for DISASM |
Bob Sander-Cederlof |

RAK-Ware's DISASM has the nice feature of being able to used a list of pre-defined labels when you are disassembling a block of code. I needed to turn the //c monitor ROM ($F800-$FFFF) into source code, and Apple sent me a list of all their labels in this area.

The format of the label table, or name table, is very simple. Each entry takes eight bytes: the first two are the value, high byte first; the remaining six are the label name, in ASCII with high bit set. If the name is less than six characters long, zeroes are used to fill out the entry.

Very simple to explain, but how do you enter things like that in the S-C Macro Assembler? The example on the DISASM disk does it this way:

1000 .HS FDED 1010 .AS -/COUT/ 1020 .HS 00000000 1030 .HS FDF0 1040 .AS -/COUT1/ 1050 .HS 000000 and so on.

That works, but it is so error prone and time wasting that I gave up before I started. However, there is an easy way using macros and abbreviations.

Start by defining a macro which will build one entry:

1000 .MA LBL 1010 .HS ]1 1020 .AS -/]2/ 1030 .BS *+7/8*8-* 1040 .EM

The macro is named LBL, and will be used like this:

1050 >LBL FDED,COUT 1060 >LBL FDF0,COUT1

Line 1030 is the tricky one. This .BS will pad out an entry to an even multiple of 8 bytes. Now, assuming the origin started at an even multiple of 8, and assuming you are writing the table on a target file, that macro builds the kind of entries DISASM wants. Instead of just assuming, lets add:

0900 .OR $4000 0910 .TF B.NAMETBL

I also mentioned abbreviations above. I even get tired of typing "tab>LBL ", you know. Usually when I have a lot of lines to type that have a common element, I use some special character that is easy to type and not present in the lines I plan to type. Then after all the lines are in, I use the REPLACE command to substitute the longer string for the single-character abbreviation I have used. Thus, I can type:

1050 .FDED,COUT 1060 .FDF0,COUT1 et cetera

and after many lines type

REP /./ >LBL /1050,A

I was about up to FA90 when it dawned on me that I could break the symbols into blocks within a page, and include the page value in my abbreviation:

1050 .ED,COUT 1060 .F0,COUT1 REP /./ >LBL FD/1050,A

With all these shortcuts, I was able to enter over 400 label names and definitions in less than an hour.

Let the computer work FOR you!

Quick Memory Testing |
Bob Sander-Cederlof |

What do you do when a friend brings his Apple over with an intermittent memory failure? You KNOW you have a memory test program somewhere, but WHERE?

Here is a quick way to test normal RAM between $7D0 and $BFFF. (RAM in //e hyperspace or banked into ROM space is another matter.) Turn on your friend's computer, and hit reset to abort the booting sequence. We don't need or want DOS around while we are testing memory. Type HOME and CALL-151 to get into the monitor. Then type the following monitor command:

*N 7D0:00 N 7D1<7D0.BFFEM 7D1<7D0.BFFEV 7D0:55 N 7D1<7D0.BFFEM 7D1<7D0.BFFEV 7D0:AA N 7D1<7D0.BFFEM 7D1<7D0.BFFEV 7D0:FF N 7D1<7D0.BFFEM 7D1<7D0.BFFEV 34:0

The "*" is the monitor prompt, so don't you type it. There are no carriage returns in the line above, it just wraps around the 40-column screen that way. There must be one trailing blank after the "34:0" at the end. This makes the monitor repeat the whole command line forever.

I started the test at $7D0 so there will be some visible feedback, but most of the screen will stay clear. If you begin testing at a lower address, any errors displayed on the screen might be overwritten as soon as they show up.

When you type the RETURN key you will see a line of inverse at-signs at the bottom of the screen. After a few seconds, this will change to flashing U. Then *, and then some other character, depending on what kind of Apple you have. Then the cycle will start over again.

Until a memory error is detected. Any error will cause two lines to be printed, giving the address before the error with its contents and the contents of the error byte, and the address of the error byte with its actual contents and should-be contents. For example, if you were in the "AA" phase, and $8123 came up with $AB, you would see:

8122-AA (AB) byte before error 8123-AB (AA) error byte

If any error lines start printing, note which bit is bad and which 16K bank it is in. Then you can point directly to the bad chip.

7 6 5 4 3 2 1 0 7D0...3FFF C10 C9 C8 C7 C6 C5 C4 C3 4000...7FFF D10 D9 D8 D7 D6 D5 D4 D3 8000...BFFF E10 E9 E8 E7 E6 E5 E4 E3

68000 Sieve Benchmark |
Peter J. McInerney New Zealand |

Here are two versions of the Sieve of Eratosthenes for the MC68000. They provide ample justification for the power claimed for this chip.

The first version is a fairly straightforward translation of the algorithm as presented in the November 1982 AAL, by Tony Brightwell. Tony's best time in the 6502 was 183 seconds for 1000 repetitions; in my 12.5 MHz DTACK GROUNDED attached processor, 1000 repetitions took only 40 seconds.

Compare the 68000 code with the 6502 code, and I'm sure you will agree the 68000 version is much easier to understand. Note the use of long instructions in the array clearing loop and the two-dimensional indexing in lines 1230 and 1310. Other nice things are the shift left by 3 (multiply by 8) in line 1270 and the decrement & branch instructions in lines 1120 and 1400. Also very useful is the postincrement address mode, which automatically increments the address kept in the referenced register by 1, 2, or 4 depending on the size of the operation. This is used for popping off (downward growing) stacks or as here to advance through memory. There is also a predecrement mode but I did not use it in these example programs.

The second version uses a modified algorithm. The changes I made should apply to the 6502 version also, improving it in about the same proportion.

- Since we are ignoring even numbers, we may as well leave them out of the array entirely, thus halving the array size.
- We can therefore simplify the formula for odd squares from S*8+1 to S*4.
- We can even do away with the *4 part by adding 4 each time rather than 1.
- The initial array clearing loop can be made faster by using more than one CLR instruction per loop.

This modified version does 1000 iterations in only 33 seconds! It is only slightly harder to follow than the first version, and only slightly larger. In fact, if we forego the final modification above, the code is actually shorter. I think most of the speedup comes from halving the array size.

If you have a Macintosh, and can manage to load machine code into it, you should find everything running about half as fast as my DTACK GROUNDED board.

[ We tried the program on our QWERTY Q-68 board, and it took roughly 10 times as long as Peter's DG board. Understandable, since it was using Apple memory at .5MHz rate for all work. (Bill&Bob) ]

1000 *SAVE SIEVE OF ERATOSTHENES.1 1010 *-------------------------------- 1020 * CODED BY PETER J. MCINERNEY, NEW ZEALAND 1030 *-------------------------------- 1040 .OR $3800 1050 ARRAY .EQ $4000 1060 *-------------------------------- 1070 SIEVE MOVE #999,D6 DO 1000 TIMES 1080 *---CLEAR WORKING ARRAY---------- 1090 .1 MOVE #ARRAY,A0 CLEAR ARRAY FROM 1100 MOVE #$FFF,D0 $4000 TO $7FFF 1110 .2 CLR.L (A0)+ 1120 DBF D0,.2 1130 *---INIT VARIABLES--------------- 1140 MOVEQ #3,D0 START AT 3 1150 MOVEQ #1,D1 SUM OF ODD NUMBERS 1160 MOVEQ #1,D2 COUNT OF ODD NUMBERS 1170 MOVEQ #1,D3 USED FOR STRIKING NON-PRIMES 1180 MOVE #ARRAY,A0 START OF ARRAY 1190 BRA.S .4 JUMP INTO LOOP 1200 *---START SIFTING---------------- 1210 .3 ADDQ #1,D2 COUNT ODD NUMBERS 1220 ADD D2,D1 GET SUM OF ODDS 1230 .4 CMPI.B #0,0(A0,D0) IS THIS A PRIME? 1240 BNE.S .6 NO 1250 *---STRIKE OUT MULTIPLES--------- 1260 MOVE D1,D4 GET 8*S+1 = N*N 1270 ASL #3,D4 1280 ADDQ #1,D4 1290 MOVE D0,D5 ONLY STRIKE ODD MULTIPLES 1300 ASL #1,D5 1310 .5 MOVE.B D3,0(A0,D4) STRIKE ONE 1320 ADD D5,D4 NEXT STRIKE 1330 CMPI #$4000,D4 ...FINISHED? 1340 BLS .5 ...NO 1350 *---GET NEXT SIEVE SIZE---------- 1360 .6 ADDQ #2,D0 NEXT ODD NUMBER 1370 CMPI #127,D0 UNTIL SQRT $4000-1 1380 BLS .3 1390 *---DO IT ALL 1000 TIMES--------- 1400 DBF D6,.1 NEXT TIME 1410 RTS 1000 *SAVE SIEVE OF ERATOSTHENES.2 1010 *-------------------------------- 1020 * CODED BY PETER J. MCINERNEY, NEW ZEALAND 1030 *-------------------------------- 1040 .OR $3800 1050 ARRAY .EQ $4000 1060 *-------------------------------- 1070 SIEVE MOVE #999,D6 DO 1000 TIMES 1080 *---CLEAR WORKING ARRAY---------- 1090 .1 MOVE #ARRAY,A0 CLEAR ARRAY FROM 1100 MOVE #$FF,D0 $4000 TO $7FFF 1110 .2 CLR.L (A0)+ 1120 CLR.L (A0)+ 1130 CLR.L (A0)+ 1140 CLR.L (A0)+ 1150 CLR.L (A0)+ 1160 CLR.L (A0)+ 1170 CLR.L (A0)+ 1180 CLR.L (A0)+ 1190 DBF D0,.2 1200 *---INIT VARIABLES--------------- 1210 MOVEQ #3,D0 START AT 3 1220 MOVEQ #4,D4 CORRESPONDS TO 9 1230 MOVEQ #4,D2 DELTA 1240 MOVEQ #1,D3 USED FOR STRIKING NON-PRIMES 1250 MOVE #ARRAY+1,A0 POSITION OF 3 1260 MOVE #ARRAY,A1 START OF ARRAY 1270 BRA.S .4 JUMP INTO LOOP 1280 *---START SIFTING---------------- 1290 .3 ADDQ #4,D2 UPDATE DIFFERENCE 1300 ADD D2,D4 UPDATE SQUARE POINTER 1310 .4 CMPI.B #0,(A0)+ IS THIS A PRIME? 1320 BNE.S .6 NO 1330 *---STRIKE OUT MULTIPLES--------- 1340 MOVE D4,D5 GET LATEST SQUARE 1350 .5 MOVE.B D3,0(A1,D5) STRIKE ONE 1360 ADD D0,D5 NEXT STRIKE 1370 CMPI #$2000,D5 ...FINISHED? 1380 BLS .5 ...NO 1390 *---GET NEXT SIEVE SIZE---------- 1400 .6 ADDQ #2,D0 NEXT ODD NUMBER 1410 CMPI #127,D0 UNTIL SQRT $4000-1 1420 BLS .3 1430 *---DO IT ALL 1000 TIMES--------- 1440 DBF D6,.1 NEXT TIME 1450 RTS |

Updating the 6502 Prime Sifter |
Bob Sander-Cederlof |

I spent a half day applying Peter's algorithm improvements to the November 1982 6502 version, and refining the program as much as I could. It now runs in 175 milliseconds per iteration, or 1000 iterations in 175 seconds. Still way behind the 68000, of course. On the other hand, a 6MHz 6502, with fast enough RAM for no wait states, would be faster than a 12.5 MHz 68000. And it remains to be seen what the 65802 could do.

In the process of running various versions and various tests, I discovered that the innermost loop, at lines 1820-1850, is executed 10277 times. This means that, while marking out the odd non-primes between 1 and 16383, a total of 10277 such marks are made. Since only odd numbers are assigned slots in the working array, giving only 8192 such slots, you can see that some numbers get stricken more than once. These are the numbers with more than one prime factor. The most-stricken number is 3*5*7*11*13 = 15015, which gets five strikes. The loop takes 11 cycles as written, and I don't see any way to shorten it any further or to reduce the number of times it is used. Do you?

The loop time is 11*10277 is 121297 cycles, or about 120 msec out of the total 175. The array clearing accounts for another 41 msecs, leaving only 14 msec for all the rest of the program. Not bad!

Here is a little Applesoft program which will make a nice neat listing of primes from the working array, assuming it runs from $6000 through $7FFF.

100 HIMEM:24576 110 FOR A = 24576 TO 32767 120 IF PEEK (A) = 0 THEN PRINT RIGHT$(" "+STR$((A-24576)*2+1,7);: N = N + 1 130 IF N = 10 THEN PRINT : N = 0 140 NEXT

1000 .LI MOFF 1010 *SAVE S.SUPER-FAST PRIMES IMPROVED 1020 .OR $8000 SAFELY OUT OF WAY 1030 *-------------------------------- 1040 BASE .EQ $6000 BASE OF PRIME ARRAY 1050 BEEP .EQ $FF3A BEEP THE SPEAKER 1060 SQZZZZ .EQ 0,1 1070 START .EQ 2 1080 COUNT .EQ 4,5 1090 *-------------------------------- 1100 .MA ZERO 1110 STA ]1+$000,X 1120 STA ]1+$100,X 1130 STA ]1+$200,X 1140 STA ]1+$300,X 1150 STA ]1+$400,X 1160 STA ]1+$500,X 1170 STA ]1+$600,X 1180 STA ]1+$700,X 1190 .EM 1200 *-------------------------------- 1210 * MAIN CALLING ROUTINE 1220 * 1230 MAIN LDA #-100 DO 1000 TIMES SO WE CAN MEASURE 1240 STA COUNT THE TIME IT TAKES 1250 LDA /-100 1260 STA COUNT+1 1270 JSR BEEP ANNOUNCE START 1280 .1 JSR PRIME 1290 INC COUNT 1300 BNE .1 1310 INC COUNT+1 1320 BNE .1 1330 JMP BEEP SAY WE'RE DONE 1340 *-------------------------------- 1350 * PRIME ROUTINE 1360 * SETS ARRAY STARTING AT BASE 1370 * TO $FF IF NUMBER IS NOT PRIME 1380 * CHECKS ONLY ODD NUMBERS > 3 1390 * INC = INCREMENT OF KNOCKOUT 1400 * N = KNOCKOUT VARIABLE 1410 *-------------------------------- 1420 PRIME 1430 LDX #0 1440 TXA CLEAR WORKING ARRAY 1450 .1 >ZERO BASE 1460 >ZERO BASE+$0800 1470 >ZERO BASE+$1000 1480 >ZERO BASE+$1800 1490 INX 1500 BNE .1 NOT FINISHED CLEARING 1530 *-------------------------------- 1540 LDA /BASE+4 POINT AT FIRST PRIME-SQUARED 1550 STA SQZZZZ+1 (WHICH IS 3*3=9) 1560 LDA #BASE+4 1570 STA SQZZZZ 1580 LDA #1 POINT AT FIRST PRIME (3) 1590 BNE .4 ...ALWAYS 1600 *-------------------------------- 1610 .2 TXA 1620 ASL 1630 ASL 1640 ADC SQZZZZ 1650 STA SQZZZZ 1660 BCC .3 1670 INC SQZZZZ+1 1680 .3 LDA BASE,X GET A POSSIBLE PRIME 1690 BNE .8 THIS ONE HAS BEEN KNOCKED OUT 1700 TXA 1710 *-------------------------------- 1720 .4 STA START 1730 ASL INC = START*2 + 1 1740 ADC #1 1750 STA .7+1 1760 LDA SQZZZZ+1 MOVE MULT TO N 1770 STA .6+2 1780 LDA SQZZZZ 1790 .5 TAX 1800 BEQ .9 ...SPECIAL CASE FOR X=0 1810 *---STRIKE OUT MULTIPLES--------- 1820 .6 STA $FF00,X REMEMBER THAT N IS REALLY AT .6+2 1830 .7 ADC #*-* N = N + INC 1840 TAX 1850 BCC .6 DONT'T BOTHER TO ADD, NO CARRY 1860 CLC 1870 INC .6+2 INC HIGH ORDER 1880 BPL .5 ...NOT FINISHED 1890 *-------------------------------- 1900 LDX START GET OUR NEXT KNOCKOUT 1910 .8 INX POINT AT NEXT ODD NUMBER 1920 CPX #64 UP TO 127 1930 BCC .2 WE'RE DONE IF X>127 1940 RTS 1950 *-------------------------------- 1960 .9 LDA .6+2 1970 STA .10+2 1980 .10 STA $FF00 1990 TXA 2000 BEQ .7 ...ALWAYS 2010 *-------------------------------- |

Sorting and Swapping |
Bob Sander-Cederlof |

Jack McDonald, writing in the July 1984 Software News, posed a puzzle for programmers: using nothing more than a series of calls to a SWAP, sort five items into ascending order. SWAP compares two items according to the indexes supplied, and exchanges the items if they are out of order. For example, calls on SWAP which follow the pattern of a "Bubble Sort" would look like this:

SWAP (1,2) SWAP (1,2) SWAP (1,2) SWAP (1,2) SWAP (2,3) SWAP (2,3) SWAP (2,3) SWAP (3,4) SWAP (3,4) SWAP (4,5)

That is ten swaps, which is more than necessary. You can do it in nine, which was McDonalds Puzzle. He gave an answer, and I found another. It was fun writing some quick code to test various swap-lists.

First I wrote a macro named "S" which loaded the two index numbers into X and Y, and called a subroutine named SWAP. See it in lines 1030-1070.

Then I coded SWAP (lines 1200-1290), which compared two bytes at BASE,X and BASE,Y; if they were out of order, I swapped them around. To make things easy for me, I put BASE at $500, which just happens to be the third line on the video screen. That way I could watch everything happen without struggling to code I/O routines.

I wrote a program which would initialize a 5-byte string to all $01 (no program, really just a data definition at line 1670); another which copies the string to BASE (LOAD, lines 1590-1650); another which counts up from 0101010101 to 0505050505, so that all possible combinations would be run through (NEXT, lines 1770-1870); and another to do all these in connection with SORT, which performed a list of SWAP calls. The result was a method for visualizing and checking various groups of SWAPs to see if they could sort any initial permutation into ascending order. Assemble, and type MGO NEXT to see it all work.

Here is the code, with two possible SWAP orders which work, of nine steps each.

1000 *SAVE S.SWAP AND SORT 1010 .LIST MOFF,CON 1020 *-------------------------------- 1030 .MA S 1040 LDX #]1 1050 LDY #]2 1060 JSR SWAP 1070 .EM 1080 *-------------------------------- 1090 .MA INC 1100 INC PERM+]1 1110 LDA PERM+]1 1120 CMP #6 1130 BCC :1 1140 LDA #1 1150 STA PERM+]1 1160 :1 1170 .EM 1180 *-------------------------------- 1190 * SWAP (X,Y) 1200 *-------------------------------- 1210 SWAP LDA BASE,X 1220 CMP BASE,Y 1230 BCC .1 1240 PHA 1250 LDA BASE,Y 1260 STA BASE,X 1270 PLA 1280 STA BASE,Y 1290 .1 RTS 1300 *-------------------------------- 1310 * SORT BY SWAPS 1320 *-------------------------------- 1330 SORT 1340 .DO 0 CHANGE TO 1 TO SELECT MCDONALD'S LIST 1350 >S 4,5 MCDONALD'S ORDER 1360 >S 3,5 1370 >S 3,4 1380 >S 1,2 1390 >S 1,4 1400 >S 1,3 1410 >S 2,5 1420 >S 2,4 1430 >S 2,3 1440 .ELSE 1450 >S 1,4 MY ORDER 1460 >S 2,5 1470 >S 1,3 1480 >S 3,5 1490 >S 2,4 1500 >S 1,2 1510 >S 2,3 1520 >S 3,4 1530 >S 4,5 1540 .FIN 1550 RTS 1560 *-------------------------------- 1570 BASE .EQ $500 1580 *-------------------------------- 1590 LOAD LDX #5 COPY PERM LIST TO BASE ON SCREEN 1600 .1 LDA PERM,X 1610 STA BASE,X 1620 STA BASE+128,X 1630 DEX 1640 BNE .1 1650 RTS 1660 *-------------------------------- 1670 PERM .HS 000101010101 1680 *-------------------------------- 1690 CHECK LDX #4 CHECK IF LIST IS SORTED 1700 .1 LDA BASE+1,X 1710 CMP BASE,X 1720 BCC .2 1730 DEX 1740 BNE .1 1750 .2 RTS 1760 *-------------------------------- 1770 NEXT >INC 5 INCREMENT PERM LIST 1780 BCC .1 EACH BYTE RANGES FROM 1790 >INC 4 01 TO 05 1800 BCC .1 1810 >INC 3 1820 BCC .1 1830 >INC 2 1840 BCC .1 1850 >INC 1 1860 BCC .1 1870 RTS FINISHED 1880 .1 JSR LOAD COPY PERMLIST TO SCREEN 1890 JSR SORT SORT IT ON THE SCREEN 1900 JSR CHECK CHECK IF SORTED 1910 BCS NEXT ...SORTED, TRY NEXT SEQUENCE 1920 RTS ...NOT SORTED 1930 *-------------------------------- |

I also got interested in permutation generation, and came up with the following macros and code to generate all 120 permutations of five items, without any extra steps, each step being the simple interchange of two items. Assemble, and type MGO PERMUTE to see it generate 120 strings of the letters ABCDE in different arrangements.

1940 .MA SS 1950 LDX #]1 1960 LDY #]2 1970 JSR EXCHANGE 1980 .EM 1990 *-------------------------------- 2000 EXCHANGE 2010 LDA PERM,X 2020 PHA 2030 LDA PERM,Y 2040 STA PERM,X 2050 PLA 2060 STA PERM,Y 2070 LDX #1 2080 .1 LDA PERM,X 2090 ORA #$C0 2100 JSR $FDED 2110 INX 2120 CPX #6 2130 BCC .1 2140 LDA #$A0 2150 JSR $FDED 2160 RTS 2170 *-------------------------------- 2180 .MA S3 2190 >SS 1,2 2200 >SS 1,3 2210 >SS 1,2 2220 >SS 1,3 2230 >SS 1,2 2240 .EM 2250 *-------------------------------- 2260 .MA S4 2270 >S3 2280 JSR $FD8E 2290 >SS 1,4 2300 >S3 2310 JSR $FD8E 2320 >SS 2,4 2330 >S3 2340 JSR $FD8E 2350 >SS 3,4 2360 >S3 2370 JSR $FD8E 2380 .EM 2390 *-------------------------------- 2400 PERMUTE 2410 LDX #5 2420 .1 TXA 2430 STA PERM,X 2440 DEX 2450 BNE .1 2460 *-------------------------------- 2470 >SS 1,1 2480 >S4 2490 >SS 1,5 2500 >S4 2510 >SS 1,5 2520 >S4 2530 >SS 1,5 2540 >S4 2550 >SS 1,5 2560 >S4 2570 *-------------------------------- 2580 RTS |

Our //c came in,and we love it. However... |
Bob Sander-Cederlof |

The //c package does not include any DOS 3.3 master. Everything is ProDOS. Of course you do get a DOS 3.3 with most software you purchase. And of course ProDOS includes a disk copier that is supposed to be able to copy DOS 3.3 disks when you need to back up your DOS-based software. However...

The ProDOS disk copier which is being shipped with the //c has a serious bug. When you are copying a DOS-based disk it ignores the volume number recorded on the source disk, and forces the copy to be volume 254. That is fine if the source just happened to be volume 254 also, but chances are it isn't. I have many disks around here which are volume 1. The DOS image and the VTOC both think the disk copied by //c ProDOS is volume 1, but RWTS discovers it is volume 254 and refuses to cooperate any further.

I guess the solution is to use the old faithful COPYA from your DOS 3.3 System Master. Since that doesn't come with a //c system, we are including licensed copies of COPYA and FID on our Macro 1.1 disks now.

More gotchas.... Apple decided it was time to rewrite large chunks of the monitor. Necessarily so, because the disassembler now has to cope with 27 new opcodes and address modes. They removed four entries from the monitor command table, and changed its starting point. This throws off the "$" command in the S-C Macro Assemblers, all versions.

If you have Macro 1.1, the //e version is the one you should be running in your //c. You can fix the "$" command with these patches:

$1000 $D000 old new version version value value ------- ------- ----- ----- $147B $D47B $17 $13 $1486 $D486 $CC $CD $148B $D48B $15 $11

A more elegant patch is possible, which automatically adjusts for whether you are in a //e or //c. If you want this, and have a 1.1 version prior to serial # 675, send us $5 for an update.

We have tried RAK-Ware's DISASM 2.2e on our //c, and it works fine. It even picks up the 27 new opcodes and address modes automatically, because DISASM links to the monitor disassembler. Older versions of DISASM will not run on a //e or //c.

Orphans and Widows |
Bob Sander-Cederlof |

James, a brother of Jesus Christ, wrote: "Pure religion and undefiled before God and the Father is this, to visit the fatherless and widows in their affliction, and to keep himself unspotted from the world." (chapter 1, verse 27, King James Version)

Of course, he was referring to real life and to real people with real needs, but it still serves to introduce this little announcement.

"Orphans" and "widows" are also terms used in word processing to describe the lamentable situation of one line of a paragraph being left all alone on one page, while the rest is on another page. If that one line is the last line of a paragraph which won't quite fit, "she" is forced to the top of the next page, and is a widow. If the lonely line is the first line of a paragraph, dwelling at the bottom of a page, bereft of the rest of its family on the following page, he or she is indeed an orphan.

High class word processors give you the option of automatically "visiting" orphans and widows "in their affliction". Thanks to Bobby Deen, this feature is now (as of June 29th) included in the S-C Word Processor (whether high class or not). When the feature is selected (by the "!or1" directive), orphans get moved to the next page and widows get squeezed onto the current page.

Bobby is also working on, and he says it is now functional but somewhat unfinished, a version that fully uses the 80-column display on the Apple //e. We already had 80-column preview, but he is developing 80-column text display during edit/entry mode.

Speed vs. Space |
Bob Sander-Cederlof |

There are always tradeoffs. If you have plenty of memory, you can write faster code. If you have plenty of time, you can write smaller code. In an "academic" situation you may have plenty of both, so you can write "creative" code, stretching the frontiers of knowledge. In a "real" world it seems there is never enough time or memory, so jobs have to be finished on a very short schedule, fit in a tiny ROM or RAM, and run like greased lightning.

A case in point is last month's segment of the DP18 series: the SHIFT.MAC.RIGHT.ONE subroutine on page 8 takes about 1827 clock cycles, and fits in 25 bytes. Upon reflection, I see a way to write a 34-byte version that takes only 1029 cycles. If I can use nine more bytes, I can shave about 800 microseconds off each and every multiply. (Maybe a total of a whole minute per day!) That might be important, or it might not; but seeing the two techniques side-by-side is probably valuable.

1970 SHIFT.MAC.RIGHT.ONE 1980 LDY #4 4 BITS RIGHT 1990 .0 LDX #1 20 BYTES 2000 LSR MAC 2010 .1 ROR MAC,X 2020 INX NEXT BYTE 2030 PHP 2040 CPX #20 2050 BCS .2 NO MORE BYTES 2060 PLP 2070 JMP .1 2080 .2 PLP 2090 DEY NEXT BIT 2100 BNE .0 2110 RTS 1970 SHIFT.MAC.RIGHT.ONE 1980 LDX #0 FOR X=0 TO 19 1990 TXA NEW 1ST NYBBLE = 0 2000 .1 STA TEMP SAVE FOR HI NYBBLE 2010 LDA MAC,X MOVE LOW NYBBLE 2020 ASL TO HI SIDE 2030 ASL 2040 ASL 2050 ASL 2060 PHA SAVE ON STACK 2070 LDA MAC,X MOVE HI NYBBLE 2080 LSR TO LOW SIDE 2090 LSR 2100 LSR 2110 LSR 2120 ORA TEMP MERGE WITH NEW 2130 STA MAC,X HI NYBBLE 2140 PLA HI NYBBLE OF NEXT BYTE 2150 INX NEXT X 2160 CPX #20 2170 BCC .1 2180 RTS |

The smaller method uses two nested loops. The inner loop shifts all 20 bytes of MAC right one bit. The outer loop does the inner loop four times. If I counted cycles correctly, the time is 4*(19*23+18)+7. The faster method uses one loop to scan through the twenty bytes one time. The timing works out as 20*51+9.

Upon still further reflection, it dawned on me that a 38 byte version could run in 840 cycles! This version processes the bytes from right to left instead of left to right; eliminates the PHA-PLA and STA-ORA TEMP of the second version above; and loops only 19 times rather than 20. The timing is 19*43+23.

1970 SHIFT.MAC.RIGHT.ONE 1980 LDX #19 FOR X = 19 TO 1 STEP -1 1990 .1 LDA MAC,X SHIFT HI- TO LO- 2000 LSR 2010 LSR 2020 LSR 2030 LSR 2040 STA MAC,X SAVE IN FORM 0X 2050 LDA MAC-1,X GET LO- OF HIGHER BYTE 2060 ASL 2070 ASL 2080 ASL 2090 ASL 2100 ORA MAC,X MERGE THE NYBBLES 2110 STA MAC,X 2120 DEX NEXT X 2130 BNE .1 ...UNTIL 0 2140 LDA MAC PROCESS HIGHEST BYTE 2150 LSR INTRODUCE LEADING ZERO 2160 LSR 2170 LSR 2180 LSR 2190 STA MAC 2200 RTS |

Of course an even faster approach is to emulate the loops I wrote for shifting 10-bytes left or right 4-bits. The program would look like this:

1970 SHIFT.MAC.RIGHT.ONE 1980 LDY #4 1990 .1 LSR MAC 2000 LSR MAC+1 . . . 2180 LSR MAC+19 2190 DEY 2200 BNE .1 2210 RTS |

This version takes 2+3*20+4 = 66 bytes. Yet the timing is only (4*6+5)*20+7 = 587 clock cycles. And by writing out the four loops all the way, we use 4*3*20 = 240 bytes; the time would be 4*6*20 or 480 cycles. How about another example? The MULTIPLY.ARG.BY.N subroutine on the same page last month was nice and short, but very slow. The subroutine is called once for each non-zero digit in the multiplier, or up to 20 times. What it does is add the multiplicand to MAC the number of times corresponding to the current multplier digit. If we assume the distribution of digits is random, with equal probablility for any digit 1...9 in any position, the average number of adds will be 5. Actually there will be zero digits too, so the average will be 4.5 instead of 5, with the subroutine not even being called for zero digits.

For 20 digits, 4.5 addition loops per digit, that is an average of 90 addition loops. And a maximum, when all digits are 9, of 180 addition loops.

Now, if there is enough RAM around, we can pre-calculate all partial products from 1 to 9 of the multiplicand and save them in a buffer area. Each partial product will take 11 bytes. We already have the first one in ARG, so for 2...9 we will need 8*11 or 88 bytes of storage. It will take 8 addition loops to form these partial products. Once they are all stored, the MULTIPLY.ARG.BY.N subroutine will always do exactly one addition loop no matter what the non-zero digit is. Therefore the maximum number of addition loops is 8+20 or 28, compared to 180! And the average (assuming there will be 2 zero digits out of 20 on the average) will be 26 addition loops.

The inner loop in MULTIPLY.ARG.BY.N, called "addition loop" above, takes 20 cycles. If we implement this new method, we will have shortened the average case from 1800 to 520 cycles, and the maximum from 3600 to 560 cycles. Of course the whole DMULT routine includes more time-consuming code, but this subroutine was the biggest factor. Taking the SHIFT.MAC.RIGHT.ONE improvements also, we have shortened the overall time in the average case by 2078 cycles, or 2 milliseconds per multiply. In the maximum case, the savings is nearly 4 milliseconds.

Of course, it takes more code space as well as the 88-byte partial product buffer for the new method. And it will take more time to write such a program. You have to make tradeoffs.

1000 *SAVE S.DP18 FASTER MULTIPLY 1010 *-------------------------------- 1020 * DAC = ARG * DAC 1030 *-------------------------------- 1040 DMULT LDA DAC.EXPONENT IF DAC=0, EXIT 1050 BEQ .3 1060 LDA ARG.EXPONENT IF ARG=0, SET DAC=0 AND EXIT 1070 BEQ .4 1080 *---CLEAR RESULT REGISTER-------- 1090 LDA #0 1100 LDY #19 1110 .1 STA MAC,Y 1120 DEY 1130 BPL .1 1140 *---FORM PRODUCT OF FRACTIONS---- 1150 JSR MULTIPLY.BY.LOW.DIGITS 1160 JSR SHIFT.MAC.RIGHT.ONE 1170 JSR SHIFT.DAC.RIGHT.ONE 1180 JSR MULTIPLY.BY.LOW.DIGITS 1190 *---ADD THE EXPONENTS------------ 1200 LDA DAC.EXPONENT 1210 CLC 1220 ADC ARG.EXPONENT 1230 CMP #$C0 CHECK FOR OVERFLOW 1240 BCS .5 ...OVERFLOW 1250 SBC #$3F ADJUST OFFSET 1260 BMI .4 ...UNDERFLOW 1270 STA DAC.EXPONENT 1280 *---FORM SIGN OF PRODUCT--------- 1290 LDA DAC.SIGN 1300 EOR ARG.SIGN 1310 STA DAC.SIGN 1320 *---MOVE MAC TO DAC-------------- 1330 LDY #9 1340 .2 LDA MAC,Y 1350 STA DAC.HI,Y 1360 DEY 1370 BPL .2 1380 *---NORMALIZE DAC---------------- 1390 JSR NORMALIZE.DAC 1400 LDA MAC IF LEADING DIGIT=0, 1410 AND #$F0 THEN GET ANOTHER DIGIT 1420 BNE .3 1430 LDA MAC+10 1440 LSR 1450 LSR 1460 LSR 1470 LSR 1480 ORA DAC.HI+9 1490 STA DAC.HI+9 1500 .3 RTS 1510 .4 LDA #0 1520 STA DAC.SIGN 1530 STA DAC.EXPONENT 1540 RTS 1550 .5 JMP AS.OVRFLW 1560 *-------------------------------- 1570 * MULTIPLY BY EVERY OTHER DIGIT 1580 *-------------------------------- 1590 MULTIPLY.BY.LOW.DIGITS 1600 SED DECIMAL MODE 1610 LDX #9 1620 LDY #19 1630 .1 LDA DAC.HI,X 1640 AND #$0F ISOLATE NYBBLE 1650 BEQ .2 0, SO NEXT DIGIT 1660 JSR MULTIPLY.ARG.BY.N 1670 .2 DEY NEXT MAC POSITION 1680 DEX NEXT DAC DIGIT 1690 BPL .1 DO NEXT DIGIT 1700 CLD BINARY MODE 1710 RTS DONE 1720 *-------------------------------- 1730 MULTIPLY.ARG.BY.N 1740 STA DIGIT N = 1...9 1750 STY TEMP SAVE Y 1760 STX TEMP+1 SAVE X 1770 .1 LDX #9 INDEX INTO ARG 1780 CLC 1790 .2 LDA ARG.HI,X 1800 ADC MAC,Y ADD IT 1810 STA MAC,Y 1820 DEY NEXT MAC 1830 DEX NEXT ARG 1840 BPL .2 NEXT DIGIT 1850 BCC .4 NO CARRY 1860 .3 LDA #0 PROPAGATE CARRY 1870 ADC MAC,Y 1880 STA MAC,Y 1890 DEY 1900 BCS .3 MORE CARRY 1910 .4 LDY TEMP GET POSITION IN MAC 1920 .5 DEC DIGIT NEXT DIGIT 1930 BNE .1 1940 LDX TEMP+1 1950 RTS DONE 1960 *-------------------------------- 1970 SHIFT.MAC.RIGHT.ONE 1980 LDY #4 4 BITS RIGHT 1990 .0 LDX #1 20 BYTES 2000 LSR MAC 2010 .1 ROR MAC,X 2020 INX NEXT BYTE 2030 PHP 2040 CPX #20 2050 BCS .2 NO MORE BYTES 2060 PLP 2070 JMP .1 2080 .2 PLP 2090 DEY NEXT BIT 2100 BNE .0 2110 RTS 2120 *-------------------------------- |

Apple Assembly Line is published monthly by S-C SOFTWARE CORPORATION, P.O. Box 280300, Dallas, Texas 75228. Phone (214) 324-2050. Subscription rate is $18 per year in the USA, sent Bulk Mail; add $3 for First Class postage in USA, Canada, and Mexico; add $12 postage for other countries. Back issues are available for $1.80 each (other countries add $1 per back issue for postage).

All material herein is copyrighted by S-C SOFTWARE, all rights reserved. Unless otherwise indicated, all material herein is authored by Bob Sander-Cederlof. (Apple is a registered trademark of Apple Computer, Inc.)