Apple Assembly Line
Volume 6 -- Issue 4 January 1986

In This Issue...

New Low Diskette Price

When we first started offering blank diskettes for sale, back in May of 1981, we were able to sell them for the then below retail price of only $50 for 20 diskettes. There have been quite a few changes in this industry in the last five years, and prices have continued to fall. We have a new supplier and can now offer you quality diskettes for only $20 per package of 20, a reduction of 37% since last month.

New Monthly Disks

Since diskette prices have fallen so far, we are now planning to send out disks containing the source code from AAL on a monthly basis, instead of quarterly as we have in the past. See the note on page 11 for the details.


Sears Financial Network is launching a new credit card this year, called Discover. They offer lower interest rates and no yearly charge to the consumer, and better rates to the merchants as well, so we are pleased to be able to accept this card now. You can use it just like any other card for your phone and mail orders.

Convert Lo-Res Pictures to Hi-Res Bob Sander-Cederlof

Most Apple dot-matrix printer interfaces now include the firmware to print hi-res graphics pictures directly from a screen image. However, most do not provide any way to print lo-res graphics pictures. With the program presented here you can convert a lo-res graphics image into a hi-res picture, ready to be printed by your interface firmware.

Even if you don't ever plan to do such a thing, there are some neat coding tricks in the following program, which you might be able to apply to other hi-res programs.

Lines 1070-1200 demonstrate the use of my lo- to hi-res conversion program. Notice that I started with the label "T". I find I am using that label all the time, lately. I think I started using it as a short mnemonic for "TEST". It is convenient, because in the S-C Macro Assembler environment I can test the program I just assembled by typing "MGO" and the label I want to start at. I find my fingers can now type "MGOT" without my brain even realizing it happened.

The first thing my demo does is call PLOT to create a lo-res picture. I didn't have any real lo-res art around, so I simply drew a 4-by-4 pattern showing all 16 lo-res colors. PLOT fills 16 (4x4) pixels with color 0, 16 with color 1, and so on:

lo-res   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

   0     0  0  0  0  4  4  4  4  8  8  8  8  C  C  C  C
   1     0  0  0  0  4  4  4  4  8  8  8  8  C  C  C  C
   2     0  0  0  0  4  4  4  4  8  8  8  8  C  C  C  C
   3     0  0  0  0  4  4  4  4  8  8  8  8  C  C  C  C
  15     3  3  3  3  7  7  7  7  B  B  B  B  F  F  F  F

The rest of the lo-res screen I did not change, so it will normally show the lo-res equivalent of whatever text was on the screen. Of course if you were really trying to use my CONVERT program you would draw your real lo-res picture.

Lines 1090-1120 turn on the lo-res graphics display, and line 1130 waits until I press a key on the keyboard. After running this much of the program, and studying the dot patterns on the screen, I realized that it is not possible to exactly reproduce the lo-res colors on the hi-res screen (unless I used //e or //c double hi-res). However, by mixing various patterns of dots within the 28 dots (7x4) each lo-res pixel maps to, I can come close to the same color. I don't really know how close I came, because I do not have a color monitor. However, I can at least tell by inspection that all 16 colors map to different dot patterns that will be distinguishable colors.

The PAUSE.FOR.ANY.KEY subroutine will return EQ status if the key I press is RETURN, and NE status if it is any other key. Line 1140 will terminate my test program if RETURN was typed. If it was not RETURN, line 1150 turns on hi-res graphics and line 1160 calls the convert program. Then line 1170 waits for another keypress. Again, RETURN will terminate the test, and any other key will flop back to let me see the lo-res display again. Line 1190 turns the text display back on.

CONVERT is very straightforward. The outer loop, using the X-register, runs from 23 down to 0. This corresponds to the 24 text lines on the screen, or 48 lo-res rows. If your lo-res picture does not use the bottom 4 lines (8 rows), change line 1300 to "LDX #19".

The inner loop, using the Y-register, runs from 39 down to 0, corresponding to the 40 columns of lo-res pixels. Each of the 960 bytes addressed by X and Y contains two lo-res pixels. The top lo-res row (HLIN 0,39 AT 0) is in the low-order nybble of each of 40 bytes starting at $400. The second row (HLIN 0,39 AT 1) is in the high-order nybble of the same 40 bytes. The third and fourth rows are in the 40 bytes starting at $480, and so on. The starting addresses for each row-pair are exactly the same as those for the 24 lines of the text screen. They also happen to be very closely related to the starting addresses for the corresponding rows on the hi-res screen.

I stored the 24 starting addresses in two tables, LOL and LOH. LOL contains the low-half of each address, and LOH the high. Lines 1320-1360 pick up the base address for the current row-pair and put it in pointer LBAS. Lines 1340 and 1370-1380 set up a similar pointer for the hi-res screen. Note that the only difference is that the lo-res screen starts at $400, and the hi-res screen starts at $2000. This address points at the first byte (first seven dots) of the top line of the eight hi-res lines that are in the same position as the lo-res row-pair.

Each lo-res pixel will be mapped to four lines on the hi-res screen, and will be seven dots (or one byte) wide. Each of the 960 lo-res bytes has two pixels, so each byte uses eight lines on the hi-res screen. The right lo-res nybble will be the top four lines, and the left nybble will be the bottom four. After studying the tables of hi-res addresses, I noticed that each set of eight lines follow a very regular pattern. Given the address for the leftmost byte of the top line of a set of eight lines, I can compute the addresses for the next seven lines by successively adding 4 to the high byte of the address. Thus the base addresses for the first eight lines are $2000, $2400, $2800, $2C00, $3000, $3400, $3800, and $3C00. I can always get the base address for the first of the eight by subtracting $400 and adding $2000 to the corresponding lo-res pointer. Line 1370 does that operation in one step with "EOR #$24".

Lines 1400-1480 pick up the current lo-res byte and feed first the right nybble and then the left nybble to PROCESS.NYBBLE. For indexing purposes I multiply the nybble by 8, so that the lo-res color is in the A-register like this: xCCCCxxx. More on that later. Lines 1490-1530 are the south ends of the two nested loops, equivalent to NEXT Y and NEXT X. By the way, please don't get confused by the terms Y and X. They refer in my program to 6502 registers, not Cartesian coordinates. Just to keep your minds nimble, I use the Y-register for the X-coordinate. The X-register is half the lo-res Y-coordinate.

I mentioned above the problem of coming up with patterns of 28 dots to approximate the lo-res colors. There are only six solid hi-res colors, which correspond exactly to lo-res colors 0, 3, 6, 9, 12, and 15. The other 10 lo-res colors take double the normal hi-res resolution to reproduce exactly. However, as Don Lancaster explains in detail in his "Enhancing Your Apple II -- Volume I", you can produce thousands of shades in hi-res by using dot patterns. I picked 12 of his patterns based on the names he gave them, since I did not have a color monitor. His patterns fit in a 28-dot by two line array. Since each byte stores seven dots, it takes 28 dots before the some of the patterns repeat. Using two lines with different or offset patterns gives even more variety.

The table SHADES in lines 1900-2050 give sixteen patterns. The first four bytes of each color are for the first line of 28 dots, and the other four bytes give the second line of 28 dots. Each lo-res pixel will use only one pair of bytes from the set of eight, depending on which column it is in. The last two bits of the lo-res column number (in the Y-register) select which byte pair to use.

Lines 1650-1700 build an index to the byte pair by merging those two bits with the color*8. Then by addressing "SHADE,X" I get the first byte of a pair, and by using "SHADE+4,X" I get the second one. Each lo-res pixel will use the four hi-res bytes by repeating the pair selected from SHADES.

The rest of the code in PROCESS.NYBBLE involves putting the selected color bytes into the hi-res area. HBAS points at the top line of the four to be stored into, and the Y-register points to the byte on that line; so "STA (HBAS),Y" will store into that byte. COMMON.CODE (so named because of a lack of creativity on my part this morning when I discovered that the same eight lines appeared twice) gets and puts two color bytes. The first byte goes into (HBAS),Y; then I add 4 to the high byte of HBAS (since I KNOW it is zero, ORA can be used to add the bit) and store the second byte at the new (HBAS),Y. The "EOR #$0C" at line 1720 changes $24 to $28 or $34 to $38. Similarly, the "EOR #$1C" at line 1750 changes $2C to $30 or $3C to $20. This last possibility leaves HBAS prepared for the next column, automatically!

Some of the same tricks could be used in writing a program to copy text from the text screen to the hi-res graphics screen, or for a general purpose routine to write characters onto the hi-res screen. Instead of using a color map, we would need a table of dot-matrix characters. Maybe this is just how everyone does it, but I don't recall seeing all of these tricks in any previous code. Especially the idea of getting the hi-res base pointer by merely toggling two bits in the equivalent text pointer, and the idea of generating successive hi-res pointers by merely adding 4 to that base pointer.

When I wrote this program I wasn't really worrying about speed or space. Nevertheless, as you can see, it is fairly compact. As for speed, it takes less than a second.

  1010 *--------------------------------
  1020 LBAS   .EQ $26,27
  1030 HBAS   .EQ $2A,2B
  1040 SAVEX  .EQ $2E
  1050 COLOR  .EQ $30
  1060 *--------------------------------
  1070 T
  1080        JSR PLOT
  1090        LDA $C050    GRAPHICS
  1100        LDA $C052    SOLID (40 BY 48 PIXELS)
  1110        LDA $C054    PRIMARY PAGE
  1120 .1     LDA $C056    LO-RES
  1130        JSR PAUSE.FOR.ANY.KEY
  1140        BEQ .2       ...<RETURN>
  1150        LDA $C057    HIRES
  1160        JSR CONVERT
  1170        JSR PAUSE.FOR.ANY.KEY
  1180        BNE .1       ...NOT <RETURN>
  1190 .2     LDA $C051    TEXT
  1200        RTS
  1210 *--------------------------------
  1230 .1     LDA $C000    WAIT FOR ANY KEY
  1240        BPL .1       ...NOT YET
  1250        STA $C010    CLEAR STROBE
  1260        CMP #$8D     SET .EQ. IF <RETURN>
  1270        RTS
  1280 *--------------------------------
  1290 CONVERT
  1300        LDX #23      OR #19 IF MIXED MODE
  1310 .1     LDY #39      COLUMNS 0...39
  1330        STA LBAS
  1340        STA HBAS     SAME FOR HI-RES
  1350        LDA LOH,X
  1360        STA LBAS+1
  1370        EOR #$24     SHIFT FROM $400 TO $2000 FOR HI-RES
  1380        STA HBAS+1
  1390        STX SAVEX    SAVE X-REG
  1410        PHA          SAVE FOR LOWER ONE
  1420        ASL          UPPER PIXEL * 8
  1430        ASL
  1440        ASL
  1450        JSR PROCESS.NYBBLE
  1460        PLA          GET LOWER PIXEL
  1470        LSR          TIMES 8
  1480        JSR PROCESS.NYBBLE
  1500        BPL .2       ...ANOTHER ONE
  1510        LDX SAVEX    RESTORE X-REG
  1520        DEX          NEXT LINE, SCANNING BOTTOM TO TOP
  1530        BPL .1       ...ANOTHER ONE
  1540        RTS          FINISHED!
  1550 *--------------------------------
  1560 LOH    .HS  HIGH BYTES
  1570        .HS  OF SCRN PNTRS
  1580        .HS  (TEXT OR LO-RES)
  1590 *--------------------------------
  1600 LOL    .HS  LOW BYTES
  1610        .HS 28.A8.28.A8.28.A8.28.A8  OF SCRN PNTRS
  1620        .HS 50.D0.50.D0.50.D0.50.D0
  1630 *--------------------------------
  1650        AND #$78     MASK THE SHIFTED NYBBLE
  1660        STA COLOR
  1670        TYA          LO-RES COLUMN
  1680        AND #3       LOW 2 BITS
  1690        ORA COLOR         0CCCC0YY
  1700        TAX
  1710        JSR COMMON.CODE
  1720        EOR #$0C          3RD LINE OF 4
  1730        STA HBAS+1
  1740        JSR COMMON.CODE
  1750        EOR #$1C          NEXT LINE
  1760        STA HBAS+1
  1770        RTS
  1780 *--------------------------------
  1800        LDA SHADES,X      EVEN LINE
  1810        STA (HBAS),Y
  1820        LDA HBAS+1
  1830        ORA #4
  1840        STA HBAS+1
  1850        LDA SHADES+4,X    ODD LINE
  1860        STA (HBAS),Y
  1870        LDA HBAS+1
  1880        RTS
  1890 *--------------------------------
  1900 SHADES .HS  0--BLACK
  1910        .HS AA.D5.AA.D5.55.2A.55.2A  1--MAGENTA
  1920        .HS 91.A2.C4.88.C4.88.91.A2  2--DARK BLUE
  1930        .HS  3--PURPLE
  1940        .HS 2A.55.2A.55.2A.55.2A.55  4--DARK GREEN
  1950        .HS 33.66.4C.19.4C.19.33.66  5--GRAY 1
  1960        .HS D5.AA.D5.AA.D5.AA.D5.AA  6--MEDIUM BLUE
  1970        .HS DD.BB.F7.EE.F7.EE.DD.BB  7--LIGHT BLUE
  1980        .HS A2.C4.  8--BROWN
  1990        .HS AA.D5.AA.D5.AA.D5.AA.D5  9--ORANGE
  2000        .HS B3.E6.CC.99.CC.99.B3.E6  A--GRAY 2
  2010        .HS D5.AA.D5.AA.AA.D5.AA.D5  B--PINK
  2020        .HS 6E.5D.3B.77.3B.77.6E.5D  C--LIGHT GREEN
  2030        .HS 2A.55.2A.55.AA.D5.AA.D5  D--YELLOW
  2040        .HS 2A.55.2A.55.D5.AA.D5.AA  E--AQUAMARINE
  2050        .HS 7F.7F.7F.7F.7F.7F.7F.7F  F--WHITE
  2060 *--------------------------------
  2080 *--------------------------------
  2090 PLOT   LDY #0
  2100        STY COLOR
  2110 .1     LDX #3
  2120 .2     LDA COLOR    00, 44, 88, CC
  2130        STA $400,Y   GR ROWS 0-3
  2140        STA $480,Y
  2150        CLC
  2160        ADC #$11     11, 55, 99, DD
  2170        STA $500,Y   GR ROWS 4-7
  2180        STA $580,Y
  2190        ADC #$11     22, 66, AA, EE
  2200        STA $600,Y   GR ROWS 8-11
  2210        STA $680,Y
  2220        ADC #$11     33, 77, BB, FF
  2230        STA $700,Y   GR ROWS 12-15
  2240        STA $780,Y
  2250        INY
  2260        DEX
  2270        BPL .2
  2280        ADC #$11     .., 44, 88, CC, END
  2290        STA COLOR
  2300        BCC .1       ...MORE
  2310        RTS
  2320 *--------------------------------

A Question About BRUN Bob Sander-Cederlof

Mike Lawrie, a reader in South Africa, reports that he tried our prime benchmark (Sep 85 AAL) in a Titan Accelerator card equipped with a 65802. It ran 1000 times in 41 seconds, which correlates very nicely with my predictions in the article. The Titan card runs at 3.58 MHz, and I predicted .35 seconds for 10 repetitions at 4 MHz.

Mike also asked an interesting question, which has been asked by a lot of you at one time or another. Why is it that some assembly language programs can be BLOADed and CALLed, but not BRUN? Even the following very simple program will not return from a BRUN, while it will from a BLOAD followed by a CALL:

       JSR $FF3A   Ring the bell

The problem is inside the DOS BRUN command. This command does not use a JSR command to jump to the binary code just loaded. Rather, it uses a JMP command. No return address is left on the stack. When the RTS at the end of the program is executed, it pops garbage off the stack and returns wherever that garbage indicates. What will happen is rather unpredictable.

The Applesoft CALL command does use JSR, and so it works. So does the monitor G command, and the S-C Macro Assembler MGO command. In ProDOS, the BRUN processor works correctly, using a JSR.

This leaves the question: How should a BRUNnable program end under DOS 3.3? If it is to return to the command prompt (] for Applesoft) then the last line should be JMP $3D0. If the BRUN command came from a machine language program (unlikely) then the called program should end with a JMP to a known entry point in the calling program. The most likely case is an Applesoft program that uses a machine language routine. The best way to handle this is to use BLOAD and CALL.

Monthly AAL Source Disk Subscriptions Now Available Bob Sander-Cederlof

We have always made the source code of all the programs published in Apple Assembly Line available on disks. We have collected the programs from three issues together on Quarterly Disks, priced at $15 each or $45/year.

Now that diskettes are so much less expensive, we have decided to try another approach. For those who are interested in getting the source code on disk, we would like to send the source disk along with each newsletter. We will still collect three issues onto Quarterly Disks, for late comers. But those of you who have Quarterly Disk subscriptions will start getting the Monthly Disks. We will send the disk and newsletter in the same envelope, First Class Mail.

The price for combined newsletter/disk subscriptions will be $64 in the USA, Canada, and Mexico. For other countries the postage is higher, so the fee will be $87.

If you want to synchronize your newsletter and Monthly Disk subscriptions, you can pro-rate the Monthly Disk at $3.75 per month ($4.75 for overseas). You can check the length remaining on your current newsletter subscription by looking at the mailing label: the number in the upper-right corner is the year and month of the last issue of your current subscription.

If you currently are receiving the Quarterly Disk by an automatic charge to your credit card each quarter there will be no change: you will still get the Quarterly Disk rather than the monthly disk.

Text File Transfer Using DOS 3.3 File Manager Bob Potts

Transferring text files from one drive to another can be frustrating and time consuming. The standard procedure is to read from the file on the source drive and write to the file on the target drive. One possible solution is to use FID, but you must BRUN FID and cannot use it from inside an Applesoft program.

With this in mind, I set out to write a utility which will transfer a text file using the DOS 3.3 File Manager (FM) routines. FM has been a part of every release of DOS, but very little documentation has been written about these powerfull routines. While RWTS concerns itself with tracks and sectors, FM deals with whole files, be they binary, text, or Applesoft. I recalled that a couple of years ago, Bob Sander-Cederlof had assisted me with a communications program and had used the FM routines to read and store the file. I located the listing we used, analyzed the code, and here is the result.

The entire program could have been written in assembler, but since most of my programs are in Applesoft (with machine language support routines), I decided to write it as simple as possible. The name of the file to be transferred, the OPEN, READ, WRITE, and CLOSE commands are all obtained through a short Applesoft front end program. The machine language portion is broken down as follows.

Lines 1130-1150 are simply easily accessible jump vectors to the two routines which will be CALLed from inside an Applesoft driver.

Lines 1190-1320 clear the buffer, in this case $2000-95FF, to zeroes. This gives us a buffer of 30,208 bytes, which should be large enough for most text files. (This is 118 sectors.) Lines 1330-1340 reset the base buffer address, for use later to find the end of the data in the buffer.

Lines 1360-1460 load the file that has been OPENed by the Applesoft driver. The process of setting up a FM parameter block is simplified by using a preset data area called RD.BLK, lines 1790-1800. Calling FM.SETUP sets up the Y- and A-registers properly, and then calling FM.ENTRY reads the text file.

Lines 1500-1580 search through the data buffer for the first occurrence of a 00 byte, which will signal the end of data. By subtracting the base buffer address in lines 1660-1710 we get the actual length of the data. Lines 1600-1650 copy in the initial parameter values for writing, and lines 1660-1710 set up the length.

Lines 1720-1740 call on FM to actually write the data on the file that has been opened in the Applesoft driver.

The time saving using this transfer is significant. A text file containing 8000 bytes took 49 seconds to read and write using pure Applesoft. Using the FM the same operation was accomplished in only 17 seconds.

Since the program is only 120 bytes long, it can be placed almost anywhere there is free space, especially on page 3. If you are working from a larger Applesoft program, the starting point for the buffer could be moved as needed to load your text file.

  1010 *--------------------------------
  1020        .OR $300
  1025        .TF TEXT.TRANSFER.OBJ
  1030 *--------------------------------
  1040 MY.BUFFER  .EQ $2000
  1050 *--------------------------------
  1080 *--------------------------------
  1090 FM.SETUP   .EQ $3DC      INITIALIZE Y & A
  1120 *--------------------------------
  1150        JMP FIND.END.AND.WRITE
  1160 *--------------------------------
  1180 *--------------------------------
  1200        LDA #MY.BUFFER
  1210        STA BUFFER        LSB
  1220        LDA /MY.BUFFER
  1230        STA BUFFER+1      MSB
  1240        LDY #0
  1250 .1     LDA #0            CLEAR BUFFER UP TO $95FF
  1260 .2     STA (BUFFER),Y
  1270        INY               NEXT BYTE IN THIS PAGE
  1280        BNE .2            ...STILL IN THE PAGE
  1290        INC BUFFER+1      NEXT PAGE
  1300        LDA BUFFER+1
  1310        CMP #$96          AT END OF STORAGE?
  1320        BNE .1            ...NO, KEEP CLEARING
  1340        STA BUFFER+1
  1350 *--------------------------------
  1370        LDX #9            10 BYTES
  1380 .1     LDA RD.BLK,X
  1390        STA FM.BLK,X
  1400        DEX
  1410        BPL .1
  1420        JSR FM.SETUP
  1430        JSR FM.ENTRY
  1440        LDA FM.BLK+10     GET RETURN CODE
  1460        RTS               RETURN TO APPLESOFT
  1470 *--------------------------------
  1490 *--------------------------------
  1510        LDY #0            SEARCH FOR 00 BYTE
  1520 .1     LDA (BUFFER),Y
  1530        BEQ .2            ...FOUND END
  1540        INY
  1550        BNE .1            ...NEXT BYTE IN SAME PAGE
  1560        INC BUFFER+1      NEXT PAGE
  1570        BNE .1            ...ALWAYS
  1580 .2     STY BUFFER        LSB OF EOF BYTE
  1590 *--------------------------------
  1610        LDX #9            10 BYTES
  1620 .1     LDA WR.BLK,X
  1630        STA FM.BLK,X
  1640        DEX
  1650        BPL .1
  1660        LDA BUFFER        LSB
  1670        STA FM.BLK+6      LSB OF FILE LENGTH
  1680        SEC
  1690        LDA BUFFER+1
  1700        SBC /MY.BUFFER
  1710        STA FM.BLK+7      MSB OF FILE LENGTH
  1720        JSR FM.SETUP
  1730        LDX #1            IF NO FILE, ALLOCATE ONE
  1740        JSR FM.ENTRY      WRITE THE FILE
  1750        LDA FM.BLK+10     RETURN CODE
  1770        RTS               RETURN TO APPLESOFT
  1780 *--------------------------------
  1790 RD.BLK .HS 03.02.0000.0000
  1800        .DA $9600-MY.BUFFER,MY.BUFFER
  1810 WR.BLK .HS 04.02.0000.0000
  1820        .DA $9600-MY.BUFFER,MY.BUFFER
  1830 *--------------------------------
     99  REM -------------------
     100  HIMEM: 8192: REM  $2000-95FF IS MY BUFFER
     110 D$ =  CHR$ (4)
     120  PRINT D$"NOMON I,O,C"
     199  REM -------------------
     300  REM ----------------------
     320  PRINT "------------------"
     330  INPUT "ENTER FILE NAME:  ";F$
     410  PRINT D$"OPEN"F$",D1
     420  PRINT D$"READ"F$
     430  CALL RF
     440  IF  PEEK (RC) = 5 THEN 500
     450  PRINT D$"CLOSE"
     470  STOP 
     510  PRINT D$"CLOSE"
     520  PRINT D$"OPEN"F$",D2
     530  PRINT D$"WRITE"F$
     540  CALL WF
     550  IF  PEEK (RC) = 0 THEN 600
     560  PRINT D$"CLOSE"
     580  STOP 
     600  REM FINISHED
     610  PRINT D$"CLOSE"
     630  END 

Fast 6502 & 65802 Multiply Routines Bob Sander-Cederlof

Since multiplication is not a built-in function in the 6502, 65C02, or 65802, many of us have written our own subroutines for the purpose. I will present some efficient subroutines here, to handle the 8-bit and 16-bit cases.

I will assume both arguments are the same length (either 8-bits or 16-bits) and that we want the full product. If the arguments are only 8-bits long, the product will by 16-bits long. If the arguments are 16-bits long, the product will be 32-bits long. I will also assume the arguments are unsigned values. Thus $FF times $FF will be $FE01 (in decimal, 255x255 = 65025).

Way back in February, 1981, I published an article with a Brooke Boering's fast 16-bit multiplication subroutine. His subroutine duplicated the functions of the subroutine in the original Apple Monitor ROM, but was nearly twice as fast. Brooke's programs were originally published in the December, 1980, Micro magazine (now defunct). He included an 8-bit multiply subroutine with an average time of only 192 cycles.

Damon Slye wrote an article for Call APPLE, published June, 1983. He introduced some coding tricks which allow an 8-bit multiply in an average of 160 cycles. I have reproduced Damon's program below, in lines 1010-1300. His trick involves eliminating a CLC opcode from the loop in lines 1210-1260. Ordinarily you would need a CLC before the ADC instruction; Damon decremented the multiplicand by one before starting the loop, so that adding with carry set works. He does the decrementing in lines 1130-1160. Note that if the original multiplicand was zero, he skips all the rest of the code and just returns the answer: 0.

I had to go at least one step faster, so I partially "un-wrapped" the 8-step loop. I changed it to loop only four times, but handled two bits of the multiplier each time. This runs an average time of 140 cycles. You could unwrap it all the way, writing out the BCC-ADC-ROR-ROR lines a total of 8 times, and cut the average time down to only 111 cycles.

Let me stop here and say what I mean by average time. I am stating time in terms of "cycles", rather than seconds or microseconds. The Apple two different cycle times, depending on the video timing logic. The average Apple speed is 1020488 cycles per second. The multiply algorithms will vary in speed depending on the number of bits in the multiplier which equal "1". If the multiplier = $FF (all ones) the algorithm will take the maximum time. If the multiplier is $00, it will take the minimum time. On the average for random arguments, the multiplier will have four zeroes and four ones, so the average time is equal to the average of the minimum and maximum times. For all of the subroutines, I included the cycles for a JSR to call them, and for the RTS at the end.

I programmed an 8-bit multiply using 65802 opcodes, as shown below in lines 1560-1790. The program is slightly shorter (one byte), but that really isn't a fair comparison. The arguments and product are handled differently, and so the effort to call the program may be more or less than that for the 6502 version. Rather than passing the multiplicand in the X-register, I have it in the A-register. I pass the multiplier in the high byte of the A-register. Since X is not used for passing any values, I saved and restored it (lines 1620 and 1770). I assumed the program would be called from the 6502 mode, which of course it was as long as I was testing it. In "real life" it might be written to be called from Native 65802 mode, since the larger program it was a part of would also be taking advantage of all the 65802 features.

I used a couple of tricks to save space and time. One you may justly complain about is that I store the multiplicand directly into the operand field of the ADC instruction at line 1720. This definitely saves time, but it also could have serious drawbacks. (For example, it would not work if executed from ROM.) Since I enter in 6502 Emulation mode, line 1640 only loads 0 into the low byte of the A-register. Lines 1650-1660 enter the 65802 Native mode. Line 1680 sets the A-register to 16-bit mode.

In line 1690 I form the inverse (one's complement) of the multiplier. This is just another way of eliminating the CLC from the loop. Note that the multiplier is in the high byte of A, and the product is going to be accumulated in the low byte. The loop runs from line 1700 through line 1740. Line 1700 shifts to the left both the partial product and what remains of the multiplier, putting the highest remaining bit of the multiplier into the carry status bit. If that bit = 1, then the original bit in the multiplier before complementing was a zero, so we do not add the multiplicand to the current partial product. As we continue through the loop, the bits of the multiplier keep shifting out just ahead of the ever-growing partial product, until finally we have the answer.

Lines 1750-1780 restore the machine state to the 6502 Emulation mode and restore the original X-register value. The full product is now in the A-register. If I wanted to print out the product, I might do it like this:

       XBA             GET HIGH BYTE INTO LOW-A
       XBA             GET LOW BYTE INTO LOW-A
       JMP $FDDA

Here is a summary of the execution times (in cycles) for the three 8-bit multiply subroutines:

              Minimum  Maximum  Average
        Slye    152       168      160
        RBSC    132       148      140
        65802   119       135      127

The 65802 version would be seven cycles faster if we did not require saving and restoring the X-register. If you want to change the 65802 version for calling from Native mode, delete lines 1650, 1660, 1750, and 1760. Then insert the following:

       1612         PHP
       1614         SEP #$30
       1772         PLP

These changes add one cycle to the time.

  1010 *--------------------------------
  1020 CAND   .EQ 2
  1030 PLIER  .EQ 3
  1040 PROD   .EQ 4,5
  1050 *--------------------------------
  1070 *   CALL APPLE, JUNE 1983, P45-48.
  1080 *      (A-REG) = MULTIPLIER
  1090 *      (X-REG) = MULTIPLICAND
  1110 *--------------------------------
  1120 FAST.8X8.SLYE
  1130        CPX #0
  1140        BEQ .3       A*0=0
  1150        DEX          DECR. CAND TO AVOID
  1160        STX CAND        THE CLC BEFORE ADC CAND
  1170        LSR          PREPARE FIRST BIT
  1180        STA PLIER
  1190        LDA #0
  1200        LDX #8
  1210 .1     BCC .2       NO ADD
  1220        ADC CAND
  1230 .2     ROR
  1240        ROR PLIER
  1250        DEX
  1260        BNE .1
  1270        LDX PLIER
  1280        RTS
  1290 .3     TXA
  1300        RTS
  1310 *--------------------------------
  1320 FAST.8X8.RBSC
  1330        CPX #0
  1340        BEQ .3       A*0=0
  1350        DEX          DECR. CAND TO AVOID
  1360        STX CAND        THE CLC BEFORE ADC CAND
  1370        LSR          PREPARE FIRST BIT
  1380        STA PLIER
  1390        LDA #0
  1400        LDX #4
  1410 .1     BCC .2       NO ADD
  1420        ADC CAND
  1430 .2     ROR
  1440        ROR PLIER
  1450        BCC .25      NO ADD
  1460        ADC CAND
  1470 .25    ROR
  1480        ROR PLIER
  1490        DEX
  1500        BNE .1
  1510        LDX PLIER
  1520        RTS
  1530 .3     TXA
  1540        RTS
  1550 *--------------------------------
  1560        .OP 65816
  1570 *--------------------------------
  1580 *      MULTIPLIER IN A(15-8), MULTIPLICAND IN A(7-0)
  1590 *      RETURN PRODUCT IN A(15-0)
  1600 *--------------------------------
  1610 MULTIPLY.8X8.65802
  1620        PHX
  1630        STA .2+1     SAVE MULTIPLICAND
  1640        LDA #0
  1650        CLC
  1660        XCE
  1670        LDX #8
  1680        REP #$20     A-REG 16 BITS
  1700 .1     ASL
  1710        BCS .3       ...IF ORIGINAL BIT=0
  1720 .2     ADC ##0      ADD MULTIPLICAND
  1730 .3     DEX
  1740        BNE .1
  1750        SEC
  1760        XCE
  1770        PLX
  1780        RTS
  1790 *--------------------------------
  1800        .LIF

I will also show three sample 16-bit multiply, four. The first one is a copy of Brooke Boering's code. The second is a direct conversion of Brooke's code to 65802 code, with emphasis on space. The third modifies the second with the tricks of Damon Slye; it takes more space, but it is faster.

The first three of these subroutines are modeled after the code in the original Apple monitor ROM. The arguments are expected in page zero locations, low-byte first. The result will also be in page zero locations. The function performed is actally a little more than just multiplication, because it is possible to specify an addend as well. The final result will be PRODUCT = ADDEND + (MULTIPLIER * MULTIPLICAND). PRODUCT is stored in four consecutive bytes, backwards. The highest byte is at PRODUCT+1, the next at PRODUCT, the next at PLIER+1, and the lowest at PLIER. The fourth subroutine differs in that the product does not overlap the multiplier.

Looking at Brooke's version (lines 1000-1270) you can see that the loop contains a 16-bit addition (lines 1130-1190). There are also two 16-bit ROR shifts, at lines 1200-1230. These are the likely candidates for shortening via 65802 code. My first version for the 65802 made no other changes in the loop. I merely prefixed Brooke's code with CLC-XCE-REP to get into the 16-bit Native mode, and suffixed it with SEC-XCE to get back to Emulation mode. Then I noticed another shortcut, and the result is in lines 1300-1480.

By moving the LDA PRODUCT up before the BCC opcode in lines 1370-1380, I was able to change a ROR PRODUCT to a simple ROR on the A-register followed by a STA PRODUCT. This saves a net six cycles when the multiplier bit is "1", and costs two cycles when the multiplier bit is "0". The average savings for random multipliers is four cycles, inside a loop that runs 16 times.

The faster version, in lines 1500-1780, merely implements Damon Slye's trick of pre-decrementing the multiplicand so as to avoid an explicit CLC opcode inside the 16-time loop. It costs 12 cycles for the extra setup, but it saves two cycles for each one-bit in the multiplier.

The fourth version, in the separate listing as lines 1000-1430, uses the trick of splitting the multiplier in half. In effect, two parallel 8-bit by 16-bit multiplies are accomplished, with the result usually taking less time than any of the other algorithms. By deleting line 1130 (which shaves off another four cycles) the feature of allowing an addend can be included.

Here is a summary of the execution cycles for the four 16-bit multiply subroutines:

               Minimum  Maximum  Average
       Boering   541      845      693
       Smaller   519      599      559
       Faster    531      579      555
       Fourth    332      684      508 (usually fastest)

Note that the third subroutine also goes even faster when the multiplicand = zero, because the bulk of the code is skipped.

These are pretty good subroutines, but I have no doubt that they can be improved upon. Why not try your hand? If you can significantly improve either space or time or features, send your code to AAL. We'll publish the best ones, and help advance the state of the art. And if you have some classy division subroutines, they are welcome too!

  1000 *SAVE S.MULTIPLY 16X16
  1010 *--------------------------------
  1020 PLICAND    .EQ $00,01    MULTIPLICAND
  1030 PLIER      .EQ $02,03    MULTIPLIER, LO-16 OF PRODUCT
  1040 PRODUCT    .EQ $04,05    HI-16 OF PRODUCT
  1050 *--------------------------------
  1060        .OP 6502
  1070 *--------------------------------
  1080 MULTIPLY.16X16.6502
  1090        LDX #16
  1110        LSR
  1120        BCC .2       ...DON'T ADD MULTIPLICAND
  1130        CLC
  1140        LDA PRODUCT
  1150        ADC PLICAND
  1160        STA PRODUCT
  1170        LDA PRODUCT+1
  1180        ADC PLICAND+1
  1190        STA PRODUCT+1
  1200 .2     ROR PRODUCT+1
  1210        ROR PRODUCT
  1220        ROR PLIER+1
  1230        ROR PLIER
  1240        DEX
  1250        BNE .1
  1260        RTS
  1270 *--------------------------------
  1280        .OP 65802
  1290 *--------------------------------
  1300 MULTIPLY.16X16.65802.SMALLER
  1310        CLC
  1320        XCE          NATIVE MODE
  1330        REP #$20     A-REG 16-BITS
  1340        LDX #16      LOOP 16 TIMES
  1360        LSR
  1380        BCC .2       ...DO NOT NEED TO ADD
  1390        CLC
  1400        ADC PLICAND
  1410 .2     ROR
  1420        STA PRODUCT
  1430        ROR PLIER    USE FOR LO-16 OF PRODUCT
  1440        DEX
  1450        BNE .1
  1460        SEC
  1470        XCE          BACK TO EMULATION MODE
  1480        RTS
  1490 *--------------------------------
  1500 MULTIPLY.16X16.65802.FASTER
  1510        CLC
  1520        XCE          NATIVE MODE
  1530        REP #$20     A-REG 16-BITS
  1540        LDA PLICAND
  1550        BEQ .3       0*ANYTHING=0
  1560        DEC
  1570        STA PLICAND
  1580        LDX #16      LOOP 16 TIMES
  1600        LSR
  1620        BCC .2       ...DO NOT NEED TO ADD
  1630        ADC PLICAND
  1640 .2     ROR
  1650        STA PRODUCT
  1660        ROR PLIER    USE FOR LO-16 OF PRODUCT
  1670        DEX
  1680        BNE .1
  1690        SEC
  1700        XCE          BACK TO EMULATION MODE
  1710        RTS
  1730        STA PLIER    LOW 16 OF PRODUCT
  1750        SEC
  1760        XCE          BACK TO EMULATION MODE
  1770        RTS
  1780 *--------------------------------
  1790        .LIF

And here is a 65802 version, even faster.

  1000 *SAVE S.MUL16X1665802
  1010 *--------------------------------
  1020        .OP 65802
  1030 *--------------------------------
  1040 A      .EQ 0,1
  1050 B      .EQ 2,3
  1060 P      .EQ 4,5,6,7
  1070 *--------------------------------
  1090        CLC
  1100        XCE          ENTER NATIVE MODE
  1110        REP #$20     16-BIT A-REGISTER
  1120        STZ P+2      MAKE SURE NO ADDEND IN HI-16
  1130        STZ P   (DELETE IF WANT AN ADDEND IN LO-16)
  1140        LDX #8
  1150        BRA .2       ...HOP OVER SHIFTS
  1160 *--------------------------------
  1170 .1     ASL P        DOUBLE THE PRODUCT
  1180        ROL P+2
  1190 .2     LDA A
  1200        AND ##$0080  LOOK AT SIGN OF LO-BYTE
  1210        BEQ .3       ...DON'T ADD MULTIPLICAND
  1220        CLC
  1230        LDA P
  1240        ADC B
  1250        STA P
  1260        BCC .3
  1270        INC P+2      ADD CARRY TO HI-16
  1280 *--------------------------------
  1290 .3     ASL A        SHIFT MULTIPLIER
  1300        BCC .4
  1310        CLC
  1320        LDA P+1      ADD TO MIDDLE OF PRODUCT
  1330        ADC B
  1340        STA P+1
  1350        BCC .4
  1360        INC P+3      (NEVER BOTHERS P+4)
  1370 *--------------------------------
  1380 .4     DEX
  1390        BNE .1
  1400        SEC
  1410        XCE
  1420        RTS
  1430 *--------------------------------
  1440 T
  1450        JSR MUL.EVEN.FASTER
  1460        LDA P+3
  1470        JSR PRB
  1480        LDA P+2
  1490        JSR PRB
  1500        LDA P+1
  1510        JSR PRB
  1520        LDA P+0
  1530 PRB    JMP $FDDA
  1540 *--------------------------------
  1550        .LIF

RAMWORKS Compatible Auxiliary MOVE Routine Harvey Brown
Spirit River, Alberta, CANADA

The MOVE routine inside the Apple //c and //e ROM transfers data conveniently to and from the auxiliary 48K area, but it does not work with the upper 16K area. Also, it does not work with the extra banks of RAM available in cards such as the RAMWORKS from Applied Engineering.

I needed that ability, so I wrote my own MOVE subroutine. Mine uses the page at $200 in main RAM as a buffer, to simplify the movement code. If you want to move from an arbitrary bank to another arbitrary bank, my program will require you to use $200 in main RAM as an intermediate buffer. (Somewhat like stopping at Chicago on your way from Toronto to Dallas.) My program also assumes you are always moving exactly 256 bytes (one page). This simplifies the code and the calling sequence, and is probably a reasonable restriction.

The program begins by copying itself into every bank you are using. The bank numbers must be assembled in to the list in lines 1800-1870. Notice that I use bank number $FF to signal the main RAM, and banks from $00 up to signal the banks of Auxiliary RAM. This code needs to reside in the same location in every bank that will be switched on, because when you move from an auxiliary bank to the main RAM that auxiliary bank will be set up so that all RAM reads come from it. This includes reads for the program, so the program had better be there.

Once the program has been initialized, you can JSR MOVE (or JSR $C03 if you want to use a "frozen" entry point) to move a page. At the time of the JSR MOVE, you should have the high byte of the Auxiliary RAM address in the A-register, and the bank number in the X-register. Set carry (SEC opcode) to indicate moving from main $200, or clear carry (CLC opcode) to indicate moving into main $200. Thinking in terms of a ramdisk application, SEC for a write or CLC for a read.

Warning: my program assumes you are calling from a program that runs with the Applesoft ROM selected (see line 1780). If you plan to run it with RAM selected in the upper 16K, you will have to make appropriate changes. You could save the status of the LCRAM and LCBANK soft switches ($C012 and $C011 respectively) before changing them. These partially indicate the status of the $C08x switches. You can tell whether RAM or ROM was selected, and restore the proper one after MOVE is finished. You can also tell which $D000 bank was selected. However, you cannot tell whether the RAM was write-enabled or not; also, you cannot tell if it was in the special mode in which you read ROM and write RAM.

  1010 *--------------------------------
  1020 * MOVE by H. Brown
  1030 * Jan 18/86
  1040 *--------------------------------
  1050 PTR    .EQ $00,01
  1060 BUFFER .EQ $200
  1070 RAMRD  .EQ $C002
  1080 RAMWRT .EQ $C004
  1090 ALTZP  .EQ $C008
  1110 ROM    .EQ $C082
  1120 RAM1   .EQ $C08B
  1130 RAM2   .EQ $C083
  1140 *--------------------------------
  1150        .OR $C00     ORG  AT BEGINNING OF A PAGE
  1160 *--------------------------------
  1190        JMP MOVE     NORMAL ENTRY
  1200 *--------------------------------
  1210 *   INIT copies COMMONPG to all 64K banks
  1220 *--------------------------------
  1240 .1     LDA BANKS,X  GET  BANK #
  1250        STA BNKSEL   SELECT 64K BANK
  1270        LDY #0       COPY PAGE
  1280 .2     LDA COMMONPG,Y
  1290        STA COMMONPG,Y
  1300        INY
  1310        BNE .2       LOOP TO END OF PAGE
  1320        DEX
  1330        BNE .1       LOOP TO START OF TABLE
  1350 *--------------------------------
  1360 *   enter MOVE with A = page (CX for 2nd DX)
  1370 *                   X = 64K bank #
  1380 *                   Carry SET for write, CLEAR for read
  1390 *--------------------------------
  1400 MOVE   BCS .3       BRANCH IF WRITING
  1410        CMP #$C0
  1420        BCS .1       BRANCH IF UPPER 16K
  1430        CPX #$FF     ---  READ 48K ---
  1440        BEQ .2       SKIP IF MAIN 64K
  1450        STA RAMRD+1  READ FROM AUX 48K
  1460        STX BNKSEL   SELECT 64K BANK
  1470        BNE .2
  1480 .1     JSR SEL16K   ---  READ 16K ---
  1490        CPX #$FF
  1500        BEQ .2       SKIP IF MAIN 64K
  1510        STX BNKSEL   SELECT 64K BANK
  1520        STA ALTZP+1  SELECT AUX 16K
  1530 .2     CLC
  1540        JSR COPYPAGE
  1550        BEQ EXIT
  1560 *--------------------------------
  1570 *   WRITING
  1580 *--------------------------------
  1590 .3     CMP #$C0
  1600        BCS .4       BRANCH IF UPPER 16K
  1610        CPX #$FF     ---  WRITE 48K ---
  1620        BEQ .5       SKIP IF MAIN 64K
  1630        STX BNKSEL
  1640        STA RAMWRT+1 WRITING TO AUX 48K
  1650        BNE .5
  1660 .4     JSR SEL16K   ---  WRITE 16K ---
  1670        CPX #$FF
  1680        BEQ .5
  1690        STX BNKSEL
  1700        STA ALTZP+1
  1710 .5     SEC
  1720        JSR COPYPAGE
  1730 *--------------------------------
  1750        STA RAMWRT   MAIN 48K
  1760        STA RAMRD
  1770        STA ALTZP    MAIN 16K
  1780        LDA ROM
  1790        RTS
  1800 *--------------------------------
  1810 *   BANKS is a table of 64K bank #'s, where
  1820 *      FF = main 64k, 00 = alt 64K when no RAMWORKS
  1830 *      00,04,08,0C = banks of a 256K RAMworks
  1840 *      1st entry is # of banks
  1850 *--------------------------------
  1860 BANKS  .HS 05       Five banks all told
  1870        .HS FF.
  1880 *--------------------------------
  1890 *   COPYPAGE copies 256 bytes
  1900 *      from (PTR) in specified bank to motherboard $200
  1910 *   or from motherboard $200 to (PTR) in specified bank
  1920 *--------------------------------
  1940        STA PTR+1
  1950        LDY #0
  1960        STY PTR
  1970        BCS .2 
  1980 .1     LDA (PTR),Y
  1990        STA BUFFER,Y
  2000        INY
  2010        BNE .1 
  2020        RTS
  2030 .2     LDA BUFFER,Y
  2040        STA (PTR),Y
  2050        INY
  2060        BNE .2 
  2070        RTS
  2080 *--------------------------------
  2090 *   SEL16K selects the appropriate bank in 16K area
  2100 *--------------------------------
  2110 SEL16K CMP #$D0
  2120        BCS .1 
  2130        LDY RAM2     C0   -> AUX D0
  2140        LDY RAM2
  2150        ADC #$10
  2160        RTS
  2170 .1     LDY RAM1     SELECT RD/WRT RAM
  2180        LDY RAM1
  2190        RTS
  2200 *--------------------------------

Correction to DOS/ProDOS Double Init Bob Sander-Cederlof

The Sep 85 (V5N12) issue of AAL included an article and program to initialize a disk with both DOS and ProDOS catalogs in separate halves of the disk. After trying to use Catalog Arranger on a disk we made with DOUBLE.INIT, we discovered that program has a bug.

The DOS catalog is written in track $11, starting with sector 15 and going backwards to sector 1. The second and third bytes in each catalog sector are supposed to point to the next catalog sector, with the exception of those bytes in the LAST catalog sector. In the last catalog sector, the link bytes should both be $00, to signal to anyone who tries to read the catalog that this is indeed the last sector. DOUBLE.INIT stored $11 in the first link byte, and so some catalog reading programs such as Catalog Arranger get very confused.

The fix is to add the following lines to the program, where the line numbers correspond to those in the printed listing in AAL:

       2201      BNE .5
       2202      STY C.TRACK   (Y=0)

Add the label ".5" to line 2210, so that it reads:

       2210 .5   JSR CALL.RWTS

If you have already created some disks with DOUBLE.INIT, we suggest you use a program such as Bag of Tricks, CIA, or some other disk zap program to clear the second byte of track $11, sector $01 on those disks.

An Interesting Bit of Trivia Bill Parker

Some time ago I asked Bob S-C if he knew the origin of the term "6502". Why was this particular number chosen? Bob didn't know, but referred me to Bill Mensch at Western Design Center.

Bill worked at Motorola and was on the design team that created the 6800, which later led to the development of the 68000. He left Motorola with a few others and formed MOS Technology (now absorbed into Commodore), where they developed a new microprocessor which was supposed to be an improved version of the 6800. Hence the decision was made to use a number in the 6000 series. As for the hundreds digit, Commodore already had chips that used just about every digit, except "5". Thus, the "6500" series of chips was born.

As history tells us, the first chip in the series, the 6501, was too close to Motorola's design, and had to be revised. The result was the 6502.

Apple Assembly Line is published monthly by S-C SOFTWARE CORPORATION, P.O. Box 280300, Dallas, Texas 75228. Phone (214) 324-2050. Subscription rate is $18 per year in the USA, sent Bulk Mail; add $3 for First Class postage in USA, Canada, and Mexico; add $14 postage for other countries. Back issues are available for $1.80 each (other countries add $1 per back issue for postage).

All material herein is copyrighted by S-C SOFTWARE CORPORATION, all rights reserved. (Apple is a registered trademark of Apple Computer, Inc.)