GENWiki

This document has been created to provide the net.community with some detailed information about mathematical coprocessors for the Intel 80x86 CPU family. It may also help to answer some of the FAQs (frequently asked questions) about this topic. The primary focus of this document is on 80387- compatible chips, but there is also some information on the other chips in the 80x87 family and the Weitek family of coprocessors. Care was taken to make the information included as accurate as possible. If you think you have discovered erroneous information in this text, or think that a certain detail needs to be clarified, or want to suggest additions, feel free to contact me at:

       S_JUFFA@IRAVCL.IRA.UKA.DE

       or at my SnailMail address:

       Norbert Juffa
       Wielandtstr. 14
       7500 Karlsruhe 1
       Germany

This is the fifth version of this document (dated 01-13-93) and I'd like to thank those who have helped improving it by commenting on the previous versions:

       Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
       (peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
       Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
       Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
       (ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
       (tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
       Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
       Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
       (benny.iil.intel.com)

A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com), who did a great job editing and formatting this article. Thanks David!

Contents of this document

1) What are math coprocessors? 2) How PC programs use a math coprocessor 3) Which applications benefit from a math coprocessor 4) Potential performance gains with a math coprocessor 5) How various math coprocessors work 6) Coprocessor emulator software 7) Installing a math coprocessor 8) Detailed description and specifications for all available math

  coprocessor chips

9) Finding out which coprocessor you have (the COMPTEST program) 10) Current coprocessor prices and purchasing advice 11) The coprocessor benchmark programs (performance comparisons of

  available math coprocessors using various CPUs)

12) Clock-cycle timings for each coprocessor instruction 13) Accuracy tests and IEEE-754 conformance for various coprocessors 14) Accuracy of transcendental function calculations for various coprocessors 15) Compatibility tests with Intel's 387DX / the SMDIAG program 16) References (literature) 17) Addresses of manufacturers of math coprocessors 18) Appendix A: Test programs for partial compatibility and accuracy checks 19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP

What are math coprocessors?

A coprocessor in the traditional sense is a processor, separate from the main CPU, that extends the capabilities of a CPU in a transparent manner. This means that from the program's (and programmer's) point of view, the CPU and coprocessor together look like a single, unified machine.

The 80x87 family of math coprocessors (also known as MCPs [Math CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are typical examples of such coprocessors. The 80x86 CPUs, with the exception of the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit integers as their basic data types. However, many PC-based applications require the use of not only integers, but floating-point numbers. Simply put, the use of floating-point numbers enables a binary representation of not only integers, but also fractional values over a wide range. A common application of floating-point numbers is in scientific applications, where very small (e.g., Planck's constant) and very large numbers (e.g., speed of light) must be accurately expressed. But floating-point numbers are also useful for business applications such as computing interest, and in the geometric calculations inherent in CAD/CAM processing.

Because the instruction sets of all 80x86 CPUs directly support only integers and calculations upon integers, floating-point numbers and operations on them must be programmed indirectly by using series of CPU integer instructions. This means that computations when floating-point numbers are used are far slower than normal, integer calculations. And this is where the 80x87 coprocessors come in: adding an 80x87 to an 80x86-based system augments the CPU architecture with eight floating-point registers, five additional data types and over 70 additional instructions, all designed to deal directly with floating-point numbers as a basic data type. This removes the 'penalty' for floating-point computations, and greatly increases overall system performance for applications which depend heavily on these calculations.

In addition to being able to quickly execute load/store operations on floating-point numbers, the 80x87 coprocessors can directly perform all the basic arithmetic operation on them. Besides "knowing" how to add, subtract, multiply and divide floating-point numbers, they can also operate on them to perform comparisons, square roots, transcendental functions (such as logarithms and sine/cosine/tangent), and compute their absolute value and remainder.

Like most things in life, floating-point arithmetic has been standardized. The relevant standard (to which I will refer quite often in this document) is the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The standard specifies numeric formats, value sets and how the basic arithmetic (+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this document claim full or at least partial compliance with the IEEE-754 standard.

How PC programs use 80x87 and Weitek coprocessors

The basic data type used by all 80x87 coprocessors is an 80-bit long floating-point number. This data type (called "temporary real" or "double extended precision") can directly represent numbers which range in size between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932 including denormal numbers) where '^' denotes the power operator. (For those familiar with floating-point formats, this format has 64 mantissa bits, 15 exponent bits and 1 sign bit, for the total of 80 bits.) This format provides a precision of about 19 decimal places. 80x87s can also handle additional data types that are converted to/from the internal format upon being loaded or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit integers as well as a 18 digit BCD (binary coded decimal) data type occupying 10 bytes and providing 18 decimal digits.

The 80x87 also supports two additional floating-point types. The short real data type (also called "single-precision") has 32 bits that split into 23 mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit" technique, the effective length of the mantissa is increased to 24 bits. (The hidden bit technique exploits the fact that for normalized floating-point numbers, the mantissa m always is in the range 1 ⇐ m < 2. Since the first mantissa bit represents the integer part of the mantissa, it is always set for normalized numbers, and therefore need not be stored, as it is guaranteed to always be 1.) The IEEE single-precision format provides a precision of about 6-7 decimal places and can represent numbers between 1.17*10^-38 and 3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long real, or double-precision, data type has 64 bits, consisting of 52 mantissa bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^- 324 to 1.79*10^308 including denormal numbers). (This format also uses the hidden bit technique to provide effectively 53 mantissa bits.)

The eight registers in the 80x87 are organized in a stack-like manner which takes some time getting used to if one programs the coprocessor directly in assembly language. However, nowadays the compilers or interpreters for most high level languages (HLLs) can give a programmer easy access to the coprocessor's data types and use their instructions, so there is not much need to deal directly with the rather unusual architecture of the 80x87.

The architecture of the Weitek chips differs significantly from the 80x87. Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in that they do not transparently extend the CPU architecture; rather, they could be described as highly-specialized, memory-mapped IO devices. But as the term "coprocessor" has been traditionally used for these chips, they will be referred to as such here.

The Weitek coprocessors have a RISC-like architecture which has been tuned for maximum performance. Only a small instruction set has been implemented in the chip, but each instruction executes at a very high speed (usually only a few clock cycles each). Instructions available include load/store, add, subtract, subtract reverse, multiply, multiply and negate, multiply and accumulate, multiply and take absolute value, divide reverse, negate, absolute value, compare/test, convert fix/float, and square root. In contrast to the 80x87 family, the Weitek Abacus does not support a double extended format, has no built-in transcendental functions, and does not support denormals. The resources required to implement such features have instead been devoted to implement the basic arithmetic operations as fast as possible.

While the 80x87 coprocessors perform all internal calculations in double extended precision and therefore have about the same performance for single and double-precision calculations, the Weitek features explicit single and double-precision operations. For applications that require only single- precision operations, the Weitek can therefore provide very high performance, as single-precision operations are about twice as fast as their double- precision counterparts. Also, since the Weitek Abacus has more registers than the 80x87 coprocessors (31 versus 8), values can be kept in registers more often and have to be loaded from memory less frequently. This also leads to performance gains.

The Weitek's register file consists of 31 32-bit registers, each one capable of holding an IEEE single-precision number. Pairs of consecutive single- precision registers can also be used as 64-bit IEEE double-precision registers; thus there are 15 double-precision registers. The Weitek register file has the standard organization like the register files in the 80386, not the special stack-like organization of the 80x87 coprocessors.

To the main CPU, the Weitek Abacus appears as a 64 KB block of memory starting at physical address 0C0000000h. Each address in this range corresponds to a coprocessor instruction. Accessing a specified memory location within this block with a MOV instruction causes the corresponding Weitek instruction to be executed. (The instructions have been cleverly assigned to memory locations in such a way that loads to consecutive coprocessor registers can make use of the 386/486 MOVS string instruction.) This memory-mapped interface is much faster than the IO-oriented protocol that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's memory block can actually be assigned to any logical address using the MMU (memory management unit) in the 386/486's protected and virtual modes. This also means that the Weitek Abacus *cannot* be used in the real mode of those processors, since their physical starting address (0C0000000h) is not within the 1 MByte address range and the MMU is inoperable in real mode. However, DOS programs can make use of the Weitek by using a DOS extender or a memory manager (such as QEMM or EMM386) that runs in protected/virtual mode itself and can therefore map the Weitek's memory block to any desired location in the 1 MByte address range.

Typically the FS segment register is then set up to point to the Weitek's memory block. On the 80486, this technique has severe drawbacks, as using the FS: prefix takes an additional clock cycle, thereby nearly halving the performance of the 4167. Most DOS-based compilers exhibit this problem, so the only way around it is to code in assembly language [75]. The Weitek Abacus 3167 and 4167 are also supported by the UNIX operating system [33].

Which application programs benefit from a math coprocessor

According to the Intel 387DX User's Guide, there are more than 2100 commercial programs that can make use of a 387-compatible coprocessor. Every program that uses floating-point arithmetic somewhere and contains the instructions to support an 80x87 or Weitek chip can gain speed by installing one. However, the speedup will vary from program to program (and even within the same program) depending on how computation-intensive the program or operation within the program is. Typical applications that benefit from the use of a math coprocessor are:

CAD programs (AutoCAD, VersaCAD, GenericCAD)
Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
Mathematical analysis and statistical programs (Mathematica, TKSolver,

SPSS/PC, Statgraphics)

Database programs (dBase IV, FoxBase, Paradox, Revelation)

Note that for spreadsheets and databases, a coprocessor only helps if some kind of floating-point computation is performed; this is true more often for spreadsheets than for databases. Also note that the speed of many programs depends quite heavily on factors such the speed of the graphics adapter (CAD) or the disk performance (databases), so the computational performance is only a (small) part of the total performance of the application. There are some programs that won't run without a coprocessor, among them AutoCAD (R10 and later) and Mathematica.

Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2 Presentation Manager do *not* gain additional speed from using a *mathematical* coprocessor, since their graphics operations only use integer arithmetic [71]. They *will* benefit from a graphics board with a graphics "coprocessor" that speeds up certain common graphics operations such as BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a certain amount of floating-point operations for operations such as arc drawing. However, the use of floating-point operations in X-Windows seems to have decreased significantly in versions after X11R3, so the overall performance impact of a coprocessor is small [72]. Applications running under any GUI may take advantage of a math coprocessor, of course (for example, Microsoft Excel running under Windows).

While support for 80x87 coprocessors is very common in application programs, the Weitek Abacus coprocessors do not enjoy such widespread support. Due to their higher price, only a few high-end PCs have been equipped with Weitek coprocessors. Some machines, such as IBM's PS/2 series, do not even have sockets to accommodate them. Therefore, most of the programs that support these coprocessors are also high-end products, like AutoCAD and Versacad-386.

Potential performance gains with a coprocessor

The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX coprocessor has a demonstration program that shows the speedup of certain application programs when run with the Intel coprocessor versus a system with no coprocessor:

       Application       Time w/o 387   Time w/387    Speedup

       Art&Letters          87.0 sec      34.8 sec     150%
       Quattro Pro           8.0 sec       4.0 sec     100%
       Wingz                17.9 sec       9.1 sec      97%
       Mathematica         420.2 sec     337.0 sec      25%

       The following table is an excerpt from [70]:

       Application        Time w/o 387   Time w/387  Speedup

       Corel Draw          471.0 sec     416.0 sec      13%
       Freedom Of Press    163.0 sec      77.0 sec     112%
       Lotus 1-2-3         257.0 sec      43.0 sec     597%

       The following table is an excerpt from [25]:

       Application        Time w/o 387   Time w/387  Speedup

       Design CAD, Test1    98.1 sec      50.0 sec      96%
       Design CAD, Test2    75.3 sec      35.0 sec     115%
       Excel, Test 1         9.2 sec       6.8 sec      35%
       Excel, Test 1        12.6 sec       9.3 sec      35%

Note that coprocessor performance also depends on the motherboard, or more specifically, the chipset used on the motherboard. In [34] and [35] identically configured motherboards using different 386 chipsets were tested. Among other tests a coprocessor benchmark was run which is based on a fractal computation and its execution time recorded. The following tables showing coprocessor performance to vary with the chipset have been copied from these articles in abridged form:

                Cyrix                                   Cyrix
  chip set      387+                 chip set           83D87

  Opti, 40 MHz  24.57 sec   97.0%    PC-Chips, 33 MHz  26.97 sec   93.0%
  Elite,40 MHz  24.46 sec   97.4%    UMC,      33 MHz  27.69 sec   90.5%
  ACT,  40 MHz  23.84 sec  100.0%    Headland, 33 MHz  25.08 sec  100.0%
  Forex,40 MHz  23.84 sec  100.0%    Eteq,     33 MHz  27.38 sec   91.6%

This shows that performance of the same coprocessor can vary by up to ~10% depending on the chipset used on your board, at least for 386 motherboards (similar numbers for 286, 386SX, and 486 are, unfortunately, not available). The benchmarks for this article were run on a motherboard with the Forex chip set, one of the fastest 386 chip sets available, and not only with respect to floating-point performance [35].

How various math coprocessors work

In any 80x86 system with an 80x87 math coprocessor, CPU instructions and coprocessor instructions are executed concurrently. This means that the CPU can execute CPU instructions while the coprocessor executes a coprocessor instruction at the same time. The concurrency is restricted somewhat by the fact that the CPU has to aid the coprocessor in certain operations. As the CPU and the coprocessor are fed from the same instruction stream and both instruction streams may operate on the same data, there has to be a synchronizing mechanism between the CPU and the coprocessor.

The 8087

In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode coming in from the bus. To do this, both chips have the same BIU (bus interface unit) and the 8086 BIU sends the status signals of its prefetch queue to the 8087 BIU. This insures that both processors always decode the same instructions in parallel. Since all coprocessor instruction start with the bit pattern 11011, it is easy for the 8087 to ignore all other instructions. Likewise the CPU ignores all coprocessor instructions, unless they access memory. In this case, the CPU computes the address of the LSB (least significant byte) of the memory operand and does a dummy read. The 8087 then takes the data from the data bus. If more than one memory access is needed to load an memory operand, the 8087 requests the bus from the CPU, generates the consecutive addresses of the operand's bytes and fetches them from the data bus. After completing the operation, the 8087 hands bus control back to the CPU. Since 8087 and CPU are hooked up to the same synchronous bus, they must run at the same speed. This means that with the 8087, only synchronous operation of CPU and coprocessor is possible.

Another 8087 coprocessor instruction can only be started if the previous one has been completed in the NEU (numerical execution unit) of the 8087. To prevent the 8086 from decoding a new coprocessor instruction while the 8087 is still executing the previous coprocessor instruction, a coding mechanism is employed: All 8087-capable compilers and assemblers automatically generate a WAIT instruction before each coprocessor instruction. The WAIT instruction tests the CPU's /TEST pin and suspends execution until its input becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor instruction stops the CPU until any still-executing coprocessor instruction has finished.

The same synchronization is used before the CPU accesses data that was written by the coprocessor. A WAIT instruction after any coprocessor instruction that writes to memory causes the CPU to stop until the coprocessor has completed transfer of the data to memory, after which the CPU can safely access it.

The 80287

The 80287 coprocessor-CPU interface is totally different from the 8087 design. Since the 80286 implements memory protection via an MMU based on segmentation, it would have been much too expensive to duplicate the whole memory protection logic on the coprocessor, which an interface solution similar to the 8087 would have required. Instead, in an 80286/80287 system, the CPU fetches and stores all opcodes and operands for the coprocessor. Information is then passed through the CPU ports F8h-FFh. (As these ports are accessible under program control, care must be taken in user programs not to accidentally perform write operations to them, as this could corrupt data in the math coprocessor.)

The 8087/8087 combination can be characterized as a cooperation of partners with equal rights, while the 80286/287 is more a master-slave relationship. This makes synchronization easier, since the complete instruction and data flow of the coprocessor goes through the CPU. Before executing most coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the 287 coprocessor and signals if the 80287 is still executing a previous coprocessor instruction or has encountered an exception. The 80286 then waits until the /BUSY signal goes to "low" before loading the next coprocessor instruction into the 80287. Therefore, a WAIT instruction before every coprocessor instruction is not required. These WAITs are permissible, but not necessary, in 80287 programs. The second form of WAIT synchronization (after the coprocessor has written a memory operand) *is* still necessary on 286/287 systems.

The execution unit of the 80287 is practically identical to that of the 8087; that is, nearly all coprocessor instructions execute in the same number of clock cycles on both coprocessors. However, due to the additional overhead of the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz 80286/80287 combination can have lower floating-point performance than an 8086/8087 system running at the same speed. Additionally, older 286 boards were often configured to run the coprocessor at only 2/3 the speed of the CPU, making use of the ability of the 80287 to run asynchronously: The 80287 has a CKM pin that causes the incoming system clock to be divided by three for the coprocessor if it is tied to ground. The 80286 always divides the system clock by two internally, hence the final ratio of 2/3. However, when the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the CLK input. This feature has been exploited by the maker of coprocessor speed sockets. These sockets tie CKM high and supply their own CLK signal with a built-in oscillator, thereby allowing the 80287 or compatible to run at a much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20 MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating- point performance of such a configuration does not scale linearly with the coprocessor clock, since all the data has to be passed through the much slower CPU. If the coprocessor executes mostly simple instructions (such as addition and multiplication), doubling the coprocessor clock to 20 MHz in a 10 MHz system does not show any performance increase at all [24].

The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of a 387 coprocessor, but are pin-compatible to the original 287. These chips divide the system clock by two internally, as opposed to three in the original 80287. Since the 80286 also divides the system clock by two, they usually run synchronously with respect to the CPU, although they can also be run asynchronously.

The 80387

The coprocessor interface in 80386/80387 systems is very similar to the one found in 286/287 systems. However, to prevent corruption of the coprocessor's contents by programming errors, the IO ports 800000F8h-800000FFh are used, which are not accessible to programs. The CPU/coprocessor interface has been optimized and uses full 32-bit transfers; the interface overhead has been reduced to about 14-20 clock cycles. For some operations on the 387 'clones' that take less than about 16 clock cycles to complete, this overhead effectively limits the execution rate of coprocessor instructions. The only sensible solution to provide even higher floating-point performance was to integrate the CPU and coprocessor functionality onto the same chip, which is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits from the instruction pipelining and from the on-chip cache.

Coprocessor emulators

In the absence of a coprocessor, floating-point calculations are often performed by a software package that simulates its operations. Such a program is called a coprocessor emulator. Simulating the coprocessor has the advantage for application programs that identical code can be generated for use with either the coprocessor and the emulator, so that it's possible to write programs that run on any system without regard to whether a coprocessor is present or not. Whether the program will use an actual coprocessor or software emulating it can easily be determined at run-time by detecting the presence or absence of the coprocessor chip.

Two approaches to interface an 80x87 emulator to programs are common. The first method makes use of the fact that all coprocessor instruction start with the same five bit pattern 11011. Thus the first byte of a coprocessor instruction will be in the range D8-DF hexadecimal. In addition, coprocessor instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is one byte long (the reason for doing this has been described in the previous chapter dealing with the operating details of the 80x87). One common approach is to replace the WAIT instruction and the first byte of the coprocessor instruction with one out of eight interrupt instructions; the remaining bytes of the coprocessor instruction are left unchanged. Interrupts 34 to 3B hexadecimal are used for this emulation technique. (Note that the sequences 9B D8 … 9B DF can be easily converted to the interrupt instructions CD 34 … CD 3B by simple addition and subtraction of constants.) The compiler or assembler initially produces code that contains these appropriate interrupt calls instead of the coprocessor instructions. If a hardware coprocessor is detected at run-time, the emulator interrupts point to a short routine that converts the interrupts calls back to coprocessor instructions (yes, this is known as "self-modifying code"). If no coprocessor is found the interrupts point to the emulation package, which examines the byte(s) following the interrupt instruction to determine which floating-point operation to perform. This method is used by many compilers, including those from Microsoft and Borland. It works with every 80x86 CPU from the 8086/8088 on.

The second method to interface an emulator is only available on 286/386/486 machines. If the emulation bit in the machine status word of these processors is set, the processors will generate an interrupt 7 whenever a coprocessor instruction is encountered. The vector for this interrupt will have been set up to point at an emulation package that decodes the instruction and performs the desired operation. This approach has the advantage that the emulator doesn't have to be included in the program code, but can be loaded once (as a TSR or device driver) and then used by every program that requires a coprocessor. Emulation via interrupt 7 is transparent, which means that programs containing coprocessor instructions execute just like a coprocessor was present, only slower. This approach is taken by the public domain EM87 emulator, the shareware program Q387, and the commercial Franke387 emulator, for example. Even programs that require a coprocessor to run like AutoCAD are 'fooled' to believe that a coprocessor is present with emulators using INT 7.

Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor emulations using INT 7 automatically if they do not find a coprocessor to be installed. The emulator in Windows doesn't seem to be very fast, as people who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler (using the emulation built into the TP 6.0 run-time library) to the TPW 1.5 Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as much as a factor of five have been reported [79].

The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver. Note that Franke387 and especially EM87 model a real coprocessor much more closely than Turbo Pascal's emulator does. In particular, EM87 supports denormal numbers, precision control, and rounding control. The emulator in TP 6.0 does not implement these features. The version of Franke387 tested (V2.4) supports denormals in single and double-precision, but not double extended precision, and it supports precision control, but not rounding control. The recently introduced shareware program Q387 only runs on 386, 386SX, 486SX and compatible processors. The program loads completely into extended memory and uses about 330 KB. To enable INT 7 trapping to a service routine in extended memory it needs to run with a memory manager (e.g. EMM386, QEMM, or 386MAX). The huge size of the program stems from the fact that it was solely optimized for speed, assuming that extended memory is a cheap resource. Presumably it uses large tables to speed computations. Intel's E80287 program is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Note that the more closely a real coprocessor is modelled by the emulator, the slower the emulator runs and the larger the code for the emulator gets.

       Relative execution times of coprocessor vs. software emulators
       for selected coprocessor instructions

                      Intel 387DX    TP 6.0 Emulator   EM87 Emulator

       FADD ST, ST(0)       1              26                104
       FDIV [DWord]         1              22                136
       FXAM                 1              10                 73
       FYL2X                1              33                102
       FPATAN               1              36                110
       F2XM1                1              38                110

       The following table is an excerpt from [44]:

                      Intel 80287  Intel E80287 Emulator

       FADD ST, ST(0)       1              42
       FDIV [DWord]         1             266
       FXAM                 1             139
       FYL2X                1              99
       FPATAN               1             153
       F2XM1                1              41

       The following has been adapted from [43] and merged with my own
       data:

                      Intel 8087  TP 6.0 Emul. (8086)  Intel Emul. (8086)

       FADD ST, ST(0)       1              20                 94
       FDIV [DWord]         1              22                 82
       FPTAN                1              18                144
       F2XM1                1               6                171
       FSQRT                1              44                544

One of the reasons emulators are so slow is that they are often designed to run with every CPU from the 8086/8088 on upwards. This is the case with the emulators built into the compiler libraries of the Turbo Pascal 6.0 (also used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in other Microsoft products) and is also true for the EM87 emulator in the public domain. By using code that can run on a 8086/8088, these emulators forego the speed advantage offered by the additional instructions and architectural enhancements (such as 32-bit registers) of the more advanced Intel 80x86 processors. A notable exception to this is the Franke387 emulator, a commercial emulator that is also sold as shareware. It uses 386- specific 32-bit code and only runs on 386/386SX/486SX computers.

Besides being slow, coprocessor emulators have other drawbacks when compared with real coprocessors. Most of the emulators do not support the additional instructions that the 387-compatible coprocessors offer over the 80287. Often, some of the low-level stack-manipulating instructions like FDECSTP are not emulated. For example, [76] lists the coprocessor instructions not emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN libraries) as follows:

       FCOS         FRSTOR      FSINCOS      FXTRACT
       FDECSTP      FSAVE       FUCOM
       FINCSTP      FSETPM      FUCOMP
       FPREM1       FSIN        FUCOMPP

Additionally, some parts of the coprocessor architecture, like the status register, are often not or only partially emulated. Some emulators do not conform to the IEEE-754 standard in their implementation of the basic arithmetic functions, while the hardware coprocessors do. Also, they sometimes lack the support for denormals (a special class of floating-point numbers) although it is required by the standard. Not all the 80x87 emulators support rounding control and precision control, also features required by IEEE-754. Most of these omissions are aimed at making the emulator faster and smaller. Because of the performance gap and these other shortcomings of coprocessor emulators, a real coprocessor is a must for anybody planning to do some serious computations. (At today's prices, this shouldn't pose much of a problem to anybody!)

Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone coprocessor emulators for PCs, among them the two emulators, EM87 and Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms of reliability, speed, and accuracy.

Installing a math coprocessor

Usually, installing a coprocessor doesn't pose much of a problem, as every coprocessor comes with installation instructions and a diagnostic disk that lets you check its correct operation after installation. In addition, the user manuals of most computers have a section on coprocessor installation.

1) Make sure to buy the right coprocessor for your system. An 8087 works

   together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
   compatible works with a 80286 CPU. (There are also some old 386
   motherboards that accept a 80287 coprocessor, but they usually also
   provide a socket for the 387; given today's pricing, it makes no sense
   not to get a 387 for these systems.) A 80387, 387DX or compatible
   coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
   coprocessors also work with the Cyrix 486DLC CPU (which, despite its
   name, does not include an FPU). Similarly, the 387SX or compatible
   coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.

   The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
   socket in the system; this is *not* the same socket used by a 80387 or
   compatible chip, and some computers, such as IBM's PS/2s, don't have
   this socket. The Weitek Abacus 4167 works together with the 486 and
   requires a special 142-pin socket to be present.

2) Always install a coprocessor that's rated at the same clock speed as the

   CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
   a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
   IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified
   frequency rating may cause it to produce false results, which you might
   fail to recognize as such. (I have personally experienced this problem
   with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the
   diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some
   commercial system test programs. However, I found it to fail the
   Whetstone and Linpack benchmarks, which include accuracy checks.)
   Although there is usually no problem with overheating when pushing a
   coprocessor over the specified maximum frequency rating, be warned that
   operation of a coprocessor above the maximum ratings stated by the
   manufacturer may make its operation unreliable.

   Some 386 boards allow the coprocessor to be clocked differently than the
   CPU. This is called "asynchronous operation" and allows you, for
   example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
   Of the currently available math coprocessors, only the Intel 80387 and
   387DX support asynchronous operation. The 387-compatible "clones" from
   Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even
   if you have set up your motherboard for asynchronous operation.

3) Once you've got the correct coprocessor for your system you can start

   the actual installation process. Turn off the computer's power switch
   and unplug the power cord from the wall outlet, remove the case, and
   locate the math coprocessor socket. This socket is always located right
   next to the main CPU, which can be identified by the printing on top of
   the chip. (It's also usually one of the biggest chips on the board). The
   8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
   each of the longer sides. The 387SX PLCC socket is a square socket that
   has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
   socket is square and has two rows of pin holes on each side. The EMC
   socket for the Weitek 3167 is similar but has three rows of holes on
   each side. The PGA socket for the Weitek 4167 is also square with three
   rows of holes on each side. If you can't find the math coprocessor
   socket, consult your owner's manual, your computer dealer, or a
   knowledgeable friend.

   If you are installing the Intel RapidCAD chipset in a 386 system, you
   will have to remove the 386 CPU first. Intel provides an easy-to-use
   chip extractor and a storage box for the 386 chip for this purpose. Just
   follow the instructions in the RapidCAD installation manual.

   On many systems, the motherboard is supported only at a small number of
   points. Since considerable force is required to insert a pin grid chip
   like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
   board may bend quite a lot due to the insertion pressure. This could
   cause cracks in the board's conductive traces that may render it
   intermittently or completely inoperable. Damage done to the board in
   this way is usually not covered by the computer's warranty! Therefore,
   it may be a good idea to first check how much the board bends by
   pressing on the math coprocessor socket with your finger. If you find it
   to bend easily, try to put something under the board directly beneath
   the coprocessor socket. If this is impossible, as it is in many desktop
   cases, consider removing the whole mother board from the case, and
   placing it on a hard, flat surface free of static electricity. (You will
   also have to do this if your system's CPU and coprocessor socket are on
   a separate card rather than on the motherboard, as is typical in many
   modular systems.)

   Be sure you are properly grounded before you remove the coprocessor from
   its antistatic box, as even a tiny jolt of static electricity can ruin
   the coprocessor. Make sure you do not touch the pins on the bottom of
   the chip.

   Check the pins and make sure none are bent; if some are, you can
   *carefully* straighten them with needle-nose pliers or tweezers.

4) Match the coprocessor's orientation with the orientation of the socket.

   Correct orientation of the coprocessor is absolutely essential, because
   if you insert it the wrong way it may be damaged.

   8087 and 287 coprocessors have a notch on one the shorter sides of their
   rectangular DIL package that should be matched with the notch of the
   coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
   placed alongside each other and both have the same orientation, (that
   is, their respective notches point in the same direction). 387SX
   coprocessors feature a white dot or similar mark that matches with some
   sort of marking on the socket. 387 coprocessors have a bevelled corner
   that is also marked with a white dot or similar marking. This should be
   matched with the bevelled or otherwise marked corner of the socket. If
   your system has only a large EMC socket and you are installing a 387 in
   it, you will leave one row of pin holes free on each side of the chip.

   Once you have found the correct orientation, place the chip over the
   socket and make sure all pins are correctly aligned with their
   respective holes. Press firmly and evenly on the chip -- you may have to
   press hard to seat the coprocessor all the way. Again, make sure your
   motherboard does not bend more than slightly under the insertion
   pressure. For 8087, 287, and 387 coprocessors it is normal that the
   coprocessor does not go all the way in; about one millimeter (1/25 inch)
   of space is usually left between the socket and the bottom of the
   coprocessor chip. (This allows the insertion of a extraction device
   should it become necessary to remove the chip. Note that the
   construction of the 387SX's PLCC socket makes it next-to-impossible to
   remove the coprocessor once fully inserted, as the top of the chip is
   level with the socket's 'walls'.)

5) Check your computer's manual for the proper position of any jumpers or

   switches that need to be set to tell the system it now has a coprocessor
   (and possibly, which kind it has). Put the cover back on the system
   unit, reconnect the power, and turn on your computer. Depending on your
   system's BIOS, you may now have to run a setup or configuration program
   to enable the coprocessor. Finally, run the programs supplied on the
   diagnostic disk (included with your coprocessor) to check for its
   correct operation.

Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):

Intel 8087

   [43] This was the first coprocessor that Intel made available for the
   80x86 family. It was introduced in 1980 and therefore does not have full
   compatibility with the IEEE-754 standard for floating-point arithmetic,
   (which was finally released in 1985). It complements the 8088 and 8086
   CPUs and can also be interfaced to the 80188 and 80186 processors.

   The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
   dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
   MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].

   A neat trick to enhance the processing power of the 8087 for
   computations that use only the basic arithmetic operations (+,-,*,/) and
   do not require high precision is to set the precision control to single-
   precision. This gives one a performance increase of up to 20%. For
   details about programming the precision control, see program PCtrl in
   appendix A.

   With the help of an additional chip, the 8087 can in theory be
   interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
   from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
   introduction of the 80286-based AT in 1984, it soon lost all
   significance for the PC market.

Intel 80187

   The 80187 is a rather new coprocessor designed to support the 80C186
   embedded controller (a CMOS version of the 80186 CPU; see above). It was
   introduced in 1989 and implements the complete 80387 instruction set. It
   is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
   pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
   Power consumption is rated at max. 675 mW for the 12.5 MHz version and
   max. 780 mW for the 16 MHz version [37].

Intel 80287

   [44] This is the original Intel coprocessor for the 80286, introduced in
   1983. It uses the same internal execution unit as the 8087 and therefore
   has the same speed (actually, it is sometimes slower due to additional
   overhead in CPU-coprocessor communication). As with the 8087, it does
   not provide full compatibility with the IEEE-754 floating point standard
   released in 1985.

   The 80287 was manufactured in NMOS technology, and is packaged in a 40-
   pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
   MHz versions. Power consumption can be estimated to be the same as that
   for the 8087, which is 2400 mW max.

   The 80287 has been replaced in the Intel 80x87 family with its faster
   successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
   below). There may still be a few of the old 80287 chips on the market,
   however.

Intel 80287XL

   This chip is Intel's second-generation 287, first introduced in 1990.
   Since it is based on the 80387 coprocessor core, it features full IEEE
   754 compatibility and faster instruction execution. Intel claims about
   50% faster operation than the 80287 for typical benchmark tests such as
   Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
   which is identical to the Intel 80287, support this claim [1]: The Intel
   287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
   and 66% faster on the Whetstone benchmark in these tests. Whetstone
   results from [46] show the Intel 287XL at 12.5 MHz to perform 552
   kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
   performance increase. A benchmark using the MathPak program showed the
   Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
   sec.) [26]. Since the 287XL has all the additional instructions and
   enhancements of a 387, most software automatically identifies it as an
   80387-compatible coprocessor and therefore can make use of extra 387-
   only features, such as the FSIN and FCOS instructions.

   The 287XL is manufactured in CMOS and therefore uses much less power
   than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
   rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
   287XL is available in either a 40-pin CERDIP (ceramic dual inline
   package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
   version is called the 287XLT and intended mainly for laptop use.) The
   287XL is rated for speeds of up to 12.5 MHz.

AMD 80C287

   This chip, manufactured by Advanced Micro Devices (AMD), is an exact
   clone of the old Intel 80287, and was first brought to market by AMD in
   1989. It contains the original microcode of the 80287 and is therefore
   100% compatible with it. However, as the name indicates, the 80C287 is
   manufactured in CMOS and therefore uses less power than an equivalent
   Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
   or slightly less than that of the Intel 80287XL [27]. There is also
   another version called AMD 80EC287 that uses an 'intelligent' power save
   feature to reduce the power consumption below 80C287 levels. Tests at
   10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
   compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
   1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
   suited for low power laptop systems.

   The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
   only seen it being offered in 10 MHz and 12 MHz versions, however.) At
   about US$ 50, it is currently the cheapest coprocessor available. Note
   that it provides less performance than the newer Intel 287XL (see
   above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
   (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).

   Due to recent legal battles with Intel over the right to use the 287
   microcode, which AMD lost, AMD may have to discontinue this product
   (disclaimer: I am not a legal expert).

Cyrix 82S87

   This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
   80387 'clone') and has been available since 1991. It complies completely
   with the IEEE-754 standard for floating-point arithmetic and features
   nearly total compatibility with Intel's coprocessors, including
   implementation of the full Intel 80387 instruction set. It implements
   the transcendental functions with the same degree of accuracy and the
   superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
   fastest [1] and most accurate 287 compatible coprocessor available.
   Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
   MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
   chips manufactured after 1991 use the internals of the Cyrix 387+, which
   succeeds the original 83D87 [73].

   The 82S87 is a fully static CMOS design with very low power requirements
   that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
   82S87 to consume about the same amount of power as the AMD 80C287 (see
   above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
   chip carrier) compatible with the pinout of the Intel 287XLT and
   ideally suited for laptop use.

IIT 2C87

   This chip was the first 80287 clone available, introduced to the market
   in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
   implements the full 80387 instruction set [38]. Tests I ran on the 3C87
   seem to indicate that it is not fully compatible with the IEEE-754
   standard for floating-point arithmetic (see below for details), so it
   can be assumed that the 2C87 also fails these test (as it presumably
   uses the same core as the 3C87).

   The IIT 2C87 provides extra functions not available on any other 287
   chip [38]. It has 24 user-accessible floating-point registers organized
   into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
   allow switching from one bank to another. (Transfers between registers
   in different banks are not supported, however, so this feature by itself
   is of limited usefulness. Also, there seems to be only one status
   register (containing the stack top pointer), so it has to be manually
   loaded and stored when switching between banks with a different number
   of registers in use [40]). The register bank's main purpose is to aid
   the fourth additional instruction the 2C87 has (F4X4), which does a full
   multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
   graphics applications [39]. The built-in matrix multiply speeds this
   operation up by a factor of 6 to 8 when compared to a programmed
   solution according to the manufacturer [38]. Tests show the speed-up to
   be indeed in this range [40]. For the 3C87, I measured the execution
   time of F4X4 to be about 280 clock cycles; the execution time on the
   2C87 should be somewhat larger - I estimate it to be around 310 clock
   cycles due to the higher CPU-NDP communication overhead in instruction
   execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
   systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
   seem, however, there are very few applications that make use of it when
   an IIT coprocessor is detected at run time (among them Schroff
   Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
   [25]).

   The 2C87 is available for speeds of up to 20 MHz. It is implemented in
   an advanced CMOS process and has therefore a low power consumption of
   typically about 500 mW [38].

Intel 80387

   This chip was the first generation of coprocessors designed specifically
   for the Intel 80386 CPU. It was introduced in 1986, about one year after
   the 80386 was brought to market. Early 386 system were therefore
   equipped with both a 80287 and a 80387 socket. The 80386 does work with
   an 80287, but the numerical performance is hardly adequate for such a
   system.

   The 80387 has itself since been superseded by the Intel 387DX introduced
   by a quiet change in 1989 (see below). You might find it when acquiring
   an older 386 machine, though. The old 80387 is about 20% slower than the
   newer 387DX.

   The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
   using Intel's older 1.5 micron CHMOS III technology, giving it moderate
   power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
   typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
   1950 mW (1250 mW typical) [60].

Intel 387DX

   The 387DX is the second-generation Intel 387; it was quietly introduced
   to replace the original 80387 in 1989. This version is done in a more
   advanced CMOS process which enables the coprocessor to run at a maximum
   frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
   MHz). The 387DX is also about 20% faster than the 80387 on the average
   for the same clock frequency. For a 386/387 system operating at 29 MHz
   the Whetstone benchmark (compiled with the highly optimizing Metaware
   High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
   kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
   programmed in assembly language, the 387DX performance was 28% higher
   than the performance of the 80387. The transcendental functions have
   also sped up from the 80387 to the 387DX. In the Savage benchmark
   (again, compiled with Metaware High-C V1.6 and running on a 29 MHz
   system), the 80387 evaluated 77600 function calls/second, while the
   387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
   instructions have been sped up a lot more than the average 20%. For
   example, the performance of the FBSTP instruction has increased by a
   factor of 3.64.

   The Intel 387DX (and its predecessor 80387) are the only 387
   coprocessors that support asynchronous operation of CPU and coprocessor.
   The 387 consists of a bus interface unit and a numerical execution unit.
   The bus interface unit always runs at the speed of the CPU clock
   (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
   the numerical execution unit runs at the same speed as the bus interface
   unit. If CKM is tied to ground, the numerical execution unit runs at the
   speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
   clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
   For example, for a 20 MHz 386, the Intel 387DX could be clocked from
   12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
   387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These
   coprocessors are therefore not capable of asynchronous operation and
   always run at the speed of the CPU.)

   The Intel 387DX is manufactured using Intel's advanced low power CHMOS
   IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
   typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
   1250 mW (750 mW typical) [59].

Intel 387SX

   This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
   Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
   (somewhat) the costs to build a 386SX system as compared to a full 32-
   bit design required by a 386DX. (The 386SX's main *marketing* purpose
   was to replace the 80286 CPU, which was being sold more cheaply by other
   manufacturers [such as AMD], and which Intel subsequently stopped
   producing.) Due to the 16-bit data path, the 386SX is slower than the
   386DX and offers about the same speed as an 80286 at the same clock
   frequency for 16-bit applications. But as the 386SX is a complete 80386
   internally, it offers also the possibility to run 32-bit applications
   and supports the virtual 8086 mode (used for example by Windows' 386
   enhanced mode).

   The 387SX has all the features of the Intel 80387, including the ability
   of asynchronous operation of CPU and coprocessor (see Intel 387DX
   information, above). Due to the 16 bit data path between the CPU and the
   coprocessor, the 387SX is a bit slower than a 80387 operating at the
   same frequency. In addition, the 387SX is based on the core of the
   original 80387, which executes instructions slower than the second
   generation 387DX.

   The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
   and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
   386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
   and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
   (740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
   typical) [62].

Intel 387SL

   This coprocessor is designed for use in systems that contain an Intel
   386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
   static CHMOS IV design with very low power requirements that is intended
   to be used in notebook and laptop computers. It features an integrated
   cache controller, a programmable memory controller, and hardware support
   for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
   introduced in early 1992, has been designed to accompany the 386SL in
   machines with low power consumption and substitute the 387SX for this
   purpose. It features advanced power saving mechanisms. It is based on
   the 387DX core, rather than on the older and slower 80387 core (which is
   used by the 387SX).

IIT 3C87

   This IIT chip was introduced in 1989, about the same time as the Cyrix
   83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
   IIT 3C87 also provides extra functions not available on any other 387
   chip [38]. It has 24 user-accessible floating-point registers organized
   into three register banks. Three additional instructions (FSBP0, FSBP1,
   FSBP2) allow switching from one bank to another. (Transfers between
   registers in different banks are not supported, however, so this feature
   by itself is of limited usefulness. Also, there seems to be only one
   status register [containing the stack top pointer], so it has to be
   manually loaded and stored when switching between banks with a different
   number of registers in use [40]). The register bank's main purpose is to
   aid the fourth additional instruction the 3C87 has (F4X4), which does a
   full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
   3D-graphics applications [39]. The built-in matrix multiply speeds this
   operation up by a factor of 6 to 8 when compared to a programmed
   solution according to the manufacturer [38]. Tests show the speed-up to
   be indeed in this range [40]. I measured the F4X4 to execute in about
   280 clock cycles, during which time it executes 16 multiplications and
   12 additions. The built-in matrix multiply speeds up the matrix-by-
   vector multiply by a factor of 3 compared with a programmed solution
   according to IIT [39]. The results for my own TRNSFORM benchmark support
   this claim (see results below), showing a performance increase by a
   factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
   as fast as on an Intel 486 at the same clock frequency. As desirable as
   the F4X4 instruction may seem, however, there are very few applications
   that make use of it when an IIT coprocessor is detected at run time
   (among them Schroff Development's Silver Screen and Evolution
   Computing's Fast-CAD 3-D [25]).

   These IIT-specific instructions also work correctly when using a Chips &
   Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
   faster replacements for the Intel 386DX CPU.

   Tests I ran with the IEEETEST program show that the 3C87 is not fully
   compatible with the IEEE-754 standard for floating-point arithmetic,
   although the manufacturer claims otherwise. It is indeed possible that
   the reported errors are due to personal interpretations of the standard
   by the program's author that have been incorporated into IEEETEST and
   that the standard also supports the different interpretation chosen by
   IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
   have become somewhat of an industry standard [66] and Intel's 387, 486,
   and RapidCAD chips pass the test without a single failure, so the fact
   that the IIT 3C87 fails some of the tests indicates that it is not fully
   compatible with the Intel 387 coprocessor. My tests also show that the
   IIT 3C87 does not support denormals for the double extended format. It
   is not entirely clear whether the IEEE standard mandates support for
   extended precision denormals, as the IEEE-754 document explicitly only
   mentions single and double-precision denormals. Missing support for
   denormals is not a critical issue for most applications, but there are
   some programs for which support of denormals is at the very least quite
   helpful [41]. In any case, failure of the 3C87 to support extended
   precision denormal numbers does represent an incompatibility with the
   Intel 387 and 486 chips.

   The 3C87 is implemented in an advanced CMOS process and has low power
   requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
   and ULSI, the 3C87 does not support asynchronous operation of the CPU
   and the coprocessor, but always runs at the full speed of the CPU. It is
   available in 16, 20, 25, 33, and 40 MHz versions.

IIT 3C87SX

   This is the version of the IIT 3C87 that is intended for use with
   Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
   the IIT3C87. Due to the 16-bit data path between the CPU and the
   coprocessor in a 386SX- based system, coprocessor instructions will
   execute somewhat more slowly than on the 3C87. At present, the IIT
   3C87SX is the only 387SX coprocessor that is offered at speeds of 16,
   20, 25, and 33 MHz. (I have read that Cyrix has also announced an 83S87-
   33, but haven't seen it being offered yet.) The 3C87SX is packaged in a
   68-pin PLCC.

Cyrix FasMath 83D87

   This chip was introduced in 1989, only shortly after the coprocessors
   from IIT. It has been found to be the fastest 387-compatible coprocessor
   in several benchmark comparisons [1,7,68,69]. It also came out as the
   fastest coprocessor in my own tests (see benchmark results below).
   Although the Cyrix 83D87 provides up to 50% more performance than the
   Intel 387DX in benchmarks comparisons, the speed advantage over other
   387-compatible coprocessors in real applications is usually much
   smaller, because coprocessor instructions represent only a small part of
   the total application code. For example, in a test using the program 3D-
   Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].

   Besides being the fastest 387 coprocessor, the 83D87 also offers the
   most accurate transcendental functions results of all coprocessors
   tested (see test results below). The new "387+" version of the 83D87,
   available since November 1991, even surpasses the level of accuracy of
   the original 83D87 design. Note that the name 387+ is used in European
   distribution only. In other parts of the world, the new chip still goes
   by the name 83D87.

   Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to
   compute the transcendental functions, Cyrix uses polynomial and rational
   approximations to the functions. In the past the CORDIC method has been
   popular since it requires only shifts and adds, which made it relatively
   easy to implement a reasonably fast algorithm. Recently, the cost for the
   implementation of fast floating-point hardware multipliers has dropped
   significantly (due to the availability of VLSI), making the use of
   polynomial and rational approximations superior to CORDIC for the
   generation of transcendental functions [61]. The Cyrix 83D87 uses a fast
   array multiplier, making its transcendental functions faster than those
   of any other 387 compatible coprocessor. It also uses 75 bit for the
   mantissa in intermediate calculations (as opposed to 68 bits on other
   coprocessors), making its transcendental functions more accurate than
   those of any other coprocessor or FPU (see results below).

   The 83D87 (and its successor, the 387+) are the 387 'clones' with the
   highest degree of compatibility to the Intel 387DX. A few minor software
   and hardware incompatibilities have been documented by Cyrix [12]. The
   software differences are caused by some bugs present in the 387DX that
   Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all
   other 387-compatible chips as well) does not support asynchronous
   operation of CPU and coprocessor. There were also problems in the past
   with the CPU-coprocessor communications, causing the 83D87 to
   occasionally hang on some machines. The reason behind this was that
   Cyrix shaved off a wait state in the communication protocol, which
   caused a communications breakdown between the CPU and the 83D87 for some
   systems running at 25 MHz or faster. (One notable example of this
   behavior was the Intel 302 board.) Also there were problems with boards
   based on early revisions of the OPTI chipset. These problem are only
   rarely encountered with the current generation of 386 motherboards, and
   it is possible that it has been entirely eliminated in the 387+, the
   successor to the 83D87.

   To reduce power consumption the 83D87 features advanced power saving
   features. Those portions of the coprocessor that are not needed are
   automatically shut down. If no coprocessor instructions are being
   executed, *all* parts except the bus interface unit are shut down [12].
   Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while
   typical power consumption at this clock frequency is 500 mW [15].

Cyrix EMC87

   This coprocessor is basically a special version of the Cyrix 83D87,
   introduced in 1990. In addition to the normal 387 operating mode, in
   which coprocessor-CPU communication is handled through reserved IO
   ports, it also offers a memory-mapped mode of operation similar to the
   operation principle of the Weitek Abacus. Like the Weitek chip, the
   EMC87 occupies a block of memory starting at physical address C0000000h
   (the Abacus occupies a memory block of 64 KB, while the EMC87 uses only
   4 KB [77]). It can therefore only be accessed in the protected or
   virtual modes of the 386 CPU. DOS programs can access the EMC87 with the
   help of DOS extenders or memory managers like EMM386 which run in
   protected/virtual mode themselves. To implement the memory-mapped
   interface, the usual 80x87 architecture has been slightly expanded with
   three additional registers and eleven additional instructions that can
   only be used if the memory-mapped mode is enabled.

   Using this special mode of the EMC87 provides a significant speed
   advantage. The traditional 387 CPU-coprocessor interface via IO ports
   has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87
   executes some operations like addition and multiplication in much less
   time, its performance is actually limited by the CPU-coprocessor
   interface. Since the memory-mapped mode has much less overhead, it
   allows all coprocessor instructions to be executed at full speed with no
   penalty.

   Originally, Cyrix claimed support for the fast memory-mapped mode of the
   EMC87 from a number of software vendors (including Borland and
   Microsoft). However, there are only very few applications that make use
   of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP
   FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer,
   and Intusofts's Spice [63,73]. Part of the problem in supporting the
   memory-mapped mode is that the application must reserve one of the
   general purpose registers of the CPU to use memory-mapped mode
   instructions that access memory.

   (Note that the EMC87 is *not* compatible with Weitek's Abacus
   coprocessor. They both use the same CPU interface technique [memory
   mapping], but while the EMC87 uses the standard 387 instruction set, the
   Weitek Abacus coprocessors use a different instruction set entirely its
   own.)

   Since the EMC87 provides also the standard 386/387 CPU interface via IO
   ports, it can be used just like any other 387-compatible coprocessor and
   delivers the same performance as the Cyrix 83D87 in this mode. The EMC87
   even allows mixed use of memory-mapped and traditional instructions in
   the same code. Cyrix has also implemented some additional instructions
   in the EMC87 that are also available in the 387-compatible mode:
   FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to
   integer without setting the rounding mode by manipulating the
   coprocessor control word, and are intended to make life easier for
   compiler writers.

   In a test, the EMC87 at 33 MHz ran the single-precision Whetstone
   benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a
   speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In
   another test, the EMC87 ran a fractal computation at twice the speed of
   the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third
   test found the EMC87's overall performance to be 20% higher than the
   performance of the Cyrix 83D87 [65].

   The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the
   two chips are identical. Unlike the Cyrix 83D87, which fits into the 68-
   pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and
   requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that
   not all boards have such a socket (a notable exception being IBM's
   PS/2s, for example). The EMC87 is available 25 and 33 MHz versions.
   Maximum power consumption at 33 MHz is 2000 mW.

   Cyrix appears currently to be phasing out the EMC87.

Cyrix FasMath 387+

   This chip is the second-generation successor to the Cyrix 83D87. (The
   name "387+" is only used for European distribution; in other parts of
   the world, it goes by the original 83D87 designation.) According to a
   source within Cyrix [73], the 387+ was designed to make a smaller (and
   thus cheaper to manufacture) coprocessor chip that could also be pushed
   to higher frequencies than the original chip: the 387+ is available in
   versions of up to 40 MHz, whereas the original 83D87 could go no faster
   than 33 MHz.

   The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU,
   which is a 486SX compatible replacement chips for the Intel 386DX.
   Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a
   Cyrix 387+.

   In my tests, I found the Cyrix 387+ to be about five to 10 percent
   *slower* than the Cyrix 83D87. However, some instructions like the
   square root (FSQRT) now run at only half the speed at which they ran in
   the 83D87, and most transcendental functions show about a 40% drop in
   performance compared to their 83D87 averages (see performance results,
   below). However, I did find the transcendental functions on the 387+ to
   be a bit *more* accurate than those implemented in the 83D87. The new
   design uses a slower hardware multiplier that needs six clock cycles to
   multiply the floating-point mantissa of an internal precision number,
   while the multiplier in the 83D87 takes only 4 clocks to accomplish the
   same task. Since the transcendental functions in Cyrix math coprocessors
   are generated by polynomial and rational approximations, this slows them
   down significantly.

   The divide/square root logic has also been changed from the 83D87
   design. The original design used an algorithm that could generate both
   the quotient and square root, so the execution times for these
   instructions were nearly identical. The algorithm chosen for the
   division in the 387+ doesn't allow the square root to be taken so
   easily, so it takes nearly twice as long.

   In the 387+, the available argument range for the FYL2XP1 instruction
   has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is
   found on all 80x87 coprocessors, to include all floating-point numbers.
   Also, four additional instructions have been implemented: FRICHOP
   (opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP
   (opcode D9 E6).

Cyrix FasMath 83S87

   The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the
   fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of
   the 387SX compatible coprocessors [1], as well as providing the most
   accurate transcendental functions. 83S87 chips manufactured after 1991
   use the internals of the Cyrix 387+, the successor to the original 83D87
   [73] (above). The Cyrix 83S87 is ideally suited to be used with the
   Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip
   for the Intel 386SX CPU.

   The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, and
   25 MHz versions. Due to the advanced power saving features of the Cyrix
   coprocessor, the typical power consumption of the 20 MHz version is only
   about 350 mW [67].

ULSI Math*Co 83C87

   The ULSI 83C87 is an 80387-compatible coprocessor first introduced in
   early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other
   387 clones, it is somewhat faster than the Intel 387DX, particularly in
   its basic arithmetic functions. The transcendental functions, however,
   show only a slight speed improvement over the Intel 387DX (see benchmark
   results below).

   In my tests, the ULSI had the most inaccurate transcendental functions
   of all tested coprocessors. However, the maximum relative error is still
   within the limits set by Intel, so this is probably not an important
   issue for all but a very few applications. The ULSI 83C87 shows some
   minor flaws in the tests for IEEE 754 compatibility, but this, too, is
   probably unimportant under typical operating conditions. ULSI claims
   that the program IEEETEST, which was used to test for IEEE
   compatibility, contains many personal interpretations of the IEEE
   standard by the program's author and states that there is no ANSI-
   certified IEEE-754 compliance test. While this may be true, it is
   also a fact that the IEEE test vectors used in IEEETEST are a de facto
   industry standard, and that Intel's 387, 486, and RapidCAD chips pass it
   without a single failure, as do the coprocessors from Cyrix. Since the
   ULSI Math*Co 83C87 fails some of the tests, it is certainly less than
   100% compatible with Intel's chips, although this will likely make
   little or no difference in typical operating conditions. (It is
   interesting to note that an ULSI 83S87 manufactured in 92/17 showed
   fewer errors in the IEEETEST test run [74] than the ULSI 83C87,
   manufactured in 91/48, I used in my original test. This indicates that
   ULSI might have applied some quick fixes to newer revisions of their
   math coprocessors.)

   The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does
   not implement the "precision control" feature. While all the internal
   operations of 80x87 coprocessors are usually performed with the maximum
   precision available (double-extended precision with 64 mantissa bits),
   the 80x87 architecture also offer the possibility to force lower
   precision to be used for the basic arithmetic functions (add, subtract,
   multiply, divide, and square root). This feature is required by IEEE-754
   for all coprocessors that can not store results *directly* to a single
   or double-precision location. Since 80x87 coprocessors lack this storage
   capability, they all implement precision control to provide correctly
   rounded single- and double-precision results according to the floating-
   point standard - except the ULSI chips. For programs that make use of
   precision control (e.g., Interactive UNIX), correct implementation of
   the feature may be essential for correct arithmetic results.

   Like other non-Intel 387 compatibles, the 83C87 does not support
   asynchronous operation of the CPU and the coprocessor. This means that
   the 83C87 always runs at the full speed of the CPU. It is available in
   20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS;
   power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz
   it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625
   mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87
   is packaged in a 68-pin ceramic PGA.

   ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc.,
   will replace the coprocessor up to three times free of charge should it
   ever fail to function properly.

ULSI Math*Co 83S87

   This chip is the SX version of the ULSI 83C87, for use in systems with
   an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to
   the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an
   advanced power saving design with a sleep mode and a standby mode with
   only minimal power requirements. Power consumption under normal
   operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW
   typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25
   MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.

C&T SuperMATH 38700DX

   Produced by Chips&Technologies, this is the latest entry into the 387-
   compatible marketplace. Originally announced in October, 1991, it has
   apparently not been available to end-users before the third quarter of
   1992, at least here in Germany. My tests show that its compatibility
   with Intel products is very good, even for the more arcane features of
   the 387DX and comparable to the coprocessors from Cyrix. Like these
   chips, it passes the IEEETEST program without a single failure. It
   passes, of course, all tests in Chips&Technologies' own compatibility
   test program, SMDIAG. However, some of the tests (the transcendental
   functions) in this program are selected in such a way that the C&T 38700
   passes while the Cyrix 83D87 or Intel RapidCAD fail, so they are not
   very useful. (There is also a 'bug' in the test for FSCALE that hides a
   true bug in the C&T 38700.) My tests show the accuracy of the
   transcendental functions on the C&T 38700DX varies. Overall, accuracy of
   the transcendentals is slightly better than on the Intel 387DX.

   In my own speed tests [see below] and those reported in [1], the C&T
   38700DX showed performance at about 90-100% the level of the Cyrix
   83D87, which is the 387 clone with the highest performance. For
   floating-point-intensive benchmarks, the C&T 38700DX provides up to 50%
   more computational performance than the Intel 387DX. However, as with
   all other 387 compatible coprocessors, the speed advantage over the
   Intel 387DX is far less significant in real applications.

   The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip
   power management, which makes for low power consumption. The 38700DX is
   packaged in a 68-pin ceramic PGA (pin grid array and available in speeds
   of 16, 20, 25, 33, and 40 MHz.

C&T 38700SX

   This chip is the SX version of the 38700DX and compatible with the Intel
   387SX. It provides performance comparable to a Cyrix 83S87 [1], the
   387SX clone with the highest performance. Compatibility with the Intel
   387SX is very good and on par with the high degree of the compatibility
   found in the Cyrix 83S87.

   The 38700SX has low power consumption. It is packaged in a 68-pin PLCC
   (plastic leaded chip carrier) and available in speeds of 16, 20, and 25
   MHz.

Intel RapidCAD

   The RapidCAD is not a coprocessor, strictly seen, although it is
   marketed as one. Rather, it is a full replacement for a 80386 CPU:
   basically, an Intel 486DX CPU chip without the internal cache and with a
   standard 386 pinout. RapidCAD is delivered as a set of two chips.
   RapidCAD-1 goes into the 386 socket and contains the CPU and FPU.
   RapidCAD-2 goes into the coprocessor (387) socket and contains a simple
   PAL whose only purpose is to generate the FERR signal normally generated
   by a coprocessor (This is needed by the motherboard circuitry to provide
   287 compatible coprocessor exception handling in 386/387 systems.) The
   RapidCAD instruction set is compatible with the 386, so it doesn't have
   any newer, 486-specific instructions like BSWAP. However, since the
   RapidCAD CPU core is very similar to 80486 CPU core, most of the
   register-to-register instructions execute in the same number of clock
   cycles as on the 486.

   RapidCAD's use of the standard 386 bus interface causes instructions
   that access memory to execute at about the same speed as on the 386. The
   integer performance on the RapidCAD is definitely limited by the low
   memory bandwidth provided by this interface (2 clock cycles per bus
   cycle) and the lack of an internal cache. CPU instructions often execute
   faster than they can be fetched from memory, even with a big and fast
   external cache. Therefore, the integer performance of the RapidCAD
   exceeds that of a 386 by *at most* 35%. This value was derived by
   running some programs that use mostly register-to-register operations
   and few memory accesses, and is supported by the SPEC ratings that Intel
   reports for the 386-33 and the RapidCAD-33: while the 386-33 has a
   SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase.
   (Note that these tests used the old [1989] SPEC benchmarks suite.)

   While CPU and integer instructions often execute in one clock cycle on
   the RapidCAD, floating-point operations always take more than seven
   clock cycles. They are therefore rarely slowed down by the low-bandwidth
   386 bus interface; My tests show a 70%-100% performance increase for
   floating-point intensive benchmarks over a 386-based system using the
   Intel 387DX math coprocessor. This is consistent with the SPECfp rating
   reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
   the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85%
   increase. This means that a system that uses the RapidCAD is faster than
   *any* 386/387 combination, regardless of the type of 387 used, whether
   an Intel 387DX or a faster 387 clone. The diagnostic disk for the
   RapidCAD also gives some application performance data for the RapidCAD
   compared to the Intel 387DX:

           Application      Time w/ 387DX  Time w/ RapidCAD  Speedup

           AutoCAD 11              52 sec         32 sec       63%
           AutoShade/Renderman    180 sec        108 sec       67%
           Mathematica(Windows  ) 139 sec        103 sec       35%
           SPSS/PC+ 4.01           17 sec         14 sec       21%

   RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed
   through different channels than the other Intel math coprocessors, and I
   have therefore been unable to obtain a data sheet for it. [78] gives the
   typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is
   the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot
   when operating. Therefore, I recommend extra cooling for it (see the
   paragraph below on the 486 for details). The RapidCAD-1 is packaged in a
   132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a
   68-pin PGA like a 80387 coprocessor.

Intel 486DX

   The Intel 486DX is, of course, not solely a coprocessor. This chip,
   first introduced by Intel in 1989, functionally combines the CPU (a
   heavily-pipelined implementation of the 386 architecture) with an
   enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified
   on-chip code/data cache. (This description is necessarily simplified;
   for a detailed hardware description, see [52].) The 486DX offers about
   two to three times the integer performance of a 386 at the same clock
   frequency, while floating-point performance is about three to four times
   as high as the Intel 387DX at the same clock rate [29]. Since the FPU is
   on the same chip as the CPU, the considerable communication overhead
   between CPU and coprocessor in a 386/387 system is omitted, letting FPU
   instructions run at the full speed permitted by the implementation. The
   FPU also takes advantage of the on-chip cache and the highly pipelined
   execution unit. The concurrent execution of CPU and coprocessor
   instructions typical for 80x86/80x87 systems is still in existence on
   the 486, but some FPU instructions like FSIN have nearly no concurrency
   with CPU instructions, indicating that they make heavy use of both, CPU
   and FPU resources [53, 1].

   Besides its higher performance, the 486 FPU provides more accurate
   transcendental functions than the 387DX coprocessor, according to my
   tests (see below). To achieve better interrupt latency, FPU instructions
   with a long execution times have been made abortable if an interrupt
   occurs during their execution.

   Due to the considerable amount of heat produced by these chips, and
   taking into consideration the slow air flow provided by the fan in
   garden-variety PC tower cases, I recommend an extra fan directly above
   the CPU for safer operation. If you measure the surface temperature of
   an 486DX after some time of operation in a normal tower case without
   extra cooling, you may well come up with something like 80-90 degrees
   Celsius (that is 175-195 degrees Fahrenheit for those not familiar with
   metric units) [54,55]. You don't need the well known (and expensive)
   IceCap[tm] to effectively cool your CPU; a simple fan mounted directly
   above the CPU can bring the temperature of the chip down to about 50-60
   degrees Celsius (120-140 degrees Fahrenheit), depending on the room
   temperature and the temperature within the PC case (which depends on the
   total power dissipation of all the components and the cooling provided
   by the fan in the system's power supply). According to a simple rule
   known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius
   slows down chemical reactions by a factor of two, so lowering the
   temperature of your CPU by 30 degrees should prolong the life of the
   device by a factor of eight, due to the slower ageing process. If you
   are reluctant to add a fan to your system because of the additional
   noise, settle for a low-noise fan like those available from the German
   manufacturer Pabst (this is not meant to be an advertisement; I am just
   the happy owner of such a fan, and have no other connections to the
   firm).

   The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is
   available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz
   version has also been available, manufactured by a CHMOS V process (the
   25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum
   power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500
   mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW
   typical) for the 50 MHz chip.

Intel 486DX2

   The 486DX2 represents the latest generation of Intel CPUs. The "DX2"
   suffix (instead of simply DX) is meant to be an indicator that these are
   clock-doubled versions of the basic CPU. A normal 486DX operates at the
   frequency provided by the incoming clock signal. A 486DX2 instead
   generates a new clock signal from the incoming clock by means of a PLL
   (phase locked loop). In the DX2, this clock signal has twice the
   frequency of the incoming clock, hence the name clock-doubler. All
   internal parts of the 486DX2 (cache, CPU core, and FPU) run at this
   higher frequency; only the bus interface runs at the normal (undoubled)
   speed. Using this technique, an Intel 486DX2-50 can run on an unmodified
   motherboard designed for 25 MHz operation. Since motherboards which run
   at 50 MHz are much harder to design and build than those for 25 MHz,
   this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50
   system.

   For all operations that don't access off-chip resources (e.g., register
   operations), a 486DX2-50 provides exactly the same performance as a
   486DX-50, and twice the performance of a 486DX-25. However, since the
   main memory in a 486DX2-50 systems still operates at 25 MHz, all
   instructions involving memory accesses are potentially slower than in a
   486DX-50 system, whose memory also (presumably) runs at 50 MHz. The
   internal cache of the 486 helps this problem a bit, but overall
   performance of a 486DX2-50 is still lower than that of a 486DX-50.
   Intel's documentation [32] shows this drop to be quite small, although
   it is highly dependent upon the particular application.

   The truly wonderful thing about the 486DX2 is that it allows easy
   upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely
   pin-compatible with the 486DX: you need just take out the 486DX and plug
   in the new 486DX2. Note that power consumption of the 486DX2-50 equals
   that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the
   486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.).
   These chips get *really* hot in a standard PC case with no extra
   cooling, even if they come with an attached heat sink by default. (See
   the discussion above for more detailed information on this problem and
   possible solutions).

Intel 487SX

   The 487SX is the math coprocessor intended for use in 486SX systems. The
   486SX is basically a 486DX without the floating-point unit (FPU) [48,
   50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it
   has now completely removed the FPU part from the 486SX mask for mass
   production.) The introduction of the 486SX in 1991 has been viewed by
   many as a marketing 'trick' by Intel to take market share from the 386
   based systems once AMD became successful with their Am386. (AMD has
   taken as much as 40% of the 386 market due to some superior features
   such as higher clock frequency, lower power consumption, fully static
   design, and availability of a 3V version). A 486SX at 20 MHz delivers
   a bit less integer performance than a 40 MHz Am386.

   To add floating-point capabilities to a 486SX based system, it would
   seem to be easiest to swap the 486SX for a 486DX, which includes the FPU
   on-chip. However, Intel has prevented this easy solution by giving the
   486SX a slightly different pin out [48, 51]. Since only three pins are
   assigned differently, clever board manufacturers have come out with
   boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU
   socket and by doing so provide a clean upgrade path. A set of three
   jumpers ensures correct signal assignment to the changed pins for either
   CPU type. To upgrade 486SX systems without this feature, you are forced
   to buy a 487SX and install it in the "Performance Upgrade Socket"
   (present in most systems).

   Once the 487SX was available, it was quickly found out that it is just a
   normal 486DX with a slightly different pinout [49]. Technically
   speaking, the solution Intel chose was the only practical way to provide
   a 486SX system with the high level of floating-point performance the
   486DX offers. The CPU and FPU must be on the same chip; otherwise, the
   FPU cannot make use of the CPU's internal cache and there would be
   considerable overhead in CPU-FPU communication (similar to a 386/387
   system), nullifying most of the arithmetic speedups over the 387. That
   the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely
   for marketing reasons.

   To upgrade a 486SX based system, Intel also offers the OverDrive chip,
   which is just the same as a 487SX with internal clock doubling. It also
   goes into the motherboard's "Performance Upgrade Socket". The OverDrive
   roughly doubles the performance of a 486SX/487SX based system. (For a
   explanation of clock doubling, see the description of the Intel 486DX2
   above.)

   Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX
   system, so the 486SX could be removed once the 487SX is installed. Since
   the shut down is logical, not electrical, the 486SX still uses power if
   used with the 487SX, although it is inoperational. As with the 486SX,
   the 487SX is currently available in 20 MHz and 25 MHz versions. At 20
   MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW
   typical). It is available in a 169 pin ceramic PGA (pin grid array).

Weitek 1167

   This math coprocessor was the predecessor of the Weitek Abacus 3167. It
   was actually a small printed circuit board with three chips mounted on
   it. In contrast to the Weitek 3167, the 1167 did not have a square root
   instruction; instead, the square root function was computed by means of
   a subroutine in the Weitek transcendental function library. However, the
   1167 did have a mode in which it supported denormal numbers. (The Weitek
   3167 and 4167 only implement the 'fast' mode, in which denormals are not
   supported.) Overall performance of the 1167 is slightly less than that
   of the Weitek 3167.

Weitek 3167

   The 3167 was introduced by Weitek in 1989 and provided the fastest
   floating-point performance possible on a 386 based system at that time.
   The 3167 is not a real coprocessor, strictly speaking, but rather a
   memory-mapped peripheral device. The architecture of the 3167 was
   optimized for speed wherever possible. Besides using the faster memory
   mapped interface to the CPU (the 80x87 uses IO-ports), it does not
   support many of the features of the 80x87 coprocessors, allowing all of
   the chip's resources to be concentrated on the fast execution of the
   basic arithmetic operations. (For a more detailed description of the
   Weitek 3167, see the first chapter of this document.)

   In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the
   performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167
   the Whetstone benchmark performed at 7574 kWhetstones/sec compared with
   the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these
   are single-precision results and that the Weitek 3167's performance
   would drop to about half the stated rate for double-precision, while the
   value for the Intel 387DX would change very little.) In any case, before
   the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed
   all 387-compatible coprocessors, even for double-precision operations
   [63,65,69]. For typical applications, the advantage of the Weitek 3167
   over the 387 clones is much smaller. In a benchmark test using
   AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel
   387DX's performance compared with 106% for the Cyrix FasMath 83D87 and
   118% for the Intel RapidCAD.

   The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an
   EMC socket (provided in most 386-based systems). It does *not* fit into
   the normal 68-pin PGA socket intended for a 387 coprocessor.

   To get the best of both worlds, one might want to use a Weitek 3167 and
   a 387 compatible coprocessor in the same system. These coprocessors can
   coexist in the same system without problems; however, most 386-based
   systems contain only one coprocessor socket, usually of the EMC
   (extended math coprocessor) type. Thus, you can install either a 387
   coprocessor or a Weitek 3167, but not both at the same time. There *are*
   small daughter boards available that plug into the EMC socket and
   provide two sockets, an EMC and a standard coprocessor socket.

   At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At
   33 MHz, max. power consumption is 2250 mW.

Weitek 4167

   The 4167 is a memory-mapped coprocessor that has the same architecture
   as the 3167; it is designed to provide 486-based systems with the
   highest floating-point performance available. It executes coprocessor
   instructions at three to four times the speed of the Weitek 3167.
   Although it is up to 80% faster than the Intel 486 in some benchmarks
   [1,69], the performance advantage for real application is probably more
   like 10%. The introduction of the 486DX2 processors has more or less
   obliterated the need for a Weitek 4167, since the DX2 CPUs provide the
   same performance as the Weitek, as well as the additional features the
   80x87 architecture has that the Weitek does not.

   The Weitek 4167 is packaged in a 142-pin PGA package that is only
   slightly smaller than the 486's package. At 25 MHz, it has a max. power
   consumption of 2500 mW [32].

Finding out which coprocessor you have

If you are interested in programming techniques which allow the detection and differentiation of the coprocessors described above, I refer you to my COMPTEST program. COMPTEST reliably detects the type and clock frequency of the CPU and coprocessor installed in your machine. The current version is CTEST257.ZIP, with future versions to be called CTEST258, CTEST259 and so on. COMPTEST can correctly identify all of the coprocessors described above, with the exception of the Weitek chips, for which the detection mechanism is not that reliable.

COMPTEST is in the public domain and comes with complete source code. It is available via anonymous ftp from garbo.uwasa.fi and additional ftp sites that mirror garbo.

Current coprocessor prices and purchasing advice

Due to mid-1992 price slashing by Cyrix (and subsequently, Intel) for 387 coprocessors, prices have dropped significantly for all 287 and 387 compatibles, with hardly any price difference between manufacturers. 387DX compatible coprocessors typically sell for ~US$ 80 for all speeds except for 40 MHz versions, which are typically ~US$ 90. 387SX compatible coprocessors sell for ~US$ 70, regardless of speed, with the exception of the 33 MHz versions, which are ~US$ 80. The Intel 287XL sells for ~US$ 90, while the IIT 2C87 and Cyrix 82S87 each sell for about US$ 60. 8087s may be more expensive, the price of an 8087-10 being ~US$ 150. I purchased the Intel RapidCAD for US$ 300 and haven't seen it offered for a better price. I see the Weitek Abacus 3167-33 being offered for US$ 230 and the 4167-33 being offered for US$ 850. The Intel 486SX OverDrive is available for ~US$ 570 for the 20 MHz version, while the Intel 486DX2-50 costs ~650 US$. This price information reflects the price situation as of 01-11-93; prices can be expected to drop slightly in the near future.

Which coprocessor should you buy?

Several computer magazines have published application-level performance comparisons for various 387 coprocessors and Weitek's ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests, performance improvements for the 387 clones over Intel's 387DX were small to marginal, the clones running the applications no more than 5-15% faster than the Intel 387DX. In the test of 3D-Studio, one of the few programs that directly supports the Weitek Abacus, the Weitek 3167 improved performance by 23% over an Intel 387DX and the 4167 improved performance by 10% over the 486DX [1].

If you have a demand for high floating-point performance, you should consider buying a full 486-based system, rather than a 386-based system with an additional coprocessor. Consider: A 386/33 MHz motherboard currently sells for ~US$ 270; together with the coprocessor, the cost totals ~US$ 350. A 486/33 MHz ISA motherboard sells for US$ 650. While this means that the 486 system is 85% more expensive than the 386/387 system, it also provides 100% more integer and floating-point performance (twice the performance), giving it better price/performance for math-intensive applications. As prices for 486 chips fall in the future, the price difference between these two systems should become even smaller.

If you want to push your 386-based system to its maximum floating-point performance and can't switch to a 486, I recommend the Intel RapidCAD chipset. It is both faster [1] and cheaper than installing a Weitek Abacus 3167 in a 386 system, which used to be the highest performing combination before the RapidCAD was introduced.

In a similar vein, the introduction of the Intel 486DX2 clock-doubler chips has obliterated the need for a Weitek 4167 to get maximum floating-point performance out of a 486-based system. A 486DX2-66 performs at or above the performance level of a 33 MHz Weitek 4167, even if the latter uses single- precision rather than double-precision. The 486DX-66 is rated by Intel at 24700 double-precision kWhetstones/sec and 3.1 double-precision Linpack MFLOPS. (Of course, these benchmarks used the highest performance compilers available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double-precision MFLOPS out of the 486DX2-66 for the LLL benchmark [for a description of these benchmarks, see the paragraph on benchmarks below].) Although I haven't yet seen 486DX2-66 processors being offered to end users for upgrade purposes, I recommend the 486DX2-66 to those that need highest floating-point performance and are planning to buy a new PC. The price difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is around US$ 450, well below the price for the Weitek Abacus 4167.

The benchmark programs / Coprocessor performance comparisons

The performance statistics below were put together with the help of four widely-known numeric benchmarks and two benchmarks developed by me. Three Pascal programs, one FORTRAN program, and two assembly language programs were used. The assembly language programs were linked with Borland's Turbo Pascal 6.0 for library support, especially to include the coprocessor emulator of the TP 6.0 run-time library. The Pascal programs were compiled with Turbo Pascal 6.0, a non-optimizing compiler that produces 16-bit code. The FORTRAN program was compiled using Microsoft's FORTRAN 5.0, an optimizing compiler that generates 16-bit code. All programs use double-precision variables (except PEAKFLOP and SAVAGE, which use double extended precision).

Note that the use of a highly optimizing compiler producing 32-bit code can give much higher performance for some benchmarks. For example, Intel rates the 33 MHz 386/387DX at 3290 kWhetstones/sec and 0.4 double-precision LINPACK MFLOPS [28,29], and it rates the Intel 486 at 12300 kWhetstones/sec and 1.6 double-precision LINPACK MFLOPS [30]. The compilers used in these benchmarks run by the chip's manufacturer are the ones that give the highest performance available, and sell in the US$ 1000+ price range. Some of them may even be experimental or prereleased versions not available to the general public. The relative performance of one coprocessor to another can and does vary greatly depending on the code generated by compilers. Non-optimizing compilers tend to generate a high percentage of operations which access variables in memory, while optimizing compiler produce code that contains many operations involving registers. Thus it is well possible that coprocessor A beats coprocessor B running benchmark Z if compiled with compiler C, but B beats A when the same benchmark is compiled using compiler D.

All benchmark in this overview were run from floppy under a 'bare-bones' MS- DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made sure no TSR or other program unnecessarily stole computing resources from the benchmarks.

Description of benchmarks

PEAKFLOP is the kernel of a fractal computation. It consists mainly of a tight loop written in assembly code and fine-tuned to give maximum performance. The whole program fits nicely into even a very small CPU cache. All variables are held in the CPU's and coprocessor's registers, so the only memory access is for opcode fetches. The main loop contains three multiplications and five additions/ subtractions; this ratio is fairly typical for other floating-point intensive programs as well. Due to the nature of this program, its MFLOPS rate is hardly to be exceeded by any program that calculates anything useful; thus the name PEAKFLOP. You will find the source code for PEAKFLOP in appendix B.

TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix (a 4x4 matrix). Each vector consists of four double-precision values. Multiplying vectors with a matrix is a typical operation in the manipulation (e.g. rotation) of 3D objects which are made up from many vectors describing the object. This benchmark stresses addition and multiplication as well as memory access. For each vector, 16 multiplications and 12 additions are used, and about 256 KB of data is accessed during the benchmark run.

For the IIT 3C87, a special version of TRNSFORM was written that makes use of the special F4X4 instruction available on that coprocessor. F4X4 does a full multiplication of a 4x4 matrix by a 4x1 vector in a single instruction. TRNSFORM is implemented as an optimized assembler program linked with the Turbo Pascal 6.0 library. The full source code can be found in appendix B.

LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from real floating-point extensive programs. Some of these loops are vectorizable, but since we don't deal with vector processors here, this doesn't matter. For this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement), memory allocation for data was reduced to 64 KB, so all data fits into a single 64 KB segment. The older version of LLL is used here which contains 14 loops. There also exists a newer, more elaborate version consisting of 24 kernels. The kernels in LLL exercise only multiplication and addition. The MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels. All floating-point variables in the programs are of type DOUBLE.

Both LLL and Whetstone results (see below) are reported as returned by my COMPTEST test program, in which they have been included as a measure of coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal 6.0 with all 'optimizations' on and using my own run-time library, which gives higher performance than the one included with TP 6.0. My library is available as TPL60N18.ZIP from garbo.uwasa.fi and ftp sites that mirror this site.

Linpack [5] is a well known floating-point benchmark that also heavily exercises the memory system. Linpack operates on large matrices and takes up about 570 KB in the version used for this test. This is about the largest program size a pure DOS system can accommodate. Linpack was originally designed to estimate performance of BLAS, a library of FORTRAN subroutines that handles various vector and matrix operations. Note that vendors are free to supply optimized (e.g., assembly language) versions of BLAS. Linpack uses two routines from BLAS which are thought to be typical of the matrix operations used by BLAS. Both routines only use addition/subtraction and multiplication. The FORTRAN source code for Linpack can be obtained from the automated mail server netlib@ornl.gov. Linpack was compiled using MS FORTRAN 5.0 in the HUGE memory model (which can handle data structures larger than 64 KB) and with compiler switches set for maximum optimization. All floating-point variables in the program are of the DOUBLE type. Linpack performs the same test repeatedly. The number reported is the maximum MFLOPS rate returned by Linpack. Linpack MFLOPS ratings for a great number of machines are contained in [6]. This PostScript document is also available from netlib@ornl.gov.

Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected about the use of certain control and data structures in programs written in high level languages. Based on these statistics, it tries to mirror a 'typical' HLL program. Whetstone performance is expressed by how many hypothetical 'whetstone' instructions are executed per second. It was originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack, Whetstone not only uses addition and multiplication but exercises all basic arithmetic operations as well as some transcendental functions. Whetstone performance depends on the speed of the CPU as well as on the coprocessor, while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU.

There exist both old and new versions of Whetstone. Note that results from the two versions can differ by as much as 20% for the same test configuration. For this test, the new version in Pascal from [3] was used. It was compiled with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations' on. All computations are performed using the DOUBLE type.

SAVAGE tests the performance of transcendental function evaluation. It is basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt functions are combined in a single expression. While sin, cos, arctan, and sqrt can be evaluated directly with a single 387 coprocessor instruction each, ln and exp need additional preprocessing for argument reduction and result conversion. According to [14], the Savage benchmark was devised by Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000 passes though the loop. Here only 10,000 loops are executed for a total of 60,000 transcendental function evaluations. The result is expressed in function evaluations per second. SAVAGE source code was taken from [7] and compiled with Turbo Pascal 6.0 and my own run-time library (see above).

Benchmark results using the Intel 386DX CPU and various coprocessors

My benchmark results for 387 coprocessors, coprocessor emulators and the Intel RapidCAD and Intel 486 CPUs, using the programs described above, on an Intel 386DX system:

     33.3 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec

     Intel 386DX WITH:
     EM87 emulator  0.0070   0.0040   0.0050  0.0050         26      418 ##
     Franke387 emu. 0.0307   0.0246   0.0194  0.0179        137     3335 $$
     TP/MS-FORT emu 0.0263   0.0227   0.0167  0.0158        133     3160 %%
     Q387 emulator  0.0920   0.0664   0.0305  0.0304        251     4796 ((
     Intel 387DX    0.7647   0.6004   0.3283  0.2676       2046    43860
     ULSI 83C87     1.0097   0.6609   0.3239  0.2598       2089    47431
     IIT 3C87       0.8455   0.5957   0.3198  0.2646       2203    49020
     IIT 3C87,4X4   0.8455   1.4334   0.3198  0.2646       2203    49020 @@
     C&T 38700      0.9455   0.6907   0.3338  0.2700       2376    62565
     Cyrix 387+     0.9286   0.6806   0.3293  0.2669       2435    66890
     Cyrix EMC87    1.0400   0.6628   0.3352  0.2808       2540    71685 //

     Intel RapidCAD 1.8572   1.5798   0.6072  0.4533       3953    72464
     Intel 486DX    2.0800   1.7779   0.9387  0.6682       5143    82192

     40 MHz         PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec

     Intel 386DX WITH:
     EM87 emulator  0.0084   0.0080   0.0060  0.0060         31      502 ##
     Franke387 emu. 0.0369   0.0295   0.0233  0.0215        164     4002 $$
     TP/MS-FORT emu 0.0316   0.0273   0.0200  0.0190        160     3794 %%
     Q387 emulator  0.1103   0.0798   0.0365  0.0364        301     5758 ((
     Intel 387DX    0.9204   0.7212   0.3932  0.3211       2428    52677
     ULSI 83C87     1.2093   0.7936   0.3890  0.3120       2528    56926
     IIT 3C87       1.0196   0.7145   0.3834  0.3179       2663    58766
     IIT 3C87,4x4   1.0196   1.7244   0.3834  0.3179       2663    58766 @@
     C&T 38700      1.0722   0.7908   0.4007  0.3222       2837    74906
     Cyrix 387+     1.1305   0.8162   0.3945  0.3208       2946    80322
     Cyrix EMC87    1.2381   0.7963   0.4025  0.3324       3061    86083 //

     Intel RapidCAD 2.2128   1.8931   0.7377  0.5432       4810    86957
     Intel 486DX    2.4762   2.1335   1.1110  0.8204       6195    98522

Benchmark results using the Cyrix 486DLC CPU and various coprocessors

The Cyrix 486DLC is the latest entry into the market of 386DX replacement processors. It features an Intel 486SX-compatible instruction set, a 1 KB on- chip cache, and a 16x16 bit hardware multiplier. The RISC-like execution unit of the 486DLC executes many instructions in a single clock cycle. The hardware multiplier multiplies 16-bit quantities in 3 clock cycles, as compared to 12-25 cycles on a standard Intel 386DX. This is especially useful in address calculations (code from non-optimizing compilers may contain many MUL instructions for array accesses) and for software floating-point arithmetic. The 1 KB cache helps the 486DLC to overcome some of the limitations of the 386 bus interface, and although its hit rate averages only about 65% under normal program conditions, a 5-15% overall performance increase can usually be seen for both integer and floating-point-intensive applications when it is enabled.

The 486DLC's internal cache is a unified data/instruction write-through type, and can be configured as either a direct mapped or a 2-way set associative cache. For compatibility reasons, the cache is disabled after a processor reset and must be enabled with the help of a small routine provided by Cyrix. Cyrix has also defined some additional cache control signals for some of the 486DLC pins, intended to improve communication between the on-chip cache and an external cache. Current 386 systems ignore these signals, since they are not defined for the standard Intel 386DX. However, future systems designed with the 486DLC in mind may take advantage of them for increased performance.

In existing 386 systems, DMA transfers (e.g., by a SCSI controller or a soundcard) may cause the 486DLC's entire on-chip cache to be flushed, since no other means exist to enforce consistency between the cache contents and main memory. This reduces the performance of the 486DLC in these cases. The 486DLC on-chip cache does, however, allow specification of up to four non- cacheable regions, which is particularly useful if your system has memory mapped peripherals (e.g., a Weitek coprocessor).

Although I successfully ran my test programs on the Cyrix chip with all coprocessors, not all of them work well with the 486DLC in all circumstances. The IIT 3C87, the Cyrix 83D87 (chips manufactured prior to November 1991), and the Cyrix EMC87 should not be used with the 486DLC, since they may cause the computer to lock up if the FSAVE and FRSTOR instructions are used. (These instructions are typically used in protected mode multiple task environments to save and restore the coprocessor state for each task. Note that Microsoft Windows also fits this description.) According to Cyrix, this problem occurs only with first revision 486DLCs (sample chips) and is fixed on newer ones. To be on the safe side, I recommend using the Cyrix 387+ with the 486DLC, both for assured compatibility and for best performance. Note that 387+ is a 'Europe only' name and that this chip is called 83D87 elsewhere, just like the old version. You need to get a 83D87 produced after about October 1991 to guarantee that is works correctly with any 486DLC; the same caveat applies to the Cyrix 486SLC and the Cyrix 83S87. If you already have a Cyrix coprocessor, use my COMPTEST program to find out whether you have a 'new' or 'old' coprocessor. COMPTEST is available as CTEST257.ZIP via anonymous ftp from garbo.uwasa.fi (in the /systest directory) and other ftp servers that mirror garbo.

The Cyrix 486DLC is currently the 386 'clone' with the highest integer performance. With the internal cache enabled, integer performance of the 486DLC can be up to 80% higher than that of an Intel 386DX at the same clock frequency, with the average speed gain for most applications being about 35%. Floating-point applications are typically accelerated by about 15%-30% when using a Cyrix 486DLC (with its cache enabled) instead of the Intel 386DX.

     33.3 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec
     Cyrix 486DLC
     (cache off) WITH:
     EM87 emulator  0.0089   0.0082   0.0062  0.0063         31      472 ##
     Franke387 emu. 0.0402   0.0324   0.0258  0.0240        184     4807 $$
     TP/MS-FORT emu 0.0346   0.0288   0.0206  0.0212        173     4401 %%
     Q387 emulator  0.1214   0.0810   0.0368  0.0382        320     6020 ((
     Intel 387DX    0.8455   0.6552   0.3659  0.3033       2249    48780
     ULSI 83C87     1.1818   0.7543   0.3752  0.3026       2381    53476
     IIT 3C87       0.9541   0.6609   0.3653  0.3036       2476    55814
     IIT 3C87,4X4   0.9541   1.4988   0.3653  0.3036       2476    55814 @@
     C&T 38700      1.1183   0.7644   0.3796  0.3087       2703    73350
     Cyrix 387+     1.1305   0.7445   0.3727  0.3060       2731    81967
     Cyrix EMC87    1.2236   0.7593   0.3823  0.3144       2908    88889 //

     Intel RapidCAD 1.8572   1.5798   0.6072  0.4533       3953    72464
     Intel 486DX    2.0800   1.7779   0.9387  0.6682       5143    82192

     40.0 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec
     Cyrix 486DLC
     (cache off) WITH:
     EM87 emulator  0.0107   0.0098   0.0075  0.0075         37      567 ##
     Franke387 emu. 0.0488   0.0392   0.0311  0.0288        223     5808 $$
     TP/MS-FORT emu 0.0416   0.0345   0.0246  0.0253        208     5284 %%
     Q387 emulator  0.1463   0.0973   0.0442  0.0458        384     7237 ((
     Intel 387DX    1.0196   0.7880   0.4375  0.3644       2712    58479
     ULSI 83C87     1.4247   0.9064   0.4506  0.3630       2868    64171
     IIT 3C87       1.1556   0.7963   0.4399  0.3611       2988    66964
     IIT 3C87,4X4   1.1556   1.7916   0.4399  0.3611       2988    66964 @@
     C&T 38700      1.3333   0.9210   0.4548  0.3708       3254    88106
     Cyrix 387+     1.3507   0.8958   0.4477  0.3754       3297    98361
     Cyrix EMC87    1.4648   0.9136   0.4548  0.3773       3505   106572 //

     Intel RapidCAD 2.2128   1.8931   0.7377  0.5432       4810    86957
     Intel 486DX    2.4762   2.1335   1.1110  0.8204       6195    98522

     33.3 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec
     Cyrix 486DLC
     (cache on) WITH:
     EM87 emulator  0.0099   0.0089   0.0068  0.0069         35      550 ##
     Franke387 emu. 0.0462   0.0362   0.0288  0.0265        205     5445 $$
     TP/MS-FORT emu 0.0410   0.0330   0.0234  0.0241        198     5339 %%
     Q387 emulator  0.1344   0.0902   0.0389  0.0403        339     6241 ((
     Intel 387DX    0.8525   0.6552   0.3941  0.3279       2332    49834
     ULSI 83C87     1.2093   0.7543   0.4068  0.3270       2478    57197
     IIT 3C87       0.9720   0.6609   0.3959  0.3295       2579    57252
     IIT 3C87,4X4   0.9720   1.5087   0.3959  0.3295       2579    57252 @@
     C&T 38700      1.1305   0.7644   0.4126  0.3343       2839    75949
     Cyrix 387+     1.1429   0.7445   0.4023  0.3310       2866    85349
     Cyrix EMC87    1.2381   0.7593   0.4150  0.3412       3051    93897 //

     Intel RapidCAD 1.8572   1.5798   0.6072  0.4533       3953    72464
     Intel 486DX    2.0800   1.7779   0.9387  0.6682       5143    82192

     40.0 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec
     Cyrix 486DLC
     (cache on) WITH:
     EM87 emulator  0.0118   0.0107   0.0082  0.0082         42      659 ##
     Franke387 emu. 0.0565   0.0438   0.0350  0.0313        248     6585 $$
     TP/MS-FORT emu 0.0491   0.0395   0.0279  0.0296        238     6408 %%
     Q387 emulator  0.1610   0.1084   0.0470  0.0484        407     7509 ((
     Intel 387DX    1.0297   0.7880   0.4748  0.3937       2801    59821
     ULSI 83C87     1.4445   0.9028   0.4891  0.3926       2976    65789
     IIT 3C87       1.1686   0.7963   0.4734  0.3916       3096    68729
     IIT 3C87,4X4   1.1686   1.8057   0.4734  0.3916       3096    68729 @@
     C&T 38700      1.3685   0.9173   0.4958  0.4012       3401    91185
     Cyrix 387+     1.3867   0.8958   0.4887  0.3962       3448   102564
     Cyrix EMC87    1.4857   0.9100   0.4959  0.4091       3676   112360 //

     Intel RapidCAD 2.2128   1.8931   0.7377  0.5432       4810    86957
     Intel 486DX    2.4762   2.1335   1.1110  0.8204       6195    98522

Benchmark results using the C&T 38600DX CPU and various coprocessors

The Chips&Technologies 38600DX CPU is marketed as a 100% compatible replacement for the Intel 386DX CPU. Unlike AMD's Am386, which uses microcode that is identical to the Intel 386DX's, the C&T 38600DX uses microcode developed independently by C&T using "clean-room" techniques. C&T even included the 386DX's "undocumented" LOADALL386 instruction into the instruction set to provide full compatibility with the 386DX. In my tests, however, I observed that the 38600DX has severe problems with the CPU- coprocessor communication, which causes the floating-point performance to drop below that of the Intel 386DX/Intel 387DX for most programs. This problem exists with all available 387-compatible coprocessors (ULSI 83C87, IIT 3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, Intel 387DX). A net.aquaintance also did tests with the 38600DX and arrived at similar results. He contacted C&T and they said that they were aware of the problem.

Some instructions execute faster on the C&T 38600DX than on the 386DX, giving an average speedup of 5-10% for integer applications. C&T also produces a 38605DX CPU that includes a 512 byte instruction cache and provides a further performance increase. However, the 38605DX needs a bigger socket (144-pin PGA) and is therefore *not* pin-compatible with the 386DX. Tests using the 38600DX were run at 33.3 MHz, as a 40 MHz version was not available as of 09- 17-92 and running the 33 MHz chip version at 40 MHz locked up the machine frequently. Unfortunately, tests using the Intel 387DX consistently locked up in the TRNSFORM benchmark when run at 33.3 MHz. It ran fine at 20 MHz, and the results were scaled to show expected performance at 33.3 MHz.

     33.3 MHz       PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec

     C&T 38600DX WITH:
     Intel 387DX    0.7376   0.5620   0.3337  0.2636       2066    45489
     ULSI 83C87     0.5226   0.4690   0.3236  0.2654       2087    43228
     IIT 3C87       0.7879   0.5762   0.3397  0.2674       2263    51195
     IIT 3C87,4X4   0.7879   0.6181   0.3397  0.2674       2263    51195 @@
     C&T 38700      0.5977   0.5572   0.3463  0.2681       2338    63966
     Cyrix 387+     0.5896   0.5508   0.3438  0.2673       2375    66741

     Intel RapidCAD 1.8572   1.5798   0.6072  0.4533       3953    72464
     Intel 486      2.0800   1.7779   0.9387  0.6682       5143    82192

     For comparison:

                    PEAKFLOP TRNSFORM LLL     Linpack Whetstone Savage
                    MFLOPS   MFLOPS   MFLOPS  MFLOPS  kWhet/sec Func/sec

     i486DX2-66     4.1601   3.4227   1.6531  1.3010      10655   163934
     i486DX2-50     3.0589   2.6665   1.2537  0.9744       7962   123203
     i387, 20 MHz   0.2253   0.3271   0.1434  0.1171        952    21739 ++
     i387DX, 20 MHz 0.3567   0.4444   0.1484  0.1161       1034    24155 &&
     i80287, 5 MHz  0.0281   0.0310   0.0242  0.0222        150     3261 !!
     i8087,9.54 MHz 0.0636   0.0705   0.0321  0.0219        234     5782 **

Benchmark notes and footnotes

Hardware configuration for test of 387 coprocessors with C&T 38600DX, Intel 386DX, Cyrix 486DLC, and Intel RapidCAD CPUs:

   System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM

Hardware configuration for test of 486 FPU (extra fan for 40 MHz operation):

   System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM

## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that

  loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX
  systems with no coprocessor upon encountering coprocessor instructions
  to catch coprocessor instructions and emulate them. Whetstone and Savage
  benchmarks for this test were compiled with the original TP 6.0 library,
  as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my
  own library if a 387 is detected. Obviously EM87 identifies itself as a
  387, but it has no support for 387-specific instructions.

$$ Franke387 is a commercial 387 emulator that is also available in a

  shareware version. For this test, shareware version V2.4 was used.
  Franke387 unlike many other emulators supports all 387 instructions.
  It is loaded as a device driver and uses INT 7 to trap coprocessor
  instructions.

¹⁾

Q387 is an emulator that is distributed as a shareware program by

  Quickware of Austin, Texas. As the name implies, this emulator uses
  386 specific code and supports the full 387 instruction set. The
  program is about 330 kByte in size and loads completely into extended
  memory, using absolutely no DOS memory. It is loaded as a TSR and
  requires an EMM (expanded memory manager) to be present. The emulation
  uses the INT 7 mechanism. The version of Q387 used was 3.0a.

These benchmarks were run using the built-in coprocessor emulators of the TP 6.0 (for Savage, LLL, Whetstone, TRNSFORM, PEAKFLOP) and the MS FORTRAN 5.0 (for Linpack) run-time libraries by forcing the libraries into not using a coprocessor by using the environment settings NO87=NC and 87=N. @@ The 3C87 specific F4X4 instruction was used in the vector transformation benchmark. // The EMC87 was used in the 387-compatible mode only. The faster memory- mapped mode was *not* used. Times should therefore be identical to the Cyrix 83D87. ++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB RAM && System A, CPU cache disabled via extended set-up, turbo-switch set to half speed (that is, 20 MHz) !! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the fast CPU used here, performance figures are somewhat higher than can be expected for a 80286/287 combination, except for the PEAKFLOP benchmark, which is basically coprocessor limited. ** 8086/8087 system with 640 KB RAM Benchmark results for Weitek coprocessors ------------------------------------------ Since neither a Weitek coprocessor nor a compiler that generates code for the Weitek chips were available to me, performance data for the Weitek Abacus is given here according to [31,32] and scaled to show performance of a 33 MHz system. The benchmarks were compiled using highly-optimizing 32-bit compilers. Single Prec. Double Prec. Double Prec. 3167 4167 3167 4167 387 486 Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6 Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300 Note that for the Intel coprocessors, running programs in single vs. double- precision doesn't provide much of an performance advantage since all internal calculations are always done in extended precision. Using Weitek coprocessors, however, performance nearly doubles in single-precision mode. For double-precision calculations using only basic arithmetic, the Weitek Abacus can at most provide performance at twice the level of the respective Intel coprocessor (387/486) at the same clock speed. Comparison of floating-point performance [30,32] single-precision Weitek 4167-33 Intel 486-33 Intel 486DX2-66 Linpack MFLOPS 5.0 1.8 3.5 Whetstones kWhet/sec 22700 12700 25500 double-precision Weitek 4167-33 Intel 486-33 Intel 486DX2-66 LINPACK MFLOPS 3.5 1.6 3.1 kWhetstones/sec 14000 12300 24700 ============================================================================= Clock-cycle timings for coprocessor instructions on various coprocessor chips ============================================================================= Speed of various coprocessor instructions, measured in clock cycles, as captured by my program 87TIMES. Error is +/- one clock cycle, except for the Intel 80287. Times for the 80287 were determined on a system with a 20 MHz 80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine 80286/287 system, especially for those instructions that access an operand in memory. Since the times are stated as the number of coprocessor clock cycles used, the faster 386 which can execute four clock cycles where the 80287 executes one clock cycle may decrease memory access times as seen by the coprocessor. The CPU used in testing the 387 coprocessors was an Intel 386DX. Note that due to the improved coprocessor interface of the Cyrix 486DLC the execution time of most coprocessor instructions drops by 2-3 clock cycles when used with this CPU. Intel Intel Cyrix Cyrix C&T ULSI IIT Intel Intel i486 RapidCAD 83D87 387+ 38700 83C87 3C87 387DX 80387 FLD1 4 3 14 14 14 18 24 23 26 FLDZ 4 3 14 14 14 18 24 23 31 FLDPI 7 8 14 15 14 18 24 38 45 FLDLG2 7 8 14 14 14 18 24 33 45 FLDL2T 7 8 14 14 14 19 24 38 45 FLDL2E 7 8 14 14 14 19 24 38 45 FLDLN2 7 8 14 14 14 19 24 38 45 FLD ST(0) 4 4 14 14 14 14 24 20 21 FST ST(1) 3 4 14 14 14 14 19 18 22 FSTP ST(0) 4 4 14 14 14 15 19 19 22 FSTP ST(1) 4 4 15 15 14 15 19 20 22 FLD ST(1) 4 4 14 14 14 14 24 18 21 FXCH ST(1) 4 4 14 20 14 19 24 24 27 FILD [Word] 12 16 33 37 32 42 38 47 62 FILD [DWord] 8 11 26 26 21 32 28 35 45 FILD [QWord] 9 15 30 30 25 36 32 34 54 FLD [DWord] 3 5 26 26 21 23 28 20 25 FLD [QWord] 3 7 30 30 25 27 32 24 35 FLD [TByte] 5 11 46 46 46 46 47 46 57 FBLD [TByte] 83 90 66 86 106 146 197 71 278 FIST [Word] 31 31 37 40 37 42 51 69 90 FIST [DWord] 29 30 35 40 35 40 49 66 84 FST [DWord] 7 7 35 37 32 40 33 37 40 FST [QWord] 8 9 43 43 39 47 40 45 51 FISTP [Word] 32 32 42 40 37 43 46 70 90 FISTP [DWord] 31 31 40 40 35 41 50 67 87 FISTP [QWord] 29 29 44 44 42 48 56 73 92 FSTP [DWord] 8 8 38 36 32 41 35 38 43 FSTP [QWord] 9 9 46 43 39 48 42 46 49 FSTP [TByte] 8 8 50 45 49 50 48 53 58 FBSTP [TByte] 170 172 98 98 114 129 218 144 533 FINIT 17 31 15 16 15 15 16 16 25 FCLEX 7 20 15 16 16 16 16 16 25 FCHS 7 8 14 15 14 14 19 30 33 FABS 5 5 14 15 14 14 19 30 33 FXAM 12 13 14 15 14 14 19 39 43 FTST 5 5 19 25 14 24 24 34 38 FSTENV 67 82 125 125 124 132 124 159 165 FLDENV 44 59 106 106 112 120 106 119 129 FSAVE 181 169 355 355 374 361 376 469 511 FRSTOR 130 203 358 358 385 372 371 420 456 FSTSW [mem] 4 5 14 14 14 14 14 14 17 FSTSW AX 3 4 12 12 11 11 11 11 14 FSTCW [mem] 4 5 14 14 13 13 13 14 18 FLDCW [mem] 4 11 26 26 31 32 27 32 36 FADD ST,ST(0) 8 9 19 20 19 19 24 24 32 FADD ST,ST(1) 9 9 19 20 19 18 24 20 32 FADD ST(1),ST 10 10 19 20 19 18 24 24 37 FADDP ST(1),ST 11 11 19 19 19 16 24 25 37 FADD [DWord] 9 10 25 28 22 23 23 21 34 FADD [QWord] 9 10 32 32 26 27 27 25 38 FIADD [Word] 20 21 34 34 33 40 40 52 80 FIADD [DWord] 20 21 27 28 27 30 30 37 61 FSUB ST(1),ST 10 10 19 20 19 19 24 24 38 FSUBR ST(1),ST 9 10 19 22 19 19 24 27 38 FSUBRP ST(1),ST 10 10 19 19 22 20 24 25 38 FSUB [DWord] 11 12 27 28 27 23 29 27 32 FSUB [QWord] 11 12 32 32 31 27 33 26 44 FISUB [Word] 21 21 34 34 34 40 40 52 80 FISUB [DWord] 21 22 27 28 27 29 30 40 60 FMUL ST,ST(1) 16 17 19 25 24 24 29 38 57 FMUL ST(1),ST 16 17 19 24 24 24 29 40 62 FMULP ST(1),ST 17 17 19 24 24 25 29 40 58 FIMUL [Word] 22 23 40 40 37 46 46 52 80 FIMUL [DWord] 22 23 27 28 27 36 35 45 68 FMUL [DWord] 11 12 27 28 27 28 29 25 45 FMUL [QWord] 14 15 32 32 31 32 33 37 61 FDIV ST,ST(0) 73 74 26 40 59 54 54 89 100 FDIV ST,ST(1) 73 74 36 45 59 54 54 77 100 FDIV ST(1),ST 73 74 36 45 59 55 54 78 102 FDIVR ST(1),ST 73 74 36 45 59 54 54 77 102 FDIVRP ST(1),ST 73 74 36 44 59 55 54 76 106 FIDIV [Word] 84 85 52 58 75 76 76 105 141 FIDIV [DWord] 84 85 45 46 65 65 65 101 123 FDIV [DWord] 73 74 45 46 63 56 59 77 101 FDIV [QWord] 73 74 50 50 67 60 63 78 103 FSQRT (0.0) 25 25 19 19 14 19 24 29 37 FSQRT (1.0) 83 84 36 74 54 89 59 109 132 FSQRT (L2T) 86 87 36 74 54 89 59 104 137 FXTRACT (L2T) 17 17 19 19 19 28 79 53 72 FSCALE (PI,5) 30 30 36 24 24 49 79 59 82 FRNDINT (PI) 31 31 19 29 24 34 29 49 82 FPREM (99,PI) 58 59 54 99 44 54 49 79 96 FPREM1(99,PI) 90 91 54 99 44 59 54 104 121 FCOM 5 6 15 20 19 25 19 29 32 FCOMP 6 6 15 19 19 25 19 30 33 FCOMPP 7 7 15 19 19 25 19 31 40 FICOM [Word] 16 17 34 34 33 46 34 58 76 FICOM [DWord] 16 16 21 28 21 35 23 45 57 FCOM [DWord] 5 6 21 28 22 23 23 27 34 FCOM [QWord] 5 8 27 32 25 27 27 31 39 FSIN (0.0) 24 24 14 99 14 19 24 39 43 FSIN (1.0) 310 313 114 164 144 494 219 509 596 FSIN (PI) 88 89 118 189 64 64 214 134 152 FSIN (LG2) 292 295 72 89 139 454 184 449 531 FSIN (L2T) 299 302 123 179 164 469 214 454 536 FCOS (0.0) 24 24 19 159 14 19 24 34 42 FCOS (1.0) 302 305 84 104 139 489 214 459 547 FCOS (PI) 88 89 154 254 64 64 224 199 232 FCOS (LG2) 300 303 108 149 139 454 194 504 583 FCOS (L2T) 307 310 159 239 164 469 224 509 601 FSINCOS (0.0) 25 25 14 19 19 18 34 38 55 FSINCOS (1.0) 353 356 124 174 254 493 419 538 636 FSINCOS (PI) 105 106 162 263 79 68 424 228 277 FSINCOS (LG2) 340 343 119 159 249 458 359 533 627 FSINCOS (L2T) 347 350 168 248 274 473 424 538 646 FPTAN (0.0) 25 25 14 19 19 18 29 38 46 FPTAN (1.0) 266 269 119 149 184 538 309 323 396 FPTAN (PI) 145 146 134 228 104 108 304 168 211 FPTAN (LG2) 244 246 94 129 179 498 274 298 363 FPTAN (L2T) 247 249 139 219 204 513 304 298 365 FPATAN (0.0) 38 39 19 24 19 20 29 95 93 FPATAN (1.0) 294 298 124 159 29 375 604 360 433 FPATAN (PI) 304 308 139 188 279 360 424 375 472 FPATAN (LG2) 290 293 128 154 269 365 379 375 448 FPATAN (L2T) 304 308 144 189 274 359 424 375 468 F2XM1 (0.0) 25 25 14 14 14 19 24 34 37 F2XM1 (LN2) 209 211 89 119 169 394 284 299 348 F2XM1 (LG2) 204 206 78 104 159 379 284 294 337 FYL2X (1.0) 60 61 36 39 24 75 94 115 127 FYL2X (PI) 294 297 108 163 249 450 359 395 504 FYL2X (LG2) 311 314 108 159 249 460 339 410 518 FYL2X (L2T) 293 296 108 164 249 439 359 390 501 FYL2XP1 (LG2) 334 337 99 169 234 460 284 435 538 80386 + 80386 + 80386 + 80386 + Intel Intel Q387 Franke387 TP 6.0 EM87 8087 80287 Emulator Emulator Emulator Emulator FLD1 26 55 51 481 422 1626 FLDZ 21 53 39 480 416 1646 FLDPI 26 55 51 486 443 1626 FLDLG2 26 56 51 486 423 1626 FLDL2T 26 55 51 486 440 1626 FLDL2E 26 53 52 486 423 1626 FLDLN2 26 55 52 486 441 1626 FLD ST(0) 31 55 57 493 362 1851 FST ST(1) 26 54 61 489 355 1931 FSTP ST(0) 26 54 46 507 358 2115 FSTP ST(1) 21 55 66 507 356 2116 FLD ST(1) 26 55 54 493 362 1852 FXCH ST(1) 21 57 80 497 486 2187 FILD [Word] 58 90 122 667 712 2259 FILD [DWord] 64 74 121 608 812 2164 FILD [QWord] 74 93 179 652 707 2971 FLD [DWord] 49 44 106 633 473 2077 FLD [QWord] 54 57 118 641 524 2336 FLD [TByte] 59 45 102 607 492 2063 FBLD [TByte] 309 310 736 2019 1512 17827 FIST [Word] 79 72 143 854 766 2418 FIST [DWord] 84 80 136 865 518 2325 FST [DWord] 89 85 124 686 441 2200 FST [QWord] 99 92 135 703 516 2481 FISTP [Word] 79 80 154 864 794 2620 FISTP [DWord] 79 81 144 879 541 2523 FISTP [QWord] 88 75 184 904 916 3226 FSTP [DWord] 89 75 133 713 467 2400 FSTP [QWord] 93 72 142 732 538 2678 FSTP [TByte] 49 21 111 685 467 2124 FBSTP [TByte] 528 472 1124 3305 1555 27013 FINIT 11 10 1079 742 641 1369 FCLEX 11 10 48 440 323 912 FCHS 21 54 45 460 354 1744 FABS 21 54 43 456 349 1738 FXAM 21 54 72 481 380 1551 FTST 51 75 70 585 386 2721 FSTENV 54 57 827 928 519 2104 FLDENV 48 50 780 1125 450 1631 FSAVE 214 244 3929 1949 976 2749 FRSTOR 209 227 2901 2182 657 2225 FSTSW [mem] 28 10 87 516 401 1189 FSTSW AX N/A 55 57 451 N/A N/A FSTCW [mem] 28 10 74 506 359 1167 FLDCW [mem] 19 47 91 524 437 1584 FADD ST,ST(0) 86 128 136 643 706 2805 FADD ST,ST(1) 85 116 146 707 808 3093 FADD ST(1),ST 92 131 157 664 812 3146 FADDP ST(1),ST 92 129 164 704 799 3143 FADD [DWord] 105 122 221 874 969 3139 FADD [QWord] 115 122 232 888 1021 3396 FIADD [Word] 115 122 238 940 1211 3330 FIADD [DWord] 125 122 239 882 1297 3215 FSUB ST(1),ST 88 130 171 738 817 3156 FSUBR ST(1),ST 96 132 181 740 868 3004 FSUBRP ST(1),ST 99 132 193 733 805 3301 FSUB [DWord] 119 122 230 918 1018 3127 FSUB [QWord] 129 123 242 932 1070 3632 FISUB [Word] 115 123 268 977 1081 3802 FISUB [DWord] 125 125 289 940 980 4161 FMUL ST,ST(1) 145 151 297 810 1368 3924 FMUL ST(1),ST 145 151 296 817 1377 3962 FMULP ST(1),ST 148 168 304 840 1365 4164 FIMUL [Word] 132 151 384 1039 1517 4039 FIMUL [DWord] 141 151 383 980 1643 3976 FMUL [DWord] 125 123 345 948 1480 3445 FMUL [QWord] 175 192 387 991 1602 4416 FDIV ST,ST(0) 201 207 274 726 1536 9789 FDIV ST,ST(1) 203 218 299 808 1658 10332 FDIV ST(1),ST 207 214 299 825 1655 10342 FDIVR ST(1),ST 201 206 302 819 1806 10213 FDIVRP ST(1),ST 201 205 309 845 1803 10409 FIDIV [Word] 237 227 390 980 1779 11225 FIDIV [DWord] 246 227 411 944 1680 11572 FDIV [DWord] 229 226 352 893 1722 10577 FDIV [QWord] 236 227 391 993 1777 10829 FSQRT (0.0) 21 57 60 512 382 1755 FSQRT (1.0) 186 206 294 1106 2504 37836 FSQRT (L2T) 186 207 295 1398 2467 37925 FXTRACT (L2T) 51 56 155 726 571 3326 FSCALE (PI,5) 41 56 95 817 443 3194 FRNDINT (PI) 51 58 136 808 800 7092 FPREM (99,PI) 81 131 322 1696 941 4098 FPREM1(99,PI) N/A N/A 384 1625 N/A N/A FCOM 56 75 155 582 483 2799 FCOMP 61 92 160 616 485 2983 FCOMPP 61 90 149 661 476 3198 FICOM [Word] 79 77 231 808 861 3654 FICOM [DWord] 89 77 231 750 964 3684 FCOM [DWord] 74 75 214 741 625 3643 FCOM [QWord] 74 76 205 754 667 3771 FSIN (0.0) N/A N/A 137 639 N/A N/A FSIN (1.0) N/A N/A 997 4640 N/A N/A FSIN (PI) N/A N/A 322 2488 N/A N/A FSIN (LG2) N/A N/A 978 3911 N/A N/A FSIN (L2T) N/A N/A 1005 3767 N/A N/A FCOS (0.0) N/A N/A 182 740 N/A N/A FCOS (1.0) N/A N/A 988 4777 N/A N/A FCOS (PI) N/A N/A 337 2557 N/A N/A FCOS (LG2) N/A N/A 976 4176 N/A N/A FCOS (L2T) N/A N/A 1001 3905 N/A N/A FSINCOS (0.0) N/A N/A 225 714 N/A N/A FSINCOS (1.0) N/A N/A 1841 6049 N/A N/A FSINCOS (PI) N/A N/A 1167 4091 N/A N/A FSINCOS (LG2) N/A N/A 1525 5640 N/A N/A FSINCOS (L2T) N/A N/A 1552 5405 N/A N/A FPTAN (0.0) 41 58 90 752 8381 2324 FPTAN (1.0) 581 582 1182 6366 10817 29824 FPTAN (PI) 606 587 292 4388 12410 2300 FPTAN (LG2) 516 513 883 5939 12502 26770 FPTAN (L2T) 576 586 954 5723 12483 2301 FPATAN (0.0) 41 55 123 616 1208 10578 FPATAN (1.0) 736 736 171 1426 13446 34208 FPATAN (PI) 206 207 11115 2835 13305 46903 FPATAN (LG2) 756 736 11077 2490 13319 41312 FPATAN (L2T) 206 204 11117 2922 13364 50149 F2XM1 (0.0) 16 56 102 563 723 1722 F2XM1 (LN2) 631 624 905 4178 11070 33823 F2XM1 (LG2) 611 585 890 4798 11116 32163 FYL2X (1.0) 56 57 136 961 1214 4327 FYL2X (PI) 946 961 1008 8987 12858 40148 FYL2X (LG2) 1081 1038 1035 8933 12748 46821 FYL2X (L2T) 926 886 1089 8982 12712 38986 FYL2XP1 (LG2) 1026 1037 1154 10485 11867 44708 Clock-cycle timings for floating-point operations on Weitek coprocessors ------------------------------------------------------------------------ The Weitek 3167 and 4167 coprocessors only implement the basic arithmetic functions (add, subtract, multiply, divide, square root) in hardware; transcendental functions are implemented by means of a software library supplied by Weitek which uses the basic hardware instructions to approximate the transcendental functions (using polynomial and rational approximations). The clock cycle timings for the transcendental functions are average values, since execution time can differ with the value of argument. The speed of transcendental functions for the 4167 is estimated based on the numbers in [31,33], from which this timing information has been extracted. Single-precision Double-precision 3167 4167 3167 4167 ABS 3 2 3 2 NEG 6 2 6 2 ADD 6 2 6 2 SUB 6 2 6 2 SUBR 6 2 6 2 MUL 6 2 10 3 DIVR 38 17 66 31 SQRT 60 17 118 31 SIN 146 ~50 292 ~100 COS 140 ~50 285 ~100 TAN 188 ~60 340 ~110 EXP 179 ~60 401 ~130 LOG 171 ~60 365 ~120 F->ASCII 1000 N/A 1700 N/A // ASCII->F 1100 N/A 1800 N/A // // rough average of the timings given for different numeric formats by Weitek. Note that these conversions routines do much more work than the FBLD and FBSTP instructions provided by the 80x87 coprocessors. FBLD and FBSTP are useful for conversion routines but quite a bit of additional code is need for this purpose. ============================================================================= Accuracy of calculations performed by a coprocessor / The IEEETEST program ============================================================================= Among the 80x87 coprocessors, the IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11] was first fully implemented by Intel's 387 coprocessor [17]. Among other things, this means that the add, subtract, multiply, divide, remainder, and square root operations always deliver the 'exact' result. By 'exact', the standard means that the coprocessor always delivers the machine number closest to the real result, which may not always be representable exactly in the available numeric format. The 80387 implements the single, double, and double extended formats as specified in the IEEE standard, as well as all functions required by it [17]. Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a draft version of the standard that differs from the final version. These chips were developed before IEEE-754 was finally accepted in 1985. As with the 80387, the basic arithmetic in the 8087 and the 80287 is 'exact' in the sense that the computed result is always the machine number closest to the real result. However, there are some differences regarding certain operands like infinities, and some operations like the remainder are defined differently than in the final version of the standard. Some new instructions were introduced with the 80387, most notably the FSIN and FCOS operations. The argument range for some transcendental function has also been extended [17]. Note that the IEEE-754 standard says nothing about the quality of the implementation of transcendental functions like sin, cos, tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute the transcendental functions; Intel claims that maximum error in the 8087, 80287, and 80387 for all transcendental functions does not exceed two bits in the mantissa of the double extended format, which features 64 mantissa bits for an overall accuracy of approximately 19 decimal places [22,23]. This claim has been independently verified by a competing vendor [13]. This means that at least 62 of the 64 mantissa bits returned as a result by one of the transcendental function instructions are guaranteed to be correct. The Weitek Abacus 3167 and 4167 coprocessors are 'mostly compatible' with IEEE-754 [31,32,33]. They support the single-precision and double precision numeric formats described in the standard, as well as the four rounding modes required by it. However, due to Weitek's desire for extremely high-speed operation, some of the finer points of IEEE-754 have not been implemented. One of the most notable omissions is the missing support for denormal numbers; denormals are always flushed to zero on Weitek chips. The 387 clone manufacturers all claim 100% compatibility with Intel's 80387, so one would reasonably expect the same accuracy from their chips as from Intel's. For example, on the packaging of the IIT 3C87 it states that "...the requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states that their 83D87 complies fully with the IEEE-754 standard [12], and in fact delivers with their coprocessors diagnostic software that includes the program IEEETEST. This program is based on the IEEE test vectors from the PhD thesis of Dr. Jerome T. Coonen [9]. A test using the IEEE test vectors has also been included into the RUNDIAG program on the Intel RapidCAD diagnostic disk. Rather than performing random tests, the test vectors check specific cases that may be hard to get right. Each test vector specifies the operation to be performed, the operands, precision and rounding mode to be used, and the result (including flags set) to be expected according to the IEEE-754 standard. I ran IEEETEST on all the available coprocessors/FPUs. The Intel 486, Intel RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL, and FSCALE operations, getting flag errors in about 1% of the tested cases, but no computational errors. However, for the IIT 3C87, the IEEETEST program showed flag *and* some computational errors (that is, wrong results) for all tested operations except FXTRACT and FCHS. The Intel 8087 and 80287 show numerous errors, but this it not surprising, since they do not comply with IEEE-754 but with an earlier draft of that standard, so they do some things differently than required by the final version of the standard. In particular the Intel 8087/80287 do not feature the IEEE-754 compliant comparison (FUCOM) and remainder (FPREM1) instructions available on the Intel 80387 and newer coprocessors, so IEEETEST uses the non-compliant FCOM and FPREM instructions on these processors. Lack of an IEEE-754 compliant comparison instruction also causes a good deal of the errors in the 'Next After' test. Since IEEETEST is written in Turbo Pascal, it was recompiled with the $E+ switch to enable use of the coprocessor emulator built into the TP 6.0 library. Using the emulator, IEEETEST aborted in the following tests with a division by zero error: 'Comparison', 'Division', 'Next After'. These tests were removed from the suite and the remaining tests were performed. The public domain emulator EM87 could be tested, but hung in the last test which checks the implementation of the remainder operation. This problem occurred because EM87 incorrectly identifies itself as an 387 type coprocessor when run on an 80386. This causes the 387 specific FUCOM instruction to be used in the 'Comparison' and 'Next After' tests and the FPREM1 instruction to be used in the 'Remainder' test. Apparently EM87 is not able to emulate these instructions and therefore crashes upon trying to execute them. It is interesting to note how the error profile of EM87 matches exactly that of the Intel 80287, so it can be assumed that EM87 is a very good emulation of the 80287 when run on the 80286. The Franke387 V2.4 emulator hangs in the following test performed by IEEETEST: 'Division', 'Multiplication', 'Scalb', 'Remainder'. The cause for these failures is unknown. This explanatory text is printed at the start of the IEEETEST program: JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities as a member of the floating-point working group that defined the IEEE 754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of his thesis presents FPTEST, a Pascal program written by J Thomas and JT Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math coprocessor accepts 80387-compatible floating-point instructions. IEEETEST reads test vectors from the file TESTVECS and compares the answer returned by the math coprocessor with the answer listed in the test vector. If these answers differ an 'F' is displayed, otherwise a '.'is displayed. Answers can differ due to two types of failures: numeric failures or flag failures. Numeric failures occur when the computed answer has the wrong value. Flag failures occur when the status (invalid operation, divide by zero, underflow, overflow, inexact) is incorrectly identified. TESTVECS is the concatenation of unmodified versions of all the test vectors distributed by UC Berkeley. The test data base is copyrighted by UC Berkeley (1985) and is being distributed with their permission. FPTEST and the test data base can be obtained by asking for 'IEEE-754 Test Vector' from UC Berkeley, Electrical Engineering and Computer Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720 (415)643-6687. The initial version of this test data base for the proposed IEEE 754 binary floating-point standard (draft 8.0) was developed for Zilog, Inc. and was donated to the floating-point working group for dissemination. Errors in or additions to the distributed data base should be reported to the agency of distribution, with copies to Zilog, Inc., 1315 Dell Avenue, Campbell, CA, 95008. IEEETEST output for Intel 80387, Intel 387DX (manufactured 91/49), Intel 486, C&T 38700 (manufactured 92/19), Cyrix 83D87, Cyrix 387+ (manufactured 92/11), and Intel RapidCAD (manufactured 92/05): ---------------------------------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4311 0 | 0 0 0 | 0 0 0 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3978 0 | 0 0 0 | 0 0 0 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2832 0 | 0 0 0 | 0 0 0 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 948 0 | 0 0 0 | 0 0 0 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31235 0 | IEEETEST output for ULSI 83C87 (manufactured 91/48): ---------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4312 8 | 0 0 0 | 0 0 8 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4250 61 | 0 0 0 | 28 28 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3936 42 | 0 0 0 | 19 19 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 133 | IEEETEST output for ULSI 83S87 (manufactured 92/17) (data kindly supplied by Bengt Ask, f89ba@efd.lth.se): ------------------------------------------------------ IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4296 15 | 0 0 0 | 5 5 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3966 12 | 0 0 0 | 4 4 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 45 | IEEETEST output for IIT 3C87 (manufactured 92/20): -------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 200 16 | 0 0 16 | 0 0 0 Addition + | 3336 192 | 0 0 128 | 0 0 96 Comparison C | 4224 96 | 0 0 96 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4159 152 | 0 0 124 | 0 0 116 Fraction Part F | 600 24 | 0 0 24 | 0 0 24 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3702 276 | 0 0 248 | 0 0 100 Negation - | 200 16 | 0 0 16 | 0 0 0 Next After N | 2248 584 | 0 0 584 | 0 0 168 Round to Integer I | 542 16 | 0 0 4 | 0 0 16 Scalb S | 874 74 | 5 5 44 | 8 8 20 Square Root V | 688 56 | 0 0 56 | 0 0 56 Subtraction - | 3336 192 | 0 0 128 | 0 0 96 Remainder % | 2844 140 | 0 0 140 | 0 0 116 Totals | 29401 1834 | IEEETEST output for Intel 80287 run with a 80386 CPU and Intel 8087: -------------------------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 3612 708 | 136 136 136 | 228 228 228 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 516 2316 | 168 168 332 | 764 764 764 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | 1490 1494 | 432 432 288 | 342 342 230 Totals | 23412 7823 | IEEETEST output for EM87 coprocessor emulator run on an Intel 386 CPU: ---------------------------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | ######## not run since machine hangs ####### IEEETEST output for Franke387 coprocessor emulator run on an Intel 386: ----------------------------------------------------------------------- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 152 64 | 0 0 8 | 24 24 8 Addition + | 1587 1941 | 178 178 722 | 508 508 616 Comparison C | 3696 624 | 208 208 208 | 4 4 108 Copy Sign @ | 1200 288 | 0 0 0 | 144 144 0 Division / | ######## not run since machine hangs ####### Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 908 52 | 0 0 16 | 16 16 4 Multiplication * | ######## not run since machine hangs ####### Negation - | 152 64 | 0 0 8 | 24 24 8 Next After N | 1404 1420 | 404 404 596 | 80 80 172 Round to Integer I | 514 44 | 4 4 20 | 8 8 16 Scalb S | ######## not run since machine hangs ####### Square Root V | 569 175 | 14 31 54 | 28 48 72 Subtraction - | 1827 1701 | 98 98 642 | 452 452 576 Remainder % | ######## not run since machine hangs ####### IEEETEST output for Q387 coprocessor emulator run on an Intel 386: ------------------------------------------------------------------ IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 104 112 | 42 38 16 | 24 24 0 Addition + | 911 2617 | 746 637 637 | 672 672 380 Comparison C | 3180 1140 | 380 380 380 | 108 108 108 Copy Sign @ | 696 792 | 320 280 0 | 288 288 0 Division / | 900 3411 | 673 574 814 | 977 977 821 Fraction Part F | 348 276 | 154 82 40 | 24 24 24 Logb L | 656 304 | 136 100 36 | 24 24 12 Multiplication * | 1023 2955 | 759 663 857 | 670 670 442 Negation - | 86 130 | 44 38 32 | 24 24 0 Next After N | 464 2368 | 780 780 796 | 344 344 320 Round to Integer I | 273 285 | 95 74 52 | 72 72 68 Scalb S | 254 694 | 217 192 137 | 176 168 136 Square Root V | 128 616 | 192 180 147 | 196 196 188 Subtraction - | 911 2617 | 746 637 637 | 672 672 372 Remainder % | 558 2426 | 903 859 664 | 508 508 220 Totals | 10492 20743 | IEEETEST output for TP 6.0 coprocessor emulator: ------------------------------------------------ IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 168 48 | 16 16 16 | 16 8 0 Addition + | 1877 1651 | 294 290 336 | 496 456 416 Comparison C | ## not run - program aborts with div-by-0 ## Copy Sign @ | 1392 96 | 48 48 0 | 48 0 0 Division / | ## not run - program aborts with div-by-0 ## Fraction Part F | 588 36 | 12 0 24 | 0 0 0 Logb L | 888 72 | 24 24 24 | 12 12 12 Multiplication * | 2148 1830 | 332 310 528 | 520 360 352 Negation - | 160 48 | 16 16 16 | 16 8 0 Next After N | ## not run - program aborts with div-by-0 ## Round to Integer I | 318 240 | 0 0 4 | 80 80 80 Scalb S | 564 384 | 108 100 76 | 112 88 56 Square Root V | 180 564 | 143 157 169 | 72 72 128 Subtraction - | 1877 1651 | 294 290 336 | 496 456 416 Remainder % | 1072 1912 | 652 672 524 | 336 288 216 Additional accuracy and compatibility tests ------------------------------------------- To complement the checks done by IEEETEST, I also wrote the short programs DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following coprocessor functions: 1. support for denormals in all precisions (single, double, extended) 2. support for the four IEEE rounding modes (up, down, nearest, chop) 3. support for precision control Note that passing all tests is required for IEEE conformance, as well as 100% compatibility with Intel's coprocessors. Precision control forces the results of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the specified precision (single, double, double extended). This feature is provided to obtain compatibility with certain programming languages [17]. By specifying lower precision, one effectively nullifies the advantages of extended precision intermediate results. The IEEE-754 standard for floating-point arithmetic demands that processors and floating-point packages that can not store the result of operations *directly* to single and double precision location must provide precision control. The programs that test precision control and rounding control are designed to return a different result for each of the modes for the same sequence of operation. The source code of the programs can be found in appendix A. The Intel 8087 and 80287 were not tested with DENORMTS since Turbo Pascal does not support extended precision denormals on 8087/80287 processors, so the denormal test fails anyway. (The 8087 and 287 pass the RCTRL and PCTRL tests without error, however). Test Results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD, Cyrix 83D87, Cyrix 387+, C&T 38700, and the EM87 emulator (on an 80386 system): ------------------------------------------------------------------------------- Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 Results for the ULSI 83C87: --------------------------- Precision Control SINGLE 1.23456789012337585E+0000 DOUBLE 1.23456789012337585E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 Results for the IIT 3C87: ------------------------- Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals not supported Results for the Turbo Pascal 6.0 coprocessor emulator: ------------------------------------------------------ Precision Control SINGLE 1.23456789012351396E+0000 DOUBLE 1.23456789012351396E+0000 EXTENDED 1.23456789012351396E+0000 Rounding Control NEAREST -1.23457766383395931E+0100 DOWN -1.23457766383395931E+0100 UP -1.23457766383395931E+0100 CHOP -1.23457766383395931E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported Results for the Q387 coprocessor emulator: ------------------------------------------ Precision Control SINGLE 1.23456789012337614E+0000 DOUBLE 1.23456789012337614E+0000 EXTENDED 1.23456789012337614E+0000 Rounding Control NEAREST -1.23427621117212139E+0100 DOWN -1.23427621117212139E+0100 UP -1.23427621117212139E+0100 CHOP -1.23427621117212139E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported The test results show that the IIT 3C87 does not conform to the IEEE-754 floating-point standard in that it does not support denormals in double extended precision. The ULSI 83C87 does not conform to that standard in that it does not support precision control, but uses double extended precision for all operations. The TP 6.0 emulator supports neither precision control, rounding control nor support for any denormals, as does the Q387 emulator. In addition, their basic arithmetic operations do not seem to conform to the IEEE standard as the results of the test programs differ from that of any result computed by a coprocessor for any mode. ================================================ Accuracy of transcendental function calculations ================================================ With regard to the accuracy of transcendental functions, Cyrix claims that the relative error of the transcendental functions on its 83D87 coprocessor never exceeds 0.5 ULP of the double extended format [13] (ULP = Unit in the Last Place, numeric weight of the least significant mantissa bit). This means that the maximum relative error is below 2**-64, while Intel's published error limit for the 80387 is 2**-62. While Intel uses a modified CORDIC algorithm [18,19] to compute the transcendental functions, Cyrix uses rational approximations that utilize their chip's very fast array multiplier. (For an explanation why this approach is superior to CORDIC with today's technology, see [61].) Also, Cyrix uses an internal 75 bit data path for the mantissa [15], so intermediate computations in the generation of transcendental function values will enjoy some additional accuracy over the 64 bits provided by the double extended format. Using 75 mantissa bits also provides an advantage over other coprocessors like the Intel 387DX and ULSI 83C87 which use only a 68 bit mantissa data path [58,59]. Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does not mean that it returns the 'exact' result (machine number closest to infinitely precise result) all the time. Consider the case where the infinitely precise result of a transcendental function falls nearly halfway between two machine numbers. A relative error of 0.5 ULP can cause the result to be either of the numbers after rounding, depending on the direction of the error. But the 83D87 should deliver results that never differ from the 'exact' result by more than one ULP. Also note that the claim of relative error being below 0.5 ULPs is slightly exaggerated; 0.6 ULPs would be a more realistic error limit. Imagine that the infinitely precise result for some argument to a transcendental was xxx..xxx1001... (where the xxx...xxx represent the first 64 bits of the result), but that the coprocessor computes the result as xxx..xxx0111 and then round this down to xxx..xxx0000. Then the relative error is (1001b-0b)/1000b = 0.5625 ULPs. I tested some of the transcendental functions of the Cyrix 387+ and found the relative error to be always below 0.6 ULPs. Cyrix also claims that its transcendental functions satisfy the monotonicity criterion [13], a claim not made by any of the competitors, which does not mean that the transcendental functions on the other 387-compatibles may not be monotonic, too. Monotonicity means that for all x1 > x2, it always follows that f(x1) >= f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that f(x1) <= f(x2). As previously noted, the Weitek Abacus 3167 and 4167 coprocessors implement only the basic arithmetic operations (add, subtract, negate, multiply, divide, square root) in hardware. Transcendental functions are performed via a software library provided by Weitek. For these library functions Weitek claims a maximum relative error of 5 ULPs [31,33]. This means that the last three bits in the mantissa of a double-precision result can be wrong. Note that the Intel 387 and compatible math coprocessors generate the transcendental functions with a small relative error with regard to the *extended double precision* format. Thus, when rounded to double-precision, their function values are nearly always 'exact'. The problem of 'double rounding' prevents them to be 'exact' in 100% of all cases. 387 type coprocessors in general have superior accuracy when compared with Weitek's coprocesssors. The test diskette distributed with early versions of the Cyrix 83D87 contained a program (TRANCK) that checks the accuracy of the transcendental functions in the coprocessor against a more precise software arithmetic [16]. I used this program to compare the accuracy of the transcendental functions on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not accept negative numbers as interval limits, I tested each function on an interval along the positive x-axis. The functions tested were F2XM1 (2**x-1), FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y * log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental functions implemented on the 80387. Note that the square root (FSQRT) is *not* a transcendental function. For each function, 100,000 arguments were evaluated, with the arguments uniformly distributed within the interval tested. The EM87 emulator could not be checked with TRANCK, since the multiple precision package in TRANCK would always return with an error message immediately. However, the Franke387 emulator could be tested. In the test results below, the following statistics are detailed: %wrong is the percentage of results that differ from the 'exact' result (infinitely precise result rounded to 64 bits) ULP_hi is the number of results where the returned result was greater than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-63 to 2**-64 depending of the size of the number). ULPs_hi is the number of results where the returned result was greater than the 'exact' result by two or more ULPs. ULP_lo is the number of results where the returned result was smaller than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-63 to 2**-64 depending of the size of the number). ULPs_lo is the number of results where the returned result was smaller than the 'exact' result by two or more ULPs. max ULP err is the maximum deviation of a returned result from the 'exact' answer expressed in ULPs. Test results for accuracy of transcendental functions for double extended precision as returned by the program TRANCK. 100,000 trials per function: Franke387 V2.4 emulator max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 39.042 25301 708 13029 4 2 COS 0,pi/4 75.714 49827 25887 0 0 3 TAN 0,pi/4 76.976 14230 10029 24323 28394 9 ATAN 0,1 55.826 26028 1529 24044 4225 4 2XM1 0,0.5 96.717 0 0 47910 48807 5 YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8 YL2X 0.1,10 62.252 16817 4712 37082 3641 2953 Microsoft's coprocessor emulator (part of MS-C and MS-Fortran libraries) max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 N/A N/A N/A N/A N/A N/A COS 0,pi/4 N/A N/A N/A N/A N/A N/A TAN 0,pi/4 40.828 27764 1520 11445 99 2 ATAN 0,1 32.307 18893 485 12530 299 2 2XM1 0,0.5 52.163 8585 189 37745 5644 3 YL2XP1 0,sqrt(2)-1 88.801 4714 916 14239 68932 11 YL2X 0.1,10 36.598 13813 3272 13866 5647 11 INTEL 8087, 80287 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 N/A N/A N/A N/A N/A N/A COS 0,pi/4 N/A N/A N/A N/A N/A N/A TAN 0,pi/4 37.001 18756 524 17405 316 2 ATAN 0,1 9.666 6065 0 3601 0 1 2XM1 0,0.5 19.920 0 0 19920 0 1 YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1 YL2X 0.1,10 1.287 723 0 564 0 1 INTEL 80387 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.872 2467 0 26392 13 2 COS 0,pi/4 27.213 27169 35 9 0 2 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 INTEL 387DX max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.873 2467 0 26393 13 2 COS 0,pi/4 27.121 27090 22 9 0 2 TAN 0,pi/4 10.711 457 0 10254 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 ULSI 83C87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 35.530 4989 6 30238 297 2 COS 0,pi/4 43.989 11193 675 31393 728 2 TAN 0,pi/4 48.539 18880 1015 26349 2295 3 ATAN 0,1 20.858 62 0 20796 0 1 2XM1 0,0.5 21.257 4 0 21253 0 1 YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2 YL2X 0.1,10 13.603 9816 0 3787 0 1 IIT 3C87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 18.650 11171 0 7479 0 1 COS 0,pi/4 7.700 3024 0 4676 0 1 TAN 0,pi/4 20.973 9681 0 11291 1 2 ATAN 0,1 19.280 13186 0 6094 0 1 2XM1 0,0.5 25.660 17570 0 8090 0 1 YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3 YL2X 0.1,10 10.888 5638 357 4845 48 3 C&T 38700DX max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.821 1272 0 549 0 1 COS 0,pi/4 23.358 12458 0 10901 0 1 TAN 0,pi/4 17.178 10725 0 6453 0 1 ATAN 0,1 9.359 7082 0 2277 0 1 2XM1 0,0.5 15.188 3039 0 12149 0 1 YL2XP1 0,sqrt(2)-1 19.497 12109 0 7388 0 1 YL2X 0.1,10 46.868 261 0 46607 0 1 CYRIX 83D87 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.554 1015 0 539 0 1 COS 0,pi/4 0.925 143 0 782 0 1 TAN 0,pi/4 4.147 881 0 3266 0 1 ATAN 0,1 0.656 229 0 427 0 1 2XM1 0,0.5 2.628 1433 0 1194 0 1 YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1 YL2X 0.1,10 0.931 256 0 675 0 1 CYRIX 387+ max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.486 864 0 622 0 1 COS 0,pi/4 2.072 12 0 2060 0 1 TAN 0,pi/4 0.602 63 0 539 0 1 ATAN 0,1 0.384 12 0 372 0 1 2XM1 0,0.5 1.985 27 0 1958 0 1 YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1 YL2X 0.1,10 0.764 367 0 397 0 1 INTEL RapidCAD, Intel 486 max funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 16.991 1517 0 15474 0 1 COS 0,pi/4 9.003 7603 0 1400 0 1 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.078 2386 0 4691 1 2 2XM1 0,0.5 32.025 0 0 32025 0 1 YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1 YL2X 0.1,10 3.894 1879 0 2015 0 1 Discussion of the transcendental function tests ----------------------------------------------- The test results above indicate that all 80x87 compatibles do not exceed Intel's stated error bound of 3 ULPs for the transcendental functions. However, some coprocessors are more accurate than others. Rating the coprocessors according to the accuracy of their transcendental functions gives the following list (highest accuracy first): Cyrix 387+, Cyrix 83D87, Intel 486, Intel RapidCAD, Intel 80287(!), C&T 38700DX, Intel 387DX, Intel 80387, IIT 3C87, ULSI 83C87. The tests also show that the problems with excessive inaccuracy of the transcendental functions in early versions of the IIT coprocessors with errors of up to 8 ULPs [8] have been corrected. (According to [56], certain problems with the FPATAN instruction on the IIT 3C87 occurring under the UNIX version of AutoCAD were corrected in June, 1990.) Considering the coprocessor emulators, the Franke387 has acceptable accuracy for the FSIN, FCOS, and FPATAN instructions, taking into consideration that according to its documentation, Franke387 uses only 64 bits of precision for the intermediate results, while coprocessors typically use 68 bits and more. However, the larger error in the FPTAN, F2XM1, FYL2XP1, and especially the FYL2X operations show that the emulator doesn't use state-of-the-art algorithms, which ensure an error of only a very few ULPs even if no extra precise intermediate results are available. Microsoft's emulator, meanwhile, provides transcendental functions with rather good accuracy, except for the logarithmic operations, which contain some minor flaws. The Q387 emulator, which came out only recently and is the fastest emulator available, could unfortunately not be tested since it caused TRANCK to abort with a GP (general protection) fault for every input that I tried. ====================================================== Intel 387DX compatibility testing / The SMDIAG program ====================================================== Chips and Technologies has included the program SMDIAG on the V1.0 diagnostic disk distributed with its SuperMATH 38700DX coprocessor. Its stated purpose is to test the compatibility of the computational results and flag settings returned by the C&T coprocessor with the Intel 387DX. However, the tests for the transcendental functions seem to have been tweaked to let the C&T 38700DX pass, while coprocessors like the Intel RapidCAD and the Cyrix 83D87 fail. Also, SMDIAG shows failure in the FSCALE test for the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI 83C87, even though they return the correct result according to Intel's documentation for the Intel 387DX (Intel's second generation 387), which is indeed returned by the 387DX. (SMDIAG apparently expects the result returned by the original Intel 80387.) Note that chip manufacturers often do quite bug fixes, so it wouldn't be surprising if somebody else, using different runs of the same manufacturer's chip, came up with different results than the ones below. The Intel 387 alone seems to have been produced in four different versions that can be told apart by software, and Cyrix, ULSI, and IIT have manufactured at least two versions each of their coprocessors. (The coprocessors I tested have the following manufacturing dates stamped on them. Intel 387DX: 91/49, C&T 38700DX: 92/19, Cyrix 387+: 92/11, Intel RapidCAD: 92/05, ULSI 83C87: 91/48, IIT 3C87: 92/20.) Results of running the SMDIAG program on 387-compatible coprocessors (p = passed, f = failed) Intel Intel Intel Cyrix Cyrix IIT ULSI C&T Test RapidCAD 387DX 80387 387+ 83D87 3C87 83C87 38700 1 (fstore) f p p p f f f p ##,

     2  (fiall)         p     p     p     p     p     p     f     p
     3  (faddsub)       p     p     p     p     p     p     p     p
     4  (faddsub_nr)    p     p     p     p     f     f     f     p %%
     5  (faddsub_cp)    p     p     p     p     f     f     f     p %%
     6  (faddsub_dn)    p     p     p     p     f     f     f     p %%
     7  (faddsub_up)    p     p     p     p     f     f     f     p %%,&&
     8  (fmul)          p     p     p     p     p     f     f     p
     9  (fdivn)         p     p     p     p     p     p     p     p
     10 (fdiv)          p     p     p     p     p     p     f     p
     11 (fxch)          p     p     p     p     p     p     p     p
     12 (fyl2x)         p     p     p     f     f     f     f     p ++
     13 (fyl2xp1)       f     p     p     f     f     f     f     p ++
     14 (fsqrt)         p     p     p     p     p     p     p     p
     15 (fsincos)       f     p     p     f     f     f     f     p ++
     16 (fptan)         p     p     p     f     p     f     f     p ++
     17 (fpatan)        p     p     p     f     f     f     f     p ++
     18 (f2xm1)         p     p     p     f     f     f     f     p ++
     19 (fscale)        f     f     p     f     f     f     f     p **
     20 (fcom1)         p     p     p     p     p     f     f     p
     21 (fprem)         p     p     p     p     p     p     p     p
     22 (misc1)         p     p     p     p     p     f     f     p
     23 (misc3)         p     p     p     p     p     p     p     p
     24 (misc4)         p     p     p     p     f     f     p     p %%

     failed modules:    4     1     0     7    12    16    17     0

     ## the failure of the Intel RapidCAD is caused by the fact that
        it stores the value of BCD INDEFINITE differently from the
        Intel 387DX. It uses FFFFC000000000000000, while the 387DX uses
        FFFF8000000000000000. However, both encodings are valid according
        to Intel's documentation, which defines the BCD INDEFINITE as
        FFFFUUUUUUUUUUUUUUUU, where U is undefined. So failure of the
        RapidCAD to deliver the same answer as the 387DX is not an
        "error", just a very slight incompatibility.
     ** the FSCALE errors reported for the Intel 387DX, Intel RapidCAD,
        Cyrix 83D87, Cyrix 387+, and ULSI 83C87 are due to a single
        'wrong' result each returned by one of the FSCALE computations.
        SMDIAG expects the result returned by the first generation
        Intel 80387 (and, of course, the C&T 38700DX). However, this
        result is wrong according to Intel's documentation and the
        behavior was corrected in the second generation Intel 387DX.
        Therefore, the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI
        83C87 return the correct result compatible with the Intel 387DX.
     %% Failures reported for the Cyrix 83D87 are due to the fact that it
        converts pseudodenormals contained in its registers to normalized
        numbers upon storing them to memory with the FSTP TBYTE PTR
        instruction. Intel's processors store pseudodenormals without
        'normalizing' them. This is an incompatibility, but not an error,
        because both encodings will evaluate to the same value should
        they be reused in a calculation.
     && Two of the failures reported for the Cyrix 83D87 are actual
        errors where the Cyrix 83D87 fails to deliver the correct result.
        1) control word = 0A7F (closure=proj., round=up, precision=53bit)
           ST(0) = 0001 ABCEF9876542101
           ST(1) = 0001 800000000345FFF
           instruction: FSUBRP ST(1), ST
           result should be: 0000 2BCEF987650EC800, status word = 3A30
           83D87 returns:    0000 3BCEF987650EC000, status word = 3830
        2) control word = 0A7F (closure=proj., round=up, precision=53bit)
           ST(0) = 0001 ABCEF9876542101
           ST(1) = 0001 800000000000000
           instruction: FSUB ST, ST(1)
           result should be: 0000 2BCEF98765432800, status word = 3A30
           83D87 returns:    0000 3BCEF98765432000, status word = 3830
     ++ The failures for the test of transcendental functions are caused
        by the tested coprocessor returning results that differ from the
        ones returned by the Intel 387DX. On the Cyrix 83D87, Cyrix 387+,
        and Intel RapidCAD, this is simply due to the improved accuracy
        these coprocessors provide over the Intel 387DX. The failures of
        the IIT 3C87 and ULSI 83C87 are mainly due to the lesser accuracy
        in the transcendental functions of these coprocessors, but for
        the IIT 3C87 an additional source of failures is its inability to
        handle extended-precision denormals.

Another compatibility issue that has been discussed on Usenet is the behavior of the math coprocessors under protected-mode operating systems. I have seen postings claiming that coprocessors from ULSI, IIT, and Cyrix locked up the machine when a protected mode operating system (several UNIX derivatives were also mentioned) was run on them. However, there have also been reports that several 486-based systems also have this problem, while others do not. Therefore, I think most of these problems are caused by poor motherboard design, especially wrong handling of error interrupts coming from the coprocessor. There could also be bugs in the exception handlers of the operating system. ========== References ========== [1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170-

[2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal,

   Vol. 19, No. 1, 1976, pp. 43-49

[3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report

   DITC 107/88, National Physics Laboratory, UK, March 1988

[4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.

   In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman
   and Hall 1990

[5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der

   Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990

[6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear

   Equations Software. Report CS-89-85, Computer Science Department,
   University of Tennessee, March 11, 1992

[7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design &

   Elektronik 1990, Heft 13, Seiten 105-110

[8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163 [9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating-

   Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984

[10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN

   Notices, Vol. 22, No. 2, 1985, pp. 9-25

[11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-

   1985. New York, NY: Institute of Electrical and Electronics Engineers
   1985

[12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order

   No. B2004

[13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No.

   B2002

[14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No.

   B2004

[15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No.

   L2001-003

[16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM

   Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp.
   57-70

[17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989

   Order No. 231917-002

[18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE

   Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959,
   pp. 330-334

[19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS

   Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385

[20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E

   mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales
   Rechenzentrum an der Universit"at zu Kln, Februar 1988

[21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical

   performance range. Technical Report UCRL-53745, Lawrence Livermore
   National Laboratory, USA, December 1986

[22] Nave, R.: Implementation of Transcendental Functions on a Numerics

   Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4,
   March-April 1983, pp. 221-225

[23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference

   Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7

[24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92 [25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni

   1991, Seiten 214-237

[26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order

   No.210760-002

[27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June

   1989 Order No. 11671B/0

[28] Intel RapidCAD™ Engineering CoProcessor Performance Brief. Intel

   Corporation, 1992

[29] i486™ Microprocessor Performance Report. Intel Corporation, April

   1990 Order No. 240734-001

[30] Intel486™ DX2 Microprocessor Performance Brief. Intel Corporation,

   March 1992 Order No. 241254-001

[31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation,

   July 1990 DOC No. 9030

[32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July

   1989 DOC No. 8943

[33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC

   No. 8967

[34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130 [35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34 [36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik-

   Prozessoren 8087/80287. Mnchen: tewi 1985

[37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September

   1989 Order No. 270640-003

[38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [39] Engineering note 4x4 matrix multiply transformation. IIT, 1989 [40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten

   266-276

[41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.:

   Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan
   Kaufmann 1990

[42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order

   No. 205835-007

[43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel

   Corporation, 1989 Order No. 240487-001

[44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987

   Order No. 210498-005

[45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation,

   May 1990 Order No. 290376-001

[46] Cyrix FasMath™ 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991

   Document 94018-00 Rev. 1.0

[47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [48] 486™SX™ Microprocessor/ 487™SX™ Math CoProcessor Data Sheet.

   Intel Corporation, April 1991. Order No. 240950-001

[49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57 [50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25 [51] Intel486™DX Microprocessor Data Book. Intel Corporation, June 1991

   Order No. 240440-004

[52] i486™ Microprocessor Hardware Reference Manual. Intel Corporation,

   1990 Order No. 240552-001

[53] i486™ Microprocessor Programmer's Reference Manual. Intel

   Corporation, 1990 Order No. 240486-001

[54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144 [55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108 [56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni

   1991, Seiten 214-237

[57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten

   114-116

[58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI

   System, 5/92, Rev. E

[59] 387™DX Math CoProcessor Data Sheet. Intel Corporation, September

   1990. Order No. 240448-003

[60] 387™ Numerics Coprocessor Extension Data Sheet. Intel Corporation,

   February 1989. Order No. 231920-005

[61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical

   Coprocessor Based on Rational Approximations. IEEE Transactions on
   Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037

[62] 387™ SX Math CoProcessor Data Sheet. Intel Corporation, November 1989

   Order No. 240225-005

[63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27,

[64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96 [65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200 [66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory.

   Preprint MCS-P84-0889, Mathematics and Computer Science Division,
   Argonne National Laboratory, August 1989

[67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991,

   pp. 38-41

[68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten

   153-156

[69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67 [70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live

   1991, Heft 10, Seiten 138-149

[71] email from Peter Forsberg (peterf@vnet.ibm.com), email from Alan Brown

   (abrown@Reston.ICL.COM)

[72] email from Eric Johnson (johnsone%camax01@uunet.UU.NET), email from

   Jerry Whelan (guru@stasi.bradley.edu), email from Arto Viitanen
   (av@cs.uta.fi), email from Richard Krehbiel (richk@grebyn.com)

[73] email from Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM) [74] correspondence with Bengt Ask (f89ba@efd.lth.se) [75] email from Thomas Hoberg (tmh@prosun.first.gmd.de) [76] Microsoft Macro Assembler Programmer's Guide Version 6.0, Microsoft

   Corporation, 1991. Document No. LN06556-0291

[77] FasMath EMC87 User's Manual, Rev. 2. Cyrix Corporation, February 1991

   Order No. 90018-00

[78] Persson, C.: Die 32-Bit-Parade c't 1992, Heft 9, Seiten 150-156 [79] email from Duncan Murdoch (dmurdoch@mast.QueensU.CA) ======================== Manufacturer's addresses ========================

Intel Corporation
3065 Bowers Avenue
Santa Clara, CA 95051
USA

IIT Integrated Information Technology, Inc.
2540 Mission College Blvd.
Santa Clara, CA 95054
USA

ULSI Systems, Inc.
58 Daggett Drive
San Jose, CA 95134
USA

Chips & Technologies, Inc.
3050 Zanker Road
San Jose, CA 95134
USA

Weitek Corporation
1060 East Arques Avenue
Sunnyvale, CA 94086
USA

AMD Advanced Microdevices, Inc.
901 Thompson Place
P.O.B. 3453
Sunnyvale, CA 94088-3453
USA

Cyrix Corporation
P.O.B. 850118
Richardson, TX 75085
USA

=============================== Appendix A: Test program source ===============================

{$N+,E+}
PROGRAM PCtrl;

VAR B,c: EXTENDED;
    Precision, L: WORD;

PROCEDURE SetPrecisionControl (Precision: WORD);
(* This procedure sets the internal precision of the NDP. Available *)
(* precision values:  0  -  24 bits (SINGLE)                        *)
(*                    1  -  n.a. (mapped to single)                 *)
(*                    2  -  53 bits (DOUBLE)                        *)
(*                    3  -  64 bits (EXTENDED)                      *)

VAR CtrlWord: WORD;

BEGIN {SetPrecisionCtrl}
   IF Precision = 1 THEN
      Precision := 0;
   Precision := Precision SHL 8; { make mask for PC field in ctrl word}
   ASM
      FSTCW    [CtrlWord]        { store NDP control word }
      MOV      AX, [CtrlWord]    { load control word into CPU }
      AND      AX, 0FCFFh        { mask out precision control field }
      OR       AX, [Precision]   { set desired precision in PC field }
      MOV      [CtrlWord], AX    { store new control word }
      FLDCW    [CtrlWord]        { set new precision control in NDP }
   END;
END; {SetPrecisionCtrl}

BEGIN {main}
   FOR Precision := 1 TO 3 DO BEGIN
      B := 1.2345678901234567890;
      SetPrecisionControl (Precision);
      FOR L := 1 TO 20 DO BEGIN
         B := Sqrt (B);
      END;
      FOR L := 1 TO 20 DO BEGIN
         B := B*B;
      END;
      SetPrecisionControl (3);   { full precision for printout }
      WriteLn (Precision, B:28);
   END;
END.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{$N+,E+}
PROGRAM RCtrl;

VAR B,c: EXTENDED;
    RoundingMode, L: WORD;

PROCEDURE SetRoundingMode (RCMode: WORD);
(* This procedure selects one of four available rounding modes *)
(* 0  -  Round to nearest (default)                            *)
(* 1  -  Round down (towards negative infinity)                *)
(* 2  -  Round up (towards positive infinity)                  *)
(* 3  -  Chop (truncate, round towards zero)                   *)

VAR CtrlWord: WORD;

BEGIN
   RCMode := RCMode SHL 10;  { make mask for RC field in control word}
   ASM
      FSTCW    [CtrlWord]        { store NDP control word }
      MOV      AX, [CtrlWord]    { load control word into CPU }
      AND      AX, 0F3FFh        { mask out rounding control field }
      OR       AX, [RCMode]      { set desired precision in RC field }
      MOV      [CtrlWord], AX    { store new control word }
      FLDCW    [CtrlWord]        { set new rounding control in NDP }
   END;
END;

BEGIN
   FOR RoundingMode := 0 TO 3 DO BEGIN
      B := 1.2345678901234567890e100;
      SetRoundingMode (RoundingMode);
      FOR L := 1 TO 51 DO BEGIN
         B := Sqrt (B);
      END;
         FOR L := 1 TO 51 DO BEGIN
         B := -B*B;
      END;
      SetRoundingMode (0);        { round to nearest for printout }
      WriteLn (RoundingMode, B:28);
   END;
END.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{$N+,E+}

PROGRAM DenormTs;

VAR E: EXTENDED;
    D: DOUBLE;
    S: SINGLE;

BEGIN
   WriteLn ('Testing support and printing of denormals');
   WriteLn;
   Write ('Coprocessor is: ');
   CASE Test8087 OF
      0: WriteLn ('Emulator');
      1: WriteLn ('8087 or compatible');
      2: WriteLn ('80287 or compatible');
      3: WriteLn ('80387 or compatible');
   END;
   WriteLn;
   S := 1.18e-38;
   S := S * 3.90625e-3;
   IF S = 0 THEN
      WriteLn ('SINGLE denormals not supported')
   ELSE BEGIN
      WriteLn ('SINGLE denormals supported');
      WriteLn ('SINGLE denormal prints as:   ', S);
      WriteLn ('Denormal should be printed as 4.60943...E-0041');
   END;
   WriteLn;
   D := 2.24e-308;
   D := D * 3.90625e-3;
   IF D = 0 THEN
      WriteLn ('DOUBLE denormals not supported')
   ELSE BEGIN
      WriteLn ('DOUBLE denormals supported');
      WriteLn ('DOUBLE denormal prints as:   ', D);
      WriteLn ('Denormal should be printed as 8.75...E-0311');
   END;
   WriteLn;
   E := 3.37e-4932;
   E := E * 3.90625e-3;
   IF E = 0 THEN
      WriteLn ('EXTENDED denormals not supported')
   ELSE BEGIN
      WriteLn ('EXTENDED denormals supported');
      WriteLn ('EXTENDED denormal prints as: ', E);
      WriteLn ('Denormal should be printed as 1.3164...E-4934');
   END;
END.

==================================== Appendix B: Benchmark program source ====================================

; FILE: APFELM4.ASM
; assemble with MASM /e APFELM4 or TASM /e APFELM4

CODE        SEGMENT BYTE PUBLIC 'CODE'
            ASSUME  CS: CODE

            PAGE    ,120

            PUBLIC  APPLE87;

APPLE87     PROC    NEAR
            PUSH    BP                  ; save caller's base pointer
            MOV     BP, SP              ; make new frame pointer
            PUSH    DS                  ; save caller's data segment
            PUSH    SI                  ; save register
            PUSH    DI                  ;  variables
            LDS     BX, [BP+04]         ; pointer to parameter record
            FINIT                       ; init 80x87          FSP->R0
            FILD   WORD  PTR [BX+02]    ; maxrad              FSP->R7
            FLD    QWORD PTR [BX+08]    ; qmax                FSP->R6
            FSUB   QWORD PTR [BX+16]    ; qmax-qmin           FSP->R6
            DEC    WORD  PTR [BX+04]    ; ymax-1
            FIDIV  WORD  PTR [BX+04]    ; (qmax-qmin)/(ymax-1)FSP->R6
            FSTP   QWORD PTR [BX+16]    ; save delta_q        FSP->R7
            FLD    QWORD PTR [BX+24]    ; pmax                FSP->R6
            FSUB   QWORD PTR [BX+32]    ; pmax-pmin           FSP->R6
            DEC    WORD  PTR [BX+06]    ; xmax-1
            FIDIV  WORD  PTR [BX+06]    ; delta_p             FSP->R6
            MOV    AX, [BX]             ; save maxiter,[BX] needed for
            MOV    [BX+2], AX           ;  80x87 status now
            XOR    BP, BP               ; y=0
            FLD    QWORD PTR [BX+08]    ; qmax                FSP->R5
            CMP    WORD  PTR [BX+40], 0 ; fast mode on 8087 desired ?
            JE     yloop                ; no, normal mode
            FSTCW  [BX]                 ; save NDP control word
            AND    WORD PTR [BX], 0FCFFh; set PCTRL = single-precision
            FLDCW  [BX]                 ; get back NDP control word
yloop:      XOR    DI, DI               ; x=0
            FLD    QWORD PTR [BX+32]    ; pmin                FSP->R4
xloop:      FLDZ                        ; j**2= 0             FSP->R3
            FLDZ                        ; 2ij = 0             FSP->R2
            FLDZ                        ; i**2= 0             FSP->R1
            MOV    CX, [BX+2]           ; maxiter
            MOV    DL, 41h              ; mask for C0 and C3 cond.bits
iteration:  FSUB   ST, ST(2)            ; i**2-j**2           FSP->R1
            FADD   ST, ST(3)            ; i**2-j**2+p = i     FSP->R1
            FLD    ST(0)                ; duplicate i         FSP->R0
            FMUL   ST(1), ST            ; i**2                FSP->R0
            FADD   ST, ST(0)            ; 2i                  FSP->R0
            FXCH   ST(2)                ; 2*i*j               FSP->R0
            FADD   ST, ST(5)            ; 2*i*j+q = j         FSP->R0
            FMUL   ST(2), ST            ; 2*i*j               FSP->R0
            FMUL   ST, ST(0)            ; j**2                FSP->R0
            FST    ST(3)                ; save j**2           FSP->R0
            FADD   ST, ST(1)            ; i**2+j**2           FSP->R0
            FCOMP  ST(7)                ; i**2+j**2 > maxrad? FSP->R1
            FSTSW  [BX]                 ; save 80x87 cond.codeFSP->R1
            TEST   BYTE PTR [BX+1], DL  ; test carry and zero flags
            LOOPNZ iteration            ; until maxiter if not diverg.
            MOV    DX, CX               ; number of loops executed
            NEG    CX                   ; carry set if CX <> 0
            ADC    DX, 0                ; adjust DX if no. of loops<>0

            ; plot point here (DI = X, BP = y, DX has the color)

            FSTP   ST(0)                ; pop i**2            FSP->R2
            FSTP   ST(0)                ; pop 2ij             FSP->R3
            FSTP   ST(0)                ; pop j**2            FSP->R4
            FADD   ST,ST(2)             ; p=p+delta_p         FSP->R4
            INC    DI                   ; x:=x+1
            CMP    DI, [BX+6]           ; x > xmax ?
            JBE    xloop                ; no, continue on same line
            FSTP   ST(0)                ; pop p               FSP->R5
            FSUB   QWORD PTR [BX+16]    ; q=q-delta_q         FSP->R5
            INC    BP                   ; y:=y+1
            CMP    BP, [BX+4]           ; y > ymax ?
            JBE    yloop                ; no, picture not done yet

groesser:   POP    DI                   ; restore
            POP    SI                   ;  register variables
            POP    DS                   ; restore caller's data segm.
            POP    BP                   ; save caller's base pointer
            RET    4                    ; pop parameters and return
APPLE87     ENDP

CODE        ENDS

END

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

UNIT Time;

INTERFACE

FUNCTION Clock: LONGINT;          { same as VMS; time in milliseconds }

IMPLEMENTATION

FUNCTION Clock: LONGINT; ASSEMBLER;
ASM
           PUSH    DS            { save caller's data segment }
           XOR     DX, DX        { initialize data segment to }
           MOV     DS, DX        {  access ticker counter }
           MOV     BX, 46Ch      { offset of ticker counter in segm.}
           MOV     DX, 43h       { timer chip control port }
           MOV     AL, 4         { freeze timer 0 }
           PUSHF                 { save caller's int flag setting }
           STI                   { allow update of ticker counter }
           LES     DI, DS:[BX]   { read BIOS ticker counter }
           OUT     DX, AL        { latch timer 0 }
           LDS     SI, DS:[BX]   { read BIOS ticker counter }
           IN      AL, 40h       { read latched timer 0 lo-byte }
           MOV     AH, AL        { save lo-byte }
           IN      AL, 40h       { read latched timer 0 hi-byte }
           POPF                  { restore caller's int flag }
           XCHG    AL, AH        { correct order of hi and lo }
           MOV     CX, ES        { ticker counter 1 in CX:DI:AX }
           CMP     DI, SI        { ticker counter updated ? }
           JE      @no_update    { no }
           OR      AX, AX        { update before timer freeze ? }
           JNS     @no_update    { no }
           MOV     DI, SI        { use second }
           MOV     CX, DS        {  ticker counter }
@no_update:NOT     AX            { counter counts down }
           MOV     BX, 36EDh     { load multiplier }
           MUL     BX            { W1 * M }
           MOV     SI, DX        { save W1 * M (hi) }
           MOV     AX, BX        { get M }
           MUL     DI            { W2 * M }
           XCHG    BX, AX        { AX = M, BX = W2 * M (lo) }
           MOV     DI, DX        { DI = W2 * M (hi) }
           ADD     BX, SI        { accumulate }
           ADC     DI, 0         {  result }
           XOR     SI, SI        { load zero }
           MUL     CX            { W3 * M }
           ADD     AX, DI        { accumulate }
           ADC     DX, SI        {  result in DX:AX:BX }
           MOV     DH, DL        { move result }
           MOV     DL, AH        {  from DL:AX:BX }
           MOV     AH, AL        {   to }
           MOV     AL, BH        {    DX:AX:BH }
           MOV     DI, DX        { save result }
           MOV     CX, AX        {  in DI:CX }
           MOV     AX, 25110     { calculate correction }
           MUL     DX            {  factor }
           SUB     CX, DX        { subtract correction }
           SBB     DI, SI        {  factor }
           XCHG    AX, CX        { result back }
           MOV     DX, DI        {  to DX:AX }
           POP     DS            { restore caller's data segment }
END;

BEGIN
   Port [$43] := $34;           { need rate generator, not square wave}
   Port [$40] := 0;             { generator as prog. by some BIOSes }
   Port [$40] := 0;             { for timer 0 }
END. { Time }

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{$A+,B-,R-,I-,V-,N+,E+}
PROGRAM PeakFlop;

USES Time;

TYPE ParamRec = RECORD
                   MaxIter, MaxRad, YMax, XMax: WORD;
                   Qmax, Qmin, Pmax, Pmin: DOUBLE;
                   FastMod: WORD;
                   PlotFkt: POINTER;
                   FLOPS:LONGINT;
                END;

VAR Param: ParamRec;
    Start: LONGINT;

{$L APFELM4.OBJ}

PROCEDURE Apple87 (VAR Param: ParamRec);     EXTERNAL;

BEGIN
   WITH Param DO BEGIN
      MaxIter:= 50;
      MaxRad := 30;
      YMax   := 30;
      XMax   := 30;
      Pmin   :=-2.1;
      Pmax   := 1.1;
      Qmin   :=-1.2;
      Qmax   := 1.2;
      FastMod:= Word (FALSE);
      PlotFkt:= NIL;
      Flops  := 0;
   END;
   Start := Clock;
   Apple87 (Param);         { executes 104002 FLOP }
   Start := Clock - Start;  { elapsed time in milliseconds }
   WriteLn ('Peak-MFLOPS: ', 104.002 / Start);
END.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

; FILE: M4X4.ASM
;
; assemble with TASM /e M4X4 or MASM /e M4X4

CODE      SEGMENT BYTE PUBLIC 'CODE'

          ASSUME  CS:CODE

          PUBLIC  MUL_4x4
          PUBLIC  IIT_MUL_4x4

FSBP0     EQU     DB  0DBh, 0E8h        ; declare special IIT
FSBP1     EQU     DB  0DBh, 0EBh        ;  instructions
FSBP2     EQU     DB  0DBh, 0EAh
F4X4      EQU     DB  0DBh, 0F1h

;---------------------------------------------------------------------
;
; MUL_4x4 multiplicates a four-by-four matrix by an array of four
; dimensional vectors. This operation is needed for 3D transformations
; in graphics data processing. There are arrays for each component of
; a vector. Thus there is an ; array containing all the x components,
; another containing all the y components and so on. Each component is
; an 8 byte IEEE floating-point number. Two indices into the array of
; vectors are given. The first is the index of the vector that will be
; processed first, the second is the index of the vector processed
; last.
;
;---------------------------------------------------------------------

MUL_4x4   PROC    NEAR

          AddrX   EQU DWORD PTR [BP+24] ; address of X component array
          AddrY   EQU DWORD PTR [BP+20] ; address of Y component array
          AddrZ   EQU DWORD PTR [BP+16] ; address of Z component array
          AddrW   EQU DWORD PTR [BP+12] ; address of W component array
          AddrT   EQU DWORD PTR [BP+8]  ; addr. of 4x4 transform. mat.
          F       EQU WORD  PTR [BP+6]  ; first vector to process
          K       EQU WORD  PTR [BP+4]  ; last vector to process
          RetAddr EQU WORD  PTR [BP+2]  ; return address saved by call
          SavdBP  EQU WORD  PTR [BP+0]  ; saved frame pointer
          SavdDS  EQU WORD  PTR [BP-2]  ; caller's data segment

          PUSH    BP                    ; save TURBO-Pascal frame ptr
          MOV     BP, SP                ; new frame pointer
          PUSH    DS                    ; save TURBO-Pascal data segmnt

          MOV     CX, K                 ; final index
          SUB     CX, F                 ; final index - start index
          JNC     $ok                   ; must not
          JMP     $nothing              ;  be negative
$ok:      INC     CX                    ; number of elements

          MOV     SI, F                 ; init offset into arrays
          SHL     SI, 1                 ; each
          SHL     SI, 1                 ;  element
          SHL     SI, 1                 ;   has 8 bytes

          LDS     DI, AddrT             ; addr. of transformation mat.
          FLD     QWORD PTR [DI]        ; load a[0,0]   = R7
          FLD     QWORD PTR [DI+8]      ; load a[0,1]   = R6

$mat_mul: LES     BX, AddrX             ; addr. of x component array
          FLD     QWORD PTR ES:[BX+SI]  ; load x[a]     = R5
          LES     BX, AddrY             ; addr. of y component array
          FLD     QWORD PTR ES:[BX+SI]  ; load y[a]     = R4
          LES     BX, AddrZ             ; addr. of z component array
          FLD     QWORD PTR ES:[BX+SI]  ; load z[a]     = R3
          LES     BX, AddrW             ; addr. of w component array
          FLD     QWORD PTR ES:[BX+SI]  ; load w[a]     = R2

          FLD     ST(5)                 ; load a[0,0]   = R1
          FMUL    ST, ST(4)             ; a[0,0] * x[a] = R1
          FLD     ST(5)                 ; load a[0,1]   = R0
          FMUL    ST, ST(4)             ; a[0,1] * y[a] = R0
          FADDP   ST(1), ST             ; a[0,0]*x[a]+a[0,1]*y[a]=R1
          FLD     QWORD PTR [DI+16]     ; load a[0,2]   = R0
          FMUL    ST, ST(3)             ; a[0,2] * z[a] = R0
          FADDP   ST(1), ST             ; a[0,0]*x[a]...a[0,2]*z[a]=R1
          FLD     QWORD PTR [DI+24]     ; load a[0,3]   = R0
          FMUL    ST, ST(2)             ; a[0,3] * w[a] = R0
          FADDP   ST(1), ST             ; a[0,0]*x[a]...a[0,3]*w[a]=R1
          LES     BX, AddrX             ; get address of x vector
          FSTP    QWORD PTR ES:[BX+SI]  ; write new x[a]

          FLD     QWORD PTR [DI+32]     ; load a[1,0]   = R1
          FMUL    ST, ST(4)             ; a[1,0] * x[a] = R1
          FLD     QWORD PTR [DI+40]     ; load a[1,1]   = R0
          FMUL    ST, ST(4)             ; a[1,1] * y[a] = R0
          FADDP   ST(1), ST             ; a[1,0]*x[a]+a[1,1]*y[a]=R1
          FLD     QWORD PTR [DI+48]     ; load a[1,2]   = R0
          FMUL    ST, ST(3)             ; a[1,2] * z[a] = R0
          FADDP   ST(1), ST             ; a[1,0]*x[a]...a[1,2]*z[a]=R1
          FLD     QWORD PTR [DI+56]     ; load a[1,3]   = R0
          FMUL    ST, ST(2)             ; a[1,3] * w[a] = R0
          FADDP   ST(1), ST             ; a[1,0]*x[a]...a[1,3]*w[a]=R1
          LES     BX, AddrY             ; get address of y vector
          FSTP    QWORD PTR ES:[BX+SI]  ; write new y[a]

          FLD     QWORD PTR [DI+64]     ; load a[2,0]   = R1
          FMUL    ST, ST(4)             ; a[2,0] * x[a] = R1
          FLD     QWORD PTR [DI+72]     ; load a[2,1]   = R0
          FMUL    ST, ST(4)             ; a[2,1] * y[a] = R0
          FADDP   ST(1), ST             ; a[2,0]*x[a]+a[2,1]*y[a]=R1
          FLD     QWORD PTR [DI+80]     ; load a[2,2]   = R0
          FMUL    ST, ST(3)             ; a[2,2] * z[a] = R0
          FADDP   ST(1), ST             ; a[2,0]*x[a]...a[2,2]*z[a]=R1
          FLD     QWORD PTR [DI+88]     ; load a[2,3]   = R0
          FMUL    ST, ST(2)             ; a[2,3] * w[a] = R0
          FADDP   ST(1), ST             ; a[2,0]*x[a]...a[2,3]*w[a]=R1
          LES     BX, AddrZ             ; get address of z vector
          FSTP    QWORD PTR ES:[BX+SI]  ; write new z[a]

          FLD     QWORD PTR [DI+96]     ; load a[3,0]   = R1
          FMULP   ST(4), ST             ; a[3,0] * x[a] = R5
          FLD     QWORD PTR [DI+104]    ; load a[3,1]   = R1
          FMULP   ST(3), ST             ; a[3,1] * y[a] = R4
          FLD     QWORD PTR [DI+112]    ; load a[3,2]   = R1
          FMULP   ST(2), ST             ; a[3,2] * z[a] = R3
          FLD     QWORD PTR [DI+120]    ; load a[3,3]   = R1
          FMULP   ST(1), ST             ; a[3,3] * w[a] = R2
          FADDP   ST(1), ST             ; a[3,3]*w[a]+a[3,2]*z[a]=R3
          FADDP   ST(1), ST             ; a[3,3]*w[a]...a[3,1]*y[a]=R4
          FADDP   ST(1), ST             ; a[3,3]*w[a]...a[3,0]*x[a]=R5
          LES     BX, AddrW             ; get address of w vector
          FSTP    QWORD PTR ES:[BX+SI]  ; write new w[a]

          ADD     SI, 8                 ; new offset into arrays
          DEC     CX                    ; decrement element counter
          JZ      $done                 ; no elements left, done
          JMP     $mat_mul              ; transform next vector

$done:    FSTP     ST(0)                ; clear
          FSTP     ST(0)                ;  FPU stack
$nothing: POP      DS                   ; restore TP data segment
          POP      BP                   ; restore TP frame pointer
          RET      24                   ; pop parameters and return

MUL_4X4   ENDP

;---------------------------------------------------------------------
;
; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four
; dimensional vectors. This operation is needed for 3D transformations
; in graphics data processing. There are arrays for each component of
; a vector.  Thus there is an array containing all the x components,
; another containing all the y components and so on. Each component is
; an 8 byte IEEE floating-point number. Two indices into the array of
; vectors are given. The first is the index of the vector that will be
; processed first, the second is the index of the vector processed
; last. This subroutine uses the special instructions only available
; on IIT coprocessors to provide fast matrix multiply capabilities.
; So make sure to use it only on IIT coprocessors.
;
;---------------------------------------------------------------------

IIT_MUL_4x4   PROC    NEAR

          AddrX   EQU DWORD PTR [BP+24] ; address of X component array
          AddrY   EQU DWORD PTR [BP+20] ; address of Y component array
          AddrZ   EQU DWORD PTR [BP+16] ; address of Z component array
          AddrW   EQU DWORD PTR [BP+12] ; address of W component array
          AddrT   EQU DWORD PTR [BP+8]  ; addr. of 4x4 transf. matrix
          F       EQU WORD  PTR [BP+6]  ; first vector to process
          K       EQU WORD  PTR [BP+4]  ; last vector to process
          RetAddr EQU WORD  PTR [BP+2]  ; return address saved by call
          SavdBP  EQU WORD  PTR [BP+0]  ; saved frame pointer
          SavdDS  EQU WORD  PTR [BP-2]  ; caller's data segment
          Ctrl87  EQU WORD  PTR [BP-4]  ; caller's 80x87 control word

          PUSH    BP                    ; save TURBO-Pascal frame ptr
          MOV     BP, SP                ; new frame pointer
          PUSH    DS                    ; save TURBO-Pascal data seg.
          SUB     SP, 2                 ; make local variabe
          FSTCW   [Ctrl87]              ; save 80x87 ctrl word
          LES     SI, AddrT             ; ptr to transformation matrix
          FINIT                         ; initialize coprocessor
          FSBP2                         ; set register bank 2
          FLD     QWORD PTR ES:[SI]     ; load a[0,0]
          FLD     QWORD PTR ES:[SI+32]  ; load a[1,0]
          FLD     QWORD PTR ES:[SI+64]  ; load a[2,0]
          FLD     QWORD PTR ES:[SI+96]  ; load a[3,0]
          FLD     QWORD PTR ES:[SI+8]   ; load a[0,1]
          FLD     QWORD PTR ES:[SI+40]  ; load a[1,1]
          FLD     QWORD PTR ES:[SI+72]  ; load a[2,1]
          FLD     QWORD PTR ES:[SI+104] ; load a[3,1]
          FINIT                         ; initialize coprocessor
          FSBP1                         ; set register bank 1
          FLD     QWORD PTR ES:[SI+16]  ; load a[0,2]
          FLD     QWORD PTR ES:[SI+48]  ; load a[1,2]
          FLD     QWORD PTR ES:[SI+80]  ; load a[2,2]
          FLD     QWORD PTR ES:[SI+112] ; load a[3,2]
          FLD     QWORD PTR ES:[SI+24]  ; load a[0,3]
          FLD     QWORD PTR ES:[SI+56]  ; load a[1,3]
          FLD     QWORD PTR ES:[SI+88]  ; load a[2,3]
          FLD     QWORD PTR ES:[SI+120] ; load a[3,3]

                                        ; transformation matrix loaded

          MOV     AX, F                 ; index of first vector
          MOV     DX, K                 ; index of last vector

          MOV     BX, AX                ; index 1st vector to process
          MOV     CL, 3                 ; component has 8 (2**3) bytes
          SHL     BX, CL                ; compute offset into arrays

          FINIT                         ; initialize coprocessor
          FSBP0                         ; set register bank 0

$mat_loop:LES     SI, AddrW             ; addr. of W component array
          FLD     QWORD PTR ES:[SI+BX]  ; W component current vector
          LES     SI, AddrZ             ; addr. of Z component array
          FLD     QWORD PTR ES:[SI+BX]  ; Z component current vector
          LES     SI, AddrY             ; addr. of Y component array
          FLD     QWORD PTR ES:[SI+BX]  ; Y component current vector
          LES     SI, AddrX             ; addr. of X component array
          FLD     QWORD PTR ES:[SI+BX]  ; X component current vector
          F4X4                          ; mul 4x4 matrix by 4x1 vector
          INC     AX                    ; next vector
          MOV     DI, AX                ; next vector
          SHL     DI, CL                ; offset of vector into arrays

          FSTP    QWORD PTR ES:[SI+BX]  ; store X comp. of curr. vect.
          LES     SI, AddrY             ; address of Y component array
          FSTP    QWORD PTR ES:[SI+BX]  ; store Y comp. of curr. vect.
          LES     SI, AddrZ             ; address of Z component array
          FSTP    QWORD PTR ES:[SI+BX]  ; store Z comp. of curr. vect.
          LES     SI, AddrW             ; address of W component array
          FSTP    QWORD PTR ES:[SI+BX]  ; store W comp. of curr. vect.

          MOV     BX, DI                ; ofs nxt vect. in comp. arrays
          CMP     AX, DX                ; nxt vector past upper bound?
          JLE     $mat_loop             ; no, transform next vector
          FLDCW   [Ctrl87]              ; restore orig 80x87 ctrl word

          ADD      SP, 2                ; get rid of local variable
          POP      DS                   ; restore TP data segment
          POP      BP                   ; restore TP frame pointer
          RET      24                   ; pop parameters and return
IIT_MUL_4x4   ENDP

CODE      ENDS

END

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{$N+,E+}

PROGRAM Trnsform;

USES Time;

CONST VectorLen = 8190;

TYPE  Vector    = ARRAY [0..VectorLen] OF DOUBLE;
      VectorPtr = ^Vector;
      Mat4      = ARRAY [1..4, 1..4] OF DOUBLE;

VAR   X: VectorPtr;
      Y: VectorPtr;
      Z: VectorPtr;
      W: VectorPtr;
      T: Mat4;
      K: INTEGER;
      L: INTEGER;
      First: INTEGER;
      Last:  INTEGER;
      Start: LONGINT;
      Elapsed:LONGINT;

PROCEDURE MUL_4X4     (X, Y, Z, W: VectorPtr;
                       VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr;
                       VAR T: Mat4; First, Last: INTEGER); EXTERNAL;

{$L M4X4.OBJ}

BEGIN
   WriteLn ('Test8087 = ', Test8087);
   New (X);
   New (Y);
   New (Z);
   New (W);
   FOR L := 1 TO VectorLen DO BEGIN
      X^ [L] := Random;
      Y^ [L] := Random;
      Z^ [L] := Random;
      W^ [L] := Random;
   END;
   X^ [0] := 1;
   Y^ [0] := 1;
   Z^ [0] := 1;
   W^ [0] := 1;
   FOR K := 1 TO 4 DO BEGIN
      FOR L := 1 TO 4 DO BEGIN
         T [K, L] := (K-1)*4 + L;
      END;
   END;
   First := 0;
   Last  := 8190;
   Start := Clock;
   MUL_4X4 (X, Y, Z, W, T, First, Last);
   { IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); }
   Elapsed := Clock - Start;
   WriteLn ('Number of vectors: ', Last-First+1);
   WriteLn ('Time: ', Elapsed, ' ms');
   WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/
            (Elapsed*1e-3):0:4, ' MFLOPS');
   WriteLn;
   WriteLn ('Last vector:');
   WriteLn;
   WriteLn (X^[Last]);
   WriteLn (Y^[Last]);
   WriteLn (Z^[Last]);
   WriteLn (W^[Last]);
END

Table of Contents