EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
This document has been created to provide the net.community with some
detailed information about mathematical coprocessors for the Intel 80x86 CPU
family. It may also help to answer some of the FAQs (frequently asked
questions) about this topic. The primary focus of this document is on 80387-
compatible chips, but there is also some information on the other chips in
the 80x87 family and the Weitek family of coprocessors. Care was taken to
make the information included as accurate as possible. If you think you have
discovered erroneous information in this text, or think that a certain detail
needs to be clarified, or want to suggest additions, feel free to contact me
at:
S_JUFFA@IRAVCL.IRA.UKA.DE
or at my SnailMail address:
Norbert Juffa
Wielandtstr. 14
7500 Karlsruhe 1
Germany
This is the fifth version of this document (dated 01-13-93) and I'd like
to thank those who have helped improving it by commenting on the previous
versions:
Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
(peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
(ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
(tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
(benny.iil.intel.com)
A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
who did a great job editing and formatting this article. Thanks David!
Contents of this document
1) What are math coprocessors?
2) How PC programs use a math coprocessor
3) Which applications benefit from a math coprocessor
4) Potential performance gains with a math coprocessor
5) How various math coprocessors work
6) Coprocessor emulator software
7) Installing a math coprocessor
8) Detailed description and specifications for all available math
coprocessor chips
9) Finding out which coprocessor you have (the COMPTEST program)
10) Current coprocessor prices and purchasing advice
11) The coprocessor benchmark programs (performance comparisons of
available math coprocessors using various CPUs)
12) Clock-cycle timings for each coprocessor instruction
13) Accuracy tests and IEEE-754 conformance for various coprocessors
14) Accuracy of transcendental function calculations for various coprocessors
15) Compatibility tests with Intel's 387DX / the SMDIAG program
16) References (literature)
17) Addresses of manufacturers of math coprocessors
18) Appendix A: Test programs for partial compatibility and accuracy checks
19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
What are math coprocessors?
A coprocessor in the traditional sense is a processor, separate from the main
CPU, that extends the capabilities of a CPU in a transparent manner. This
means that from the program's (and programmer's) point of view, the CPU and
coprocessor together look like a single, unified machine.
The 80x87 family of math coprocessors (also known as MCPs [Math
CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor
eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are
typical examples of such coprocessors. The 80x86 CPUs, with the exception of
the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit
integers as their basic data types. However, many PC-based applications
require the use of not only integers, but floating-point numbers. Simply put,
the use of floating-point numbers enables a binary representation of not only
integers, but also fractional values over a wide range. A common application
of floating-point numbers is in scientific applications, where very small
(e.g., Planck's constant) and very large numbers (e.g., speed of light) must
be accurately expressed. But floating-point numbers are also useful for
business applications such as computing interest, and in the geometric
calculations inherent in CAD/CAM processing.
Because the instruction sets of all 80x86 CPUs directly support only integers
and calculations upon integers, floating-point numbers and operations on them
must be programmed indirectly by using series of CPU integer instructions.
This means that computations when floating-point numbers are used are far
slower than normal, integer calculations. And this is where the 80x87
coprocessors come in: adding an 80x87 to an 80x86-based system augments the
CPU architecture with eight floating-point registers, five additional data
types and over 70 additional instructions, all designed to deal directly with
floating-point numbers as a basic data type. This removes the 'penalty' for
floating-point computations, and greatly increases overall system performance
for applications which depend heavily on these calculations.
In addition to being able to quickly execute load/store operations on
floating-point numbers, the 80x87 coprocessors can directly perform all the
basic arithmetic operation on them. Besides "knowing" how to add, subtract,
multiply and divide floating-point numbers, they can also operate on them to
perform comparisons, square roots, transcendental functions (such as logarithms
and sine/cosine/tangent), and compute their absolute value and remainder.
Like most things in life, floating-point arithmetic has been standardized.
The relevant standard (to which I will refer quite often in this document) is
the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The
standard specifies numeric formats, value sets and how the basic arithmetic
(+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this
document claim full or at least partial compliance with the IEEE-754
standard.
How PC programs use 80x87 and Weitek coprocessors
The basic data type used by all 80x87 coprocessors is an 80-bit long
floating-point number. This data type (called "temporary real" or "double
extended precision") can directly represent numbers which range in size
between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932
including denormal numbers) where '^' denotes the power operator. (For those
familiar with floating-point formats, this format has 64 mantissa bits, 15
exponent bits and 1 sign bit, for the total of 80 bits.) This format provides
a precision of about 19 decimal places. 80x87s can also handle additional
data types that are converted to/from the internal format upon being loaded
or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit
integers as well as a 18 digit BCD (binary coded decimal) data type occupying
10 bytes and providing 18 decimal digits.
The 80x87 also supports two additional floating-point types. The short real
data type (also called "single-precision") has 32 bits that split into 23
mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit"
technique, the effective length of the mantissa is increased to 24 bits. (The
hidden bit technique exploits the fact that for normalized floating-point
numbers, the mantissa m always is in the range 1 ⇐ m < 2. Since the first
mantissa bit represents the integer part of the mantissa, it is always set
for normalized numbers, and therefore need not be stored, as it is guaranteed
to always be 1.) The IEEE single-precision format provides a precision of
about 6-7 decimal places and can represent numbers between 1.17*10^-38 and
3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
real, or double-precision, data type has 64 bits, consisting of 52 mantissa
bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of
precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-
324 to 1.79*10^308 including denormal numbers). (This format also uses the
hidden bit technique to provide effectively 53 mantissa bits.)
The eight registers in the 80x87 are organized in a stack-like manner which
takes some time getting used to if one programs the coprocessor directly in
assembly language. However, nowadays the compilers or interpreters for most
high level languages (HLLs) can give a programmer easy access to the
coprocessor's data types and use their instructions, so there is not much
need to deal directly with the rather unusual architecture of the 80x87.
The architecture of the Weitek chips differs significantly from the 80x87.
Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in
that they do not transparently extend the CPU architecture; rather, they
could be described as highly-specialized, memory-mapped IO devices. But as
the term "coprocessor" has been traditionally used for these chips, they will
be referred to as such here.
The Weitek coprocessors have a RISC-like architecture which has been tuned
for maximum performance. Only a small instruction set has been implemented in
the chip, but each instruction executes at a very high speed (usually only a
few clock cycles each). Instructions available include load/store, add,
subtract, subtract reverse, multiply, multiply and negate, multiply and
accumulate, multiply and take absolute value, divide reverse, negate,
absolute value, compare/test, convert fix/float, and square root. In contrast
to the 80x87 family, the Weitek Abacus does not support a double extended
format, has no built-in transcendental functions, and does not support
denormals. The resources required to implement such features have instead
been devoted to implement the basic arithmetic operations as fast as
possible.
While the 80x87 coprocessors perform all internal calculations in double
extended precision and therefore have about the same performance for single
and double-precision calculations, the Weitek features explicit single and
double-precision operations. For applications that require only single-
precision operations, the Weitek can therefore provide very high performance,
as single-precision operations are about twice as fast as their double-
precision counterparts. Also, since the Weitek Abacus has more registers than
the 80x87 coprocessors (31 versus 8), values can be kept in registers more
often and have to be loaded from memory less frequently. This also leads to
performance gains.
The Weitek's register file consists of 31 32-bit registers, each one capable
of holding an IEEE single-precision number. Pairs of consecutive single-
precision registers can also be used as 64-bit IEEE double-precision
registers; thus there are 15 double-precision registers. The Weitek register
file has the standard organization like the register files in the 80386, not
the special stack-like organization of the 80x87 coprocessors.
To the main CPU, the Weitek Abacus appears as a 64 KB block of memory
starting at physical address 0C0000000h. Each address in this range
corresponds to a coprocessor instruction. Accessing a specified memory
location within this block with a MOV instruction causes the corresponding
Weitek instruction to be executed. (The instructions have been cleverly
assigned to memory locations in such a way that loads to consecutive
coprocessor registers can make use of the 386/486 MOVS string instruction.)
This memory-mapped interface is much faster than the IO-oriented protocol
that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's
memory block can actually be assigned to any logical address using the MMU
(memory management unit) in the 386/486's protected and virtual modes. This
also means that the Weitek Abacus *cannot* be used in the real mode of those
processors, since their physical starting address (0C0000000h) is not within
the 1 MByte address range and the MMU is inoperable in real mode. However,
DOS programs can make use of the Weitek by using a DOS extender or a memory
manager (such as QEMM or EMM386) that runs in protected/virtual mode itself
and can therefore map the Weitek's memory block to any desired location in
the 1 MByte address range.
Typically the FS segment register is then set up to point to the Weitek's
memory block. On the 80486, this technique has severe drawbacks, as using the
FS: prefix takes an additional clock cycle, thereby nearly halving the
performance of the 4167. Most DOS-based compilers exhibit this problem, so
the only way around it is to code in assembly language [75]. The Weitek
Abacus 3167 and 4167 are also supported by the UNIX operating system [33].
Which application programs benefit from a math coprocessor
According to the Intel 387DX User's Guide, there are more than 2100
commercial programs that can make use of a 387-compatible coprocessor. Every
program that uses floating-point arithmetic somewhere and contains the
instructions to support an 80x87 or Weitek chip can gain speed by installing
one. However, the speedup will vary from program to program (and even within
the same program) depending on how computation-intensive the program or
operation within the program is. Typical applications that benefit from the
use of a math coprocessor are:
CAD programs (AutoCAD, VersaCAD, GenericCAD)
Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
Mathematical analysis and statistical programs (Mathematica, TKSolver,
SPSS/PC, Statgraphics)
Database programs (dBase IV, FoxBase, Paradox, Revelation)
Note that for spreadsheets and databases, a coprocessor only helps if some
kind of floating-point computation is performed; this is true more often for
spreadsheets than for databases. Also note that the speed of many programs
depends quite heavily on factors such the speed of the graphics adapter (CAD)
or the disk performance (databases), so the computational performance is only
a (small) part of the total performance of the application. There are some
programs that won't run without a coprocessor, among them AutoCAD (R10 and
later) and Mathematica.
Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2
Presentation Manager do *not* gain additional speed from using a
*mathematical* coprocessor, since their graphics operations only use integer
arithmetic [71]. They *will* benefit from a graphics board with a graphics
"coprocessor" that speeds up certain common graphics operations such as
BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a
certain amount of floating-point operations for operations such as arc
drawing. However, the use of floating-point operations in X-Windows seems to
have decreased significantly in versions after X11R3, so the overall
performance impact of a coprocessor is small [72]. Applications running under
any GUI may take advantage of a math coprocessor, of course (for example,
Microsoft Excel running under Windows).
While support for 80x87 coprocessors is very common in application programs,
the Weitek Abacus coprocessors do not enjoy such widespread support. Due to
their higher price, only a few high-end PCs have been equipped with Weitek
coprocessors. Some machines, such as IBM's PS/2 series, do not even have
sockets to accommodate them. Therefore, most of the programs that support
these coprocessors are also high-end products, like AutoCAD and Versacad-386.
Potential performance gains with a coprocessor
The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX
coprocessor has a demonstration program that shows the speedup of certain
application programs when run with the Intel coprocessor versus a system with
no coprocessor:
Application Time w/o 387 Time w/387 Speedup
Art&Letters 87.0 sec 34.8 sec 150%
Quattro Pro 8.0 sec 4.0 sec 100%
Wingz 17.9 sec 9.1 sec 97%
Mathematica 420.2 sec 337.0 sec 25%
The following table is an excerpt from [70]:
Application Time w/o 387 Time w/387 Speedup
Corel Draw 471.0 sec 416.0 sec 13%
Freedom Of Press 163.0 sec 77.0 sec 112%
Lotus 1-2-3 257.0 sec 43.0 sec 597%
The following table is an excerpt from [25]:
Application Time w/o 387 Time w/387 Speedup
Design CAD, Test1 98.1 sec 50.0 sec 96%
Design CAD, Test2 75.3 sec 35.0 sec 115%
Excel, Test 1 9.2 sec 6.8 sec 35%
Excel, Test 1 12.6 sec 9.3 sec 35%
Note that coprocessor performance also depends on the motherboard, or more
specifically, the chipset used on the motherboard. In [34] and [35]
identically configured motherboards using different 386 chipsets were tested.
Among other tests a coprocessor benchmark was run which is based on a fractal
computation and its execution time recorded. The following tables showing
coprocessor performance to vary with the chipset have been copied from these
articles in abridged form:
Cyrix Cyrix
chip set 387+ chip set 83D87
Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHz 27.38 sec 91.6%
This shows that performance of the same coprocessor can vary by up to ~10%
depending on the chipset used on your board, at least for 386 motherboards
(similar numbers for 286, 386SX, and 486 are, unfortunately, not available).
The benchmarks for this article were run on a motherboard with the Forex chip
set, one of the fastest 386 chip sets available, and not only with respect to
floating-point performance [35].
How various math coprocessors work
In any 80x86 system with an 80x87 math coprocessor, CPU instructions and
coprocessor instructions are executed concurrently. This means that the CPU
can execute CPU instructions while the coprocessor executes a coprocessor
instruction at the same time. The concurrency is restricted somewhat by the
fact that the CPU has to aid the coprocessor in certain operations. As the
CPU and the coprocessor are fed from the same instruction stream and both
instruction streams may operate on the same data, there has to be a
synchronizing mechanism between the CPU and the coprocessor.
The 8087
In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode
coming in from the bus. To do this, both chips have the same BIU (bus
interface unit) and the 8086 BIU sends the status signals of its prefetch
queue to the 8087 BIU. This insures that both processors always decode the
same instructions in parallel. Since all coprocessor instruction start with
the bit pattern 11011, it is easy for the 8087 to ignore all other
instructions. Likewise the CPU ignores all coprocessor instructions, unless
they access memory. In this case, the CPU computes the address of the LSB
(least significant byte) of the memory operand and does a dummy read. The
8087 then takes the data from the data bus. If more than one memory access is
needed to load an memory operand, the 8087 requests the bus from the CPU,
generates the consecutive addresses of the operand's bytes and fetches them
from the data bus. After completing the operation, the 8087 hands bus control
back to the CPU. Since 8087 and CPU are hooked up to the same synchronous
bus, they must run at the same speed. This means that with the 8087, only
synchronous operation of CPU and coprocessor is possible.
Another 8087 coprocessor instruction can only be started if the previous one
has been completed in the NEU (numerical execution unit) of the 8087. To
prevent the 8086 from decoding a new coprocessor instruction while the 8087
is still executing the previous coprocessor instruction, a coding mechanism
is employed: All 8087-capable compilers and assemblers automatically
generate a WAIT instruction before each coprocessor instruction. The WAIT
instruction tests the CPU's /TEST pin and suspends execution until its input
becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to
the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it
forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor
instruction stops the CPU until any still-executing coprocessor instruction
has finished.
The same synchronization is used before the CPU accesses data that was
written by the coprocessor. A WAIT instruction after any coprocessor
instruction that writes to memory causes the CPU to stop until the
coprocessor has completed transfer of the data to memory, after which the CPU
can safely access it.
The 80287
The 80287 coprocessor-CPU interface is totally different from the 8087
design. Since the 80286 implements memory protection via an MMU based on
segmentation, it would have been much too expensive to duplicate the whole
memory protection logic on the coprocessor, which an interface solution
similar to the 8087 would have required. Instead, in an 80286/80287 system,
the CPU fetches and stores all opcodes and operands for the coprocessor.
Information is then passed through the CPU ports F8h-FFh. (As these ports are
accessible under program control, care must be taken in user programs not to
accidentally perform write operations to them, as this could corrupt data in
the math coprocessor.)
The 8087/8087 combination can be characterized as a cooperation of partners
with equal rights, while the 80286/287 is more a master-slave relationship.
This makes synchronization easier, since the complete instruction and data
flow of the coprocessor goes through the CPU. Before executing most
coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the
287 coprocessor and signals if the 80287 is still executing a previous
coprocessor instruction or has encountered an exception. The 80286 then waits
until the /BUSY signal goes to "low" before loading the next coprocessor
instruction into the 80287. Therefore, a WAIT instruction before every
coprocessor instruction is not required. These WAITs are permissible, but not
necessary, in 80287 programs. The second form of WAIT synchronization (after
the coprocessor has written a memory operand) *is* still necessary on 286/287
systems.
The execution unit of the 80287 is practically identical to that of the 8087;
that is, nearly all coprocessor instructions execute in the same number of
clock cycles on both coprocessors. However, due to the additional overhead of
the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz
80286/80287 combination can have lower floating-point performance than an
8086/8087 system running at the same speed. Additionally, older 286 boards
were often configured to run the coprocessor at only 2/3 the speed of the
CPU, making use of the ability of the 80287 to run asynchronously: The 80287
has a CKM pin that causes the incoming system clock to be divided by three
for the coprocessor if it is tied to ground. The 80286 always divides the
system clock by two internally, hence the final ratio of 2/3. However, when
the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the
CLK input. This feature has been exploited by the maker of coprocessor speed
sockets. These sockets tie CKM high and supply their own CLK signal with a
built-in oscillator, thereby allowing the 80287 or compatible to run at a
much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20
MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating-
point performance of such a configuration does not scale linearly with the
coprocessor clock, since all the data has to be passed through the much
slower CPU. If the coprocessor executes mostly simple instructions (such as
addition and multiplication), doubling the coprocessor clock to 20 MHz in a
10 MHz system does not show any performance increase at all [24].
The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of
a 387 coprocessor, but are pin-compatible to the original 287. These chips
divide the system clock by two internally, as opposed to three in the
original 80287. Since the 80286 also divides the system clock by two, they
usually run synchronously with respect to the CPU, although they can also be
run asynchronously.
The 80387
The coprocessor interface in 80386/80387 systems is very similar to the one
found in 286/287 systems. However, to prevent corruption of the coprocessor's
contents by programming errors, the IO ports 800000F8h-800000FFh are used,
which are not accessible to programs. The CPU/coprocessor interface has been
optimized and uses full 32-bit transfers; the interface overhead has been
reduced to about 14-20 clock cycles. For some operations on the 387 'clones'
that take less than about 16 clock cycles to complete, this overhead
effectively limits the execution rate of coprocessor instructions. The only
sensible solution to provide even higher floating-point performance was to
integrate the CPU and coprocessor functionality onto the same chip, which
is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits
from the instruction pipelining and from the on-chip cache.
In the absence of a coprocessor, floating-point calculations are often
performed by a software package that simulates its operations. Such a program
is called a coprocessor emulator. Simulating the coprocessor has the
advantage for application programs that identical code can be generated for
use with either the coprocessor and the emulator, so that it's possible to
write programs that run on any system without regard to whether a coprocessor
is present or not. Whether the program will use an actual coprocessor or
software emulating it can easily be determined at run-time by detecting the
presence or absence of the coprocessor chip.
Two approaches to interface an 80x87 emulator to programs are common. The
first method makes use of the fact that all coprocessor instruction start
with the same five bit pattern 11011. Thus the first byte of a coprocessor
instruction will be in the range D8-DF hexadecimal. In addition, coprocessor
instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is
one byte long (the reason for doing this has been described in the previous
chapter dealing with the operating details of the 80x87). One common approach
is to replace the WAIT instruction and the first byte of the coprocessor
instruction with one out of eight interrupt instructions; the remaining bytes
of the coprocessor instruction are left unchanged. Interrupts 34 to 3B
hexadecimal are used for this emulation technique. (Note that the sequences
9B D8 … 9B DF can be easily converted to the interrupt instructions CD 34
… CD 3B by simple addition and subtraction of constants.) The compiler or
assembler initially produces code that contains these appropriate interrupt
calls instead of the coprocessor instructions. If a hardware coprocessor is
detected at run-time, the emulator interrupts point to a short routine that
converts the interrupts calls back to coprocessor instructions (yes, this
is known as "self-modifying code"). If no coprocessor is found the interrupts
point to the emulation package, which examines the byte(s) following the
interrupt instruction to determine which floating-point operation to perform.
This method is used by many compilers, including those from Microsoft and
Borland. It works with every 80x86 CPU from the 8086/8088 on.
The second method to interface an emulator is only available on 286/386/486
machines. If the emulation bit in the machine status word of these processors
is set, the processors will generate an interrupt 7 whenever a coprocessor
instruction is encountered. The vector for this interrupt will have been set
up to point at an emulation package that decodes the instruction and performs
the desired operation. This approach has the advantage that the emulator
doesn't have to be included in the program code, but can be loaded once (as a
TSR or device driver) and then used by every program that requires a
coprocessor. Emulation via interrupt 7 is transparent, which means that
programs containing coprocessor instructions execute just like a coprocessor
was present, only slower. This approach is taken by the public domain EM87
emulator, the shareware program Q387, and the commercial Franke387 emulator,
for example. Even programs that require a coprocessor to run like AutoCAD
are 'fooled' to believe that a coprocessor is present with emulators using
INT 7.
Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor
emulations using INT 7 automatically if they do not find a coprocessor to be
installed. The emulator in Windows doesn't seem to be very fast, as people
who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler
(using the emulation built into the TP 6.0 run-time library) to the TPW 1.5
Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as
much as a factor of five have been reported [79].
The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies
about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver.
Note that Franke387 and especially EM87 model a real coprocessor much more
closely than Turbo Pascal's emulator does. In particular, EM87 supports
denormal numbers, precision control, and rounding control. The emulator in TP
6.0 does not implement these features. The version of Franke387 tested (V2.4)
supports denormals in single and double-precision, but not double extended
precision, and it supports precision control, but not rounding control.
The recently introduced shareware program Q387 only runs on 386, 386SX, 486SX
and compatible processors. The program loads completely into extended memory
and uses about 330 KB. To enable INT 7 trapping to a service routine in
extended memory it needs to run with a memory manager (e.g. EMM386, QEMM,
or 386MAX). The huge size of the program stems from the fact that it was
solely optimized for speed, assuming that extended memory is a cheap resource.
Presumably it uses large tables to speed computations. Intel's E80287 program
is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Note
that the more closely a real coprocessor is modelled by the emulator, the
slower the emulator runs and the larger the code for the emulator gets.
Relative execution times of coprocessor vs. software emulators
for selected coprocessor instructions
Intel 387DX TP 6.0 Emulator EM87 Emulator
FADD ST, ST(0) 1 26 104
FDIV [DWord] 1 22 136
FXAM 1 10 73
FYL2X 1 33 102
FPATAN 1 36 110
F2XM1 1 38 110
The following table is an excerpt from [44]:
Intel 80287 Intel E80287 Emulator
FADD ST, ST(0) 1 42
FDIV [DWord] 1 266
FXAM 1 139
FYL2X 1 99
FPATAN 1 153
F2XM1 1 41
The following has been adapted from [43] and merged with my own
data:
Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
FADD ST, ST(0) 1 20 94
FDIV [DWord] 1 22 82
FPTAN 1 18 144
F2XM1 1 6 171
FSQRT 1 44 544
One of the reasons emulators are so slow is that they are often designed to
run with every CPU from the 8086/8088 on upwards. This is the case with the
emulators built into the compiler libraries of the Turbo Pascal 6.0 (also
used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in
other Microsoft products) and is also true for the EM87 emulator in the
public domain. By using code that can run on a 8086/8088, these emulators
forego the speed advantage offered by the additional instructions and
architectural enhancements (such as 32-bit registers) of the more advanced
Intel 80x86 processors. A notable exception to this is the Franke387
emulator, a commercial emulator that is also sold as shareware. It uses 386-
specific 32-bit code and only runs on 386/386SX/486SX computers.
Besides being slow, coprocessor emulators have other drawbacks when compared
with real coprocessors. Most of the emulators do not support the additional
instructions that the 387-compatible coprocessors offer over the 80287.
Often, some of the low-level stack-manipulating instructions like FDECSTP are
not emulated. For example, [76] lists the coprocessor instructions not
emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN
libraries) as follows:
FCOS FRSTOR FSINCOS FXTRACT
FDECSTP FSAVE FUCOM
FINCSTP FSETPM FUCOMP
FPREM1 FSIN FUCOMPP
Additionally, some parts of the coprocessor architecture, like the status
register, are often not or only partially emulated. Some emulators do not
conform to the IEEE-754 standard in their implementation of the basic
arithmetic functions, while the hardware coprocessors do. Also, they
sometimes lack the support for denormals (a special class of floating-point
numbers) although it is required by the standard. Not all the 80x87 emulators
support rounding control and precision control, also features required by
IEEE-754. Most of these omissions are aimed at making the emulator faster and
smaller. Because of the performance gap and these other shortcomings of
coprocessor emulators, a real coprocessor is a must for anybody planning to
do some serious computations. (At today's prices, this shouldn't pose much of
a problem to anybody!)
Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone
coprocessor emulators for PCs, among them the two emulators, EM87 and
Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms
of reliability, speed, and accuracy.
Installing a math coprocessor
Usually, installing a coprocessor doesn't pose much of a problem, as every
coprocessor comes with installation instructions and a diagnostic disk that
lets you check its correct operation after installation. In addition, the
user manuals of most computers have a section on coprocessor installation.
1) Make sure to buy the right coprocessor for your system. An 8087 works
together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
compatible works with a 80286 CPU. (There are also some old 386
motherboards that accept a 80287 coprocessor, but they usually also
provide a socket for the 387; given today's pricing, it makes no sense
not to get a 387 for these systems.) A 80387, 387DX or compatible
coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
coprocessors also work with the Cyrix 486DLC CPU (which, despite its
name, does not include an FPU). Similarly, the 387SX or compatible
coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
socket in the system; this is *not* the same socket used by a 80387 or
compatible chip, and some computers, such as IBM's PS/2s, don't have
this socket. The Weitek Abacus 4167 works together with the 486 and
requires a special 142-pin socket to be present.
2) Always install a coprocessor that's rated at the same clock speed as the
CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified
frequency rating may cause it to produce false results, which you might
fail to recognize as such. (I have personally experienced this problem
with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the
diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some
commercial system test programs. However, I found it to fail the
Whetstone and Linpack benchmarks, which include accuracy checks.)
Although there is usually no problem with overheating when pushing a
coprocessor over the specified maximum frequency rating, be warned that
operation of a coprocessor above the maximum ratings stated by the
manufacturer may make its operation unreliable.
Some 386 boards allow the coprocessor to be clocked differently than the
CPU. This is called "asynchronous operation" and allows you, for
example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
Of the currently available math coprocessors, only the Intel 80387 and
387DX support asynchronous operation. The 387-compatible "clones" from
Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even
if you have set up your motherboard for asynchronous operation.
3) Once you've got the correct coprocessor for your system you can start
the actual installation process. Turn off the computer's power switch
and unplug the power cord from the wall outlet, remove the case, and
locate the math coprocessor socket. This socket is always located right
next to the main CPU, which can be identified by the printing on top of
the chip. (It's also usually one of the biggest chips on the board). The
8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
each of the longer sides. The 387SX PLCC socket is a square socket that
has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
socket is square and has two rows of pin holes on each side. The EMC
socket for the Weitek 3167 is similar but has three rows of holes on
each side. The PGA socket for the Weitek 4167 is also square with three
rows of holes on each side. If you can't find the math coprocessor
socket, consult your owner's manual, your computer dealer, or a
knowledgeable friend.
If you are installing the Intel RapidCAD chipset in a 386 system, you
will have to remove the 386 CPU first. Intel provides an easy-to-use
chip extractor and a storage box for the 386 chip for this purpose. Just
follow the instructions in the RapidCAD installation manual.
On many systems, the motherboard is supported only at a small number of
points. Since considerable force is required to insert a pin grid chip
like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
board may bend quite a lot due to the insertion pressure. This could
cause cracks in the board's conductive traces that may render it
intermittently or completely inoperable. Damage done to the board in
this way is usually not covered by the computer's warranty! Therefore,
it may be a good idea to first check how much the board bends by
pressing on the math coprocessor socket with your finger. If you find it
to bend easily, try to put something under the board directly beneath
the coprocessor socket. If this is impossible, as it is in many desktop
cases, consider removing the whole mother board from the case, and
placing it on a hard, flat surface free of static electricity. (You will
also have to do this if your system's CPU and coprocessor socket are on
a separate card rather than on the motherboard, as is typical in many
modular systems.)
Be sure you are properly grounded before you remove the coprocessor from
its antistatic box, as even a tiny jolt of static electricity can ruin
the coprocessor. Make sure you do not touch the pins on the bottom of
the chip.
Check the pins and make sure none are bent; if some are, you can
*carefully* straighten them with needle-nose pliers or tweezers.
4) Match the coprocessor's orientation with the orientation of the socket.
Correct orientation of the coprocessor is absolutely essential, because
if you insert it the wrong way it may be damaged.
8087 and 287 coprocessors have a notch on one the shorter sides of their
rectangular DIL package that should be matched with the notch of the
coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
placed alongside each other and both have the same orientation, (that
is, their respective notches point in the same direction). 387SX
coprocessors feature a white dot or similar mark that matches with some
sort of marking on the socket. 387 coprocessors have a bevelled corner
that is also marked with a white dot or similar marking. This should be
matched with the bevelled or otherwise marked corner of the socket. If
your system has only a large EMC socket and you are installing a 387 in
it, you will leave one row of pin holes free on each side of the chip.
Once you have found the correct orientation, place the chip over the
socket and make sure all pins are correctly aligned with their
respective holes. Press firmly and evenly on the chip -- you may have to
press hard to seat the coprocessor all the way. Again, make sure your
motherboard does not bend more than slightly under the insertion
pressure. For 8087, 287, and 387 coprocessors it is normal that the
coprocessor does not go all the way in; about one millimeter (1/25 inch)
of space is usually left between the socket and the bottom of the
coprocessor chip. (This allows the insertion of a extraction device
should it become necessary to remove the chip. Note that the
construction of the 387SX's PLCC socket makes it next-to-impossible to
remove the coprocessor once fully inserted, as the top of the chip is
level with the socket's 'walls'.)
5) Check your computer's manual for the proper position of any jumpers or
switches that need to be set to tell the system it now has a coprocessor
(and possibly, which kind it has). Put the cover back on the system
unit, reconnect the power, and turn on your computer. Depending on your
system's BIOS, you may now have to run a setup or configuration program
to enable the coprocessor. Finally, run the programs supplied on the
diagnostic disk (included with your coprocessor) to check for its
correct operation.
Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):
Intel 8087
[43] This was the first coprocessor that Intel made available for the
80x86 family. It was introduced in 1980 and therefore does not have full
compatibility with the IEEE-754 standard for floating-point arithmetic,
(which was finally released in 1985). It complements the 8088 and 8086
CPUs and can also be interfaced to the 80188 and 80186 processors.
The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].
A neat trick to enhance the processing power of the 8087 for
computations that use only the basic arithmetic operations (+,-,*,/) and
do not require high precision is to set the precision control to single-
precision. This gives one a performance increase of up to 20%. For
details about programming the precision control, see program PCtrl in
appendix A.
With the help of an additional chip, the 8087 can in theory be
interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
introduction of the 80286-based AT in 1984, it soon lost all
significance for the PC market.
Intel 80187
The 80187 is a rather new coprocessor designed to support the 80C186
embedded controller (a CMOS version of the 80186 CPU; see above). It was
introduced in 1989 and implements the complete 80387 instruction set. It
is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
Power consumption is rated at max. 675 mW for the 12.5 MHz version and
max. 780 mW for the 16 MHz version [37].
Intel 80287
[44] This is the original Intel coprocessor for the 80286, introduced in
1983. It uses the same internal execution unit as the 8087 and therefore
has the same speed (actually, it is sometimes slower due to additional
overhead in CPU-coprocessor communication). As with the 8087, it does
not provide full compatibility with the IEEE-754 floating point standard
released in 1985.
The 80287 was manufactured in NMOS technology, and is packaged in a 40-
pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
MHz versions. Power consumption can be estimated to be the same as that
for the 8087, which is 2400 mW max.
The 80287 has been replaced in the Intel 80x87 family with its faster
successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
below). There may still be a few of the old 80287 chips on the market,
however.
Intel 80287XL
This chip is Intel's second-generation 287, first introduced in 1990.
Since it is based on the 80387 coprocessor core, it features full IEEE
754 compatibility and faster instruction execution. Intel claims about
50% faster operation than the 80287 for typical benchmark tests such as
Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
which is identical to the Intel 80287, support this claim [1]: The Intel
287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
and 66% faster on the Whetstone benchmark in these tests. Whetstone
results from [46] show the Intel 287XL at 12.5 MHz to perform 552
kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
performance increase. A benchmark using the MathPak program showed the
Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
sec.) [26]. Since the 287XL has all the additional instructions and
enhancements of a 387, most software automatically identifies it as an
80387-compatible coprocessor and therefore can make use of extra 387-
only features, such as the FSIN and FCOS instructions.
The 287XL is manufactured in CMOS and therefore uses much less power
than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
287XL is available in either a 40-pin CERDIP (ceramic dual inline
package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
version is called the 287XLT and intended mainly for laptop use.) The
287XL is rated for speeds of up to 12.5 MHz.
AMD 80C287
This chip, manufactured by Advanced Micro Devices (AMD), is an exact
clone of the old Intel 80287, and was first brought to market by AMD in
1989. It contains the original microcode of the 80287 and is therefore
100% compatible with it. However, as the name indicates, the 80C287 is
manufactured in CMOS and therefore uses less power than an equivalent
Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
or slightly less than that of the Intel 80287XL [27]. There is also
another version called AMD 80EC287 that uses an 'intelligent' power save
feature to reduce the power consumption below 80C287 levels. Tests at
10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
suited for low power laptop systems.
The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
only seen it being offered in 10 MHz and 12 MHz versions, however.) At
about US$ 50, it is currently the cheapest coprocessor available. Note
that it provides less performance than the newer Intel 287XL (see
above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
(dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).
Due to recent legal battles with Intel over the right to use the 287
microcode, which AMD lost, AMD may have to discontinue this product
(disclaimer: I am not a legal expert).
Cyrix 82S87
This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
80387 'clone') and has been available since 1991. It complies completely
with the IEEE-754 standard for floating-point arithmetic and features
nearly total compatibility with Intel's coprocessors, including
implementation of the full Intel 80387 instruction set. It implements
the transcendental functions with the same degree of accuracy and the
superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
fastest [1] and most accurate 287 compatible coprocessor available.
Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
chips manufactured after 1991 use the internals of the Cyrix 387+, which
succeeds the original 83D87 [73].
The 82S87 is a fully static CMOS design with very low power requirements
that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
82S87 to consume about the same amount of power as the AMD 80C287 (see
above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
chip carrier) compatible with the pinout of the Intel 287XLT and
ideally suited for laptop use.
IIT 2C87
This chip was the first 80287 clone available, introduced to the market
in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
implements the full 80387 instruction set [38]. Tests I ran on the 3C87
seem to indicate that it is not fully compatible with the IEEE-754
standard for floating-point arithmetic (see below for details), so it
can be assumed that the 2C87 also fails these test (as it presumably
uses the same core as the 3C87).
The IIT 2C87 provides extra functions not available on any other 287
chip [38]. It has 24 user-accessible floating-point registers organized
into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
allow switching from one bank to another. (Transfers between registers
in different banks are not supported, however, so this feature by itself
is of limited usefulness. Also, there seems to be only one status
register (containing the stack top pointer), so it has to be manually
loaded and stored when switching between banks with a different number
of registers in use [40]). The register bank's main purpose is to aid
the fourth additional instruction the 2C87 has (F4X4), which does a full
multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
graphics applications [39]. The built-in matrix multiply speeds this
operation up by a factor of 6 to 8 when compared to a programmed
solution according to the manufacturer [38]. Tests show the speed-up to
be indeed in this range [40]. For the 3C87, I measured the execution
time of F4X4 to be about 280 clock cycles; the execution time on the
2C87 should be somewhat larger - I estimate it to be around 310 clock
cycles due to the higher CPU-NDP communication overhead in instruction
execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
seem, however, there are very few applications that make use of it when
an IIT coprocessor is detected at run time (among them Schroff
Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
[25]).
The 2C87 is available for speeds of up to 20 MHz. It is implemented in
an advanced CMOS process and has therefore a low power consumption of
typically about 500 mW [38].
Intel 80387
This chip was the first generation of coprocessors designed specifically
for the Intel 80386 CPU. It was introduced in 1986, about one year after
the 80386 was brought to market. Early 386 system were therefore
equipped with both a 80287 and a 80387 socket. The 80386 does work with
an 80287, but the numerical performance is hardly adequate for such a
system.
The 80387 has itself since been superseded by the Intel 387DX introduced
by a quiet change in 1989 (see below). You might find it when acquiring
an older 386 machine, though. The old 80387 is about 20% slower than the
newer 387DX.
The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
using Intel's older 1.5 micron CHMOS III technology, giving it moderate
power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
1950 mW (1250 mW typical) [60].
Intel 387DX
The 387DX is the second-generation Intel 387; it was quietly introduced
to replace the original 80387 in 1989. This version is done in a more
advanced CMOS process which enables the coprocessor to run at a maximum
frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
MHz). The 387DX is also about 20% faster than the 80387 on the average
for the same clock frequency. For a 386/387 system operating at 29 MHz
the Whetstone benchmark (compiled with the highly optimizing Metaware
High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
programmed in assembly language, the 387DX performance was 28% higher
than the performance of the 80387. The transcendental functions have
also sped up from the 80387 to the 387DX. In the Savage benchmark
(again, compiled with Metaware High-C V1.6 and running on a 29 MHz
system), the 80387 evaluated 77600 function calls/second, while the
387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
instructions have been sped up a lot more than the average 20%. For
example, the performance of the FBSTP instruction has increased by a
factor of 3.64.
The Intel 387DX (and its predecessor 80387) are the only 387
coprocessors that support asynchronous operation of CPU and coprocessor.
The 387 consists of a bus interface unit and a numerical execution unit.
The bus interface unit always runs at the speed of the CPU clock
(CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
the numerical execution unit runs at the same speed as the bus interface
unit. If CKM is tied to ground, the numerical execution unit runs at the
speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
For example, for a 20 MHz 386, the Intel 387DX could be clocked from
12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These
coprocessors are therefore not capable of asynchronous operation and
always run at the speed of the CPU.)
The Intel 387DX is manufactured using Intel's advanced low power CHMOS
IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
1250 mW (750 mW typical) [59].
Intel 387SX
This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
(somewhat) the costs to build a 386SX system as compared to a full 32-
bit design required by a 386DX. (The 386SX's main *marketing* purpose
was to replace the 80286 CPU, which was being sold more cheaply by other
manufacturers [such as AMD], and which Intel subsequently stopped
producing.) Due to the 16-bit data path, the 386SX is slower than the
386DX and offers about the same speed as an 80286 at the same clock
frequency for 16-bit applications. But as the 386SX is a complete 80386
internally, it offers also the possibility to run 32-bit applications
and supports the virtual 8086 mode (used for example by Windows' 386
enhanced mode).
The 387SX has all the features of the Intel 80387, including the ability
of asynchronous operation of CPU and coprocessor (see Intel 387DX
information, above). Due to the 16 bit data path between the CPU and the
coprocessor, the 387SX is a bit slower than a 80387 operating at the
same frequency. In addition, the 387SX is based on the core of the
original 80387, which executes instructions slower than the second
generation 387DX.
The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
(740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
typical) [62].
Intel 387SL
This coprocessor is designed for use in systems that contain an Intel
386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
static CHMOS IV design with very low power requirements that is intended
to be used in notebook and laptop computers. It features an integrated
cache controller, a programmable memory controller, and hardware support
for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
introduced in early 1992, has been designed to accompany the 386SL in
machines with low power consumption and substitute the 387SX for this
purpose. It features advanced power saving mechanisms. It is based on
the 387DX core, rather than on the older and slower 80387 core (which is
used by the 387SX).
IIT 3C87
This IIT chip was introduced in 1989, about the same time as the Cyrix
83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
IIT 3C87 also provides extra functions not available on any other 387
chip [38]. It has 24 user-accessible floating-point registers organized
into three register banks. Three additional instructions (FSBP0, FSBP1,
FSBP2) allow switching from one bank to another. (Transfers between
registers in different banks are not supported, however, so this feature
by itself is of limited usefulness. Also, there seems to be only one
status register [containing the stack top pointer], so it has to be
manually loaded and stored when switching between banks with a different
number of registers in use [40]). The register bank's main purpose is to
aid the fourth additional instruction the 3C87 has (F4X4), which does a
full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
3D-graphics applications [39]. The built-in matrix multiply speeds this
operation up by a factor of 6 to 8 when compared to a programmed
solution according to the manufacturer [38]. Tests show the speed-up to
be indeed in this range [40]. I measured the F4X4 to execute in about
280 clock cycles, during which time it executes 16 multiplications and
12 additions. The built-in matrix multiply speeds up the matrix-by-
vector multiply by a factor of 3 compared with a programmed solution
according to IIT [39]. The results for my own TRNSFORM benchmark support
this claim (see results below), showing a performance increase by a
factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
as fast as on an Intel 486 at the same clock frequency. As desirable as
the F4X4 instruction may seem, however, there are very few applications
that make use of it when an IIT coprocessor is detected at run time
(among them Schroff Development's Silver Screen and Evolution
Computing's Fast-CAD 3-D [25]).
These IIT-specific instructions also work correctly when using a Chips &
Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
faster replacements for the Intel 386DX CPU.
Tests I ran with the IEEETEST program show that the 3C87 is not fully
compatible with the IEEE-754 standard for floating-point arithmetic,
although the manufacturer claims otherwise. It is indeed possible that
the reported errors are due to personal interpretations of the standard
by the program's author that have been incorporated into IEEETEST and
that the standard also supports the different interpretation chosen by
IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
have become somewhat of an industry standard [66] and Intel's 387, 486,
and RapidCAD chips pass the test without a single failure, so the fact
that the IIT 3C87 fails some of the tests indicates that it is not fully
compatible with the Intel 387 coprocessor. My tests also show that the
IIT 3C87 does not support denormals for the double extended format. It
is not entirely clear whether the IEEE standard mandates support for
extended precision denormals, as the IEEE-754 document explicitly only
mentions single and double-precision denormals. Missing support for
denormals is not a critical issue for most applications, but there are
some programs for which support of denormals is at the very least quite
helpful [41]. In any case, failure of the 3C87 to support extended
precision denormal numbers does represent an incompatibility with the
Intel 387 and 486 chips.
The 3C87 is implemented in an advanced CMOS process and has low power
requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
and ULSI, the 3C87 does not support asynchronous operation of the CPU
and the coprocessor, but always runs at the full speed of the CPU. It is
available in 16, 20, 25, 33, and 40 MHz versions.
IIT 3C87SX
This is the version of the IIT 3C87 that is intended for use with
Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
the IIT3C87. Due to the 16-bit data path between the CPU and the
coprocessor in a 386SX- based system, coprocessor instructions will
execute somewhat more slowly than on the 3C87. At present, the IIT
3C87SX is the only 387SX coprocessor that is offered at speeds of 16,
20, 25, and 33 MHz. (I have read that Cyrix has also announced an 83S87-
33, but haven't seen it being offered yet.) The 3C87SX is packaged in a
68-pin PLCC.
Cyrix FasMath 83D87
This chip was introduced in 1989, only shortly after the coprocessors
from IIT. It has been found to be the fastest 387-compatible coprocessor
in several benchmark comparisons [1,7,68,69]. It also came out as the
fastest coprocessor in my own tests (see benchmark results below).
Although the Cyrix 83D87 provides up to 50% more performance than the
Intel 387DX in benchmarks comparisons, the speed advantage over other
387-compatible coprocessors in real applications is usually much
smaller, because coprocessor instructions represent only a small part of
the total application code. For example, in a test using the program 3D-
Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].
Besides being the fastest 387 coprocessor, the 83D87 also offers the
most accurate transcendental functions results of all coprocessors
tested (see test results below). The new "387+" version of the 83D87,
available since November 1991, even surpasses the level of accuracy of
the original 83D87 design. Note that the name 387+ is used in European
distribution only. In other parts of the world, the new chip still goes
by the name 83D87.
Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to
compute the transcendental functions, Cyrix uses polynomial and rational
approximations to the functions. In the past the CORDIC method has been
popular since it requires only shifts and adds, which made it relatively
easy to implement a reasonably fast algorithm. Recently, the cost for the
implementation of fast floating-point hardware multipliers has dropped
significantly (due to the availability of VLSI), making the use of
polynomial and rational approximations superior to CORDIC for the
generation of transcendental functions [61]. The Cyrix 83D87 uses a fast
array multiplier, making its transcendental functions faster than those
of any other 387 compatible coprocessor. It also uses 75 bit for the
mantissa in intermediate calculations (as opposed to 68 bits on other
coprocessors), making its transcendental functions more accurate than
those of any other coprocessor or FPU (see results below).
The 83D87 (and its successor, the 387+) are the 387 'clones' with the
highest degree of compatibility to the Intel 387DX. A few minor software
and hardware incompatibilities have been documented by Cyrix [12]. The
software differences are caused by some bugs present in the 387DX that
Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all
other 387-compatible chips as well) does not support asynchronous
operation of CPU and coprocessor. There were also problems in the past
with the CPU-coprocessor communications, causing the 83D87 to
occasionally hang on some machines. The reason behind this was that
Cyrix shaved off a wait state in the communication protocol, which
caused a communications breakdown between the CPU and the 83D87 for some
systems running at 25 MHz or faster. (One notable example of this
behavior was the Intel 302 board.) Also there were problems with boards
based on early revisions of the OPTI chipset. These problem are only
rarely encountered with the current generation of 386 motherboards, and
it is possible that it has been entirely eliminated in the 387+, the
successor to the 83D87.
To reduce power consumption the 83D87 features advanced power saving
features. Those portions of the coprocessor that are not needed are
automatically shut down. If no coprocessor instructions are being
executed, *all* parts except the bus interface unit are shut down [12].
Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while
typical power consumption at this clock frequency is 500 mW [15].
Cyrix EMC87
This coprocessor is basically a special version of the Cyrix 83D87,
introduced in 1990. In addition to the normal 387 operating mode, in
which coprocessor-CPU communication is handled through reserved IO
ports, it also offers a memory-mapped mode of operation similar to the
operation principle of the Weitek Abacus. Like the Weitek chip, the
EMC87 occupies a block of memory starting at physical address C0000000h
(the Abacus occupies a memory block of 64 KB, while the EMC87 uses only
4 KB [77]). It can therefore only be accessed in the protected or
virtual modes of the 386 CPU. DOS programs can access the EMC87 with the
help of DOS extenders or memory managers like EMM386 which run in
protected/virtual mode themselves. To implement the memory-mapped
interface, the usual 80x87 architecture has been slightly expanded with
three additional registers and eleven additional instructions that can
only be used if the memory-mapped mode is enabled.
Using this special mode of the EMC87 provides a significant speed
advantage. The traditional 387 CPU-coprocessor interface via IO ports
has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87
executes some operations like addition and multiplication in much less
time, its performance is actually limited by the CPU-coprocessor
interface. Since the memory-mapped mode has much less overhead, it
allows all coprocessor instructions to be executed at full speed with no
penalty.
Originally, Cyrix claimed support for the fast memory-mapped mode of the
EMC87 from a number of software vendors (including Borland and
Microsoft). However, there are only very few applications that make use
of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP
FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer,
and Intusofts's Spice [63,73]. Part of the problem in supporting the
memory-mapped mode is that the application must reserve one of the
general purpose registers of the CPU to use memory-mapped mode
instructions that access memory.
(Note that the EMC87 is *not* compatible with Weitek's Abacus
coprocessor. They both use the same CPU interface technique [memory
mapping], but while the EMC87 uses the standard 387 instruction set, the
Weitek Abacus coprocessors use a different instruction set entirely its
own.)
Since the EMC87 provides also the standard 386/387 CPU interface via IO
ports, it can be used just like any other 387-compatible coprocessor and
delivers the same performance as the Cyrix 83D87 in this mode. The EMC87
even allows mixed use of memory-mapped and traditional instructions in
the same code. Cyrix has also implemented some additional instructions
in the EMC87 that are also available in the 387-compatible mode:
FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to
integer without setting the rounding mode by manipulating the
coprocessor control word, and are intended to make life easier for
compiler writers.
In a test, the EMC87 at 33 MHz ran the single-precision Whetstone
benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a
speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In
another test, the EMC87 ran a fractal computation at twice the speed of
the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third
test found the EMC87's overall performance to be 20% higher than the
performance of the Cyrix 83D87 [65].
The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the
two chips are identical. Unlike the Cyrix 83D87, which fits into the 68-
pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and
requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that
not all boards have such a socket (a notable exception being IBM's
PS/2s, for example). The EMC87 is available 25 and 33 MHz versions.
Maximum power consumption at 33 MHz is 2000 mW.
Cyrix appears currently to be phasing out the EMC87.
Cyrix FasMath 387+
This chip is the second-generation successor to the Cyrix 83D87. (The
name "387+" is only used for European distribution; in other parts of
the world, it goes by the original 83D87 designation.) According to a
source within Cyrix [73], the 387+ was designed to make a smaller (and
thus cheaper to manufacture) coprocessor chip that could also be pushed
to higher frequencies than the original chip: the 387+ is available in
versions of up to 40 MHz, whereas the original 83D87 could go no faster
than 33 MHz.
The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU,
which is a 486SX compatible replacement chips for the Intel 386DX.
Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a
Cyrix 387+.
In my tests, I found the Cyrix 387+ to be about five to 10 percent
*slower* than the Cyrix 83D87. However, some instructions like the
square root (FSQRT) now run at only half the speed at which they ran in
the 83D87, and most transcendental functions show about a 40% drop in
performance compared to their 83D87 averages (see performance results,
below). However, I did find the transcendental functions on the 387+ to
be a bit *more* accurate than those implemented in the 83D87. The new
design uses a slower hardware multiplier that needs six clock cycles to
multiply the floating-point mantissa of an internal precision number,
while the multiplier in the 83D87 takes only 4 clocks to accomplish the
same task. Since the transcendental functions in Cyrix math coprocessors
are generated by polynomial and rational approximations, this slows them
down significantly.
The divide/square root logic has also been changed from the 83D87
design. The original design used an algorithm that could generate both
the quotient and square root, so the execution times for these
instructions were nearly identical. The algorithm chosen for the
division in the 387+ doesn't allow the square root to be taken so
easily, so it takes nearly twice as long.
In the 387+, the available argument range for the FYL2XP1 instruction
has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is
found on all 80x87 coprocessors, to include all floating-point numbers.
Also, four additional instructions have been implemented: FRICHOP
(opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP
(opcode D9 E6).
Cyrix FasMath 83S87
The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the
fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of
the 387SX compatible coprocessors [1], as well as providing the most
accurate transcendental functions. 83S87 chips manufactured after 1991
use the internals of the Cyrix 387+, the successor to the original 83D87
[73] (above). The Cyrix 83S87 is ideally suited to be used with the
Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip
for the Intel 386SX CPU.
The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, and
25 MHz versions. Due to the advanced power saving features of the Cyrix
coprocessor, the typical power consumption of the 20 MHz version is only
about 350 mW [67].
ULSI Math*Co 83C87
The ULSI 83C87 is an 80387-compatible coprocessor first introduced in
early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other
387 clones, it is somewhat faster than the Intel 387DX, particularly in
its basic arithmetic functions. The transcendental functions, however,
show only a slight speed improvement over the Intel 387DX (see benchmark
results below).
In my tests, the ULSI had the most inaccurate transcendental functions
of all tested coprocessors. However, the maximum relative error is still
within the limits set by Intel, so this is probably not an important
issue for all but a very few applications. The ULSI 83C87 shows some
minor flaws in the tests for IEEE 754 compatibility, but this, too, is
probably unimportant under typical operating conditions. ULSI claims
that the program IEEETEST, which was used to test for IEEE
compatibility, contains many personal interpretations of the IEEE
standard by the program's author and states that there is no ANSI-
certified IEEE-754 compliance test. While this may be true, it is
also a fact that the IEEE test vectors used in IEEETEST are a de facto
industry standard, and that Intel's 387, 486, and RapidCAD chips pass it
without a single failure, as do the coprocessors from Cyrix. Since the
ULSI Math*Co 83C87 fails some of the tests, it is certainly less than
100% compatible with Intel's chips, although this will likely make
little or no difference in typical operating conditions. (It is
interesting to note that an ULSI 83S87 manufactured in 92/17 showed
fewer errors in the IEEETEST test run [74] than the ULSI 83C87,
manufactured in 91/48, I used in my original test. This indicates that
ULSI might have applied some quick fixes to newer revisions of their
math coprocessors.)
The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does
not implement the "precision control" feature. While all the internal
operations of 80x87 coprocessors are usually performed with the maximum
precision available (double-extended precision with 64 mantissa bits),
the 80x87 architecture also offer the possibility to force lower
precision to be used for the basic arithmetic functions (add, subtract,
multiply, divide, and square root). This feature is required by IEEE-754
for all coprocessors that can not store results *directly* to a single
or double-precision location. Since 80x87 coprocessors lack this storage
capability, they all implement precision control to provide correctly
rounded single- and double-precision results according to the floating-
point standard - except the ULSI chips. For programs that make use of
precision control (e.g., Interactive UNIX), correct implementation of
the feature may be essential for correct arithmetic results.
Like other non-Intel 387 compatibles, the 83C87 does not support
asynchronous operation of the CPU and the coprocessor. This means that
the 83C87 always runs at the full speed of the CPU. It is available in
20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS;
power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz
it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625
mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87
is packaged in a 68-pin ceramic PGA.
ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc.,
will replace the coprocessor up to three times free of charge should it
ever fail to function properly.
ULSI Math*Co 83S87
This chip is the SX version of the ULSI 83C87, for use in systems with
an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to
the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an
advanced power saving design with a sleep mode and a standby mode with
only minimal power requirements. Power consumption under normal
operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW
typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25
MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.
C&T SuperMATH 38700DX
Produced by Chips&Technologies, this is the latest entry into the 387-
compatible marketplace. Originally announced in October, 1991, it has
apparently not been available to end-users before the third quarter of
1992, at least here in Germany. My tests show that its compatibility
with Intel products is very good, even for the more arcane features of
the 387DX and comparable to the coprocessors from Cyrix. Like these
chips, it passes the IEEETEST program without a single failure. It
passes, of course, all tests in Chips&Technologies' own compatibility
test program, SMDIAG. However, some of the tests (the transcendental
functions) in this program are selected in such a way that the C&T 38700
passes while the Cyrix 83D87 or Intel RapidCAD fail, so they are not
very useful. (There is also a 'bug' in the test for FSCALE that hides a
true bug in the C&T 38700.) My tests show the accuracy of the
transcendental functions on the C&T 38700DX varies. Overall, accuracy of
the transcendentals is slightly better than on the Intel 387DX.
In my own speed tests [see below] and those reported in [1], the C&T
38700DX showed performance at about 90-100% the level of the Cyrix
83D87, which is the 387 clone with the highest performance. For
floating-point-intensive benchmarks, the C&T 38700DX provides up to 50%
more computational performance than the Intel 387DX. However, as with
all other 387 compatible coprocessors, the speed advantage over the
Intel 387DX is far less significant in real applications.
The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip
power management, which makes for low power consumption. The 38700DX is
packaged in a 68-pin ceramic PGA (pin grid array and available in speeds
of 16, 20, 25, 33, and 40 MHz.
C&T 38700SX
This chip is the SX version of the 38700DX and compatible with the Intel
387SX. It provides performance comparable to a Cyrix 83S87 [1], the
387SX clone with the highest performance. Compatibility with the Intel
387SX is very good and on par with the high degree of the compatibility
found in the Cyrix 83S87.
The 38700SX has low power consumption. It is packaged in a 68-pin PLCC
(plastic leaded chip carrier) and available in speeds of 16, 20, and 25
MHz.
Intel RapidCAD
The RapidCAD is not a coprocessor, strictly seen, although it is
marketed as one. Rather, it is a full replacement for a 80386 CPU:
basically, an Intel 486DX CPU chip without the internal cache and with a
standard 386 pinout. RapidCAD is delivered as a set of two chips.
RapidCAD-1 goes into the 386 socket and contains the CPU and FPU.
RapidCAD-2 goes into the coprocessor (387) socket and contains a simple
PAL whose only purpose is to generate the FERR signal normally generated
by a coprocessor (This is needed by the motherboard circuitry to provide
287 compatible coprocessor exception handling in 386/387 systems.) The
RapidCAD instruction set is compatible with the 386, so it doesn't have
any newer, 486-specific instructions like BSWAP. However, since the
RapidCAD CPU core is very similar to 80486 CPU core, most of the
register-to-register instructions execute in the same number of clock
cycles as on the 486.
RapidCAD's use of the standard 386 bus interface causes instructions
that access memory to execute at about the same speed as on the 386. The
integer performance on the RapidCAD is definitely limited by the low
memory bandwidth provided by this interface (2 clock cycles per bus
cycle) and the lack of an internal cache. CPU instructions often execute
faster than they can be fetched from memory, even with a big and fast
external cache. Therefore, the integer performance of the RapidCAD
exceeds that of a 386 by *at most* 35%. This value was derived by
running some programs that use mostly register-to-register operations
and few memory accesses, and is supported by the SPEC ratings that Intel
reports for the 386-33 and the RapidCAD-33: while the 386-33 has a
SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase.
(Note that these tests used the old [1989] SPEC benchmarks suite.)
While CPU and integer instructions often execute in one clock cycle on
the RapidCAD, floating-point operations always take more than seven
clock cycles. They are therefore rarely slowed down by the low-bandwidth
386 bus interface; My tests show a 70%-100% performance increase for
floating-point intensive benchmarks over a 386-based system using the
Intel 387DX math coprocessor. This is consistent with the SPECfp rating
reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85%
increase. This means that a system that uses the RapidCAD is faster than
*any* 386/387 combination, regardless of the type of 387 used, whether
an Intel 387DX or a faster 387 clone. The diagnostic disk for the
RapidCAD also gives some application performance data for the RapidCAD
compared to the Intel 387DX:
Application Time w/ 387DX Time w/ RapidCAD Speedup
AutoCAD 11 52 sec 32 sec 63%
AutoShade/Renderman 180 sec 108 sec 67%
Mathematica(Windows ) 139 sec 103 sec 35%
SPSS/PC+ 4.01 17 sec 14 sec 21%
RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed
through different channels than the other Intel math coprocessors, and I
have therefore been unable to obtain a data sheet for it. [78] gives the
typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is
the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot
when operating. Therefore, I recommend extra cooling for it (see the
paragraph below on the 486 for details). The RapidCAD-1 is packaged in a
132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a
68-pin PGA like a 80387 coprocessor.
Intel 486DX
The Intel 486DX is, of course, not solely a coprocessor. This chip,
first introduced by Intel in 1989, functionally combines the CPU (a
heavily-pipelined implementation of the 386 architecture) with an
enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified
on-chip code/data cache. (This description is necessarily simplified;
for a detailed hardware description, see [52].) The 486DX offers about
two to three times the integer performance of a 386 at the same clock
frequency, while floating-point performance is about three to four times
as high as the Intel 387DX at the same clock rate [29]. Since the FPU is
on the same chip as the CPU, the considerable communication overhead
between CPU and coprocessor in a 386/387 system is omitted, letting FPU
instructions run at the full speed permitted by the implementation. The
FPU also takes advantage of the on-chip cache and the highly pipelined
execution unit. The concurrent execution of CPU and coprocessor
instructions typical for 80x86/80x87 systems is still in existence on
the 486, but some FPU instructions like FSIN have nearly no concurrency
with CPU instructions, indicating that they make heavy use of both, CPU
and FPU resources [53, 1].
Besides its higher performance, the 486 FPU provides more accurate
transcendental functions than the 387DX coprocessor, according to my
tests (see below). To achieve better interrupt latency, FPU instructions
with a long execution times have been made abortable if an interrupt
occurs during their execution.
Due to the considerable amount of heat produced by these chips, and
taking into consideration the slow air flow provided by the fan in
garden-variety PC tower cases, I recommend an extra fan directly above
the CPU for safer operation. If you measure the surface temperature of
an 486DX after some time of operation in a normal tower case without
extra cooling, you may well come up with something like 80-90 degrees
Celsius (that is 175-195 degrees Fahrenheit for those not familiar with
metric units) [54,55]. You don't need the well known (and expensive)
IceCap[tm] to effectively cool your CPU; a simple fan mounted directly
above the CPU can bring the temperature of the chip down to about 50-60
degrees Celsius (120-140 degrees Fahrenheit), depending on the room
temperature and the temperature within the PC case (which depends on the
total power dissipation of all the components and the cooling provided
by the fan in the system's power supply). According to a simple rule
known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius
slows down chemical reactions by a factor of two, so lowering the
temperature of your CPU by 30 degrees should prolong the life of the
device by a factor of eight, due to the slower ageing process. If you
are reluctant to add a fan to your system because of the additional
noise, settle for a low-noise fan like those available from the German
manufacturer Pabst (this is not meant to be an advertisement; I am just
the happy owner of such a fan, and have no other connections to the
firm).
The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is
available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz
version has also been available, manufactured by a CHMOS V process (the
25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum
power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500
mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW
typical) for the 50 MHz chip.
Intel 486DX2
The 486DX2 represents the latest generation of Intel CPUs. The "DX2"
suffix (instead of simply DX) is meant to be an indicator that these are
clock-doubled versions of the basic CPU. A normal 486DX operates at the
frequency provided by the incoming clock signal. A 486DX2 instead
generates a new clock signal from the incoming clock by means of a PLL
(phase locked loop). In the DX2, this clock signal has twice the
frequency of the incoming clock, hence the name clock-doubler. All
internal parts of the 486DX2 (cache, CPU core, and FPU) run at this
higher frequency; only the bus interface runs at the normal (undoubled)
speed. Using this technique, an Intel 486DX2-50 can run on an unmodified
motherboard designed for 25 MHz operation. Since motherboards which run
at 50 MHz are much harder to design and build than those for 25 MHz,
this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50
system.
For all operations that don't access off-chip resources (e.g., register
operations), a 486DX2-50 provides exactly the same performance as a
486DX-50, and twice the performance of a 486DX-25. However, since the
main memory in a 486DX2-50 systems still operates at 25 MHz, all
instructions involving memory accesses are potentially slower than in a
486DX-50 system, whose memory also (presumably) runs at 50 MHz. The
internal cache of the 486 helps this problem a bit, but overall
performance of a 486DX2-50 is still lower than that of a 486DX-50.
Intel's documentation [32] shows this drop to be quite small, although
it is highly dependent upon the particular application.
The truly wonderful thing about the 486DX2 is that it allows easy
upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely
pin-compatible with the 486DX: you need just take out the 486DX and plug
in the new 486DX2. Note that power consumption of the 486DX2-50 equals
that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the
486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.).
These chips get *really* hot in a standard PC case with no extra
cooling, even if they come with an attached heat sink by default. (See
the discussion above for more detailed information on this problem and
possible solutions).
Intel 487SX
The 487SX is the math coprocessor intended for use in 486SX systems. The
486SX is basically a 486DX without the floating-point unit (FPU) [48,
50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it
has now completely removed the FPU part from the 486SX mask for mass
production.) The introduction of the 486SX in 1991 has been viewed by
many as a marketing 'trick' by Intel to take market share from the 386
based systems once AMD became successful with their Am386. (AMD has
taken as much as 40% of the 386 market due to some superior features
such as higher clock frequency, lower power consumption, fully static
design, and availability of a 3V version). A 486SX at 20 MHz delivers
a bit less integer performance than a 40 MHz Am386.
To add floating-point capabilities to a 486SX based system, it would
seem to be easiest to swap the 486SX for a 486DX, which includes the FPU
on-chip. However, Intel has prevented this easy solution by giving the
486SX a slightly different pin out [48, 51]. Since only three pins are
assigned differently, clever board manufacturers have come out with
boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU
socket and by doing so provide a clean upgrade path. A set of three
jumpers ensures correct signal assignment to the changed pins for either
CPU type. To upgrade 486SX systems without this feature, you are forced
to buy a 487SX and install it in the "Performance Upgrade Socket"
(present in most systems).
Once the 487SX was available, it was quickly found out that it is just a
normal 486DX with a slightly different pinout [49]. Technically
speaking, the solution Intel chose was the only practical way to provide
a 486SX system with the high level of floating-point performance the
486DX offers. The CPU and FPU must be on the same chip; otherwise, the
FPU cannot make use of the CPU's internal cache and there would be
considerable overhead in CPU-FPU communication (similar to a 386/387
system), nullifying most of the arithmetic speedups over the 387. That
the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely
for marketing reasons.
To upgrade a 486SX based system, Intel also offers the OverDrive chip,
which is just the same as a 487SX with internal clock doubling. It also
goes into the motherboard's "Performance Upgrade Socket". The OverDrive
roughly doubles the performance of a 486SX/487SX based system. (For a
explanation of clock doubling, see the description of the Intel 486DX2
above.)
Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX
system, so the 486SX could be removed once the 487SX is installed. Since
the shut down is logical, not electrical, the 486SX still uses power if
used with the 487SX, although it is inoperational. As with the 486SX,
the 487SX is currently available in 20 MHz and 25 MHz versions. At 20
MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW
typical). It is available in a 169 pin ceramic PGA (pin grid array).
Weitek 1167
This math coprocessor was the predecessor of the Weitek Abacus 3167. It
was actually a small printed circuit board with three chips mounted on
it. In contrast to the Weitek 3167, the 1167 did not have a square root
instruction; instead, the square root function was computed by means of
a subroutine in the Weitek transcendental function library. However, the
1167 did have a mode in which it supported denormal numbers. (The Weitek
3167 and 4167 only implement the 'fast' mode, in which denormals are not
supported.) Overall performance of the 1167 is slightly less than that
of the Weitek 3167.
Weitek 3167
The 3167 was introduced by Weitek in 1989 and provided the fastest
floating-point performance possible on a 386 based system at that time.
The 3167 is not a real coprocessor, strictly speaking, but rather a
memory-mapped peripheral device. The architecture of the 3167 was
optimized for speed wherever possible. Besides using the faster memory
mapped interface to the CPU (the 80x87 uses IO-ports), it does not
support many of the features of the 80x87 coprocessors, allowing all of
the chip's resources to be concentrated on the fast execution of the
basic arithmetic operations. (For a more detailed description of the
Weitek 3167, see the first chapter of this document.)
In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the
performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167
the Whetstone benchmark performed at 7574 kWhetstones/sec compared with
the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these
are single-precision results and that the Weitek 3167's performance
would drop to about half the stated rate for double-precision, while the
value for the Intel 387DX would change very little.) In any case, before
the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed
all 387-compatible coprocessors, even for double-precision operations
[63,65,69]. For typical applications, the advantage of the Weitek 3167
over the 387 clones is much smaller. In a benchmark test using
AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel
387DX's performance compared with 106% for the Cyrix FasMath 83D87 and
118% for the Intel RapidCAD.
The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an
EMC socket (provided in most 386-based systems). It does *not* fit into
the normal 68-pin PGA socket intended for a 387 coprocessor.
To get the best of both worlds, one might want to use a Weitek 3167 and
a 387 compatible coprocessor in the same system. These coprocessors can
coexist in the same system without problems; however, most 386-based
systems contain only one coprocessor socket, usually of the EMC
(extended math coprocessor) type. Thus, you can install either a 387
coprocessor or a Weitek 3167, but not both at the same time. There *are*
small daughter boards available that plug into the EMC socket and
provide two sockets, an EMC and a standard coprocessor socket.
At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At
33 MHz, max. power consumption is 2250 mW.
Weitek 4167
The 4167 is a memory-mapped coprocessor that has the same architecture
as the 3167; it is designed to provide 486-based systems with the
highest floating-point performance available. It executes coprocessor
instructions at three to four times the speed of the Weitek 3167.
Although it is up to 80% faster than the Intel 486 in some benchmarks
[1,69], the performance advantage for real application is probably more
like 10%. The introduction of the 486DX2 processors has more or less
obliterated the need for a Weitek 4167, since the DX2 CPUs provide the
same performance as the Weitek, as well as the additional features the
80x87 architecture has that the Weitek does not.
The Weitek 4167 is packaged in a 142-pin PGA package that is only
slightly smaller than the 486's package. At 25 MHz, it has a max. power
consumption of 2500 mW [32].
Finding out which coprocessor you have
If you are interested in programming techniques which allow the detection and
differentiation of the coprocessors described above, I refer you to my
COMPTEST program. COMPTEST reliably detects the type and clock frequency of
the CPU and coprocessor installed in your machine. The current version is
CTEST257.ZIP, with future versions to be called CTEST258, CTEST259 and so on.
COMPTEST can correctly identify all of the coprocessors described above, with
the exception of the Weitek chips, for which the detection mechanism is not
that reliable.
COMPTEST is in the public domain and comes with complete source code. It is
available via anonymous ftp from garbo.uwasa.fi and additional ftp sites that
mirror garbo.
Current coprocessor prices and purchasing advice
Due to mid-1992 price slashing by Cyrix (and subsequently, Intel) for 387
coprocessors, prices have dropped significantly for all 287 and 387
compatibles, with hardly any price difference between manufacturers. 387DX
compatible coprocessors typically sell for ~US$ 80 for all speeds except for
40 MHz versions, which are typically ~US$ 90. 387SX compatible coprocessors
sell for ~US$ 70, regardless of speed, with the exception of the 33 MHz
versions, which are ~US$ 80. The Intel 287XL sells for ~US$ 90, while the
IIT 2C87 and Cyrix 82S87 each sell for about US$ 60. 8087s may be more
expensive, the price of an 8087-10 being ~US$ 150. I purchased the Intel
RapidCAD for US$ 300 and haven't seen it offered for a better price. I see the
Weitek Abacus 3167-33 being offered for US$ 230 and the 4167-33 being offered
for US$ 850. The Intel 486SX OverDrive is available for ~US$ 570 for the 20 MHz
version, while the Intel 486DX2-50 costs ~650 US$. This price information
reflects the price situation as of 01-11-93; prices can be expected to drop
slightly in the near future.
Which coprocessor should you buy?
Several computer magazines have published application-level performance
comparisons for various 387 coprocessors and Weitek's ABACUS 3167 and 4167
chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar,
Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests,
performance improvements for the 387 clones over Intel's 387DX were small to
marginal, the clones running the applications no more than 5-15% faster than
the Intel 387DX. In the test of 3D-Studio, one of the few programs that
directly supports the Weitek Abacus, the Weitek 3167 improved performance by
23% over an Intel 387DX and the 4167 improved performance by 10% over the
486DX [1].
If you have a demand for high floating-point performance, you should consider
buying a full 486-based system, rather than a 386-based system with an
additional coprocessor. Consider: A 386/33 MHz motherboard currently sells for
~US$ 270; together with the coprocessor, the cost totals ~US$ 350. A 486/33 MHz
ISA motherboard sells for US$ 650. While this means that the 486 system is 85%
more expensive than the 386/387 system, it also provides 100% more integer
and floating-point performance (twice the performance), giving it better
price/performance for math-intensive applications. As prices for 486 chips
fall in the future, the price difference between these two systems should
become even smaller.
If you want to push your 386-based system to its maximum floating-point
performance and can't switch to a 486, I recommend the Intel RapidCAD
chipset. It is both faster [1] and cheaper than installing a Weitek Abacus
3167 in a 386 system, which used to be the highest performing combination
before the RapidCAD was introduced.
In a similar vein, the introduction of the Intel 486DX2 clock-doubler chips
has obliterated the need for a Weitek 4167 to get maximum floating-point
performance out of a 486-based system. A 486DX2-66 performs at or above the
performance level of a 33 MHz Weitek 4167, even if the latter uses single-
precision rather than double-precision. The 486DX-66 is rated by Intel at
24700 double-precision kWhetstones/sec and 3.1 double-precision Linpack
MFLOPS. (Of course, these benchmarks used the highest performance compilers
available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6
double-precision MFLOPS out of the 486DX2-66 for the LLL benchmark [for a
description of these benchmarks, see the paragraph on benchmarks below].)
Although I haven't yet seen 486DX2-66 processors being offered to end users
for upgrade purposes, I recommend the 486DX2-66 to those that need highest
floating-point performance and are planning to buy a new PC. The price
difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is
around US$ 450, well below the price for the Weitek Abacus 4167.
The benchmark programs / Coprocessor performance comparisons
The performance statistics below were put together with the help of four
widely-known numeric benchmarks and two benchmarks developed by me. Three
Pascal programs, one FORTRAN program, and two assembly language programs were
used. The assembly language programs were linked with Borland's Turbo Pascal
6.0 for library support, especially to include the coprocessor emulator of
the TP 6.0 run-time library. The Pascal programs were compiled with Turbo
Pascal 6.0, a non-optimizing compiler that produces 16-bit code. The FORTRAN
program was compiled using Microsoft's FORTRAN 5.0, an optimizing compiler
that generates 16-bit code. All programs use double-precision variables
(except PEAKFLOP and SAVAGE, which use double extended precision).
Note that the use of a highly optimizing compiler producing 32-bit code can
give much higher performance for some benchmarks. For example, Intel rates
the 33 MHz 386/387DX at 3290 kWhetstones/sec and 0.4 double-precision LINPACK
MFLOPS [28,29], and it rates the Intel 486 at 12300 kWhetstones/sec and 1.6
double-precision LINPACK MFLOPS [30]. The compilers used in these benchmarks
run by the chip's manufacturer are the ones that give the highest performance
available, and sell in the US$ 1000+ price range. Some of them may even be
experimental or prereleased versions not available to the general public. The
relative performance of one coprocessor to another can and does vary greatly
depending on the code generated by compilers. Non-optimizing compilers tend
to generate a high percentage of operations which access variables in memory,
while optimizing compiler produce code that contains many operations
involving registers. Thus it is well possible that coprocessor A beats
coprocessor B running benchmark Z if compiled with compiler C, but B beats A
when the same benchmark is compiled using compiler D.
All benchmark in this overview were run from floppy under a 'bare-bones' MS-
DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made
sure no TSR or other program unnecessarily stole computing resources from the
benchmarks.
Description of benchmarks
PEAKFLOP is the kernel of a fractal computation. It consists mainly of a
tight loop written in assembly code and fine-tuned to give maximum
performance. The whole program fits nicely into even a very small CPU cache.
All variables are held in the CPU's and coprocessor's registers, so the only
memory access is for opcode fetches. The main loop contains three
multiplications and five additions/ subtractions; this ratio is fairly
typical for other floating-point intensive programs as well. Due to the
nature of this program, its MFLOPS rate is hardly to be exceeded by any
program that calculates anything useful; thus the name PEAKFLOP. You will
find the source code for PEAKFLOP in appendix B.
TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix
(a 4x4 matrix). Each vector consists of four double-precision values.
Multiplying vectors with a matrix is a typical operation in the manipulation
(e.g. rotation) of 3D objects which are made up from many vectors describing
the object. This benchmark stresses addition and multiplication as well as
memory access. For each vector, 16 multiplications and 12 additions are used,
and about 256 KB of data is accessed during the benchmark run.
For the IIT 3C87, a special version of TRNSFORM was written that makes use of
the special F4X4 instruction available on that coprocessor. F4X4 does a full
multiplication of a 4x4 matrix by a 4x1 vector in a single instruction.
TRNSFORM is implemented as an optimized assembler program linked with the
Turbo Pascal 6.0 library. The full source code can be found in appendix B.
LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from
real floating-point extensive programs. Some of these loops are vectorizable,
but since we don't deal with vector processors here, this doesn't matter. For
this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal
6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement),
memory allocation for data was reduced to 64 KB, so all data fits into a
single 64 KB segment. The older version of LLL is used here which contains 14
loops. There also exists a newer, more elaborate version consisting of 24
kernels. The kernels in LLL exercise only multiplication and addition. The
MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels.
All floating-point variables in the programs are of type DOUBLE.
Both LLL and Whetstone results (see below) are reported as returned by my
COMPTEST test program, in which they have been included as a measure of
coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal
6.0 with all 'optimizations' on and using my own run-time library, which
gives higher performance than the one included with TP 6.0. My library is
available as TPL60N18.ZIP from garbo.uwasa.fi and ftp sites that mirror this
site.
Linpack [5] is a well known floating-point benchmark that also heavily
exercises the memory system. Linpack operates on large matrices and takes up
about 570 KB in the version used for this test. This is about the largest
program size a pure DOS system can accommodate. Linpack was originally
designed to estimate performance of BLAS, a library of FORTRAN subroutines
that handles various vector and matrix operations. Note that vendors are
free to supply optimized (e.g., assembly language) versions of BLAS. Linpack
uses two routines from BLAS which are thought to be typical of the matrix
operations used by BLAS. Both routines only use addition/subtraction and
multiplication. The FORTRAN source code for Linpack can be obtained from
the automated mail server netlib@ornl.gov. Linpack was compiled using MS
FORTRAN 5.0 in the HUGE memory model (which can handle data structures
larger than 64 KB) and with compiler switches set for maximum optimization.
All floating-point variables in the program are of the DOUBLE type. Linpack
performs the same test repeatedly. The number reported is the maximum MFLOPS
rate returned by Linpack. Linpack MFLOPS ratings for a great number of
machines are contained in [6]. This PostScript document is also available
from netlib@ornl.gov.
Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected
about the use of certain control and data structures in programs written in
high level languages. Based on these statistics, it tries to mirror a
'typical' HLL program. Whetstone performance is expressed by how many
hypothetical 'whetstone' instructions are executed per second. It was
originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack,
Whetstone not only uses addition and multiplication but exercises all basic
arithmetic operations as well as some transcendental functions. Whetstone
performance depends on the speed of the CPU as well as on the coprocessor,
while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU.
There exist both old and new versions of Whetstone. Note that results from
the two versions can differ by as much as 20% for the same test configuration.
For this test, the new version in Pascal from [3] was used. It was compiled
with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations'
on. All computations are performed using the DOUBLE type.
SAVAGE tests the performance of transcendental function evaluation. It is
basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
functions are combined in a single expression. While sin, cos, arctan, and
sqrt can be evaluated directly with a single 387 coprocessor instruction
each, ln and exp need additional preprocessing for argument reduction and
result conversion. According to [14], the Savage benchmark was devised by
Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore
Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
to make 250,000 passes though the loop. Here only 10,000 loops are executed
for a total of 60,000 transcendental function evaluations. The result is
expressed in function evaluations per second. SAVAGE source code was taken
from [7] and compiled with Turbo Pascal 6.0 and my own run-time library
(see above).
Benchmark results using the Intel 386DX CPU and various coprocessors
My benchmark results for 387 coprocessors, coprocessor emulators and the
Intel RapidCAD and Intel 486 CPUs, using the programs described above, on
an Intel 386DX system:
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Intel 386DX WITH:
EM87 emulator 0.0070 0.0040 0.0050 0.0050 26 418 ##
Franke387 emu. 0.0307 0.0246 0.0194 0.0179 137 3335 $$
TP/MS-FORT emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
Q387 emulator 0.0920 0.0664 0.0305 0.0304 251 4796 ((
Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 @@
C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565
Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
Cyrix EMC87 1.0400 0.6628 0.3352 0.2808 2540 71685 //
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Intel 386DX WITH:
EM87 emulator 0.0084 0.0080 0.0060 0.0060 31 502 ##
Franke387 emu. 0.0369 0.0295 0.0233 0.0215 164 4002 $$
TP/MS-FORT emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
Q387 emulator 0.1103 0.0798 0.0365 0.0364 301 5758 ((
Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 @@
C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906
Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
Cyrix EMC87 1.2381 0.7963 0.4025 0.3324 3061 86083 //
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
Benchmark results using the Cyrix 486DLC CPU and various coprocessors
The Cyrix 486DLC is the latest entry into the market of 386DX replacement
processors. It features an Intel 486SX-compatible instruction set, a 1 KB on-
chip cache, and a 16x16 bit hardware multiplier. The RISC-like execution unit
of the 486DLC executes many instructions in a single clock cycle. The
hardware multiplier multiplies 16-bit quantities in 3 clock cycles, as
compared to 12-25 cycles on a standard Intel 386DX. This is especially useful
in address calculations (code from non-optimizing compilers may contain many
MUL instructions for array accesses) and for software floating-point
arithmetic. The 1 KB cache helps the 486DLC to overcome some of the
limitations of the 386 bus interface, and although its hit rate averages only
about 65% under normal program conditions, a 5-15% overall performance
increase can usually be seen for both integer and floating-point-intensive
applications when it is enabled.
The 486DLC's internal cache is a unified data/instruction write-through type,
and can be configured as either a direct mapped or a 2-way set associative
cache. For compatibility reasons, the cache is disabled after a processor
reset and must be enabled with the help of a small routine provided by
Cyrix. Cyrix has also defined some additional cache control signals for some
of the 486DLC pins, intended to improve communication between the on-chip
cache and an external cache. Current 386 systems ignore these signals, since
they are not defined for the standard Intel 386DX. However, future systems
designed with the 486DLC in mind may take advantage of them for increased
performance.
In existing 386 systems, DMA transfers (e.g., by a SCSI controller or a
soundcard) may cause the 486DLC's entire on-chip cache to be flushed, since
no other means exist to enforce consistency between the cache contents and
main memory. This reduces the performance of the 486DLC in these cases. The
486DLC on-chip cache does, however, allow specification of up to four non-
cacheable regions, which is particularly useful if your system has memory
mapped peripherals (e.g., a Weitek coprocessor).
Although I successfully ran my test programs on the Cyrix chip with all
coprocessors, not all of them work well with the 486DLC in all circumstances.
The IIT 3C87, the Cyrix 83D87 (chips manufactured prior to November 1991),
and the Cyrix EMC87 should not be used with the 486DLC, since they may cause
the computer to lock up if the FSAVE and FRSTOR instructions are used. (These
instructions are typically used in protected mode multiple task environments
to save and restore the coprocessor state for each task. Note that Microsoft
Windows also fits this description.) According to Cyrix, this problem occurs
only with first revision 486DLCs (sample chips) and is fixed on newer ones.
To be on the safe side, I recommend using the Cyrix 387+ with the 486DLC,
both for assured compatibility and for best performance. Note that 387+ is a
'Europe only' name and that this chip is called 83D87 elsewhere, just like
the old version. You need to get a 83D87 produced after about October 1991
to guarantee that is works correctly with any 486DLC; the same caveat applies
to the Cyrix 486SLC and the Cyrix 83S87. If you already have a Cyrix
coprocessor, use my COMPTEST program to find out whether you have a 'new' or
'old' coprocessor. COMPTEST is available as CTEST257.ZIP via anonymous ftp
from garbo.uwasa.fi (in the /systest directory) and other ftp servers that
mirror garbo.
The Cyrix 486DLC is currently the 386 'clone' with the highest integer
performance. With the internal cache enabled, integer performance of the
486DLC can be up to 80% higher than that of an Intel 386DX at the same clock
frequency, with the average speed gain for most applications being about 35%.
Floating-point applications are typically accelerated by about 15%-30% when
using a Cyrix 486DLC (with its cache enabled) instead of the Intel 386DX.
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Cyrix 486DLC
(cache off) WITH:
EM87 emulator 0.0089 0.0082 0.0062 0.0063 31 472 ##
Franke387 emu. 0.0402 0.0324 0.0258 0.0240 184 4807 $$
TP/MS-FORT emu 0.0346 0.0288 0.0206 0.0212 173 4401 %%
Q387 emulator 0.1214 0.0810 0.0368 0.0382 320 6020 ((
Intel 387DX 0.8455 0.6552 0.3659 0.3033 2249 48780
ULSI 83C87 1.1818 0.7543 0.3752 0.3026 2381 53476
IIT 3C87 0.9541 0.6609 0.3653 0.3036 2476 55814
IIT 3C87,4X4 0.9541 1.4988 0.3653 0.3036 2476 55814 @@
C&T 38700 1.1183 0.7644 0.3796 0.3087 2703 73350
Cyrix 387+ 1.1305 0.7445 0.3727 0.3060 2731 81967
Cyrix EMC87 1.2236 0.7593 0.3823 0.3144 2908 88889 //
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Cyrix 486DLC
(cache off) WITH:
EM87 emulator 0.0107 0.0098 0.0075 0.0075 37 567 ##
Franke387 emu. 0.0488 0.0392 0.0311 0.0288 223 5808 $$
TP/MS-FORT emu 0.0416 0.0345 0.0246 0.0253 208 5284 %%
Q387 emulator 0.1463 0.0973 0.0442 0.0458 384 7237 ((
Intel 387DX 1.0196 0.7880 0.4375 0.3644 2712 58479
ULSI 83C87 1.4247 0.9064 0.4506 0.3630 2868 64171
IIT 3C87 1.1556 0.7963 0.4399 0.3611 2988 66964
IIT 3C87,4X4 1.1556 1.7916 0.4399 0.3611 2988 66964 @@
C&T 38700 1.3333 0.9210 0.4548 0.3708 3254 88106
Cyrix 387+ 1.3507 0.8958 0.4477 0.3754 3297 98361
Cyrix EMC87 1.4648 0.9136 0.4548 0.3773 3505 106572 //
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Cyrix 486DLC
(cache on) WITH:
EM87 emulator 0.0099 0.0089 0.0068 0.0069 35 550 ##
Franke387 emu. 0.0462 0.0362 0.0288 0.0265 205 5445 $$
TP/MS-FORT emu 0.0410 0.0330 0.0234 0.0241 198 5339 %%
Q387 emulator 0.1344 0.0902 0.0389 0.0403 339 6241 ((
Intel 387DX 0.8525 0.6552 0.3941 0.3279 2332 49834
ULSI 83C87 1.2093 0.7543 0.4068 0.3270 2478 57197
IIT 3C87 0.9720 0.6609 0.3959 0.3295 2579 57252
IIT 3C87,4X4 0.9720 1.5087 0.3959 0.3295 2579 57252 @@
C&T 38700 1.1305 0.7644 0.4126 0.3343 2839 75949
Cyrix 387+ 1.1429 0.7445 0.4023 0.3310 2866 85349
Cyrix EMC87 1.2381 0.7593 0.4150 0.3412 3051 93897 //
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
Cyrix 486DLC
(cache on) WITH:
EM87 emulator 0.0118 0.0107 0.0082 0.0082 42 659 ##
Franke387 emu. 0.0565 0.0438 0.0350 0.0313 248 6585 $$
TP/MS-FORT emu 0.0491 0.0395 0.0279 0.0296 238 6408 %%
Q387 emulator 0.1610 0.1084 0.0470 0.0484 407 7509 ((
Intel 387DX 1.0297 0.7880 0.4748 0.3937 2801 59821
ULSI 83C87 1.4445 0.9028 0.4891 0.3926 2976 65789
IIT 3C87 1.1686 0.7963 0.4734 0.3916 3096 68729
IIT 3C87,4X4 1.1686 1.8057 0.4734 0.3916 3096 68729 @@
C&T 38700 1.3685 0.9173 0.4958 0.4012 3401 91185
Cyrix 387+ 1.3867 0.8958 0.4887 0.3962 3448 102564
Cyrix EMC87 1.4857 0.9100 0.4959 0.4091 3676 112360 //
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
Benchmark results using the C&T 38600DX CPU and various coprocessors
The Chips&Technologies 38600DX CPU is marketed as a 100% compatible
replacement for the Intel 386DX CPU. Unlike AMD's Am386, which uses microcode
that is identical to the Intel 386DX's, the C&T 38600DX uses microcode
developed independently by C&T using "clean-room" techniques. C&T even
included the 386DX's "undocumented" LOADALL386 instruction into the
instruction set to provide full compatibility with the 386DX. In my tests,
however, I observed that the 38600DX has severe problems with the CPU-
coprocessor communication, which causes the floating-point performance to
drop below that of the Intel 386DX/Intel 387DX for most programs. This
problem exists with all available 387-compatible coprocessors (ULSI 83C87,
IIT 3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, Intel 387DX). A
net.aquaintance also did tests with the 38600DX and arrived at similar
results. He contacted C&T and they said that they were aware of the problem.
Some instructions execute faster on the C&T 38600DX than on the 386DX, giving
an average speedup of 5-10% for integer applications. C&T also produces a
38605DX CPU that includes a 512 byte instruction cache and provides a further
performance increase. However, the 38605DX needs a bigger socket (144-pin
PGA) and is therefore *not* pin-compatible with the 386DX. Tests using the
38600DX were run at 33.3 MHz, as a 40 MHz version was not available as of 09-
17-92 and running the 33 MHz chip version at 40 MHz locked up the machine
frequently. Unfortunately, tests using the Intel 387DX consistently locked up
in the TRNSFORM benchmark when run at 33.3 MHz. It ran fine at 20 MHz, and
the results were scaled to show expected performance at 33.3 MHz.
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
C&T 38600DX WITH:
Intel 387DX 0.7376 0.5620 0.3337 0.2636 2066 45489
ULSI 83C87 0.5226 0.4690 0.3236 0.2654 2087 43228
IIT 3C87 0.7879 0.5762 0.3397 0.2674 2263 51195
IIT 3C87,4X4 0.7879 0.6181 0.3397 0.2674 2263 51195 @@
C&T 38700 0.5977 0.5572 0.3463 0.2681 2338 63966
Cyrix 387+ 0.5896 0.5508 0.3438 0.2673 2375 66741
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
For comparison:
PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
Benchmark notes and footnotes
Hardware configuration for test of 387 coprocessors with C&T 38600DX, Intel
386DX, Cyrix 486DLC, and Intel RapidCAD CPUs:
System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM
Hardware configuration for test of 486 FPU (extra fan for 40 MHz operation):
System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM
## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that
loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX
systems with no coprocessor upon encountering coprocessor instructions
to catch coprocessor instructions and emulate them. Whetstone and Savage
benchmarks for this test were compiled with the original TP 6.0 library,
as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my
own library if a 387 is detected. Obviously EM87 identifies itself as a
387, but it has no support for 387-specific instructions.
$$ Franke387 is a commercial 387 emulator that is also available in a
shareware version. For this test, shareware version V2.4 was used.
Franke387 unlike many other emulators supports all 387 instructions.
It is loaded as a device driver and uses INT 7 to trap coprocessor
instructions.
1)