Very long instruction word |
A very long instruction word or VLIW CPU architecture
implements a form of instruction level
parallelism. Similar to superscalar architectures, it uses several
execution units of the same type (e.g. two multipliers), which enables
the CPU to execute several instructions at the same time (e.g. two multiplications).
Since the very earliest days of computer architecture, some CPUs have added several additional arithmetic logic units (ALUs)
to run in parallel. Superscalar CPUs use hardware to decide which operations can run in parallel. VLIW CPUs use software (the
compiler) to decide which operations can run in parallel.
For instance, the CPU might have the ability to multiply two numbers at the same time. However, the results of the second may
depend on the first. If so, the second of the two units "stalls" while it waits for the first one to finish. In a conventional
CPU, such stalling is implemented in hardware. In a VLIW, the compiler predetermines the schedule of operations: While one
multiplier is working on the first result, the compiler has scheduled a NOP for the other
multiplier, until the result from the first multiply is ready. This reduces the hardware complexity substantially.
A similar problem occurs when the result of such an instruction is used as input for a branch. Most modern CPUs "guess" which
branch will be taken even before the calculation is complete, so that they can load up the instructions for the branch, or (in
some architectures) even start to compute them
speculatively. If the CPU guesses wrong, all of these instructions and their context need to be "flushed" and the correct
ones loaded, which is time-consuming.
This has led to increasingly complex decoders that attempt to guess right, and the simplicity of the original RISC designs has been eroded.
In a VLIW, the compiler uses heuristic or profile information to guess the direction of a branch. This allows it to move and
preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If
the branch goes the unexpected way, the compiler has already generated compensation code to discard speculative results in order
to preserve program semantics.
The term VLIW, and the VLIW architecture concept, was by Prof. Josh Fisher in his research group at Yale University in the early 1980s. His original development of Trace Scheduling as a compilation
technique for VLIW was developed when he was a graduate student at New York University. Prior to VLIW, the notion of prescheduling functional units and instruction level
parallelism in software was well established in the practice of developing horizontal microcode. Fisher's innovations were around
developing a compiler that could target horizontal microcode from programs written in ordinary programming language. He realized
that in order to get good performance, and to target a wide-issue machine, it would be necessary to find parallelism beyond which
one finds generally within basic blocks. He developed region scheduling techniques to identify parallelism beyond basic blocks. Trace Scheduling is such a
technique, and involves scheduling a most likely path of basic blocks first, inserting compensation code to deal with speculative
motions, scheduling the second most likely trace, and so on until the schedule was complete.
The second innovation of Fisher was the notion that the target CPU architecture should be designed to be a reasonable target
for a compiler -- the compiler and the architecture for VLIW must be co-designed. This was partly inspired by the difficulty Josh
observed at Yale of compiling for architectures like Floating Point Systems FPS164, which had a complex instruction set architecture that
separated instruction initiation from the instructions that saved the result -- leading to the need for very complicated
scheduling algorithms. Josh developed a set of principles characterizing a proper VLIW, such as self draining pipelines, wide
multi-port register files, and memory architectures. These principles made it easier for compilers to write fast code.
The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. John Ruttenberg also developed
certain important algorithms for scheduling.
Fisher left Yale in 1984 to found a startup company, Multiflow, along with co-founders John O'Donnell and John Ruttenberg.
Multiflow produced the TRACE series of VLIW minisupercomputers, shipping their first machines around 1988. Multiflow's VLIW could
issue 28 operations in parallel each instruction. Multiflow failed as a business in 1990. The reasons for any business failure
are complex; part of the challenge faced by Multiflow was timing with respect to hardware implementation technology. Multiflow
implemented its VLIW in an MSI/LSI/VLSI mix packaged in cabinets, a technology that fell out of favor when it became possible to
integrate all of the components of a processor (excluding memory) on a single chip. Multiflow was too early to catch the
following wave when chip architectures allowed multiple issue CPUs. The major semiconductor companies recognized the value of
Multiflow technology in this context, and consequently the compiler and architecture were subsequently licensed to most of these
companies.
There are instances of the Multiflow Trace machines in the computer museum.
One of the licensees of the technology is Hewlett-Packard, which
Fisher joined after Multiflow's failure. In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their PA-RISC processor family. They found that the CPU could be greatly simplified by removing the complex
decoding logic from the CPU and placing it into the compiler. Today's compilers are much more complex than those from the 1980s,
so this added complexity in the compiler is considered to be a small cost.
VLIW CPUs are actually RISC-based, typically with four to eight main units. After compiling the program normally, the VLIW
compiler re-orders the code into paths that simply don't have any dependencies. These are then sliced into four or more (one for
each unit of the CPU) and packaged together into one larger instruction with additional information regarding which of the
instructions should run which unit. The result is a single much larger op-code (thus the term "very long").
Philips' TriMedia processor as well as Intel®'s Itanium® IA-64 processor
are examples of VLIW CPUs.
Some people felt (though there is no general agreement) that an early problem with VLIW processors is that they do not scale
well to different price points. Both CISC or RISC
machines can be implemented in many ways to save varying amounts of money (indeed most CISC processors are now implemented as
RISC processors with a hardware instruction-set-translation front end). A VLIW machine was perceived to have fewer options.
Transmeta™ addresses this issue by including a binary-to-binary software
compiler layer (termed Code Morphing ™) in their Crusoe™ implementation of the x86 architecture. Basically, this mechanism is advertised to recompile, optimize, and translate x86 opcodes at runtime
into the CPU's internal machine code. Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled
from the x86 CISC instruction set that it executes.
As the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. The
VLIW architecture is growing in popularity, particularly in the embedded market, where it is possible to customize a processor
for an application in an embedded SOC. Embedded VLIW products are available from several vendors, including Fujitsu, ST
Microelectronics. The Texas Instruments DSP line has evolved in its C6 line to look more like a VLIW, in contrast to the earlier
C5 lines.
|