Chips Work, Stock is Moving, Mood is Upbeat
MathStar's new field-programmable object arrays
Product Update from MathStar (Digital MediaCom)
Digital MediaCom Merger
Test and Measurement World Article illustrating Mathstart chip works and meets needs
Computational vision: Testing MathStar’s field-programmable object arrays
Test is a key part of moving field-programmable object arrays from design into manufacturing.
Rick Nelson, Chief Editor -- Test & Measurement World, 2/1/2007
READ OTHER FEBRUARY ARTICLES:
Contents, February 2007
HILLSBORO, OR—Traditional approaches for extracting computational power from silicon are insufficient for meeting the requirements of machine-vision, professional-video, medical-imaging, radar-processing, and other demanding applications. That's the view of the founders of MathStar, who envisioned a silicon computational device that could provide the performance of ASICs at the development costs of FPGAs. The device they invented is the field-programmable object array (FPOA)—a high-performance, reprogrammable integrated circuit based on the company's proprietary silicon object technology, which in its second-generation Arrix family of devices can process logic functions at a clock rate up to 1 GHz.
Underlying the FPOA's development was the Minneapolis-based founders' extensive knowledge of how to efficiently implement in silicon the massively parallel processing required to quickly execute the complex algorithms required in machine-vision and other demanding applications. MathStar founder and president Douglas Pihl said the concept arose when he was working with a mathematician on DARPA-funded projects related to making supercomputing platforms sufficiently fast to serve in very high-performance computing radar detectors. “It made sense,” said Pihl, “to start a company that would build chips to serve these applications.”
MathStar founder and president Douglas Pihl said MathStar moved its headquarters to Hillsboro, OR, to tap the area’s wealth of semi-conductor production talent. Courtesy of MathStar.
“We didn't believe that FPGAs were going to be able to keep up with the performance demands. We originally thought we would build ASICs. But we saw that ASIC development costs were at half a million dollars and rising, and we didn't believe the algorithms we planned to implement would serve enough market segments to make an ASIC implementation profitable.”
Pihl said he did believe, though, that “the market would be receptive to a product that filled the gap between FPGAs and ASICs—one embodying a programmable architecture coupled with gigahertz-rate performance. A lot of people were saying they couldn't afford to spend a half million dollars to make an ASIC. A lot of people have adopted FPGAs, which have been very useful, but we saw that technology plateauing. People were finding that their application would require two or three or four FPGAs—and that's just too bulky and cumbersome. So, we thought the market would be very receptive to a new technology.”
He added, “We also felt our new device had to be programmable, or we would have the same development cost problem facing ASICs. We worked a long time on developing an architecture thoroughly focused on performance and programmability—maintaining our goal of getting it to run at a gigahertz and still be a very programmable architecture.”
Figure 1. a) In an FPOA, a periphery surrounds a core consisting of arithmetic logic units, multiply/accumulators, and register files. The core (inside the red line) operates at 1-GHz and requires BIST; other components operate at slower speeds and are amenable to ATPG techniques.
Unlike FPGAs, FPOAs are not user-programmable at the gate level but rather at a higher, silicon object level. Object types now include arithmetic logic units (ALUs), register files (RFs), and multiply/accumulate units (MACs), each of which is programmable (Figure 1a). The objects themselves are arranged in an internal grid pattern, surrounded by functions such as memory and I/O. The object grid is overlaid with a patented, high-speed interconnect system that a customer programs to serve unique applications.
The FPOA architecture has benefits in addition to performance. Explained Pihl, “Some of our very early customers were in the mil/aerospace area and were always looking for highest performance. Several years ago, we started talking with Honeywell's space electronics group. They had looked very seriously at FPGAs, but they wanted to make them rad hard. Because of massive amounts of SRAM in an FPGA, that's very difficult. They would have had to use triple-mode redundancy to get around the soft-failure problem. Because of the space and weight and power-supply limitations, it just wasn't practical. Our device is much more attractive in this respect because it has only 10 to 15% of the SRAM in FPGAs. Ultimately, we expect them to port our device over to their rad-hard facility in Minneapolis.”
Tim Tickman, engineering VP at MathStar.
Engineering VP Tim Teckman noted that FPOAs require less SRAM because the hard-wired silicon objects don't require gate-level programmability. “Because the functions that we deliver in objects are higher-level functions, there are fewer programmable states. An ALU, for example, has 30-some instructions and a number of states that we can program with a few hundred bits. If you are going to build an ALU out of CLBs and LUTs, you are going to have literally thousands or maybe millions of bits of configuration information. Because we have already done the detailed design and optimization, the end user doesn't have to program those functions.”
Added Pihl, “The same applies to the interconnect structure. With an FPGA, you are making connections all over an array, and each path includes a whole bunch of pass transistors, each of which takes a configuration bit. With our chips, you need just a few bits to specify the destination and operation, and off it goes. Also, we use the same basic PROM-loader or JTAG-loader mechanism as do FPGAs, but since we have fewer configuration bits, configuration time is much shorter.”
You might expect that deep-pocketed military and aerospace prime contractors would opt for rad-hard ASICs, but Pihl explained that even for the government, ASICs are becoming too expensive. And a bigger problem, said Pihl, is that invariably a development team will get halfway through an ASIC design, and system designers and mathematicians will say, “wait, we improved the algorithm—do it this way,” resulting in huge cost and schedule delays. “They want programmable solutions,” said Pihl. “They even talk about reprogramming satellites in space.”
MathStar developed its original technology in Minneapolis and maintains a design team there. But Pihl believed that to grow the company and move the product from prototype to production, it would be necessary to move to an area offering a wealth of semiconductor talent. “A lot of ASIC designers know how to work with standard libraries, but finding people that really understood silicon—especially at the transistor level—was tough. After we got through the initial development and produced our prototype chip, I felt that if this was really going to work, we had to move to a different area. It's one thing to build a prototype chip that works, but quite another to make a million.”
So in early 2005, the company established its Hillsboro, OR, headquarters and brought onboard, among others, engineering VP Teckman, COO Dan Sweeney, and marketing VP Sean Riley, all of whom had worked at Intel. Teckman's job was to reorganize the engineering team to be more process oriented with a much heavier emphasis on test and simulation and timing extraction. Having shipped its first prototypes in April 2005, the company in July 2006 received its first 1-GHz production units from foundry TSMC.
Teckman said that the parts that went into production in the fourth quarter of 2006 are 130 nm. Added Pihl, “When we started, 90 nm was very early edge. We heard horror stories, and we said we didn't want to take the process risk along with the architecture risk. Now, with Tim, we have more process knowledge, but we still we feel we can get the performance through the architecture, so we can afford to hang back a process step or so.”
Dick Reohr, senior architect, evaluates FPOA prototypes in MathStar’s lab.
Added Teckman, “We are doing some interesting things with the process to maximize the performance we get out of it. We've started work in the 90-nm process space, but we prefer to stay off the bleeding edge of process technology as a smaller company, because we are able to hold back and still have 2X or so clock-rate performance improvement.
“That said, we don't think our technology is in any way tied to any particular process. We can use the same sort of architecture and design techniques on the processes as they mature. But there is a huge cost in terms of time and resources and dollars associated with being on the leading edge, and we don't think we need to be there. We are delivering the performance without it.”
BIST supports gigahertz performance
The FPOAs' built-in self-test (BIST) implementation helps MathStar maximize performance while providing 98% fault coverage for both stuck-at and speed-related faults, according to Dick Reohr, senior architect. He described the FPOA structure as a prelude to explaining how test works: “The object array is a matrix of symmetrical objects that connect together by abutment, so we can easily change the size and shape and tiling pattern of the array, which operates in synch with a common1-GHz clock. Surrounding the array is a ring of circuitry we use for BIST.”
Figure 1. b) MISRs and LFSRs communicate with objects in a rectangle under test over party lines.
That circuitry (Figure 1b, top inset) consists primarily of linear-feedback shift-register (LFSR) pseudorandom number generators, which provide stimulus traffic into the object array, and multiple-input shift registers (MISRs), which record and compress results.
“Outside that collar,” Reohr continued, referring to the red square in Figure 1a, “are periphery interfaces that operate with many independent and substantially lower-speed clocks. For example, the external memory operates at 175- to 266-MHz DDR, the parallel LVDS at up to 500-MHz DDR, and the GPIO at 100 MHz. The RTL code describing the periphery interfaces is all written in Verilog, and we use a scan ATPG methodology to test the periphery devices.”
Within the collar, Reohr continued, “test is a little bit tougher. We faced a tradeoff between performance and adding testability hooks, and because of the delays scan muxes would impose, we chose not to add scan. So, we needed to come up with some other way to test the internal array. We had to come up with a BIST algorithm that was scalable—that would allow us to change the shape of the object array as well as the tiling pattern. Each object is unique, but from a BIST point of view it must be self-testable using the same BIST algorithm—that is, the BIST algorithm must be independent of the tiling pattern. We also need to run tests at the full gigahertz clock rate, so we had to disassociate ourselves from any particular tester—we shift test data in and out at slow speeds, turn BIST on, and let the chip manage all the high-speed signals itself.”
One of the biggest challenges, Reohr said, stemmed from the fact that BIST “configures the part in a random way that's not realistic”—that is, that doesn't mirror how the device will run during normal operation—“and the amount of power that we could potentially consume could be too great, so we had to make sure we limited test power to levels that would permit wafer tests without requiring cooling. Another challenge involved the fact that random configurations established by the BIST circuitry could conceivably require a signal to traverse the full array in a single clock cycle, whereas single-clock data transfers by design occur only within local neighborhoods.”
Reohr employed the concept of a “rectangle under test” to address these challenges. In Figure 1b, the yellow objects represent the rectangle under test, and they are fully powered up. The white objects are powered off to limit chip power consumption, while the red objects are partially powered up to allow the party-line (PL) buses that connect all objects together to carry data into and out of the rectangle under test. The lower inset in Figure 1b details an object's east/west PL bus.
The rectangle under test, Reohr said, can range from 1x1 to nx8, with the exact configuration downloadable via the JTAG interface. Large rectangles help to verify timing over varying multicycle paths, but they reduce observability. Test typically begins, he said, with the rectangle in the lower left. The rectangle is then marched from left to right and then moved up one row, with the process repeating until test is complete. Table 1 outlines the BIST algorithm.
Test and simulation
Device test is only part of the challenge of getting customers' products up and running. MathStar also needs to support its customers' efforts to develop and test FPOA applications.
To that end, Teckman said, the company offers a variety of applications support. “We have several types of development kits that provide a chip and programming tools and some applications libraries along with training. We offer one development board that connects to a PC, and for military/aerospace applications, we provide a ruggedized rack of DIN cards that includes a control processor and an FPOA. We have an internal IP development team that builds things like FIR filters and FFTs for the test-and-measurement applications—for front-ending an oscilloscope probe, for instance. In addition, third-party IP developers like Barco Silex and Cadre Codesign, which have made a business of building IP on top of FPGAs, have now started to migrate to our technology.”
Steve Kassel, director of silicon/system engineering, explained the application development process. “Tool flow starts with Summit Design's Visual Elite—that's our design entry tool, which we use to generate OHDL, or Object Hardware Description Language, a Verilog-like syntax we use to pass information between our various design tools. Customers then use COAST,” MathStar's Connection and Assignment Tool (Figure 2), “for floor-planning and placement. When the layout is complete, a compiler combines OHDL and mapping files to generate the image that's downloaded onto the target chip.”
Figure 2. Users can convert a logic design into a physical FPOA implementation using MathStar’s COAST tool, which provides a graphical user interface for placement and connection of the silicon objects. Courtesy of MathStar.
Teckman said that customers can't rearrange the silicon objects for chips they already have—they can only reprogram the connections. He explained, however, that an Excel spreadsheet and a Perl script are all that are necessary for specifying a custom array that MathStar can build. “We do that for our internal development, and Barco Silex and end customers all get access to this capability.” Ultimately, he said, the goal is to take a simulation in a program like Matlab from The MathWorks and synthesize directly to objects, with automatic place-and-route.
As an example, said Teckman, “We are doing something like that today with Barco Silex, which has built a complementary FPGA that sits next to an FPOA. Barco Silex uses our tool with Visual Elite from Summit and ModelSim from Mentor Graphics to form an environment where they simulate a complete design.”
Teckman emphasized that throughout the design process, the customer is not dealing with clocks or DLLs (delay-locked loops) that could affect timing closure. Data transfer timing between nearest neighbors or over party lines is completely deterministic, with nearest neighbors and all party lines synchronized to the exact same clock.
Said Pihl, “An advantage of our architecture and our tools is that we have deterministic timing. We can load a program, and it runs at the full 1-GHz clock rate. You won't have the clock-rate degradation as in FPGAs, which have a place-and-route and timing-closure iterative loop you have to go through. With FPOAs you can literally simulate the timing and performance of your application, because we have cycle-accurate models of each one of the objects in our chip. A customer can know exactly what his timing is before implementing real hardware, making for a very efficient design process.”
Pihl added, “We want to evolve the toolset over time to make them easier to use. Our current tools are based around Summit Design's Visual Elite, and we are trying to expand that to deal with a whole range of EDA companies, so customers can use the front-end design tools they already have.” The goal is full support for ESL, he said. “We want customers to have the ability to do SystemC level simulations and move down the hierarchy toward automated development of the actual code that runs on our chips.”
Table 1. High-level BIST algorithm
1. Disable I/O.
2. Shift in rectangle under test (RUT) infrastructure into entire array with RUT at bottom left.
3. Shift in random seeds to LFSRs; initialize MISRs and signature register.
4. Enable full-speed clocks for burst of activity; provide random stimulus and record outputs to MISRs.
5. Shift LFSR chains to reconfigure objects under test.
6. Repeat steps 4 and 5 a configurable number of times.
7. Record state of MISRs into final BIST signature.
8. Shift all objects one column to the right. RUT moves east one column.
9. Repeat steps 4 through 8 a configurable number of times.
10. Repeat steps 2 through 9, moving RUT up one row, a configurable number of times.
11. Read and compare final BIST signature.