Thursday, April 24, 2008

[HiPerCoPS] FFT

To test the reconfiguration mechanism of my simulator, I'll use a 32x32 cell device and implement an FFT circuit. These are the modules that comprise the FFT:
  • 4x4-cell multipliers (12)
  • 2x4-cell adders (11)
  • 2x4-cell subtractors (11)
  • 4x4-cell memories (8)
The experimental variable will be the number of faults, starting at zero and increasing by one until the system can not longer be mapped to the device. The dependent variable will be the percentage of time that the system is unable to be mapped to the device (the "failure rate").

The tests will be performed on a number of versions of my simulator:
  1. No fault avoidance. The placer simply maps modules without knowing/caring if a cell is faulty or not.
  2. Simple fault avoidance. Modules are mapped in the order that they were added to the device, avoiding faults.
  3. Size-aware fault avoidance. Modules are mapped starting with the largest and ending with the smallest, avoiding faults.
  4. GPD fault avoidance. Simple fault avoidance, but faulty cells can be used if the desired functionality is still present.
  5. GPD size-aware fault avoidance. Size-aware fault avoidance, but faulty cells can be used if the desired functionality is still present.
Of course, it will be important to automate this test. Some simple loops should do it though. I'll run a few tests by hand to decide how long each trial takes. Based on that information (and how much time I have at my disposal), I'll decide how many trials to run for a given number of faults. Somewhere between 10^3 and 10^4 trials for a given number of faults per simulation version should suffice, I would think.

Before I move on to the experiment, though, I need to investigate the phenomenon of "pseudo-clustering." Dr. Delgado told me that modern fault models tend to group faults in bunches rather than disperse them equally across a system. Apparently, this model is a closer approximation to the real world. One thing to consider, though, is whether pseudo-clustering is likely to occur in our architecture.

[HiPerCoPS] 562 Project

For most of this semester, I've been working on a simulator for our architecture. I developed an algorithm to determine the cell types in multiplier modules of arbitrary size and granularity. Then I worked on simulating the cell interconnects so the multipliers would actually function. There are still a few bugs in the overall system, but it works pretty well for the most part.

Now, as part of my Cpt S 562: Fault Tolerant Computing class, I'm working on a mechanism that maps these multipliers to reconfigurable device. The algorithms are pretty simple so far, but they're also a bit clunky. I'm looking into ways to make the process more elegant and efficient, but the real focus is on making it fault tolerant. My goal is to add fault recovery and avoidance capabilities to the device.

So far, I can define a device with certain dimensions and place modules on it. There's also a function that induces permanent faults in the cells. Fault placement is random, but the number of faults to be induced is taken as a parameter.  My module placer knows that it can't map modules over faulty cells, so it looks for fault-free areas that are big enough for the "footprint" of a module.

Sometimes, if there are too many faults or the modules are too large, placement will be impossible. My system is smart enough to realize this report an error whenever it happens. There are a number ways to maximize the likelihood of being able to place 100% of the modules for a given system. First of all, the module placer can attempt to map the modules in order of size, from biggest to smallest. This way, the modules that are least likely to have enough space (the largest modules) have a better chance of finding unused real estate on the device. The smaller modules, then, can "settle into the cracks" that remain. Second, it is possible that fault cells can still be used in some contexts. Since the atomic unit of our architecture is medium grain, perhaps a single burnt-out transistor won't cause the entire cell to fail. In the event that a subset of the functionality remains, maybe certain types of cells could still be mapped to this location. This is an example of graceful performance degradation, which could greatly increase the lifetime of the device.

One of the things I'd like to do is consider using a simulated annealing algorithm in the module placer. This approach is used in most FPGA placers. Although there are some fundamental differences between FPGAs and our architecture, I still think that simulated annealing could be a promising option for more efficient module placement.

Thursday, April 17, 2008

[HiPerCoPS] Google "in binary" Searches

During my research, I've come to rely heavily on the Google Calculator. Since I usually have a web browser, this is the easiest way for me to do conversions involving decimal, binary, and hexadecimal numbers. Here are some examples:
 
"256 in binary" gives
 

256 = 0b100000000

 
"255 in hexadecimal" gives 
 

255 = 0xF


"0b00101011101010101010101101010100101 in decimal" gives
 

0b00101011101010101010101101010100101 = 5 860 842 149


Very useful. Thanks, Google!