4.1. Modeling and Prototyping Using DIF
 We start with an application specification that describes the DSP algorithm under consideration (in this case, the TDD) along with proper input and output interfaces. The application is specified using the DIF language. This DIF specification consists of topological information about the dataflow graph — interconnections between the actors along with input and output interfaces. The DIF specification is a platform-independent, high-level application specification. The specification can be used, for example, to simulate the application, given the library of actors from which the specification is constructed.
 Depending upon the application under consideration, the designer can select among a variety of dataflow models of computation in DIF to effectively capture relevant aspects of the application dynamics. It should be noted that the designer does not always need to specify the model in advance. The CFDF model can be used to describe individual modules (actors) in the application, and the DIF package can analyze the CFDF representation (CFDF modes, to be specific) of the actors, as specified by the designer through the actor code, and annotate the actors with additional dataflow information using various techniques for identifying specialized forms of dataflow behavior [e.g., see Plishker et al., 2010]. This step requires the functionality of individual actors to be specified in CFDF semantics. The designer can use the existing blocks from the Java actor library in DIF or develop his or her own library of CFDF actors.
 In terms of tunability, the key components of the TDD as seen from Figure 3 are the tunable FIR filter, and decimation filter blocks. The tunable decimation filter (TDF) block is of particular interest, considering that it is the only multirate block in the system. Its behavior resembles that of the one described in section 3.1.3. In view of this, we have identified PSDF and PCSDF as candidate dataflow models for efficient implementation of the targeted TDD system. For this system, we have to take into account the multiple inputs and outputs to actors, as mentioned in section 2.
 To illustrate details of the dataflow behavior of a decimator actor based on such specifications, we have shown one such decimator actor with 4 inputs and outputs, and having a decimation factor of 6 in Figures 10a and 10b. The decimator simultaneously receives 4 consecutive samples from its 4 inputs. It outputs every sixth input sample starting with the first input sample. Each of these output samples appears on a successive output of the decimator.
Figure 10. Dataflow behavior of a Decimator actor with 4 inputs and outputs for a decimation factor of 6 using (a) SDF and (b) CSDF models.
Download figure to PowerPoint
 For the sake of simplicity and clarity, we have excluded the other single rate blocks from the application graphs in these figures. In our implementation, we extend this behavior for an actor with 8 inputs and outputs. We have created a DIF prototype using PSDF and PCSDF as underlying models for equivalent CFDF representation of actor blocks. We have also developed a Java library of actors in DIF adhering to CFDF semantics for all of the blocks.
 We then used DIF for software prototyping, analysis, and functional simulation. The DIF package uses the DIF specification to generate an intermediate graph representation, which can then be used as an input for further graph transformations including a scheduling transformation, which determines the schedule for an application. Here, by a schedule, we mean the assignment of actors to processing resources, and the execution ordering of actors that share the same resource. The functional simulation capabilities provided in DIF can be used to analyze and estimate buffer requirements in terms of the numbers of tokens accumulated on the buffers that correspond to dataflow graph edges. This provides an estimate of total memory requirements as well as specifications for individual buffers when porting the application to the targeted implementation platform.
 Figure 11 shows the TDD application graph generated using DIF. This is based on the TDD block diagram shown in Figure 3 with addition of some actors that handle parameter configuration for the actors. We discard one of the two sets of outputs (more specifically, sine output) of the localOsc actor as we have employed a real mixer in our design. The complexity of the graph, which is increased due to multiple parallel edges between two actors, can easily be captured through a DIF specification that makes use of topological patterns. We have shown one of the possible specifications of the graph topology in DIF using topological patterns in Figure 12.
 For our design, we have used parameterized looped schedules (PLSs) [Ko et al., 2007] for PSDF and PCSDF models to determine the total buffer requirements. Using the TDD specification, we construct PLSs for the TDD application. Figure 13a shows a PLS for a TDD application, where the decimator actor has the underlying SDF model, while Figure 13b shows one in which the decimator actor employs the CSDF model. We have used the generalized schedule tree (GST) representation for the PLSs [Ko et al., 2007]. An internal node of a GST denotes a loop count, while a leaf node represents an actor. The execution of a schedule involves traversing the GST in a depth-first manner, and during this traversal, the sub-schedule rooted at any internal node is executed as many times as specified by the loop count of that node. As annotated in these GSTs, loop counts p0, p1, and p2 are parameterizable. The loop count p0 is set to a user-specified number of iterations, while the loop counts p1 and p2 are tuned based upon the decimation factor as well as the underlying dataflow model for the decimator.Figures 13a and 13b, in particular, show values of the parameterizable loop counts set for a decimator with a decimation factor of 11. This PLS can be viewed as providing CFDF-based execution for the given PDF-based actor specification model.
Figure 13. PLSs for the TDD application configured for a decimation factor of 11, and decimator actor employing the (a) PSDF and (b) PCSDF models of computation.
Download figure to PowerPoint
 Table 1 shows the total buffer requirements using PLSs shown in Figures 13a and 13b for various configurations of decimation factors. Note that for a given configuration (setting of graph parameters), a PSDF or PCSDF graph behaves like an SDF or CSDF graph, respectively. It can be seen that for the SDF model, the total buffer requirements vary with the decimation factor, and this is due to input buffers to the TDD block that need to accumulate varying numbers of tokens. Thus, employing the PSDF model will require tuning buffer sizes for different decimation factors if one wants to provide for optimized buffer sizes in terms of graph parameters.
Table 1. Total Buffer Requirements From a DIF Prototype for Different Decimation Factors Using Parameterized Looped Schedules
| ||Decimation Factor|
|Total buffer requirements||SDF||132||140||148||156||164||172||180||188|
|(Number of tokens)||CSDF||100||100||100||100||100||100||100||100|
 We have used the CASPER tool flow for developing our platform-specific implementation as explained later insection 4.2. This implementation is targeted to an FPGA. Our objective here is to support tuning the decimation factor without regenerating hardware code. A dataflow buffer can be implemented using a FIFO or dual-port random access memory (RAM) block in the targeted FPGA device. The size of the available FIFO block can be set to 2n, where n≥ 1. This gives limited control over setting the FIFO size, and may increase the resource utilization. At the same time, tuning the sizes of FIFO or dual-port RAM blocks is not possible during run-time. It is in general possible to set the size of a FIFO or dual-port RAM block to a maximum required value, and access only a part of it using a tunable address counter during run-time. This, however, again may lead to unnecessary increased resource utilization. The ADC output is of a streaming nature (data is produced or consumed at every clock cycle without any synchronization signal), as is the DSP subsystem downstream of the TDD.
 In order to achieve the throughput constraint imposed by the maximum data rate of the ADC output stream, SDF buffers need to be pipelined, which is not efficient using RAM blocks. Thus, we use the CSDF model, which does not require tuning of dataflow buffer sizes to achieve the maximum throughput constraint, as observed from our DIF-based prototype. The TDD generates a synchronization or enable signal indicating a valid output data. This can be used as a clock to drive the downstream DSP system.
 We use our DIF prototype as a reference while integrating the design with the current CASPER tool flow for the target implementation on the IBOB. Section 4.2 further elaborates on this approach along with implementation results.
4.2. Integration With the CASPER Tool Flow
 The CASPER tool flow is based on the BEE_XPS tool flow [Parsons et al., 2006]. This tool flow requires that an application be specified as a Simulink model using XSG [Parsons et al., 2006]. Since there is no automated tool for transforming a DIF representation into an equivalent Simulink model, porting the DIF specification to Simulink/XSG requires manual transcoding of the DIF specification. This also requires implementing parameterizable actor blocks that are currently not available in the XSG, CASPER, or BEE_XPS libraries.
 Each actor gets transformed into an equivalent functional XSG block. For each of the Simulink actor blocks, we provide a pre-synthesis parameterization that allows changing block parameters before hardware synthesis (seeParsons et al. for more details on Simulink scripting). In order to implement our objective of tunability — post-synthesis parameterization — we use thesoftware registermechanism in the BEE_XPS library to specify parameters that change during run-time (that is, after hardware code is generated, and depending upon user requirements.)
 Software registers can be accessed and set during run-time from the TinyShell interface available for IBOB. This allows tuning TDD parameters without re-synthesizing the hardware each time the parameters change from the previous setting. Each block has an enable input signal. Through systematic transformations, an application graph in DIF can be converted into an equivalent Simulink/XSG model. We have developed an interface software package using C programs, and Bash and Python scripts to compute software register values for the required TDD configuration, and set these values on the IBOB over a telnet connection, which is used for remote access to the hardware platform at NRAO.
 On the targeted FPGA device, we have employed the NCO using dual-port RAM blocks that are loaded with pre-computed sinusoidal signal values of the required precision. Each of these dual-port RAM blocks is used to simultaneously read sine and cosine values from both of its ports. The oscillator frequency is set using a software register, and depends upon the desired output signal band.
 In our current implementation, the TDF block (see Figure 3) can have up to 16 filter taps. We have also implemented a tunable FIR filter block, which does not decimate, shown in Figure 3. This block can have up to 8 taps in our implementation. These, again, are set using software registers. Figure 4bshows the schematic of a TDF. As shown in this figure, we have employed two filter banks (16-tap units) inside our design of a TDF block that operate in tandem to allow maximum throughput (that is, the maximum data rate of the ADC output stream). Hence, our TDF block has 32 multiplication operations. As mentioned earlier, our TDF design employs a polyphase implementation as described inVaidyanathan . The software computes the sequence in which the input signals should be routed to an appropriate filter tap for a given decimation factor. This information is then fed to the signal routing scheme using software registers.
 Table 2 shows results for the TDD implementation on the IBOB using the Xilinx EDK 7.1.2. We have used this hardware platform and tool for all of the experiments reported in the remainder of the paper. Design 1 shows some of the device utilization parameters for a TDD that supports only baseband modes. This design does not include the tunable FIR filter, NCO, and mixer blocks shown in Figure 3. Design 2 is based on the block diagram of a TDD shown in Figure 3. As evaluation metrics for hardware cost, we have used the utilization of FPGA slices, 4-input look-up tables (LUTs), block RAM units, and the number of embedded multipliers. Note that neither of these two designs use any of the available embedded multipliers for multiplication. Designs 3 and 4 are modified versions of designs 1 and 2, respectively, in that they employ embedded 18 × 18 multipliers. It can be seen that using embedded multipliers does not provide significant improvements in hardware cost. We observe that use of embedded multipliers, in fact, needs to be accompanied by addition of extra latency in the design to achieve timing closure. We have been able to achieve maximum throughput using an implementation based on the PCSDF model.
Table 2. Implementation Summary for TDD Designsa
|Parameter||Design 1||Design 2||Design 3||Design 4|
|FPGA slices (out of 23616)||12234 (52%)||13315 (56%)||12322 (52%)||14232 (60%)|
|4 input LUTs (out of 47232)||14139 (29%)||16123 (34%)||12123 (25%)||15035 (31%)|
|Block RAMs (out of 232)||41 (17%)||48 (20%)||41 (17%)||48 (20%)|
|18 × 18 multipliers (out of 232)||–||–||32 (13%)||95 (40%)|
4.3. Platform-Specific Analysis Using DIF
 It is common to go back and forth between a high-level prototype and a corresponding platform-specific implementation while designing an embedded DSP system. Such alternation in design phases is common, for example, when one is developing a platform-specific library or tool flow. In support of such a design methodology, it is desirable for a high-level design tool to support platform-specific analysis. This can be achieved by annotating the high-level application specification with platform-specific implementation parameters, which are derived through device data sheets, experimentation or some combination of both.
 DIF supports specifying user-defined actor parameters. We use this feature in DIF to annotate actors with two relevant implementation parameters — the latency constraint, and number of embedded multipliers. This allows estimating results based on the DIF prototype itself instead of determining them from the constructed design, which is generally time consuming. We have verified the accuracy of metrics estimated by our DIF model compared with actual hardware synthesis results that are shown inTable 2.
 Developers of tool flows and DSP libraries can profile their library blocks to determine a wide variety of platform-specific implementation parameters. DIF can use such information to estimate implementation parameters at a high-level of abstraction, and earlier in the design cycle to help efficiently prune segments of the design space. Support for estimation of various platform-specific resources for different platforms is beyond the scope of this paper. It is, however, an important direction toward developing alternative model based design flows and open access tool flows for astronomical DSP solutions.