[1] Improvements in sampling and recording technology have made it possible to acquire data at rates of higher than 1 Gbit/s. In very long baseline interferometry, the use of wide-bandwidth data produces a high SNR, which is proportional to (B is bandwidth and T is integration time). In astronomical applications, wide-bandwidth data acquisition can be used to detect very small flux densities of cosmic radio sources. It is also useful in geodetic applications. The signal processing of the current correlation processing algorithm is bit serial. The data-processing speed, however, is restricted by the correlation device clock in serial data processing, and as a result, the device speed prevents the whole (channel) bandwidth from being used for observations. To overcome this problem, a new correlation processing algorithm for parallel bit stream set processing has been developed. This article focuses on how to derive serial data processing algorithms for parallel bit stream.

[2] Very long baseline interferometry (VLBI) is a kind of radio interferometry, the baseline of which may be as long as the diameter of the Earth. In the field of astronomy, the goal is generally to make high-resolution maps of radio sources. This is accomplished by combining data from two or more antennas which are individual interferometer elements to the synthesized aperture. A single antenna effectively gathers its signal by summing the signals from the various parts of its surface. If the actual single antennas are distantly separated, the synthesized antenna has a beam pattern with the highest angular resolution achievable by any known technique. Wide-bandwidth data enable observation of very weak sources [Rogers and Moran, 1981; Rogers et al., 1983, 1984].

[3] For geodesy, the goal is usually to measure the time-of-arrival difference of signals from a radio star. Assuming that the radio star is a point object and that the clocks at the individual stations are exactly synchronized, the time-of-arrival difference, τ(t), can be measured by comparing the signals with different trial delays until they match. In practice, this is done by multiplying them together with different trial delays until a maximum is discovered for an actual delay. Wide-bandwidth data leads to a high-SNR observations.

[4] A VLBI correlator can be built to process data in the “XF” (correlation followed by Fourier transform) or “FX” (Fourier transform followed by multiplication). The “X” stands for cross correlation and “F” for Fourier transform. In the XF correlator, the cross-correlation function is calculated first, followed by Fourier transformation. In the FX correlator, the signals are first Fourier transformed, and then the voltage spectra are cross multiplied. It is operated when a bit shift occurs during an FFT calculation period. One of the attractive features of the FX design is that it is more naturally “station based.” A station-based delay is applied to both X and Y in a manner similar to the XF correlator, and quadrature rotation is applied to remove the gross station-based fringe rate.

[5] This paper focuses on the XF correlator, but the parallel bit stream approach is also effective for the FX correlator, especially highly fringe rotation.

2. Serial Data Correlation Algorithm

[6] Before introducing a parallel bit stream correlation method, we discuss the current serial bit stream correlation method [Rogers, 1970; Whitney, 2000; Kiuchi and Kondo, 1996], which is the most coherent lossless system, in order to compare them. The basic feature of the VLBI correlator is the fringe rotator for canceling the Earth's rotation, called fringe stopping, that operates on the data stream before correlation. The data are multiplied by a three-level approximation to the sine and cosine of the fringe stopping phase, which results in two modified data streams. These data streams are correlated with the data stream from the other station, yielding a complex correlation function. The simple real correlator consists of shift registers and an exclusive Not-Or (EXNOR). The correlated values are integrated by counters. This correlation function is then transformed to give the cross-power spectrum. It is important to realize that the correlation function must be calculated for a range of delays even if the clock error is precisely known; it is necessary to discard the negative frequency part of the cross-power spectrum. The continuum fringe visibility can be calculated as the average of the cross-power spectrum over positive frequency or as a convolution operation on the correlation function.

2.1. Fringe Rotation

[7] In the case of correlation between the X signal and Y signal, the X signal is multiplied by sine and cosine waveforms that correspond to the “fringe rate,” and the resulting two signals are individually correlated with the delayed Y signal. This type of sine/cosine multiplication is known as “quadrature mixing” and is effectively a single-sideband mixing operation that shifts the frequency spectrum of the X signal to match that of the Y signal. The correlator has a “cosine” and “sine” sum at each lag; such a correlator is known as a “complex correlator” since the cosine/sine sum pair form a complex number.

[8] In the correlator, the fringe generator consists of a modulated numerically controlled oscillator (MNCO) with a phase acceleration register. The fringe phase is maintained to a phase register that maps a single turn of phase from 0° to 360° on to the full range of register values. A Δϕ register holds the value by which the phase register is incremented at every sampling period. The phase acceleration control register holds the value by which the phase rate is incremented at the beginning of each sampling period.

[9] The actual value of the cosine/sine waveforms approximated by the three-level value, which maps onto 16 discrete values, are determined only by the four most significant bits (MSBs) of the phase register. The three-level signal is resolved into a sign bit (indicated by ±1) signal and a blank (indicated by 0) signal. The coherence loss of the 1-bit correlation is less than 15% [Rogers and Moran, 1981] in SNR. The lost energy is scattered into data mixed with the higher harmonics of the cosine/sine waveforms and is seen as a slight additional noise. The correlated values are integrated by counters using an enable signal. The counter enable signal is controlled by the blank signal from the fringe rotator.

2.2. Delay Tracking

[10] If the delay of the Y signal with respect to the X signal varies, the delay applied to the Y signal must change as a function of time in a quantized fashion, as shown in Figure 1. The dashed line represents the “model” (or desired) delay, but the actual Y delay must be stepped in integral sample periods as shown. This results in a sawtooth “baseline-delay error” shown at the bottom of Figure 1, which varies between +0.5 and −0.5 sample periods, because the correlator delay cannot exactly track the desired delay.

2.3. Fractional Bit

[11] The frequency of the phase generator is usually chosen to be the fringe rate at the center of the sampled analog signal (RF). This choice minimizes the average SNR loss over a correlation interval.

[12] During each period of time for which the delay is held constant (during the interval over which the quantized delay error tracks within ±0.5 sample period), the phase rotator operates at a phase rate corresponding to the fringe rate at the middle of the channel.

[13] The correlation processor compensates for the fringe phase on the center frequency of the receiving band in the time domain. This fringe phase is not perfectly compensated for by the entire video bandwidth. Because the overall processing should be referenced to the DC edge of the video band, the phase of the phase rotator must be stepped exactly 90° as action d, shown in Figure 2 at each instant that the delay is shifted by one sample period (delay tracking) in Figure 1. The sign of the phase shift depends on the sideband and the delay rate. A 90° phase shift is appropriate for Nyquist-sampled data. If a channel is oversampled, the phase shift is correspondingly reduced by the oversampling factor.

[14] The digital processing losses in this system include the following factors: (1) 0.64 for two-level signal representation, (2) 0.96 for three-level fringe rotation in one data path, (3) 0.975 for a seven-delay correlator (in the case of eight lags), and (4) 0.966 for the loss due to the discrete alignment of the data streams.

3. Current Parallel Bit Stream Processing Algorithm

[15] The current gigabit correlation [Koyama et al., 2000] processing algorithm operates on a parallel bit stream set (a block diagram is shown in Figure 3). The parallel bit stream is delay controlled by using a buffer memory and real cross correlation. After cross correlation, fringe stopping is performed by phase switching.

[16] However, because fringe stopping is done in parallel bit stream steps rather than in bit steps, there is some loss of coherence. The continuous delay-tracking range is limited to 8064 bits. This method has slightly more loss than the serial method.

4. New Parallel Bit Stream Processing Algorithm for Gigabit Rate

[17] This section focuses on how to derive processing serial data algorithms for parallel units of data. Note that even if wide-bandwidth high-speed processing is possible, lack of coherence remains a problem. Parallel bit stream processing requires that the coherence of the serial data be maintained and that a 1-bit step algorithm (delay tracking, fringe rotation, and 90° phase jump) be used. In other words, the resolution of the data processing must be higher than that of the parallel bit stream clock. The serial data processing algorithm of the Mark-III [Rogers et al., 1983] and the Mark-IV [Whitney et al., 2004] (section 2) is sophisticated, so this scheme should be converted for the parallel bit stream correlation. In this paper, correlation for 1-bit quantization will be discussed. However, multibit quantization correlation is achieved by four-channel 1-bit correlator. In the case of the 1-bit (two-level) mode, correlation is done between the X and Y data. In the case of the 2-bit (four-level) mode [Cooper, 1970], the correlated data of each unit are MSB(X) * MSB(Y), LSB(X) * MSB(Y), MSB(X) * LSB(Y), LSB(X) * LSB(Y). After 1-bit correlation, the weight values of the correlated results are multiplied and the sums are calculated. Therefore a four-channel 1-bit (two-level) correlator is equivalent to a one-channel 2-bit (four-level) one.

4.1. Delay Tracking Based on Parallel Bit Stream

[18] A block diagram of the delay tracker is shown in Figure 4. It is defined that n is a number of parallel bits and k (or j) is the kth (jth) parallel data bits.

[19] Considered is the case of correlation between station X (reference) and station Y. Here we assume that there are n parallel data stream samples. The circuit consists of parallel shift registers A and B, (n + 1) data selectors, each selecting one out of n possible values, and represented as (n:1), parallel bit stream buffer memories C and D (of the station Y data) which are in toggle operation, a control counter, register Y′, n data selectors, each selecting one out of two possible values, and represented as (2:1), a bit select control register, and register Y. The (n + 1) of the (n:1) data selectors is connected to the parallel shift registers; the selectors are controlled by the control counter,and the data are output to register Y′. The data output to register Y′ (for correlation to station X data), is shown in the area between the two lines in Figures 5 and 6. The status of registers A, B, and Y′ are shown in Figures 5 (positive delay rate) and 6 (negative delay rate). The selected (n + 1) data are n data reselected by n of the (2:1) data selectors and output to register Y. The n of the (2:1) data selectors is controlled by the bit select control register. All of the circuits are operated using a parallel bit stream clock. Each operation is explained as follows.

4.1.1. Operation 1

[20] Initially, the (k – 1th) data are loaded into parallel shift register B from parallel data buffer C (or D), and older data bits are selected. These buffer memories work in a pipeline sequence.

4.1.2. Operation 2

[21] The kth parallel data bits are then loaded into parallel shift register B from the parallel data buffer D (or C) at a time at which the data in shift register B are shifted to shift register A. According to this operation, the sampled data are lined from the left-end bit of register A to the right-end bit of register B in turn.

[22] The (n + 1) of the (n:1) data selectors selects the shift registers' data and outputs the data to register Y′, which is controlled by the control counter. The control counter shows the bit-select status of the (n:1) data selectors. After that, the selected (n + 1) data in register Y′ are reselected as n data by the n of the (2:1) data selectors and are output to register Y. The n of the (2:1) data selectors is controlled by the bit select control register.

[23] Comparing the delay of the current parallel unit of data clock cycle with that of the previous cycle, we find that the difference in time is n (parallel data number) of the sampling period when the delay difference is more than 1 bit; thus fractional delay (delay tracking in Figure 1) occurs. Usually, the values of all bits of the bit select control register are “0.” If there is a fractional delay in the parallel bits, “0” and “1” are sent to the bit select control register. The fractional bit shift timing is indicated by the boundary between the “0” and “1” bits (Figures 7 and 8). After that, the control counter value increases or decreases according to the delay tracking sequence, and the bit select control register is reset on the next clock. Increases and decreases in the control counter value correspond to the sign of the delay rate. Parallel data can be bit shifted in 1-bit steps.

4.1.3. Operation 3

[24] Here k is increased step by step, and operation 2 is repeated. The choice of buffer memory A or B is a toggle selection.

4.1.4. Operation 4

[25] When the control counter value is full or zero, which happens when operation 2 is finished, either of the two following operations is done.

4.1.5. Operation 5

[26] When the control counter's value reaches its maximum (the right end of register B; see Figure 5), the next data bit is loaded into the parallel shift registers from the parallel data buffers, and the conditions of the registers are simultaneously set to “the least” at the beginning of the next data clock cycle.

4.1.6. Operation 6

[27] When the control counter reaches zero (the left end of register A; see Figure 6), the next data bit is unloaded into the parallel shift registers from the parallel data buffer, and the conditions of the registers are simultaneously set to “full” at the beginning of the next data clock cycle. By this algorithm, the delay tracking in Figure 1 is performed on the parallel data set.

4.2. Bit Shifts Timing

[28] In this subsection, the bit shifts timing is described in more detail. The delay tracking timing chart is shown in Table 1. During these operations, the sampled data are lined from the left-end bit of register A to the right-end bit of register B in turn.

The table is a delay tracking timing chart with selectable values of (n:1) from 1 to n. (C) (or (D)) shows loading operation from register C (or D) of buffer memory. The bit shift timing (α, β, γ, δ, ε, and ζ) is calculated by a priori software and is set to the (2:1) selector. The boundary between “0” and “1” represents the bit shift timing.

0

k − 2

k − 1 (C)

all 0

1

k − 2

k − 1 (C)

all 1

n

1

k − 1

k (D)

all 0

1

k − 1

k (D)

all 1

n

2

1

k

k + 1 (C)

α

1

k

k + 1 (C)

δ

n

3

k + 1

k + 2 (D)

all 0

2

k + 1

k + 2 (D)

all 1

n − 1

4

k + 2

k + 3 (C)

all 0

2

k + 2

k + 3 (C)

all 1

n − 1

5

k + 3

k + 4 (D)

all 0

2

k + 3

k + 4 (D)

all 1

n − 1

6

1

k + 4

k + 5 (C)

β

2

k + 4

k + 5 (C)

ε

n − 1

7

k + 5

k + 6 (D)

all 0

3

k + 5

k + 6 (D)

all 1

n − 2

8

k + 6

k + 7 (C)

all 0

3

k + 6

k + 7 (C)

all 1

n − 2

j

j + 1 (C)

all 0

n

j

j + 1 (C)

all 1

1

1

j + 1

j + 2 (D)

γ

n

j + 1

j + 2 (D)

ζ

1

j + 3 (C)

j + 4 (D)

all 0

1

j + 1

j + 2

all 1

n

4.3. Parallel Fringe Rotation

[29] The built-in controller executes a priori calculations for every cycle of the parallel bit stream clock. The fringe rotator is shown in Figure 9. The fringe phase 4 bit is generated by fringe NCO and the 90° jump control is the counter of the number of the bit shift in Figure 4. The bit selection control register is the register of Figure 4.

[30] Usually, the bits of the fractional fringe phase control register are all “0.” Comparing the fringe phases of the current parallel unit of data clock cycle with that of the previous cycle, we find that the difference in time is n (parallel width) sampling periods when the phase difference is more than π/8, and “0” and “1” are set in the phase control register. The boundary between “0” and “1” bit indicates the fractional phase shift timing, and the fringe phase must be changed by ±π/8 radian. The position of the boundary is determined (Figure 7) by the division circuits.

[31] It is assumed that the dividend value (ϕ) was less than the divisor value (Δϕ), and n was the number of data set. Then ϕ/Δϕ × n was calculated using the six-layered division circuit shown in Figure 8. The number of layers was decided by log(n)/log(2); in Figure 8, n is 64. The circuit in each layer consists of a shift register (for 1/2), a subtraction circuit, a (2:1) selector, and an inverter. The operation is as follows: (1) The divisor is divided in two by the shift register. (2) The dividend minus the shift register output is calculated by the subtraction circuit. (3) If a borrow has occurred (“1”) in the subtraction circuit, then the (2:1) selector selects the dividend value; otherwise, the selector selects the output value of the subtraction circuit. (4) The selected value is used as the dividend value in the next layer. (5) A borrow signal is inverted as an answer bit. (6) In the next layer, the operation is repeated from steps one to five. (7) The answer is [pd5] × 2^{5} + [pd4] × 2^{4} + [pd3] × 2^{3} + [pd2] × 2^{2} + [pd1] × 2^{1} + [pd0], where from [pd0] to [pd5] are shown in Figure 8.

4.4. Parallel 90° Phase Jump

[32] Fringe stopping is performed on the band center frequency; the 90° phase jump and bit shift are done simultaneously. The bit select control register (for delay tracking based on the parallel bit stream, but it must be performed in 1-bit steps) also controls the 90° phase jump in Figure 9. The boundary between the “0” and “1” bit of the bit select control register (Figure 7) indicates the fractional bit shift timing, and the 90° phase jump corresponds to action d in Figure 2. By these algorithms, the fringe stopping in Figure 2 is performed on the parallel data set.

4.5. Parallel Integration

[33] A simple real correlator/integrator consists of shift registers and an exclusive NOR (EXNOR). Correlated values are integrated by counters. The three-level approximation of the fringe signal (2-bit signals) is shown in Figure 10. In these signals, one bit is a sign bit and the other is a blank control bit. The sign bit is used for fringe stopping, and the blank control bit is used for controlling integration. Integration of parallel bit stream by using counters is inefficient, so a summation table is added to the ROM for the integration of the parallel bit stream, which is shown in Figure 10.

[34] The total amount of the address bus stored in the summation table enables us to obtain the value that is designated by the address bus and is equivalent to the summation of the input data. The summation table is more effective for multiquantization. In this case, the aggregation value that is weighted and multiplied is stored in the summation table. Therefore the multiplication and the summation are simultaneously performed by a parallel data clock.

4.6. Data Synchronization

[35] The correlator is equipped with a function for automatic data synchronization so there is no need for external units. Time stamps composed of indicators of year, day, hour, minute, second, and the synchronous pattern (SYNC) code used in the time code recognition are inserted in the data at regular intervals. To absorb the transmission path delay, signals are stored in the buffer memory at the same time as the time stamp is received. This time stamp is generated by the asynchronous transfer mode (ATM) interface unit [Kiuchi et al., 2000]. Readout starts immediately after the time stamps from all observation stations have arrived, and this allows for timing synchronization, which is shown in Figure 11. The data are only output to the correlation part after the timing has been synchronized, so the output data for each station are correct up to the time at which the time stamp is applied. The size of the buffer memory is 64 Mbits/channel. The same data synchronization function is used for tape-based correlation.

4.7. A Priori Calculation

[36] In current correlation systems, to remove the effects of the Earth's rotation, an a priori calculation (Earth's rotation parameters: wobble, diurnal polar motion, diurnal rotation, nutation, precession, aberration, time difference, etc.) is done on a host computer and then the data is synchronized to a parameter reference time in the correlator. In the parallel bit stream correlator, an a priori calculation using the time code of the input data and the parameter setting is done by the built-in control unit. The following parameters are given by the control computer: (1) frequency table of the channels; (2) station positions (X, Y, Z); (3) star position (right ascension, declination); (4) Earth rotation parameters (ERP) UT1, X, and Y components of wobble; (5) correlation start, stop time, and parameter reference time (PRT), and (6) video bandwidth.

[37] Specific features of the correlator are that directory of the output files to the host computer via the network filing system (NFS) and correlation start/stop control is done by itself according to the input data time code. This algorithm is indispensable for a stand-alone correlation processor in a decentralized processing system and is especially useful in a real-time system [Kiuchi et al., 2000], which is realized by using high-speed networks. In a decentralized processing system, single-baseline correlators were installed in some nodes of the network. Each correlator is an element of the multibaseline correlator.

5. System Evaluation

[38] The gigabit system consists of a gigabit sampler, an ATM interface unit, and a gigabit correlator.

5.1. Gigabit Sampler

[39] The gigabit sampler is based on a commercially available digital oscilloscope (Tektronix TDS784). The oscilloscope has four analog-to-digital sampler chips, each of which operates at a maximum speed of 1 Gbits/s with a quantization level of 8 bits for each sample. The two MSBs quantized bits of each sample (channel) are extracted from the digital oscilloscope and are sent connected to the ATM interface unit. A block diagram of the gigabit sampler is shown in Figure 12. The sampling rate was increased by 25.6/25, so that the original 1 gigasample/sec sampler could be operated as a 1.024 gigasamples/s sampler for compatibility with the existing system. The sample rates ranging are used from 256 to 1024 megasamples/s in 1- and 2-bit quantization.

[40] The above 25.6/25 modification was made to the A/D sampled-signal pickup daughter board in the oscilloscope. The output signal is sent via high-speed parallel coaxial cable, and the (4-channel) * (256 or 512 or 1024 megasamples/s) * (2-bit sampling) data are output from the pick-up daughter board.

[41] The oscilloscope can calibrate itself by using signal path compensation. A self-calibration function is also used to calibrate the DC offsets of the A/D converters according to changes in the ambient temperature. This function is useful in multibit sampling.

5.2. ATM Interface Unit

[42] The ATM interface has a real-time clock that is phase locked to the data clock of the gigabit sampler. The data input from the gigabit sampler are formatted, and a time code is inserted. Channel selection (1/2/4 channels) and quantization bit selection (1 or 2 bits) are performed in the formatting section. The rate of the output of data to the recorder or ATM line is selected from among four rates, ranging from 256 to 2048 Mbits/s.

5.3. Correlator Specifications

[43]Figure 13 is a photograph of the correlation system. The number of parallel data bits is 64. The maximum speed of processing is 2048 megabits/s/channel. Each of the four channels has 1024 complex lags. The integration circuit of each lag has a 31-bit accumulator. This correlator is a four-channel system; it accepts 2-bit quantization data because a four-channel 1-bit (two-level) correlator is equivalent to a one-channel 2-bit (four-level) one.

5.4. Real-Time Fringe Detection Using an ATM Network

[44] A real-time experiment [Kiuchi et al., 2000] between the Koganei and Kashima Keystone project (KSP) [Koyama et al., 1998; Kiuchi et al., 1997]) stations (109.1 km) was carried out. The IF signal was down converted to a wide-bandwidth video signal by using a KSP local oscillator. The wide-bandwidth video signal was sampled by the gigabit sampler and then was formatted by the ATM interface unit. The formatted signal was transmitted from Kashima to Koganei via the ATM line. After that, correlation processing was performed. The experimental result is shown in Figure 14, displaying 512 lags around the fringe. The result shows that we had been able to obtain fringes,; they are consistent with the serial (low data rate) bit stream current system in delay and delay rate. The correlated amplitudes are consistent within 10^{−4}, and the interval of both experiments is 15 mi.

6. Conclusions

[45] The wide-bandwidth VLBI system is required to provide sensitivity sufficient for measuring very small flux densities of cosmic radio sources in astronomical applications and achieving high accuracy in geodetic applications. The key factors determining the accuracy of geodetic measurements by VLBI are the total instantaneous bandwidth and the actual value of the observation frequency. The first factor is strongly related to the uncertainty measurements of group delay, while the second is related to the ultimate resolution achievable. In current systems, the bandwidth is limited by the processing speed. In this paper, the effectiveness of a wide-bandwidth data processing system was investigated. By improving sampling and correlation technologies, it is possible to achieve data acquisition with a data rate of over 1 Gbit/s. The author established a parallel data processing algorithm that prevents coherence deterioration in wide-bandwidth data acquisition and high-speed processing. My algorithm is a bit-by-bit correlation algorithm in parallel data-step processing.

Acknowledgments

[46] First, I am deeply indebted to A. E. E. Rogers and A. Whitney, who designed the Mark-III and Mark-IV systems and framed the bandwidth synthesis theory. The Mark-III system and the bandwidth synthesis theory impressed me deeply. I am indebted to J. Amagai, T. Kondo, and Y. Takahashi, staff members of the National Institute of Information and Communications Technology, for their helpful technical discussions. I appreciate the help I received from Cosmo Research, Inc., and SONY, Inc., in putting together the VLBI system.