Latency Added by the NoC - Audio Processing Latency on a Multicore Platform

6.5 Audio Processing Latency on a Multicore Platform

6.5.2 Latency Added by the NoC

The TDM period (P_{T DM}) in the Argo NoC is proportional to the amount of channels and bandwidth values required. At this point, it is interesting to recall the worst-case packet latency term introduced in 3.4.2, L_P_wc = P_{T DM} + 8, measured in clock cycles. The NoC latency is proportional toL_P_wc.

The latency of a NoC channel C_i,j between effects i and j on different cores, L_C_i,j, is calculated shown in Equation 6.5, whereBW_i,jis the bandwidth of the channel measured in samples andB_i,j is the buffer size assigned to the channel.

The equation gives an idea of how many TDM periods are required for the full buffer to get transferred from source to destination. This latency value is measured in the same unit as the packet latency, clock cycles. To convert it to

samples, the amount of clock cycles per sample period must be known.

L_C_i,j = B_i,j BWi,j

·L_P_wc [clock cycles] (6.5)

To provide a simple example, if a channel has a buffer of 16 samples and a bandwidth of 2 samples, it means that 2 samples are assigned to each TDM period, so it will take 8 periods for the full buffer to arrive at the destination.

To calculate the overall latency added by the NoC in a chain of audio effects, the latencies of all the NoC channels need to be added. This is shown in Equation 6.6 for a set ofN effects (0,1,2, ..., N−1) that form a chain of effects, and all the connections between them are NoC channels. The value T_CC is the sampling period measured in clock cycles, which allows converting the NoC latency value from clock cycles to sample units. If there are parallel chains, just one of them needs to be considered, as the buffer sizes are equal in all chains.

LN oC =

d

^L^C^0,1 ⁺ ^L^C^1,2 ⁺ ^L^C^2,3 ⁺ ^... ⁺ ^L^C^N−2,N−1

T_CC

e

[samples] (6.6) The L_{N oC} value shown is a worst-case value, because the worst-case packet latency was considered to calculate it. In many cases, this value will be under one sample (because the sum of NoC channel latencies is below one sampling period). But even in this case, a latency of 1 sample should be considered. In general, theLN oC value should be quantized to its closest upper integer.

As an example, the setup presented in Figure 6.9 can be considered again. The values of the NoC parameters can be assumed, for instance, as a bandwidth of 1 sample per packet, with a worst-case packet latency of 18 clock cycles. Following Equation 6.5, the worst-case latency added by each channel of the system can be determined. The values are calculated for all the channels in the system shown in Figure 6.9:

LC01 = LC24 = LC67 = (16/1)·18 = 288 clock cycles L_C₄₅ = L_C₅₆ = (64/1)·18 = 1152 clock cycles

ChannelsC₀₃andC₃₄are not considered as one of the parallel chains has already been considered. ChannelC₁₂ is neither considered as it is not a NoC channel.

If the latencies of all channels are added, we get the total NoC latency in clock cycles, according to Equation 6.6:

6.5 Audio Processing Latency on a Multicore Platform 85

L_{N oC} = 288·3 + 1152·2 = 3168 clock cycles

This value can be then divided with the sampling period measured in clock cycles, which in the current platform is calculated by dividing the processor clock frequency, 80 M Hz, with the sampling frequency of 52.083 kHz. The value is 1536 cycles per sample. So the NoC latency measured in samples is onlyd3168/1536e = 3 samples.

Chapter 7 Implementation and WCET Analysis of the Platform

This chapter shows the implementation of the audio processing platform and the WCET analysis performed to guarantee its real-time functionality. Here, the individual effects presented in Chapter 5 are combined in the multi-processor system, forming sequential and parallel audio effect chains. The effects are allocated into cores following the concepts and rules explained in Chapter 6. As previously explained, the platform used for implementation is T-CREST, that is why this chapter represents a contribution to the T-CREST project.

Section 7.1 presents the architecture and technical details of the chosen audio processing platform implementation. First, how the system takes care of the signal latency is explained, and then the software architecture is shown, in the form of C data structures and functions. Section 7.2 shows both WCET analysis and experimental execution time measurements done for each effect. The results of the experimental measurements are used by the implemented effect allocator to decide how to map audio effects to cores. The allocator is also presented here.

7.1 Architecture and Technical Details

The chosen T-CREST platform implementation is the 2-by-2 bitorus topology with 4 Patmos processors, as the one previously shown in Figure 3.1, running on the Altera DE2-115 FPGA board. There is a master core, which is a Patmos processor connected to the audio interface component as an I/O device, and 3 slaves, standard Patmos cores. Therefore, the master core is in charge of audio input/output, and the slaves are only used to process effects. All the cores have the same cache and SPM sizes. The instruction cache has proved to be the main bottleneck of the system, as its size and associativity values needed to be increased to be able to compute all the effects in real-time. This is the main reason why a platform with more cores has not been used: the large associativity value of the instruction cache is a limitation for meeting the timing requirements in the FPGA platform.

Subsection 7.1.1 first explains how the master core takes care of the signal latency, a concept already discussed in Section 6.5. After that, the architecture of the chosen implementation is described in Subsection 7.1.2, where the general data structure of the effects is shown, and the main setup and audio processing functions are overviewed.

In document Audio Processing on a Multicore Platform (Sider 95-100)