Parallel rendering concepts - Design for Scalability in 3D Computer Graphics Architectures

Rendering a complex scene is so computationally intensive that it requires billions of floating-point, fixed-point and integer operations for each frame. Real-time in-teractive rendering places even higher demands on the rendering system as a min-imum framerate has to be maintained. Note that real-time rendering usually is a soft-real-time problem where the response time may be stretched slightly without major problems. Most real-time implementations do not require a hard-real-time guaranteed response time to function (although that would be nice). In order to maintain the processing power needed for interactive rendering of complex scenes, we can rely on Moore’s law and wait until microprocessors and graphics accelera-tors become fast enough to solve the problem. Alternatively, if the needed process-ing power is wanted now, the only option left is parallel processprocess-ing, in this case parallel rendering. Some concepts which are important for parallel rendering will be discussed in the following sections.

Scanline interpolation

Span interpolation

Span coherence

Scanline coherence

Span

Figure 2.1: Spatial coherence in a screen space region (32x32 tile).

2.2.1 Coherence

The termcoherence is used in computer graphics to describe that features nearby in space or time have similar properties [37, 215]. Coherence is useful for reducing the computational requirements of rendering, by allowing incremental processing of data by a sequential algorithm. Because of this it is important for a parallel rendering algorithm to preservecoherence, or it will suffer from computational overhead. Several types of coherence can be identified in rendering:

Spatial coherencerefers to the property that pixels tend to have values similar to their neighbors both in horizontal and vertical directions, stepping from one scanline to the next (scanline coherence), and between pixels within a span (span coherence). Figure 2.1 illustrates these types of spatial coherence. A sequential rendering algorithm can use these kinds of coherence to reduce computation costs while interpolating parameters between triangle vertices during scan conversion. A popular incremental linear interpolation algorithm is forward differencing or DDA (Digital Differential Analyzer) [58]. In a parallel renderer which partitions the screen into regions, coherence may be forfeited at region boundaries. Because of this, triangles which overlap several regions may cause a computational overhead in a parallel renderer.

Temporal coherenceis based on the observation that consecutive frames in an animation or interactive application tend to be similar. This may be useful for

Pipeline Stage 1

(Processor 1) Pipeline stage 2

(Processor 2) Pipeline stage 3 (Processor 3)

b) Data parallelism c) Temporal parallelism

Input FIFO FIFO Output

Figure 2.2: A process parallelized over three processors by using three dif-ferent types of parallelism. a) Functional parallelism. b) Data parallelism. c) Temporal parallelism.

predicting workloads in a parallel renderer in order to improve load balancing. A good example of temporal coherence is MPEG video compression which relies on incremental frame to frame encoding and motion compensation to achieve its high compression ratios.

Data coherenceis a more abstract term, but can for example be described as the tendency for multiple triangles or other data to contribute to nearby pixel regions.

Data coherence is improved by locally caching data andreusingthe cached data. It is related to both spatial and temporal coherence as multiple triangles contributing to the same screen region can be grouped together to improve communication effi-ciency and usage of cached pixels. These properties are important for the effieffi-ciency of a parallel renderer.

Statistical studies on workload characteristics examining different forms of co-herence in various rendering tasks were published in [150, 32].

2.2.2 Parallelism in rendering

Many different types of parallelism can be exploited in rendering. These are func-tional parallelism, data parallelism and temporal parallelism. Figure 2.2 presents an overview. These basic types of parallelism can be used alone or combined into hybrid systems to exploit multiple forms of parallelism at once.

Application Geometry Rasterization Composition Display

Figure 2.3: Standard rendering pipeline.

Application Geometry Rasterization Composition Display

FIFO FIFO FIFO FIFO

Figure 2.4:A pipelined parallel renderer using FIFOs to queue data between stages.

Functional parallelism – Pipelining

In computer graphics 3D surface rendering can easily be expressed as a pipeline, where triangles are fed into the pipeline and data is serially passed from one pro-cessing unit to the next in the data path, and pixels are produced at the end. The standard rendering pipeline (figure 2.3) is an obvious candidate for functional par-allelism as each individual stage may be mapped to individual processors. Sev-eral early commercial hardware renderers [5, 45] were based on functional paral-lelism by physically arranging programmable floating-point microprocessors in a pipeline, and mapping different stages of the rendering pipeline to different micro-processors (or “Geometry Engines”).

Functional parallelism has some major drawbacks, though. The overall speed of the pipeline is limited by its slowest stage, and it is susceptible to pipeline stalls.

Most pipelines use FIFO⁴ queues to balance the pipeline loads by queuing data between pipeline stages (figure 2.4), allowing upstream stages to continue working while a downstream stage is busy. A pipeline stall occurs when a pipeline stage is using more time to complete its task than the others, for example when a rasterizer is busy filling a huge triangle. Small pipeline stalls can be avoided by using FIFO queues to balance the load, provided that the processed data stream provides the pipeline with an even workload distribution averaged over time.

The level of parallelism achieved by functional parallelism is proportional to the number of pipeline stages. Functional parallelism does not scale well, since the pipeline has to be redesigned for a different number of pipeline stages each time the system is scaled. To achieve higher levels of performance, an alternate strategy is required.

4First-In-First-Out

Data parallelism

While it may be simple to perform rendering using a single data stream through multiple specialized pipelined processors, it may be preferable to split the load into several data streams. This allows us to process multiple data items concurrently by replicating a number of identical processing units. Data parallelism is necessary to build scalable renderers because large numbers of processors can be utilized, making massively parallel systems possible. It is also possible to build different versions of a data parallel system, scaling the performance levels to match the required tasks simply by varying the number of processing elements.

Data parallelism can be implemented in rendering in many different ways. Two basic classes of data parallelism in rendering may be conceived,object parallelism andimage parallelism.

Object-parallel rendering refers to an architecture which splits the rendering workload so each processor works independently on individual geometric objects in a scene.

Image-parallel rendering refers to partitioning the processing of the pixels for the final image. Each processor is responsible for its own set of pixels, and works independently of the other processors.

Object parallelism and image parallelism can be combined to perform object parallel computations at the front-endof the rendering pipeline, and image paral-lelism can be exploited at theback-endof the rendering pipeline. Load balancing between the front-end and back-end must be handled. The workload must also be balanced between the individual workers in each stage. Communication patterns between front-end and back-end are crucial for the scalability of such a system, which will be discussed later. Functional and data parallelism can also be com-bined to gain additional speed, e.g. by building a pipeline of processor farms. The-ory behind pipelined processor farms (PPF) is covered in the recently published book [62]. Figure 2.5 shows how a pipeline of parallel processor farms can be used to parallelize the 3D graphics pipeline. Note that the data communication between the pipeline stages isnotsimple. Data redistribution or sorting is required between some of the pipeline stages to implement a parallel renderer.

Temporal parallelism

Temporal parallelism works by rendering several different frames of an animation concurrently. Batch renderers, such as those used for rendering 3D animated spe-cial effects for Hollywood movies, typically use temporal parallelism to distribute the workload over a “rendering farm” of workstations, each rendering their own set

Application Distribute

Dataflow direction in the parallel feed-forward rendering pipeline Geometry

DistributeRasterization

Distribute Composition

Distribute Display

Figure 2.5: Parallel feed-forward rendering pipeline.

of animation frames. This is the only type of parallel rendering that is considered to be embarrassingly parallel, as each worker process executes the entire rendering pipeline implemented as a sequential 3D renderer.

When used for interactive real-time rendering, temporal parallelism can exploit the latency in the rendering pipeline to overlap rendering of two consecutive frames by using two separate graphics pipelines. This technique requires that the frame rate is high enough to hide the effect of latency. Note that in this case the achieved parallelism can also be thought of as high level pipelining.

In document Design for Scalability in 3D Computer Graphics Architectures (Sider 29-34)