This chapter first discusses some general issues regarding system-specific tuning, then provides tuning information that is relevant for particular Silicon Graphics systems. Use these techniques as needed if you expect your program to be used primarily on one kind of system, or a group of systems. The chapter discusses:
Some points are also discussed in earlier chapters but repeated here because they result in particularly noticeable performance improvement on certain platforms.
|Note: To determine your particular hardware configuration, use /usr/gfx/gfxinfo. See the reference page for gfxinfo for more information. You can also call glGetString() with a GL_RENDERER argument. See the reference page for information about the renderer strings for different systems.|
Many of the performance tuning techniques discussed in the previous chapters (such as minimizing the number of state changes and disabling features that are not required) are a good idea no matter what system you are running on. Other tuning techniques need to be customized for a particular system. For example, before you sort your database based on state changes, you need to determine which state changes are the most expensive for each system you are interested in running on.
In addition, you may want to modify the behavior of your program depending on which modes are fast. This is especially important for programs that must run at a particular frame rate. To maintain the frame rate on certain systems, you may need to disable some features. For example, if a particular texture mapping environment is slow on one of your target systems, you have to disable texture mapping or change the texture environment whenever your program is running on that platform.
Before you can tune your program for each of the target platforms, you have to do some performance measurements. This is not always straightforward. Often a particular device can accelerate certain features, but not all at the same time. It is therefore important to test the performance for combinations of features that you will be using. For example, a graphics adapter may accelerate texture mapping but only for certain texture parameters and texture environment settings. Even if all texture modes are accelerated, you have to experiment to see how many textures you can use at the same time without causing the adapter to page textures in and out of the local memory.
A more complicated situation arises if the graphics adapter has a shared pool of memory that is allocated to several tasks. For example, the adapter may not have a framebuffer deep enough to contain a depth buffer and a stencil buffer. In this case, the adapter would be able to accelerate both depth buffering and stenciling but not at the same time. Or perhaps, depth buffering and stenciling can both be accelerated but only for certain stencil buffer depths.
Typically, per-platform testing is done at initialization time. You should do some trial runs through your data with different combinations of state settings and calculate the time it takes to render in each case. You may want to save the results in a file so your program does not have to do this test each time it starts up. You can find an example of how to measure the performance of particular OpenGL operations and save the results using the isfast program from the OpenGL web site.
This section discusses how you can get the best results from your application on low-end graphics systems, such as the Indy, Indigo, and Indigo2 XL systems (but not other Indigo2 systems); discussing the following topics:
By emphasizing features implemented in hardware, you can significantly influence the performance of your application. As a rule of thumb, consider the following:
Hardware-supported features: Lines, filled rectangles, color shading, alpha blending, alpha function, antialiased lines (color-indexed and RGB), line and stippling patterns, color plane masks, color dithering, logical operations selected with glLogicOp(), pixel reads and writes, screen to screen copy, and scissoring.
Software-supported features: All features not in hardware, such as stencil and accumulation buffer, fogging and depth queuing, transforms, lighting, clipping, depth buffering, and texturing. Triangles and polygons are partially software supported.
The low-end graphics systems' FIFO allows the CPU and the graphics subsystem to work in parallel. For optimum performance, follow these guidelines:
Make sure the graphics subsystem always has enough in the queue.
Let the CPU perform preprocessing or non-graphic aspects of the application while the graphics hardware works on the commands in the FIFO.
Note that FIFOs in low-end systems are much smaller than those in high-end systems. Not all graphics processing happens in hardware, and the time spent therefore differs greatly. To detect imbalances between the CPU and the graphics FIFO, execute the gr_osview command and observe gfxf in the CPU bar and fiwt and finowt in the gfx bar.
If your application seems transform limited on a low end system, you can improve it by considering the tips in this section. The section starts with some general points, then discusses optimizing line drawing and using triangles and polygons effectively.
To improve performance in the geometry subsystem, follow these guidelines:
Even on low-end systems, lines can provide real-time interactivity. Consider these guidelines:
Wide lines are drawn as multiple parallel offset lines.
The hardware can usually draw lines faster than the software can produce commands, though long or antialiased lines can cause a backup in the graphics pipeline.
Maximize the number of vertices between glBegin() and glEnd().
Use connected primitives (triangle, quad, and line strips). Use triangle strips wherever possible and draw as many triangles as possible per glBegin()/glEnd() pair.
When rendering solid triangles, consider the following:
Color shading and alpha blending are performed in hardware on Indy and Indigo2 XL systems. Consult system-specific documentation for information on other low-end systems.
Larger triangles have a better overall fill rate than smaller ones because CPU setup per triangle is independent of triangle size.
This section looks at some things you can do if your application is fill limited on a low-end system. It provides information about getting the optimum fill rates and about using pixel operations effectively.
The hardware accelerates drawing rectangles that have their vertical and horizontal edges parallel to those of the window. The OpenGL glRect() call—as opposed to IRIS GL rect()—specifies rectangles, but depending on the matrix transformations they may not be screen-aligned. If the matrices are such that the rectangle drawn by glRect() is screen-aligned, OpenGL detects that and uses the accelerated mode for screen-aligned rectangles.
Using dithering, shading, patterns, logical operations, writemasks, stencil buffering, and depth buffering (and alpha blending on some systems) slows down an application.
In any OpenGL matrix mode, low-end systems check for transforms that only scale, and have no rotations or perspective. The system looks at the specified matrices, and if they only scale and have no rotation or perspective, performs optimizations that speed up transformation of vertices to device coordinates. One way to specify this is as follows:
glMatrixMode(GL_PROJECTION); glLoadIdentity(); gluOrtho2D(0,xsize,0,ysize); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glShadeModel(GL_FLAT);
You also have to use a glVertex2fv() call to specify 2D vertices.
glEnable(GL_TEXTURE_2D); glTexParameter(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameter(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameter(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT); glTexParameter(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT); glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST);
In addition, follow these guidelines:
For RGB textures, make sure the texture environment mode, set with glTexEnv(), is either GL_REPLACE or GL_DECAL.
For RGBA textures, make sure the texture environment mode is GL_REPLACE.
Note that the above fast path does not work when stencil is enabled and when depth buffering and alpha testing are both on.
|Note: Avoid using depth buffering whenever possible, because the fill rates for depth buffering are slow. In particular, avoid using depth-buffered lines. The depth buffer is as large as the window, so use the smallest possible window to minimize the amount of memory allocated to the depth buffer. The same applies for stencil buffers.|
Write your OpenGL program to use the combinations of pixel formats and types listed in Table 16-1, for which the hardware can use DMA. The CPU has to reformat pixels in other format and type combinations.
Here are some additional guidelines for optimizing pixel operations:
When you scroll something, such as a teleprompter text scroll or an EKG display, use glCopyPixels() to shift the part of the window in the scrolling direction, and draw only the area that is newly exposed. Using glCopyPixels() is much faster than completely redrawing each frame.
Minimizing calls. Make each pixel operation draw as much data as possible. For each call, a certain amount of setup is required; you cut down on that time if you minimize the number of calls.
Depth and scissoring. Low-end systems use an accelerated algorithm that makes clearing the depth buffer virtually free. However, this has slowed enabling and disabling scissoring and changing the scissor rectangle. The larger the scissor rectangle, the longer the delay. As a result:
Rendering while scissoring is turned on is fast.
Calling glEnable() and glDisable() with GL_SCISSOR, calling glScissor() and pushing and popping attributes that cause a scissor change are slow.
For Indy and Indigo2 XL systems, an extension has been developed that increases fill rate by drawing pixels as N x N rectangles (effectively lowering the window resolution). This “framezoom” extension, SGIX_framezoom, is available as a separate patch under both IRIX 5.3 and IRIX 6.2 (and later).
|Note: This extension is experimental. The interface and supported systems may change in the future.|
When using the extension, consider the following performance tips:
The extension works best when texturing is enabled. When pixels are zoomed up by N, you can expect the fill rate to go up by about N2/2. This number is an estimate; a speed-up of this magnitude occurs only if texturing performance has been optimized as explained in the last bullet of “Getting the Optimum Fill Rates”.
When texturing is not enabled, performance, although faster than the texture map case, is relatively slow compared to the non-framezoomed case. Actually, a framezoom value of 2 is slower than if framezoom was not enabled. The reason is that the graphics hardware in low-end systems is optimized for flat or interpolated spans, and not for cases where the color changes from pixel to pixel (as with texturing). When pixels are bigger (as with the framezoom extension), this benefit cannot be used.
The framezoom factor can be changed on a frame-to-frame basis, so you can render with framezoom set to a larger value when you are moving around a scene, and lower the value, or turn framezoom off, when there are no changes in the scene.
An O2 system is similar to previous low-end systems in that it divides operations in the OpenGL pipeline between the host CPU and dedicated graphics hardware. However, unlike previous systems, graphics hardware on the O2 handles more of the graphics pipeline in hardware. In particular, it is capable of rasterizing triangle-based primitives directly without the host having to split them into spans, and it performs all of the OpenGL per-fragment operations. The CPU is still responsible for vertex and normal transformation, clipping, lighting, and primitive-level set-up.
In addition to using the CPU for geometry operations and the Graphics Rendering Engine (GRE) for per-fragment operations, a number of imaging extensions and pixel transfer operations are accelerated by the Imaging and Compression Engine (ICE).
The section “Optimizing Performance on Low-End Graphics Systems” lists recommendations in “Using Geometry Operations Effectively”. Many of these recommendations apply to the O2 system as well. There are, however, some differences worth mentioning:
Generic 3D transformations with perspective are comparable in speed to 2D transformations because the floating-point pipeline in the R5000 and R10000 CPUs is much faster than previous-generation CPUs. However, always put perspective in the projection matrix and not in the modelview matrix to allow for faster normal transformation.
Minimize attribute setup; attribute setup for each primitive is performed on the CPU. For example:
Use flat shading if the color of the model changes rarely or not within the same primitives that make up the model.
Don't enable depth-buffering when rendering lines.
Turn off polygon offset when not in use.
Choose a visual with no destination alpha planes if destination alpha blending is not used.
When using fog, set the param argument to GL_LINEAR instead of GL_EXP or GL_EXP2. GL_LINEAR uses vertex fogging, which is hardware accelerated on O2 systems, instead of per-pixel fogging, which is not.
When continuously rendering a large amount of static geometry elements, consider storing the geometry elements in display lists. When vertices and vertex attributes are stored in display lists, the R10000 CPU can prefetch the data in anticipation of its use and thus reduce read latency for data that cannot fit in the caches.
The n32 version of the OpenGL is somewhat faster than the o32 version due to the more efficient parameter passing convention and the larger number of floating- point registers that the n32 compilation mode offers. Furthermore, using n32 can improve application speed because the compiler can generate MIPS IV code and schedule instructions optimally for the R5000 or the R10000 CPU.
Lighting on O2 systems is faster than on previous low-end systems because of the better floating-point performance of its CPUs. However, the larger the number or the more complex the lights (local lights, for instance), the larger the amount of work the CPU has to perform. Two-sided lighting is not a “free” operation, so consider using single-sided lighting, if possible.
Line drawing for low-end systems is discussed in some detail in “Optimizing Line Drawing”. On O2 systems, almost all line rendering (rasterization) modes are hardware supported.
The following kinds of lines need to be rasterized by the CPU and will perform significantly slower:
anti-aliased RGB lines that are either wide (line width greater than 1) or stippled
all types of anti-aliased color-index lines
Triangle drawing for low-end systems is discussed in some detail in “Optimizing Triangles and Polygons”. Note the following points:
Triangle strips are the most optimal triangle path through the OpenGL pipeline. Maximize the number of vertices between glBegin() and glEnd().
Polygon stippling is not hardware supported. Because a stippled polygon has to be rasterized on the CPU, enabling polygon stippling will cause a significant performance degradation.
If the application is using polygon stippling for screen-door transparency effects, consider instead using alpha blending to emulate transparency. If using alpha blending is not possible, consider setting the GLCRMSTIPPLEOPT environment variable. Setting this variable enables an optimization that uses the stencil planes to emulate polygon stippling if the application does not use the stencil planes. However, note that if the stipple pattern changes often during the rendering of a frame, the performance benefits may be lost to the time spent repainting the stencil planes with the different patterns.
This section discusses how to use per-fragment operations effectively on O2 systems.
The rasterization hardware has the same fill rates whether the shading model is smooth or flat. If the application is rendering very large areas, there should be little difference in the performance between smooth and flat shading. However, remember that setting up smooth-shaded primitives is more expensive on the CPU side.
The framebuffer on O2 systems can be configured four different ways (16, 16+16, 32, 32+32) to allow applications to trade off memory usage and rendering speed for image quality. Apart from pixel depth, the other main difference between these framebuffer types is where the back-buffer pixels of a double-buffered visual reside. For the 16- and 32-bit framebuffer, the front and back buffers share the same pixel with each buffer taking half of the pixel. For the 16+16 and 32+32 framebuffers, the back buffer is allocated as needed in a different region from the main framebuffer. As a result, 16+16 and 32+32 buffers can have visuals with the same color depth for single-buffered and double-buffered visuals but will need more memory in that case.
The framebuffer's configuration (size and depth) affects fill rate performance. In general, the deeper the framebuffer, the more data the GRE (graphics rendering engine) needs to write to memory to update the framebuffer and the more data the graphics back-end needs to read from the framebuffer to refresh the screen. Note that for double-buffered applications, better fill rates can be achieved with the split 16+16 framebuffer than with the 32-bit framebuffer. This is because the new color information can be written to the pixels directly instead of having to be combined with what is in the framebuffer. This is especially important for fill-rate limited texture mapping operations, buffer clears and pixel DMAs.
An O2 system stores texture maps in system memory. The amount of texture storage is therefore limited only by the amount of physical memory on the system. Texture memory is partitioned into 64 KB tiles. Texture memory is made available on a tile basis; the actual memory usage for a texture is rounded up to 64 KB and the memory usage will be higher than if the same texture were to be packed optimally in memory.
Tile-based texture memory also means that the minimum memory usage for any texture is one tile and the amount of “wasted” texture memory can quickly add up if the application uses a large number of small textures. In that case, consider combining small textures into larger ones and using different texture coordinates to access different sections of the larger texture map.
The following texture formats are supported directly by the graphics hardware and require no conversion when specified by the application:
8 bit luminance or intensity
16 bit luminance-alpha (8:8 format)
16-bit RGBA (5:5:5:1 format)
16 bit RGBA (4:4:4:4 format)
32 bit RGBA (8:8:8:8 format)
Applications that use more than one texture should use texture objects, now part of OpenGL 1.1, for faster switching between multiple textures. Although fast, binding a texture is not a free operation and judicious minimization of its use during frame rendering will increase performance. This can be achieved by rendering all the primitives that use the same texture object at the same time.
The texture filters GL_NEAREST and GL_LINEAR result in the best texture fill rates, whereas GL_LINEAR_MIPMAP_LINEAR results in the worst fill rate. In cases where texture maps are being minified and only bilinear filtering is required, consider using mipmaps with the minification filter set to GL_LINEAR_MIPMAP_NEAREST. This filter gives the graphics engine better cache locality and better fill rates.
The 3D texture mapping, texture color table, and texture scale bias extensions are supported by the O2 OpenGL implementation, but are not hardware accelerated. Enabling one of these extensions will therefore result in significantly slower rendering.
The graphics rendering engine does not allow updating both the front and back buffers at the same time (glDrawBuffer(GL_FRONT_AND_BACK)). In order to support this functionality, the OpenGL needs to specify the primitive being rendered to the graphics hardware twice, once for both the front and back buffer. This is an expensive operation and applications should try to avoid using concurrent updates to both front and back buffers.
The following is a table of pixel types and formats for which hardware DMA can be used.
The pixel DMA paths support stencil, depth, and alpha tests, fogging, blending, and texturing.
Stencil indices can be sent via DMA as 32-bit unsigned int values, where the most significant 8 bits are transferred, using a stencil shift value of –24 for draw operations and 24 for read operations.
Depth components can be sent via DMA as 24-bit unsigned int values, using a depth scale of 256 for draw operations and 1/256.0 for read operations. For draw operations, the depth test must be enabled with a function of GL_ALWAYS, and the color buffer must be set to GL_NONE.
Pixel zooms are accelerated for whole integer factors from -16 to 16, and integer fractions from -1/16 to 1/16 on all DMA paths.
Most read pixel operations on O2 will be significantly faster when the destination buffer and row lengths are 32-byte aligned.
O2 systems contain a multi-purpose compute ASIC called the Imaging and Compression Engine (ICE) which serves both the needs of DCT-based compression algorithms and of OpenGL image processing. All the elements in the OpenGL imaging pipeline (that is the pixel transfer modes) are implemented on ICE, but some functions (such as convolution and color matrix multiplication) benefit a lot while others (like histogram and color table) don't benefit as much. This section discusses the support provided by ICE and gives some programming tips.
Pixel Formats. ICE supports the 8-bit GL_RGBA, GL_RGB, and GL_LUMINANCE pixel formats. Because the O2 graphics hardware does not support an RGB framebuffer type, RGB pixels have to be converted to RGBA before they can be displayed. Instead of using the CPU to perform this conversion, glDrawPixels() uses the wide loads and stores and DMA engine on ICE. It is possible to use other pixel formats such as luminance-alpha or color index, but for those formats, the CPU performs all image processing calculations.
64 KB Tiles. The memory system natively allocates memory for the framebuffer and pbuffers in 64 KB tiles. ICE takes advantage of this by having a translation look-aside buffer (TLB) in the DMA engine that maps 64 KB tiles.
Buffer to buffer fast path. Because ICE can directly map tiles without further manipulation, it is fastest to go buffer to buffer (i.e. glCopyPixels()) for the imaging pipeline on O2. While not explicitly an imaging operation, ICE supports span conversion between GL_RGBA and GL_LUMINANCE on the pixel transfer path including glCopyPixels(). glDrawPixel() is the next fastest path.
Image Width. Any image width up to 2048 pixels is permitted, but image widths that are modulo 16 pixels are optimal. If the image is not modulo 16, the CPU uses bcopy(); to pad the image to closest modulo 16 width. Note that setting row pack, row unpack, and certain clipping and zoom combinations can cause the internal image width to change from what was modulo 16 pixels.
Number Formats. The vector processor on ICE dictates to a large extent the numerical representation of coefficients that can be used. There are two number formats on ICE: integer and fixed point (s.15). Therefore values should be either [-1.0, 1.0) or strictly integer. Numbers outside this range force the library to perform the calculations on the CPU. Developers have not found this to be too restricting as a multiplication; by 1.9, for example, can also be expressed as a multiplication of 0.95 followed by a multiplication of 2.0. OpenGL allows this trick through use of the post color scale functions.
Memory. Some programming restrictions arise from the need to balance the amount of state kept on the chip and the amount of memory available for image data. The 6 KB of data RAM is organized into 3 banks. Bank C is 2 KB and is used for storing color tables, histogram, convolution coefficients, and 256 bytes of internal state. In order to remain on the fast path, the total bytes used for items in Bank C must be less than 2 KB. Because of that limitation, two color tables specified as GLbyte and GL_RGBA will not be hardware accelerated. This is not a problem if the application can specify the color tables as GL_LUMINANCE or GL_LUMINANCE_ALPHA.
Color Tables. The number, type, and format of color tables is important to keep the application on the fast path. Up to two color tables or one color table and one histogram can be accelerated on the O2 imaging fast path. The internal format of the color table can be luminance, luminance-alpha, or rgba. The color table type must be GL_BYTE. While the texture color table is not supported, using the color table extension on texture load is an alternative.
Convolution. Both general and separable convolutions are hardware accelerated on O2. Convolution kernel sizes that are accelerated are 3x3, 5x5, and 7x7. Applications can gain further performance improvement by specifying the kernel as GL_INTENSITY (note that this is different than GL_LUMINANCE). O2 systems cannot accelerate convolutions and histograms at the same time. See “EXT_convolution—The Convolution Extension” for more information.
Separating Components. On other graphics architectures, there is a significant advantage to processing image components one at a time. Some OpenGL implementations use the color matrix multiply function to separate out components. There is no advantage to separating out a component on O2 by using the color matrix multiply function. The intent was to use the matrix multiply for color correction. Unlike the color scale and bias and convolution, matrix multiply values should be in the [-1.0, 1.0) range for hardware acceleration.
Histograms. Histograms are internal calculated with 16-bit bins, and the internal format is only GL_RGBA. While an application can ask for different formats, the histogram is always calculated as RGBA.
Pixel Extensions: EXT_abgr, EXT_packed_pixels, SGIX_interlace
Blending Extensions: EXT_blend_color, EXT_blend_logic_op, EXT_blend_minmax, EXT_blend_subtract
Imaging extensions:. EXT_convolution, EXT_histogram, SGI_color_matrix, SGI_color_table
Buffer and Pbuffer extensions: EXT_import_context, EXT_visual_info, EXT_visual_rating, SGIX_dm_pbuffer, SGIX_fbconfig, SGIX_pbuffer
Texture extensions: EXT_texture3D, SGIS_texture_border_clamp, SGIS_texture_color_table, SGIS_texture_edge_clamp,
Supported only on O2 systems: SGIS_generate_mipmap, SGIS_texture_scale_bias. These two extensions are not discussed in this manual.
Video and swap control extensions: SGI_swap_control, SGI_video_sync, SGIX_video_source.
This section discusses optimizing performance for two of the Silicon Graphics mid-range systems: Elan graphics and Extreme graphics. For information on Indigo2 IMPACT systems, see “Optimizing Performance on Indigo2 IMPACT and OCTANE Systems”.
The following general performance tips apply to mid-range graphics systems:
Data size. Mid-range graphics systems are optimized for word-sized and word-aligned data (one word is four bytes). Pixel read and draw operations are fast if the data is word aligned and each row is an integral number of words.
Extensions. The ABGR extension is hardware accelerated (see “EXT_abgr—The ABGR Extension”).
Other available extensions are implemented in software.
In double buffer mode, it is not necessary to call glFlush(); the glXSwapBuffers() call automatically flushes the pipeline (implicit flushing).
Consider the following points when optimizing geometry operations for a mid-range system:
Consider the following issues when optimizing per-fragment operations for a mid-range system:
Alpha Blending. Mid-range graphics systems support alpha blending in hardware. All primitives can be blended, with the exception of antialiased lines and points, which use the blending hardware to determine pixel coverage. The alpha value is ignored for these primitives. Pixel blends work best in 24-bit, single-buffered RGB mode. In double-buffered RGB mode, the blend quality degrades.
Dithering. Dithering is used to expand the range of colors that can be created from a group of color components and to provide smooth color transitions. Disabling dither can improve the performance of glClear(). Dithering is enabled by default. To change that, call
glHint (GL_FOG_HINT, GL_FASTEST)
Pixel formats. The GL_ABGR_EXT pixel format is much faster than the GL_RGBA pixel format. For details, see “EXT_abgr—The ABGR Extension”.
The combinations of types and formats shown in Table 16-3 are the fastest.
Elan Graphics accelerates depth buffer operations on systems that have depth buffer hardware installed (default on Elan, optional on XS and XS24, not available on Indigo2 systems).
depth buffer is cleared to 1 and the depth test is GL_LEQUAL
depth buffer is cleared to 0 and the depth test is GL_GEQUAL
This section provides performance tips for Indigo2 IMPACT and OCTANE graphics systems. All information applies to all Indigo2 IMPACT and OCTANE systems, except sections on texture mapping, which do not apply to the Indigo2 Solid IMPACT and the OCTANE SI (without hardware texture mapping). You learn about these topics
This section provides some general tips for improving overall rendering performance. It also lists some features that are much faster than on previous systems and may now be used by applications that could not consider them before.
Fill-rate limited applications. Because per-primitive operations (transformations, lighting, and so on) are very efficient, applications may find that they are fill-rate limited when drawing large polygons (more than 50 pixels per triangle). In that case, you can actually increase the complexity of per-primitive operations at no cost to overall performance. For example, additional lights or two-sided lighting may come for free.
For general advice on improving performance for fill-rate limited applications, see “Tuning the Raster Subsystem”. Note in this context that texture-mapping is greatly accelerated on Indigo2 IMPACT and OCTANE systems with hardware texture-mapping.
Geometry-limited applications. For applications that draw many small polygons, consider a different approach: Use textures to avoid drawing so many triangles. See “Using Textures”.
Clipping. For optimum performance, avoid clipping. Special hardware supports clipping within a small range outside of the viewport. By keeping geometry within this range, you may be able to significantly reduce clipping overhead.
Antialiasing. Antialiased lines on Indigo2 IMPACT systems are high quality and fast. Applications that did not use antialiased lines before because of the performance penalty may now be able to take advantage of them. All antialiased lines are rendered with the same high quality, regardless of the settings of GL_LINE_SMOOTH_HINT. Although available, wide antialiased lines (width greater than 1.0) are not supported in hardware and should be avoided. Wide antialiased points are supported in hardware with good performance.
Rendering of primitives is especially fast if you follow these recommendations:
Note that the hardware allows mixing of different lengths of triangle strips. Grouping like primitives is highly recommended.
Use glLoadIdentity() to put identity matrixes on the stack. The system can optimize the pipeline if the identity matrix is used, but does not check whether a matrix loaded by glLoadMatrix() is the identity matrix.
Indigo2 High IMPACT
OCTANE SI with hardware textures
Indigo2 Maximum IMPACT
Texture-mapping is greatly accelerated on systems with hardware texture, and is only slightly slower than non-textured fill rates. It also significantly improves image quality for your application. To get the most benefit from textures, use the extensions to OpenGL for texture management as follows:
Use texture objects to keep as many textures resident in texture memory as possible. You can bind a texture to a name, then use it as needed (similar to the way you define and call a display list). The extension also allows you to specify a set of textures and prioritize which textures should be resident in texture memory.
Texture objects are part of OpenGL 1.1. For OpenGL 1.0, they were implemented as the texture object extension (EXT_texture_object).
Use the texture-LOD extension to clamp LOD values, which has the side effect of communicating to the system which mipmap levels it needs to keep resident in texture memory. For more information, see “SGIS_texture_lod—The Texture LOD Extension”.
Use subtextures to make texture definitions more efficient. For example, assume an application uses several large textures, all of the same size and component type. Instead of declaring multiple textures, declare one, then use glTexSubImage2D() to redefine the image as needed.
Subtextures are part of OpenGL 1.1. They were implemented as the subtexture extension (EXT_subtexture) in OpenGL 1.0.
Use the GL_RGBA4 internal format to improve performance and conserve memory. This format is especially important if you have a large number of textures. The quality is reduced, but you can fit more textures into memory because they use less space.
Internal formats are part of OpenGL 1.1. They were implemented as part of the texture extension in OpenGL 1.0.
Use the GL_RGBA4 internal format and the packed pixels extension to minimize disk space and improve download rate (see “EXT_packed_pixels—The Packed Pixels Extension”).
Use the 3D texture extension for volume rendering. Note, however, that due to the large amount of data, you typically have to tile the texture. You can set up the texture as a volume and slice through it as needed. For more information, see “EXT_texture3D—The 3D Texture Extension”.
If you use GL_LUMINANCE and GL_LUMINANCE_ALPHA textures, you can speed up loading by using the texture-select extension (see “SGIS_texture_select—The Texture Select Extension”).
For Indigo2 IMPACT graphics, data coherence enhances performance. For example:
When you draw your geometry, cluster points, short lines, or very small triangles so that you are not jumping around the texture ( you want to maintain texture data coherency).
If any minification is done to the texture, mipmaps result in improved performance.
When you use the pixel texture extension, performance varies based on the coherency of the lookup of pixel color data as texture coordinates. Applications have no control over this.
On many systems, a program encounters a noticeable performance cliff when a certain specific feature (for example depth-buffering) is turned on, or when the number of modes or components exceeds a certain limit.
On Indigo2 IMPACT systems, performance scales with the number of components. For example, on some systems, a switch from RGBA to RGB may not result in a change in performance, while on Indigo2 IMPACT systems, you should expect a performance improvement of 25%. (Note that while this applies to loading textures, it does not apply to using loaded textures.)
Here are some additional hints for optimizing image processing:
Instead of glPixelMap(), use the Silicon Graphics color table extension, discussed in “SGI_color_table—The Color Table Extension”, especially when working with GL_LUMINANCE or GL_LUMINANCE_ALPHA images.
OpenGL requires expansion of pixels using formats other than GL_RGBA to GL_RGBA. Conceptually, this expansion takes place before any pixel operation is applied. Indigo2 IMPACT systems attempt to postpone expansion as long as possible: this improves performance (operations must be performed on all components present in an image—a non-expanded image has fewer components and therefore requires less computation). Because pixel maps are inherently four components, GL_LUMINANCE and GL_LUMINANCE_ALPHA images must be expanded (a different lookup table is applied to the red, green, and blue components derived from the luminance value). However, if the internal format of an image matches the internal format of the color table, Indigo2 IMPACT hardware postpones the expansion, which speeds up processing.
The convolution extension, discussed in “EXT_convolution—The Convolution Extension” has been optimized. If possible, use the extension with separable convolution filters.
Indigo2 IMPACT systems are tuned for 3 x 3, 5 x 5, and 7 x 7 convolution kernels. If you choose a kernel size not in that set, performance is comparable to that of the closest member of the set. For example, if you specify 2 x 7, performance is similar to using 7 x 7.
Use texture-based zooming instead of glPixelZoom().
Texture loading and interpolation is fast on Indigo2 IMPACT, and texture-based zooming therefore results in a speed increase and higher-quality, more controllable results.
Where possible, minimize color table and histogram sizes and the number of color tables activated. If you don't, you may experience performance loss because the color table and the histogram compete for limited resources with other OpenGL applications.
Linear color space conversion. Use the color matrix extension to handle linear color space conversion, such as CMY to RGB, in hardware. This extension is also useful for reassigning or duplicating components. See “SGI_color_matrix—The Color Matrix Extension” for more information.
Non-linear color space conversions. Use the 3D and 4D texture extension for color conversion (for example, RGBA to CMYK). Using the glPixelTexGenSGIX() command, you can direct pixels into the lookup table and get other pixels out. Performance has been optimized.
If you work on a CAD application or other application that uses relatively static data, and therefore find it useful to use display lists instead of immediate mode, you can benefit from the display list implementation on Indigo2 IMPACT systems:
When the display list is compiled, most OpenGL functions are stored in a format that the hardware can use directly. At execution time, these display list segments are simply copied to the graphics hardware with little CPU overhead.
A further optimization is that a DMA mechanism can be used for a subset of display lists. By default, the CPU feeds the called list to the graphics hardware. Using DMA display lists, the host gives up control of the bus and Indigo2 IMPACT uses DMA to feed the contents to the graphics pipeline. The speed improvement at the bus is fourfold; however, a setup cost makes this improvement irrelevant for very short lists. The break-even point varies depending on the list you are working with, whether it is embedded in other lists, and other factors.
The functions that are direct (use hardware formats) will change over time. The following items are currently NOT compiled to direct form:
glCallLists() and glListBase()
all imaging functions
all texture functions
glHint(), glClear(), and glScissor()
glEnable() and glDisable()
glPushAttrib() and glPopAttrib()
all evaluator functions
most OpenGL extensions
If a display list meets certain criteria, Indigo2 IMPACT systems use DMA to transfer data from the CPU to the graphics pipeline. This is useful if an application is bus limited. It can also be an advantage in a multi-threaded application, because the CPU can do some other work while the graphics subsystem pulls the display list over.
The DMA method is used under the following conditions:
Note that the system tests recursively whether the DMA model is appropriate: If an embedded display list meets the criteria, it can be used in DMA mode even if the higher-level list is processed by the CPU.
Offscreen rendering can be accelerated using the pixel buffer extension discussed in “SGIX_pbuffer—The Pixel Buffer Extension”.
Here are some tips for improving RealityEngine geometry performance:
Primitive length. Most systems have a characteristic primitive length that the system is optimized for. On RealityEngine systems, multiples of 3 vertices are preferred, and 12 vertices (for example a triangle strip that consists of 10 triangles) result in the best performance.
This section discusses optimizing rasterization. While it points out a few things to watch out for, it also provides information on features that were expensive on other systems but are acceptable on RealityEngine systems:
After a clear command (or a command to fill a large polygon), send primitives to the geometry engine for processing. Geometry can be prepared as the clear or fill operations take place.
Texturing is free on a RealityEngine if you use a 16-bit texel internal texture format. There are 16-bit texel formats for each number of components. Using a 32-bit texel format yields half the fill rate of the 16-bit texel formats. Internal formats are part of OpenGL 1.1; they were part of the texture extension in OpenGL 1.0.
The use of detail texture and sharpen texture usually incurs no additional cost and can greatly improve image quality. Note, however, that texture management can become expensive if a detail texture is applied to many base textures. Use detail texture but keep detail and base paired and detail only a few base textures. See “SGIS_sharpen_texture—The Sharpen Texture Extension” and “SGIS_detail_texture—The Detail Texture Extension”.
If textures are changing frequently, use subtextures to incrementally load texture data. RealityEngine systems are optimized for 32 x 32 subimages.
There is no penalty for using the highest-quality mipmap filter (GL_LINEAR_MIPMAP_LINEAR) if 16-bit texels are used (for example, the GL_RGBA4 internal format, which is part of OpenGL 1.1 and part of the texture extension for OpenGL 1.0).
Local lighting or multiple lights are possible without an unacceptable degradation in performance. As you turn on more lights, performance degrades slowly.
Simultaneous clearing of depth and color buffers is optimized in hardware.
Vertex arrays were implemented as an extension to OpenGL 1.0 and are part of OpenGL 1.1. If you use vertex arrays, the following cases are currently accelerated for RealityEngine (each line corresponds to a different special case). To get the accelerated routine, you need to make sure your vertices correspond to the given format by using the correct size and type in your enable routines, and also by enabling the proper arrays:
glColor4f glTexCoord2f glVertex3f
glColor3f glNormal3f glVertex3f
glColor4f glNormal3f glVertex3f
glNormal3f glTexCoord2f glVertex3f
glColor4f glTexCoord2f glNormal3f glVertex3f
Multisampling provides full-scene antialiasing with performance sufficient for a real-time visual simulation application. However, it is not free and it adds to the cost of some fill operations. With RealityEngine graphics, some fragment processing operations (blending, depth buffering, stenciling) are essentially free if you are not multisampling, but do reduce performance if you use a multisample-capable visual. Texturing is an example of a fill operation that can be free on a RealityEngine and is not affected by the use of multisampling. Note that when using a multisample-capable visual, you pay the cost even if you disable multisampling.
Below are guidelines for optimizing performance for multisampling:
Multisampling offers an additional performance optimization that helps balance its cost: a virtually free screen clear operation. Technically, this operation doesn't really clear the screen, but rather allows you to set the depth values in the framebuffer to be undefined. Therefore, use of this clear operation requires that every pixel in the window be rendered every frame; pixels that are not touched remain unchanged. This clear operation is invoked with glTagSampleBufferSGIX() (see the reference page for more information).
When multisampling, using a smaller number of samples and color resolution results in better performance. Eight samples with 8-bit RGB components and a 24-bit depth buffer usually result in good performance and quality; 32-bit depth buffers are rarely needed.
Multisampling with stencilling is expensive. If it becomes too expensive, use the polygon offset extension to deal with decal tasks (for example, runway strips).
Polygon offsets are supported in OpenGL 1.1 and were part of the Polygon Offset extension in OpenGL 1.0.
There are two ways of achieving transparency on a RealityEngine system: alpha blending and subpixel screen-door transparency using glSampleMaskSGIS(). Alpha blending may be slower, because more buffer memory may need to be accessed. For more information about screen-door transparency, see “SGIS_multisample—The Multisample Extension”.
Unsigned color types are faster than signed or float types.
Smaller component types (for example, GL_UNSIGNED_BYTE) require less bandwidth from the host to the graphics pipeline and are faster than larger types.
The slow pixel drawing path is used when fragment operations (depth or alpha testing, and so on) are used, or when the format is GL_DEPTH_COMPONENT, or when multisampling is enabled and the visual has a multisample buffer.
Your application might perform RGBA imaging operations (for example, convolution, histogram, and such) on a single-component basis. This is the case either when processing gray scale (monochrome) images, or when different color components are processed differently.
RealityEngine systems currently do not support RGBA-capable monochrome visuals (a feature that is introduced by the framebuffer configuration extension; see “SGIX_fbconfig—The Framebuffer Configuration Extension”). You must therefore use a four-component RGBA visual even when performing monochrome processing. Even when monochrome RGBA-capable visuals are supported, you may find it beneficial to use four-component visuals in some cases, depending on your application, to avoid the overhead of the glXMakeCurrent() or glXMakeCurrentReadSGI() call.
On RealityEngine systems, monochrome imaging pipeline operations are about four times as fast as the four-component processing. This is because only a quarter of the data has to be processed or transported either from the host to graphics subsystem—for example, for glDrawPixels()—or from the framebuffer to the graphics engines—for example, for glCopyPixels().
The RealityEngine implementation detects monochrome processing by examining the color matrix (see “Tuning the Imaging Pipeline”) and the color writemask.
The following operations are optimized under the set of circumstances listed below:
glDrawPixels() with convolution enabled and
the pixel format is GL_LUMINANCE or GL_LUMINANCE_ALPHA
the color matrix is such that the active source component is red
glCopyPixels() and the absolute value of GL_ZOOM_X and GL_ZOOM_Y is 1.
The following set of circumstances has to be met:
All pixel maps and fragment operations are disabled.
The color matrix does not scale any of the components.
The post color matrix scales and biases for all components are 1 and 0, respectively.
Either write is enabled only for a single component (R, G, B, or A), or alpha-component write is disabled.
Using the texture_object extension (OpenGL 1.0) or texture objects (OpenGL 1.1) usually yields better performance than using display lists.
OpenGL will make a copy of your texture if needed for context switching, so deallocate your own copy as soon as possible after loading it. Note that this behavior differs from RealityEngine behavior.
Note that RealityEngine and InfiniteReality systems differ here:
On RealityEngine systems, there is one copy of the texture on the host, one on the graphics pipeline. If you run out of texture memory, OpenGL sends the copy from the host to the graphics pipeline after appropriate cleanup.
On Infinite Reality systems, only the copy on the graphics pipe exists. If you run out of texture memory, OpenGL has to save the texture that didn't fit from the graphics pipe to the host, then clean up texture memory, then reload the texture. To avoid these multiple moves of the texture, be sure to always clean up textures you no longer need so you don't run out of texture memory.
This approach has the advantage of very fast texture loading because no host copy is made.
To load a texture immediately, follow this sequence of steps:
Bind your texture.
To define a texture without loading it into the hardware until the first time it is referenced, follow this sequence of steps:
Bind your texture.
Note that in this case, a copy of your texture is placed in main memory.
Don't overflow texture memory, or texture swapping will occur.
If you want to implement your own texture memory management policy, use subtexture loading. You have two options. For both options, it is important that after initial setup, you never create and destroy textures but reuse existing ones:
Allocate one large empty texture, then call glTexSubImage*() to load it piecewise, and use the texture matrix to select the relevant portion.
Allocate several textures, then fill them in by calling glTexSubImage*() as appropriate.
Use 16-bit texels whenever possible; RGBA4 can be twice as fast as RGBA8. As a rule, remember that bigger formats are slower.
If you need a fine color ramp, start with 16-bit texels, then use a texture lookup table and texture scale/bias.
For loading textures, use pixel formats on the host that match texel formats on the graphics system.
Avoid OpenGL texture borders; they consume large amounts of texture memory. For clamping, use the GL_CLAMP_TO_EDGE_SGIS style defined by the SGIS_texture_edge_clamp extension (see “SGIS_texture_edge/border_clamp—Texture Clamp Extensions”). This extension is identical to the old IRIS GL clamping semantics on RealityEngine.
InfiniteReality systems support offscreen rendering through a combination of extensions to GLX:
pbuffers are offscreen pixel arrays that behave much like windows, except that they're invisible. See “SGIX_pbuffer—The Pixel Buffer Extension”.
fbconfigs (framebuffer configurations) define color buffer depths, determine presence of Z buffers, and so on. See “SGIX_fbconfig—The Framebuffer Configuration Extension”.
glXMakeCurrentReadSGI() allows you to read from one window or pbuffer while writing to another. See “EXT_make_current_read—The Make Current Read Extension”.
In addition, glCopyTexImage*() allows you to copy from pbuffer or window to texture memory. This function is supported through an extension in OpenGL 1.0 but is part of OpenGL 1.1.
For framebuffer memory management, consider the following tips:
If you have deep windows, such as multisampled or quad- buffered windows, then you'll have less space in the framebuffer for pbuffers.
pbuffers are swappable (to avoid collisions with windows), but not completely virtualized, that is, there is a limit to the number of pbuffers you can allocate. The sum of all allocated pbuffer space cannot exceed the size of the framebuffer.
pbuffers can be volatile (subject to destruction by window operations) or nonvolatile (swapped to main memory in order to avoid destruction). Volatile pbuffers are recommended because swapping is slow. Treat volatile pbuffers like they were windows, subject to exposure events.
As a rule, it is more efficient to change state when the relevant function is disabled than when it is enabled. For example, when changing line width for antialiased lines, call
As a result of this call, the line filter table is computed just once, when line antialiasing is enabled. If you call
the table may be computed twice: Once when antialiasing is enabled, and again when the line width is changed. As a result, it may be best to disable a feature if you plan to change state, then enable it after the change.
The following mode changes are fast: sample mask, logic op, depth function, alpha function, stencil modes, shade model, cullface, texture environment, matrix transforms.
The following mode changes are slow: texture binding, matrix mode, lighting, point size, line width.
For best results, map the near clipping plane to 0.0 and the far clipping plane to 1.0 (this is the default). Note that a different mapping, for example 0.0 and 0.9, will still yield good result. A reverse mapping, such as near = 1.0 and far = 0.0, noticeably decreases depth-buffer precision.
When using a visual with a 1-bit stencil, it is faster to clear both the depth buffer and stencil buffer than it is to clear the depth buffer alone.
Use the color matrix extension for swapping and smearing color channels. The implementation is optimized for cases in which the matrix is composed of zeros and ones.
Be sure to check for the usual things: indirect contexts, drawing images with depth buffering enabled, and so on.
Triangle strips that are multiples of 10 (12 vertices) are best.
InfiniteReality systems optimize 1-component pixel draw operations. They are also faster when the pixel host format matches the destination format.
Bitmaps have high setup overhead. Consider these approaches:
If possible, draw text using textured polygons. Put the entire font in a texture and use texture coordinates to select letters.
To use bitmaps efficiently, compile them into display lists. Consider combining more than one into a single bitmap to save overhead.
Avoid drawing bitmaps with invalid raster positions. Pixels are eliminated late in the pipeline and drawing to an invalid position is almost as expensive as drawing to a valid position.
Minimize the amount of data sent to the pipeline.
Use display lists as a cache for geometry. Using display lists is critical on Onyx 1 system. It is less critical, but still recommended, on Onyx2 systems. The two systems performance differs because the bus between the host and the graphics is faster on Onyx2 systems.
The display list priority extension (see “SGIX_list_priority—The List Priority Extension”) can be used to manage display list memory efficiently.
Use small data types, aligned, for immediate-mode drawing such as RGBA color packed into a 32-bit word, surface normals packed as three shorts, texture coordinates packed as two shorts). Smaller data types mean, in effect, less data to transfer.
Use the packed vertex array extension.
Render with exactly one thread per pipe.
Use multiple OpenGL rendering contexts sparingly. The rendering context-switching rate is about 60,000 calls per second, assuming no texture swapping, so each call to glXMakeCurrent() costs the equivalent of 100 textured triangles or 800 32-bit pixels.