Methods for Gigabit Networking Performance Measurements

Methods for Gigabit Networking Performance Measurements

Mark Foster

NASA Research & Education Network

NASA Ames Research Center / Recom Technologies

This document provides a brief overview of mechanisms that can be used to measure and confirm Gigabit-level data communications. We define Gigabit Networking as data rates that are in the 500 Mbps to 1 Gbps range. At present there are, in essence, two approaches to achieving these levels of performance: a single stream between two very high performance systems, and multiple concurrent (parallel) streams between cooperating systems (generally clusters of two or more machines at each endpoint). Partial descriptions of the elements of these approaches are given and a handful of recent experiences are summarized. At the end of this document are some of the references from which details in this paper were derived.

System Elements:

System communication performance is achieved through fast processors, low latency memory, and high performance peripheral busses. Key characteristics and constraints of these elements are described below.

The Peripheral Component Interconnect (PCI) bus commonly used in current systems has theoretical maximum throughput (burst rate) as given in Table 1. Note that the continuous transfer is about 60% of the burst rate; thus, a burst capacity of 1 Gbps yields a continuous rate of about 600 Mbps. In most systems, a fully saturated bus cannot be maintained for very long periods, due to other system-level activities such as protocol processing, memory management and disk I/O.

bus speed \ bus width	32 bits	64 bits
33 Mhz	1 Gbps	2 Gbps
66 Mhz 100 Mhz 133 Mhz		4 Gbps 6 Gbps 8 Gbps

Table 1. PCI Bus burst rates

Many systems use a "Northbridge" chip (the circuit that connects CPU, memory, and PCI bus). Early versions of this chip imposed a limit of 50 Mbytes/second (~400 Mbps) on bus to memory transfers. Current versions, such as the Intel 440BX, theoretically can support transfers up to 800 Mbytes/second using a 100 Mhz bus, although most PCI cards are currently limited to 66 Mhz. Ways to further increase the system throughput include use of multiple independent PCI busses. Infiniband, a higher performance bus specification, proposes about a two-fold increase over the fastest PCI performance; systems and adapter cards that use this specification are not yet available.

Network interfaces that support OC12 (622 Mbps), Gigabit Ethernet (1Gbps), or above must be used for single-stream measurements. These interfaces must also support large frames (MTU typically of 4470 or 9000 bytes) to reduce CPU overhead and to provide high throughput on long latency links. On systems that support logical grouping or "bonding" of interfaces into a "fat pipe", multiple slower interfaces could accomplish throughput in excess of 500 Mbps using multiple concurrent streams. We do not expect to explore such bonded interfaces for high performance measurements, and will focus on the higher performance network interface cards.

Processor speeds become an issue for handling interrupts and processing the protocol stack. Systems with ~400 Mhz processors or slower will generally be bandwidth limited unless interrupt and protocol processing are offloaded onto the network interface cards, or protocol processing is minimized (UDP).

Without offloading, even processors twice as fast are readily saturated when performing bulk data transfers at high speeds, leaving few cycles for computation. Memory subsystems should use page sizes that are convenient to hold an MTU worth of data; this generally means 4K or 8K bytes. MTU’s that are sized appropriately can be more effectively used by zero-copy socket code, and thereby reduce the amount of per-packet processing required.

The operating system communications processing (e.g., TCP protocol, multi-copy, single-copy, and zero-copy I/O) lead to dramatic differences in overall throughput (as much as 4x). Chase, et.al. describe approaches to end system tuning and packet size optimization to increase network throughput in http://www.cs.duke.edu/ari/publications/end-system.pdf. They indicate that the key techniques to exploit are interrupt suppression, checksum offloading, and zero-copy data movement by page re-mapping, in addition to use of large (jumbo) frames.

Measurement Tools:

The evaluation of end-to-end throughput can be accomplished by using three sets of tools: traffic generators, path characterization, and packet sample/capture. The first two are critical; sampling of a test flow can be used to help validate what the generator/collector processes report, and, in some cases, may help identify where a loss, if any, is occurring. Latency comes into play particularly for TCP over long delay links; very minor loss rates (< 10E-6) can significantly affect very high bandwidth TCP sessions as well. Characterizing the path as part of the bandwidth performance testing is crucial to understanding the measured performance.

Tool Category	Tool Name	Brief Description	Key Parameters (where applicable)
traffic generation	iperf	sends/receives tcp, udp packets; client/server user interface simplifies end-to-end testing
	nttcp	similar functionality to iperf; somewhat different parameters
	gen_send, gen_receive	effective at saturating a link with tcp or udp packets
	smartbits (with GbE interface)	external hardware with sophisticated generation and sampling features
loss estimation & latency	ping	uses ICMP to compute round-trip-times
	sting	clever use of TCP protocol to assess one-way loss
	pchar	path characterization (bandwidth, latency, loss)
sampling & capture	tcpdump	packet capture can select specific flows, or a percentage of all flows on a lightly loaded machine
	tcptrace	helpful for analyzing output of tcpdump and examining TCP protocol exchange
	PCMon	passive packet capture that doesn’t impact the systems under test; may only be capable of sampling at higher speeds

Recent Experiences:

During Supercomputing 2000, several groups participated in coordinated demonstrations to show how their applications could fill a WAN link that was provisioned at 1.5Gbps. Some notable instances were the efforts of LBL and SLAC.

SLAC – bulk throughput measurements

Staff from Stanford Linear Accelerator and Fermi National Accelerator Laboratory collaborated to illustrate the needs and challenges of data intensive science and in particular for the Particle Physics Data Grid (PPDG). At the show, the SLAC booth had connectivity via SCInet to NTON. They measured the throughput on this link during the show. The NTON link to SCInet was an OC48 (2.4Gbits/s) packet over SONET link (limited to 1.5Gbps per HSCC agreement with Qwest).

Their iperf-based tests showed a peak transfer rate from the booth in Dallas to SLAC via NTON of around 990 Mbit/s, achieved using two PCs on the floor and two Suns (pharlap and datamove5) at SLAC. The best results were achieved with a 128KB window size and 25 parallel streams. Bigger window sizes caused noticeable performance degradation. The 990 MBit/s was measured as a few-second peak and was done using 2-second sampling intervals (as opposed to 5 second intervals in the SC00 Network Challenge).

In the last test they sent data via UDP as fast as possible to SLAC. With two PCs they achieved about 1.25GBit/s (1GBit/s from a Dell PowerEdge and 250MBit/s from the other PC; they don't fully understand why the rate was only 250MBit/s). Out of the 1GBit raw bandwidth sent, they received about 975MBit/s (5min average) at the GSR at SLAC.

More details can be found at http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2k.html.

LBL - Visapult

Lawrence Berkeley National Lab-led Visapult application was the winner of the "Fastest and Fattest" award in the SC00 Network Challenge ( http://www-fp.mcs.anl.gov/sc2000_netchallenge ). They sought to make two high-water marks: remote visualization of a dataset that exceeds 1TB in size, and application performance that could exceed 1Gb/s in sustained network bandwidth consumption.

Visapult combines network-based data caches, such as a DPSS (http://www-didc.lbl.gov/DPSS), high speed networks, domain-decomposed parallel rendering in software and a lightweight viewer to achieve interactive visualization of large scientific data sets. High throughput rates are achieved by parallelizing i/o at each stage in the application, and by pipelining the visualization process. Visapult consists of two components; a viewer and a back end. The back end is a parallel application that loads in large scientific data sets using a domain decomposition, and performs software volume rendering on each subdomain, producing an image. The viewer, also a parallel application, implements Image Based Assisted Volume Rendering using the imagery produced by the back end. On the display device, graphics interactivity is effectively decoupled from the latency inherent in network applications.

More details can be found at http://vis.lbl.gov/projects/visapult

Additional References

PCI Bus information:

http://www.adaptec.com/technology/whitepapers/64bitpci.html

http://www.pcisig.com/data/news/press_kit/pk_the_successor_to_pci.pdf

FreeBSD zero copy info:

http://people.freebsd.org/~ken/zero_copy/