PowerCast ™ Switching Fabric
The interconnect fabric is one of the most critical design
components in a high performance switching system. Under heavy
network load, internal switching fabric capacity can easily
become the bottleneck limiting overall throughput for a switch
or router operating in the core of a network. Enterasys's
Xpedition family of products are built around a scalable,
multi-gigabit non-blocking switching fabric, PowerCast, that
eliminates the backplane interconnect bottleneck and delivers
continuous wire-speed throughput with low latency even under
heavy network load.
Recent Trends
Until recently, most switches and routers were built around a
shared bus interconnect. Shared bus architecture provided a
simple and central solution with sufficient capacity as long as
data rates, port utilization and port densities were relatively
low. However, advances in LAN technologies and the explosive
growth of Internet and intranet applications have increased
switching requirements beyond the limits of shared bus
architecture.
- LAN data rates have increased by a factor of 100 in less
than a decade. The maximum data rate for Ethernet has gone
from 10Mb/s to 1000Mb/s in a just a few years. Network data
rates are increasing faster than advances in memory and
electrical signalling technology.
- Advances in microprocessor technology and computer
architecture have enabled new generations of systems to
generate and process larger amounts of data over the
network. Internet traffic volume is doubling every 10 to 12
months.
- Client-Server applications, the world-wide web, multimedia
and multicast applications have increased the amount of
traffic in the corporate intranet. New classes of
applications are creating new classes of service
expectations such as quality of service, efficient
multicasting capability and predictable network latency.
- New computing paradigms have changed access patterns over
the LAN, breaking the 80/20 rule which stated that 80
percent of network traffic was limited to a single switched
domain. Hierarchical switching fabric architectures based on
traffic locality assumptions are no longer acceptable.
- Shared media hubs are being replaced by switches resulting
in higher average utilization at aggregation points in the
network backbone. Routers which cannot support wire-speed
traffic are obsolete.
- Average number of ports on routers have increased from a
handful (4..16) to several dozen (48..120). This trend is
expected to continue as corporations continue to centralize
servers and network services around the network backbone.
- Price of routed ports is declining rapidly with the
introduction of layer-3 switching technology. Decreasing
prices for line cards and control modules result in lower
total system cost expectations and severe cost pressure on
the switching fabric.
Xpedition's switching fabric, PowerCast, is designed handle
these emerging network requirements by delivering the highest
possible interconnect performance at the lowest cost per port
and supporting wire-speed switching of Gigabit network
interfaces.
Fabric Design Goals
PowerCast ™ switching fabric was designed to meet the
following goals:
- Sustain 100% line utilization under heavy and bursty load
conditions.
- Provide multiple priority levels to support mission
critical applications.
- Deliver true wire speed performance on single Gigabit
ethernet interfaces.
- Provide an architecture that scales up to hundreds of
ports.
- Support a family of products at various price/performance
points.
- Enable reliable and fault-tolerant solutions suitable for
enterprise class products.
- Support hardware replication of packets for efficient
multicasting.
The primary goal for the PowerCast ™ fabric was to deliver
100% throughput in a 16 channel system that is fully populated
with single gigabit or octal 100Base-T modules. The design
constraints were chosen to ensure sustained wire-rate
performance even under worst case scenarios:
- For worst case load, all ports were assumed to be full
duplex and running at 100% utilization.
- There were no assumptions about typical packet size. The
switch had to deliver full wire-speed performance even for
minimum size packets.
- No assumptions were made about the locality of the
traffic. The performance of the switching fabric had to be
independent of input and output port assignments.
Fabric Architecture
Various fabric architectures were investigated during the design
phase, including bus-based backplanes and shared memory
interconnects.
Bus-Based Backplane
Since the goal was to scale the system up to 16-channels,
shared-bus based implementation was not considered due to
electrical risk factors associated with using a wide, high-speed
backplane bus. To achieve the 64Gb/s backplane bandwidth of
Xpedition8600, a 256-bit wide bus running at 250MHz or a 512-bit
wide bus running at 125MHz would be necessary.
Hierarchical bus based architecture, which is used in some of
the current generation switches and routers from Cisco, was also
not considered due to the limitations on the bisection
bandwidth. In a hierarchical bus configuration, such as the one
shown in Figure 1, the usable bandwidth of the main backplane
bus is typically significantly less than the aggregate bandwidth
of all the ports. Hierarchical bus based switching systems
operate under the assumption that most traffic can be locally
switched, therefore, traffic going through the backplane is
limited. Even though this assumption may have been valid in the
past, it is no longer true. When the traffic is not localized,
the backplane bus typically becomes the bottleneck limiting
system throughput. Moreover, requiring port assignments to be
made based on locality of communication would make network
management and reconfiguration difficult by introducing
unnecessary constraints on the topology of the network.
Shared Memory Switch
Shared memory switch architecture is commonly used in many of
the newer switching and routing products. Shared memory switch
is one particular implementation of a more general class of
switching fabrics known as output buffered switch, which is
shown in Figure 2.
In an output buffered switch, a packet is placed in the
output queue of the target output port. Use of a separate queue
for each output keeps the packet flows to different outputs
isolated from each other and eliminates packet loss due to
contention effects unless an output port is oversubscribed. Even
in the case of oversubscription, output buffered switch
constrains the packet loss to only oversubscribed output
channels. By eliminating contention related delays and using a
single queueing point, output buffered switching systems also
make it possible to control the latency of packets through the
system, which is very important for supporting QoS in a switch
or router.
Output buffered switches need to run the internal switching
fabric at N times the input port rate. Moreover, the memories
used for output buffers need to support N+1 times the bandwidth
of the input ports if the output buffers are distributed, or 2N
times the bandwidth of the input ports for a shared memory
implementation. Due to these bandwidth constraints, output
buffered switches are not practical, or cost efficient for
switches that carry large amount of traffic or support large
numbers of ports. For example, a 16x16 switch supporting 2
gigabit ethernet ports per channel would require a usable memory
bandwidth of 64Gb/s. To achieve this bandwidth, a 256-bit SRAM
interface running at 250MHz or a 1024-bit DRAM interface running
at 100MHz would be necessary. Increasing line rates (OC-48,
OC-192) make the bandwidth problem even more difficult.
Input Buffered Switch
In an input buffered switch, packets are queued at input ports
after they arrive at the system. Each input port has a channel
that runs at line rate into a switching fabric. Access to the
switching fabric is controlled by a fabric arbiter which
resolves output contention and schedules packet transfers across
the fabric. When the switching fabric runs at line-rate, input
and output memories only need to run at the maximum port
bandwidth rate. Since memory bandwidth is not proportional to
the number of ports, it is possible to build scalable switching
systems that can support a large number of ports with relatively
low cost memory components.
The main problem associated with input buffered switches is
the head-of-line (HOL) blocking which can severely limit the
throughput. If each input buffer is maintained as a simple FIFO,
HOL blocking problem can limit the throughput of an input
buffered switch to 58% of the maximum aggregate input rate when
all input ports are driven at 100% utilization with traffic that
is uniformly distributed over all output ports [1].
It has been shown that HOL blocking can be eliminated by
using a non-FIFO scheduling algorithms. Using certain algorithms
which may not be very suitable for high-speed hardware
implementation, it is possible to eliminate HOL blocking
entirely and achieve 100% throughput [2].
Another method that has been used to increase the throughput
of a input buffered switch is to run the switching fabric at a
faster rate than the input ports. This implementation is also
known as the combined input-output queued (CIOQ) switch since
packets need to be buffered both before and after the switching
fabric. Various studies have shown that speedup makes it
possible to achieve to throughput in excess of 99% using CIOQ
architectures [3][4][5][6] Moreover, more recent results show
that "a CIOQ switch can behave identically to an output
queued switch, or one using centralized shared memory" [7]
and "only a moderate speedup factor (at most two) is
necessary to approach the delay and throughput performance of
pure output queueing switches" [8]. [9] presents a novel
crossbar arbitration algorithm which is work conserving for all
traffic patterns and switch sizes for a speedup of only 2 and,
finally, [10] and [11] show that a CIOQ switch can match the
packet latency behavior of an output buffered switch with a
speed up of 2.
PowerCast
PowerCast TM was architected around a non-blocking combined
input-output queued (CIOQ) switching fabric. At the core of
PowerCast is a multipoint switch that provides concurrent access
to output channels from any input. In addition to being capable
of transferring packets from multiple inputs to multiple outputs
simultaneously, a multipoint switch can also multicast packets
from input channels to multiple sets of outputs. The decision to
go with a dynamic multipoint switch was primarily based on the
objective of delivering a scalable switching architecture that
met wirespeed packet forwarding requirements even at Gigabit
ethernet speeds. Unlike shared memory based solutions which run
into bandwidth bottlenecks due to centralized buffer management,
a multipoint switch can provide full throughput for very large
numbers of ports by distributing buffering bandwidth
requirements over multiple channels. The basic architecture of
the PowerCast switching fabric is shown in Figure 3.
PowerCast achieves line-rate throughput by:
- Overspeed interconnect channels, i.e., running the
interconnect channels significantly faster than the maximum
aggregate input rate.
- Employing a sophisticated arbitration algorithm to
eliminate HOL blocking.
- Pipelined arbitration to guarantee full throughput even
for the smallest packet sizes.
- Providing ample buffering to eliminate the effects of
short term contention and bursty traffic patterns.
Head-of-Line Blocking Avoidance
PowerCast eliminates head-of-line blocking by maintaining
multiple outstanding requests per channel and using a dynamic
scheduling algorithm that provides significant performance
improvement over the simple FIFO scheduling algorithm. This
dynamic scheduling method allows packets destined for available
output channels to bypass older packets which are waiting for a
busy output channels. In addition, at points where multiple low
bandwidth ports are aggregated into a high speed channel,
requests from input ports are ordered such that sequential
channel connection requests are directed to non-overlapping
destination ports. This essentially ensures that fabric
utilization is maximized under most real-life network traffic
distributions.
Over-speed is designed into the interconnect as an additional
measure to improve throughput. Packets are removed from the
input queues faster than they arrive at the input ports. For 8
port 100Base-T module, the fabric channel provides a speed-up of
2X and for a single port Gigabit Ethernet module, the fabric
channel provides a speedup of 1.5X. Switching packets faster
across the fabric not only minimizes the time packets wait at
the input queues for outputs to become available but also
reduces the total port-to-port packet latency across the switch.
Combination of the intelligent scheduling algorithm and the
overspeed interconnect channels enables PowerCast to deliver
near 100% throughput even under 100% input load distributed
randomly and uniformly over all output ports as shown in Figure
4. Even under output oversubscription, PowerCast arbitration
algorithm minimizes throughput loss compared to a basic input
buffered switch.
Traffic Management
At the core of the network, rate mismatches between input and
output ports, bursty traffic patterns and network hot spots can
result in output wire overload. First, all critical queueing
points in the system use multiple logical priority queues to
support various classes of service. Even within the same
priority class, PowerCast switching fabric ensures that no input
port can occupy the entire bandwidth of an output port in the
presence of traffic targeted at the same output port from other
input ports. Round robin arbitration at multiple levels of the
fabric guarantee that input ports which are trying to send
limited amounts of traffic to an output port cannot be blocked
by high density traffic coming in from different input ports.
Large input and output queues, implemented using high capacity,
low cost memory components, can hold several thousand packets to
smooth bursty traffic patterns across the switch. Unlike
per-flow queueing systems which cannot deal with large bursts of
a single flow due to buffer capacity shortage, PowerCast
provides sufficient buffer capacity to enable efficient
application level flow control through high level protocols like
TCP/IP without packet loss.
Wirespeed Multicast
In addition to providing wire-rate unicast performance the
fabric is capable of replicating packets to multiple output
ports for efficient handling of multicast and broadcast traffic.
PowerCast technology can sustain a large number of multicast
streams, with possible overlaps among the targets of these
multicast groups. Multicast streams can be handled at the input
wire rate. Even under output port overload, packet loss is
localized to output ports which receive more traffic than the
wire can handle, while other members of the multicast group
continue to receive their streams without disruption.
With efficient multicasting, the network can easily be turned
into a highly efficient broadcast media. The potential demand
for one-to-many applications is proven by the success of
streaming technologies, one of the fastest growing segments of
network usage. Web-casting, video teleconferencing, push
technology and software distribution are just a few examples of
one-to-many networking technologies that are becoming popular.
However, most streaming applications are currently limited to
unicast packet flows inside corporate networks or over the
internet backbone due to severe restrictions imposed on
multicast traffic bandwidth by current generation of routers.
Multicast based on hardware replication of packets eliminates
the performance bottlenecks inherent in traditional software
only routers. Wire-speed multicasting provisions the network
infrastructure to distribute large amounts of data to a large
number of clients reliably and simultaneously. With a
distribution network of only two levels of wire-speed multicast
capable switching routers from Enterasys, it is possible to
sustain multicast rates of tens of megabits per second to
thousands of ports.
References
[1] M. Karol, M. Hluchyj, and S.
Morgan, "Input Versus Output Queueing on a Space Division
Switch", IEEE Trans. Comm., 35(12) pp.1347-1356
[2] N. McKeown, V. Anantharam, J. Walrand,
"Achieving 100% Throughput in an Input-Queued Switch",
INFOCOM'96, pp.296-302
[3] Y. Oie, M. Murata, K. Kubota and H. Miyahara,
"Effect of Speedup in Non-blocking Packet Switch",
Proc. ICC'89, Boston, MA, June 1989, pp.410-414
[4] A.L. Gupta and N.D. Georganas, "Analysis of a
Packet Switch with Input and Output Buffers and Speed
Constraints", INFOCOM'91, Bal Harbour, FL, April 1991,
pp.694-700
[5] J.S.C. Chen and T.E. Stern, "Throughput
analysis, optimal buffer allocation and traffic imbalance study
of a generic non-blocking packet switch", IEEE Journal of
Selected Areas in Communications, Vol. 9, No.3, April 1991,
pp.439-449
[6] N. McKeown, B. Prabhakar, and M. Zhu,
"Matching Output Queueing with Combined Input and Output
Queueing", Proc. 35 Annual Allerton Conf. on Comm.,
Monticello, IL, October 1997
[7] B. Prabhakar, N. McKeown, "On the speedup
required for combined input and output queued switching",
Stanford University Computer Systems Lab Technical Report,
CSL-TR-97- 738, November 1997
[8] R. Guerin and K.N. Sivarajan, "Delay and
Throughput Performance of Speeded-up Input-Queueing Packet
Switches", IBM Research Report RC20892, March 1998 [9] P.
Krishha, N.S. Patel, A. Charny and R. Simcoe, "On the
Speedup Required for Work Conserving Crossbar Switches",
IWQoS'98, May 1998
[10] S. Chuang, et. al. "Matching Output Queueing
with a Combined Input Output Queued Switch", Stanford
University Computer Systems Lab Technical Report, CSL-TR-98-758
[11] I. Stoica & H. Zhang sExact Emulation of an
Output Queueing Switch by a Combined Input Output Queueing
Switch", IWQoS'98, May 1998
|