Notes
Slide Show
Outline
1
High Performance Data Transfer
  • Phillip Dykstra
  • Chief Scientist
  • WareOnEarth Communications Inc.
  • phil@sd.wareonearth.com
2
Online Copy
  • This tutorial can be found at
    • http://www.wcisd.hpc.mil/~phil/sc2004/
3
Motivation
4
Tutorial Focus
  • Moving a lot of data
    • Quickly
    • Securely
    • Error free
5
Unique HPC Environment
  • The Internet has been optimized for
    • millions of users behind low speed connections
    • thousands of high bandwidth servers serving millions of low speed streams
  • Single high-speed to high-speed flows get little commercial attention
6
Objectives
  • Look at current high performance networks
  • Fundamental understanding of delay, loss, bandwidth, routes, MTU, windows
  • Learn what is required for high speed data transfer and what to expect
  • Look at many useful tools and how to use them for performance testing and debugging
  • Examine TCP dynamics and TCP’s future
  • Look at alternatives to TCP and higher level approaches to data transfer
7
Topics
  • Fundamental Concepts                                Slide #
    • Layers, Circuits                                                       9
    • High Performance Networks                                 18
    • Delay                                                                      36
    • Routes                                                                    57
    • Packet Size                                                             76
    • Bandwidth                                                              86
    • Windows                                                               110
  • TCP Performance                                          126
  • Performance Tuning                                      168
8
Topics
  • Testing and Debugging                                184
  • Data Integrity                                               249
  • TCP Security                                                272
  • TCP Improvements                                      286
  • Alternatives to TCP                                     306
  • Quality of Service                                        319
  • Storage Area Networks                                338
  • Peer to Peer                                                  351
  • Abstract Storage Layers                               366
9
Quick Layer Review
10
WAN Circuits
  • TDM – Time Division Multiplexing
    • T1, 1.544 Mbps
    • T3, 44.736 Mbps
  • SONET – Synchronous Optical Network
    • OC3, 155.520 Mbps
    • OC12, 622.080 Mbps
    • OC48, 2.488 Gbps
    • OC192, 9.953 Gbps (“10 Gbps”)
    • OC768, 39.813 Gbps (“40 Gbps”)
11
Relative Bandwidths
12
Fiber Optic Attenuation
13
Wave Division Multiplexing (WDM)
  • Use color / frequency to separate channels
  • Course (CWDM) and Dense (DWDM) varieties
  • Usually in the 1550 nm band
14
WDM Evolution
Toward 100 Tbps per fiber
15
DWDM Progress
  • Example DWDM Technology Deployment
    • 1996, 8 channels, 2.4 Gbps each (19 Gbps)
    • 2000, 40 channels, 10 Gbps each (400 Gbps)
    • 2004, 80 channels, 40 Gbps each (3.2 Tbps)
    • 2010, 320 channels, 160 Gbps each (51 Tbps)
  • Cost is linear with # of channels
  • ~2.5x cost for 4x bps within a TDM channel
  • The return of wider channels (for higher rates)?
  • A lot of fiber in the ground today will be obsolete
16
Ethernet World Domination
  • Over 95% of the worlds computers connect via Ethernet
  • 1 Gbps on copper can be had for $10/port
    • Fiber is at least 10x more expensive
  • 10 Gbps fiber NICs run $2000 - $6000 (Oct 2004)
    • $10k/switch port, $100k/router port
  • Why aren’t WANs Ethernet too?
17
Encapsulation
18
High Performance Networks in the USA
19
NGI Architecture
20
Abilene Backbone
Sep 2004
21
 
22
 
23
MCI vBNS+
24
DREN Map
25
DOE ESnet, March 2004
26
ESnet Future
27
NREN Testbed
28
Target NREN Backbone
December 2005
29
National LambdaRail Architecture
30
 
31
 
32
NLR Layer 2 Network
Nationwide Gigabit Ethernet
33
Network Speeds Over Time
34
Commercial Networks
  • The top ten networks defined by number of peers


  • Rank Peers     ASN    Description
  • 1   2,347   701  UUNET Technologies, Inc.
  • 2   1,902  7018  AT&T WorldNet Services
  • 3   1,732  1239  Sprint
  • 4   1,171  3356  Level 3 Communications, LLC
  • 5   1,092   209  Qwest
  • 6     636  2914  Verio, Inc.
  • 7     623   174  Cogent Communications
  • 8     597  3549  Global Crossing
  • 9     549  6461  Abovenet Communications, Inc
  • 10    533  4513  Globix Corporation


  • Source: www.fixedorbit.com, Sep 2004
35
Network References
  • Abilene, abilene.internet2.edu
  • DREN, www.hpcmo.hpc.mil/Htdocs/DREN
  • ESnet, www.es.net
  • NLR, www.nlr.net
  • NREN, www.nren.nasa.gov
  • vBNS+, www.vbns.net
36
Delay
  • a.k.a. Latency
37
Speed Limit
  • The speed of light is a constant
  • It seems slower every year
38
Speed of Light in Media
  • ~3.0x108 m/s in free space
  • ~2.3x108 m/s in copper
  • ~2.0x108 m/s in fiber = 200 km / ms
    •   [100 km of distance = 1 ms of round trip time]
39
Light Speed Delay in Fiber
40
High “Speed” Networks
41
Packet Durations and Lengths
1500 Byte Packets in Fiber
42
Observations on Packet Lengths
  • A 56k packet could wrap around the earth!



  • A 10GigE packet fits in the convention center
43
Observations on Packet Lengths
  • Each store and forward hop adds the packet duration to the delay
    • In the old days (< 10 Mbps) such hops dominated delay
    • Today (> 10 Mbps) store and forward delays on WANs are minimal compared to propagation
44
Observations on Packet Lengths
  • ATM cells (and TCP ACK packets) are ~1/30th as long, 30x as many per second
    • One of the reasons we haven’t seen OC48 SAR until recently (2002)
  • Jumbo Frames (9000 bytes) are 6x longer, 1/6th as many per second
45
Router Queues
  • The major source of variable delay
  • Handle temporary inequalities between arrival rate and output interface speed
  • Small queues minimize delay variation
  • Large queues minimize packet drop
46
Passive Queue Management (PQM)
  •   When full, choose a packet to drop
    • Tail-Drop – arriving packet
    • Drop-From-Front – packet at front of queue
    • Push-Out – packet at back of queue
    • Random-Drop – pick any packet
47
Active Queue Management (AQM)
  • Addresses lock-out and full queue problems
  • Incoming packet drop probability is a function of average queue length
  • Random Early Detection (RED)
    • Recommended Practice, RFC 2309, Apr 98
    • Minth, Maxth, Mindrop, Maxdrop, w
    • www.icir.org/floyd/red.html
48
Measuring Delay: Ping
  • $ ping –s 56 cisco.com
  • PING cisco.com (198.133.219.25) from 63.196.71.246 : 56(84) bytes of data.
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=1 ttl=241 time=25.6 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=2 ttl=241 time=25.5 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=3 ttl=241 time=25.1 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=4 ttl=241 time=26.1 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=5 ttl=241 time=25.0 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=6 ttl=241 time=25.8 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=8 ttl=241 time=25.4 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=9 ttl=241 time=25.1 ms
  • 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=10 ttl=241 time=26.2 ms


  • --- cisco.com ping statistics ---
  • 10 packets transmitted, 9 received, 10% loss, time 9082ms
  • rtt min/avg/max/mdev = 25.053/25.585/26.229/0.455 ms


49
Ping Observations
  • Ping packet  = 20 bytes IP + 8 bytes ICMP + “user data” (first 8 bytes = timestamp)
  • Default = 56 user bytes = 64 byte IP payload = 84 total bytes
  • Small pings (-s 8 = 36 bytes) take less time than large pings (-s 1472 = 1500 bytes)
50
Ping Observations
  • TTL = 241 indicates 255-241 = 14 hops
  • Delay variation indicates congestion or system load
  • Not good at measuring small loss
    • An HPC network should show zero ping loss
  • Depends on ICMP ECHO which is sometimes blocked for “security”
51
Delay Changes
  • Delay jumping between two distinct values
  • Does not show up in traceroute as a route change
  • Layer 2 problem: SONET protect failover
52
Delay Creep
When layer 2 “heals” itself
53
One-Way Active Measurement Protocol (OWAMP)
  •         http://e2epi.internet2.edu/owamp
54
Surveyor
  • Showed that one-way is interesting
55
Real-Time Delay
56
Bandwidth*Delay Product
  • The number of bytes in flight to fill the entire path
  • Includes data in queues if they contributed to the delay
  • Example
    • 100 Mbps path
    • ping shows a 75 ms rtt
    • BDP = 100 * 0.075 = 7.5 million bits (916 KB)
57
Routes
  • The path taken by your packets
58
How Routers Choose Routes
  • Within a network
    • Smallest number of hops
    • Highest bandwidth paths
    • Usually ignore latency and utilization
  • From one network to another
    • Often “hot potato” routing, i.e. pass to the other network ASAP
59
IP Routing Hierarchy
60
“Scenic” Routes
61
Asymmetric Routes
62
Level3 Dark Fiber
63
Qwest Fiber Routes
64
MCI Fiber Routes
65
NGI Architecture
66
Path Performance: Latency vs. Bandwidth
67
A Modern Traceroute
  • ftp://ftp.login.com/pub/software/traceroute/
  • Supports
    • MTU discovery (-M)
    • Report ASNs (-A)
    • Registered owners (-O)
    • Set IP TOS field (-t)
    • Microsecond timestamps (-u)
    • Set IP protocol (-I)
68
Traceroute Example
  • [phil@damp-ssc phil]$ traceroute -A mit.edu
  • traceroute to mit.edu (18.7.22.69), 64 hops max, 40 byte packets
  •  1  ge-0-1-0.sandiego.dren.net (138.18.190.1) [AS668]  0 ms  0 ms  0 ms
  •  2  so-0-0-0.ngixeast.dren.net (138.18.1.55) [AS668]  76 ms  75 ms  76 ms
  •  3  Abilene-peer.ngixeast.dren.net (138.18.47.34) [AS668]  76 ms  77 ms  76 ms
  •  4  nycmng-washng.abilene.ucaid.edu (198.32.8.84) [<NONE>]  80 ms  88 ms  80 ms
  •  5  ATM10-420-OC12-GIGAPOPNE.nox.org (192.5.89.9) [<NONE>]  87 ms  86 ms  85 ms
  •  6  192.5.89.90 (192.5.89.90) [<NONE>]  85 ms  86 ms  104 ms
  •  7  W92-RTR-1-BACKBONE.MIT.EDU (18.168.0.25) [AS3]  85 ms  85 ms  85 ms
  •  8  WEB.MIT.EDU (18.7.22.69) [AS3]  86 ms  86 ms  85 ms


69
How Traceroute Works
www.caida.org/outreach/resources/animations/
  • Sends UDP packets to ports (-p) 33434 and up, TTL of 1 to 30
  • Each router hop decrements the TTL
  • If the TTL=0, that node returns an ICMP TTL Expired
  • The destination host returns an ICMP Port Unreachable
70
Traceroute Observations
  • Shows the return interface addresses of the forwarding path
  • You can’t see hops through switches or over tunnels (e.g. ATM VC’s, GRE, MPLS)
  • The required ICMP replies are sometimes blocked for “security”, or not generated, or sent without resetting the TTL
71
Matt’s Traceroute
www.bitwizard.nl/mtr/
72
GTrace – Graphical Traceroute
www.caida.org/tools/visualization/gtrace/
73
tcptraceroute
  • http://michael.toren.net/code/tcptraceroute/
  • Useful when ICMP is blocked
    • But still depends on TTL expired replies
  • Can select the TCP port number
    • Helps locate firewall or ACL blocks
    • Defaults to port 80 (http)
74
Routing vs. Switching
  • IP routing requires a longest prefix match
    • Harder than switching, but now wire speed
    • Inspired IP switching / MPLS
  • Switches have gained features
    • Some even route
  • Simplicity is good
    • Used switched infrastructure where you can, routers where you need better separation or control
75
Multi-Protocol Label Switching (MPLS)
  • Adds switched (layer 2) paths to below IP
    • Useful for traffic engineering, VPN’s, QoS control, high speed switching
  • IP packets get wrapped in MPLS frames and “labeled”
  • MPLS routers switch the packets along Label Switched Paths (LSP’s)
  • Being generalized for optical switching
76
Packet Sizes
  • MTU
77
Maximum Transmission Unit (MTU)
  • Maximum packet size that can be sent as one unit (no fragmentation)
  • Usually “IP MTU” which is the largest IP datagram including the IP header
    • Shown with ifconfig on Unix/Linux
  • Sometimes a layer 2 MTU
    • IP MTU 1500 = Ethernet MTU 1518 (or 1514, or 1522)
78
IP MTU Sizes
79
Ethernet Jumbo Frames
  • Non-standard Ethernet extension
    • Usually 9000 bytes (IP MTU)
    • Sometimes anything over 1500
  • Usually only used with 1GigE and above
  • Requires separate LANs or VLANs to accommodate non-jumbo equipment
  • http://sd.wareonearth.com/~phil/jumbo.html
80
Reasons for Jumbo
  • Reduce system and network overhead
  • Handle 8KB NFS or SAN packets
  • Improve TCP performance!
    • Greater throughput
    • Greater loss tolerance
    • Reduced router queue loads
81
Vendor Jumbo Support
  • MTU 9000 interoperability demonstrated:
    • Allied Telesyn, Avaya, Cisco, Extreme, Foundry, NEC, Nortel
  • Juniper = 9178 (9192 – 14)
  • Cisco (5000/6000) = 9216
  • Foundry JetCore 1 & 10 GbE = 14000
  • NetGear GA621 = 16728
  • http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html


82
Path MTU (PMTU)
  • Minimum MTU of all hops in a path
  • Hosts can do Path MTU Discovery to find it
    • Depends on ICMP replies
  • Without PMTU Discovery hosts should assume PMTU is only 576 bytes
    • Some hosts falsely assume 1500!
83
Check the MTU
84
Using Ping to Check the MTU
85
Things You Can Do
  • Use only large MTU interfaces/routers/links
    • Gigabit Ethernet with Jumbo Frames (9000)
    • Packet over SONET (POS) (4470, 9000+)
    • ATM CLIP (9180)
  • Never reduce the MTU (or bandwidth) on the path between each/every host and the WAN
  • Make sure your TCP uses Path MTU Discovery
86
Bandwidth
  • and throughput
87
Hops of Different Bandwidth
88
Throughput Limit
  • throughput <= available bandwidth
  •         (“tight link” with the minimum unused bandwidth)


    • A high performance network should be lightly loaded (<50%)
    • A loaded high speed network is no better to the end user than a lightly loaded slow one
89
Bandwidth Determination
  • Ask the routers/switches
    • SNMP
    • RMON / NetFlow / sFlow
  • Passive Measurement
    • tcpdump
    • OCxmon
  • Active Measurement
    • Light weight tests
    • Full bandwidth tests
90
SNMP
  • Simple Network Monitoring Protocol
    • Version 1: RFC 1155-1157 (1990)
    • Version 2c: RFC 1901-1908 (1996)
    • Version 3: RFC 2571-2574 (1999)
  • Net-SNMP (was UCD-SNMP)
    • http://www.net-snmp.org/
    • Library and utilities
91
Net-SNMP Example
  • $ snmpget -v 2c -c public router.dren.net system.sysUpTime.0
  • SNMPv2-MIB::sysUpTime.0 = Time ticks: (1407699551) 162 days, 22:16:35.51


  • $ snmpwalk -v 2c -c public router.dren.net IF-MIB::ifXTable
  • IF-MIB::ifName.1 = STRING: fxp0
  • IF-MIB::ifName.2 = STRING: fxp1
  • …
  • IF-MIB::ifHCInOctets.2 = Counter64: 4049716647
  • IF-MIB::ifHCOutOctets.2 = Counter64: 3264374754
  • IF-MIB::ifHighSpeed.2 = Gauge32: 100
  • IF-MIB::ifAlias.2 = STRING: Router 1 LAN
  • …
92
SNMP Counter Wrap
93
SNMP Resolution
  • Juniper caches interface statistics for five seconds
  • Reads provide new data only if cache > 5 seconds old
  • No timestamps on the data
94
Real-Time DREN Traffic
95
MRTG
  • www.mrtg.org
  • Extremely popular network monitoring tool
  • Most common display:
    • Five minute average link utilizations
    • Green into interface
    • Blue out of interface
96
MRTG Example
97
RRDtool
  • Round Robin Database Tool
  • Reimplementation of MRTG’s backend
    • Generalized data storage, retrieval, graphing
    • Many front ends available, including MRTG
  • www.rrdtool.org
98
Abilene Weather Map
99
tcpdump
  • www.tcpdump.org
    • Raw packet capture / decode from hosts
    • libpcap
  • www.ethereal.com
    • Nice GUI protocol analyzer
100
RMON / NetFlow / sFlow
  • RMON, 1993
    • SNMP controllable packet capture
  • NetFlow, 1997
    • Cisco flow records or samples
  • sFlow, 2001
    • Low level packet samples and statistics
101
NetFlow
  • Cisco flow records (also on Juniper)
  • Version 5 and 9 formats common today
    • www.cisco.com/go/netflow
    • www.ietf.org/internet-drafts/draft-claise-netflow-version9-07.txt
  • Many tools
    • Cflowd, www.caida.org/tools/measurement/cflowd/
    • flow-tools, www.splintered.net/sw/flow-tools/
    • FlowScan, net.doit.wisc.edu/~plonka/FlowScan/
102
sFlow
  • RFC 3176, Sep 2001
  • Switch or router level agent
    • Controlled by SNMP
  • Statistical flow samples
    • Flow is one ingress port to one egress port(s)
    • Up to 256 packet header bytes, or summary
  • Periodic counter statistics
    • typical 20-120 sec interval
  • www.sflow.org, www.ntop.org
103
Bandwidth Estimation – Single Packet
104
Bandwidth Estimation – Multi Packet
  • Packet pairs or trains are sent
  • The slower link causes packets to spread
  • The packet spread indicates the bandwidth
105
Bandwidth Measurement Tools
  • pathchar – Van Jacobson, LBL
    • ftp://ftp.ee.lbl.gov/pathchar/
  • clink – Allen Downey, Wellesley College
    • http://rocky.wellesley.edu/downey/clink/
  • pchar – Bruce A. Mah, Sandia/Cisco
    • http://www.employees.org/~bmah/Software/pchar/
106
Bandwidth Measurement Tools
  • pipechar - Jin Guojun, LBL
    • http://www.didc.lbl.gov/pipechar/
  • nettimer - Kevin Lai, Stanford University
    • http://gunpowder.stanford.edu/~laik/projects/nettimer/
  • pathrate/pathload - Constantinos Dovrolis, Georgia Tech
    • http://www.cc.gatech.edu/fac/Constantinos.Dovrolis/bwmeter.html
107
abing
  • http://www-iepm.slac.stanford.edu/tools/abing/
  • Sends 20 pairs (40 packets) of 1450 byte UDP to a reflector on port 8176, estimates (in ~1 second):
    • ABw: Available Bandwidth
    • Xtr: Cross traffic
    • DBC: Dominating Bottleneck Capacity
108
Abing Example 1
109
Abing Example 2
110
Windows
  • Flow/rate control and error recovery
111
Windows
  • Windows control the amount of data that is allowed to be “in flight” in the network
  • Maximum throughput is one window full per round trip time
  • The sender, receiver, and the network each determine a different window size
112
Window Sizes 1,2,3
113
MPing – A Windowed Ping www.psc.edu/~mathis/wping/
  • Excellent tool to view the packet forwarding and loss properties of a path under varying load!
  • Sends windows full of ICMP Echo or UDP packets
  • Treats ICMP Echo_Reply or Port_Unreachable packets as “ACKs”
  • Make sure destination responds well to ICMP
  • Consumes a lot of resources: use with care
114
How MPing Works
115
MPing on a “Normal” Path
116
MPing on a “Normal” Path
117
Some MPing Results #1
118
Some MPing Results #2
119
Some MPing Results #3
120
Some MPing Results #4
121
Ethernet Duplex Problems
An Internet Epidemic!
  • Ethernet “auto-negotiation” can select the speed and duplex of a connected pair
  • If only one end is doing it:
    • It can get the speed right
    • but it will assume half-duplex
  • Mismatch loss only shows up under load
    • Can’t see it with ping
122
Windows and TCP Throughput
  • TCP Throughput is at best one window per round trip time (window/rtt)
  • The maximum TCP window of the receiving or sending host is the most common limiter of performance
  • Examples
    • 8 KB window / 87 msec ping time = 753 Kbps
    • 64 KB window / 14 msec rtt = 37 Mbps
123
Maximum TCP/IP Data Rate
With 64KB window
124
Receive Windows for 1 Gbps
125
Observed Receiver Window Sizes
  • ATM traffic from the Pittsburgh Gigapop
  • 50% had windows < 20 KB
    • These are obsolete systems!
  • 20% had 64 KB windows
    • Limited to ~ 8 Mbps coast-to-coast
  • ~9% are assumed to be using window scale
126
TCP
  • The Internet’s transport
127
Important Points About TCP
  • TCP is adaptive
  • It is constantly trying to go faster
  • It slows down when it detects a loss


  • How much it sends is controlled by windows
  • When it sends is controlled by received ACK’s
  •     (or timeouts)
128
TCP Throughput vs. Time
129
The Three TCP Windows
130
TCP Receive Window
  • A flow control mechanism
  • The amount of data the receiver is ready to receive
    • Updated with each ACK
  • Loosely set by receive socket buffer size
  • Application has to drain the buffer fast enough
  • System has to wait for missing or out of order data (reassembly queue)
131
Sender Socket Buffers
  • Must hold two round trip times of data
    • One set for the current round trip time
    • One set for possible retransmits from the last round trip time (retransmit queue)
  • A system maximum often limits size
  • The application must write data quickly enough
132
TCP Congestion Window
  • Flow control, calculated by sender, based on observations of the data transfer (loss, rtt, etc.)
  • In the old days, there was none
    • TCP would send a full receive window in one round trip time
  • TCP now has two modes
    • Slow-start: cwnd increases exponentially
    • Congestion-Avoidance: cwnd increases linearly
133
TCP Congestion Window (cwnd)
  • Congestion window controls startup and limits throughput in the face of loss
  • cwnd gets larger after every new ACK
  • cwnd get smaller when loss is detected
  • cwnd amounts to TCP’s estimate of the available bandwidth at any given time
134
Cwnd During Slow Start
135
Slow Start and Congestion Avoidance Together
136
AIMD
  • Additive Increase, Multiplicative Decrease
    • of cwnd, the congestion window
  • The core of TCP’s congestion avoidance phase, or “steady state”
    • Standard increase = +1.0 MSS per loss free rtt
    • Standard decrease = *0.5 (i.e. halve cwin on loss)
  • Avoids congestion collapse
  • Promotes fairness among flows
137
Iperf : TCP California to Ohio
138
TCP Examples from Maui HI
139
TCP Acceleration (MSS/rtt2)
(Congestion avoidance rate increase, MSS = 1448)
140
Bandwidth*Delay Product and TCP
  • TCP needs a receive window (rwin) equal to or greater than the BW*Delay product to achieve maximum throughput
  • TCP needs sender side socket buffers of 2*BW*Delay to recover from errors
  • You need to send about 3*BW*Delay bytes for TCP to reach maximum speed
141
Delayed ACKs
  • TCP receivers send ACK’s:
    • after every second segment
    • after a delayed ACK timeout
    • on every segment after a loss (missing segment)
  • A new segment sets the delayed ACK timer
    • Typically 0-200 msec
  • A second segment (or timeout) triggers an ACK and clears the delayed ACK timer
142
ACK Clocking
  • A queue forms in front of a slower speed link
  • The slower link causes packets to spread
  • The spread packets result in spread ACK’s
  • The spread ACK’s end up clocking the source packets at the slower link rate
143
Detecting Loss
  • Packets get discarded when queues are full
  •    (or nearly full)
  • Duplicate ACK’s get sent after missing or out of order packets
  • Most TCP’s retransmit after the third duplicate ACK (“triple duplicate ACK”)
    • Windows XP now uses 2nd dup ACK
144
TCP Throughput Model
Once recv window size and available bandwidth aren’t the limit

  •                                ~0.7 * Max Segment Size (MSS)
  •              Rate  =
  •                                     Round Trip Time  *  sqrt[pkt_loss]
  •       M. Mathis, et al.


  • Double the MTU, double the throughput
  • Halve the latency, double the throughput
    • shortest path matters
  • Halve the loss rate, 40% higher throughput
  • www.psc.edu/networking/papers/model_abstract.html
145
Round Trip Time (RTT)
rate = 0.7 * MSS / (rtt * sqrt(p))
  • If we could halve the delay we could double throughput!
  • Most delay is caused by speed of light in fiber (~200 km/msec)
  • “Scenic routing” and fiber paths raise the minimum
  • Congestion (queueing) adds delay
146
Max Segment Size (MSS)
rate = 0.7 * MSS / (rtt * sqrt(p))
  • MSS = MTU – packet headers
  • Most often, MSS = 1448 or 1460
  • Jumbo frame => ~6x throughput increase
  • The Abilene and DREN WANs support MTU 9000 everywhere
    • Most sites still have a lot of work to do
147
DREN MTU Increase
  • IP Maximum Transmission Unit (MTU) on the DREN WAN is now 9146 bytes
  • After much debate, exactly 9000 is being used to the sites with GigE interfaces
    • Sites choose 1500, 4470, or 9000; others by exception
  • Sites are encouraged to support 9000 on their GigE infrastructures
148
Impact of Jumbo, into Ohio
Jumbo enabled 23 July 2003
149
Impact of Jumbo, San Diego to DC
Jumbo enabled 17 July 2003
150
Packet Loss (p)
 rate = 0.7 * MSS / (rtt * sqrt(p))

  • Loss dominates throughput !
  • At least 6 orders of magnitude observed on the Internet
  • 100 Mbps throughput requires O(10-6)
  • 1 Gbps throughput requires O(10-8)
151
Loss Limits for 1 Gbps
152
Specifying Loss
  • TCP loss limits for 1 Gbps across country are O(10-8), i.e. 0.000001% packet loss
    • About 1 “ping” packet every three years
    • Systems like AMP would never show loss
    • Try to get 10-8 in writing from a provider!
    • Most providers won’t guarantee < 0.01%
153
Specifying Throughput
  • Require the provider to demonstrate TCP throughput
    • DREN contract requires ˝ line rate TCP flow sustained for 10 minutes cross country (e.g. ~300 Mbps on OC12)
  • A low loss requirement comes with this!
154
Concerns About Bit Errors
  • Bit Error Rate (BER) specs for networking interfaces/circuits may not be low enough
    • E.g. 10-12 BER => 10-8 packet ER (1500 bytes)
    • 10 hops => 10-7 packet drop rate
155
Example Bit Error Rate Specs
  • Hard Disk, 10-14
    • One error in 10 TB
  • SONET Equipment, 10-12
  • SONET Circuits, 10-10
  • 1000Base-T, 10-10
    • One bit error every 10 seconds at line rate
  • Copper T1, 10-6
156
Is 10-8 Packet Loss Reasonable?
  • Requires a bit error rate (BER) of 10-12 or better!
  • Perhaps TCP is expecting too much from the network
  • Loss isn’t always congestion
  • Modify TCP’s congestion control
157
TCP Throughput Model Revisited
  • Original formula assumptions
    • constant packet loss rate
    • dominated by congestion from other flows
    • 6 x MTU provides 6 x throughput
  • HPC environment (low congestion)
    • Loss may be dominated by bit error rates
    • MSS becomes sqrt(MSS)
    • 6 x MTU provides 2.4 x throughput
158
More About TCP
  • Some high performance options
159
High Performance TCP Features
  • Window Scale
  • Timestamps
  • Selective Acknowledgement (SACK)
  • Path MTU Discovery (PMTUD)
  • Explicit Congestion Notification (ECN)
160
TCP Window Scale
  • 16-bit window, 65535 byte maximum
  • RFC1323 window scale option in SYN packets
  • Creates a 32-bit window by left shifting 0 to 14 bit positions
    • New max window is 1 GB (230)
    • Granularity: 1B, 2B, 4B, …, 16KB
161
TCP Timestamps
  • 12 byte option
    • drops MSS of 1460 to 1448
  • Allows better Round Trip Time Measurement (RTTM)
  • Prevention Against Wrapped Sequence numbers (PAWS)
    • 32 bit sequence number wraps in 17 sec at 1 Gbps
    • TCP assumes a Maximum Segment Lifetime (MSL) of 120 seconds
162
Selective ACK
  • TCP originally could only ACK the last in sequence byte received (cumulative ACK)
  • RFC2018 SACK allows ranges of bytes to be ACK’d
    • Sender can fill in holes without resending everything
    • Up to four blocks fit in the TCP option space (three with the timestamp option)
  • Surveys have shown that SACK implementations often have errors
    • RFC3517 addresses how to respond to SACK
163
Duplicate SACK
  • RFC2883 extended TCP SACK option to allow duplicate packets to be SACK’d
  • D-SACK is a SACK block that reports duplicates
  • Allows the sender to infer packet reordering
164
Forward ACK Algorithm
  • Forward ACK (FACK) is the forward-most segment acknowledged in a SACK
  • FACK TCP uses this to keep track of outstanding data
  • During fast recovery, keeps sending as long as outstanding data < cwnd
  • Generally a good performer - recommended
165
Path MTU Discovery
  • Probes for the largest end-to-end unfragmented packet
  • Uses don’t-fragment option bit, waits for must-fragment replies
  • Possible black holes
    • some devices will discard the large packets without sending a must-fragment reply
    • Problems are discussed in RFC2923
166
Explicit Congestion Notification (ECN)
  • RFC3168, Proposed Standard
  • Indicates congestion without packet discard
  • Uses last two bits of the IP Type of Service (TOS) field
  • Black hole warning
    • Some devices silently discard packets with these bits set
167
SYN Option Check Server
  • http://syntest.psc.edu:7961/
  • telnet syntest.psc.edu 7960
168
System Tuning
  • Interfaces, routes, buffers, etc.
169
Things You Can Do
  • Throw out your low speed interfaces and networks!


  • Make sure routes and DNS report high speed interfaces
  • Don’t over-utilize your links (<50%)
  • Use routers sparingly, host routers not at all
    • routed -q
170
Things You Can Do
  • Make sure your HPC apps offer sufficient receive windows and use sufficient send buffers
    • But don’t run your system out of memory
    • Find out the rtt with ping, compute BDP
    • Tune system wide, by application, or automatically
  • Check your TCP for high performance features
  • Look for sources of loss
    • Watch out for duplex problems
171
TCP Performance
  • Maximum TCP throughput is one window per round trip time
  • System default and maximum window sizes are usually too small for HPC
  • The #1 cause of slow TCP transfers!
172
Minimum TCP Window Sizes
173
Increase Your Max Window Size
Linux example
  • Command line:
    • sysctl -w net.core.rmem_max=8388608
    • sysctl -w net.core.wmem_max=8388608

  • In /etc/sysctl.conf  (makes it permanent)
    • net.core.rmem_max = 8388608
    • net.core.wmem_max = 8388608
174
Increase Your Default Window Size
Linux example
  • In /etc/sysctl.conf
    • net.ipv4.tcp_rmem = 4096 349520 4194304
    • net.ipv4.tcp_wmem = 4096  65536 4194304
  • Three values are: minimum, default, maximum for automatic buffer allocation on Linux
175
Application Buffer Sizes
  • Before doing a connect or accept
    • setsockopt(fd, SOL_SOCKET, SO_SNDBUF, …)
    • setsockopt(fd, SOL_SOCKET, SO_RCVBUF, …)
176
Dr. TCP
A TCP Stack Tuner for Windows
http://www.dslreports.com/front/drtcp.html
177
Tuning Other Systems
  • See “Enabling High Performance Data Transfers”
    • http://www.psc.edu/networking/projects/tcptune/
178
FTP Buffer Sizes
  • Many FTP’s allow the user to set buffer sizes
    • The commands are different everywhere!
  • kftp buffer size commands
    • lbufsize 8388806      (local)
    • rbufsize 8388806      (remote)
    • Other kftp commands
      • protect/cprotect (clear|safe|private)
      • passive – client opens the data channel connection
  • For other versions of FTP, see
    • http://dast.nlanr.net/Projects/FTP.html
179
vsftpd FTP server
  • Small, fast, secure
  • 2.x has SSL / TLS support
  • http://vsftpd.beasts.org/
180
GridFTP
  • FTP Protocol Extensions
    • SBUF and ABUF – set / auto buffer sizes
    • SPOR and SPAS – striped port / passive
181
Autobuf – An Auto-tuning FTP http://dast.nlanr.net/Projects/Autobuf/
  • Measures the spread of a burst of ICMP Echo packets to estimate BDP, sets bufs
182
DREN Success Story
  • Rome NY to Maui HI, kftp
  • Before tuning: 3.2 Mbps
  • After tuning: 40 Mbps (line rate)


  • A very common story…
183
Good Tuning References
  • Users Guide to TCP Windows
    • www.ncsa.uiuc.edu/People/vwelch/net_perf/tcp_windows.html
  • TCP Tuning Guide
    • www-didc.lbl.gov/TCP-tuning/
  • Enabling High Performance Data Transfers on Hosts
    • www.psc.edu/networking/projects/tcptune/
  • WAN Tuning and Troubleshooting
    • www.internet2.edu/~shalunov/writing/tcp-perf.html
184
Throughput Tests
185
Treno Throughput Test
www.psc.edu/networking/treno_info.html
  • Tells you what a good TCP should be able to achieve (Bulk Transfer Capacity)
186
Treno Observations
  • Easy 10 second test, no remote access or receiver process required
  • Emulates TCP but doesn’t use TCP
    • Problems with host TCP or tuning are avoided
  • Does Path MTU Discovery
  • Reports rtt and loss rates
  • A zero equilibrium result means there was too much packet loss to exit “slow start”
187
Treno Observations
  • Can send ICMP (-i) or UDP (default)
    • ICMP replies (ECHO or UNREACH) could be blocked for “security”
  • Routers send ICMP replies very slowly
    • So don’t test routers with treno
  • ICMP is often rate limited now by hosts
  • Port numbers can wrap around (and look like port scans)
188
TCP Throughput Tests
  • ttcp – the original, many variations
    • http://sd.wareonearth.com/~phil/net/ttcp/
  • nuttcp – great successor to ttcp (recommended)
    • ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
  • Iperf – great TCP/UDP tool (recommended)
    • http://dast.nlanr.net/Projects/Iperf/
  • netperf – dated but still in wide use
    • http://www.netperf.org/
  • ftp – nothing beats a real application
189
Throughput Testing Notes
  • Network data rates (bps) are powers of 10, not powers of 2 as used for Bytes
    • E.g. 100 Mbps ethernet is 100,000,000 bits/sec
    • Some tools wrongly use powers of 2 (e.g. ttcp)
  • User payload data rates are reported by tools
    • No TCP, IP, Ethernet, etc. headers are included
    • E.g. 100 Mbps ethernet max is 97.5293 Mbps
      • http://sd.wareonearth.com/~phil/net/overhead/
190
nuttcp
  • My favorite TCP/UDP test tool
  • Get the latest .c file
    • ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
  • Compile it:
    • cc –O3 –o nuttcp nuttcp-v5.1.11.c
  • Start a server:
    • nuttcp –S
    • Allows remote users to run tests to/from that host without accounts
191
A Permanent nuttcp Server
RedHat Linux example
  • Create a file /etc/xinetd.d/nuttcp


    • # default: off
    • # description: nuttcp
    • service nuttcp
    • {
    •         socket_type     = stream
    •         wait            = no
    •         user            = nobody
    •         server          = /usr/local/bin/nuttcp
    •         server_args     = -S
    •         disable         = no
    • }

192
Testing a Path
Example
  • San Diego CA to Washington DC
  • OC12 path
193
Traceroute
  • [phil@damp-ssc phil]$ nuttcp -xt damp-nrl
  • traceroute to damp-nrl-ge (138.18.23.37), 30 hops max, 38 byte packets
  •  1  ge-0-1-0.sandiego.dren.net (138.18.190.1)  0.255 ms  0.237 ms  0.238 ms
  •  2  so12-0-0-0.nrldc.dren.net (138.18.1.7)  72.185 ms  72.162 ms  72.164 ms
  •  3  damp-nrl-ge (138.18.23.37)  72.075 ms  72.069 ms  72.070 ms


  • traceroute to 138.18.190.5 (138.18.190.5), 30 hops max, 38 byte packets
  •  1  ge-0-1-0.nrldc.dren.net (138.18.23.33)  0.239 ms  0.199 ms  0.184 ms
  •  2  so12-0-0-0.sandiego.dren.net (138.18.4.21)  72.210 ms  72.979 ms  72.217 ms
  •  3  damp-ssc-ge (138.18.190.5)  72.087 ms  72.079 ms  72.068 ms


194
Check Path RTT and MTU
  • [phil@damp-ssc phil]$ ping damp-nrl
  • PING damp-nrl-ge (138.18.23.37) from 138.18.190.5 : 56(84) bytes of data.
  • 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=1 ttl=62 time=72.0 ms
  • 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=2 ttl=62 time=72.0 ms
  • 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=3 ttl=62 time=72.0 ms
  • 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=4 ttl=62 time=72.0 ms
  • 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=5 ttl=62 time=72.0 ms


  • --- damp-nrl-ge ping statistics ---
  • 5 packets transmitted, 5 received, 0% loss, time 4035ms
  • rtt min/avg/max/mdev = 72.071/72.082/72.092/0.007 ms



  • [phil@damp-ssc phil]$ tracepath damp-nrl
  •  1?: [LOCALHOST]     pmtu 9000
  •  1:  ge-0-1-0.sandiego.dren.net (138.18.190.1)            asymm  2   1.060ms
  •  2:  so12-0-0-0.nrldc.dren.net (138.18.1.7)               asymm  3  72.905ms
  •  3:  damp-nrl-ge (138.18.23.37)                            72.747ms reached
  •      Resume: pmtu 9000 hops 3 back 3


195
Do the Math
  • Bandwidth * Delay Product
    • BDP = 622000000/8 * 0.072 = 5.6 MB
  • The TCP receive window needs to be at least this large
  • The sender buffer should be twice this size
  • We choose 8MB for both
    • Knowing that Linux will double it for us!
196
Check Buffer Sizes
  • [phil@damp-ssc phil]$ nuttcp -v -t -T10 -w8m damp-nrl
  • nuttcp-t: v5.1.10: socket
  • nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> damp-nrl
  • nuttcp-t: time limit = 10.00 seconds
  • nuttcp-t: connect to 138.18.23.37
  • nuttcp-t: send window size = 16777216, receive window size = 103424
  • nuttcp-t: 608.0786 MB in 10.00 real seconds = 62288.32 KB/sec = 510.2659 Mbps
  • nuttcp-t: 9730 I/O calls, msec/call = 1.05, calls/sec = 973.33
  • nuttcp-t: 0.0user 0.9sys 0:10real 9% 0i+0d 0maxrss 2+0pf 0+0csw


  • nuttcp-r: v5.1.10: socket
  • nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
  • nuttcp-r: accept from 138.18.190.5
  • nuttcp-r: send window size = 103424, receive window size = 16777216
  • nuttcp-r: 608.0786 MB in 10.20 real seconds = 61039.55 KB/sec = 500.0360 Mbps
  • nuttcp-r: 71224 I/O calls, msec/call = 0.15, calls/sec = 6981.97
  • nuttcp-r: 0.0user 0.5sys 0:10real 5% 0i+0d 0maxrss 3+0pf 0+0csw


197
Things to Check
  • The receiver’s (nuttcp-r) receive buffer size
  • The transmitter’s (nuttcp-t) send buffer size
  • The receiver’s reported throughput
    • The transmitter number is often too high
  • The CPU utilization of both sender and receiver
    • Make sure you didn’t run out of CPU
198
Throughput Test (with -i)
  • [phil@damp-ssc phil]$ nuttcp -t -T10 -i1 -w8m damp-nrl
  •      1.5531 MB /   1.00 sec =   13.0935 Mbps
  •    44.3144 MB /   1.00 sec =  371.7978 Mbps
  •    67.9265 MB /   1.00 sec =  569.9192 Mbps
  •    67.5937 MB /   1.00 sec =  567.1195 Mbps
  •    67.5339 MB /   1.00 sec =  566.6178 Mbps
  •    67.5937 MB /   1.00 sec =  567.1184 Mbps
  •    67.6363 MB /   1.00 sec =  567.4661 Mbps
  •    67.5937 MB /   1.00 sec =  567.1195 Mbps
  •    67.7558 MB /   1.00 sec =  568.4850 Mbps
  •    67.8753 MB /   1.00 sec =  569.4834 Mbps


  •  601.0000 MB /  10.19 sec =  494.5603 Mbps 8 %TX 4 %RX
199
Reverse Throughput Test (-r)
  • [phil@damp-ssc phil]$ nuttcp -r -T10 -i1 -w8m damp-nrl
  •     2.2614 MB /   1.00 sec =   18.9842 Mbps
  •    48.4872 MB /   1.00 sec =  406.7883 Mbps
  •    38.9809 MB /   1.00 sec =  327.0350 Mbps
  •    29.8672 MB /   1.00 sec =  250.5731 Mbps
  •    51.3801 MB /   1.00 sec =  431.0547 Mbps
  •    50.1768 MB /   1.00 sec =  420.9644 Mbps
  •    24.6874 MB /   1.00 sec =  207.1180 Mbps
  •    15.0616 MB /   1.00 sec =  126.3602 Mbps
  •    15.8211 MB /   1.00 sec =  132.7315 Mbps
  •    16.5891 MB /   1.00 sec =  139.1766 Mbps


  •   303.6979 MB /  10.60 sec =  240.3490 Mbps 4 %TX 2 %RX
200
UDP Test (-u –l –R)
  • [phil@damp-ssc phil]$ nuttcp -t -u -l8000 -R500m -T10 -i1 –w8m damp-nrl
  •    59.2346 MB /   0.99 sec =  499.6586 Mbps     0 /  7764 ~drop/pkt  0.00 ~%loss
  •    59.6008 MB /   1.00 sec =  500.0570 Mbps     0 /  7812 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9935 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  500.0000 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9950 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9905 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9925 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9950 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9930 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss
  •    59.5932 MB /   1.00 sec =  499.9920 Mbps     0 /  7811 ~drop/pkt  0.00 ~%loss


  •   595.9778 MB /  10.01 sec =  499.5061 Mbps 99 %TX 3 %RX 0 / 78116 drop/pkt 0.00 %loss
201
Reverse UDP Test
  • [phil@damp-ssc phil]$ nuttcp -r -u -l8000 -R500m -T10 -i1 –w8m damp-nrl
  •    56.8237 MB /   0.99 sec =  481.1408 Mbps     0 /  7448 ~drop/pkt  0.00 ~%loss
  •    59.5169 MB /   1.00 sec =  499.3279 Mbps     1 /  7802 ~drop/pkt  0.01 ~%loss
  •    59.2651 MB /   1.00 sec =  497.2097 Mbps     0 /  7768 ~drop/pkt  0.00 ~%loss
  •    59.0591 MB /   1.00 sec =  495.4825 Mbps     0 /  7741 ~drop/pkt  0.00 ~%loss
  •    58.9828 MB /   1.00 sec =  494.8439 Mbps     0 /  7731 ~drop/pkt  0.00 ~%loss
  •    60.7300 MB /   1.00 sec =  509.5001 Mbps     0 /  7960 ~drop/pkt  0.00 ~%loss
  •    59.4482 MB /   1.00 sec =  498.7424 Mbps     2 /  7794 ~drop/pkt  0.03 ~%loss
  •    59.7839 MB /   1.00 sec =  501.5677 Mbps     2 /  7838 ~drop/pkt  0.03 ~%loss
  •    58.8760 MB /   1.00 sec =  493.9463 Mbps     0 /  7717 ~drop/pkt  0.00 ~%loss
  •    59.8526 MB /   1.00 sec =  502.1347 Mbps     0 /  7845 ~drop/pkt  0.00 ~%loss


  •   595.8939 MB /  10.01 sec =  499.4702 Mbps 100 %TX 3 %RX 5 / 78110 drop/pkt 0.01 %loss
202
Is Loss the Limiter?
  • bps = 0.7 * MSS / (rtt * sqrt(loss))
  •        = 0.7 * 8948*8 / (0.072 * sqrt(5/78110))
  •        = 87 Mbps


  • Conclusion: Too much loss on DC to CA path
203
Testing Notes
  • When UDP testing
    • Be careful of fragmentation (path MTU)
    • Most systems are better at TCP than UDP
    • Setting large socket buffers helps
    • Raising process priority helps
    • Duplex problems don’t show up with UDP!
  • Having debugged test points is critical
    • Local and remote
  • A performance history is valuable
204
Duplex Problem Symptoms
  • Will never see it with pings or UDP tests
  • TCP throughput is sometimes ˝ of expected rate, sometimes under 1 Mbps
  • Worse performance usually results when the mismatch is near the receiver of a TCP flow
  • Sometimes reported as “late collisions”
205
Checking Systems for Errors
  • Don’t forget
    • ifconfig
    •         UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
    •         RX packets:414639041 errors:0 dropped:0 overruns:0 frame:0
    •         TX packets:378926231 errors:0 dropped:0 overruns:0 carrier:0
    •         collisions:0 txqueuelen:1000
    • netstat –s
      • Tcp: 59450 segments retransmited
    • dmesg or /var/log
      • ixgb: eth2 NIC Link is Up 10000 Mbps Full Duplex
206
Nuttcp vs. Iperf
  • Iperf is probably better for
    • Parallel stream reporting (-P)
    • Bi-directional tests (-d)
    • MSS control (-M) and reporting (-m)
  • Nuttcp is better at
    • Getting server results back to the client
    • Third party support
    • Traceroute (-xt)
    • Setting priority (-xP)
    • Undocumented instantaneous rate limit (-Ri)
207
Example Measurement Infrastructure
208
DREN Active Measurement Program (DAMP)
  • Grew out of the NLANR AMP project
    • http://amp.nlanr.net/
  • Long term active performance monitoring
    • Delay, loss, routes, throughput
    • Dedicated hardware; known performance
    • User accessible test points
  • Invaluable at identifying and debugging problems
209
DREN AMP Systems
210
10GigE Test Systems
  • Dell 2650, 3.0 GHz Xeon, 533 MHz FSB, 512 KB L2 cache, 1 MB L3 cache, PCI-X
    • Linux 2.4.26-web100
    • Two built-in BCM5703 10/100/1000 copper NICs, tg3 driver
    • Intel PXLA8590LR 10GigE NIC, ixgb 1.0.65 driver
  • 4.6 Gbps TCP on LAN after tuning (7.7 Gbps loopback)
    • 2.2 Gbps TCP from MD to MS over OC48
  • New NICs: Intel, S2io X-Frame, Chelsio T110/N110
  • PCI Express solution early 2005?
211
Precise Time
  • 10 msec NTP accuracy w/o clock
    • asymmetric paths and clock hopping
  • 10 usec NTP accuracy with CDMA clock
    • EndRun Technologies Praecis Ct ($1100)
    • Serial port, tagged events
  • Would love to see NTP nanokernel patch adopted by Linux!
212
AMP Tests on Demand
  • Web page interface
  • 10 second TCP throughput test
    • Uses nuttcp third party support
  • Can also access them directly
213
Need to test over a distance
  • Many local problems won’t be seen over a short round trip time (i.e. within your site!)
    • Window size won’t matter
    • TCP error recovery is extremely quick
  • Firewalls and screening routers (ACLs) could be part of the problem
214
Test Pairs
215
Problems We Have Found
  • Small windows on hosts
  • Duplex problems
  • Bad cables (patch panels, fiber, cat5)
  • Bad switches or router modules
  • Slow firewalls or routers
    • Excessive logging
  • WAN bit errors
  • Routing issues
  • Heat problems!
216
Advanced Debugging
  • TCP Traces and Testrig
217
TCP/IP Analysis Tools
  • tcpdump
    • www.tcpdump.org
  • ethereal - GUI tcpdump (protocol analyzer)
    • www.ethereal.com
  • tcptrace – stats/graphs of tcpdump data
    • www.tcptrace.org
  • testrig – tcpdump, tcptrace, xplot, etc.
    • www.ncne.nlanr.net/research/tcp/testrig/
218
“A Preconfigured TCP Test Rig”
219
TCP Connection Establishment
  • Three-way handshake
    • SYN, SYN+ACK, ACK
  • Use tcpdump, look for performance features
    •  window sizes, window scale, timestamps, MSS, SackOK, Don’t-Fragment (DF)
220
Tcpdump of TCP Handshake
221
Collecting A TCP Trace
  • tcpdump -p -w trace.out -s 100 host        $desthost and port 5001 &
  • nuttcp -T10 -n200m -i1 -w10m $desthost
  • tcptrace -l -r trace.out
  • tcptrace -S trace.out         (or –G for all graphs)
  • xplot *.xpl
222
tcptrace -l -r
  • TCP connection 1:
  •         host a:        damp-nrl-ge:58310
  •         host b:        damp-ssc-ge:5001
  •         complete conn: no       (SYNs: 2)  (FINs: 0)
  •         first packet:  Wed Oct 13 17:30:30.272579 2004
  •         last packet:   Wed Oct 13 17:30:39.289916 2004
  •         elapsed time:  0:00:09.017336
  •         total packets: 36063
  •         filename:      nrl-ssc.trace
  •    a->b:                              b->a:
  •      total packets:         23473           total packets:         12590
  •      ack pkts sent:         23472           ack pkts sent:         12590
  •      pure acks sent:            1           pure acks sent:        12589
  •      sack pkts sent:            0           sack pkts sent:         1774
  •      dsack pkts sent:           0           dsack pkts sent:           0
  •      max sack blks/ack:         0           max sack blks/ack:         1
  •      unique bytes sent: 209714276           unique bytes sent:         0
  •      actual data pkts:      23471           actual data pkts:          0
  •      actual data bytes: 210018508           actual data bytes:         0
  •      rexmt data pkts:          34           rexmt data pkts:           0
  •      rexmt data bytes:     304232           rexmt data bytes:          0
  •      zwnd probe pkts:           0           zwnd probe pkts:           0
  •      zwnd probe bytes:          0           zwnd probe bytes:          0
  •      outoforder pkts:           0           outoforder pkts:           0
  •      pushed data pkts:        555           pushed data pkts:          0


223
tcptrace -l -r (cont.)
  •                   SYN/FIN pkts sent:       1/0           SYN/FIN pkts sent:       1/0
  •      req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y
  •      adv wind scale:            7           adv wind scale:            7
  •      req sack:                  Y           req sack:                  Y
  •      sacks sent:                0           sacks sent:             1774
  •      urgent data pkts:          0 pkts      urgent data pkts:          0 pkts
  •      urgent data bytes:         0 bytes     urgent data bytes:         0 bytes
  •      mss requested:          8960 bytes     mss requested:          8960 bytes
  •      max segm size:          8948 bytes     max segm size:             0 bytes
  •      min segm size:          8948 bytes     min segm size:             0 bytes
  •      avg segm size:          8947 bytes     avg segm size:             0 bytes
  •      max win adv:           17920 bytes     max win adv:         8388480 bytes
  •      min win adv:           17920 bytes     min win adv:           17792 bytes
  •      zero win adv:              0 times     zero win adv:              0 times
  •      avg win adv:           17920 bytes     avg win adv:         8044910 bytes
  •      initial window:        17896 bytes     initial window:            0 bytes
  •      initial window:            2 pkts      initial window:            0 pkts
  •      ttl stream length:        NA           ttl stream length:        NA
  •      missed data:              NA           missed data:              NA
  •      truncated data:    209220494 bytes     truncated data:            0 bytes
  •      truncated packets:     23471 pkts      truncated packets:         0 pkts
  •      data xmit time:        8.945 secs      data xmit time:        0.000 secs
  •      idletime max:          141.4 ms        idletime max:           73.8 ms
  •      throughput:         23256786 Bps       throughput:                0 Bps
224
tcptrace -l -r (cont.)
  •                RTT samples:           10814           RTT samples:               1
  •      RTT min:                72.1 ms        RTT min:                 0.0 ms
  •      RTT max:               144.2 ms        RTT max:                 0.0 ms
  •      RTT avg:               102.1 ms        RTT avg:                 0.0 ms
  •      RTT stdev:              28.9 ms        RTT stdev:               0.0 ms


  •      RTT from 3WHS:          72.1 ms        RTT from 3WHS:           0.0 ms


  •      RTT full_sz smpls:     10813           RTT full_sz smpls:         1
  •      RTT full_sz min:        72.7 ms        RTT full_sz min:         0.0 ms
  •      RTT full_sz max:       144.2 ms        RTT full_sz max:         0.0 ms
  •      RTT full_sz avg:       102.1 ms        RTT full_sz avg:         0.0 ms
  •      RTT full_sz stdev:      28.9 ms        RTT full_sz stdev:       0.0 ms


  •      post-loss acks:            4           post-loss acks:            0
  •           For the following 5 RTT statistics, only ACKs for
  •           multiply-transmitted segments (ambiguous ACKs) were
  •           considered.  Times are taken from the last instance
  •           of a segment.
  •      ambiguous acks:           30           ambiguous acks:            0
  •      RTT min (last):         72.7 ms        RTT min (last):          0.0 ms
  •      RTT max (last):         74.9 ms        RTT max (last):          0.0 ms
  •      RTT avg (last):         73.3 ms        RTT avg (last):          0.0 ms
  •      RTT sdv (last):          0.8 ms        RTT sdv (last):          0.0 ms
  •      segs cum acked:        12508           segs cum acked:            0
  •      duplicate acks:         1742           duplicate acks:            0
  •      triple dupacks:            4           triple dupacks:            0
  •      max # retrans:             1           max # retrans:             0
  •      min retr time:          73.7 ms        min retr time:           0.0 ms
  •      max retr time:         140.2 ms        max retr time:           0.0 ms
  •      avg retr time:          89.2 ms        avg retr time:           0.0 ms
  •      sdv retr time:          10.3 ms        sdv retr time:           0.0 ms


225
TCP Startup
226
TCP Startup – Detail 1
227
TCP Startup – Detail 2
228
TCP Single Loss/Retransmit
229
TCP Sender Pause
230
Linux 2.4.26-web100 Example
231
Detail 1: Startup
232
Detail 2: Timeout
233
Detail 3: Single SACK Loss
234
Detail 4: Multiple SACK Loss
235
Normal TCP Scallops
236
A Little More Loss
237
Excessive Timeouts
238
Bad Window Behavior
239
Receiving Host/App Too Slow
240
Web100
www.web100.org
  • Set out to make 100 Mbps TCP common
  • “TCP knows what’s wrong with the network”
    • The sender side knows the most
  • Instruments the TCP stack for diagnostics
  • Enhanced TCP MIB (IETF Draft)
  • Linux 2.4 kernel patches + library and tools
  • /proc/web100 file system
    • e.g. /proc/web100/1010/{read,spec,spec-ascii,test,tune)
241
Web100 – Connection Selection
242
Web100 - Tool/Variable Selection
243
Web100 – Variable Display, Triage Chart
244
Network Diagnostic Tool (NDT)
  • http://e2epi.internet2.edu/ndt/
  • Java applet runs a test and uses Web100 stats to diagnose the end to end path
  • Developed by Richard Carlson, ANL
  • Try it out on damp-ssc (San Diego)
245
NDT Example
http://miranda.ctd.anl.gov:7123/
246
NDT Statistics
247
Iperf with Web100, Clean Link
248
Iperf with Web100, Lossy Link
249
Data Integrity
250
Corrupted Packets
  • V. Paxson
    • Claims most corruption caused by T1’s
    • 1 in 5000 packets corrupted
    • TCP checksum would then let 1 in 300M corrupted packets pass
      • Roughly two errors per terabyte
251
More Corrupted Packets
  • J. Stone, C. Partridge, SIGCOMM 2000
    • “Traces of Internet packets over the last two years show that 1 in 30,000 packets fails the TCP checksum, even on links where link-level CRCs should catch all but 1 in 4 billion errors.”
    • DMA transfers account for many of these errors
    • Conclude that 1 in 200M TCP packets pass with undetected errors
252
Data Integrity
  • How do you know that your transferred copy is correct?
  • A simple approach would be copy it two (or more) times and compare the results
  • But you would like to check it with something smaller than the entire dataset
253
Checksums, CRC’s, Hashes
  • All map large blocks of data into smaller keys
  • Checksums are simple algebraic sums
  • Cyclic Redundancy Checks (CRC’s) are good at detecting multiple bit errors
  • Hashes are good at randomizing the key space
    • Good for database lookups, sorting, etc.
    • Cryptographic hashes are good for data integrity checks
254
Simple Checksums
  • Can not detect several kinds of errors
    • Reordering of bytes
    • Inserting or deleting zero valued bytes
    • Multiple errors that cancel (same sum)
255
The Internet Checksum
  • 16 bits long, summed 16 bits at a time
  • For UDP/TCP, includes a 12 byte “pseudo header” from the IP layer with the network addresses and payload length
  • RFC 1071 – Computing the Internet Checksum
256
IP Data Protection
  • IPv4 uses the 16-bit Internet Checksum over the header information only
    • This checksum is recalculated/rewritten hop by hop, since e.g. the TTL changes
  • IPv6 has none, not even over the header
    • Assumes layer 2 will protect it
257
IPv4 and IPv6 Headers
258
UDP Data Protection
  • 16-bit Internet Checksum covers the UDP header and payload
  • Optional for IPv4 UDP
  • Required for IPv6 UDP
259
TCP Data Protection
  • 16-bit Internet Checksum over the entire segment
  • Roughly every 65536th corrupted segment goes undetected
260
The Danger Of Offloading
  • Many NICs today support TCP Checksum offloading
  • Studies have shown many errors occur during DMA transfers
    • Offloading would miss these
  • Nothing beats reading the data back from disk, e.g. md5sum
261
Ethernet Data Protection
  • IEEE 802 CRC32
  • Strong error detection for 1500 byte packets
  • Detects all triple bit errors up to 11455 bytes
  • Roughly 1 in 4 billion errors go undetected, but strength is non-uniform
  • http://www.cse.ohio-state.edu/%7Ejain/papers/xie1.htm
262
Cryptographic Hashes
  • One-way functions
    • Easy to compute, nearly impossible to invert
  • Collision free
    • Infeasible to find two inputs that result in the same output
  • The workhorses of modern cryptography
263
Cryptographic Hash History
  • 1990, MD4, Ron Rivest
  • 1992, MD5
  • 1993, SHA, NSA (a.k.a. SHA-0)
  • 1995, SHA-1
  • 2002, SHA-256, SHA-384, SHA-512
  • 2004, SHA-224
  • 2010, NIST no longer endorses SHA-1
264
Common Hashes
  • MD5 – Message Digest algorithm #5
    • 128 bit hash, RFC1321 (April 1992)
    • md5sum --check md5sums
  • SHA-1 – Secure Hash Algorithm #1
    • 160 bit hash
265
Stronger Hashes
  • Recent work has shown some weakness in MD5 and SHA-0 (Crypto2004 conference)
  • NIST plans to phase out SHA-1 by 2010
  • NIST FIPS 180-2, Secure Hash Standard, Aug 2002, also includes:
    • SHA-224, SHA-256, SHA-384, SHA-512
    • http://csrc.nist.gov/CryptoToolkit/tkhash.html
266
HMAC
  • Keyed-Hash Message Authentication Codes
  • Combines a secret key and a cryptographic hash
  • HMAC-SHA1, HMAC-MD5, HMAC-RIPEMD
  • RFC2104, Feb 1997, ANS X9.71
267
Advanced Encryption Standard (AES)
  • AES (Rijndael) data encryption/decryption
  • Federal Information Processing Standards Publication, FIPS-197
  • 128, 192, 256 bit keys, 128 bit data blocks
  • Can be used as a Message Authentication Code (MAC)
    • AES-XCBC-MAC-96, RFC 3566, Sep 2003
  • scp defaults to aes128-cbc and hmac-md5
268
Encryption/Hash Speeds
  • “openssl speed” OpenSSL 0.9.7d, 2.6 GHz Xeon, 400 MHz FSB, 512KB L2 cache


  • type              16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
  • md2               2335.38k     5298.12k     7781.61k     8827.85k     9161.67k
  • mdc2              6480.09k     7374.90k     7717.70k     7677.92k     7790.82k
  • md4              19268.23k    63363.31k   173880.13k   303676.32k   391984.43k
  • md5              16325.52k    54524.02k   143885.24k   246730.38k   311775.74k
  • hmac(md5)        17483.15k    58603.05k   151600.95k   252513.14k   313778.51k
  • sha1             15804.80k    51292.97k   126032.04k   199600.09k   240482.89k
  • rmd160           11629.85k    31237.76k    62984.82k    84785.10k    93663.74k
  • rc4              84407.69k    93179.20k    95925.54k    96411.00k    96438.66k
  • des cbc          43456.93k    43846.81k    43784.50k    43868.37k    43858.06k
  • des ede3         17158.70k    17168.26k    17128.48k    17191.63k    17186.59k
  • idea cbc         24765.82k    25599.68k    25818.65k    25979.85k    25950.68k
  • rc2 cbc          24287.21k    24957.35k    25301.88k    25439.74k    25317.71k
  • rc5-32/12 cbc    91339.13k    89333.21k    90386.24k    90881.38k    90675.90k
  • blowfish cbc     77119.43k    78330.17k    78255.31k    78479.08k    77552.78k
  • cast cbc         48512.65k    50893.86k    51452.11k    51747.77k    51385.43k
  • aes-128 cbc      70686.37k    65725.38k    65752.90k    65180.02k    64708.38k
  • aes-192 cbc      55336.89k    57739.22k    58446.44k    57441.14k    57590.31k
  • aes-256 cbc      56045.36k    52542.96k    52584.07k    51628.97k    51985.99k


269
Local Transfer Examples
  • 744 MB file, 2.6 GHz Xeon
    • Disk read, loopback interface, no disk write
    • both encrypt and decrypt for scp
  • wget –O /dev/null ftp://localhost/tmp/file
    • 0:13 sec, 455 Mbps
  • scp localhost:/tmp/file /dev/null
    • AES-128 (default), 0:49 sec, 121 Mbps
    • 3DES, 1:53, 53 Mbps
    • Blowfish, 0:39, 153 Mbps
270
Remote Transfer Examples
  • Aberdeen, MD to San Diego, CA, 744 MB
  • scp 14 minutes
    •  compared to 49 seconds on localhost
  • wget v1.9.1, 4 minutes
    • from vsftpd 1.2.2 server at Aberdeen
  • wget –passive-ftp, 15 seconds
    • wget with large window modification
271
Summary: Layers of Protection
  • Application: Hashes
  • Transport: TCP/UDP checksum
    • Optional: SSL/TLS
  • Network: IP header checksum
    • Optional: IPsec
  • Link: Ethernet CRC32 or POS FCS-32
  • Physical: symbols, ECC, FEC
272
TCP Security
273
TCP Security – Sequence Numbers
  • Connections identified by
    • (src addr, src port, dst addr, dst port)
  • 32 bit sequence numbers (byte counts)
    • One for each direction
    • Random starting value for each connection
274
TCP Security Proposals
  • Ephemeral Port Randomization
    • 16 bit port space subset
  • SYN Cookies
  • Timestamps as Cookies
  • Tighter sequence number checking
275
Guessing A Valid Seq Number
  • Today we only need one within the window
    • 2^32 / window_size
    • E.g. 32 KB window = 128 k guesses
    • Even at DSL rates, can reset a connection in about a minute
  • Proposal to require an exact match, i.e. the next number
276
TCP Security
  • Blind reset attack, RST bit
    • If we receive a RST within the window, close our connection
  • Blind reset attack, SYN bit
    • If we receive a SYN within the window of an open connection, send a RST
  • Blind data injection
    • Guessing a valid sequence number and ACK lets us inject bogus data segments
  • draft-ietf-tcpm-tcpsecure-01.txt
277
TCP Security
  • Blind ACK DoS attack
    • ACK segments not yet received
    • Client can cause server to saturate it’s connection by sending too fast
    • draft-azcorra-tcpm-tcp-blind-ack-dos-01.txt
  • Related: ACK every byte, to open senders window faster
    • Addressed by “Appropriate Byte Counting”
278
TCP Security
  • TCP MD5 Signature Option, RFC2385
    • Manually configured shared keys
    • Often used between BGP peers
    • RFC3562 provides guidance on keys
  • IPsec
    • Layer 3 solution
  • SSL / TLS
    • Layer 5 “solution”
279
IPsec History
  • IPsec work began in the early 1990s
  • Three revisions
    • 1995: RFC 1825-1829 - no key management
    • 1998: RFC 2401-2412 - IKE v1 key management
    • Current: RFC TBD - IKE v2 key management
  • At least 20 related RFC’s
280
IPsec Security Services
  • Authentication Header (AH)
    • Authentication of data source
    • Integrity of message content and ordering
    • Replay Detection
    • Access Control (i.e. packet filtering)
  • Encapsulating Security Payload (ESP)
    • Above plus encryption of packet data
    • Limited traffic flow confidentiality
281
IPsec Transport and Tunnel Modes
  • Transport Mode
    • AH and/or ESP is placed between IP header and next higher level header
      • IP-TCP  ->  IP-AH-TCP
      • IP-TCP  ->  IP-ESP-TCP
    • Can only be used host-to-host
  • Tunnel Mode
    • A new IP header is created
      • IP-UDP  ->  IP’-AH-IP-UDP
      • IP-UDP  ->  IP’-ESP-IP-UDP
    • Common way to implement VPNs
282
IPsec: Host-to-Host Modes
283
Host-to-Gateway Tunnel Mode
284
Gateway-to-Gateway Tunnel Mode
285
IPsec Support
  • Native support for IPsec is required for IPv6
    • That’s where “IPv6 is secure” comes from
    • But many vendors don’t support it yet
  • For Linux, see
    • www.openswan.org
    • www.strongswan.org
    • www.linux-ipv6.org
286
Going Faster
  • Cheating Today, Improving TCP Tomorrow
287
Parallel TCP Streams
  • PSockets (Parallel Sockets library)
    • http://citeseer.nj.nec.com/386275.html
  • GridFTP – enhanced parallel wu-ftpd
    • http://www.globus.org/datagrid/gridftp.html
  • bbFTP – parallel ‘ftp’ uses ssh or GSI
    • http://doc.in2p3.fr/bbftp/
  • MulTCP – a TCP that acts like N TCP’s
    • UCL
288
Parallel TCP Stream Performance
289
Parallel TCP Streams
Throughput and RTT by Window Size
290
TCP Keeps Evolving
  • TCP, RFC 793, Sep 1981
  • Reno, BSD, 1990
  • Path MTU Discovery, RFC 1191, Nov 1990
  • Window Scale, PAWS, RFC 1323, May 1992
  • SACK, RFC 2018, Oct 1996
  • NewReno, RFC 2581, April 1999
  • D-SACK, RFC 2883, July 2000
  • RTO Calculation, RFC 2988, Nov 2000
  • ECN, RFC 3168, Sep 2001
  • Limited Transmit, RFC 3042, Jan 2001
  • Increased Initial Windows, RFC 3390, Oct 2002
  • SACK loss recovery, RFC 3517, Apr 2003
291
TCP Reno
  • Most modern TCP’s are “Reno” based, from the 1990 BSD Reno Unix release
  • Reno defined (refined) four key mechanisms
    • Slow Start
    • Congestion Avoidance
    • Fast Retransmit
    • Fast Recovery
  • NewReno refined fast retransmit/recovery when non-SACK partial acknowledgements are available
    • Proposed Standard, RFC3782, Apr 2004
292
The Future of TCP/IP
  • Different retransmit/recovery schemes
    • TCP Tahoe, Vegas, Peach, Westwood, …
  • Pacing - removing burstiness by spreading the packets over a round trip time
    • Vegas, FAST, BLUE
  • Autotuning windows and buffer space
  • Modifications to prevent “cheating”
  • Kick-starting TCP after timeouts
293
Increased Initial Windows
  • Allows ~4KB initial window rather than one or two segments
    • min(4*MSS, max(2*MSS, 4380 bytes))
  • RFC 3390, Oct 2002, Proposed Standard
294
Appropriate Byte Counting
  • When an ACK is received, increase cwin based on the number of new bytes ACK’d
  • Prevents receiver from “cheating” and making the sender open cwin too quickly
    • e.g. receiver ACKs every byte
  • Increases by at most 2*MSS bytes per ACK
    • To avoid bursts when one ACK covers a huge number of bytes
  • RFC 3465, Feb 2003
295
Limited Slow-Start
  • In slow-start, the congestions window (cwin) doubles each round trip time
  • For large cwins, this doubling can cause massive packet loss (and network load)
  • Limited slow-start adds max_ssthresh (proposed value of 100 MSS)
  • Above max_ssthresh cwin opens slower, never bursts more than 100 MSS
    •       cwin += (0.5*max_ssthresh/cwin) * MSS
  • RFC 3742, Apr 2004
296
Limit Slow-Start Example
297
Quick-Start
draft-amit-quick-start-01.txt
  • IP option in the TCP SYN specifies desired initial sending rate
    • Routers on the path decrement a TTL counter and decrease initial sending rate if necessary
  • If all routers participated, receiver tells the sender the initial rate in the SYN+ACK pkt
  • The sender can set cwin based on the rtt of the SYN and SYN+ACK packets
298
Experimental TCP Modifications
  • TCP Westwood
    • Rate Estimation from ACK stream (cwnd = RE x RTTmin)
  • HighSpeed TCP (HSTCP)
    •  Scaled AIMD parameters
  • Scalable TCP (STCP)
  • Binary Increase Congestion Control (BIC/CUBIC)
  • TCP Vegas / FAST TCP
    • Queuing delay based congestion control
299
HighSpeed TCP
www.icir.org/floyd/hstcp.html
  • Changes the AIMD parameters
  • Identical to standard TCP for loss rates above 10-3 for fairness (cwin <= 38)
  • Allows cwin to reach 83000 segments for 10-7 loss rates
    • Good for 10 Gbps over 100 msec rtt
    • Std TCP would be limited to ~440 Mbps
  • RFC 3649, Dec 03
300
HighSpeed TCP: response
301
HighSpeed TCP: fairness
302
Standard vs. HighSpeed TCP
303
Cwin Opening Comparison
304
Scalable TCP (STCP)
  • Loss recovery time is independent of sending rate
  • http://www-lce.eng.cam.ac.uk/~ctk21/scalable/
305
FAST TCP
  • Delay based congestion control
    • “multi-bit” feedback vs. binary loss signal
    • Avoids oscillations of cwnd and unnecessary loss
  • http://netlab.caltech.edu/FAST/
306
UDP Transfer Protocols
307
Why Not TCP?
  • Slow adaptation after loss
    • 0 to 500 Mbps takes
      • 36 seconds, mtu=9000, rtt=72msec (DC to San Diego)
      • 15 minutes, mtu=1500, rtt=150msec (DC to Maui)
    • Short flow throughput is determined by slow start
  • Loss != congestion
  • Large TCP queues increase latency
  • End users never tune their systems
308
NETBLT
  • RFC969, 1985
  • Block transfer protocol (not streaming)
    • Send a block at predetermined rate
    • Wait for lost packet list
    • Resend those, etc.
309
RBUDP
  • Reliable Blast UDP
  • http://www.evl.uic.edu/cavern/quanta
  • Similar to NETBLT but in active development
310
Tsunami
  • http://www.anml.iu.edu/anmlresearch.html
  • Last release, Dec 2002
  • UDP data, TCP control, rate adaptive
    • Loss rate controls sending rate
  • File transfer protocol, no API
  • Transferred 1 TB of data at ~1 Gbps over a 12000 km “light path” (Vancouver to Geneva), Sep 2002
  • Was created because TCP over that path was getting only 10’s to 100’s of Mbps!
311
SABUL
  • Simple Available Bandwidth Utilization Library
  • National Center for Data Mining (NCDM) at UIC (University of Illinois at Chicago)
  • http://www.dataspaceweb.net/sabul.htm
  • 2000-2003, now in third generation
  • UDP data, TCP control, rate adaptive
  • Streaming protocol, window + AIMD rate control (not rtt dependent)
  • Includes FTP like application, API
312
UDT
  • UDP-based Data Transport
  • http://sourceforge.net/projects/dataspace
  • UDP for data and control
  • Grew out of SABUL work, 2003+
313
UDP Protocol Security!
  • Session hijacking, corruption, encryption
    • Learn something from 802.11 wireless?
314
TCP Revisited
  • “Anything you can do I can do too”
    • It’s just algorithms (e.g. congestion control)
  • The real UDP advantage:
    • User space implementations, i.e. easier experimentation and deployment!
    • But user space is not as efficient
315
SCTP
  • Proposed Standard transport protocol
    • Peer of TCP and UDP
  • Message oriented vs. TCP’s byte stream
  • Multi-streaming
  • Multi-homing
  • Cookie based DoS protection
  • RFC2960, RFC3309, RFC3286
  • http://www.sctp.org/
316
SCTP
  • Allows out of order delivery
    • Each message can be marked as ordered or not
    • Avoids TCP’s head-of-line blocking
  • CRC32c data protection
    • Changed from Adler-32 checksum
317
Remote Direct Data Placement
  • RDDP, a.k.a. Remote DMA
  • Something SCTP is better at than TCP
    • Easy for the sender (“zero copy”)
    • Hard for the receiver
  • Marker PDU Aligned framing for TCP (MPA)
318
Is TCP The Right Choice?
  • TCP assures 100% in-order delivery
  • If your app can handle out-of-order delivery, try SCTP
  • If you app can handle some loss, use something that isn’t 100% reliable
    • TCP stalls are sometimes worse than a loss
319
Quality of Service (QoS)
320
What is QoS?
  • Quality of Service
    •   “The collective effect of service performances which determines the degree of satisfaction of the user” (ITU)
  • A user perception of how good the service is
321
Why Improve QoS?
  • Performance Guarantees
    • SLA’s, VPN’s, Voice, Real-time, etc.
  • TCP builds large queues
    • Increases delay, can cause burst losses
  • Circuit emulation or ATM QoS over MPLS
    • Preservation of expected behavior
  • Critical services
    • DoS protection
322
How Do You Improve QoS?
  • Improve the entire network (egalitarian)
    • larger pipes, congestion control (RED, ECN), traffic engineering (MPLS)
  • Discriminate flows into service classes
    • Can be signaled/reserved, or marked
    • All members of a class get the same type of service
323
Classes of Service (CoS)
  • Signaled
    • ATM
      • CBR, VBR, ABR, UBR
    • IP Integrated Services (IntServ)
      • Best-effort (BE), Controlled-load (CLS), Guaranteed (GS)
  • Marked
    • IP Differentiated Services (DiffServ)
      • Expedited, Assured 1, 2, 3, 4
324
Integrated Services (IntServ)
  • IETF working group since 1993
  • Resource Reservation
    • RSVP is the most common choice
  • Packet Classification (multi field)
  • Admission Control (capacity and/or policy)
  • Packet Scheduling
  • QoS based routing
325
Resource Reservation Protocol (RSVP)
  • “Soft State”
    • Typical: PATH messages every 30 seconds, soft-state timeout in 90 seconds
  • RSVP only blocks when congested
    • ATM often blocks allocations even if unused
  • RSVP is in Windows 2000, but removed from XP in favor of DiffServ marking
  • NSIS may be the next generation of RSVP
326
Multi Field Classification
  • Every RSVP router classifies every packet by inspecting up to seven fields
    • dest addr, dest port, proto, src addr, src port
      • IPv4: ToS (8 bits)
      • IPv6: Traffic Class (8 bits), Flow label (20 bits)
  • DiffServ reduces this burden by marking packets
  • MPLS can help!
327
Differentiated Services (DiffServ)
  • IETF working group since 1998
  • Keeps the core stateless
    • Pushes any/all marking, policing, shaping, signaling to the edges (hosts or edge routers)
  • In the core, DiffServ Code Points map to Per Hop Behaviors (PHB)
  • Cisco Committed Access Rate (CAR) was an early implementation of DiffServ
328
DiffServ Classes (PHB Groups)
  • Expedited Forwarding (EF)
    • Emulates virtual leased line
    • Absolute assurance, admission controlled
    • One PHB: Priority or Class Based Queue
  • Assured Forwarding (AF)
    • Emulates unloaded network
    • Statistical assurance
    • 12 PHB’s (4 classes/queues, each with three drop priorities)
329
IPv4 and IPv6 CoS fields
330
IP TOS Field (8 bits)
  • D = Minimize Delay
  • T = Maximize Throughput
  • R = Maximize Reliability
  • M = Minimize Monetary Cost (RFC1349)
  • Maximum Security = 1111 (RFC1349)
  • RFC791, RFC795, RFC1349
331
IP TOS Precedence
  •    3 bit precedence in the TOS byte:
      • 0 - Routine
      • 1 - Priority
      • 2 - Immediate
      • 3 - Flash
      • 4 - Flash Override
      • 5 - CRITIC/ECP
      • 6 - Internetwork Control
      • 7 - Network Control
332
DiffServ Code Points
  • CS0  000000 [RFC2474]
  • CS1  001000 [RFC2474]
  • CS2  010000 [RFC2474]
  • CS3  011000 [RFC2474]
  • CS4  100000 [RFC2474]
  • CS5  101000 [RFC2474]
  • CS6  110000 [RFC2474]
  • CS7  111000 [RFC2474]
  • AF11 001010 [RFC2597]
  • AF12 001100 [RFC2597]
  • AF13 001110 [RFC2597]
  • AF21 010010 [RFC2597]
  • AF22 010100 [RFC2597]
  • AF23 010110 [RFC2597]
  • AF31 011010 [RFC2597]
  • AF32 011100 [RFC2597]
  • AF33 011110 [RFC2597]
  • AF41 100010 [RFC2597]
  • AF42 100100 [RFC2597]
  • AF43 100110 [RFC2597]
  • EF   101110 [RFC3246]
  • Six bit field, goes in the IPv4 TOS octet or the IPv6 Traffic Class
  • Last two bits are being used by ECN
333
Comparison to ATM: Policing
  • ATM uses a leaky bucket to determine and enforce conformance
    • Smoothes traffic
  • IntServ and DiffServ use a token bucket
    • Preserves packet spacing and bursts
334
Routing vs. Switching
  • Does my switch have QoS controls?
    • Some QoS in 802.1p(D)/Q
      • 3 bit User Priority in VLAN tag
      • Simple priority queuing (possible starvation)
    • IETF defined Subnet Bandwidth Manager (SBM)
  • Routers have better queue management, classification, and tagging capability
335
Classes of Encrypted Traffic?
  • Encryption can make multi field inspection impossible, and can hide DiffServ marks
  • Need a way to identify service classes to the WAN provider
    • 802.1Q User Priority?
    • MPLS label or Exp field (in shim header)?
    • Separate port?
336
QoS Recommendations
  • Implement DiffServ where needed
  • Utilize MPLS+TE where available
  • Reduce TCP queues/bursts (pacing)
  • Better Active Queue Management
    • Have routers penalize “bad” flows
337
QoS/CoS Summary
  • QoS is a user perception of performance
  • CoS discriminates flows into service types
    • CoS can be signaled (IntServ/RSVP)
    • CoS can be marked (DiffServ)
  • Current research could improve QoS even without CoS
338
Storage Area Networks
  • SANs and IP Storage
339
Storage Area Network (SAN)
  • Dedicated network for accessing Network Attached Storage (NAS)
  • Usually Fibre Channel
    • Fibre Channel Protocol (FCP) = SCSI Commands over FC
  • IP Storage is coming
340
Fibre Channel
  • Began in 1988 to improve HIPPI connections
    • First ANSI standard in 1994
    • 1 Gbps (100 MBps) standard in 1996
  • Runs on both fiber and copper
  • Ring or mesh (switch) topology
  • FC disks have dual / redundant connections
341
Fibre Channel Architecture
342
FCIP
  • Fibre Channel over TCP/IP
    • Encapsulates FC frames in TCP/IP
  • Provides Inter-Switch Links (ISLs)
  • Allows multiple FC SAN’s to be connected via the internet
  • RFC 3821, July 2004
343
iFCP
  • Internet Fiber Channel Protocol
  • FC-4 layer interface with a TCP/IP sublayer
    • Replaces the FC SAN with an IP network
  • Gateways interface between FC and IP devices
  • Emulates fabric services
  • Work in progress
    • http://www.ietf.org/html.charters/ips-charter.html
344
iSCSI
  • Internet Small Computer Systems Interface
    • Serialized SCSI over TCP
    • Initiator to target (tcp port 3260)
  • Removes 25 meter distance limit
  • Includes initiator/target authentication via iSCSI Login PDUs
    • Challenge Response Authentication Protocol (CHAP)
    • Secure Remote Password (SRP)
  • Uses CRC32c (but optional!)
  • RFC 3720, Apr 2004
345
iSNS
  • Internet Storage Name Server
  • Discovery and management of FCP and iSCSI storage devices
  • Required by iFCP, optional in iSCSI
  • Work in progress
    • http://www.ietf.org/html.charters/ips-charter.html
346
Remote Direct Memory Access (RDMA)
  • Zero-copy end-to-end data movement
  • Available in InfiniBand
  • RDMA Consortium defined it for TCP/IP via iWarp
    • www.rdmaconsortium.org
347
iWarp
  • Collective name given to three protocols
    • MPA – Marker PDU Alignment for TCP
    • DDP – Direct Data Placement
    • RDMAP – RDMA Protocol
348
Enhanced Network Interface Cards
349
IP SAN Security
  • FC SANs provide little to no security
  • FCIP, iFCP, iSCSI all depend on IPsec for per-packet
    • Authentication
    • Confidentiality
    • Integrity
    • Replay protection
  • RFC 3723, Securing Block Storage Protocols over IP, Apr 2004
350
SANs in Perspective
  • Came out in the 10/100 Mbps Ethernet era
    • IP networks simply weren’t fast enough, nor did they have QoS controls
  • Held on through the introduction of 1Gbps Ethernet
    • Claimed to still be faster and cheaper
  • Today: Can no longer compete with 1 GigE and soon 10 GigE pricing
  • IP attached iSCSI storage is coming
351
Peer to Peer (P2P)
File Transfer
352
P2P File Sharing Networks
  • FastTrack
    • Closed source network
    • Morpheus, KaZaa, Grokster, Limewire
  • Gnutella (G1)
    • Open source FastTrack like implementation
  • eDonkey2000
    • eDonkey, eMule
  • Direct Connect
    • started Nov 1999, now over 12 PB of data!
    • i2Hub
353
Gnutella (G1)
  • Open source FastTrack like system
  • No central database (unlike Napster)
  • Flooding, order N lookups
354
eDonkey 2000 (ed2k)
  • Did for movies what Napster did for music
  • One of the first to allow partial file uploads
  • Many clients, including eMule
355
Direct Connect
  • 12 PB of data on 250k hosts (6 Oct 04)
  • http://www.neo-modus.com/
356
i2Hub
  • www.i2hub.com
  • Direct Connect P2P network on Internet2
  • “faster because it uses Internet2”
  • Reality check:
    • “at UCLA, i2hub downloads run 30 kBps to 200 kBps, vs. 600 kBps for a good server”
357
BitTorrent
  • http://bitconjurer.org/BitTorrent/
  • http://btfaq.com/
358
BitTorrent
359
BitTorrent Terminology
  • Peer – a client with a file
    • “Seed” if complete file
    • “Leach” if partial file
  • Swarm – the collection of all peers with parts or all of a given file
  • “Swarming” – uploading partial files
360
BitTorrent – How it Works
  • .torrent file
    • Tracker location(s)
    • 256 KB file pieces, SHA1 hash of each piece
  • Downloads
    • 16 KB sub-piece transfers
    • 5 pipelined requests (~80 KB window)
    • 4 active peers, reselected every 10 seconds
      • New opportunistic unchoke every 30 seconds
361
BitTorrent – Reality Check
  • One seed, One leach, Local LAN
    • 14 Mbps to 40 Mbps
  • Two seeds, One leach, Local LAN
    • 14 Mbps + 14 Mbps = 30 Mbps
  • One seed, One leach, CA to MD
    • 3.4 Mbps
362
mod_bt – Integrated Web/BitTorrent
  • Combines tracker and seed functions in an Apache module
  • Worse case, downloads like a web server
  • BitTorrent function can only help
  • Not quite ready for prime time
363
konspire[2b]
  • Similar to BitTorrent
  • Distribution tree, clients subscribe
  • Does not require balanced upload/download
  • In theory, more efficient than BitTorrent for large numbers of clients
  • http://konspire.sourceforge.net/
364
Distributed Hash Tables (DHT)
  • Hot topic since 2001
    • CAN, Chord, Pastry, Tapestry
    • Could be used to replace DNS, web caches, etc.
  • O(log N) vs. O(N) lookups
  • Provable properties
  • Many research projects in DHT’s
    • FreeNet, OceanStore
    • Automated P2P backup projects
      • Pastiche and PAST built on Pastry
365
Newer P2P Networks
  • Gnutella2 (G2)
    • DHT network from Shareaza
  • Overnet
    • DHT network from eDonkey2000
  • NEONet
    • DHT network from Morpheus
366
Abstract Storage Layers
  • Internet caches
  • DataSpace Transfer Protocol (DSTP)
  • I2 Logistical Networking
  • Decouple LAN/WAN tuning issues
  • Should storage be a network resource?
367
Internet Web Caches
  • IRCache project 1995-2000
  • Internet Cache Protocol (ICP)
    • RFC2186, RFC2187, Sep 1997
    • Used in Squid and several web cache products
    • http://icp.ircache.net/
  • Hyper Text Caching Protocol (HTCP)
    • RFC2756, Jan 2000
368
Logistical Networking
  • http://loci.cs.utk.edu/
  • Internet Backplane Protocol (IBP)
  • Logistical Backbone (L-Bone)
    • Directory of IBP depots
  • exNodes
    • Collection of IBP allocations
  • Logistical Runtime System (LoRS)
    • Upload and Download services
369
IBP
  • Implements append-only byte arrays
  • Often anonymous creation, limited time storage
  • Need capabilities to access data
    • Crypto secure URL’s: read/write/manage
  • Supports Data Mover plugins
  • 4 GB allocation limit?
370
IBP Commands
  • Storage Management
    • IBP_allocate, IBP_manage
  • Data Transfer
    • IBP_store, IBP_load, IBP_copy, IBP_mcopy
  • Depot Management
    • IBP_status
371
exNodes
  • Holds information and capabilities about storage on one or more IBP depots
    • think inode for L-Bone storage
  • XML representation
    • Can be emailed, etc.
372
L-Bone
373
Logistical Run Time System
  • LoRS tools for using the L-Bone
  • Finds allocations, creates exNodes, etc.
  • Command line, visual, and java versions
374
Example Application: IBPvo
375
PlanetLab
  • Think world-wide shared Linux cluster
  • Over 100 universities doing research in overlay networks, routing, P2P, security, performance, etc.
  • Over half of the L-Bone servers are PlanetLab hosts
  • Internet2 has PlanetLab hosts on the backbone
376
Review
  • High performance comes from all levels
    • They complement each other
  • Network capacity vs. speed
  • How TCP throughput depends on delay, loss, packet size
  • Importance of window and buffer sizes and how to tune them
377
Review
  • How to test and debug network performance
  • Protect your data, the network isn’t perfect
  • TCP continues to improve, notably in congestion control and security modifications
  • Alternatives exist: SCTP and UDP transport experiments
  • Quality of Service: class based and egalitarian
  • Storage Area Networks and IP storage
378
Review
  • Peer to Peer is where the file transfer action is
    • But most P2P assumes poor network performance
    • High performance BitTorrent hasn’t been written yet
  • There is a lot of work today in DHT and abstract storage layers
  • Is storage a future network function?
379
Recommended Resources
  • System tuning details
    • http://www.psc.edu/networking/projects/tcptune/
  • Tom Dunigan’s Network Performance Links
    • http://www.csm.ornl.gov/~dunigan/netperf/netlinks.html
  • SLAC’s Network Monitoring pages
    • http://www-iepm.slac.stanford.edu/
  • CAIDA Internet Measurement Tool Taxonomy
    • http://www.caida.org/tools/
380
Recommended Resources
  • Nuttcp and Iperf for TCP and UDP testing
    • ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
    • http://dast.nlanr.net/Projects/Iperf/
  • Testrig for TCP traces
    • http://ncne.nlanr.net/research/tcp/testrig/
  • Web100 / Net100
    • http://www.web100.org/
    • http://www.csm.ornl.gov/~dunigan/net100/
381
Resources
  • Request For Comments (RFC)
    • http://www.rfc-editor.org/
    • ftp://ftp.rfc-editor.org/in-notes/rfc968.txt
  • Internet Engineering Task Force (IETF)
    • http://www.ietf.org/
    • http://www.ietf.org/html.charters/wg-dir.html
382
W. Richard Stevens’ Books
  • TCP/IP Illustrated, 3 volumes
  • Unix Network Programming, 2 volumes
  • http://www.kohala.com/start/
383
Resources
  • Stream Control Transmission Protocol (STCP)
  • Randall R. Stewart, Qiaobing Xie
  • November, 2001
  • ISBN: 0-201-72186-4
384
Resources
  • High Performance TCP/IP Networking
  • M. Hassan, R. Jain
  • October, 2003
  • ISBN: 0130646342
385
Thank You!
  • Phillip Dykstra
  • WareOnEarth Communications Inc.
  • 2109 Mergho Impasse
  • San Diego, CA 92110
  • phil@sd.wareonearth.com
  • 619-574-7796