|
1
|
- Phillip Dykstra
- Chief Scientist
- WareOnEarth Communications Inc.
- phil@sd.wareonearth.com
|
|
2
|
- This tutorial can be found at
- http://www.wcisd.hpc.mil/~phil/sc2004/
|
|
3
|
|
|
4
|
- Moving a lot of data
- Quickly
- Securely
- Error free
|
|
5
|
- The Internet has been optimized for
- millions of users behind low speed connections
- thousands of high bandwidth servers serving millions of low speed
streams
- Single high-speed to high-speed flows get little commercial attention
|
|
6
|
- Look at current high performance networks
- Fundamental understanding of delay, loss, bandwidth, routes, MTU,
windows
- Learn what is required for high speed data transfer and what to expect
- Look at many useful tools and how to use them for performance testing
and debugging
- Examine TCP dynamics and TCP’s future
- Look at alternatives to TCP and higher level approaches to data transfer
|
|
7
|
- Fundamental Concepts Slide #
- Layers, Circuits
9
- High Performance Networks 18
- Delay
36
- Routes
57
- Packet Size
76
- Bandwidth
86
- Windows
110
- TCP Performance
126
- Performance Tuning
168
|
|
8
|
- Testing and Debugging 184
- Data Integrity
249
- TCP Security
272
- TCP Improvements
286
- Alternatives to TCP 306
- Quality of Service
319
- Storage Area Networks 338
- Peer to Peer
351
- Abstract Storage Layers 366
|
|
9
|
|
|
10
|
- TDM – Time Division Multiplexing
- T1, 1.544 Mbps
- T3, 44.736 Mbps
- SONET – Synchronous Optical Network
- OC3, 155.520 Mbps
- OC12, 622.080 Mbps
- OC48, 2.488 Gbps
- OC192, 9.953 Gbps (“10 Gbps”)
- OC768, 39.813 Gbps (“40 Gbps”)
|
|
11
|
|
|
12
|
|
|
13
|
- Use color / frequency to separate channels
- Course (CWDM) and Dense (DWDM) varieties
- Usually in the 1550 nm band
|
|
14
|
|
|
15
|
- Example DWDM Technology Deployment
- 1996, 8 channels, 2.4 Gbps each (19 Gbps)
- 2000, 40 channels, 10 Gbps each (400 Gbps)
- 2004, 80 channels, 40 Gbps each (3.2 Tbps)
- 2010, 320 channels, 160 Gbps each (51 Tbps)
- Cost is linear with # of channels
- ~2.5x cost for 4x bps within a TDM channel
- The return of wider channels (for higher rates)?
- A lot of fiber in the ground today will be obsolete
|
|
16
|
- Over 95% of the worlds computers connect via Ethernet
- 1 Gbps on copper can be had for $10/port
- Fiber is at least 10x more expensive
- 10 Gbps fiber NICs run $2000 - $6000 (Oct 2004)
- $10k/switch port, $100k/router port
- Why aren’t WANs Ethernet too?
|
|
17
|
|
|
18
|
|
|
19
|
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
|
|
24
|
|
|
25
|
|
|
26
|
|
|
27
|
|
|
28
|
|
|
29
|
|
|
30
|
|
|
31
|
|
|
32
|
|
|
33
|
|
|
34
|
- The top ten networks defined by number of peers
- Rank Peers ASN Description
- 1 2,347 701
UUNET Technologies, Inc.
- 2 1,902 7018
AT&T WorldNet Services
- 3 1,732 1239
Sprint
- 4 1,171 3356
Level 3 Communications, LLC
- 5 1,092 209
Qwest
- 6 636 2914
Verio, Inc.
- 7 623 174
Cogent Communications
- 8 597 3549
Global Crossing
- 9 549 6461
Abovenet Communications, Inc
- 10 533 4513
Globix Corporation
- Source: www.fixedorbit.com, Sep 2004
|
|
35
|
- Abilene, abilene.internet2.edu
- DREN, www.hpcmo.hpc.mil/Htdocs/DREN
- ESnet, www.es.net
- NLR, www.nlr.net
- NREN, www.nren.nasa.gov
- vBNS+, www.vbns.net
|
|
36
|
|
|
37
|
- The speed of light is a constant
- It seems slower every year
|
|
38
|
- ~3.0x108 m/s in free space
- ~2.3x108 m/s in copper
- ~2.0x108 m/s in fiber = 200 km / ms
- [100 km of distance = 1 ms of
round trip time]
|
|
39
|
|
|
40
|
|
|
41
|
|
|
42
|
- A 56k packet could wrap around the earth!
- A 10GigE packet fits in the convention center
|
|
43
|
- Each store and forward hop adds the packet duration to the delay
- In the old days (< 10 Mbps) such hops dominated delay
- Today (> 10 Mbps) store and forward delays on WANs are minimal
compared to propagation
|
|
44
|
- ATM cells (and TCP ACK packets) are ~1/30th as long, 30x as
many per second
- One of the reasons we haven’t seen OC48 SAR until recently (2002)
- Jumbo Frames (9000 bytes) are 6x longer, 1/6th as many per
second
|
|
45
|
- The major source of variable delay
- Handle temporary inequalities between arrival rate and output interface
speed
- Small queues minimize delay variation
- Large queues minimize packet drop
|
|
46
|
- When full, choose a packet to
drop
- Tail-Drop – arriving packet
- Drop-From-Front – packet at front of queue
- Push-Out – packet at back of queue
- Random-Drop – pick any packet
|
|
47
|
- Addresses lock-out and full queue problems
- Incoming packet drop probability is a function of average queue length
- Random Early Detection (RED)
- Recommended Practice, RFC 2309, Apr 98
- Minth, Maxth, Mindrop, Maxdrop,
w
- www.icir.org/floyd/red.html
|
|
48
|
- $ ping –s 56 cisco.com
- PING cisco.com (198.133.219.25) from 63.196.71.246 : 56(84) bytes of
data.
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=1 ttl=241
time=25.6 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=2 ttl=241
time=25.5 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=3 ttl=241
time=25.1 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=4 ttl=241
time=26.1 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=5 ttl=241
time=25.0 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=6 ttl=241
time=25.8 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=8 ttl=241
time=25.4 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=9 ttl=241
time=25.1 ms
- 64 bytes from www.cisco.com (198.133.219.25): icmp_seq=10 ttl=241
time=26.2 ms
- --- cisco.com ping statistics ---
- 10 packets transmitted, 9 received, 10% loss, time 9082ms
- rtt min/avg/max/mdev = 25.053/25.585/26.229/0.455 ms
|
|
49
|
- Ping packet = 20 bytes IP + 8
bytes ICMP + “user data” (first 8 bytes = timestamp)
- Default = 56 user bytes = 64 byte IP payload = 84 total bytes
- Small pings (-s 8 = 36 bytes) take less time than large pings (-s 1472 =
1500 bytes)
|
|
50
|
- TTL = 241 indicates 255-241 = 14 hops
- Delay variation indicates congestion or system load
- Not good at measuring small loss
- An HPC network should show zero ping loss
- Depends on ICMP ECHO which is sometimes blocked for “security”
|
|
51
|
- Delay jumping between two distinct values
- Does not show up in traceroute as a route change
- Layer 2 problem: SONET protect failover
|
|
52
|
|
|
53
|
- http://e2epi.internet2.edu/owamp
|
|
54
|
- Showed that one-way is interesting
|
|
55
|
|
|
56
|
- The number of bytes in flight to fill the entire path
- Includes data in queues if they contributed to the delay
- Example
- 100 Mbps path
- ping shows a 75 ms rtt
- BDP = 100 * 0.075 = 7.5 million bits (916 KB)
|
|
57
|
- The path taken by your packets
|
|
58
|
- Within a network
- Smallest number of hops
- Highest bandwidth paths
- Usually ignore latency and utilization
- From one network to another
- Often “hot potato” routing, i.e. pass to the other network ASAP
|
|
59
|
|
|
60
|
|
|
61
|
|
|
62
|
|
|
63
|
|
|
64
|
|
|
65
|
|
|
66
|
|
|
67
|
- ftp://ftp.login.com/pub/software/traceroute/
- Supports
- MTU discovery (-M)
- Report ASNs (-A)
- Registered owners (-O)
- Set IP TOS field (-t)
- Microsecond timestamps (-u)
- Set IP protocol (-I)
|
|
68
|
- [phil@damp-ssc phil]$ traceroute -A mit.edu
- traceroute to mit.edu (18.7.22.69), 64 hops max, 40 byte packets
- 1
ge-0-1-0.sandiego.dren.net (138.18.190.1) [AS668] 0 ms
0 ms 0 ms
- 2
so-0-0-0.ngixeast.dren.net (138.18.1.55) [AS668] 76 ms
75 ms 76 ms
- 3
Abilene-peer.ngixeast.dren.net (138.18.47.34) [AS668] 76 ms
77 ms 76 ms
- 4
nycmng-washng.abilene.ucaid.edu (198.32.8.84) [<NONE>] 80 ms
88 ms 80 ms
- 5
ATM10-420-OC12-GIGAPOPNE.nox.org (192.5.89.9) [<NONE>] 87 ms
86 ms 85 ms
- 6
192.5.89.90 (192.5.89.90) [<NONE>] 85 ms
86 ms 104 ms
- 7
W92-RTR-1-BACKBONE.MIT.EDU (18.168.0.25) [AS3] 85 ms
85 ms 85 ms
- 8
WEB.MIT.EDU (18.7.22.69) [AS3]
86 ms 86 ms 85 ms
|
|
69
|
- Sends UDP packets to ports (-p) 33434 and up, TTL of 1 to 30
- Each router hop decrements the TTL
- If the TTL=0, that node returns an ICMP TTL Expired
- The destination host returns an ICMP Port Unreachable
|
|
70
|
- Shows the return interface addresses of the forwarding path
- You can’t see hops through switches or over tunnels (e.g. ATM VC’s, GRE,
MPLS)
- The required ICMP replies are sometimes blocked for “security”, or not
generated, or sent without resetting the TTL
|
|
71
|
|
|
72
|
|
|
73
|
- http://michael.toren.net/code/tcptraceroute/
- Useful when ICMP is blocked
- But still depends on TTL expired replies
- Can select the TCP port number
- Helps locate firewall or ACL blocks
- Defaults to port 80 (http)
|
|
74
|
- IP routing requires a longest prefix match
- Harder than switching, but now wire speed
- Inspired IP switching / MPLS
- Switches have gained features
- Simplicity is good
- Used switched infrastructure where you can, routers where you need
better separation or control
|
|
75
|
- Adds switched (layer 2) paths to below IP
- Useful for traffic engineering, VPN’s, QoS control, high speed
switching
- IP packets get wrapped in MPLS frames and “labeled”
- MPLS routers switch the packets along Label Switched Paths (LSP’s)
- Being generalized for optical switching
|
|
76
|
|
|
77
|
- Maximum packet size that can be sent as one unit (no fragmentation)
- Usually “IP MTU” which is the largest IP datagram including the IP
header
- Shown with ifconfig on Unix/Linux
- Sometimes a layer 2 MTU
- IP MTU 1500 = Ethernet MTU 1518 (or 1514, or 1522)
|
|
78
|
|
|
79
|
- Non-standard Ethernet extension
- Usually 9000 bytes (IP MTU)
- Sometimes anything over 1500
- Usually only used with 1GigE and above
- Requires separate LANs or VLANs to accommodate non-jumbo equipment
- http://sd.wareonearth.com/~phil/jumbo.html
|
|
80
|
- Reduce system and network overhead
- Handle 8KB NFS or SAN packets
- Improve TCP performance!
- Greater throughput
- Greater loss tolerance
- Reduced router queue loads
|
|
81
|
- MTU 9000 interoperability demonstrated:
- Allied Telesyn, Avaya, Cisco, Extreme, Foundry, NEC, Nortel
- Juniper = 9178 (9192 – 14)
- Cisco (5000/6000) = 9216
- Foundry JetCore 1 & 10 GbE = 14000
- NetGear GA621 = 16728
- http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html
|
|
82
|
- Minimum MTU of all hops in a path
- Hosts can do Path MTU Discovery to find it
- Without PMTU Discovery hosts should assume PMTU is only 576 bytes
- Some hosts falsely assume 1500!
|
|
83
|
|
|
84
|
|
|
85
|
- Use only large MTU interfaces/routers/links
- Gigabit Ethernet with Jumbo Frames (9000)
- Packet over SONET (POS) (4470, 9000+)
- ATM CLIP (9180)
- Never reduce the MTU (or bandwidth) on the path between each/every host
and the WAN
- Make sure your TCP uses Path MTU Discovery
|
|
86
|
|
|
87
|
|
|
88
|
- throughput <= available bandwidth
- (“tight link” with the
minimum unused bandwidth)
- A high performance network should be lightly loaded (<50%)
- A loaded high speed network is no better to the end user than a lightly
loaded slow one
|
|
89
|
- Ask the routers/switches
- SNMP
- RMON / NetFlow / sFlow
- Passive Measurement
- Active Measurement
- Light weight tests
- Full bandwidth tests
|
|
90
|
- Simple Network Monitoring Protocol
- Version 1: RFC 1155-1157 (1990)
- Version 2c: RFC 1901-1908 (1996)
- Version 3: RFC 2571-2574 (1999)
- Net-SNMP (was UCD-SNMP)
- http://www.net-snmp.org/
- Library and utilities
|
|
91
|
- $ snmpget -v 2c -c public router.dren.net system.sysUpTime.0
- SNMPv2-MIB::sysUpTime.0 = Time ticks: (1407699551) 162 days, 22:16:35.51
- $ snmpwalk -v 2c -c public router.dren.net IF-MIB::ifXTable
- IF-MIB::ifName.1 = STRING: fxp0
- IF-MIB::ifName.2 = STRING: fxp1
- …
- IF-MIB::ifHCInOctets.2 = Counter64: 4049716647
- IF-MIB::ifHCOutOctets.2 = Counter64: 3264374754
- IF-MIB::ifHighSpeed.2 = Gauge32: 100
- IF-MIB::ifAlias.2 = STRING: Router 1 LAN
- …
|
|
92
|
|
|
93
|
- Juniper caches interface statistics for five seconds
- Reads provide new data only if cache > 5 seconds old
- No timestamps on the data
|
|
94
|
|
|
95
|
- www.mrtg.org
- Extremely popular network monitoring tool
- Most common display:
- Five minute average link utilizations
- Green into interface
- Blue out of interface
|
|
96
|
|
|
97
|
- Round Robin Database Tool
- Reimplementation of MRTG’s backend
- Generalized data storage, retrieval, graphing
- Many front ends available, including MRTG
- www.rrdtool.org
|
|
98
|
|
|
99
|
- www.tcpdump.org
- Raw packet capture / decode from hosts
- libpcap
- www.ethereal.com
- Nice GUI protocol analyzer
|
|
100
|
- RMON, 1993
- SNMP controllable packet capture
- NetFlow, 1997
- Cisco flow records or samples
- sFlow, 2001
- Low level packet samples and statistics
|
|
101
|
- Cisco flow records (also on Juniper)
- Version 5 and 9 formats common today
- www.cisco.com/go/netflow
- www.ietf.org/internet-drafts/draft-claise-netflow-version9-07.txt
- Many tools
- Cflowd, www.caida.org/tools/measurement/cflowd/
- flow-tools, www.splintered.net/sw/flow-tools/
- FlowScan, net.doit.wisc.edu/~plonka/FlowScan/
|
|
102
|
- RFC 3176, Sep 2001
- Switch or router level agent
- Statistical flow samples
- Flow is one ingress port to one egress port(s)
- Up to 256 packet header bytes, or summary
- Periodic counter statistics
- typical 20-120 sec interval
- www.sflow.org, www.ntop.org
|
|
103
|
|
|
104
|
- Packet pairs or trains are sent
- The slower link causes packets to spread
- The packet spread indicates the bandwidth
|
|
105
|
- pathchar – Van Jacobson, LBL
- ftp://ftp.ee.lbl.gov/pathchar/
- clink – Allen Downey, Wellesley College
- http://rocky.wellesley.edu/downey/clink/
- pchar – Bruce A. Mah, Sandia/Cisco
- http://www.employees.org/~bmah/Software/pchar/
|
|
106
|
- pipechar - Jin Guojun, LBL
- http://www.didc.lbl.gov/pipechar/
- nettimer - Kevin Lai, Stanford University
- http://gunpowder.stanford.edu/~laik/projects/nettimer/
- pathrate/pathload - Constantinos Dovrolis, Georgia Tech
- http://www.cc.gatech.edu/fac/Constantinos.Dovrolis/bwmeter.html
|
|
107
|
- http://www-iepm.slac.stanford.edu/tools/abing/
- Sends 20 pairs (40 packets) of 1450 byte UDP to a reflector on port
8176, estimates (in ~1 second):
- ABw: Available Bandwidth
- Xtr: Cross traffic
- DBC: Dominating Bottleneck Capacity
|
|
108
|
|
|
109
|
|
|
110
|
- Flow/rate control and error recovery
|
|
111
|
- Windows control the amount of data that is allowed to be “in flight” in
the network
- Maximum throughput is one window full per round trip time
- The sender, receiver, and the network each determine a different window
size
|
|
112
|
|
|
113
|
- Excellent tool to view the packet forwarding and loss properties of a
path under varying load!
- Sends windows full of ICMP Echo or UDP packets
- Treats ICMP Echo_Reply or Port_Unreachable packets as “ACKs”
- Make sure destination responds well to ICMP
- Consumes a lot of resources: use with care
|
|
114
|
|
|
115
|
|
|
116
|
|
|
117
|
|
|
118
|
|
|
119
|
|
|
120
|
|
|
121
|
- Ethernet “auto-negotiation” can select the speed and duplex of a
connected pair
- If only one end is doing it:
- It can get the speed right
- but it will assume half-duplex
- Mismatch loss only shows up under load
|
|
122
|
- TCP Throughput is at best one window per round trip time (window/rtt)
- The maximum TCP window of the receiving or sending host is the most
common limiter of performance
- Examples
- 8 KB window / 87 msec ping time = 753 Kbps
- 64 KB window / 14 msec rtt = 37 Mbps
|
|
123
|
|
|
124
|
|
|
125
|
- ATM traffic from the Pittsburgh Gigapop
- 50% had windows < 20 KB
- These are obsolete systems!
- 20% had 64 KB windows
- Limited to ~ 8 Mbps coast-to-coast
- ~9% are assumed to be using window scale
|
|
126
|
|
|
127
|
- TCP is adaptive
- It is constantly trying to go faster
- It slows down when it detects a loss
- How much it sends is controlled by windows
- When it sends is controlled by received ACK’s
- (or timeouts)
|
|
128
|
|
|
129
|
|
|
130
|
- A flow control mechanism
- The amount of data the receiver is ready to receive
- Loosely set by receive socket buffer size
- Application has to drain the buffer fast enough
- System has to wait for missing or out of order data (reassembly queue)
|
|
131
|
- Must hold two round trip times of data
- One set for the current round trip time
- One set for possible retransmits from the last round trip time
(retransmit queue)
- A system maximum often limits size
- The application must write data quickly enough
|
|
132
|
- Flow control, calculated by sender, based on observations of the data
transfer (loss, rtt, etc.)
- In the old days, there was none
- TCP would send a full receive window in one round trip time
- TCP now has two modes
- Slow-start: cwnd increases exponentially
- Congestion-Avoidance: cwnd increases linearly
|
|
133
|
- Congestion window controls startup and limits throughput in the face of
loss
- cwnd gets larger after every new ACK
- cwnd get smaller when loss is detected
- cwnd amounts to TCP’s estimate of the available bandwidth at any given
time
|
|
134
|
|
|
135
|
|
|
136
|
- Additive Increase, Multiplicative Decrease
- of cwnd, the congestion window
- The core of TCP’s congestion avoidance phase, or “steady state”
- Standard increase = +1.0 MSS per loss free rtt
- Standard decrease = *0.5 (i.e. halve cwin on loss)
- Avoids congestion collapse
- Promotes fairness among flows
|
|
137
|
|
|
138
|
|
|
139
|
|
|
140
|
- TCP needs a receive window (rwin) equal to or greater than the BW*Delay
product to achieve maximum throughput
- TCP needs sender side socket buffers of 2*BW*Delay to recover from
errors
- You need to send about 3*BW*Delay bytes for TCP to reach maximum speed
|
|
141
|
- TCP receivers send ACK’s:
- after every second segment
- after a delayed ACK timeout
- on every segment after a loss (missing segment)
- A new segment sets the delayed ACK timer
- A second segment (or timeout) triggers an ACK and clears the delayed ACK
timer
|
|
142
|
- A queue forms in front of a slower speed link
- The slower link causes packets to spread
- The spread packets result in spread ACK’s
- The spread ACK’s end up clocking the source packets at the slower link
rate
|
|
143
|
- Packets get discarded when queues are full
- (or nearly full)
- Duplicate ACK’s get sent after missing or out of order packets
- Most TCP’s retransmit after the third duplicate ACK (“triple duplicate
ACK”)
- Windows XP now uses 2nd dup ACK
|
|
144
|
- ~0.7
* Max Segment Size (MSS)
- Rate =
-
Round Trip Time * sqrt[pkt_loss]
- M. Mathis, et al.
- Double the MTU, double the throughput
- Halve the latency, double the throughput
- Halve the loss rate, 40% higher throughput
- www.psc.edu/networking/papers/model_abstract.html
|
|
145
|
- If we could halve the delay we could double throughput!
- Most delay is caused by speed of light in fiber (~200 km/msec)
- “Scenic routing” and fiber paths raise the minimum
- Congestion (queueing) adds delay
|
|
146
|
- MSS = MTU – packet headers
- Most often, MSS = 1448 or 1460
- Jumbo frame => ~6x throughput increase
- The Abilene and DREN WANs support MTU 9000 everywhere
- Most sites still have a lot of work to do
|
|
147
|
- IP Maximum Transmission Unit (MTU) on the DREN WAN is now 9146 bytes
- After much debate, exactly 9000 is being used to the sites with GigE
interfaces
- Sites choose 1500, 4470, or 9000; others by exception
- Sites are encouraged to support 9000 on their GigE infrastructures
|
|
148
|
|
|
149
|
|
|
150
|
- Loss dominates throughput !
- At least 6 orders of magnitude observed on the Internet
- 100 Mbps throughput requires O(10-6)
- 1 Gbps throughput requires O(10-8)
|
|
151
|
|
|
152
|
- TCP loss limits for 1 Gbps across country are O(10-8), i.e.
0.000001% packet loss
- About 1 “ping” packet every three years
- Systems like AMP would never show loss
- Try to get 10-8 in writing from a provider!
- Most providers won’t guarantee < 0.01%
|
|
153
|
- Require the provider to demonstrate TCP throughput
- DREN contract requires ˝ line rate TCP flow sustained for 10 minutes
cross country (e.g. ~300 Mbps on OC12)
- A low loss requirement comes with this!
|
|
154
|
- Bit Error Rate (BER) specs for networking interfaces/circuits may not be
low enough
- E.g. 10-12 BER => 10-8 packet ER (1500 bytes)
- 10 hops => 10-7 packet drop rate
|
|
155
|
- Hard Disk, 10-14
- SONET Equipment, 10-12
- SONET Circuits, 10-10
- 1000Base-T, 10-10
- One bit error every 10 seconds at line rate
- Copper T1, 10-6
|
|
156
|
- Requires a bit error rate (BER) of 10-12 or better!
- Perhaps TCP is expecting too much from the network
- Loss isn’t always congestion
- Modify TCP’s congestion control
|
|
157
|
- Original formula assumptions
- constant packet loss rate
- dominated by congestion from other flows
- 6 x MTU provides 6 x throughput
- HPC environment (low congestion)
- Loss may be dominated by bit error rates
- MSS becomes sqrt(MSS)
- 6 x MTU provides 2.4 x throughput
|
|
158
|
- Some high performance options
|
|
159
|
- Window Scale
- Timestamps
- Selective Acknowledgement (SACK)
- Path MTU Discovery (PMTUD)
- Explicit Congestion Notification (ECN)
|
|
160
|
- 16-bit window, 65535 byte maximum
- RFC1323 window scale option in SYN packets
- Creates a 32-bit window by left shifting 0 to 14 bit positions
- New max window is 1 GB (230)
- Granularity: 1B, 2B, 4B, …, 16KB
|
|
161
|
- 12 byte option
- drops MSS of 1460 to 1448
- Allows better Round Trip Time Measurement (RTTM)
- Prevention Against Wrapped Sequence numbers (PAWS)
- 32 bit sequence number wraps in 17 sec at 1 Gbps
- TCP assumes a Maximum Segment Lifetime (MSL) of 120 seconds
|
|
162
|
- TCP originally could only ACK the last in sequence byte received
(cumulative ACK)
- RFC2018 SACK allows ranges of bytes to be ACK’d
- Sender can fill in holes without resending everything
- Up to four blocks fit in the TCP option space (three with the timestamp
option)
- Surveys have shown that SACK implementations often have errors
- RFC3517 addresses how to respond to SACK
|
|
163
|
- RFC2883 extended TCP SACK option to allow duplicate packets to be SACK’d
- D-SACK is a SACK block that reports duplicates
- Allows the sender to infer packet reordering
|
|
164
|
- Forward ACK (FACK) is the forward-most segment acknowledged in a SACK
- FACK TCP uses this to keep track of outstanding data
- During fast recovery, keeps sending as long as outstanding data <
cwnd
- Generally a good performer - recommended
|
|
165
|
- Probes for the largest end-to-end unfragmented packet
- Uses don’t-fragment option bit, waits for must-fragment replies
- Possible black holes
- some devices will discard the large packets without sending a
must-fragment reply
- Problems are discussed in RFC2923
|
|
166
|
- RFC3168, Proposed Standard
- Indicates congestion without packet discard
- Uses last two bits of the IP Type of Service (TOS) field
- Black hole warning
- Some devices silently discard packets with these bits set
|
|
167
|
- http://syntest.psc.edu:7961/
- telnet syntest.psc.edu 7960
|
|
168
|
- Interfaces, routes, buffers, etc.
|
|
169
|
- Throw out your low speed interfaces and networks!
- Make sure routes and DNS report high speed interfaces
- Don’t over-utilize your links (<50%)
- Use routers sparingly, host routers not at all
|
|
170
|
- Make sure your HPC apps offer sufficient receive windows and use
sufficient send buffers
- But don’t run your system out of memory
- Find out the rtt with ping, compute BDP
- Tune system wide, by application, or automatically
- Check your TCP for high performance features
- Look for sources of loss
- Watch out for duplex problems
|
|
171
|
- Maximum TCP throughput is one window per round trip time
- System default and maximum window sizes are usually too small for HPC
- The #1 cause of slow TCP transfers!
|
|
172
|
|
|
173
|
- Command line:
- sysctl -w net.core.rmem_max=8388608
- sysctl -w net.core.wmem_max=8388608
- In /etc/sysctl.conf (makes it
permanent)
- net.core.rmem_max = 8388608
- net.core.wmem_max = 8388608
|
|
174
|
- In /etc/sysctl.conf
- net.ipv4.tcp_rmem = 4096 349520 4194304
- net.ipv4.tcp_wmem = 4096 65536
4194304
- Three values are: minimum, default, maximum for automatic buffer
allocation on Linux
|
|
175
|
- Before doing a connect or accept
- setsockopt(fd, SOL_SOCKET, SO_SNDBUF, …)
- setsockopt(fd, SOL_SOCKET, SO_RCVBUF, …)
|
|
176
|
|
|
177
|
- See “Enabling High Performance Data Transfers”
- http://www.psc.edu/networking/projects/tcptune/
|
|
178
|
- Many FTP’s allow the user to set buffer sizes
- The commands are different everywhere!
- kftp buffer size commands
- lbufsize 8388806 (local)
- rbufsize 8388806 (remote)
- Other kftp commands
- protect/cprotect (clear|safe|private)
- passive – client opens the data channel connection
- For other versions of FTP, see
- http://dast.nlanr.net/Projects/FTP.html
|
|
179
|
- Small, fast, secure
- 2.x has SSL / TLS support
- http://vsftpd.beasts.org/
|
|
180
|
- FTP Protocol Extensions
- SBUF and ABUF – set / auto buffer sizes
- SPOR and SPAS – striped port / passive
|
|
181
|
- Measures the spread of a burst of ICMP Echo packets to estimate BDP,
sets bufs
|
|
182
|
- Rome NY to Maui HI, kftp
- Before tuning: 3.2 Mbps
- After tuning: 40 Mbps (line rate)
- A very common story…
|
|
183
|
- Users Guide to TCP Windows
- www.ncsa.uiuc.edu/People/vwelch/net_perf/tcp_windows.html
- TCP Tuning Guide
- www-didc.lbl.gov/TCP-tuning/
- Enabling High Performance Data Transfers on Hosts
- www.psc.edu/networking/projects/tcptune/
- WAN Tuning and Troubleshooting
- www.internet2.edu/~shalunov/writing/tcp-perf.html
|
|
184
|
|
|
185
|
- Tells you what a good TCP should be able to achieve (Bulk Transfer
Capacity)
|
|
186
|
- Easy 10 second test, no remote access or receiver process required
- Emulates TCP but doesn’t use TCP
- Problems with host TCP or tuning are avoided
- Does Path MTU Discovery
- Reports rtt and loss rates
- A zero equilibrium result means there was too much packet loss to exit
“slow start”
|
|
187
|
- Can send ICMP (-i) or UDP (default)
- ICMP replies (ECHO or UNREACH) could be blocked for “security”
- Routers send ICMP replies very slowly
- So don’t test routers with treno
- ICMP is often rate limited now by hosts
- Port numbers can wrap around (and look like port scans)
|
|
188
|
- ttcp – the original, many variations
- http://sd.wareonearth.com/~phil/net/ttcp/
- nuttcp – great successor to ttcp (recommended)
- ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
- Iperf – great TCP/UDP tool (recommended)
- http://dast.nlanr.net/Projects/Iperf/
- netperf – dated but still in wide use
- ftp – nothing beats a real application
|
|
189
|
- Network data rates (bps) are powers of 10, not powers of 2 as used for
Bytes
- E.g. 100 Mbps ethernet is 100,000,000 bits/sec
- Some tools wrongly use powers of 2 (e.g. ttcp)
- User payload data rates are reported by tools
- No TCP, IP, Ethernet, etc. headers are included
- E.g. 100 Mbps ethernet max is 97.5293 Mbps
- http://sd.wareonearth.com/~phil/net/overhead/
|
|
190
|
- My favorite TCP/UDP test tool
- Get the latest .c file
- ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
- Compile it:
- cc –O3 –o nuttcp nuttcp-v5.1.11.c
- Start a server:
- nuttcp –S
- Allows remote users to run tests to/from that host without accounts
|
|
191
|
- Create a file /etc/xinetd.d/nuttcp
- # default: off
- # description: nuttcp
- service nuttcp
- {
- socket_type = stream
- wait = no
- user = nobody
- server = /usr/local/bin/nuttcp
- server_args = -S
- disable = no
- }
|
|
192
|
- San Diego CA to Washington DC
- OC12 path
|
|
193
|
- [phil@damp-ssc phil]$ nuttcp -xt damp-nrl
- traceroute to damp-nrl-ge (138.18.23.37), 30 hops max, 38 byte packets
- 1
ge-0-1-0.sandiego.dren.net (138.18.190.1) 0.255 ms 0.237 ms 0.238 ms
- 2
so12-0-0-0.nrldc.dren.net (138.18.1.7) 72.185 ms 72.162 ms 72.164 ms
- 3
damp-nrl-ge (138.18.23.37)
72.075 ms 72.069 ms 72.070 ms
- traceroute to 138.18.190.5 (138.18.190.5), 30 hops max, 38 byte packets
- 1
ge-0-1-0.nrldc.dren.net (138.18.23.33) 0.239 ms 0.199 ms 0.184 ms
- 2
so12-0-0-0.sandiego.dren.net (138.18.4.21) 72.210 ms 72.979 ms 72.217 ms
- 3
damp-ssc-ge (138.18.190.5)
72.087 ms 72.079 ms 72.068 ms
|
|
194
|
- [phil@damp-ssc phil]$ ping damp-nrl
- PING damp-nrl-ge (138.18.23.37) from 138.18.190.5 : 56(84) bytes of
data.
- 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=1 ttl=62 time=72.0 ms
- 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=2 ttl=62 time=72.0 ms
- 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=3 ttl=62 time=72.0 ms
- 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=4 ttl=62 time=72.0 ms
- 64 bytes from damp-nrl-ge (138.18.23.37): icmp_seq=5 ttl=62 time=72.0 ms
- --- damp-nrl-ge ping statistics ---
- 5 packets transmitted, 5 received, 0% loss, time 4035ms
- rtt min/avg/max/mdev = 72.071/72.082/72.092/0.007 ms
- [phil@damp-ssc phil]$ tracepath damp-nrl
- 1?: [LOCALHOST] pmtu 9000
- 1: ge-0-1-0.sandiego.dren.net
(138.18.190.1)
asymm 2 1.060ms
- 2: so12-0-0-0.nrldc.dren.net
(138.18.1.7)
asymm 3 72.905ms
- 3: damp-nrl-ge (138.18.23.37) 72.747ms
reached
- Resume: pmtu 9000 hops 3 back
3
|
|
195
|
- Bandwidth * Delay Product
- BDP = 622000000/8 * 0.072 = 5.6 MB
- The TCP receive window needs to be at least this large
- The sender buffer should be twice this size
- We choose 8MB for both
- Knowing that Linux will double it for us!
|
|
196
|
- [phil@damp-ssc phil]$ nuttcp -v -t -T10 -w8m damp-nrl
- nuttcp-t: v5.1.10: socket
- nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> damp-nrl
- nuttcp-t: time limit = 10.00 seconds
- nuttcp-t: connect to 138.18.23.37
- nuttcp-t: send window size = 16777216, receive window size = 103424
- nuttcp-t: 608.0786 MB in 10.00 real seconds = 62288.32 KB/sec = 510.2659
Mbps
- nuttcp-t: 9730 I/O calls, msec/call = 1.05, calls/sec = 973.33
- nuttcp-t: 0.0user 0.9sys 0:10real 9% 0i+0d 0maxrss 2+0pf 0+0csw
- nuttcp-r: v5.1.10: socket
- nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
- nuttcp-r: accept from 138.18.190.5
- nuttcp-r: send window size = 103424, receive window size = 16777216
- nuttcp-r: 608.0786 MB in 10.20 real seconds = 61039.55 KB/sec = 500.0360
Mbps
- nuttcp-r: 71224 I/O calls, msec/call = 0.15, calls/sec = 6981.97
- nuttcp-r: 0.0user 0.5sys 0:10real 5% 0i+0d 0maxrss 3+0pf 0+0csw
|
|
197
|
- The receiver’s (nuttcp-r) receive buffer size
- The transmitter’s (nuttcp-t) send buffer size
- The receiver’s reported throughput
- The transmitter number is often too high
- The CPU utilization of both sender and receiver
- Make sure you didn’t run out of CPU
|
|
198
|
- [phil@damp-ssc phil]$ nuttcp -t -T10 -i1 -w8m damp-nrl
- 1.5531 MB / 1.00 sec = 13.0935 Mbps
- 44.3144 MB / 1.00 sec = 371.7978 Mbps
- 67.9265 MB / 1.00 sec = 569.9192 Mbps
- 67.5937 MB / 1.00 sec = 567.1195 Mbps
- 67.5339 MB / 1.00 sec = 566.6178 Mbps
- 67.5937 MB / 1.00 sec = 567.1184 Mbps
- 67.6363 MB / 1.00 sec = 567.4661 Mbps
- 67.5937 MB / 1.00 sec = 567.1195 Mbps
- 67.7558 MB / 1.00 sec = 568.4850 Mbps
- 67.8753 MB / 1.00 sec = 569.4834 Mbps
- 601.0000 MB / 10.19 sec = 494.5603 Mbps 8 %TX 4 %RX
|
|
199
|
- [phil@damp-ssc phil]$ nuttcp -r -T10 -i1 -w8m damp-nrl
- 2.2614 MB / 1.00 sec = 18.9842 Mbps
- 48.4872 MB / 1.00 sec = 406.7883 Mbps
- 38.9809 MB / 1.00 sec = 327.0350 Mbps
- 29.8672 MB / 1.00 sec = 250.5731 Mbps
- 51.3801 MB / 1.00 sec = 431.0547 Mbps
- 50.1768 MB / 1.00 sec = 420.9644 Mbps
- 24.6874 MB / 1.00 sec = 207.1180 Mbps
- 15.0616 MB / 1.00 sec = 126.3602 Mbps
- 15.8211 MB / 1.00 sec = 132.7315 Mbps
- 16.5891 MB / 1.00 sec = 139.1766 Mbps
- 303.6979 MB / 10.60 sec = 240.3490 Mbps 4 %TX 2 %RX
|
|
200
|
- [phil@damp-ssc phil]$ nuttcp -t -u -l8000 -R500m -T10 -i1 –w8m damp-nrl
- 59.2346 MB / 0.99 sec = 499.6586 Mbps 0 /
7764 ~drop/pkt 0.00 ~%loss
- 59.6008 MB / 1.00 sec = 500.0570 Mbps 0 /
7812 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9935 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 500.0000 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9950 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9905 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9925 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9950 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9930 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 59.5932 MB / 1.00 sec = 499.9920 Mbps 0 /
7811 ~drop/pkt 0.00 ~%loss
- 595.9778 MB / 10.01 sec = 499.5061 Mbps 99 %TX 3 %RX 0 / 78116
drop/pkt 0.00 %loss
|
|
201
|
- [phil@damp-ssc phil]$ nuttcp -r -u -l8000 -R500m -T10 -i1 –w8m damp-nrl
- 56.8237 MB / 0.99 sec = 481.1408 Mbps 0 /
7448 ~drop/pkt 0.00 ~%loss
- 59.5169 MB / 1.00 sec = 499.3279 Mbps 1 /
7802 ~drop/pkt 0.01 ~%loss
- 59.2651 MB / 1.00 sec = 497.2097 Mbps 0 /
7768 ~drop/pkt 0.00 ~%loss
- 59.0591 MB / 1.00 sec = 495.4825 Mbps 0 /
7741 ~drop/pkt 0.00 ~%loss
- 58.9828 MB / 1.00 sec = 494.8439 Mbps 0 /
7731 ~drop/pkt 0.00 ~%loss
- 60.7300 MB / 1.00 sec = 509.5001 Mbps 0 /
7960 ~drop/pkt 0.00 ~%loss
- 59.4482 MB / 1.00 sec = 498.7424 Mbps 2 /
7794 ~drop/pkt 0.03 ~%loss
- 59.7839 MB / 1.00 sec = 501.5677 Mbps 2 /
7838 ~drop/pkt 0.03 ~%loss
- 58.8760 MB / 1.00 sec = 493.9463 Mbps 0 /
7717 ~drop/pkt 0.00 ~%loss
- 59.8526 MB / 1.00 sec = 502.1347 Mbps 0 /
7845 ~drop/pkt 0.00 ~%loss
- 595.8939 MB / 10.01 sec = 499.4702 Mbps 100 %TX 3 %RX 5 / 78110
drop/pkt 0.01 %loss
|
|
202
|
- bps = 0.7 * MSS / (rtt * sqrt(loss))
- = 0.7 * 8948*8 / (0.072 *
sqrt(5/78110))
- = 87 Mbps
- Conclusion: Too much loss on DC to CA path
|
|
203
|
- When UDP testing
- Be careful of fragmentation (path MTU)
- Most systems are better at TCP than UDP
- Setting large socket buffers helps
- Raising process priority helps
- Duplex problems don’t show up with UDP!
- Having debugged test points is critical
- A performance history is valuable
|
|
204
|
- Will never see it with pings or UDP tests
- TCP throughput is sometimes ˝ of expected rate, sometimes under 1 Mbps
- Worse performance usually results when the mismatch is near the receiver
of a TCP flow
- Sometimes reported as “late collisions”
|
|
205
|
- Don’t forget
- ifconfig
- UP BROADCAST RUNNING
MULTICAST MTU:9000 Metric:1
- RX packets:414639041 errors:0
dropped:0 overruns:0 frame:0
- TX packets:378926231 errors:0
dropped:0 overruns:0 carrier:0
- collisions:0 txqueuelen:1000
- netstat –s
- Tcp: 59450 segments retransmited
- dmesg or /var/log
- ixgb: eth2 NIC Link is Up 10000 Mbps Full Duplex
|
|
206
|
- Iperf is probably better for
- Parallel stream reporting (-P)
- Bi-directional tests (-d)
- MSS control (-M) and reporting (-m)
- Nuttcp is better at
- Getting server results back to the client
- Third party support
- Traceroute (-xt)
- Setting priority (-xP)
- Undocumented instantaneous rate limit (-Ri)
|
|
207
|
|
|
208
|
- Grew out of the NLANR AMP project
- Long term active performance monitoring
- Delay, loss, routes, throughput
- Dedicated hardware; known performance
- User accessible test points
- Invaluable at identifying and debugging problems
|
|
209
|
|
|
210
|
- Dell 2650, 3.0 GHz Xeon, 533 MHz FSB, 512 KB L2 cache, 1 MB L3 cache,
PCI-X
- Linux 2.4.26-web100
- Two built-in BCM5703 10/100/1000 copper NICs, tg3 driver
- Intel PXLA8590LR 10GigE NIC, ixgb 1.0.65 driver
- 4.6 Gbps TCP on LAN after tuning (7.7 Gbps loopback)
- 2.2 Gbps TCP from MD to MS over OC48
- New NICs: Intel, S2io X-Frame, Chelsio T110/N110
- PCI Express solution early 2005?
|
|
211
|
- 10 msec NTP accuracy w/o clock
- asymmetric paths and clock hopping
- 10 usec NTP accuracy with CDMA clock
- EndRun Technologies Praecis Ct ($1100)
- Serial port, tagged events
- Would love to see NTP nanokernel patch adopted by Linux!
|
|
212
|
- Web page interface
- 10 second TCP throughput test
- Uses nuttcp third party support
- Can also access them directly
|
|
213
|
- Many local problems won’t be seen over a short round trip time (i.e.
within your site!)
- Window size won’t matter
- TCP error recovery is extremely quick
- Firewalls and screening routers (ACLs) could be part of the problem
|
|
214
|
|
|
215
|
- Small windows on hosts
- Duplex problems
- Bad cables (patch panels, fiber, cat5)
- Bad switches or router modules
- Slow firewalls or routers
- WAN bit errors
- Routing issues
- Heat problems!
|
|
216
|
|
|
217
|
- tcpdump
- ethereal - GUI tcpdump (protocol analyzer)
- tcptrace – stats/graphs of tcpdump data
- testrig – tcpdump, tcptrace, xplot, etc.
- www.ncne.nlanr.net/research/tcp/testrig/
|
|
218
|
|
|
219
|
- Three-way handshake
- Use tcpdump, look for performance features
- window sizes, window scale,
timestamps, MSS, SackOK, Don’t-Fragment (DF)
|
|
220
|
|
|
221
|
- tcpdump -p -w trace.out -s 100 host $desthost and port 5001 &
- nuttcp -T10 -n200m -i1 -w10m $desthost
- tcptrace -l -r trace.out
- tcptrace -S trace.out (or
–G for all graphs)
- xplot *.xpl
|
|
222
|
- TCP connection 1:
- host a: damp-nrl-ge:58310
- host b: damp-ssc-ge:5001
- complete conn: no (SYNs: 2) (FINs: 0)
- first packet: Wed Oct 13 17:30:30.272579 2004
- last packet: Wed Oct 13 17:30:39.289916 2004
- elapsed time: 0:00:09.017336
- total packets: 36063
- filename: nrl-ssc.trace
- a->b: b->a:
- total packets: 23473 total packets: 12590
- ack pkts sent: 23472 ack pkts sent: 12590
- pure acks sent: 1 pure acks sent: 12589
- sack pkts sent: 0 sack pkts sent: 1774
- dsack pkts sent: 0 dsack pkts sent: 0
- max sack blks/ack: 0 max sack blks/ack: 1
- unique bytes sent: 209714276 unique bytes sent: 0
- actual data pkts: 23471 actual data pkts: 0
- actual data bytes:
210018508 actual data
bytes: 0
- rexmt data pkts: 34 rexmt data pkts: 0
- rexmt data bytes: 304232 rexmt data bytes: 0
- zwnd probe pkts: 0 zwnd probe pkts: 0
- zwnd probe bytes: 0 zwnd probe bytes: 0
- outoforder pkts: 0 outoforder pkts: 0
- pushed data pkts: 555 pushed data pkts: 0
|
|
223
|
- SYN/FIN pkts
sent: 1/0 SYN/FIN pkts sent: 1/0
- req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
- adv wind scale: 7 adv wind scale: 7
- req sack: Y req sack: Y
- sacks sent: 0 sacks sent: 1774
- urgent data pkts: 0 pkts urgent data pkts: 0 pkts
- urgent data bytes: 0 bytes urgent data bytes: 0 bytes
- mss requested: 8960 bytes mss requested: 8960 bytes
- max segm size: 8948 bytes max segm size: 0 bytes
- min segm size: 8948 bytes min segm size: 0 bytes
- avg segm size: 8947 bytes avg segm size: 0 bytes
- max win adv: 17920 bytes max win adv: 8388480 bytes
- min win adv: 17920 bytes min win adv: 17792 bytes
- zero win adv: 0 times zero win adv: 0 times
- avg win adv: 17920 bytes avg win adv: 8044910 bytes
- initial window: 17896 bytes initial window: 0 bytes
- initial window: 2 pkts initial window: 0 pkts
- ttl stream length: NA ttl stream length: NA
- missed data: NA missed data: NA
- truncated data: 209220494 bytes truncated data: 0 bytes
- truncated packets: 23471 pkts truncated packets: 0 pkts
- data xmit time: 8.945 secs data xmit time: 0.000 secs
- idletime max: 141.4 ms idletime max: 73.8 ms
- throughput: 23256786 Bps throughput: 0 Bps
|
|
224
|
- RTT samples: 10814 RTT samples: 1
- RTT min: 72.1 ms RTT min: 0.0 ms
- RTT max: 144.2 ms RTT max: 0.0 ms
- RTT avg: 102.1 ms RTT avg: 0.0 ms
- RTT stdev: 28.9 ms RTT stdev: 0.0 ms
- RTT from 3WHS: 72.1 ms RTT from 3WHS: 0.0 ms
- RTT full_sz smpls: 10813 RTT full_sz smpls: 1
- RTT full_sz min: 72.7 ms RTT full_sz min: 0.0 ms
- RTT full_sz max: 144.2 ms RTT full_sz max: 0.0 ms
- RTT full_sz avg: 102.1 ms RTT full_sz avg: 0.0 ms
- RTT full_sz stdev: 28.9 ms RTT full_sz stdev: 0.0 ms
- post-loss acks: 4 post-loss acks: 0
- For the following 5 RTT
statistics, only ACKs for
- multiply-transmitted
segments (ambiguous ACKs) were
- considered. Times are taken from the last instance
- of a segment.
- ambiguous acks: 30 ambiguous acks: 0
- RTT min (last): 72.7 ms RTT min (last): 0.0 ms
- RTT max (last): 74.9 ms RTT max (last): 0.0 ms
- RTT avg (last): 73.3 ms RTT avg (last): 0.0 ms
- RTT sdv (last): 0.8 ms RTT sdv (last): 0.0 ms
- segs cum acked: 12508 segs cum acked: 0
- duplicate acks: 1742 duplicate acks: 0
- triple dupacks: 4 triple dupacks: 0
- max # retrans: 1 max # retrans: 0
- min retr time: 73.7 ms min retr time: 0.0 ms
- max retr time: 140.2 ms max retr time: 0.0 ms
- avg retr time: 89.2 ms avg retr time: 0.0 ms
- sdv retr time: 10.3 ms sdv retr time: 0.0 ms
|
|
225
|
|
|
226
|
|
|
227
|
|
|
228
|
|
|
229
|
|
|
230
|
|
|
231
|
|
|
232
|
|
|
233
|
|
|
234
|
|
|
235
|
|
|
236
|
|
|
237
|
|
|
238
|
|
|
239
|
|
|
240
|
- Set out to make 100 Mbps TCP common
- “TCP knows what’s wrong with the network”
- The sender side knows the most
- Instruments the TCP stack for diagnostics
- Enhanced TCP MIB (IETF Draft)
- Linux 2.4 kernel patches + library and tools
- /proc/web100 file system
- e.g. /proc/web100/1010/{read,spec,spec-ascii,test,tune)
|
|
241
|
|
|
242
|
|
|
243
|
|
|
244
|
- http://e2epi.internet2.edu/ndt/
- Java applet runs a test and uses Web100 stats to diagnose the end to end
path
- Developed by Richard Carlson, ANL
- Try it out on damp-ssc (San Diego)
|
|
245
|
|
|
246
|
|
|
247
|
|
|
248
|
|
|
249
|
|
|
250
|
- V. Paxson
- Claims most corruption caused by T1’s
- 1 in 5000 packets corrupted
- TCP checksum would then let 1 in 300M corrupted packets pass
- Roughly two errors per terabyte
|
|
251
|
- J. Stone, C. Partridge, SIGCOMM 2000
- “Traces of Internet packets over the last two years show that 1 in
30,000 packets fails the TCP checksum, even on links where link-level
CRCs should catch all but 1 in 4 billion errors.”
- DMA transfers account for many of these errors
- Conclude that 1 in 200M TCP packets pass with undetected errors
|
|
252
|
- How do you know that your transferred copy is correct?
- A simple approach would be copy it two (or more) times and compare the
results
- But you would like to check it with something smaller than the entire
dataset
|
|
253
|
- All map large blocks of data into smaller keys
- Checksums are simple algebraic sums
- Cyclic Redundancy Checks (CRC’s) are good at detecting multiple bit
errors
- Hashes are good at randomizing the key space
- Good for database lookups, sorting, etc.
- Cryptographic hashes are good for data integrity checks
|
|
254
|
- Can not detect several kinds of errors
- Reordering of bytes
- Inserting or deleting zero valued bytes
- Multiple errors that cancel (same sum)
|
|
255
|
- 16 bits long, summed 16 bits at a time
- For UDP/TCP, includes a 12 byte “pseudo header” from the IP layer with
the network addresses and payload length
- RFC 1071 – Computing the Internet Checksum
|
|
256
|
- IPv4 uses the 16-bit Internet Checksum over the header information only
- This checksum is recalculated/rewritten hop by hop, since e.g. the TTL
changes
- IPv6 has none, not even over the header
- Assumes layer 2 will protect it
|
|
257
|
|
|
258
|
- 16-bit Internet Checksum covers the UDP header and payload
- Optional for IPv4 UDP
- Required for IPv6 UDP
|
|
259
|
- 16-bit Internet Checksum over the entire segment
- Roughly every 65536th corrupted segment goes undetected
|
|
260
|
- Many NICs today support TCP Checksum offloading
- Studies have shown many errors occur during DMA transfers
- Offloading would miss these
- Nothing beats reading the data back from disk, e.g. md5sum
|
|
261
|
- IEEE 802 CRC32
- Strong error detection for 1500 byte packets
- Detects all triple bit errors up to 11455 bytes
- Roughly 1 in 4 billion errors go undetected, but strength is non-uniform
- http://www.cse.ohio-state.edu/%7Ejain/papers/xie1.htm
|
|
262
|
- One-way functions
- Easy to compute, nearly impossible to invert
- Collision free
- Infeasible to find two inputs that result in the same output
- The workhorses of modern cryptography
|
|
263
|
- 1990, MD4, Ron Rivest
- 1992, MD5
- 1993, SHA, NSA (a.k.a. SHA-0)
- 1995, SHA-1
- 2002, SHA-256, SHA-384, SHA-512
- 2004, SHA-224
- 2010, NIST no longer endorses SHA-1
|
|
264
|
- MD5 – Message Digest algorithm #5
- 128 bit hash, RFC1321 (April 1992)
- md5sum --check md5sums
- SHA-1 – Secure Hash Algorithm #1
|
|
265
|
- Recent work has shown some weakness in MD5 and SHA-0 (Crypto2004
conference)
- NIST plans to phase out SHA-1 by 2010
- NIST FIPS 180-2, Secure Hash Standard, Aug 2002, also includes:
- SHA-224, SHA-256, SHA-384, SHA-512
- http://csrc.nist.gov/CryptoToolkit/tkhash.html
|
|
266
|
- Keyed-Hash Message Authentication Codes
- Combines a secret key and a cryptographic hash
- HMAC-SHA1, HMAC-MD5, HMAC-RIPEMD
- RFC2104, Feb 1997, ANS X9.71
|
|
267
|
- AES (Rijndael) data encryption/decryption
- Federal Information Processing Standards Publication, FIPS-197
- 128, 192, 256 bit keys, 128 bit data blocks
- Can be used as a Message Authentication Code (MAC)
- AES-XCBC-MAC-96, RFC 3566, Sep 2003
- scp defaults to aes128-cbc and hmac-md5
|
|
268
|
- “openssl speed” OpenSSL 0.9.7d, 2.6 GHz Xeon, 400 MHz FSB, 512KB L2
cache
- type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
- md2 2335.38k 5298.12k 7781.61k 8827.85k 9161.67k
- mdc2 6480.09k 7374.90k 7717.70k 7677.92k 7790.82k
- md4 19268.23k 63363.31k 173880.13k 303676.32k 391984.43k
- md5 16325.52k 54524.02k 143885.24k 246730.38k 311775.74k
- hmac(md5) 17483.15k 58603.05k 151600.95k 252513.14k 313778.51k
- sha1 15804.80k 51292.97k 126032.04k 199600.09k 240482.89k
- rmd160 11629.85k 31237.76k 62984.82k 84785.10k 93663.74k
- rc4 84407.69k 93179.20k 95925.54k 96411.00k 96438.66k
- des cbc 43456.93k 43846.81k 43784.50k 43868.37k 43858.06k
- des ede3 17158.70k 17168.26k 17128.48k 17191.63k 17186.59k
- idea cbc 24765.82k 25599.68k 25818.65k 25979.85k 25950.68k
- rc2 cbc 24287.21k 24957.35k 25301.88k 25439.74k 25317.71k
- rc5-32/12 cbc 91339.13k 89333.21k 90386.24k 90881.38k 90675.90k
- blowfish cbc 77119.43k 78330.17k 78255.31k 78479.08k 77552.78k
- cast cbc 48512.65k 50893.86k 51452.11k 51747.77k 51385.43k
- aes-128 cbc 70686.37k 65725.38k 65752.90k 65180.02k 64708.38k
- aes-192 cbc 55336.89k 57739.22k 58446.44k 57441.14k 57590.31k
- aes-256 cbc 56045.36k 52542.96k 52584.07k 51628.97k 51985.99k
|
|
269
|
- 744 MB file, 2.6 GHz Xeon
- Disk read, loopback interface, no disk write
- both encrypt and decrypt for scp
- wget –O /dev/null ftp://localhost/tmp/file
- scp localhost:/tmp/file /dev/null
- AES-128 (default), 0:49 sec, 121 Mbps
- 3DES, 1:53, 53 Mbps
- Blowfish, 0:39, 153 Mbps
|
|
270
|
- Aberdeen, MD to San Diego, CA, 744 MB
- scp 14 minutes
- compared to 49 seconds on
localhost
- wget v1.9.1, 4 minutes
- from vsftpd 1.2.2 server at Aberdeen
- wget –passive-ftp, 15 seconds
- wget with large window modification
|
|
271
|
- Application: Hashes
- Transport: TCP/UDP checksum
- Network: IP header checksum
- Link: Ethernet CRC32 or POS FCS-32
- Physical: symbols, ECC, FEC
|
|
272
|
|
|
273
|
- Connections identified by
- (src addr, src port, dst addr, dst port)
- 32 bit sequence numbers (byte counts)
- One for each direction
- Random starting value for each connection
|
|
274
|
- Ephemeral Port Randomization
- SYN Cookies
- Timestamps as Cookies
- Tighter sequence number checking
|
|
275
|
- Today we only need one within the window
- 2^32 / window_size
- E.g. 32 KB window = 128 k guesses
- Even at DSL rates, can reset a connection in about a minute
- Proposal to require an exact match, i.e. the next number
|
|
276
|
- Blind reset attack, RST bit
- If we receive a RST within the window, close our connection
- Blind reset attack, SYN bit
- If we receive a SYN within the window of an open connection, send a RST
- Blind data injection
- Guessing a valid sequence number and ACK lets us inject bogus data
segments
- draft-ietf-tcpm-tcpsecure-01.txt
|
|
277
|
- Blind ACK DoS attack
- ACK segments not yet received
- Client can cause server to saturate it’s connection by sending too fast
- draft-azcorra-tcpm-tcp-blind-ack-dos-01.txt
- Related: ACK every byte, to open senders window faster
- Addressed by “Appropriate Byte Counting”
|
|
278
|
- TCP MD5 Signature Option, RFC2385
- Manually configured shared keys
- Often used between BGP peers
- RFC3562 provides guidance on keys
- IPsec
- SSL / TLS
|
|
279
|
- IPsec work began in the early 1990s
- Three revisions
- 1995: RFC 1825-1829 - no key management
- 1998: RFC 2401-2412 - IKE v1 key management
- Current: RFC TBD - IKE v2 key management
- At least 20 related RFC’s
|
|
280
|
- Authentication Header (AH)
- Authentication of data source
- Integrity of message content and ordering
- Replay Detection
- Access Control (i.e. packet filtering)
- Encapsulating Security Payload (ESP)
- Above plus encryption of packet data
- Limited traffic flow confidentiality
|
|
281
|
- Transport Mode
- AH and/or ESP is placed between IP header and next higher level header
- IP-TCP -> IP-AH-TCP
- IP-TCP -> IP-ESP-TCP
- Can only be used host-to-host
- Tunnel Mode
- A new IP header is created
- IP-UDP -> IP’-AH-IP-UDP
- IP-UDP -> IP’-ESP-IP-UDP
- Common way to implement VPNs
|
|
282
|
|
|
283
|
|
|
284
|
|
|
285
|
- Native support for IPsec is required for IPv6
- That’s where “IPv6 is secure” comes from
- But many vendors don’t support it yet
- For Linux, see
- www.openswan.org
- www.strongswan.org
- www.linux-ipv6.org
|
|
286
|
- Cheating Today, Improving TCP Tomorrow
|
|
287
|
- PSockets (Parallel Sockets library)
- http://citeseer.nj.nec.com/386275.html
- GridFTP – enhanced parallel wu-ftpd
- http://www.globus.org/datagrid/gridftp.html
- bbFTP – parallel ‘ftp’ uses ssh or GSI
- http://doc.in2p3.fr/bbftp/
- MulTCP – a TCP that acts like N TCP’s
|
|
288
|
|
|
289
|
|
|
290
|
- TCP, RFC 793, Sep 1981
- Reno, BSD, 1990
- Path MTU Discovery, RFC 1191, Nov 1990
- Window Scale, PAWS, RFC 1323, May 1992
- SACK, RFC 2018, Oct 1996
- NewReno, RFC 2581, April 1999
- D-SACK, RFC 2883, July 2000
- RTO Calculation, RFC 2988, Nov 2000
- ECN, RFC 3168, Sep 2001
- Limited Transmit, RFC 3042, Jan 2001
- Increased Initial Windows, RFC 3390, Oct 2002
- SACK loss recovery, RFC 3517, Apr 2003
|
|
291
|
- Most modern TCP’s are “Reno” based, from the 1990 BSD Reno Unix release
- Reno defined (refined) four key mechanisms
- Slow Start
- Congestion Avoidance
- Fast Retransmit
- Fast Recovery
- NewReno refined fast retransmit/recovery when non-SACK partial
acknowledgements are available
- Proposed Standard, RFC3782, Apr 2004
|
|
292
|
- Different retransmit/recovery schemes
- TCP Tahoe, Vegas, Peach, Westwood, …
- Pacing - removing burstiness by spreading the packets over a round trip
time
- Autotuning windows and buffer space
- Modifications to prevent “cheating”
- Kick-starting TCP after timeouts
|
|
293
|
- Allows ~4KB initial window rather than one or two segments
- min(4*MSS, max(2*MSS, 4380 bytes))
- RFC 3390, Oct 2002, Proposed Standard
|
|
294
|
- When an ACK is received, increase cwin based on the number of new bytes
ACK’d
- Prevents receiver from “cheating” and making the sender open cwin too
quickly
- e.g. receiver ACKs every byte
- Increases by at most 2*MSS bytes per ACK
- To avoid bursts when one ACK covers a huge number of bytes
- RFC 3465, Feb 2003
|
|
295
|
- In slow-start, the congestions window (cwin) doubles each round trip
time
- For large cwins, this doubling can cause massive packet loss (and
network load)
- Limited slow-start adds max_ssthresh (proposed value of 100 MSS)
- Above max_ssthresh cwin opens slower, never bursts more than 100 MSS
- cwin +=
(0.5*max_ssthresh/cwin) * MSS
- RFC 3742, Apr 2004
|
|
296
|
|
|
297
|
- IP option in the TCP SYN specifies desired initial sending rate
- Routers on the path decrement a TTL counter and decrease initial
sending rate if necessary
- If all routers participated, receiver tells the sender the initial rate
in the SYN+ACK pkt
- The sender can set cwin based on the rtt of the SYN and SYN+ACK packets
|
|
298
|
- TCP Westwood
- Rate Estimation from ACK stream (cwnd = RE x RTTmin)
- HighSpeed TCP (HSTCP)
- Scalable TCP (STCP)
- Binary Increase Congestion Control (BIC/CUBIC)
- TCP Vegas / FAST TCP
- Queuing delay based congestion control
|
|
299
|
- Changes the AIMD parameters
- Identical to standard TCP for loss rates above 10-3 for
fairness (cwin <= 38)
- Allows cwin to reach 83000 segments for 10-7 loss rates
- Good for 10 Gbps over 100 msec rtt
- Std TCP would be limited to ~440 Mbps
- RFC 3649, Dec 03
|
|
300
|
|
|
301
|
|
|
302
|
|
|
303
|
|
|
304
|
- Loss recovery time is independent of sending rate
- http://www-lce.eng.cam.ac.uk/~ctk21/scalable/
|
|
305
|
- Delay based congestion control
- “multi-bit” feedback vs. binary loss signal
- Avoids oscillations of cwnd and unnecessary loss
- http://netlab.caltech.edu/FAST/
|
|
306
|
|
|
307
|
- Slow adaptation after loss
- 0 to 500 Mbps takes
- 36 seconds, mtu=9000, rtt=72msec (DC to San Diego)
- 15 minutes, mtu=1500, rtt=150msec (DC to Maui)
- Short flow throughput is determined by slow start
- Loss != congestion
- Large TCP queues increase latency
- End users never tune their systems
|
|
308
|
- RFC969, 1985
- Block transfer protocol (not streaming)
- Send a block at predetermined rate
- Wait for lost packet list
- Resend those, etc.
|
|
309
|
- Reliable Blast UDP
- http://www.evl.uic.edu/cavern/quanta
- Similar to NETBLT but in active development
|
|
310
|
- http://www.anml.iu.edu/anmlresearch.html
- Last release, Dec 2002
- UDP data, TCP control, rate adaptive
- Loss rate controls sending rate
- File transfer protocol, no API
- Transferred 1 TB of data at ~1 Gbps over a 12000 km “light path”
(Vancouver to Geneva), Sep 2002
- Was created because TCP over that path was getting only 10’s to 100’s of
Mbps!
|
|
311
|
- Simple Available Bandwidth Utilization Library
- National Center for Data Mining (NCDM) at UIC (University of Illinois at
Chicago)
- http://www.dataspaceweb.net/sabul.htm
- 2000-2003, now in third generation
- UDP data, TCP control, rate adaptive
- Streaming protocol, window + AIMD rate control (not rtt dependent)
- Includes FTP like application, API
|
|
312
|
- UDP-based Data Transport
- http://sourceforge.net/projects/dataspace
- UDP for data and control
- Grew out of SABUL work, 2003+
|
|
313
|
- Session hijacking, corruption, encryption
- Learn something from 802.11 wireless?
|
|
314
|
- “Anything you can do I can do too”
- It’s just algorithms (e.g. congestion control)
- The real UDP advantage:
- User space implementations, i.e. easier experimentation and deployment!
- But user space is not as efficient
|
|
315
|
- Proposed Standard transport protocol
- Message oriented vs. TCP’s byte stream
- Multi-streaming
- Multi-homing
- Cookie based DoS protection
- RFC2960, RFC3309, RFC3286
- http://www.sctp.org/
|
|
316
|
- Allows out of order delivery
- Each message can be marked as ordered or not
- Avoids TCP’s head-of-line blocking
- CRC32c data protection
- Changed from Adler-32 checksum
|
|
317
|
- RDDP, a.k.a. Remote DMA
- Something SCTP is better at than TCP
- Easy for the sender (“zero copy”)
- Hard for the receiver
- Marker PDU Aligned framing for TCP (MPA)
|
|
318
|
- TCP assures 100% in-order delivery
- If your app can handle out-of-order delivery, try SCTP
- If you app can handle some loss, use something that isn’t 100% reliable
- TCP stalls are sometimes worse than a loss
|
|
319
|
|
|
320
|
- Quality of Service
- “The collective effect of
service performances which determines the degree of satisfaction of the
user” (ITU)
- A user perception of how good the service is
|
|
321
|
- Performance Guarantees
- SLA’s, VPN’s, Voice, Real-time, etc.
- TCP builds large queues
- Increases delay, can cause burst losses
- Circuit emulation or ATM QoS over MPLS
- Preservation of expected behavior
- Critical services
|
|
322
|
- Improve the entire network (egalitarian)
- larger pipes, congestion control (RED, ECN), traffic engineering (MPLS)
- Discriminate flows into service classes
- Can be signaled/reserved, or marked
- All members of a class get the same type of service
|
|
323
|
- Signaled
- ATM
- IP Integrated Services (IntServ)
- Best-effort (BE), Controlled-load (CLS), Guaranteed (GS)
- Marked
- IP Differentiated Services (DiffServ)
- Expedited, Assured 1, 2, 3, 4
|
|
324
|
- IETF working group since 1993
- Resource Reservation
- RSVP is the most common choice
- Packet Classification (multi field)
- Admission Control (capacity and/or policy)
- Packet Scheduling
- QoS based routing
|
|
325
|
- “Soft State”
- Typical: PATH messages every 30 seconds, soft-state timeout in 90
seconds
- RSVP only blocks when congested
- ATM often blocks allocations even if unused
- RSVP is in Windows 2000, but removed from XP in favor of DiffServ
marking
- NSIS may be the next generation of RSVP
|
|
326
|
- Every RSVP router classifies every packet by inspecting up to seven
fields
- dest addr, dest port, proto, src addr, src port
- IPv4: ToS (8 bits)
- IPv6: Traffic Class (8 bits), Flow label (20 bits)
- DiffServ reduces this burden by marking packets
- MPLS can help!
|
|
327
|
- IETF working group since 1998
- Keeps the core stateless
- Pushes any/all marking, policing, shaping, signaling to the edges
(hosts or edge routers)
- In the core, DiffServ Code Points map to Per Hop Behaviors (PHB)
- Cisco Committed Access Rate (CAR) was an early implementation of
DiffServ
|
|
328
|
- Expedited Forwarding (EF)
- Emulates virtual leased line
- Absolute assurance, admission controlled
- One PHB: Priority or Class Based Queue
- Assured Forwarding (AF)
- Emulates unloaded network
- Statistical assurance
- 12 PHB’s (4 classes/queues, each with three drop priorities)
|
|
329
|
|
|
330
|
- D = Minimize Delay
- T = Maximize Throughput
- R = Maximize Reliability
- M = Minimize Monetary Cost (RFC1349)
- Maximum Security = 1111 (RFC1349)
- RFC791, RFC795, RFC1349
|
|
331
|
- 3 bit precedence in the TOS
byte:
- 0 - Routine
- 1 - Priority
- 2 - Immediate
- 3 - Flash
- 4 - Flash Override
- 5 - CRITIC/ECP
- 6 - Internetwork Control
- 7 - Network Control
|
|
332
|
- CS0 000000 [RFC2474]
- CS1 001000 [RFC2474]
- CS2 010000 [RFC2474]
- CS3 011000 [RFC2474]
- CS4 100000 [RFC2474]
- CS5 101000 [RFC2474]
- CS6 110000 [RFC2474]
- CS7 111000 [RFC2474]
- AF11 001010 [RFC2597]
- AF12 001100 [RFC2597]
- AF13 001110 [RFC2597]
- AF21 010010 [RFC2597]
- AF22 010100 [RFC2597]
- AF23 010110 [RFC2597]
- AF31 011010 [RFC2597]
- AF32 011100 [RFC2597]
- AF33 011110 [RFC2597]
- AF41 100010 [RFC2597]
- AF42 100100 [RFC2597]
- AF43 100110 [RFC2597]
- EF 101110 [RFC3246]
- Six bit field, goes in the IPv4 TOS octet or the IPv6 Traffic Class
- Last two bits are being used by ECN
|
|
333
|
- ATM uses a leaky bucket to determine and enforce conformance
- IntServ and DiffServ use a token bucket
- Preserves packet spacing and bursts
|
|
334
|
- Does my switch have QoS controls?
- Some QoS in 802.1p(D)/Q
- 3 bit User Priority in VLAN tag
- Simple priority queuing (possible starvation)
- IETF defined Subnet Bandwidth Manager (SBM)
- Routers have better queue management, classification, and tagging
capability
|
|
335
|
- Encryption can make multi field inspection impossible, and can hide
DiffServ marks
- Need a way to identify service classes to the WAN provider
- 802.1Q User Priority?
- MPLS label or Exp field (in shim header)?
- Separate port?
|
|
336
|
- Implement DiffServ where needed
- Utilize MPLS+TE where available
- Reduce TCP queues/bursts (pacing)
- Better Active Queue Management
- Have routers penalize “bad” flows
|
|
337
|
- QoS is a user perception of performance
- CoS discriminates flows into service types
- CoS can be signaled (IntServ/RSVP)
- CoS can be marked (DiffServ)
- Current research could improve QoS even without CoS
|
|
338
|
|
|
339
|
- Dedicated network for accessing Network Attached Storage (NAS)
- Usually Fibre Channel
- Fibre Channel Protocol (FCP) = SCSI Commands over FC
- IP Storage is coming
|
|
340
|
- Began in 1988 to improve HIPPI connections
- First ANSI standard in 1994
- 1 Gbps (100 MBps) standard in 1996
- Runs on both fiber and copper
- Ring or mesh (switch) topology
- FC disks have dual / redundant connections
|
|
341
|
|
|
342
|
- Fibre Channel over TCP/IP
- Encapsulates FC frames in TCP/IP
- Provides Inter-Switch Links (ISLs)
- Allows multiple FC SAN’s to be connected via the internet
- RFC 3821, July 2004
|
|
343
|
- Internet Fiber Channel Protocol
- FC-4 layer interface with a TCP/IP sublayer
- Replaces the FC SAN with an IP network
- Gateways interface between FC and IP devices
- Emulates fabric services
- Work in progress
- http://www.ietf.org/html.charters/ips-charter.html
|
|
344
|
- Internet Small Computer Systems Interface
- Serialized SCSI over TCP
- Initiator to target (tcp port 3260)
- Removes 25 meter distance limit
- Includes initiator/target authentication via iSCSI Login PDUs
- Challenge Response Authentication Protocol (CHAP)
- Secure Remote Password (SRP)
- Uses CRC32c (but optional!)
- RFC 3720, Apr 2004
|
|
345
|
- Internet Storage Name Server
- Discovery and management of FCP and iSCSI storage devices
- Required by iFCP, optional in iSCSI
- Work in progress
- http://www.ietf.org/html.charters/ips-charter.html
|
|
346
|
- Zero-copy end-to-end data movement
- Available in InfiniBand
- RDMA Consortium defined it for TCP/IP via iWarp
|
|
347
|
- Collective name given to three protocols
- MPA – Marker PDU Alignment for TCP
- DDP – Direct Data Placement
- RDMAP – RDMA Protocol
|
|
348
|
|
|
349
|
- FC SANs provide little to no security
- FCIP, iFCP, iSCSI all depend on IPsec for per-packet
- Authentication
- Confidentiality
- Integrity
- Replay protection
- RFC 3723, Securing Block Storage Protocols over IP, Apr 2004
|
|
350
|
- Came out in the 10/100 Mbps Ethernet era
- IP networks simply weren’t fast enough, nor did they have QoS controls
- Held on through the introduction of 1Gbps Ethernet
- Claimed to still be faster and cheaper
- Today: Can no longer compete with 1 GigE and soon 10 GigE pricing
- IP attached iSCSI storage is coming
|
|
351
|
|
|
352
|
- FastTrack
- Closed source network
- Morpheus, KaZaa, Grokster, Limewire
- Gnutella (G1)
- Open source FastTrack like implementation
- eDonkey2000
- Direct Connect
- started Nov 1999, now over 12 PB of data!
- i2Hub
|
|
353
|
- Open source FastTrack like system
- No central database (unlike Napster)
- Flooding, order N lookups
|
|
354
|
- Did for movies what Napster did for music
- One of the first to allow partial file uploads
- Many clients, including eMule
|
|
355
|
- 12 PB of data on 250k hosts (6 Oct 04)
- http://www.neo-modus.com/
|
|
356
|
- www.i2hub.com
- Direct Connect P2P network on Internet2
- “faster because it uses Internet2”
- Reality check:
- “at UCLA, i2hub downloads run 30 kBps to 200 kBps, vs. 600 kBps for a
good server”
|
|
357
|
- http://bitconjurer.org/BitTorrent/
- http://btfaq.com/
|
|
358
|
|
|
359
|
- Peer – a client with a file
- “Seed” if complete file
- “Leach” if partial file
- Swarm – the collection of all peers with parts or all of a given file
- “Swarming” – uploading partial files
|
|
360
|
- .torrent file
- Tracker location(s)
- 256 KB file pieces, SHA1 hash of each piece
- Downloads
- 16 KB sub-piece transfers
- 5 pipelined requests (~80 KB window)
- 4 active peers, reselected every 10 seconds
- New opportunistic unchoke every 30 seconds
|
|
361
|
- One seed, One leach, Local LAN
- Two seeds, One leach, Local LAN
- 14 Mbps + 14 Mbps = 30 Mbps
- One seed, One leach, CA to MD
|
|
362
|
- Combines tracker and seed functions in an Apache module
- Worse case, downloads like a web server
- BitTorrent function can only help
- Not quite ready for prime time
|
|
363
|
- Similar to BitTorrent
- Distribution tree, clients subscribe
- Does not require balanced upload/download
- In theory, more efficient than BitTorrent for large numbers of clients
- http://konspire.sourceforge.net/
|
|
364
|
- Hot topic since 2001
- CAN, Chord, Pastry, Tapestry
- Could be used to replace DNS, web caches, etc.
- O(log N) vs. O(N) lookups
- Provable properties
- Many research projects in DHT’s
- FreeNet, OceanStore
- Automated P2P backup projects
- Pastiche and PAST built on Pastry
|
|
365
|
- Gnutella2 (G2)
- DHT network from Shareaza
- Overnet
- DHT network from eDonkey2000
- NEONet
- DHT network from Morpheus
|
|
366
|
- Internet caches
- DataSpace Transfer Protocol (DSTP)
- I2 Logistical Networking
- Decouple LAN/WAN tuning issues
- Should storage be a network resource?
|
|
367
|
- IRCache project 1995-2000
- Internet Cache Protocol (ICP)
- RFC2186, RFC2187, Sep 1997
- Used in Squid and several web cache products
- http://icp.ircache.net/
- Hyper Text Caching Protocol (HTCP)
|
|
368
|
- http://loci.cs.utk.edu/
- Internet Backplane Protocol (IBP)
- Logistical Backbone (L-Bone)
- exNodes
- Collection of IBP allocations
- Logistical Runtime System (LoRS)
- Upload and Download services
|
|
369
|
- Implements append-only byte arrays
- Often anonymous creation, limited time storage
- Need capabilities to access data
- Crypto secure URL’s: read/write/manage
- Supports Data Mover plugins
- 4 GB allocation limit?
|
|
370
|
- Storage Management
- Data Transfer
- IBP_store, IBP_load, IBP_copy, IBP_mcopy
- Depot Management
|
|
371
|
- Holds information and capabilities about storage on one or more IBP
depots
- think inode for L-Bone storage
- XML representation
|
|
372
|
|
|
373
|
- LoRS tools for using the L-Bone
- Finds allocations, creates exNodes, etc.
- Command line, visual, and java versions
|
|
374
|
|
|
375
|
- Think world-wide shared Linux cluster
- Over 100 universities doing research in overlay networks, routing, P2P,
security, performance, etc.
- Over half of the L-Bone servers are PlanetLab hosts
- Internet2 has PlanetLab hosts on the backbone
|
|
376
|
- High performance comes from all levels
- They complement each other
- Network capacity vs. speed
- How TCP throughput depends on delay, loss, packet size
- Importance of window and buffer sizes and how to tune them
|
|
377
|
- How to test and debug network performance
- Protect your data, the network isn’t perfect
- TCP continues to improve, notably in congestion control and security
modifications
- Alternatives exist: SCTP and UDP transport experiments
- Quality of Service: class based and egalitarian
- Storage Area Networks and IP storage
|
|
378
|
- Peer to Peer is where the file transfer action is
- But most P2P assumes poor network performance
- High performance BitTorrent hasn’t been written yet
- There is a lot of work today in DHT and abstract storage layers
- Is storage a future network function?
|
|
379
|
- System tuning details
- http://www.psc.edu/networking/projects/tcptune/
- Tom Dunigan’s Network Performance Links
- http://www.csm.ornl.gov/~dunigan/netperf/netlinks.html
- SLAC’s Network Monitoring pages
- http://www-iepm.slac.stanford.edu/
- CAIDA Internet Measurement Tool Taxonomy
- http://www.caida.org/tools/
|
|
380
|
- Nuttcp and Iperf for TCP and UDP testing
- ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/
- http://dast.nlanr.net/Projects/Iperf/
- Testrig for TCP traces
- http://ncne.nlanr.net/research/tcp/testrig/
- Web100 / Net100
- http://www.web100.org/
- http://www.csm.ornl.gov/~dunigan/net100/
|
|
381
|
- Request For Comments (RFC)
- http://www.rfc-editor.org/
- ftp://ftp.rfc-editor.org/in-notes/rfc968.txt
- Internet Engineering Task Force (IETF)
- http://www.ietf.org/
- http://www.ietf.org/html.charters/wg-dir.html
|
|
382
|
- TCP/IP Illustrated, 3 volumes
- Unix Network Programming, 2 volumes
- http://www.kohala.com/start/
|
|
383
|
- Stream Control Transmission Protocol (STCP)
- Randall R. Stewart, Qiaobing Xie
- November, 2001
- ISBN: 0-201-72186-4
|
|
384
|
- High Performance TCP/IP Networking
- M. Hassan, R. Jain
- October, 2003
- ISBN: 0130646342
|
|
385
|
- Phillip Dykstra
- WareOnEarth Communications Inc.
- 2109 Mergho Impasse
- San Diego, CA 92110
- phil@sd.wareonearth.com
- 619-574-7796
|