Tuesday, August 28, 2012

Severe latency bottleneck detected on ISL / Trunk port

How to troubleshoot error "Severe latency bottleneck detected" on ISL/trunk port?
What can cause this problem and how a root cause can be found?


I faced this problem at one of my customers.
We received this alert on trunk created from two 8Gbit ISL ports between two 5100 switches.

Here is alert message:
Time    Level    Message    Service    Number    Count    Message ID    Switch 
Mon Aug 06 2012 20:32:05 CEST    Warning    Severe latency bottleneck detected at slot 0 port 35.    Switch    1241    1    AN-1010    XSAN01


Port in alert message is ISL port from one trunk group.

Performance of trunk:
XSAN01:admin> trunkshow -perf
  1:  1->  7 10:00:00:05:1e:36:38:62 100 deskew 15 MASTER
      0->  6 10:00:00:05:1e:36:38:62 100 deskew 24
    Tx: Bandwidth 8.00Gbps, Throughput 37.44Kbps (0.00%)
    Rx: Bandwidth 8.00Gbps, Throughput 51.94Kbps (0.00%)
    Tx+Rx: Bandwidth 16.00Gbps, Throughput 89.38Kbps (0.00%)

  2:  5-> 71 10:00:00:05:1e:36:38:62 100 deskew 16 MASTER
      4-> 70 10:00:00:05:1e:36:38:62 100 deskew 15
    Tx: Bandwidth 8.00Gbps, Throughput 33.12Kbps (0.00%)
    Rx: Bandwidth 8.00Gbps, Throughput 58.08Kbps (0.00%)
    Tx+Rx: Bandwidth 16.00Gbps, Throughput 91.20Kbps (0.00%)

  3: 35-> 35 10:00:00:05:33:ce:61:f5 203 deskew 15 MASTER => trunk with alerts
     39-> 39 10:00:00:05:33:ce:61:f5 203 deskew 16
    Tx: Bandwidth 16.00Gbps, Throughput 442.46Kbps (0.00%)
    Rx: Bandwidth 16.00Gbps, Throughput 433.73Kbps (0.00%)
    Tx+Rx: Bandwidth 32.00Gbps, Throughput 876.19Kbps (0.00%)

Port errors on ISL ports:
porterrshow
          frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy
       tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig
     =========================================================================================================
35:  374.0m 115.7m   0      0      0      0      0      0      0     70      0      1      2      0      0
39:    3.1g   3.8g   0      2      0      0      0      0      0    204      0      1      2      0      0

 
Regarding to Brocade docs looks like buffer credit problem:

Data Center Fabric Resiliency Best Practices:
Bottleneck Detection can detect ports that are blocked due to lost credits and generate special “stuck VC” and “lost
credit” alerts for the E_Port with the lost credits (available in FOS 6.3.1b and later).
Example of a “stuck VC” alert on an E_Port:
2010/03/16-03:40:48, [AN-1010], 21761, FID 128, WARNING, sw0, Severe latency bottleneck detected at slot 0 port 38.

Data Center Bottleneck Detection Best Practices Guide:
"timestamp", [AN-1010], "sequence-number",, WARNING, "system-name", Severe latency bottleneck detected at Slot "slot number" port "port number within slot number".
This message identifies the date and time of a credit loss on a link.The platform and port affected and the number of seconds that triggered the threshold.

But what can cause buffer credit loss?
There could be a slow drain device causing the issue.

Root cause of this problem in my case has been one erroneous port on second switch with id 203.

XSAN01:admin> porterrshow
frames      enc crc    crc    too    too bad    enc disc   link   loss   loss frjt   fbsy
tx     rx in    err    g_eof  shrt long   eof     out   c3 fail    sync   sig
=========================================================================================================
28: 0      0 0      0 0      0 0      0     13.9k   0 0      0 0      0      0


SFP has been identified as a failing item in fabric. After its replacement problem has gone.

Source:
Severe latency bottleneck detected on ISL / Trunk port
HP Storageworks B-series SAN Switches - How to Interpret the Brocade porterrshow Output
HP StorageWorks B-Series Switches - Identifying if SFP or the Cable is the Cause for Loss of Link

No comments:

Post a Comment