Categories

CUC (6) CUCM (27) Jabber (6) Python (2) Routing (3) Solarwinds Orion NPM (4) switching (1) Video (6) voice (2)

Monday, 17 October 2016

Troubleshooting network congestion problems; hints and tips.

Anyone in an IT Networks operational role would have, at some stage of their career, been involved in looking into an issue related to network latency. Most IT engineers also know that issues related to high response times, overutilisation and network bottlenecks can be notoriously hard to crack. This post is by no means a silver bullet, I am merely trying to attempt to hand out some tools, tips and methodologies with which you can equip yourself. Other engineers have other tools, that might work just as well, it's up to you to combine these and come up what works best for you. I am just describing how I do things, if you have any comments please leave them at the end of this post and I will always consider your input. So let's get cracking.


The first thing you need to do, when someone reports a network performance related issue, is to ask questions. End users will never give you technically relevant information, to the contrary, they will provide you with a symptom.  like, "my internet is slow", or "pulling down files from the payroll server share, takes ages". A good engineer does not get annoyed by these types of problem descriptions. It is up to the engineer to distill some more relevant information out of them and use the power of elimination, so ask questions like:
  • When did this start happening (time)?
  • Does everyone in your office suffer this issue, if so who else?
  • Is it just internet, or are there any other applications that you are trying to access that show bad response, if so which one.
Anything that could possibly exclude causes and decrease possibilities should be asked. There are simply no rules when it comes to eliminating possible causes. This is probably the single most part of of your diagnostics, because not asking the right question could lead to spending large amounts of times analyzing useless information, that will never get you any closer to identifying a cause.

I am assuming that you have some sort of network monitoring tool available, for instance Solarwinds NPM, NPM or some sort of freeware thingy you pulled down, it doesn't really matter, what you really need to be looking at is some of the following symptoms on whatever tool you are using. Again the aim is it to eliminate links, devices and ports. 

You could look at the following items:
  • response times v time
  • traffic patterns
  • switch throughput
  • port trough put

Particularly port throughput can be very useful information to locate bottle necks. I typically set up my network platforms to set up throughput reporting on trunk links and links to WAN routers and ESXi hosts, as this is typically where various traffic types aggregate. (There is very little point in continuously monitoring access ports continuously). Once you have identified the source of the bottleneck, for instance if you have a 10Mbps WAN link and you see a monitored trunk port pumping out 10Mbps of traffic at a certain time, then you would want to drill into that trunk port (let's call it trunk A) and quantify its traffic.

Knowing you OSI-model can be very beneficial diagnosing these types of issues, so let's continue with our example. After you have successfully identified which Layer 1 link is the bottle neck (trunk A), you would then like to know who or what is generating all that traffic. Could be a user, not aware of google, torrenting all editions of Encyclopedia Britannica in pdf, or some big ass database, trying to replicate with an off site peer, could be anything really. 

The next step would be to use a tool that can do more in depth analysis of Layer 3 (IP) and 4 (TCP/UDP) traffic and that can identify top talkers. Let me mention a few methods and tools:

  • ip accounting, can be configured on Cisco devices and applied to interfaces, very rudimentary and with the drawback that is can only be configured on a layer 3 interface, so this might not be usefull on a pure L2 trunk port for instance.
  • net flow, can be configured on a global level, and does not need to be tied to an interface, can work in conjunction with a management tool by loading its data into an external data base. Netflow will instantly provide you with a list of top talkers and TCP/IP streams but really needs an external data base for intelligent data collection
  • Wireshark  any network engineer should have in their tool box, if you don't have it, go download it! There is no cost and its the best in the business.

ip accounting is perhaps the easiest one to quickly spin up but it has its limitations. Netflow, pretty good too, but not as lightweight as Wireshark.

So let us consider using Wireshark for network congestion analysis.Now  you have pin pointed the port that carries most of the traffic and is maxing out your WAN link. The next step would be to find out what sort of traffic it is and even better tie it to a process or app running somewhere locally (a file share, an illegal torrent server, could be anything).
First thing you would need to do it drag yourself over to the troubled location or switch and configure a SPAN the port (trunk A in our example) on that same switch to which you connect your laptop with the wireshark client. Capture all traffic for a relatively short amount of time, when the congestion is occurring.

Once you stop your capture, after a minute or two, in Wireshark, go to:

Fig.1 - End list summary Wireshark

The next screen shows displays an example of the top endpoints generating most of the traffic, including IP address. 


Fig.2 - Top talkers summary
You could, if you wanted, go straight to TCP in Figure 1., but I leave that up to you. Now you know the main traffic generator's IP address, its is possible  that you could have an idea of what application or process is causing the problem. Let us for the sake of this post assume that you don't, which means you will need to go into the next step.

 For the next step, I recommend using TCPView, by the boys from SysInternal (well these boys were bought out by MS ages ago, so they will be on some Carribean beach, but anyway).  TCPView is free of charge and is simply a great tool.

Below is an example of what TCPView looks like


Essentially TCPView is a graphical representation of running netstat on the command linea and tying TCP/IP sockets  to Program IDs.  The good thing about TCPView is, it will let you sort a particular socket based on in out bytes, allowing you to identify your top bandwidth hungry processes.

Now you have done that, possibly turn off the culprit process and see if your bottlenecks see decreased traffic, just for testing.

As I said at the starts this post just describes a bunch of thoughts and a methodology that I find useful and I think are definitely worth sharing. I'd love to hear your feedback and suggestions   

Good luck


Sources:   



No comments:

Post a Comment