How to troubleshoot network faults
May 30, 2023
Leave a message
We know that switches are important network devices in local area networks, and their operational status is closely related to the internet access status of client systems.
However, in practical work, the status of switches can easily be affected by external factors, resulting in various network faults in the local area network.
To ensure stable network operation, we must properly manage and maintain switches in our daily work to prevent switch failures.
In this article, we will narrate the experience of a senior low-voltage expert in troubleshooting socket faults. During the maintenance of a local area network in a building, he encountered a fault where the floor switch could not be pinged due to improper physical connections. The troubleshooting process for this network fault proved to be quite challenging.
Since this fault is relatively typical and the troubleshooting approach can be referenced, it is shared here for everyone's benefit.
1. Fault Scene:
The office building I was responsible for at the time consisted of several companies. To ensure that each company could have independent internet access and that their internet status would not be affected by other companies, I chose a router switch as the core switch for the building's network.
At the same time, different virtual working subnets were set up for each unit on the switch.
Since each unit was located on different floors and the number of companies on each floor varied, some floors had two or three units, while others had as many as five or six units.
The work subnets of units on different floors were all connected to the building's local area network through the corresponding floor switch and accessed the Internet network through the hardware firewall in the building's network.
To improve network management efficiency, network administrators would usually manage and maintain the switches through remote connections.
However, one morning when I started work and was scanning and diagnosing the working status of various switch ports on the local area network core switch, I found that one of the switch ports was in a down state.
So I checked the network management records and found that the connection to this port was from a second-floor switch on the fifth floor.
When I tried to remotely log in to the floor switch, I found that I couldn't login successfully. When I used the ping command to test the IP address of the switch, it returned "Request time out".
Just when I was wondering why no one reported the fault, the phone rang as expected, and sure enough, users from the fifth floor started reporting network faults one after another.
Based on the above fault symptoms, I suspected that there might be an unexpected issue with the floor switch.
So I rushed to the scene of the faulty switch, disconnected its power supply, waited for a while, and then reconnected the power supply to restart it.
After the restart operation was completed, I used the ping command to test the IP address of the switch again.
This time, the results returned were normal, and remote login operations could proceed smoothly.
However, half an hour later, the faulty switch exhibited the same fault symptoms again, and when I tested it with the ping command, it returned abnormal results once more.
Later on, feeling uneasy, I repeated the process of restarting and testing, only to find that the faulty switch still couldn't be pinged normally.
2. In-depth Troubleshooting:
Since repeated restarts didn't solve the problem, I estimated that the cause of the fault was more complicated, considering that this type of fault is often encountered in network management processes.
So I conducted an in-depth troubleshooting following the approach below:
Considering that only one-floor switch on the fifth floor of the entire building network exhibited this phenomenon, I initially judged that it might be caused by issues with that floor switch itself.
In order to accurately identify the cause of the fault, I planned to replace the faulty switch with a properly functioning one and observe if the fault still persisted.
At the same time, I would connect the suspected problematic switch to an independent network environment.
After half an hour of testing and observation, I saw that the faulty switch, which was connected to the isolated network environment, was functioning normally, and its IP address could be pinged in that network environment.
However, the newly replaced switch, when connected to the building network, couldn't be pinged normally.
Based on these observations, I concluded that the possibility of the fifth-floor switch itself having a problem was almost negligible. After ruling out factors related to the faulty switch's own status, I reviewed the network structure and status of the entire building network.
While users on other floors of the building could access the internet normally, a portion of the fifth-floor users couldn't.
Upon checking the networking information for the fifth floor, I found that there were five units on that floor. At that time, the network administrator had set up two-floor switches on the fifth floor and connected them in a cascade configuration.
Additionally, five virtual working subnets were created on these two switches to ensure that each unit could work independently in their respective virtual subnets.
Since the corresponding port on the core switch had already been down, theoretically, all units on the fifth floor should be unable to access the internet. So why were only some users reporting the fault?
As soon as it was time to start work, I immediately contacted several companies that hadn't reported network faults. Their response was that they had just discovered the abnormal network access and were about to seek help from the building network administrator.
If that's the case, then all units on the fifth floor should be unable to access the Internet. Therefore, the cause of the fault should lie within the virtual working subnets of these units.
After narrowing down the scope of troubleshooting to the five units on the fifth floor, I considered that restarting the equipment of a specific switch on the fifth floor could temporarily restore the network fault.
However, after half an hour, the same network fault would reappear.
Considering this specific phenomenon, I suspected that it might be a network broadcast storm that caused congestion in the switch for a certain period of time, ultimately blocking the corresponding switch port on the core switch.
To facilitate the analysis of the fault, I used network monitoring tools to analyze the network packet transmission on the cascade ports of the fifth-floor switch.
The results showed that both the inbound and outbound packet traffic were extremely high, almost exceeding normal values by around 100 times. This indicated the occurrence of network congestion in the fourth-floor network.
So, is the network congestion caused by a network virus?
Or is it caused by a network loop?
I plan to observe the status information changes of the cascade ports of the faulty switch, especially the changes in the output broadcast packets. If the output broadcast packets keep increasing every second, it's highly likely that there is a network loop in the fifth-floor network.
Based on this analysis approach, I directly connected to the faulty switch using a console control cable and logged into the system backend as a system administrator.
Using the "display" command, I checked the changes in the output broadcast packets of the cascade ports of the switch, examining the results every second and comparing them.
After repeated testing, I discovered that the size of the output broadcast packets from the faulty switch was indeed continuously increasing.
This indicates that there is definitely a network loop in the five units on the fifth floor.
Upon careful examination of the two switches on the fifth floor, I found that their physical connection was normal.
Furthermore, the various switch ports of these two switches were directly connected to the wall network sockets in the rooms on the fifth floor.
In theory, as long as the rooms do not use switches for unauthorized cascading, there should be no network loop.
Now that it is proven that there is a network loop in the fifth-floor network, it means that someone is arbitrarily using switches to expand the network. By finding the expanded switch and inspecting its physical connections, we can quickly identify the specific faulty node.
So, I contacted the network administrators of the various units on the fifth floor by phone, requesting them to inspect each office room and report the rooms using subordinate switches.
It didn't take long for the inspection results to be reported to me, and surprisingly, around 10 rooms were using subordinate switches for network expansion.
At this point, I knew that there was a high probability of a network loop in these 10 rooms. But which room exactly?
Do I have to visit each room and inspect their network connections one by one?
After careful consideration, I retrieved the network documentation and identified the port numbers used by these 10 rooms.
Next, I directly connected network cables to these ports and, in the view mode of these ports, I sequentially pinged the IP address of the faulty switch.
When I reached the sixth port, I found that it couldn't be pinged successfully.
To determine if this port was indeed problematic, I used the "display" command in the view mode of the port to check its status information.
After analyzing the results, I found that the input and output packet sizes of this port were significantly abnormal. Therefore, I estimated that this port was definitely the cause of the abnormal working status of the faulty switch.
After referring to the file records, I quickly identified the corresponding room based on that port number.
Upon arrival at the scene, I discovered that the two available network ports in that room were both connected to small hubs, and these two hubs were connected to several computers.
To make matters worse, there was a network cable directly connecting them together, creating a network loop between the two hubs.
This loop caused a broadcast storm, ultimately blocking the cascade port of the faulty switch and causing the entire building network to be unable to access the internet properly.
3. Troubleshooting:
After removing the extra network cable, I rechecked the status information of the switch port. The results showed that the input and output packet sizes had returned to normal.
When I checked the status of the corresponding port on the core switch again, I found that the previous "down" status had changed to "up" status. At this point, I was also able to successfully ping the faulty switch on the fourth floor.
This confirms that the problem was indeed caused by the unauthorized use of a switch or hub by a user in one of the rooms on the fifth floor. Later, through further inquiry with internet users, I learned that their rooms were cleaned the night before, and at that time, all the ethernet cables were unplugged.
After the cleaning work was completed, due to the users' limited knowledge of connections, they randomly reconnected the cables, resulting in a network loop. Therefore, as network engineers, we also need to be mindful of this when conducting maintenance projects.