The TPG representative on the forum listed some tests that he'd like us to perform to help track down the problem. Ideally doing tests like ping, traceroute, and nslookup before and after the problem starts. I don't know much about the infrastructure of the internet or how routing occurs, but I'm familiar with these commands and what they mean. I'd like to help out, but because the problem was intermittent I kept missing the right time to run the tests, and after a week or so of trying I was fed up.
I had an ancient laptop sitting beside me running an outdated version of Ubuntu that I decided to put to use. By automating the process of running the tests and logging the output I would be able to capture the exact moment of failure and hopefully pinpoint the problem. After all, the internet without Twitter just isn't the same.
Recording the result of the tests was easy, a quick and dirty bash script did the job.
logger.sh
#!/bin/bash
DATE=$(date +%Y_%m_%e_%H_%M_%S).log #get date YYYY_MM_DD_HH_MM_SS
printf "ping -c 5 twitter.com \n\n" > PING_twitter_com_$DATE
printf "%s\n" "$(ping -c 5 twitter.com)" >> PING_twitter_com_$DATE
printf "\n\n\n" >> PING_twitter_com_$DATE
echo $DATE
The script is simple. It gets the date in a human readable sortable format writes the command to run as a string to a file named PING_twitter_com_YYYY_MM_DD_HH_MM_SS.log. It then writes the result of the command to the file. More tests can be added by replicating the above code for each command. By issuing the following command
while true; do ./logger.sh; sleep 300; done;
logger.sh will be run every 5 minutes (actually it's every 5 minutes plus the time to run the tests) until stopped. This will generate a log file for each execution containing the data we want. An example is below.
PING_twitter_com_2013_04_18_20_17_59.log
ping -c 5 twitter.com
PING twitter.com (199.59.148.10) 56(84) bytes of data.
64 bytes from twitter.com (199.59.148.10): icmp_req=1 ttl=51 time=256 ms
64 bytes from twitter.com (199.59.148.10): icmp_req=2 ttl=51 time=256 ms
64 bytes from twitter.com (199.59.148.10): icmp_req=3 ttl=51 time=257 ms
64 bytes from twitter.com (199.59.148.10): icmp_req=4 ttl=51 time=257 ms
64 bytes from twitter.com (199.59.148.10): icmp_req=5 ttl=51 time=257 ms
--- twitter.com ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 256.601/257.101/257.432/0.464 ms
This is great, but it's a bit hard to see what's going on. By extracting the average ping value and graphing it over time, a better picture of the problem can be created. A command line graph can be generated from a directory of log files with the following command.
cat PING_twitter_com*.log | awk -F '[ /]' '/rtt/{ print $8 }' | gnuplot -persist -e "set terminal dumb 121 28; set yrange[200:400]; plot '-' with impulses title 'Ping (ms)';"
At first it seems a little complicated but once you break it down things become clear. The required files are read with cat. The files are read in order because the wild card operator used will guarantee that. This is then piped to awk. It looks for lines that match a pattern. rtt is at the start of the lines that contain the average ping so I've set awk to look for that, but you could've used mdev or another string on the line you are looking for. The field separators for awk are set with the dash F option to be space and forward slash. This makes extracting the average ping easy. By breaking the line up at spaces and forward slashes you can see that the average ping is the eighth field. awk is set to print the eighth field as its output. This is piped to gnuplot.
gnuplot needs to be set to use the terminal. The "set terminal dumb" command does this. Plotting the data from stdin is done with the plot '-' command. This gives the basic info, but is can be spruced up a bit with some extra formatting, ie, changing the range, using impulses, enlarging the graph, and giving the data a title. It's up to you to decide what looks best.
A graph of ping results to twitter.com every 5 minutes |
As you can see from the graph above, I didn't encounter the problem tonight. The average ping result remained remarkably stable at around 250 ms. Who knows, they may have fixed the problem. But i'll run this for the next couple of nights just to make sure.
this is great but I couldnt quite follow your steps. Could you guide me a little? Im trying to monitor my AT&T T1 traffic so I can see when it spikes.
ReplyDeleteFirst of all, are you using Linux? If you're using windows I'm not sure how to go about doing it. If you're using Linux let me know what part you're having trouble with and I'll try to help
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteyour script didn't work for me very well, but your idea was really good, so I did this:
ReplyDeleteHORA=./hora
DATOS=./datos
LOGS=/scripts/graph/PING_outside*.log
echo > $HORA
echo > $DATOS
for FILE in `ls $LOGS`;do
echo $FILE | awk 'BEGIN { FS = "_" } ; { print $6":"$7":"$8 }' | sed 's/.log//g' >> $HORA
cat $FILE | awk -F '[ /]' '/rtt/{ print $8 }' >> $DATOS
done
sed '/^$/d' -i $DATOS
sed '/^$/d' -i $HORA
gnuplot -persist << EOF
filename="$DATOS";
set terminal dumb 121 28;
set title 'Ping (ms)';
plot filename with impulses title "Ping (ms)"
EOF
for LINEA in `seq 1 $(wc -l $DATOS | awk '{print $1}')`; do
VALX=`head -n $LINEA $DATOS | tail -n 1`
VALY=`head -n $LINEA $HORA | tail -n 1`
echo "$VALX - $VALY"
done;
This comment has been removed by the author.
Delete