Wednesday, May 22, 2019

Interoperable QOS Woes

Just when I think I've gotten a handle on QOS something knocks me down.

I'm currently in an MPLS Environment where we need to work within a 4 queue structure with our carrier. We do get to choose various algorithms from them and queuing structures.

We were running zero qos before but predictably when we got congestion tons of user reports about critical traffic being dropped and the support desks response was to hunt down everyone browsing youtube on their breaks and shut them down... This wasn't sustainable.

So I put in a decent amount of time trying to finally learn a command other than auto-qos and we put in a hierarchical policy that I thought looked pretty good:

Example Parent Policy

policy-map 200MB_SHAPE_CC_EDGE_WAN class class-default shape average 200000000 service-policy CC_EDGE_TO_MPLS 

Example Child Policy

policy-map CC_EDGE_TO_MPLS class VOIP priority percent 30 class NCONTROL priority percent 5 class CRITICAL bandwidth remaining percent 60 random-detect dscp-based random-detect ecn class class-default random-detect dscp-based random-detect ecn fair-queue 

We chose this method because it allows us to have parent policies for all our different bandwidth metrics. Our largest site is 700mb and our smallest is 50mb.

After we implemented this all was right with the world for the last six months. We no longer get user reports when we max out our bandwidth.

Until this week. This week one of our larger sites at 200mb finally started hitting their max and our telecom team came running over showing that the RTP streams are experiencing loss.

Weird I thought... check the VOIP buffer and no drops. I reach out to our carrier and surprisingly they are very helpful they point that the issue may be with some of the QOS settings that aren't viable for a 200 MB link.

Specifically they stated the following three items:

rate correctly set at 200M but the bc and be are over scaled at 800000 bits each equaling at 100KB burst... the tolerence is 64kb... recommend adjusting the BC and BE values manually to define at 512000 each if the CPE will allow

Raise the queue limit of 833 to at least 1000 as it is too small for the 200M service.

As for the nested QOS, it is not an exact match to the ordered network of 30-06-42-22. The output shows the cpe is set 30-05- bandwidth remaining 60%. This entials that your AF tagging is not allocating 42% of your CIR, rather your AF and BE combined are claiming the remaining 60% of the bandwidth which could account for further drops.

So I have a few concerns/questions I have been trying to google an understand but it seems like every recommendation points me in different directions. Hopefully someone here who has much more experience can weigh in and give me a hand.

Platform is ASR1001

On the ASR when going to set the BC and BE values the context sensitive help literally recommends against setting the BC and BE manually saying an algo will find the best value. Is this safe to ignore? Is there another Cisco feature that I should be using in order to more safely scale this correctly to match the carrier?

Queue Limits - I don't seem to have any control over the values that are set for the priority queues or the parent shaper. Is there something else I can do here?

The nested QOS is a fair issue. Cisco allows two priority queues and everything I found suggested VOIP and network control traffic should go in those priority queues. Our carrier only has one priority queue for EF traffic only. CS7/6 traffic would go into their P2 queue.

Is it better to just adjust our network control out of priority so it's easier to match the carrier? How do you all handle differing carrier policies and queues?



No comments:

Post a Comment