Table of Contents
Purpose
Overview of CoS in FlexVNF
Case1: Troubleshooting QoS rule definitions and traffic mapping
Case2: Troubleshooting CoS drops
2.1: Qos-policy/Policer induced drops
2.2: Troubleshooting Scheduler/Shaper induced drops
2.3: Drops owing to bursty traffic
2.4: Drop-profile/WRED and queue-depth
Case3: Troubleshooting DSCP Re-write scenarios
Case4: Troubleshooting packet loss due to thread utilization
Purpose
The purpose of this document is to illustrate aspects of troubleshooting issues pertaining to class-of-service, or quality of service, on Versa OS.
Overview of CoS in FlexVNF
Working with class-of-service revolves around prioritizing and rate-limiting traffic to ensure that the high priority traffic is given preference and the traffic volume, passing through the FlexVNF, meets the required bandwidth restrictions planned for a tenant.
A packet walk, from the perspective of class-of-service, would entail the following general modules of processing
A succinct explanation about the functionality of each of the above modules is as below
Classification – When a packet enters the FlexVNF, whether it be from the LAN-side or WAN-side, it would need to be associated with a “forwarding-class” (FC), using various match criteria like src/dst ip, dscp value, application etc, for it to be for it to enqueued to a specific queue later. This is done via the use of a qos-profile which is later associated with qos-policy
Policing – The traffic can be policed, if needed, through the use of the same qos-poicy as discussed above. This is useful if one want to police/rate-limit traffic on the lan-side of a tenant, in a multi-tenant node, and further shape/schedule it via a common queue (shared by multiple tenants), sine we cannot use a tenant-specific scheduler in 16.1R2 (though this feature is available in 20.2)
Note: You can employ a policer in any direction of the traffic, be it from wan to lan or lan to wan. Though the usual practice is to police the traffic incoming from the lan and employ a shaper towards the lan, however vice-versa is also a valid configuration.
Remarking – On FlexVNF, you can re-write the DSCP, as well as dot1p (ethernet), values at two levels: the inner and the outer. The outer refers to the ip-header of the of sd-wan packets (customer packet encapsulated with ipsec/vxlan followed by the ip-header) and the inner refers to the ip-header on the actual customer packet. Re-writing the outer ensures that the customer packet’s dscp is retainined, re-writing the inner changes the dscp values of the customer’s packet
Queueing – This module allows for shaping the traffic by en-queueing the packets, classified into specific forwarding-classes, into queues (the default queue-length being 64). There are totally 16 queues per pipe corresponding to 16 forwarding-classes into which the traffic can be classified. There are 4 traffic-classes holding a group of 4 forwarding-classes each for ease of configuration, the mapping is as shown below.
You can also associate a drop profile with the queue to define the WRED min/max values or just configure the “tail-drop” value (min=max)
Scheduling – This module implements the de-queuing of the above mentioned queues through the use of schedulers, which are defined by their PIR (peak-rate), CIR (or Guaranteed rate) and burst-size. The scheduling function follows a strict prioritization, where TC[0] > TC[1] > TC[2] > TC[3], which means that traffic enqueued in to queues belonging to TC[0] would have the highest priority and would be dequeued first (using the configured scheduler rate), followed by TC[1] and so forth.
Within a TC group, the queues can be configured with different weights to influence the scheduler bias between them. For example, consider the configuration below
Here the PIR configured is 20000 kbps, and it needs to be distributed between the 3 queues with weights in the ratio of 3:16:14 – the distribution would occur using the mathematical formulate below
3+16+14 = 33
For queue 0, the allocated PIR is derived using the below
3 * 100/33 = 9.09% (3 being the weight of queue 0)
20000 * 9.09/100 = 1780 kbps
The same calculation holds good for the “guaranteed-rate” when defined. The queue-depth is 64 by default, however it can be increased to a value of 256 if required – it should be noted that an incremental change in queue-depth would also entail into an increase in latency and hence is not advisable for jitter/latency sensitive traffic.
The above concludes the overview of CoS architecture in Versa OS. The rest of the document discusses various scenarios pertaining to troubleshooting class-of-service.
Note:
Adding more clarity on the Guaranteed Rate (GR) and Transmit Rate (TR) config with examples. Assume 100 Mbps link for the following example.
1. If the Guaranteed Rate is:
a. EF 100%
b. AF 100%
c. BE 100%
Then the Traffic Class (TC) is prioritized, if the higher traffic class takes all of its GR, there will not be any bandwidth (BW) for lower classes. However, if there is excess BW left unused by higher classes it will be available for lower classes.
Here if traffic for NC arrives, irrespective of any other traffic class, it will be prioritized since the default GR is 100% for any Traffic class when it is not explicitly configured.
2. If the Guaranteed Rate is:
a. EF 35%
b. AF 35%
c. BE 30%
Then TC will be able to use maximum of its GR, lower classes will be able to use their GR as higher traffic class doesn’t have 100% GR rate, excess BW will be available for TR by traffic class priority.
Traffic class will have a default Transmit rate of 100% and will be catered by TC priority once the GR has been met. If the Transmit rate is explicitly configured, that value will be honored as an upper bound.
3.
GR
a. EF 35%
b. AF 35%
c. BE 30%
If EF is sending 80 Mbps traffic and AF is sending 40Mbps traffic, then EF will use 35Mbps of its guaranteed rate, AF will use 35 Mbps of its guaranteed rate, and the remaining 30 Mbps BW will be made available back to EF as it has more traffic to send and its transmit rate is default of 100%. Assuming there is NO traffic on other TC.
4.
GR TR
a. EF 35% 40%
b. AF 35% 50%
c. BE 30% 50%
If EF is sending 50 Mbps traffic, it will use 35 Mbps of its GR, if AF is sending 50 Mbps traffic it will also use its 35 Mbps GR. Now for TR, EF will get its additional 5% (35% GR-40% TR) i.e another 5 Mbps of EF traffic, same way AF will get its another 15 Mbps TR.
Additional 10% BW will remain unused as TR is an upper bound and traffic cannot exceed it. Assuming NO traffic on other TC.
Case 1: Troubleshooting QoS definitions and traffic mapping
You can define various profiles, using qos-profile entries, where you can define the forwarding-class and loss-priority, along with policer attributes (if required) such as the peak-rate and burst-size. You can also define if you want to re-write the dscp of the traffic stream associated with this profile.
Once you’ve defined all the required profiles, as above, you would need to associate the profile with the appropriate traffic stream – this is done via defining qos-policies where you define the “match” criteria to identify the traffic stream of interest and then enforce the qos-profile on the same.
For example, you can see in the below screenshot, a qos-policy which matches the source-zone “intf-LAN-zone” and enforces the qos-profile (defined above) which would end up classifying all the traffic from the LAN-side as Forwarding-class-12 (BE).
You can use a more granular match criteria to differentiate various traffic streams using, source/destination ip prefix, custom service, header values (like dscp) and even schedules (to classify the traffic into different forwarding classes based on the time-of-day).
There is also an option to match various applications (predefined and custom) and URL categories, thus allowing for a L7 match criteria, through the use of “App QoS” policy.
Note: The major difference between a qos-policy and app-qos-policy is that the latter performs evaluation in a bi-directional manner – so, if you define a policer in the app-qos-policy, it will be applicable to both directions of the flow, once the flow is identified/matched.
If you face an issue where a certain traffic is not hitting the rule that you’ve defined, the starting point would be to check the below aspects
- Re-check your configuration and make sure you’ve defined the correct match criteria for the traffic, especially the source-zone/destination-zone if you are using zones to match the direction of the traffic
- Re-start the traffic a few times and check the below output, in the respective tenant org, for either the qos-policy, or app-qos, policy and check if hits increment again your rule (the hit will increment against each new flow)
- If you’ve defined a new rule, make sure you clear an existing session (for the matching traffic) for it to hit the rule – an existing session will not start matching a rule that was added post its advent.
- Initiate a “tcpdump” on the LAN interface, while troubleshooting traffic in the direction from LAN-to-WAN and save the pcap file as shown below, you can match the “host” which is generating the traffic. Check the pcap and determine if the traffic meets your match criteria – for example, if you’ve defined a rule to match traffic with “dscp CS7” and the incoming traffic has “dscp” value of CS0, there would be no hits.
- Check the below output (session extensive from the respective org) and validate if the incoming/outgoing interface, src/dest and application are what you expect them to be. You can use any “select” criteria as a filter (in the example below we filtering with application). In the below example, there is a match and the forwarding class is being set as “fc_ef” and there are also some forward drops through the impact of the define CoS attributes (drop-module qos)
- You can also dump the vsf session by connecting to vsmd, as shown below. You can check the app-ids associated with your flow in case you are using app-qos policies, and also confirm if there are hits against the qos profile (id/rule-id are internal values)
<snipped some of the output>
QoS policies can be set in any direction, from LAN-to-WAN or from WAN-to-LAN, with the intention of classifying the traffic to a certain forwarding class, enabling dscp re-write as well as employing a policer based rate-limiting.
It’s important to verify the configuration and confirm the direction in which the qos rule is applied – remember that the qos policy match is always applicable to the “org” to which the session is bound, so you will not see hits if the session belongs to a child-org and you apply the qos policy under the parent org.
Case 2: Troubleshooting CoS drops
When you apply class-of-service aspects to a traffic stream, there are two points at which QoS drops can occur
- The policer defined in the qos-profile, with peak pps/kbps and burst-size configuration, can cause drops if the traffic volume exceeds the set limits
- The schedulers defined in the scheduler-map, associated with the respective traffic-classes/forwarding-classes, can drop traffic when the concerned “queue” gets congested
Case 2.1: Troubleshooting qos-policy and app-qos-policy induced drops (policer drops)
One of the ways to ascertain if the traffic is being dropped by a policer is to check the output below (in the example below the “select” filter has been applied to match the application “youtube”, but you can use other filters like destination-ip or source-ip or port)
As can been seen below, the drop module is “qos” and the dropped-reverse-pkt-count is incrementing.
Note: In the above example, an app-qos-policy was applied matching source-zone as the LAN-side – however, as you can see above, the reverse packet drops are incrementing which is a proof that app-qos-policy evaluates traffic bi-directionally.
You can determine the rule that’s dropping the packets by first clearing the class-of-service app-qos-stats followed by checking the rule statistics
You can also check the same output via the monitor tab in the Director as shown below
Analytics can provide useful data with respect to the volume of traffic being generated by specific applications, as well as users, which can help determine if any specific app or users is hogging the bandwidth/queue
A lot of information can be obtained from “Applications” tab in analytics which can be checked per tenant – there are three panels/reports available (as you scroll down) as shown below
Note: The Rx bps refers to the aggregate “reverse byte” rate for sessions with application as “youtube”, and the Tx bps refers to the aggregate “forward byte” rate for the same. This data is critical to understanding the reason for scheduler/policer drops as it provides direct information about the aggregate bandwidth utilization
You can also check the “users” tab, in analytics, which gives you insight into the top users utilizing the bandwidth (you can determine the presence of a spurious user this way)
Note: If you have a qos-policy and app-qos-policy, which match the same traffic or a subset of traffic, the qos-policy is evaluated first after which the app-qos-policy is evaluated – in sequence, and hence the app-qos-policy will always influence the final evaluation.
Case 2.2: Troubleshooting Scheduler/Shaper induced drops
Schedulers can be defined per tenant, in case you want to shape/rate-limit the LAN-side traffic (or rather the traffic exiting the LAN ports) and/or they can be defined under the provider/parent org (to which the WAN interfaces belong) when you want to rate-limit the traffic exiting the WAN interfaces – the latter is a case of aggregated rate-limiting, where you end up shaping the aggregate egress traffic of all the tenants, configured on the branch, towards the WAN.
It’s important to understand that you plan on deploying schedulers to shape your LAN-side traffic, for a tenant, you will need to configure qos-policy for this tenant to match traffic in the direction towards LAN (destination-zone LAN-side) or just use an app-qos-policy in which there is a bi-direction evaluation/classification.
Note: In 20.2 there is a feature which allows you to apply schedulers per tenant on the WAN-side, defining an aggregated bandwidth for the provider org and a subset of this bandwidth for the tenant orgs which is then availed by the respective schedulers.
Checking for scheduler drops is relatively easier, the schedulers are bound to their egressing interfaces (which are the pipes with queues defined) and their stats can be checked as below
Below is the output for a WAN-side interface
You can clear the stats first to get a better understanding of the current state of the traffic profile on a concerned interface.
Below example shows the output pertaining to a LAN-side interface (vni-0/2) which has schedulers applied for a specific tenant
As can be seen above, tc3 which caters to best effort forwarding-class is registereing scheduler drops, specific to pipe vni-0/2.0
Each traffic-class has 4 queues associated with it. If you’ve defined forwarding-classes that employ the use of these queues, in a tc, you can get a more granular view of which queue is dropping traffic using the below
You can also see a field called “Length” which define the queue-depth that’s being utilized at the time when this cmd was executed – this is a real time data. The utilization of queue-depth, on consistent basis, on any specific queue, does indicate the possibility of a burst traffic subject the scheduler.
If the scheduler is applied on the WAN-side interface, in a multi-tenant branch, it will display the aggregate traffic (from all tenants) egressing the wan interface – so, drops observed on a specific queue will not give you insight into the actual volume from each tenant (so you will not be able to determine if any specific tenant is actually a major contributor to the congestion). In such a case, it’s useful to first check the interface statictics on the LAN side, per tenant as below, and figure out the volume per tenant
You can get a better picture by checking the LAN usage (tab: VRF) to get an idea of whether any specific tenant is hogging the wan bandwidth in a spurious fashion as shown below
Note: Be careful while assigning “gaurentee rate” to schedulers, the best practice should be to ensure that the aggregate of gaurenteed rate, from all the defined scheduler mapped to all the TCs, never exceeds the line-rate. Over-subscribing peak-rate is fine, but over-subscribing the gaureeted rate would mean that a higher traffic-class would hog the scheduler (even if the traffic rate is much lower than the specified gaureented rate) and thus deprive the other traffic-classes.
Case 2.3: Troubleshooting drops owing to bursty traffic
Certain applications, or traffic streams, have the tendency to generate micro-bursts, or sporadic bursts, that last for a few milliseconds, to a few seconds. These bursts can be much higher than the configured peak-rate and usually such burst are accomodated by the configuration of “burst-size” associated with a policer or scheduler as shown below
Setting burst size on a Policer (associated with qos-profile)
Setting burst-size on a Scheduler
As seen below, the burst-size is associated with the pipe to which the schedulers are mapped
The formula to determine an approximate burst-size, which you can set, for the corresponding peak-rate is as below
Burst-size = Peak-rate (kbps)/8
In 16.2R10, the burst-size for policers is set to MAX (15000, Peak-rate/8) by default
However, some traffic can be highly bursty in nature and there can be drops despite the above burst-size being set. In such cases, the customer is advised to look into options of reducing the burstiness of the traffic (if possible), if not we can look to increase the burst-size further by a rate of 2 to 5 times the value calculated above.
One should realize though that increasing the burst-size, to exhorbitant values, is not optimal as this would task the poller/worker thread which can lead to drops in other schedulers that are catering to normal traffic. The recommendation is to first look at options of reducing the bursty traffic, possibly by introducing “shaping” (queue based scheduling, setting a queue-depth that accomodates the bursts and shapes it accordingly) along the path of the traffic if possible.
The burst-size allocated to an interface (scheduler) can be checked using the below cmd
Drops owing to traffic exceeding the peak-rate/burst-size can be seen in the cmds mentioned in the sections above - the “session extensive” output, the qos-policy rule stats, “show class-of-service-interface” as well as the below output taken from vsmd (through which you can be aware that there are scheduler related drops occuring in the branch)
To view the micro-bursts that we observe in the VOS, we could run a pcap on the LAN side, if for a specific source or destination, or for a specific duration and open the pcap in Wireshark and plot an I/O graph, for an interval of 10 ms.
Default Burst Sizes for the corresponding interface port-speeds are as below,
10 Megs: 12500
100 Megs: 12500
1Gig: 125k
10 Gig: 1250k
Case 2.4: Drop profile and queue-depth
Along with increasing the burst-size, you can also look at increasing the queue-depth for schedulers, a deeper queue will allow for a better shaping of bursty traffic which in turn can help the connecting peer (by sending a less bursty traffic towards it)
You can increase the queue-depth by setting the max values in the drop-profile (the default queue-depth is 64).
Note: You can set max = min (same values for max and min) in which case you will effective create a “tail-drop” queue instead of a WRED queue.
You will need to associate these drop profile with the scheduler as shown below
Case 3: Troubeshooting DSCP re-write scenarios
In Versa OS, we have four levels of DSCP re-write available
- Copying the dscp values from the LAN-side ip-header to WAN-side (sdwan/vxlan) ip-header
- Copying the dscp values from sdwan ip-header to the LAN-side ip-header
- Changing the dscp/dot1.p values from LAN-side ip-header to custom values mapped to FC (this changes the dscp value on the customer’s traffic)
- Enforcing a specific dscp/dot1.p value on the sdwan ip-header without changing the dscp value of the customer’s traffic
The first two functions can be achieve through the use of “copy from inner” and “copy from outer” options respectively, in the RW rule defned under a tenant.
Note: By default, the customer’s dscp value is not copied to sdwan header (tos bits are not set in the ip-header of sdwan encapsulation carrying customer’s traffic – tos 0x0 or dscp CS0)
DSCP re-write issues can be troubleshooted through the use of tcpdump, on the LAN-side as well as WAN-side.
For example, consider a topology as below
(client) --- vni-0/2 (spoke) vni-0/0 ---------- vni-0/0 (hub) (dia vni-0/3)--- internet
Tcpdump has been enabled on vni-0/2 of spoke node, to capture certain packets from the client side (client is sending ping with size 400 and tos 104 – 0x68) – the echo-reply has tos 0x68 too as is expected from ping application.
After enabling “copy-from-inner”, you will notice that the sdwan traffic leaving the spoke has tos 0x68 set on its vxlan header. Also, you can see that the echo-reply packets has tos 0x0 since the Hub is not configured to “copy-from-inner”
Notice that occurs to the echo reply packets, egressing vni-0/2, when “copy-from-outer” is also enabled. The tos values from the sdwan header are written into the dscp field of the echo-reply packet (customer packet is modified as a result – tos 0x68 is replaced with tos 0x0)
Below is an example of achieving the functionality defined in “point 3”, which is to change the dscp value of the incoming customer traffic before it’s sent over the sdwan tunnel. For this, the dscp re-write option should be enabled under the respective qos-profile, which would ensure that all the customer traffic mapped to this qos-profile (via qos-policy) will have its dscp value over-written .
Note: As seen in step 3, below, the RW rule is applied to the “tunnel” interface under the Tenant, in order to over-write the Tenant’s traffic.
Step 1
Step 2
Step 3
The impact of the above configuration is as seen below, the customer’s traffic arrives on vni-0/2 (spoke) with tos 0x68. It is classified into forwarding-class-ef by the qos-policy/qos-profile after which the dscp re-write is applied – the effect is seen in the traffic leaving the Hub’s dia interface vni-0/3, you can see that the tos value is 0x28 (AF11)
Tcpdump on Spoke vni-0/2
Tcpdump on Hub vni-0/3
Also, note that the Hub is receiving this traffic via sdwan, with the sdwan ip-header displaying tos value 0x28 (check the 2nd packet below, 16.1.1.6 is spoke’s wan ip) – this is because “copy-from-outer” is still enabled
Note: When “copy-from-inner” is enabled and you perform a dscp-rewrite of the customer packet using the above configuration (dscp-rewrite on tunnel), then the dscp-rewrite rule will change the dscp value on the customer traffic first after which it would be copied to the sdwan ip-header.
Use the below configuration, on the wan interface, instead of “tunnel” to avoid changing the dscp values on the customer’s packet but modifying it on the sdwan ip-header (as discussed in point 4)
Indigo is the parent org, in this setup, and the wan interface belong to the same. A DSCP RW rule is configured under the parent org and is associated with the network/wan interface – all the traffic mapped to the fowarding class matched in the RW rule will have their sdwan ip-header dscp value re-written with the value defined in the RW rule
From the vsmd prompt (from shell, “vsh connect vsmd”) you can run the below two cmds which will give you a succint summary of the RW rules place, in each tenant/interface, and the modification applied by the rule
If you want to verify the FC name associated with the FC number, you can run the below cmd from cli – FC number 4, referenced in the table above, is fc_ef and FC number 8 is fc_af as seen in the table below.
Case 4: Troubleshooting packet loss due to thread utilization
Though this topic is not directly related to CoS, there can be cases where a poller thread is busy handling bursty traffic (with a burst-size setting much higher than the calculated value), with traffic volume close to line-rate, which causes it to be over-whelmed or there can be other reasons (like core/thread utilization due to a stuck process) which can cause CoS drops, due to low cpu cycles available to a scheduler, though the traffic is not exceeding peak-rate.
The poller passes on the traffic, from the respective interfaces/ports, to the worker threads. If the worker thread is busy (or stuck) and is unable to cater to the traffic being passed on by the poller, you are likely to see THRM drop in the stats below in the vsmd prompt (from shell, vsh connect vsmd)
There can also be a case (though not very likely in a production scenario) where there is an uneven distribution of sessions between the worker threads, which causes a certain worker thread to be busy, the same can be verified through the cmd below which shows the session count per worker thread
Check the core/thread utilization on the branch using “htop -H” as shown below,
Note: the “thread number” starts from 1 here, whereas in the vsf outputs taken above the thread number starts from 0.
In the below output there are just two threads (taken from a vm), but on a baremetal there can be several threads, with multiple worker threads and pollers. Use the below output to determine if there are any utilization spikes on any of the threads.
In system with multiple worker threads, you can check the below mapping to determine which thread is associated with a traffic class. In the output below you can see that traffic class NC is mapped to worker 0, while Traffic class BE is mapped to core 0 and 1. If you see spike, in the “htop” output above on a thread which is mapped to a certain traffic class, and you see QoS/Scheduler drops for that traffic class, a co-relation can be made.
In some cases, especially an overloaded system or lab system where max throughput/performance tests are being conducted, which is experiencing performance issues, one can consider enabling isolcpu (please raise a case with TAC if such considerations arise, avoid experimenting with this option in a production environment), via cli, as shown below
On a linux system, enabling isolcpu would ensure that the kernel does not use any other cores, other than core-0, for control functions thus ensuring that the other cores are free to perform their task (worker/poller) without interference, thus improving traffic throughput performance. As mentioned, this option should not be exercised without consulting with Versa TAC