This KB article describes how to troubleshoot the branch staging and post-staging issues during the Zero touch provisioning (ZTP) process.
We have three Zero touch provisioning methods as below:
- Script based ZTP
- URL based ZTP
- Global ZTP
Irrespective of the ZTP method adopted, the communication process between the branch and the headend components( Director and controller) would remain the same from the start to the end of the successful onboarding.
Assuming the workflow template and Workflow device have been provisioned already on the director, this article would describe the issues encountered after a ZTP(among the 3 methods) is initiated
For the simplicity purposes, the below simple topology will be considered for the explanations going further.
ZTP is all about staging and post-staging before the branch is fully onboarded
Staging Phase:
Branch will try to authenticate with the controller through the data provided via the staging script and forms the standard IKE/IPSec tunnel and gets the temporary management IP through the IKE payload. Once the IPSec is formed between branch and controller, controller will verify multiple parameters before it sends out a branch-connect notification to the director
Post-staging Phase :
Controller sends the branch-connect notification to the director and once the director receives the branch-connect notification from the controller, it will initiate the post-staging process. This process will push the configuration to the device that is already provisioned in the form workflow template and device bind-data on the director with respect to this staging branch device via the Netconf session.
Staging Issues:
Importance of Global Tenant-ID
Before you execute the staging script, please verify the global-tenant ID of the Parent org on the controller through which you would want to perform the staging of the branch device. Here in the below example – Provider is the parent organisation
If the value is other than 1, then we have to execute the script by specifying the global tenant ID argument with the value which is seen in the above output from the controller config. If the global tenant ID is already equal to 1, then we need not specify the global tenant ID in the staging script which itself by-default takes value 1
Miscellaneous :
Parent org always tends to be 1 as this will be first org created on the director. There are a few instances the org ID might be different than 1 when you created a parent org but for some reason you had to delete it and recreate it immediately with the same or different name. Based on the time taken between deletion and recreation of the parent org, the tenant ID value may differ from 1 at times
If you execute the staging script without global tenant ID argument and the existing global tenant-ID of the parent org on the controller is not equal to 1, onboarding fails at the staging phase although you have everything fine right from the end to end reachability to the configs on the director templates.
Symptoms of staging failure due to Global-Tenant-ID:
- IKE will be success/done
- You would have received the temporary mgmt IP on the branch
- IPSec with controller will be up
- But controller will not send the SDWAN-branch-connect notification ( Check this via show alarms output described down below in this article)
So symptom 4 clearly indicates the failure of staging phase.
To overcome this issue,
- Please verify the global-Tenant ID on the controller and if its not equal to 1
- Erase the running config on the branch
- Re-run the staging script with the Global-Tenant-ID argument ( -gt) on the branch with the correct value according to what you seen on the controller
Provided you are aware of the Global-Tenant-ID settings, you would be able to create the ideal staging script and execute it
A. Staging script executed but the staging do not start:
1. Once you execute the staging script, the first thing that you have to notice is the interface brief output to see if it has received the staging/temporary IP . This would take 10-15 secs (depends on the underlay connectivity) IKE should come up before you receive this IP. Incase you do not see this IP showing even after say 1 minute, then there is a problem in forming the IKE.
2. So check the IKE status using the below command, in the below example Ike has failed. In this case we need to check the connectivity between the controller and the branch.
3. First perform the connectivity check with the immediate gateway and then check the reachability till the controller WAN IP.
4. The above shows ping to the next hop gateway is successful and the ARP is learned successfully from the gateway for the nexthop. So this confirms the nexthop gateway is fine.
Incase you see the ping is not working or ARP is not learnt, this is an issue with ISP nexthop, so please involve the ISP to debug further at this point.
5. Anyways, after making sure the next hop gateway is fine, we need to proceed checking the controller reachability from the branch. Initiate the ping from the branch global routing instance to the controller WAN IP and do the tcpdump on the controller and the Branch WAN interface
100.100.100.1 – Control WAN IP
115.115.115.26 – Branch WAN IP
6. Branch Side:
Ping failed but we could see the ICMP echo request is sent out from the branch but no response came back from the controller.
Above traceroute might help to see where the packets are being dropped on the underlay
7. Controller side :
tcpdump vni-0/x filter "'host <Branch WAN IP> and icmp'"
From the below tcpdump, we see no packets from the branch IP received at the controller WAN interface. So it is evident that underlay is dropping the packets between the branch and the controller. Involve the concerned ISP to identify the problem.
B. Ping successful but still fails between controller and the branch
1. There are scenarios where the ping would be successful but the IKE would still be down/failed. Incase you find the ping is successful between the branch and the controller, but the IKE is still not coming up between them, then please capture the IKE packets between the branch and controller
2. Branch side: Performing IKE capture on the branch WAN interface
> tcpdump vni-0/x filter "port 500"
3. Controller side: Performing IKE capture on the branch WAN interface
> tcpdump vni-0/x filter "'host <Branch IP> and port 500'"
4. In the above capture, IKE packets from the branch donot make it to the controller and also the IKE packets from the controller donot make it to the branch. This clearly indicates the IKE packets are being dropped on the underlay. So involve the ISP to check IKE block in the path.
If the above issues are fixed, then the onboarding process should move to the post-staging phase
Postaging issues:
1. Check the controller has sent the branch-connect notification to the director
> show alarms last-n 10 | match branch-connect
2. If you see the above sdwan-Branch-connect notification generated on the controller, it means controller sent the notification to the director. Now check the director has received the branch-connect notification. You can filter using branch mgmt IP on the director command
3. This confirms the staging is successful and the ZTP process is proceeded to the next phase known as post-staging. Onboarding of a branch fails at the Post-staging stage of multiple reasons like config mismatch on the templates to that of the device, routing issue on the headend towards the branch or discrepancy in VNF-manager IPs on controller and branch.
4. When the onboarding fails at this stage. It is important to first check the task status on the director to get the clue about what could have failed the post-staging process.
Below is a common example where post-staging failing with the reason “ Connecting to appliance failed”.
5. When you see such issues, it indicates there is a connectivity break between the branch and the director although the connectivity between branch and controller is fine( Because staging process is successful which indicated the controller and the branch connectivity is fine).
To debug this further, please read below:
C. Director Debugging:
1. Now director should ensure the reachability towards the branch device. To ensure director has a way to reach the branch management IP, check the director has route in its routing table to reach the branch via its SB interface
2. The above snippet confirms the director has a route to the Branch staging IP. Lets initiate a ping from the director to this branch mgmt. IP and check its successful.
3. If the ICMP ping fails, proceed with the Versa built netconf-check based ping
cd /opt/versa/vnms/scripts/
sudo ./netconf-check.sh <Branch-mgmt IP> <admin-passwd>
Miscellaneous: Netconf ping, checks not only IP reachability but also MTU on the path, port requirements, asymmetric routing while load-balancing on the headend, Netconf or ssh connectivity.
So please make sure you collect this info before reaching out to Versa support when escalating such post-staging issue.
D. Controller Debugging:
1. Now from the above it seems the branch mgmt. IP address is not reachable from the director, so lets do a packet capture on the controller to see if its receiving the packets from the director and forwards it to the branch
Below snippet shows the director eth1 interface which has 192.168.1.1 IP configured on it
2. Controller side:
Do a tcpdump on the control VR interface which is facing the director southbound.
Below tcpdump shows the controller is receiving the packets from the director SB IP.
3. Now we need to check if the controller is forwarding it to the branch-device .
The below session extensive output says the packets are dropped on the controller. It is important to not that this ICMP packets are dropped in the forward direction in the provider control VR
4. Although we could see the packets are dropped on the controller but we could not see the exact reason to understand why controller is dropping these packets.
5. Lets enable a packet trace on the controller to trace the packets
Syntax to enable packet trace to trace this specific flow explained above. To fill in the parameters of the packet trace you can use the output of above session extensive command, that will show all the parameters involved in this flow
> request debug session filter-create filter-name F1 source-prefix <Src-IP/32> destination-prefix dst-ip/32 protocol xxx
The output of this debug will be collected under /var/log/versa/versa-pkttrace.log file
6. The above packet trace log shows the traffic is dropped/blackholed on the controller due to routing issue/No route. Please fix it and proceed next
Now we have verified till the controller, then we should proceed debugging down the line, which is the branch.
E. Branch Debugging:
1. Lets check the branch has a reverse route to the director Southbound IP. The director SB Ips are are given during the staging phase by the controller to the branch as part of IKE payload
2. From the above output, we can see the branch has reverse route to the director SB Ips. The last thing that we could verify here is whether the ICMP packets from director are hitting the branch linux or not ( Packets from director first hits the linux (global namespace), then the branch infmgr service taps the packets and forwards it to the VSMD)
3. The above output shows the branch is not receiving the packets from the controller, and it is dropped somewhere on the path. Below is the output of successful packet capture
.
4. Till this point if you could identify the reachability is not a problem then the post-staging should go through fine unless you have seen problems with the direcotor bind-data discrepancy.
XXXXXXXXXX------XXXXXXXXX-------XXXXXXXXXX-------XXXXXXXXXX
F. Actual problem tested for this whole troubleshooting:
Below is the example where all the end to end reachability is fine but the branch onboarding/Postaging still failed. This is specifically due to the wrong serial number configured on the device bind data
XXXXXXXXXX------XXXXXXXXX-------XXXXXXXXXX-------XXXXXXXXXX
G. Branch not coming up after successful onboarding:
1. After the post-staging is successful, the branch would go for a reboot and director would create Creation-baremetal successful task, the branch will come up and slowly all the IPsec tunnels will come up with the controllers. But in some cases branch ptvi tunnels on the branch( IPSec tunnel towards the controller) would be down after the reboot
2. In this case, do packet capture on both controller end and the branch for the vxlan packets with the WAN IP address on each end
Controller side :
Check if the controller is responding to the vxlan (4790) packets from the branch WAN IP. If controller is just receiving the packets from the branch IP but not responding back,
tcpdump vni-0/x filter "'host <Branch WAN IP> and port 4790'"
3. If this is the case, we need to check the infmgr service so please raise a case with Versa support to guide you further.
4. Please raise a case with below outputs on both controller and branch along with the logs from the above tshoot
> show orgs org <org-name> sdwan detail <branch name> >>>>>>>>>>Take this output from the controller
> show orgs org <org-name> sdwan detail <controller-name> >>>>>>>>>>Take this output from the Branch