(1) How to check the IPSec statistics on the VOS?
-> Check for the command: show orgs org-services <Org-Name> ipsec statistics
-> Clear the stats once using the command: request clear statistics ipsec org org-name <Org-Name> vpn-profile <VPN-Profile-Type>
Please check for any packet drop counter incrementing. >> This gives you an overall IPsec module performance at a per Org Level.
admin@Branch-cli> show orgs org-services Tenant-VSA ipsec statistics
Statistics:
Inbound Statistics:
# IKE packets : 0
# IKE Packets - Trigger PM : 0
# ESP/AH Packets (to decap) : 4
# ESP/AH Bytes (to decap) : 358
# ESP Packets (to decap) : 0
# AH Packets (to decap) : 0
# Packets - After IPsec processing : 4
# Anti-replay failure - out of order : 0
# Anti-replay failure - duplicate : 0
# NAT-T packets : 0
# NAT-T keep-alive packets : 0
# Packets dropped - Invalid IKE : 0
# Packets dropped - Unknown SPI : 0
# Packets dropped - Invalid SPI : 0
# Packets dropped - No SA : 0
# Packets dropped - Anti-replay : 0
# Packets dropped - Auth failure : 0
# Packets dropped - Invalid : 0
Outbound Statistics:
# IKE packets : 0
# Packets - Trigger PM : 0
# Packets - Not hitting any rule : 3
# Packets - Hitting existing SA : 0
# Packets - Before IPsec processing : 3
# Bytes - Before IPsec processing : 318
# Packets - ESP/AH : 3
# Packets - Submitted to pre-fragment : 0
# Packets - Successful pre-fragment : 0
# Packets - Pre-fragmented : 0
# Packets dropped - PM rate limit : 0
# Packets dropped - No SA : 0
# Packets dropped - No mbuf : 0
# Packets dropped - Coalesce failure : 0
# Packets dropped - Invalid : 0
(2) How to check if the DPD is causing the tunnel to flap or timeout?
-> To check the IKE and IPSEC History towards the Controller, please use the below commands:
The Branch is always the initiator by Default towards the Controller.
For example:
admin@Branch-cli> show orgs org-services Tenant-VSA ipsec vpn-profile Controller-VSA-Profile ike history
Local Gateway: 10.0.0.99 Remote Gateway: 10.0.0.3
Last Known State : Active
Last State Timestamp : 2023-11-15T16:29:05.376375-08:00
Event History:
0. Event : IKE Done
Timestamp : 2023-11-15T16:29:05.376377-08:00
Role : initiator
Inbound SPI : 0x200d03bca885057
Outbound SPI : 0x20003172492fbae
If there is a reachability issue, where the Event: Timed Out, please check for the reachability of the Tunnel interface IPs. For this example, Local Gateway IP is 10.0.0.99 and the Remote Gateway IP is 10.0.0.3.
admin@Branch-cli> ping 10.0.0.3 routing-instance Tenant-VSA-Control-VR source 10.0.0.99
PING 10.0.0.3 (10.0.0.3) from 10.0.0.99 : 56(84) bytes of data.
64 bytes from 10.0.0.3: icmp_seq=1 ttl=64 time=3.51 ms
64 bytes from 10.0.0.3: icmp_seq=2 ttl=64 time=10.8 ms
64 bytes from 10.0.0.3: icmp_seq=3 ttl=64 time=12.1 ms
How to check if the Tunnels to the remote controller is down due to some underlay issue, where DPD responses are not recieved back from the Controller?
Sample:
admin@Branch-cli> show orgs org-services Tenant-VSA ipsec vpn-profile Controller-VSA-Profile ike history
Local Gateway: 10.0.0.99 Remote Gateway: 10.0.0.3
Last Known State : Failed
Last State Timestamp : 2023-11-15T22:32:09.439176-08:00
Event History:
0. Event : IKE Failed
Timestamp : 2023-11-15T22:32:09.439179-08:00
Role : initiator
Inbound SPI : 0x2002e47c87ff086
Outbound SPI : 0x20003172492fbae
Error : Timed out
[admin@Branch: ~] $ grep -i "PEER (Dead) :" /var/log/versa/versa-ipsec.log | grep <Local or Remote Gateway IP>
Sample Snippet:
In the versa-ipsec-ctrl.log, we will see this error log printed: IKE peer '<Remote Gateway IP>' was found DEAD when performing IKE Dead Peer Detecting algorithm.
This means for 60 seconds if there is not responses from the remote Site, in this case the Controller, we will see one PEER (Dead) in the log line.
We should also see IPSec Disconnected/IPsec DPD Triggered count as 1 per Controller.
admin@Branch-cli> show orgs org-services <Org-Name> ipsec statistics
Statistics:
# IPsec Disconnected : 1
# IPsec DPD Triggered : 1
(3) How to check if the packets are indeed making it to the Controller and is it the Controller that is not responding, or if the controller is responding, is it some underlay router that is dropping these packets?
On the Branch:
admin@Branch-cli> ping <Remote-Gateway-IP> routing-instance <Tenant>-Control-VR source <Local-Gateway-IP> count 1000 packet-size 1000 df-bit enable rapid enable
>> How to find the remote and local gateway IP from the VOS appliance cli?
Sample Output:
admin@Branch-cli> show orgs org-services <Tenant-Name> ipsec vpn-profile <Controller-Profile> ike history
Local Gateway: 10.0.0.99 Remote Gateway: 10.0.0.3
Last Known State : Failed
Last State Timestamp : 2023-11-19T17:51:08.872841-08:00
Event History:
0. Event : IKE Failed
Timestamp : 2023-11-19T17:51:08.872844-08:00
Role : initiator
Inbound SPI : 0x200759c7351fa05
Outbound SPI : 0x2000cac7ea2eecb
Error : Timed out
On the Controller, parallelly:
*vni-0/X will the WAN interface on the Controller where the tcpdump will be taken. Please kill the tcpdump before closing the Controller Appliance shell access. We do not want to have the tcpdump running in the background. We have taken measures to auto kill the tcpdump in latest 22.1.3/21.2.3 releases.
admin@Controller-cli> tcpdump vni-0/<X> filter "'host <WAN-IP of the Spoke/Branch> and greater 1000 -c 200'"
Incase the Controller gets the request, but dont reply, please get this details until this point and share the details with the TAC-Engineer.
For Example, here we only see the packets incoming on the Controller. But the Controller is not responding to any of the requests coming from the VOS. Please share if this VOS appliance was re-onboarded or if it was RMA'ed just before the appliance ended up loosing the tunnel.
admin@Controller-cli> tcpdump vni-0/2 filter "'host 10.40.92.14 and greater 1000'"
Starting capture on vni-0/2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on _vni_0_2, link-type EN10MB (Ethernet), capture size 262144 bytes
17:49:34.920556 14:58:d0:ad:50:c0 > 52:54:00:af:03:ce, ethertype IPv4 (0x0800), length 1094: 10.40.92.14.55274 > 10.48.87.21.4790: VXLAN-GPE, flags [IP], vni 29202: ERROR: unknown-next-protocol
17:49:35.920556 14:58:d0:ad:50:c0 > 52:54:00:af:03:ce, ethertype IPv4 (0x0800), length 1094: 10.40.92.14.55274 > 10.48.87.21.4790: VXLAN-GPE, flags [IP], vni 29202: ERROR: unknown-next-protocol
17:49:36.932561 14:58:d0:ad:50:c0 > 52:54:00:af:03:ce, ethertype IPv4 (0x0800), length 1094: 10.40.92.14.55274 > 10.48.87.21.4790: VXLAN-GPE, flags [IP], vni 29202: ERROR: unknown-next-protocol
If you see the response back from the Controller, but dont see this response on the Branch, then it is an underlay routing dropping these 4790 or natted traffic back from the Controller to the Branch. Please work with your ISP if it not a Versa Managed Appliance.
To verify if the Branch is indeed receiving the packets from the Controller, please run the below tcpdump command:
admin@Branch-cli> tcpdump vni-0/0 filter "'host <Controller-WAN-IP> -c 200'"
(4) The tunnel is down due to authentication failure, what should I check next?
admin@Branch-cli> show orgs org-services Tenant-VSA ipsec vpn-profile Controller-VSA-Profile ike history
Local Gateway: 10.0.0.99 Remote Gateway: 10.0.0.3
Last Known State : Failed
Last State Timestamp : 2023-11-19T19:23:40.133994-08:00
Event History:
0. Event : IKE Failed
Timestamp : 2023-11-19T19:23:40.133997-08:00
Role : initiator
Inbound SPI : 0x2002fd848b886f3
Outbound SPI : 0x2005559fbde6270
Error : Authentication failed
At the /var/log/versa/versa-ipsec.log, we can check since we did this auth failure started to show up.
[admin@Branch: ~] $ grep -i "N(AUTHENTICATION_FAILED)" /var/log/versa/versa-ipsec.log | more
Sample:
Using the above grep output, we should see since when did it start to fail.
Now on the Branch, please check if there were any commits recently which may have updated the keys to one of the Controller's mistakenly. Please check this at the Controller as well, if somebody mistakenly updated the key to cause the tunnels to go down with an Error: Authentication failed.
admin@Branch-cli> show commit list
Lets find a way to tally the Key used on the Branch and that on the Controller to see if it is indeed matching.
On the Branch, please review the local and remote auth key's configured.
Sample Snippet:
On the Controller, please review the local and remote auth key's configured.
Please make sure the local and remote auth keys on the respective appliances are correct and there has been no recent commits to change the key which may have triggered the tunnel to go down.
We will continue to add more scenarios here, But post this check, and if you still seen an issue with the IPsec tunnel from a Branch towards a Controller to be down, please raise a Versa Support ticket.