Troubleshooting Crash

=====================

Crash happens when the system encounters unexpected CPU interrupt. We need to understand what function is causing the interrupt and the chain of calling of other functions. When crash happens the system takes entire copy of memory that relates to the crash function. Therefore, Coredump basically has the memory copy of the function when failure happens.


Follow below steps:

-------------------

(1) For any crash, coredump file should be generated. Ask customer to collect coredump and upload link provided by Versa.

If customer does not have access to the URL, you need to create one or ask help from your colleague.

Core files are stored in "/var/tmp/versa-cores".


admin@CPE-1-cli> show coredumps

total 1.2G

-rw-rw-r-x 1 root root 635K Sep 25  2017 core.versa-vsmd.2579.versa-flexvnf.1506405622.gz

-rw-rw-rw- 1 root root  46M Aug 13 03:05 core.versa-certd.2400.CPE-1.1534154716.gz

-rw-rw-rw- 1 root root 3.4M Aug 16 00:17 core.versa-vmod.2391.CPE-1.1534403835.gz

-rw-rw-rw- 1 root root 587M Aug 16 00:22 core.versa-vsmd.2249.CPE-1.1534403863.gz

-rw-rw-rw- 1 root root 3.4M Aug 16 00:29 core.versa-vmod.2092.CPE-1.1534404581.gz

-rw-rw-rw- 1 root root 530M Aug 16 00:31 core.versa-vsmd.1895.CPE-1.1534404590.gz


(2) Take a backtrace of the core file. It will give functions that caused the crash. Look for a function right

after "assert_fail" line. In the example below, "lef_process_collector_grp_update" is the function right

after "assert_fail". Backtrace gives so many other functions. As we already know the functions who are involved

in crash, you need to search for bug that matches these functions in bugzilla or freshdesk. If you dont find any

matching case or bug, you will need to create PR.


admin@CPE-1-cli> show backtrace corefile core.versa-vsmd.1895.CPE-1.1534404590.gz

[New LWP 2943]

[New LWP 2941]

[New LWP 2946]

[New LWP 3128]

[New LWP 3142]

[New LWP 2939]

[New LWP 3145]

[New LWP 1895]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Core was generated by `/opt/versa/bin/versa-vsmd -N -H 2 '.

Program terminated with signal SIGABRT, Aborted.

#0  0x00007f055225fc37 in raise () from /lib/x86_64-linux-gnu/libc.so.6

#0  0x00007f055225fc37 in raise () from /lib/x86_64-linux-gnu/libc.so.6

#1  0x00007f0552263028 in abort () from /lib/x86_64-linux-gnu/libc.so.6

#2  0x00007f0552258bf6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

#3  0x00007f0552258ca2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6

#4  0x00000000008e4d93 in lef_process_collector_grp_update (tenant=0x7f04d350af00, coll_grp_cfg=coll_grp_cfg@entry=0x7f04d35136c0) at ../usr/module/lef/lef_cfg_process.c:1582

#5  0x00000000008e17cd in lef_cfg_clnt_process_collector_grp_cfg (msg_len=12, msg=0x7f0495be174e "ROUP\020\001\030\001", msg_params=0x7f0495be1740) at ../usr/module/lef/lef_cfg_clnt.c:142

#6  lef_cfg_clnt_process_cfg (msg_params=msg_params@entry=0x7f0495be1740, msg=msg@entry=0x7f0495be174a "\n\006CGROUP\020\001\030\001", msg_len=<optimized out>) at ../usr/module/lef/lef_cfg_clnt.c:216

#7  0x00000000008f2ee7 in lef_itc_cfg_process (cfg_itc_msg_len=<optimized out>, cfg_itc_msg=<optimized out>) at ../usr/module/lef/lef_itc.c:40

#8  lef_itc_msg_process (data=0x7f0495be173c, len=<optimized out>) at ../usr/module/lef/lef_itc.c:497

#9  0x000000000146f7cf in vs_thrm_itc_process_evmsg (evmsg=evmsg@entry=0x7f04d401e6e0, ret_msg_type=ret_msg_type@entry=0x7f054f33e75c, tstamp_diff=tstamp_diff@entry=0x7f054f33e760, ret_opq_data=ret_opq_data@entry=0x7f054f33e780) at ../usr/lib/libvsthrm/vs_thrm_itc.c:79

#10 0x00000000006b667a in vsm_process_thrm_work (works=0x7f054f33e8e0, nworks=<optimized out>, work_type=<optimized out>) at ../usr/sbin/vsm/vsm_thrm.c:2928

#11 0x000000000147201b in process_ev_n_tmr_workqs (tmr_flag=<optimized out>, ev_flag=<optimized out>, tinfo=0x7f055084e200, ctx=<optimized out>) at ../usr/lib/libvsthrm/vs_worker_threads.c:265

#12 vs_worker_thread_routine (arg=0x7f055084e200) at ../usr/lib/libvsthrm/vs_worker_threads.c:604

#13 0x000000000146a873 in vs_generic_thread_routine (arg=0x7f055084e200) at ../usr/lib/libvsthrm/vs_thrm.c:198

#14 0x000000000146d0f3 in vs_thrm_start_routine (handle=<optimized out>, tgid=<optimized out>, tgid@entry=VS_THREAD_GROUP_MAX, shared_cpu=shared_cpu@entry=false) at ../usr/lib/libvsthrm/vs_thrm.c:1511

#15 0x00000000006bdfd7 in vsm_thrm_start (arg=<optimized out>) at ../usr/sbin/vsm/vsm_thrm.c:4504

#16 0x000000000167fe85 in eal_thread_loop (arg=<optimized out>) at ../usr/lib/DPDK/lib/librte_eal/linuxapp/eal/eal_thread.c:184

#17 0x00007f0555903184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

#18 0x00007f0552326ffd in clone () from /lib/x86_64-linux-gnu/libc.so.6

[ok][2018-08-21 02:36:25]


(3) unzip the file and run gdb on the file. It give outputs similar to traceback.

admin@Branch1:.../tmp/versa-cores$ sudo gunzip core.versa-vsmd.25919.Branch1.1477388878.gz

admin@Branch1:.../tmp/versa-cores$ sudo gdb /opt/versa/bin/versa-vsmd -c core.versa-vsmd.25919.Branch1.1477388878


(4) Get the release info.

admin@CPE-1-cli> show system package-info


  Package             Versa FlexVNF software

  Release             16.1-R2

  Build               S3

  Release date        20180808

  Package id          6e92440

  Package name        versa-flexvnf-20180808-212106-6e92440-16.1R2S3

  Branch              16.1R2

  Creator


(5) Check "versa-service.log" file for anything suspicious during the time of the crash.

cat /var/log/versa/versa-service.log


134331 sdb memory limit is 2147483648 bytes

134332 sdb timeout is 1 days

134333 hscan_exceed_memory is FALSE

134334 ips memory limit is 1073741824 bytes

134335 2018-08-16 00:29:50.707 NOTIC [0x201] [VSN:0] rfm_cfg_tenant_add: Tenant 3 config add

134336 2018-08-16 00:29:50.707 ERROR [0x201] vfp_ev_handler:636 Unknown event recieved

134337 versa-vsmd: ../usr/module/lef/lef_cfg_process.c:1582: lef_process_collector_grp_update: Assertion `!collector->lef_coll_grp_ptr' failed.

134338 versa-vsmd: ../usr/module/lef/lef_cfg_process.c:1582: lef_process_collector_grp_update: Assertion `!collector->lef_coll_grp_ptr' failed.


(6) Analyse the log which was created during the incident. Looks for anything interesting.

cd /var/log

[admin@CPE-1: log] # ls -ltrh | grep syslog

-rw-r----- 1 syslog adm    13K Jul 23 02:29 kern.log.4.gz

-rw-r----- 1 syslog adm    22K Aug  5 05:51 kern.log.3.gz

-rw-r--r-- 1 syslog adm   208K Aug 12 09:28 cloud-init.log

-rw-r----- 1 syslog adm    64K Aug 12 09:30 kern.log.2.gz

-rw-r----- 1 syslog adm    14K Aug 15 00:17 syslog.7.gz

-rw-r----- 1 syslog adm   195K Aug 16 00:17 syslog.6

-rw-r----- 1 syslog adm   562K Aug 17 00:17 syslog.5

-rw-r----- 1 syslog adm    23K Aug 18 00:17 syslog.4.gz

-rw-r----- 1 syslog adm    14K Aug 19 00:17 syslog.3.gz

-rw-r----- 1 syslog adm    33K Aug 19 00:51 kern.log.1

-rw-r----- 1 syslog adm    12K Aug 20 00:17 syslog.2.gz

-rw-r----- 1 syslog adm   222K Aug 21 00:17 syslog.1

-rw-r----- 1 syslog adm    416 Aug 21 04:02 kern.log

-rw-r----- 1 syslog adm   103K Aug 21 07:37 syslog


[admin@CPE-1: log] # sudo cat syslog.6 | grep 2018-08-16


(7) If relevant check for configuration change:

admin@CPE-1-cli> show commit list

2018-08-21 07:41:01

SNo. ID       User       Client      Time Stamp          Label       Comment

~~~~ ~~       ~~~~       ~~~~~~      ~~~~~~~~~~          ~~~~~       ~~~~~~~

0    10228    1001       system      2018-08-21 04:02:42

1    10227    admin      netconf     2018-08-17 00:31:32

2    10226    admin      netconf     2018-08-17 00:22:50

3    10223    admin      netconf     2018-08-13 03:13:12

4    10222    admin      netconf     2018-08-13 03:09:39

5    10221    admin      netconf     2018-08-13 03:05:16

6    10220    admin      netconf     2018-08-13 02:55:56


(8) Collect tech-support.

request system tech-support


(9) Collect generic logs.

show system package-info

show system uptime

show system status

show system details