在 Android 上，native 层的代码出现问题时，系统内核将会发送一些信号到应用进程，应用进程可以通过自定义信号的处理过程来实现 native crash 的收集等工作。

本文大概

Linux 信号

在计算机科学中，信号（英语：Signals）是 Unix、类 Unix 以及其他 POSIX 兼容的操作系统中进程间通讯的一种有限制的方式。它是一种异步的通知机制，用来提醒进程一个事件已经发生。当一个信号发送给一个进程，操作系统中断了进程正常的控制流程，此时，任何非原子操作都将被中断。如果进程定义了信号的处理函数，那么它将被执行，否则就会执行默认的处理函数。

信号类似于中断，不同之处在于中断由处理器产生并由内核处理，而信号由内核产生(可能通过系统调用)并由用户进程处理。当然内核也可以将中断作为信号传递给导致中断的用户进程。

常见信号

无法被截获处理的信号

SIGKILL: 这个信号不能被捕获或忽略，同时接收这个信号的进程在收到这个信号时也不能执行任何清理工作
SIGSTOP

发送信号

终端
- Ctrl-C 发送 SIGINT，终止进程
- Ctrl-Z 发送 SIGTSTP，挂起进程
- Ctrl-\ 发送 SIGQUIT，终止进程并内存转储到硬盘
程序
- 通过 kill() 系统调用
- 除零、段错误等异常会产生信号
- 内核可以向进程发送信号

处理信号

sigaction

sigaction() 系统调用用于更改进程在接收到特定信号时所采取的操作。通过 sigaction 系统调用设置信号处理函数。如果没有为一个信号设置对应的处理函数，就会使用默认的处理函数，否则信号就被进程截获并调用相应的处理函数。

我这里定义了一个 NativeCatcher 空间来做 native crash 的收集。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


namespace NativeCatcher {
    // 只捕获会造成进程终止的几种异常
    const int SIGNALS_LEN = 7;
    const int signal_array[] = {SIGILL, SIGABRT, SIGBUS, SIGFPE, SIGSEGV, SIGSTKFLT, SIGSYS};
    // 储存系统默认的异常处理
    struct sigaction old_signal_handlers[SIGNALS_LEN];

    void init();

    void signal_handler(int, siginfo_t *, void *);

    void make_crash();
}

在 JNI_OnLoad 的时候，调用 init 设置异常处理函数:

1
2
3
4
5
6
7


static jclass CLASS = nullptr;

extern "C" jint JNI_OnLoad(JavaVM *vm, void *reserved) {
    NativeCatcher::init();
    //...
    return JNI_VERSION_1_4;
}

注册异常处理函数, 并持有默认的处理函数。sigaction 是 sigaction() 系统调用的参数, sa_flags 用于配置信号会携带的数据, 如果 sa_flags 含有 SA_SIGINFO 标志位, 则异常处理函数(sa_sigaction) 需要为 void (*sa_sigaction)(int, siginfo_t *, void *) 的函数指针，否则就需要为 void (*sa_handler)(int) 的函数指针。

1
2
3
4
5
6
7
8
9


void NativeCatcher::init() {
    struct sigaction handler = {
            .sa_sigaction = NativeCatcher::signal_handler,
            .sa_flags = SA_SIGINFO
    };
    for (int i = 0; i < SIGNALS_LEN; ++i) {
        sigaction(signal_array[i], &handler, &old_signal_handlers[i]);
    }
}

异常处理函数:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


void NativeCatcher::signal_handler(int signal, siginfo_t *info, void *context) {
    // 自己做一些处理工作
    const int code = info->si_code;
    LOG_D("handler signal %d, code: %d, pid: %d, uid: %d, tid: %d",
          signal,
          code,
          info->si_pid,
          info->si_uid,
          info->si_tid
    );

    // 找到异常对应的默认处理函数
    int index = -1;
    for (int i = 0; i < SIGNALS_LEN; ++i) {
        if (signal_array[i] == signal) {
            index = i;
            break;
        }
    }
    if (index == -1) {
        LOG_E("Not found match handler");
        exit(code);
    }
    struct sigaction old = old_signal_handlers[index];
    // 调用默认的异常处理函数
    old.sa_sigaction(signal, info, context);
}

模拟产生异常:

1
2
3
4


void NativeCatcher::make_crash() {
    int a = 0;
    int i = 10 / a;
}

获取 crash 数据

异常处理函数的第3个参数 void* context 将会用与 crash 数据的收集。context 参数是指向 ucontext_t 类型的一个指针。

The ucontext_t type is a structure type suitable for holding the context for a user thread of execution. A thread’s context includes its stack, saved registers, and list of blocked signals

ucontext_t 结构体会包含出现异常的线程的上下文信息:

执行栈
存储的寄存器
阻塞的信号列表

具体的字段信息:

uc_link: 当前方法返回时应该返回到的地址(如果 uc_link 等于 NULL ，那么当这个方法返回时进程就会退出)
uc_sigmask: 阻塞的信号
uc_stack: 执行栈
uc_mcontext: 存储的寄存器(uc_mcontext 字段与机器的处理器架构相关)

由于寄存器等信息在不同处理器架构下都不相同。如下是在 arm 架构下的 ucontext_t 定义:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


#if defined(__arm__)

#define NGREG 18 /* Like glibc. */

typedef int greg_t;
typedef greg_t gregset_t[NGREG];
typedef struct user_fpregs fpregset_t;

#include <asm/sigcontext.h>
typedef struct sigcontext mcontext_t;

typedef struct ucontext {
  unsigned long uc_flags;
  struct ucontext* uc_link;
  stack_t uc_stack;
  mcontext_t uc_mcontext;
  sigset_t uc_sigmask;
  /* Android has a wrong (smaller) sigset_t on ARM. */
  uint32_t __padding_rt_sigset;
  /* The kernel adds extra padding after uc_sigmask to match glibc sigset_t on ARM. */
  char __padding[120];
  unsigned long uc_regspace[128] __attribute__((__aligned__(8)));
} ucontext_t;

从上面可以看出，在 ARM 下这里会在 gregset_t 数组中储存 18 个寄存器，而且 mcontext_t 的类型是 arm/sigcontext.h 中的 sigcontext:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


struct sigcontext {
  unsigned long trap_no;
  unsigned long error_code;
  unsigned long oldmask;
  unsigned long arm_r0;
  unsigned long arm_r1;
  unsigned long arm_r2;
  unsigned long arm_r3;
  unsigned long arm_r4;
  unsigned long arm_r5;
  unsigned long arm_r6;
  unsigned long arm_r7;
  unsigned long arm_r8;
  unsigned long arm_r9;
  unsigned long arm_r10;
  unsigned long arm_fp;
  unsigned long arm_ip;
  unsigned long arm_sp;
  unsigned long arm_lr;
  unsigned long arm_pc;
  unsigned long arm_cpsr;
  unsigned long fault_address;
};

sigcontext 中的 arm_pc 就代表了 ARM 处理器的 PC 寄存器。

定位问题代码

当 native 代码运行出现异常时，我们其他能够直接从输出中看到问题代码所属的源码文件和行数。这就涉及到了 so 文件的编码格式(ELF) 和 native 程序的部分运行原理(程序计数器)。

首先程序计数器一般称作 PC(Program Counter)，在处理器中一般都有专门的寄存器来储存它的值，称为 PC 寄存器。PC 寄存器中储存着处理器当前执行的指令的内存地址, 获取这个内存地址之后，使用 addr2line 工具就能找到地址对应的源码行数。

定义如下方法来实现获取程序计数器存储的地址:

1

uint NativeCatcher::get_pc(const void *context);

由于不同的处理器架构在不同的寄存器中储存程序寄存器，所以需要针对性的获取:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


const auto *ucxt = static_cast<const ucontext_t *>(context);
greg_t absolute_pc;
#if defined(__arm__)
    absolute_pc = ucxt->uc_mcontext.arm_pc;
#elif defined(__aarch64__)
    absolute_pc = ucxt->uc_mcontext.pc;
#elif defined(__i386__)
    absolute_pc = ucxt->uc_mcontext.gregs[REG_EIP];
#elif defined(__mips__)
    absolute_pc = ucxt->uc_mcontext.pc;
#elif defined(__x86_64__)
    absolute_pc = ucxt->uc_mcontext.gregs[REG_RIP];
#endif

程序寄存器中存储的当前指令在内存中的绝对地址 ，而 addr2line 工具需要的是指令在指令所属的 so 中的相对地址，所以需要先获取出现异常的指令属于的共享库(so)被加载到内存的开始地址，然后使用 绝对地址 减去 开始地址 得出程序寄存器相对 开始地址 的偏移量: 相对地址 = 绝对地址(pc) - so被加载到的地址。通过 dladdr 库函数，可以找到一个绝对地址所属的 so, 以及 so 被加载到内存的位置：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


Dl_info dl_info;
LOG_D("calculate pc(%d)", absolute_pc);
int result = dladdr((void *) absolute_pc, &dl_info);
if (result && dl_info.dli_fname) {
    // so 加载到内存的地址
    uint base = reinterpret_cast<long>(dl_info.dli_fbase);
    // 当前 pc 属于的方法的名称
    LOG_D("symbol is %s", dl_info.dli_sname);
    // 计算相对位置
    uint relative_pc = absolute_pc - base;
    return relative_pc;
}
return 0;

输出:

1
2
3
4


D: calculate pc(613568945)
D: share object is /data/app/.../lib/x86_64/libnative-catcher.so
D: symbol is _ZN13NativeCatcher10make_crashEv
D: relative pc register: 00000000000011b1

使用 addr2line 找到 pc 寄存器对应的方法和行数:

1
2
3


▶ x86_64_addr2line -e libnative-catcher.so -f 00000000000011b1
_ZN13NativeCatcher10make_crashEv
??:?

debug 模式下

从输出上看到由于行数没有找到而显示了 ??:?，这是因为 Android Gradle Plugin 在 native 编译时会默认对 so 进行 strip 操作，so 中与调试相关的信息都被去掉了。所以可以在 debug 编译下禁用 strip:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


android {
    ...
    buildTypes {
        ...
        debug {
            packagingOptions {
                doNotStrip "*/x86_64/*.so"
            }
        }
    }
    ...
}

然后使用 debug 下编译出的 so 就能看到问题代码在源码中的行数了:

1
2
3


▶ x86_64_addr2line -e libnative-catcher.so -f 00000000000011b1
_ZN13NativeCatcher10make_crashEv
.../src/main/cpp/native_catcher.cpp:92

release 模式下

上面提到了因为 strip 的原因，我们无法定位到问题代码在源码中的行数，但是如果不进行 strip，so 就会因为包含许多运行时用不到的符号信息、调试信息变得很大，从而导致Apk的安装包很大，而且 so 中包含调试信息也存在安全隐患。所以如何在 release 下既能使用 strip，又能让我们通过 addr2line 定位到问题代码呢？

现在我们项目使用的 bugly 来做 Crash 上报， bugly 甚至能将 release 模式下的 native 堆栈复原为代码的行数，不经让我好奇，它是如何做到的。通过一番搜索，发现 bugly sdk 中的有使用一个 SymtabToolAndroid 来做符号表的收集。bugly 的策略是这样：收集 debug 模式下含有符号表的 so 中的 debug 相关信息，然后存在数据库中，当线上的 release 版本 so 出现 native 问题时，就先定位到 pc (native 层的程序计数器) 的位置，然后再从之前收集的 debug 符号表中定位问题代码行数。

到 bugly 对应的 maven 仓库找到了 SymtabToolAndroid 这个工具库, 尝试导出一份符号表试下:

1
2
3
4
5
6


static String path = ".../libnative-catcher.so";

public static void main(String[] args) {
    SymtabToolAndroid.main(new String[]{"-i", path});
    System.out.println(SymtabToolAndroid.symtabFileName);
}

结果:

bugly导出的符号文件

看到这个结果就可以进一步查找 bugly 是如何生成这个文件的了。经过一番定位：

断点处就是核心所在，bugly 将 so 中上面几种内容储存为 内存地址 -> 源码行数 的映射。

so 文件本身是属于 ELF 文件的，.debug_info、.debug_line .debug_str 和 .debug_ranges 代表的是 ELF 文件中 Section 的名称。

ELF 文件 Section

ELF 作为一种可执行、可链接的二进制文件，其文件格式允许文件的生成方写入多个 Section，这里我们只看上面几种 Section，其实还有其他 Section 这里不涉及。

通过 Android NDK 工具包中的 readelf 工具可以查看 ELF 文件的详细内容, readelf -S 查看含有哪些 Section:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


▶ x86_64_readelf libnative-catcher.so -S
There are 38 section headers, starting at offset 0xec78:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [25] .debug_str        PROGBITS         0000000000000000  0000206d
       0000000000005229  0000000000000001  MS       0     0     1
  [28] .debug_info       PROGBITS         0000000000000000  00007ce5
       0000000000005a51  0000000000000000           0     0     1
  [29] .debug_ranges     PROGBITS         0000000000000000  0000d736
       0000000000000070  0000000000000000           0     0     1
  [33] .debug_line       PROGBITS         0000000000000000  0000dec9
       00000000000005ed  0000000000000000           0     0     1

那这些 Section 我们应该如何从 ELF 文件(我们这里是 so)中读取呢? 在查找读取方式的过程中，我发现了一个新名词 DWARF。原来 ELF 文件中的调试信息都是按照一种叫 DWARF 的标准来读写的。http://dwarfstd.org/ 的维护网站上详细介绍了这个标准。

通过 readelf 的 --debug-dump 方式，我们可以输出对应 Section 的内容:

debug_line

--debug-dump=decodedline: 会解码输出 .debug_line 的内容，储存着某行代码在编译之后的内存地址

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


▶ x86_64_readelf libnative-catcher.so --debug-dump=decodedline
Decoded dump of debug contents of section .debug_line:

File name                            Line number    Starting address
../native_catcher.cpp:
native_catcher.cpp                            91              0x11a0
native_catcher.cpp                            92              0x11a9
native_catcher.cpp                            93              0x11b0
native_catcher.cpp                            93              0x11b4
native_catcher.cpp                            94              0x11b7

这里查看的是 debug 模式下的 so 的 .debug_line Section, 可以看到包含了上面期望的 0x11b0 地址，它对应的源码行数是 93 行，这和 addr2line 的结果一致。

所以，要在 release 模式下没有符号信息的情况下找到出错地址对应的源码行数，可以将包含调试信息的 so 的 .debug_line Section 中的内容导出储存，然后根据出错地址匹配到源码行数。

PS: –debug-dump .debug_line 的输出中，是按方法分的，每块对应的一个方法相关的行，方法之间以文件路径分隔：

1
2
3
4
5
6
7
8
9


File name                            Line number    Starting address
cpp/native_catcher.cpp:
native_catcher.cpp                            16               0xcf0

cpp/native_catcher.cpp:
native_catcher.cpp                            26               0xda0

cpp/native_catcher.cpp:
native_catcher.cpp                            55               0xf30

获取方法调用栈

获取方法的调用栈，有多种方式。目前应用最多的是 Google 的 breakpad

系统 <unwind.h> 库
系统 libcorkscrew.so
开源库 coffeecatch
Google breakpad

Breakpad

breakpad

Breakpad 的几大部分:

client 是开发者应该引用的一个库。通过它可以捕获当前线程的状态、当前加载的可执行文件和共享库的标识去生成 minidump 文件。开发者可以配置 client，使其在发生崩溃或显式请求时生成 minidump 文件
symbol dumper 程序：它读取编译器生成的调试信息，并以 Breakpad 自己的格式生成符号文件
processor 程序：它读取 minidump 文件，为 minidump 文件对应的可执行文件和共享库找到匹配的符号文件，并生成可读的 C/C++ 调用栈。

Android 处理 Native Crash