分类 Linux 相关 下的文章

strace 常用命令

strace 是一款基于 linux ptrace 系统调用的命令行工具, 对于没有源代码去黑盒 triage 问题很有帮助. 它主要通过拦截分析 进程和系统调用的交互, 产生相应的输出.

常见命令:
strace -p 26380
strace -p 26380 -c
sudo strace -p 4599 -e trace=all

它还可以用来做错误注入 (fault injection)

-e trace=%desc     Trace all file descriptor related system calls.
         %file     Trace all system calls which take a file name as an argument.
         %fstat    Trace fstat and fstatat syscall variants.
         %fstatfs  Trace fstatfs, fstatfs64, fstatvfs, osf_fstatfs, and osf_fstatfs64 system calls.
         %ipc      Trace all IPC related system calls.
         %lstat    Trace lstat syscall variants.
         %memory   Trace all memory mapping related system calls.
         %network  Trace all the network related system calls.
         %process  Trace all system calls which involve process management.
         %pure     Trace syscalls that always succeed and have no arguments.
         %signal   Trace all signal related system calls.
         %stat     Trace stat syscall variants.
         %statfs   Trace statfs, statfs64, statvfs, osf_statfs, and osf_statfs64 system calls.
         %%stat    Trace syscalls used for requesting file status.
         %%statfs  Trace syscalls related to file system statistics.

由 hypervisor 驱动内存泄漏导致的 VM CPU飙高的问题

今天有开发人员说他们同一个 cluster 里面运行同一版本的某些 server 出现 JVM CPU 非常高的情况, 而其它 server 的JVM
CPU 维持正常. 他们表示说以前没出现过这种情况, 而出现这种情况的server 比正常其它server 的CPU usage 要高很多, 所以被内部某些监控工具自动重启了. 据他们观察这些机器可能正在被内部的某些漏洞扫描工具在扫描, 但是又不能确认, 想请SRE帮忙确认一下原因是什么?

SRE 首先确认了这些 CPU usage 非常高的server 跟内部的漏洞扫描基本没关系, 因为这些漏洞扫描的 traffic 基本进不了程序内部代码逻辑, 在应用框架层就被拦截了, 基本不会造成CPU usage 高. 另外还有其它被漏洞扫描的server 并没有出现 CPU 飙高的情况.

SRE 另外明确看到, 这些出问题的server(其实都是通过OpenStack 虚拟出来的VM)的CPU usage大概都在40%左右, 不出问题的server 的CPU usage 大概在3%左右. 出问题server 的JVM CPU usage 大概在8%左右, 而没有问题的 server 的 JVM CPU usage 大概在1%左右. 所以可以大概得出结论, 这些CPU 大部分并不是被 JVM 所占用, 但是 JVM 也受到了一定的影响.

进一步观察发现出现问题的server 都是在同一台 hypervisor 上, 进一步去查看同一台 hypervisor 上面的其它 vm server, 也都表现出了 CPU 较高的情况.

登录到这台 Hypervisor 上面, 使用下面的命令可以看到, 这些Hypervisor 有kernel的内存泄漏问题:

admin@hv-8hhy:~$ smem -twk
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory        159.2G       6.5G     152.7G
userspace memory             139.3G     196.2M     139.1G
free memory                   15.6G      15.6G          0
----------------------------------------------------------
                             314.1G      22.3G     291.8G

在 kernel dynamic memory 这行的 Noncache 这列, 我们看到它使用了152.7G, 这明显是个问题. 对于 Cloud team来说这是一个已知的issue, 并且给出了 kernel 的fix link:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972

linux find command 命令

  1. find ~/ -name a.txt
  2. find . -type f -name tech //find file named tech
  3. find . -type d -name tech //find dir named tech
  4. find . -iname tech //ignore case, find both TECH, tech, Tech, etc
  5. find /var -name "*.log"
  6. find . -type f -perm 0777 -print //find all files whose permission is 777
  7. find / -type f ! -perm 777 //find all the files without permission 777
  8. find / -perm /u=r // find read only file
  9. find / -perm /a=x // find executable file
  10. find / -name foo.bar -print 2>/dev/null //"Permission Denied" send to null
  11. find . -name *.bar -maxdepth 2 -print //only search 2 directories deep
  12. find ./dir1 ./dir2 -name foo.bar -print //search 2 dirs
  13. find /some/directory -type l -print // search link file
    type:
b    block (buffered) special 
c    character (unbuffered) special 
d    directory 
p    named pipe (FIFO) 
f     regular file 
l     symbolic link 
s    socket 

There are, however, other expressions you can use as follows:

-amin n - The file was last accessed n minutes ago
-anewer - The file was last accessed more recently than it was modified
-atime n - The file was last accessed more n days ago
-cmin n - The file was last changed n minutes ago
-cnewer - The file was last changed more recently than the file was modified
-ctime n - The file was last changed more than n days ago
-empty - The file is empty
-executable - The file is executable
-false - Always false
-fstype type - The file is on the specified file system
-gid n - The file belongs to group with the ID n
-group groupname - The file belongs to the named group
-ilname pattern - Search for a symbolic line but ignore case
-iname pattern - Search for a file but ignore case
-inum n - Search for a file with the specified node
-ipath path - Search for a path but ignore case
-iregex expression - Search for a expression but ignore case
-links n - Search for a file with the specified number of links
-lname name - Search for a symbolic link
-mmin n - File's data was last modified n minutes ago
-mtime n - File's data was last modified n days ago
-name name - Search for a file with the specified name
-newer name - Search for a file edited more recently than the file given
-nogroup - Search for a file with no group id
-nouser - Search for a file with no user attached to it
-path path - Search for a path
-readable - Find files which are readable
-regex pattern - Search for files matching a regular expression
-type type - Search for a particular type
-uid uid - Files numeric user id is the same as uid
-user name - File is owned by user specified
-writable - Search for files that can be written to

linux 本地端口 使用

今天查问题 遇到如下异常:
java.net.ConnectException: Cannot assign requested address

看到网上大多数是说 本地往外连接的端口已经被占用完.

  1. 首先查看本地的 ulimit 设置, 是否过小
    _$ ulimit -a
  2. 如果不是很小, 查看当前的 端口使用情况
    _$ ss -s
  3. 查看本地往外连接端口的设置:
    _$ cat /proc/sys/net/ipv4/ip_local_port_range

更多的 linux 网络配置参数:https://www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.obscure.html#AEN1252

参考:
https://ma.ttias.be/linux-increase-ip_local_port_range-tcp-port-range/

关于 linux PS 命令

虽然经常用, 但是不是那么熟悉它竟然能提供那么多的信息. PS 是 Process Status 的缩写. top 命令的输出和 PS 很类似, 只不过是实时刷新.

ps --help all //显示所有的命令行参数
ps L //显示输出格式
ps H 16705 //显示特定进程的线程信息
ps --forest //显示进程直接的父子关系

ps -o ppid,pid,lwp,nlwp,%cpu,%mem,cputime,cmd,args k -%cpu H 16705 //输出一个进程的所有线程, 并且自定义格式, 按照 cpu 使用时间倒序排列.

关于格式中的nlwp: Number of Lightweight Processes. This basically amounts to the number of threads a program has running

一般结合 https://www.pslinux.online/index.php & ps --help all 就能找到想用的参数.