分类 Linux 相关 下的文章

诊断 puppet agent 导致的 CPU 使用率非常高的问题

问题描述:
看到有一台机器表现很突出, CPU 使用率比其它高, 干活却没比其它机器多:
PerfMon.png

搜集数据:
登录机器, 查看各个进程 CPU 使用情况: 以 root 运行的 puppet 就是那个嫌疑犯
top -i
top.png
嫌疑犯近照:
ps -auxwwww
pdetail.png


因为 strace 只针对一个线程, 如果一个进程里面有多个线程, 首先要查出是哪个线程使用 CPU 比较高, 否则可能出错.
使用 top -i 命令出来之后, 输入 大写的 H 则能进入 thread 模式.
或者 使用 top -H -p
threads.png

然后在使用 strace 命令

诊断:
使用 strace 诊断, 发现较短时间内狂调 sched_yield, google 一下, 发现已知问题
sudo strace -p 14194 -c
strace.png
https://bugs.launchpad.net/debian/+source/ruby2.3/+bug/1834072

修复:

  1. 重启能临时解决
  2. 升级 ruby 能长期解决

strace 常用命令

strace 是一款基于 linux ptrace 系统调用的命令行工具, 对于没有源代码去黑盒 triage 问题很有帮助. 它主要通过拦截分析 进程和系统调用的交互, 产生相应的输出.

常见命令:
strace -p 26380
strace -p 26380 -c
sudo strace -p 4599 -e trace=all

它还可以用来做错误注入 (fault injection)

-e trace=%desc     Trace all file descriptor related system calls.
         %file     Trace all system calls which take a file name as an argument.
         %fstat    Trace fstat and fstatat syscall variants.
         %fstatfs  Trace fstatfs, fstatfs64, fstatvfs, osf_fstatfs, and osf_fstatfs64 system calls.
         %ipc      Trace all IPC related system calls.
         %lstat    Trace lstat syscall variants.
         %memory   Trace all memory mapping related system calls.
         %network  Trace all the network related system calls.
         %process  Trace all system calls which involve process management.
         %pure     Trace syscalls that always succeed and have no arguments.
         %signal   Trace all signal related system calls.
         %stat     Trace stat syscall variants.
         %statfs   Trace statfs, statfs64, statvfs, osf_statfs, and osf_statfs64 system calls.
         %%stat    Trace syscalls used for requesting file status.
         %%statfs  Trace syscalls related to file system statistics.

由 hypervisor 驱动内存泄漏导致的 VM CPU飙高的问题

今天有开发人员说他们同一个 cluster 里面运行同一版本的某些 server 出现 JVM CPU 非常高的情况, 而其它 server 的JVM
CPU 维持正常. 他们表示说以前没出现过这种情况, 而出现这种情况的server 比正常其它server 的CPU usage 要高很多, 所以被内部某些监控工具自动重启了. 据他们观察这些机器可能正在被内部的某些漏洞扫描工具在扫描, 但是又不能确认, 想请SRE帮忙确认一下原因是什么?

SRE 首先确认了这些 CPU usage 非常高的server 跟内部的漏洞扫描基本没关系, 因为这些漏洞扫描的 traffic 基本进不了程序内部代码逻辑, 在应用框架层就被拦截了, 基本不会造成CPU usage 高. 另外还有其它被漏洞扫描的server 并没有出现 CPU 飙高的情况.

SRE 另外明确看到, 这些出问题的server(其实都是通过OpenStack 虚拟出来的VM)的CPU usage大概都在40%左右, 不出问题的server 的CPU usage 大概在3%左右. 出问题server 的JVM CPU usage 大概在8%左右, 而没有问题的 server 的 JVM CPU usage 大概在1%左右. 所以可以大概得出结论, 这些CPU 大部分并不是被 JVM 所占用, 但是 JVM 也受到了一定的影响.

进一步观察发现出现问题的server 都是在同一台 hypervisor 上, 进一步去查看同一台 hypervisor 上面的其它 vm server, 也都表现出了 CPU 较高的情况.

登录到这台 Hypervisor 上面, 使用下面的命令可以看到, 这些Hypervisor 有kernel的内存泄漏问题:

admin@hv-8hhy:~$ smem -twk
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory        159.2G       6.5G     152.7G
userspace memory             139.3G     196.2M     139.1G
free memory                   15.6G      15.6G          0
----------------------------------------------------------
                             314.1G      22.3G     291.8G

在 kernel dynamic memory 这行的 Noncache 这列, 我们看到它使用了152.7G, 这明显是个问题. 对于 Cloud team来说这是一个已知的issue, 并且给出了 kernel 的fix link:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972

linux find command 命令

  1. find ~/ -name a.txt
  2. find . -type f -name tech //find file named tech
  3. find . -type d -name tech //find dir named tech
  4. find . -iname tech //ignore case, find both TECH, tech, Tech, etc
  5. find /var -name "*.log"
  6. find . -type f -perm 0777 -print //find all files whose permission is 777
  7. find / -type f ! -perm 777 //find all the files without permission 777
  8. find / -perm /u=r // find read only file
  9. find / -perm /a=x // find executable file
  10. find / -name foo.bar -print 2>/dev/null //"Permission Denied" send to null
  11. find . -name *.bar -maxdepth 2 -print //only search 2 directories deep
  12. find ./dir1 ./dir2 -name foo.bar -print //search 2 dirs
  13. find /some/directory -type l -print // search link file
    type:
b    block (buffered) special 
c    character (unbuffered) special 
d    directory 
p    named pipe (FIFO) 
f     regular file 
l     symbolic link 
s    socket 

There are, however, other expressions you can use as follows:

-amin n - The file was last accessed n minutes ago
-anewer - The file was last accessed more recently than it was modified
-atime n - The file was last accessed more n days ago
-cmin n - The file was last changed n minutes ago
-cnewer - The file was last changed more recently than the file was modified
-ctime n - The file was last changed more than n days ago
-empty - The file is empty
-executable - The file is executable
-false - Always false
-fstype type - The file is on the specified file system
-gid n - The file belongs to group with the ID n
-group groupname - The file belongs to the named group
-ilname pattern - Search for a symbolic line but ignore case
-iname pattern - Search for a file but ignore case
-inum n - Search for a file with the specified node
-ipath path - Search for a path but ignore case
-iregex expression - Search for a expression but ignore case
-links n - Search for a file with the specified number of links
-lname name - Search for a symbolic link
-mmin n - File's data was last modified n minutes ago
-mtime n - File's data was last modified n days ago
-name name - Search for a file with the specified name
-newer name - Search for a file edited more recently than the file given
-nogroup - Search for a file with no group id
-nouser - Search for a file with no user attached to it
-path path - Search for a path
-readable - Find files which are readable
-regex pattern - Search for files matching a regular expression
-type type - Search for a particular type
-uid uid - Files numeric user id is the same as uid
-user name - File is owned by user specified
-writable - Search for files that can be written to

linux 本地端口 使用

今天查问题 遇到如下异常:
java.net.ConnectException: Cannot assign requested address

看到网上大多数是说 本地往外连接的端口已经被占用完.

  1. 首先查看本地的 ulimit 设置, 是否过小
    _$ ulimit -a
  2. 如果不是很小, 查看当前的 端口使用情况
    _$ ss -s
  3. 查看本地往外连接端口的设置:
    _$ cat /proc/sys/net/ipv4/ip_local_port_range

更多的 linux 网络配置参数:https://www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.obscure.html#AEN1252

参考:
https://ma.ttias.be/linux-increase-ip_local_port_range-tcp-port-range/