[prev in list] [next in list] [prev in thread] [next in thread] 

List:       squid-users
Subject:    Re: [squid-users] Squid losing connectivity for 30 seconds
From:       Elie Merhej <emerhej () wise ! net ! lb>
Date:       2011-11-29 8:43:58
Message-ID: 4ED49B4E.8010309 () wise ! net ! lb
[Download RAW message or body]


> > > > > > Hi,
> > > > > > 
> > > > > > I am currently facing a problem that I wasn't able to find a 
> > > > > > solution for in the mailing list or on the internet,
> > > > > > My squid is dying for 30 seconds every one hour at the same exact 
> > > > > > time, squid process will still be running,
> > > > > > I lose my wccp connectivity, the cache peers detect the squid as 
> > > > > > a dead sibling, and the squid cannot server any requests
> > > > > > The network connectivity of the sever is not affected (a ping to 
> > > > > > the squid's ip doesn't timeout)
> > > > > > 
> > > > > > The problem doesn't start immediately when the squid is installed 
> > > > > > on the server (The server is dedicated as a squid)
> > > > > > It starts when the cache directories starts to fill up,
> > > > > > I have started my setup with 10 cache directors, the squid will 
> > > > > > start having the problem when the cache directories are above 50% 
> > > > > > filled
> > > > > > when i change the number of cache directory (9,8,...) the squid 
> > > > > > works for a while then the same problem
> > > > > > cache_dir aufs /cache1/squid 90000 140 256
> > > > > > cache_dir aufs /cache2/squid 90000 140 256
> > > > > > cache_dir aufs /cache3/squid 90000 140 256
> > > > > > cache_dir aufs /cache4/squid 90000 140 256
> > > > > > cache_dir aufs /cache5/squid 90000 140 256
> > > > > > cache_dir aufs /cache6/squid 90000 140 256
> > > > > > cache_dir aufs /cache7/squid 90000 140 256
> > > > > > cache_dir aufs /cache8/squid 90000 140 256
> > > > > > cache_dir aufs /cache9/squid 90000 140 256
> > > > > > cache_dir aufs /cache10/squid 80000 140 256
> > > > > > 
> > > > > > I have 1 terabyte of storage
> > > > > > Finally I created two cache dircetories (One on each HDD) but the 
> > > > > > problem persisted
> > > > > 
> > > > > You have 2 HDD?  but, but, you have 10 cache_dir.
> > > > > We repeatedly say "one cache_dir per disk" or similar. In 
> > > > > particular one cache_dir per physical drive spindle (for "disks" 
> > > > > made up of multiple physical spindles) wherever possible with 
> > > > > physical drives/spindles mounting separately to ensure the 
> > > > > pairing. Squid performs a very unusual pattern of disk I/O which 
> > > > > stress them down to the hardware controller level and make this 
> > > > > kind of detail critical for anything like good speed. Avoiding 
> > > > > cache_dir object limitations by adding more UFS-based dirs to one 
> > > > > disk does not improve the situation.
> > > > > 
> > > > > That is a problem which will be affecting your Squid all the time 
> > > > > though, possibly making the source of the pause worse.
> > > > > 
> > > > > From teh description I believe it is garbage collection on the 
> > > > > cache directories. The pauses can be visible when garbage 
> > > > > collecting any caches over a few dozen GB. The squid default 
> > > > > "swap_high" and "swap_low" values are "5" apart, with at minimum 
> > > > > being a value of 0 apart. These are whole % points of the total 
> > > > > cache size, being erased from disk in a somewhat random-access 
> > > > > style across the cache area. I did mention uncommon disk I/O 
> > > > > patterns, right?
> > > > > 
> > > > > To be sure what it is, you can use the "strace" tool to the squid 
> > > > > worker process (the second PID in current stable Squids) and see 
> > > > > what is running. But given the hourly regularity and past 
> > > > > experience with others on similar cache sizes, I'm almost certain 
> > > > > its the garbage collection.
> > > > > 
> > > > > Amos
> > > > > 
> > > > 
> > > > Hi Amos,
> > > > 
> > > > Thank you for your fast reply,
> > > > I have 2 HDD (450GB and 600GB)
> > > > df -h displays that i have 357Gb and 505GB available
> > > > In my last test, my cache dir where:
> > > > cache_swap_low 90
> > > > cache_swap_high 95
> > > 
> > > This is not. For anything more than 10-20 GB I recommend setting it 
> > > to no more than 1 apart, possibly the same value if that works.
> > > Squid has a light but CPU-intensive and possibly long garbage 
> > > removal cycle above cache_swap_low, and a much more aggressive but 
> > > faster and less CPU intensive removal above cache_swap_high. On 
> > > large caches it is better in terms of downtime going straight to the 
> > > aggressive removal and clearing disk space fast, despite the 
> > > bandwidth cost replacing any items the light removal would have left.
> > > 
> > > 
> > > Amos
> > > 
> > Hi Amos,
> > 
> > I have changed the swap_high  90 and swap_low 90 with two cache dir 
> > (one for each HDD), i still have the same problem,
> > I did an strace (when the problem occured)
> > ------ ----------- ----------- --------- --------- ----------------
> > 23.06    0.004769           0     85681        96 write
> > 21.07    0.004359           0     24658         5 futex
> > 19.34    0.004001         800         5           open
> > 6.54    0.001352           0      5101      5101 connect
> > 6.46    0.001337           3       491           epoll_wait
> > 5.34    0.001104           0     51938      9453 read
> > 3.90    0.000806           0     39727           close
> > 3.54    0.000733           0     86400           epoll_ctl
> > 3.54    0.000732           0     32357           sendto
> > 2.02    0.000417           0     56721           recvmsg
> > 1.84    0.000381           0     24064           socket
> > 0.96    0.000199           0     56264           fcntl
> > 0.77    0.000159           0      6366       329 accept
> > 0.53    0.000109           0     24033           bind
> > 0.52    0.000108           0     30085           getsockname
> > 0.21    0.000044           0     11200           stat
> > 0.21    0.000044           0      6998       359 recvfrom
> > 0.09    0.000019           0      5085           getsockopt
> > 0.06    0.000012           0      2887           lseek
> > 0.00    0.000000           0        98           brk
> > 0.00    0.000000           0        16           dup2
> > 0.00    0.000000           0     10314           setsockopt
> > 0.00    0.000000           0         4           getdents
> > 0.00    0.000000           0         3           getrusage
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.020685                560496     15343 total
> > 
> > this is the strace of squid when it is working normally:
> > ------ ----------- ----------- --------- --------- ----------------
> > 24.88    0.015887           0    455793       169 write
> > 13.72    0.008764           0    112185           epoll_wait
> > 11.67    0.007454           0    256256     27158 read
> > 8.47    0.005408           0    169133           sendto
> > 6.94    0.004430           0    159596           close
> > 6.85    0.004373           0    387359           epoll_ctl
> > 6.42    0.004102           0     19651     19651 connect
> > 5.54    0.003538           0    290289           recvmsg
> > 3.81    0.002431           0    116515           socket
> > 3.53    0.002254           0    164750           futex
> > 1.68    0.001075           0    207688           fcntl
> > 1.53    0.000974           0     95228     23139 recvfrom
> > 1.29    0.000821           0     33408     12259 accept
> > 1.14    0.000726           0     46582           stat
> > 1.11    0.000707           0    110826           bind
> > 0.85    0.000544           0    137574           getsockname
> > 0.32    0.000204           0     21642           getsockopt
> > 0.26    0.000165           0     39502           setsockopt
> > 0.01    0.000007           0      8092           lseek
> > 0.00    0.000000           0       248           open
> > 0.00    0.000000           0         4           brk
> > 0.00    0.000000           0        88           dup2
> > 0.00    0.000000           0        14           getdents
> > 0.00    0.000000           0         6           getrusage
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.063864               2832429     82376 total
> > 
> > Do you have any suggestions to solve the issue, can I run the garbage 
> > collector more frequently, is it better to change the cache_dir type 
> > from aufs to something else?
> > Do you see the problem in the strace?
> > 
> > Thank you,
> > Elie
> > 
> > 
> Hi,
> 
> Please note that squid is facing the same problem even when their is 
> no activity or any clients connected to it
> 
> Regards
> Elie
> 
Hi,

here is the strace result
-----------------------------------------------------------------------------------------------------
 connect(7595, {sa_family=AF_INET, sin_port=htons(80), 
sin_addr=inet_addr("2.22.50.75")}, 16) = -1 EINPROGRESS (Operation now 
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 7595, {EPOLLOUT|EPOLLERR|EPOLLHUP, 
{u32=7595, u64=13133095884688465323}}) = 0
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\375\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\262\245\257P\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\306\254\21\0000r\316N\0\0\0\0000r\316N\0\0\0\0\377\377\377\377\377\377\377\377"..., \
 72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\262\251\21\0\360\376\315N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"..., \
 72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\2\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0\321\376\316N\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\3\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0002\347\340N\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\371\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"..., \
 72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\n\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\263\245\257P\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\v\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\323\370\317N\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\f\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0003\377\365N\0\0\0\0"..., 
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0, 
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1)  = 1
write(35, 
"\2\0\0\0\334\251\21\0@\\\316N\0\0\0\0003r\316N\0\0\0\0\377\377\377\377\377\377\377\377"..., \
 72) = 72
read(165, "!", 256)                     = 1
close(379)                              = 0
epoll_ctl(6, EPOLL_CTL_DEL, 9525, {0, {u32=9525, 
u64=25160128873375029}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 9525, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=9525, 
u64=13133094175291483445}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 9525, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, 
{u32=9525, u64=23686740342482229}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 4554, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, 
{u32=4554, u64=549755818442}}) = 0
close(68)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 68
fcntl(68, F_GETFD)                      = 0
fcntl(68, F_SETFD, FD_CLOEXEC)          = 0
setsockopt(68, SOL_IP, 0x13 /* IP_??? */, [1], 4) = 0
bind(68, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("95.141.62.17")}, 16) = 0
fcntl(68, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(68, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(68, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getsockname(68, {sa_family=AF_INET, sin_port=htons(60303), 
sin_addr=inet_addr("95.141.62.17")}, [16]) = 0
socket(PF_NETLINK, SOCK_RAW, 0)         = 379
bind(379, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(379, {sa_family=AF_NETLINK, pid=11594, groups=00000000}, 
[12]) = 0
sendto(379, "\24\0\0\0\26\0\1\3t\232\324N\0\0\0\0\0\0\0\0", 20, 0, 
{sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"0\0\0\0\24\0\2\0t\232\324NJ-\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., \
 4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"@\0\0\0\24\0\2\0t\232\324NJ-\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"..., \
 4096}], msg_controllen=0, msg_flags=0}, 0) = 128
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"\24\0\0\0\3\0\2\0t\232\324NJ-\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"..., \
 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(379)                              = 0
connect(68, {sa_family=AF_INET, sin_port=htons(80), 
sin_addr=inet_addr("2.22.50.65")}, 16) = -1 EINPROGRESS (Operation now 
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 68, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=68, 
u64=13133092173836714052}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 3658, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, 
{u32=3658, u64=549755817546}}) = 0
sendto(10, "\33\221\1\0\0\1\0\0\0\0\0\0\nliveupdate\vgocyberl"..., 44, 
0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 44
sendto(10, "\275\251\1\0\0\1\0\0\0\0\0\0\2tc\2v3\6cache3\1c\4pack"..., 
48, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 48
sendto(10, "\267\355\1\0\0\1\0\0\0\0\0\0\3crl\6thawte\3com\0\0\34\0\1", 
32, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\312\237\1\0\0\1\0\0\0\0\0\0\2p4\rbhzfrwddqtb66\20cg"..., 
79, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 79
sendto(10, "\261\216\1\0\0\1\0\0\0\0\0\0\4apps\10facebook\3com\0\0"..., 
35, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 35
sendto(10, "\237\10\1\0\0\1\0\0\0\0\0\0\6sn121w\6snt121\4mail\4"..., 45, 
0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "FR\1\0\0\1\0\0\0\0\0\0\3api\6js-kit\3com\0\0\1\0\1", 32, 0, 
{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\207$\1\0\0\1\0\0\0\0\0\0\1l\naddthiscdn\3com\0\0\1"..., 34, 
0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "\265#\1\0\0\1\0\0\0\0\0\0\5usage\7hosting\7toolb"..., 60, 0, 
{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 60
sendto(10, "hZ\1\0\0\1\0\0\0\0\0\0\3www\ntargetgulf\3com\0"..., 36, 0, 
{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 36
sendto(10, "\355\345\1\0\0\1\0\0\0\0\0\0\10services\4apps\7condu"..., 
43, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 43
sendto(10, "O\254\1\0\0\1\0\0\0\0\0\0\0060-jw-w\7channel\10face"..., 45, 
0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "\330\310\1\0\0\1\0\0\0\0\0\0\2t1\7gstatic\3com\0\0\1\0\1", 
32, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\16Z\1\0\0\1\0\0\0\0\0\0\4mail\3aol\3com\0\0\1\0\1", 30, 0, 
{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 30
sendto(10, "\327\216\1\0\0\1\0\0\0\0\0\0\3www\10bisara7a\3com\0\0\1"..., 
34, 0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "o\201\1\0\0\1\0\0\0\0\0\0\003317\7channel\10faceboo"..., 42, 
0, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(10, "\344?\1\0\0\1\0\0\0\0\0\0\7email03\fsecureserve"..., 42, 0, 
{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(21, 
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\0017\360\6\0\0\0002\0P\0\0"..., 132, 
0, NULL, 0) = 132
sendto(21, 
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\1<\360\6\0\0\0\21\0P\0\0"..., 
132, 0, NULL, 0) = 132
open("/cache2/pool6/squid/13/AB", 
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 379
getdents(379, /* 254 entries */, 32768) = 8112
getdents(379, /* 0 entries */, 32768)   = 0
close(379)                              = 0
read(165, "!", 256)                     = 1
----------------------------------------------------------------------------------------------------------------------------------------------------
 Squid is freezing at this point
Please advise

Thank you
Elie


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic