'A' web server takes another "time out"
tye
created: 2006-05-03 14:17:13

You may have noticed PerlMonks becoming non-responsive from time to time. This is usually (of late) due to our 'A' web server taking a "time out" to indulge in some recreational "extreme swapping" for quite a few minutes (this appears to have started happening after pair.com upgraded the OS, Apache, etc.), early this year.

After several attempts, I finally captured some output from 'top' that shows much about the problem. Anyone care to offer interpretations / insights regarding this output from 'top' trying to dump the list of processes every 60 seconds when PerlMonk's 'A' web server "goes away"? Notice how it takes 5 1/2 minutes not "a bit over 60 seconds" for one update there. See the load average climb. See lots of 'httpd' processes appear, starving lots of older httpd processes for real RAM.

I'll look at this more as I find more time, but I'd be interested in well-considered theories about possible sources for such behavior.

I wish 'top' would show who the parent of each process was so I could tell which process is creating all of these extra processes, but 'top' isn't particularly flexible but is still the best tool I've found available on this system so far (no, I don't have root access and doubt seriously that pair.com would give it to me). Perhaps the next iteration of this logging should add periodic "ps" output to the logs to get that parent/child information, though I bet cron trying to start up "ps" would take so long when the problem is happening that it'd miss seeing the problem, based on past iterations. ;)

One difference between the 'A' and 'B' web servers is that the 'A' web server gets quite a lot of traffic from search engine spiders indexing PerlMonks via "http://someotherhostname/~monkads/?...". I disabled this for msnbot as it was doing twice as many hits as the next-busiest robot and was doing hits for a lot of bizarre URLs. I may soon disable it for all robots since the problem continues.

The 'top' output is in tags as just tags would make it impossible to view the whole thread of discussion w/o the data "in the way". So "reveal" spoilers to see the output.

last pid: 96025;  load averages:  4.81,  2.80,  2.16  up 6+23:19:50    10:08:06
107 processes: 1 running, 106 sleeping
CPU states: 29.6% user,  0.0% nice,  6.6% system,  2.3% interrupt, 61.5% idle
Mem: 806M Active, 75M Inact, 91M Wired, 27M Cache, 112M Buf, 4284K Free
Swap: 4096M Total, 317M Used, 3779M Free, 7% Inuse, 3088K In, 94M Out

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
84338 nobody     2 -10 49620K 32660K sbwait   6:13  0.00%  0.00% httpd
84049 nobody     2 -10 49396K 31396K sbwait   6:49  0.00%  0.00% httpd
84046 nobody     2 -10 48112K 31092K sbwait   6:55  0.00%  0.00% httpd
84044 nobody     2 -10 46144K 30724K sbwait   5:57  0.00%  0.00% httpd
84032 nobody     2 -10 45952K 31824K sbwait   6:01  0.00%  0.00% httpd
84041 nobody     2 -10 45220K 32836K sbwait   6:02  0.20%  0.20% httpd
84042 nobody     2 -10 44948K 30600K sbwait   6:39  0.00%  0.00% httpd
84036 nobody     2 -10 44776K 37852K sbwait   5:58  0.00%  0.00% httpd
84034 nobody     2 -10 44104K 30128K sbwait   6:26  0.00%  0.00% httpd
84051 nobody     2 -10 43768K 38004K sbwait   5:50  0.00%  0.00% httpd
84031 nobody     2 -10 43732K 29484K sbwait   5:24  0.00%  0.00% httpd
84048 nobody     2 -10 43732K 29248K sbwait   6:14  0.00%  0.00% httpd
84030 nobody     2 -10 43672K 29372K sbwait   5:16  0.00%  0.00% httpd
84202 nobody     2 -10 43516K 29040K sbwait   6:15  0.00%  0.00% httpd
84045 nobody     2 -10 43508K 16640K sbwait   5:33  0.00%  0.00% httpd
84047 nobody     2 -10 43444K 29488K sbwait   5:50  0.00%  0.00% httpd
84043 nobody     2 -10 43216K 21444K sbwait   6:27  0.00%  0.00% httpd
84037 nobody     2 -10 42288K 30280K sbwait   6:19  0.00%  0.00% httpd
84033 nobody     2 -10 41368K 29708K sbwait   6:24  0.00%  0.00% httpd
84339 nobody     2 -10 41156K 28720K sbwait   5:41  0.00%  0.00% httpd
84035 nobody     2 -10 40252K 28340K sbwait   5:40  0.00%  0.00% httpd
84039 nobody     2 -10 40232K 27472K sbwait   6:12  0.00%  0.00% httpd
84040 nobody     2 -10 40000K 27264K sbwait   6:08  0.00%  0.00% httpd
84038 nobody     2 -10 39632K 28464K sbwait   5:53  0.00%  0.00% httpd
95995 nobody     2 -10 32848K 31240K sbwait   0:01  0.82%  0.78% httpd
  230 root       2  10 21324K     0K select   0:01  0.00%  0.00% 
95899 nobody     2 -10 12952K 10860K sbwait   0:03  0.00%  0.00% httpd
95825 nobody     2 -10 12840K  9216K sbwait   0:04  0.00%  0.00% httpd
95820 nobody     2 -10 12832K  9472K sbwait   0:03  0.00%  0.00% httpd
95898 nobody     2 -10 12436K  8884K accept   0:03  0.00%  0.00% httpd
95848 nobody     2 -10 12228K  8996K sbwait   0:02  0.00%  0.00% httpd
95902 nobody     2 -10 12124K  8608K sbwait   0:02  0.00%  0.00% httpd
95900 nobody     2 -10 11708K  9780K sbwait   0:01  0.00%  0.00% httpd
95960 nobody     2 -10 10956K  9712K sbwait   0:01  0.00%  0.00% httpd
95962 nobody     2 -10 10828K  9420K sbwait   0:01  0.00%  0.00% httpd
95961 nobody     2 -10 10804K  9392K sbwait   0:01  0.00%  0.00% httpd
95963 nobody     2 -10 10720K  9448K sbwait   0:01  0.00%  0.00% httpd
95958 nobody     2 -10 10596K  9396K sbwait   0:01  0.00%  0.00% httpd
95975 nobody     2 -10  9848K  8164K sbwait   0:00  0.00%  0.00% httpd
95973 nobody     2 -10  9736K  8320K sbwait   0:00  0.00%  0.00% httpd
95996 nobody     2 -10  9512K  8324K sbwait   0:00  0.00%  0.00% httpd
95976 nobody     2 -10  9484K  7524K accept   0:00  0.00%  0.00% httpd
95974 nobody     2 -10  9416K  8016K sbwait   0:00  0.00%  0.00% httpd
95972 nobody     2 -10  9344K  7944K sbwait   0:01  0.00%  0.00% httpd
95993 nobody     2 -10  9336K  7548K sbwait   0:00  0.00%  0.00% httpd
95977 nobody     2 -10  8940K  7220K sbwait   0:00  0.00%  0.00% httpd
95957 nobody     2 -10  8868K  7592K sbwait   0:00  0.00%  0.00% httpd
95959 nobody     2 -10  8676K  7420K accept   0:00  0.00%  0.00% httpd
95980 nobody     2 -10  8576K  6728K sbwait   0:00  0.00%  0.00% httpd
95998 nobody     2 -10  8404K  7180K sbwait   0:00  0.00%  0.00% httpd
95971 nobody     2 -10  8380K  6992K sbwait   0:00  0.00%  0.00% httpd
95997 nobody     2 -10  8132K  6968K sbwait   0:00  0.00%  0.00% httpd
95994 nobody     2 -10  8056K  6696K sbwait   0:00  0.00%  0.00% httpd
  183 root      10  10  5532K   648K nanslp   0:23  0.00%  0.00% perl
  261 root       2 -15  5476K  1000K sbwait   3:31  0.00%  0.00% perl
96019 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96022 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96020 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96021 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96023 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96025 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
96024 nobody     2 -10  4300K  2664K accept   0:00  0.00%  0.00% httpd
  153 root       2 -10  4300K  2100K select   0:20  0.00%  0.00% httpd
  211 root      10 -20  3412K  1096K nanslp   3:17  0.00%  0.00% perl
  366 root       2   4  3112K     0K poll     0:00  0.00%  0.00% 
  274 root      10   4  3044K   600K nanslp   0:07  0.00%  0.00% perl
  270 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  268 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  273 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  234 root       2   0  2332K     0K select   0:06  0.00%  0.00% 
  191 root      10   4  2280K  1196K nanslp   0:13  0.00%  0.00% ncftpd
77300 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90242 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
48258 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90215 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
75810 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
  239 root      10   0  2088K     0K wait     0:00  0.00%  0.00% 
95759 monkads   31   0  2028K   380K RUN      0:02 16.00%  0.78% top
  262 root       2   0  1356K     0K select   0:01  0.00%  0.00% 
  201 root      -6   1  1264K   216K piperd   0:00  0.00%  0.00% ncftpd
   95 root       2   0  1056K   344K poll     0:06  0.00%  0.00% syslog-ng
  197 root       2   6  1016K     0K accept   0:00  0.00%  0.00% 
95749 root      -6   0  1008K   388K piperd   0:00  0.00%  0.00% cron
  102 root      10   0   992K   236K nanslp   0:03  0.00%  0.00% cron
  207 qmails     2   0   952K     0K select   0:04  0.00%  0.00% 
  319 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  321 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  315 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  316 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  320 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  322 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  318 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  317 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  217 qmaill    -6   0   896K   160K piperd   0:00  0.00%  0.00% splogger
  218 root       2   0   896K     0K select   0:00  0.00%  0.00% 
  219 qmailr     2   0   896K     0K select   0:00  0.00%  0.00% 
  220 qmailq    -6   0   884K   128K piperd   0:00  0.00%  0.00% qmail-clean
  189 root      10   0   864K     0K wait     0:00  0.00%  0.00% 
95752 monkads   10   0   632K     0K wait     0:00  0.00%  0.00% 
    1 root      10   0   544K     0K wait     0:00  0.00%  0.00% init
   23 root      18   0   212K     0K pause    0:00  0.00%  0.00% 
    2 root     -18   0     0K     0K psleep  16:17  1.81%  1.81% pagedaemon
    5 root      18   0     0K     0K syncer   3:01  0.00%  0.00% syncer
    3 root      18   0     0K     0K psleep   1:00  0.00%  0.00% vmdaemon
    4 root     -18   0     0K     0K psleep   0:02  0.00%  0.00% bufdaemon
    6 root      -2   0     0K     0K vlruwt   0:02  0.00%  0.00% vnlru
    0 root     -18   0     0K     0K sched    0:00  0.00%  0.00% swapper

last pid: 96043;  load averages: 13.13,  5.56,  3.24  up 6+23:21:32    10:09:48
124 processes: 34 running, 90 sleeping
CPU states: 40.3% user,  0.0% nice, 15.0% system,  3.1% interrupt, 41.6% idle
Mem: 830M Active, 47M Inact, 92M Wired, 33M Cache, 112M Buf, 1664K Free
Swap: 4096M Total, 465M Used, 3630M Free, 11% Inuse, 28M In, 170M Out

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
84338 nobody     2 -10 49620K 29096K accept   6:14  0.42%  0.39% httpd
84049 nobody   -22 -10 49396K 26724K swread   6:49  0.00%  0.00% httpd
84046 nobody   -22 -10 48112K 29184K swread   6:56  0.00%  0.00% httpd
84044 nobody     2 -10 46144K   264K accept   5:58  0.00%  0.00% httpd
84032 nobody     2 -10 45952K     0K RUN      6:02  0.00%  0.00% 
84041 nobody     2 -10 45220K 31024K sbwait   6:05  0.78%  0.78% httpd
84042 nobody     2 -10 44948K     0K RUN      6:39  0.00%  0.00% 
84036 nobody     2 -10 44776K     0K RUN      5:59  0.00%  0.00% 
84034 nobody   -22 -10 44104K 26892K swread   6:27  0.00%  0.00% httpd
84051 nobody     2 -10 43768K 27180K sbwait   5:51  0.05%  0.05% httpd
84048 nobody     2 -10 43732K   260K accept   6:14  0.00%  0.00% httpd
84031 nobody     2 -10 43732K     0K RUN      5:24  0.00%  0.00% 
84030 nobody     2 -10 43672K     0K RUN      5:16  0.00%  0.00% 
84202 nobody     2 -10 43516K 26348K sbwait   6:16  0.10%  0.10% httpd
84045 nobody   -22 -10 43508K 15972K swread   5:33  0.00%  0.00% httpd
84047 nobody     2 -10 43444K     0K RUN      5:50  0.00%  0.00% 
84043 nobody     2 -10 43216K     0K RUN      6:28  0.00%  0.00% 
84037 nobody   -22 -10 42288K 27340K swread   6:20  0.00%  0.00% httpd
84033 nobody     2 -10 41368K     0K RUN      6:25  0.00%  0.00% 
84339 nobody     2 -10 41156K     0K RUN      5:41  0.00%  0.00% 
84035 nobody     2 -10 40252K     0K RUN      5:40  0.00%  0.00% 
84039 nobody     2 -10 40232K     0K RUN      6:13  5.60%  0.78% 
84040 nobody     2 -10 40000K     0K RUN      6:08  0.00%  0.00% 
84038 nobody     2 -10 39640K     0K RUN      5:54  0.00%  0.00% 
95977 nobody   -22 -10 34100K 10328K swread   0:02  0.00%  0.00% httpd
95995 nobody     2 -10 33700K  9128K sbwait   0:02  0.15%  0.15% httpd
96025 nobody   -22 -10 30876K 18576K swread   0:01  1.27%  1.17% httpd
  230 root       2  10 21324K     0K select   0:01  0.00%  0.00% 
95962 nobody   -22 -10 17952K 12460K swread   0:02  0.00%  0.00% httpd
95899 nobody   -22 -10 13928K  2936K swread   0:03  0.00%  0.00% httpd
95825 nobody     2 -10 12984K  7552K sbwait   0:04  0.00%  0.00% httpd
95820 nobody     2 -10 12852K     0K RUN      0:03  0.00%  0.00% 
95900 nobody     2 -10 12784K     0K RUN      0:01  0.00%  0.00% 
95848 nobody     2 -10 12548K     0K RUN      0:03  0.00%  0.00% 
95898 nobody     2 -10 12436K     0K RUN      0:03  2.45%  0.34% 
95960 nobody     2 -10 12376K     0K RUN      0:02  1.05%  0.15% 
95902 nobody     2 -10 12124K     0K RUN      0:02  0.00%  0.00% 
95958 nobody   -22 -10 11804K  6296K swread   0:01  0.15%  0.15% httpd
95963 nobody   -22 -10 11672K  6144K swread   0:02  0.00%  0.00% httpd
95959 nobody     2 -10 11588K  7720K accept   0:01  0.00%  0.00% httpd
95961 nobody     2 -10 11192K  6116K sbwait   0:01  0.00%  0.00% httpd
95998 nobody     2 -10 11144K     0K RUN      0:01  2.80%  0.39% 
95997 nobody     2 -10 11140K     0K RUN      0:01  0.00%  0.00% 
95957 nobody   -22 -10 11116K  5336K swread   0:01  0.00%  0.00% httpd
95972 nobody     2 -10 11072K  5936K sbwait   0:01  0.00%  0.00% httpd
95994 nobody     2 -10 11068K  3016K sbwait   0:01  0.00%  0.00% httpd
96029 nobody     2 -10 11044K     0K RUN      0:01  0.00%  0.00% 
95996 nobody   -22 -10 11020K  5648K swread   0:01  0.00%  0.00% httpd
96020 nobody     2 -10 10972K     0K RUN      0:01  0.00%  0.00% 
95973 nobody   -22 -10 10956K  4208K swread   0:01  0.00%  0.00% httpd
95980 nobody     2 -10 10948K     0K RUN      0:01  0.00%  0.00% 
96021 nobody   -22 -10 10932K  7396K swread   0:01  0.00%  0.00% httpd
95975 nobody     2 -10 10908K  6176K accept   0:01  0.00%  0.00% httpd
96026 nobody     2 -10 10868K  3820K sbwait   0:01  0.00%  0.00% httpd
96023 nobody     2 -10 10864K  7420K accept   0:01  0.00%  0.00% httpd
96022 nobody     2 -10 10852K     0K RUN      0:01  1.75%  0.24% 
95971 nobody     2 -10 10844K     0K RUN      0:01  0.00%  0.00% 
96030 nobody     2 -10 10840K     0K RUN      0:01  3.15%  0.44% 
96028 nobody     2 -10 10768K     0K RUN      0:01  0.00%  0.00% 
95993 nobody     2 -10 10744K     0K RUN      0:01  0.00%  0.00% 
96027 nobody     2 -10 10708K     0K RUN      0:01  0.00%  0.00% 
96024 nobody     2 -10 10668K     0K RUN      0:01  0.00%  0.00% 
96019 nobody   -22 -10  9884K  6196K swread   0:01  0.00%  0.00% httpd
95976 nobody     2 -10  9724K     0K RUN      0:01  0.00%  0.00% 
95974 nobody   -22 -10  9424K  3360K swread   0:00  0.00%  0.00% httpd
96045 nobody    -5 -10  7528K  6180K sysctl   0:00  0.00%  0.00% httpd
96044 nobody    -5 -10  7528K  6176K sysctl   0:00  0.00%  0.00% httpd
96036 nobody    -5 -10  7524K  6188K sysctl   0:00  0.00%  0.00% httpd
96043 nobody    -5 -10  7524K  5796K sysctl   0:00  0.00%  0.00% httpd
96047 nobody    -5 -10  7508K  6160K sysctl   0:00  0.00%  0.00% httpd
96037 nobody    -5 -10  7508K  6140K sysctl   0:00  0.00%  0.00% httpd
96038 nobody    -5 -10  7508K  6140K sysctl   0:00  0.00%  0.00% httpd
96046 nobody    -5 -10  7504K  6164K sysctl   0:00  0.00%  0.00% httpd
96048 nobody    -5 -10  7504K  5716K sysctl   0:00  0.00%  0.00% httpd
96049 nobody    -5 -10  7496K  6148K sysctl   0:00  0.00%  0.00% httpd
  183 root      10  10  5532K     0K RUN      0:23  0.00%  0.00% 
  261 root       2 -15  5476K   992K sbwait   3:31  0.00%  0.00% perl
96050 nobody    -5 -10  4300K  2656K accept   0:00  0.00%  0.00% httpd
  153 root       2 -10  4300K  2016K select   0:20  0.00%  0.00% httpd
  211 root      -6 -20  3412K   808K piperd   3:17  0.00%  0.00% perl
  366 root       2   4  3112K     0K poll     0:00  0.00%  0.00% 
  274 root      -6   4  3044K   588K piperd   0:07  0.00%  0.00% perl
  270 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  268 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  273 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  234 root       2   0  2332K     0K select   0:06  0.00%  0.00% 
  191 root      10   4  2280K  1080K nanslp   0:13  0.00%  0.00% ncftpd
77300 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90242 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
48258 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90215 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
75810 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
  239 root      10   0  2088K     0K wait     0:00  0.00%  0.00% 
95759 monkads   28   0  2048K   308K RUN      0:03  0.00%  0.00% top
  262 root       2   0  1356K     0K select   0:01  0.00%  0.00% 
  201 root      -6   1  1264K   168K piperd   0:00  0.00%  0.00% ncftpd
   95 root       2   0  1056K     0K poll     0:06  0.00%  0.00% 
  197 root       2   6  1016K     0K accept   0:00  0.00%  0.00% 
95749 root      -6   0  1008K   236K piperd   0:00  0.00%  0.00% cron
96039 root      -5   0   992K   356K sysctl   0:00  0.00%  0.00% cron
  102 root      10   0   992K     0K nanslp   0:03  0.00%  0.00% 
  207 qmails     2   0   952K     0K select   0:04  0.00%  0.00% 
  319 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  321 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  315 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  316 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  320 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  322 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  318 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  317 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  217 qmaill    -6   0   896K   116K piperd   0:00  0.00%  0.00% splogger
  218 root       2   0   896K     0K select   0:00  0.00%  0.00% 
  219 qmailr     2   0   896K     0K select   0:00  0.00%  0.00% 
  220 qmailq    -6   0   884K    88K piperd   0:00  0.00%  0.00% qmail-clean
  189 root      10   0   864K     0K wait     0:00  0.00%  0.00% 
95752 monkads   10   0   632K     0K wait     0:00  0.00%  0.00% 
    1 root      10   0   544K     0K wait     0:00  0.00%  0.00% init
   23 root      18   0   212K     0K pause    0:00  0.00%  0.00% 
    2 root     -18   0     0K     0K wswbuf  16:24  3.12%  3.12% pagedaemon
    3 root      18   0     0K     0K psleep   1:02  1.61%  1.61% vmdaemon
    5 root      18   0     0K     0K syncer   3:01  0.00%  0.00% syncer
    4 root     -18   0     0K     0K psleep   0:02  0.00%  0.00% bufdaemon
    6 root      -2   0     0K     0K vlruwt   0:02  0.00%  0.00% vnlru
    0 root     -22   0     0K     0K swread   0:00  0.00%  0.00% swapper

last pid: 96100;  load averages: 12.36,  7.48,  4.29  up 6+23:26:59    10:15:15
159 processes: 1 running, 158 sleeping
CPU states: 28.4% user,  0.0% nice,  7.7% system,  3.0% interrupt, 60.9% idle
Mem: 880M Active, 17M Inact, 93M Wired, 12M Cache, 112M Buf, 1664K Free
Swap: 4096M Total, 683M Used, 3413M Free, 16% Inuse, 66M In, 241M Out

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
84338 nobody   -22 -10 49620K 21480K swread   6:14  0.00%  0.00% httpd
84049 nobody   -22 -10 49396K 21332K swread   6:49  0.00%  0.00% httpd
84046 nobody   -22 -10 48112K 22696K swread   6:56  0.00%  0.00% httpd
84044 nobody   -22 -10 46144K 21812K swread   5:58  0.00%  0.00% httpd
84032 nobody   -22 -10 45952K 24084K swread   6:02  0.00%  0.00% httpd
84041 nobody   -22 -10 45220K 26016K swread   6:05  0.00%  0.00% httpd
84042 nobody   -22 -10 44948K 21868K swread   6:40  0.00%  0.00% httpd
84036 nobody   -22 -10 44776K 23204K swread   6:00  0.00%  0.00% httpd
84034 nobody   -22 -10 44104K 24428K swread   6:27  0.00%  0.00% httpd
84051 nobody   -22 -10 43768K 22316K swread   5:51  0.00%  0.00% httpd
84048 nobody   -22 -10 43732K 16604K swread   6:15  0.00%  0.00% httpd
84031 nobody   -22 -10 43732K 12844K swread   5:24  0.00%  0.00% httpd
84030 nobody     2 -10 43672K 22964K sbwait   5:17  0.00%  0.00% httpd
84202 nobody   -22 -10 43516K 21788K swread   6:16  0.00%  0.00% httpd
84045 nobody   -22 -10 43508K 13804K swread   5:33  0.00%  0.00% httpd
84047 nobody   -22 -10 43444K 21072K swread   5:51  0.00%  0.00% httpd
84043 nobody   -22 -10 43216K 21724K swread   6:28  0.00%  0.00% httpd
84037 nobody   -22 -10 42288K 24248K swread   6:20  0.00%  0.00% httpd
84033 nobody   -22 -10 41368K 21056K swread   6:25  0.00%  0.00% httpd
84339 nobody   -22 -10 41156K 16192K swread   5:41  0.00%  0.00% httpd
84035 nobody   -22 -10 40252K 22736K swread   5:40  0.00%  0.00% httpd
84039 nobody   -22 -10 40232K 16004K swread   6:13  0.00%  0.00% httpd
84040 nobody   -22 -10 40000K 16204K swread   6:08  0.00%  0.00% httpd
84038 nobody   -22 -10 39640K 24008K swread   5:55  0.00%  0.00% httpd
96070 nobody   -22 -10 37316K 32432K swread   0:06  0.59%  0.59% httpd
96069 nobody   -22 -10 36860K 32116K swread   0:04  0.54%  0.54% httpd
95977 nobody   -22 -10 34100K  6892K swread   0:02  0.00%  0.00% httpd
95995 nobody   -22 -10 33740K  7176K swread   0:02  0.00%  0.00% httpd
96025 nobody   -22 -10 32972K  3128K swread   0:02  0.00%  0.00% httpd
  230 root       2  10 21324K     0K select   0:01  0.00%  0.00% 
96082 nobody     2 -10 19172K   276K accept   0:05  0.00%  0.00% httpd
95962 nobody   -22 -10 18144K  3616K swread   0:02  0.00%  0.00% httpd
95998 nobody   -22 -10 17328K  5088K swread   0:01  0.00%  0.00% httpd
95994 nobody   -22 -10 17284K  5212K swread   0:01  0.00%  0.00% httpd
96026 nobody   -22 -10 17152K  6100K swread   0:01  0.00%  0.00% httpd
96050 nobody   -22 -10 16908K  6072K swread   0:01  0.00%  0.00% httpd
96054 nobody   -22 -10 14024K  7732K swread   0:03  0.00%  0.00% httpd
95899 nobody   -22 -10 13884K  6792K swread   0:04  0.00%  0.00% httpd
96049 nobody   -22 -10 13388K  8284K swread   0:02  0.34%  0.34% httpd
95825 nobody   -22 -10 13152K  7640K swread   0:05  0.00%  0.00% httpd
95820 nobody   -22 -10 12876K  7028K swread   0:03  0.00%  0.00% httpd
95960 nobody   -22 -10 12840K  6724K swread   0:02  0.00%  0.00% httpd
95900 nobody   -22 -10 12804K  5572K swread   0:02  0.00%  0.00% httpd
95898 nobody   -22 -10 12604K  6360K swread   0:04  0.00%  0.00% httpd
95848 nobody     2 -10 12548K  6344K sbwait   0:03  0.00%  0.00% httpd
95971 nobody   -22 -10 12516K  5708K swread   0:01  0.00%  0.00% httpd
96029 nobody   -22 -10 12444K  5088K swread   0:01  0.00%  0.00% httpd
96066 nobody   -22 -10 12392K  6732K swread   0:01  0.00%  0.00% httpd
96080 nobody   -22 -10 12284K   916K swread   0:02  0.00%  0.00% httpd
96044 nobody     2 -10 12260K  7152K sbwait   0:02  0.00%  0.00% httpd
96067 nobody   -22 -10 12188K  7488K swread   0:02  0.00%  0.00% httpd
95972 nobody     2 -10 12132K     0K accept   0:02  0.00%  0.00% httpd
95902 nobody   -22 -10 12124K  5216K swread   0:02  0.00%  0.00% httpd
95958 nobody   -22 -10 12100K  6196K swread   0:01  0.00%  0.00% httpd
95957 nobody     2 -10 12028K  5672K sbwait   0:02  0.34%  0.34% httpd
96036 nobody   -22 -10 11900K  8356K swread   0:01  0.00%  0.00% httpd
96020 nobody     2 -10 11804K  6160K sbwait   0:01  0.00%  0.00% httpd
96074 nobody   -22 -10 11772K  1488K swread   0:01  0.00%  0.00% httpd
95997 nobody   -22 -10 11720K  6036K swread   0:01  0.00%  0.00% httpd
95980 nobody   -22 -10 11672K  5968K swread   0:02  0.00%  0.00% httpd
95963 nobody   -22 -10 11672K  5056K swread   0:02  0.00%  0.00% httpd
96022 nobody   -22 -10 11652K  6044K swread   0:01  0.00%  0.00% httpd
95996 nobody   -22 -10 11648K  6500K swread   0:02  0.00%  0.00% httpd
95959 nobody   -22 -10 11588K  5408K swread   0:01  0.00%  0.00% httpd
96055 nobody   -22 -10 11564K  6880K swread   0:01  0.00%  0.00% httpd
95975 nobody   -22 -10 11404K  6316K swread   0:01  0.00%  0.00% httpd
95993 nobody   -22 -10 11348K  6736K swread   0:01  0.00%  0.00% httpd
96048 nobody   -22 -10 11320K  6436K swread   0:01  0.00%  0.00% httpd
95961 nobody   -22 -10 11320K  4200K swread   0:01  0.00%  0.00% httpd
96051 nobody   -22 -10 11244K  6740K swread   0:01  0.00%  0.00% httpd
96043 nobody   -22 -10 11228K  5988K swread   0:01  0.00%  0.00% httpd
96046 nobody   -22 -10 11148K  6128K swread   0:01  0.00%  0.00% httpd
96021 nobody   -22 -10 11128K  5560K swread   0:01  0.00%  0.00% httpd
96038 nobody   -22 -10 11100K  6244K swread   0:01  0.00%  0.00% httpd
96045 nobody   -22 -10 11052K  5876K swread   0:01  0.00%  0.00% httpd
96024 nobody   -22 -10 11024K  4348K swread   0:01  0.00%  0.00% httpd
96028 nobody   -22 -10 11000K  5292K swread   0:01  0.00%  0.00% httpd
95973 nobody   -22 -10 10972K  5256K swread   0:01  0.00%  0.00% httpd
96030 nobody   -22 -10 10924K  5460K swread   0:01  0.00%  0.00% httpd
96027 nobody   -22 -10 10868K  4608K swread   0:01  0.00%  0.00% httpd
96023 nobody   -22 -10 10864K  5412K swread   0:01  0.00%  0.00% httpd
96019 nobody   -22 -10 10648K  5228K swread   0:01  0.00%  0.00% httpd
95976 nobody   -22 -10 10020K  4664K swread   0:01  0.00%  0.00% httpd
96052 nobody   -22 -10  9720K  5980K swread   0:00  0.00%  0.00% httpd
96037 nobody   -22 -10  9432K  4760K swread   0:00  0.00%  0.00% httpd
95974 nobody   -22 -10  9424K  2924K swread   0:00  0.00%  0.00% httpd
96077 nobody   -22 -10  9096K  5168K swread   0:00  0.00%  0.00% httpd
96071 nobody   -22 -10  8708K  4636K swread   0:00  0.00%  0.00% httpd
96047 nobody   -22 -10  8668K   776K swread   0:00  0.00%  0.00% httpd
96078 nobody   -22 -10  8568K  5004K swread   0:00  0.00%  0.00% httpd
96053 nobody   -22 -10  8396K  2956K swread   0:00  0.00%  0.00% httpd
96076 nobody   -22 -10  8356K  3336K swread   0:00  0.00%  0.00% httpd
96065 nobody   -22 -10  8048K  3004K swread   0:00  0.00%  0.00% httpd
96079 nobody    -5 -10  7528K  3752K sysctl   0:00  0.00%  0.00% httpd
96096 nobody    -5 -10  7528K  3624K sysctl   0:00  0.00%  0.00% httpd
96095 nobody    -5 -10  7528K  3592K sysctl   0:00  0.00%  0.00% httpd
96087 nobody    -5 -10  7524K  3608K sysctl   0:00  0.00%  0.00% httpd
96083 nobody    -5 -10  7508K  3620K sysctl   0:00  0.00%  0.00% httpd
96094 nobody    -5 -10  7508K  3584K sysctl   0:00  0.00%  0.00% httpd
96081 nobody    -5 -10  7496K  3512K sysctl   0:00  0.00%  0.00% httpd
  183 root      10  10  5532K   636K nanslp   0:23  0.00%  0.00% perl
  261 root       2 -15  5476K  1108K sbwait   3:31  0.00%  0.00% perl
96101 root      -5 -10  4300K  2072K pfault   0:00  0.00%  0.00% httpd
  153 root       2 -10  4300K  1960K select   0:20  0.00%  0.00% httpd
  211 root      -6 -20  3412K   700K piperd   3:17  0.00%  0.00% perl
  366 root       2   4  3112K     0K poll     0:00  0.00%  0.00% 
  274 root      -6   4  3044K   412K piperd   0:07  0.00%  0.00% perl
  270 root      -6   4  3044K   324K piperd   0:01  0.00%  0.00% perl
  268 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  273 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  234 root       2   0  2332K     0K select   0:06  0.00%  0.00% 
  191 root      10   4  2280K  1072K nanslp   0:13  0.00%  0.00% ncftpd
77300 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90242 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
48258 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90215 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
75810 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
95759 monkads   28   0  2092K   300K RUN      0:03  0.00%  0.00% top
  239 root      10   0  2088K     0K wait     0:00  0.00%  0.00% 
  262 root       2   0  1356K     0K select   0:01  0.00%  0.00% 
  201 root      -6   1  1264K   168K piperd   0:00  0.00%  0.00% ncftpd
   95 root     -22   0  1056K   116K swread   0:06  0.00%  0.00% syslog-ng
  197 root       2   6  1016K     0K accept   0:00  0.00%  0.00% 
96039 root      -6   0  1008K   336K piperd   0:00  0.00%  0.00% cron
95749 root      -6   0  1008K   184K piperd   0:00  0.00%  0.00% cron
96072 root      -6   0  1008K    80K piperd   0:00  0.00%  0.00% cron
  102 root      10   0   992K     0K nanslp   0:03  0.00%  0.00% 
  207 qmails     2   0   952K     0K select   0:04  0.00%  0.00% 
  321 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  319 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  315 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  316 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  320 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  322 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  318 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  317 root       3   0   952K     0K ttyin    0:00  0.00%  0.00% 
  217 qmaill    -6   0   896K   124K piperd   0:00  0.00%  0.00% splogger
  218 root       2   0   896K     0K select   0:00  0.00%  0.00% 
  219 qmailr     2   0   896K     0K select   0:00  0.00%  0.00% 
  220 qmailq    -6   0   884K    88K piperd   0:00  0.00%  0.00% qmail-clean
  189 root      10   0   864K     0K wait     0:00  0.00%  0.00% 
96097 root      -5   0   724K    16K sysctl   0:00  0.00%  0.00% hps
96084 root      10   0   632K     0K wait     0:00  0.00%  0.00% 
96063 root      10   0   632K     0K wait     0:00  0.00%  0.00% 
95752 monkads   10   0   632K     0K wait     0:00  0.00%  0.00% 
96061 root      10   0   628K     0K wait     0:00  0.00%  0.00% 
96073 root      10   0   628K     0K wait     0:00  0.00%  0.00% 
96098 root      10   4   628K     0K wait     0:00  0.00%  0.00% 
96075 root      10   4   628K     0K wait     0:00  0.00%  0.00% 
    1 root      10   0   544K     0K wait     0:00  0.00%  0.00% init
96089 root      -5   4   380K     8K sysctl   0:00  0.00%  0.00% ps
96090 root      -5   4   228K    40K sysctl   0:00  0.00%  0.00% tail
   23 root      18   0   212K     0K pause    0:00  0.00%  0.00% 
    2 root     -18   0     0K     0K wswbuf  16:33  1.86%  1.86% pagedaemon
    5 root      18   0     0K     0K syncer   3:01  0.00%  0.00% syncer
    3 root      18   0     0K     0K psleep   1:03  0.00%  0.00% vmdaemon
    4 root     -18   0     0K     0K psleep   0:02  0.00%  0.00% bufdaemon
    6 root      -2   0     0K     0K vlruwt   0:02  0.00%  0.00% vnlru
    0 root     -18   0     0K     0K sched    0:00  0.00%  0.00% swapper

last pid: 96223;  load averages: 17.47, 13.70,  8.74  up 6+23:28:32    10:16:48
255 processes: 36 running, 217 sleeping, 2 zombie
CPU states: 14.7% user,  0.0% nice,  4.0% system,  2.4% interrupt, 78.9% idle
Mem: 872M Active, 27M Inact, 96M Wired, 6424K Cache, 112M Buf, 1664K Free
Swap: 4096M Total, 1133M Used, 2963M Free, 27% Inuse, 190M In, 518M Out

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
84338 nobody   -22 -10 49620K 18376K swread   6:14  0.00%  0.00% httpd
84049 nobody   -22 -10 49396K 21124K swread   6:50  0.00%  0.00% httpd
84046 nobody   -22 -10 48112K 21748K swread   6:56  0.00%  0.00% httpd
84044 nobody   -22 -10 46144K 18480K swread   5:58  0.00%  0.00% httpd
84032 nobody   -22 -10 45952K 22016K swread   6:02  0.00%  0.00% httpd
84041 nobody   -22 -10 45220K 15280K swread   6:05  0.00%  0.00% httpd
84042 nobody   -22 -10 44948K 18696K swread   6:40  0.00%  0.00% httpd
84036 nobody   -22 -10 44776K 11600K swread   6:00  0.00%  0.00% httpd
84034 nobody   -22 -10 44104K 21084K swread   6:27  0.00%  0.00% httpd
84051 nobody   -22 -10 43768K 10052K swread   5:51  0.00%  0.00% httpd
84048 nobody   -22 -10 43732K 13084K swread   6:15  0.00%  0.00% httpd
84031 nobody   -22 -10 43732K  8996K swread   5:24  0.00%  0.00% httpd
84030 nobody   -22 -10 43672K 12428K swread   5:17  0.00%  0.00% httpd
84202 nobody   -22 -10 43516K 20212K swread   6:16  0.00%  0.00% httpd
84045 nobody   -22 -10 43508K  9856K swread   5:33  0.00%  0.00% httpd
84047 nobody   -22 -10 43444K 19036K swread   5:51  0.00%  0.00% httpd
84043 nobody   -22 -10 43216K 19704K swread   6:28  0.00%  0.00% httpd
84037 nobody   -22 -10 42288K 20356K swread   6:20  0.00%  0.00% httpd
84033 nobody   -22 -10 41368K 17776K swread   6:25  0.00%  0.00% httpd
84339 nobody   -22 -10 41156K 12856K swread   5:41  0.00%  0.00% httpd
84035 nobody   -22 -10 40252K 20056K swread   5:40  0.00%  0.00% httpd
84039 nobody   -22 -10 40232K  7588K swread   6:13  0.00%  0.00% httpd
84040 nobody   -22 -10 40000K 13296K swread   6:08  0.00%  0.00% httpd
84038 nobody   -22 -10 39640K 23008K swread   5:55  0.00%  0.00% httpd
96070 nobody   -22 -10 37316K 16016K swread   0:06  0.00%  0.00% httpd
96069 nobody   -22 -10 36860K 29684K swread   0:04  0.00%  0.00% httpd
95977 nobody   -22 -10 34100K  6032K swread   0:02  0.00%  0.00% httpd
95995 nobody   -22 -10 33740K  6692K swread   0:02  0.00%  0.00% httpd
96025 nobody   -22 -10 32972K  2588K swread   0:02  0.00%  0.00% httpd
  230 root       2  10 21324K     0K select   0:01  0.00%  0.00% 
96082 nobody   -22 -10 19172K  6828K swread   0:05  0.00%  0.00% httpd
96049 nobody   -22 -10 18932K  7140K swread   0:03  0.50%  0.49% httpd
95962 nobody   -22 -10 18144K  4012K swread   0:02  0.00%  0.00% httpd
95998 nobody   -22 -10 17328K  4856K swread   0:01  0.00%  0.00% httpd
95994 nobody   -22 -10 17284K  4764K swread   0:01  0.00%  0.00% httpd
96026 nobody   -22 -10 17152K  5684K swread   0:01  0.00%  0.00% httpd
96050 nobody   -22 -10 16908K  5540K swread   0:01  0.00%  0.00% httpd
96044 nobody   -22 -10 14156K  7476K swread   0:03  0.00%  0.00% httpd
96054 nobody   -22 -10 14024K  6920K swread   0:03  0.00%  0.00% httpd
95899 nobody   -22 -10 13884K  6256K swread   0:04  0.00%  0.00% httpd
95825 nobody   -22 -10 13152K  6884K swread   0:05  0.00%  0.00% httpd
95820 nobody   -22 -10 12876K  5660K swread   0:04  0.00%  0.00% httpd
95960 nobody   -22 -10 12840K  6312K swread   0:02  0.00%  0.00% httpd
95900 nobody   -22 -10 12804K  4056K swread   0:02  0.00%  0.00% httpd
95898 nobody   -22 -10 12624K  6084K swread   0:04  0.00%  0.00% httpd
95848 nobody   -22 -10 12548K  6152K swread   0:03  0.00%  0.00% httpd
95971 nobody   -22 -10 12516K  5124K swread   0:01  0.00%  0.00% httpd
96067 nobody   -22 -10 12468K  4880K swread   0:02  0.00%  0.00% httpd
96029 nobody   -22 -10 12444K  4384K swread   0:01  0.00%  0.00% httpd
96174 nobody     2 -10 12408K 11060K sbwait   0:02  3.13%  3.12% httpd
96066 nobody   -22 -10 12392K  6036K swread   0:01  0.00%  0.00% httpd
95958 nobody   -22 -10 12288K  5760K swread   0:01  0.00%  0.00% httpd
96080 nobody   -22 -10 12284K  3476K swread   0:02  0.00%  0.00% httpd
95957 nobody   -22 -10 12168K  5676K swread   0:02  0.00%  0.00% httpd
95972 nobody   -22 -10 12132K  2976K swread   0:02  0.00%  0.00% httpd
95902 nobody   -22 -10 12124K  4908K swread   0:02  0.00%  0.00% httpd
96125 nobody     2 -10 12084K 10408K sbwait   0:01  1.95%  1.95% httpd
96052 nobody   -22 -10 12032K  7244K swread   0:01  0.88%  0.88% httpd
96036 nobody   -22 -10 11900K  6604K swread   0:01  0.00%  0.00% httpd
96180 nobody     2 -10 11880K 10600K sbwait   0:01  1.82%  1.81% httpd
96020 nobody   -22 -10 11804K  5168K swread   0:02  0.00%  0.00% httpd
95980 nobody   -22 -10 11772K  6960K swread   0:02  0.00%  0.00% httpd
96074 nobody   -22 -10 11772K  3264K swread   0:01  0.00%  0.00% httpd
96055 nobody   -22 -10 11740K  6416K swread   0:01  0.00%  0.00% httpd
95997 nobody   -22 -10 11720K  5636K swread   0:01  0.00%  0.00% httpd
95963 nobody   -22 -10 11672K  4784K swread   0:02  0.00%  0.00% httpd
96022 nobody   -22 -10 11652K  4888K swread   0:02  0.00%  0.00% httpd
95996 nobody   -22 -10 11648K  4864K swread   0:02  0.00%  0.00% httpd
95959 nobody   -22 -10 11620K  5252K swread   0:01  0.00%  0.00% httpd
95976 nobody     2 -10 11588K  5648K sbwait   0:01  0.00%  0.00% httpd
95975 nobody   -22 -10 11404K  6024K swread   0:01  0.00%  0.00% httpd
95993 nobody   -22 -10 11400K  4532K swread   0:01  0.00%  0.00% httpd
96048 nobody   -22 -10 11340K  5908K swread   0:01  0.00%  0.00% httpd
95961 nobody   -22 -10 11328K  3808K swread   0:01  0.00%  0.00% httpd
96051 nobody   -22 -10 11252K  4712K swread   0:01  0.00%  0.00% httpd
96043 nobody     2 -10 11228K     0K RUN      0:01  0.00%  0.00% 
96024 nobody   -22 -10 11216K  5176K swread   0:01  0.00%  0.00% httpd
96045 nobody   -22 -10 11196K  5528K swread   0:01  0.00%  0.00% httpd
96121 nobody     2 -10 11192K  9752K sbwait   0:01  2.39%  2.39% httpd
96046 nobody   -22 -10 11148K  5368K swread   0:01  0.00%  0.00% httpd
96038 nobody   -22 -10 11128K  5812K swread   0:01  0.00%  0.00% httpd
96021 nobody   -22 -10 11128K  3928K swread   0:01  0.00%  0.00% httpd
96028 nobody   -22 -10 11100K  4908K swread   0:01  0.00%  0.00% httpd
96167 nobody     2 -10 11076K  9616K sbwait   0:01  1.17%  1.17% httpd
96019 nobody   -22 -10 10992K  5520K swread   0:01  0.00%  0.00% httpd
96186 nobody   -14 -10 10976K  9784K inode    0:01  0.74%  0.73% httpd
95973 nobody   -22 -10 10972K  4960K swread   0:01  0.00%  0.00% httpd
96030 nobody   -22 -10 10924K  3700K swread   0:01  0.00%  0.00% httpd
96027 nobody   -22 -10 10868K  4656K swread   0:01  0.00%  0.00% httpd
96023 nobody   -22 -10 10864K  5756K swread   0:01  0.00%  0.00% httpd
96183 nobody     2 -10 10592K  9436K sbwait   0:01  0.39%  0.39% httpd
96179 nobody     2 -10 10512K  9152K sbwait   0:01  0.64%  0.63% httpd
96116 nobody     2 -10 10320K  8952K sbwait   0:01  0.98%  0.98% httpd
96117 nobody   -18 -10  9632K  8320K spread   0:00  0.00%  0.00% httpd
96037 nobody   -22 -10  9436K  4392K swread   0:00  0.00%  0.00% httpd
95974 nobody   -22 -10  9424K  2624K swread   0:00  0.00%  0.00% httpd
96077 nobody   -22 -10  9160K  4804K swread   0:00  0.00%  0.00% httpd
96187 nobody     2 -10  9080K  7668K sbwait   0:01  0.74%  0.73% httpd
96071 nobody   -22 -10  8752K  4452K swread   0:00  0.00%  0.00% httpd
96182 nobody     2 -10  8716K     0K RUN      0:00  0.00%  0.00% 
96047 nobody   -22 -10  8668K  1520K swread   0:00  0.00%  0.00% httpd
96078 nobody   -22 -10  8616K  3724K swread   0:00  0.00%  0.00% httpd
96184 nobody     2 -10  8556K     0K RUN      0:00  0.00%  0.00% 
96053 nobody   -22 -10  8396K  2424K swread   0:00  0.00%  0.00% httpd
96185 nobody     2 -10  8396K     0K RUN      0:00  0.00%  0.00% 
96157 nobody   -22 -10  8376K  6284K swread   0:00  0.00%  0.00% httpd
96076 nobody   -22 -10  8356K  3112K swread   0:00  0.00%  0.00% httpd
96065 nobody   -22 -10  8048K  2884K swread   0:00  0.00%  0.00% httpd
96175 nobody     2 -10  7988K     0K RUN      0:00  0.00%  0.00% 
96159 nobody   -22 -10  7596K  5240K swread   0:00  0.00%  0.00% httpd
96161 nobody   -22 -10  7572K  5068K swread   0:00  0.00%  0.00% httpd
96122 nobody     2 -10  7536K     0K RUN      0:00  0.00%  0.00% 
96129 nobody     2 -10  7536K     0K RUN      0:00  0.00%  0.00% 
96160 nobody   -22 -10  7532K  5068K swread   0:00  0.00%  0.00% httpd
96162 nobody   -22 -10  7532K  5068K swread   0:00  0.00%  0.00% httpd
96134 nobody   -22 -10  7532K  3240K swread   0:00  0.00%  0.00% httpd
96171 nobody     2 -10  7532K     0K RUN      0:00  0.00%  0.00% 
96126 nobody     2 -10  7532K     0K RUN      0:00  0.00%  0.00% 
96166 nobody     2 -10  7532K     0K RUN      0:00  0.00%  0.00% 
96163 nobody   -22 -10  7528K  5100K swread   0:00  0.00%  0.00% httpd
96109 nobody   -22 -10  7528K  3192K swread   0:00  0.00%  0.00% httpd
96105 nobody   -22 -10  7528K  3188K swread   0:00  0.00%  0.00% httpd
96096 nobody   -22 -10  7528K  3120K swread   0:00  0.00%  0.00% httpd
96132 nobody   -22 -10  7528K  3116K swread   0:00  0.00%  0.00% httpd
96095 nobody   -22 -10  7528K  3044K swread   0:00  0.00%  0.00% httpd
96133 nobody   -22 -10  7528K  3040K swread   0:00  0.00%  0.00% httpd
96107 nobody   -22 -10  7528K  2976K swread   0:00  0.00%  0.00% httpd
96114 nobody   -22 -10  7528K  2720K swread   0:00  0.00%  0.00% httpd
96123 nobody   -22 -10  7528K  2688K swread   0:00  0.00%  0.00% httpd
96079 nobody   -22 -10  7528K  2588K swread   0:00  0.00%  0.00% httpd
96087 nobody   -22 -10  7524K  3296K swread   0:00  0.00%  0.00% httpd
96128 nobody     2 -10  7524K     0K RUN      0:00  0.00%  0.00% 
96124 nobody     2 -10  7516K     0K RUN      0:00  0.00%  0.00% 
96176 nobody     2 -10  7516K     0K RUN      0:00  0.00%  0.00% 
96131 nobody   -22 -10  7512K  4800K swread   0:00  0.00%  0.00% httpd
96106 nobody   -22 -10  7512K  3116K swread   0:00  0.00%  0.00% httpd
96112 nobody   -22 -10  7512K  2760K swread   0:00  0.00%  0.00% httpd
96130 nobody   -18 -10  7508K  4932K spread   0:00  0.00%  0.00% httpd
96094 nobody   -22 -10  7508K  3164K swread   0:00  0.00%  0.00% httpd
96083 nobody   -22 -10  7508K  3144K swread   0:00  0.00%  0.00% httpd
96118 nobody   -22 -10  7508K  2812K swread   0:00  0.00%  0.00% httpd
96113 nobody   -22 -10  7504K  2780K swread   0:00  0.00%  0.00% httpd
96101 nobody   -22 -10  7496K  3212K swread   0:00  0.00%  0.00% httpd
96102 nobody   -22 -10  7496K  3180K swread   0:00  0.00%  0.00% httpd
96081 nobody   -22 -10  7496K  3076K swread   0:00  0.00%  0.00% httpd
96111 nobody   -22 -10  7496K  2712K swread   0:00  0.00%  0.00% httpd
96127 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96164 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96165 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96158 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96169 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96170 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96168 nobody     2 -10  7496K     0K RUN      0:00  0.00%  0.00% 
96120 nobody   -22 -10  7492K  3064K swread   0:00  0.00%  0.00% httpd
96115 nobody   -22 -10  7492K  2720K swread   0:00  0.00%  0.00% httpd
96103 nobody   -22 -10  7112K  2308K swread   0:00  0.00%  0.00% httpd
  183 root      10  10  5532K     0K RUN      0:23  0.00%  0.00% 
  261 root       2 -15  5476K  1004K sbwait   3:31  0.00%  0.00% perl
96190 nobody   -18 -10  5100K  3656K spread   0:00  0.00%  0.00% httpd
96189 nobody   -14 -10  5020K  3692K inode    0:00  0.00%  0.00% httpd
96192 nobody    -6 -10  4968K  3536K biord    0:00  0.00%  0.00% httpd
96195 nobody   -14 -10  4888K  3600K inode    0:00  0.00%  0.00% httpd
96193 nobody    -6 -10  4888K  3596K biord    0:00  0.00%  0.00% httpd
96191 nobody   -14 -10  4888K  3596K inode    0:00  0.00%  0.00% httpd
96119 nobody   -14 -10  4888K  3372K inode    0:00  0.00%  0.00% httpd
  153 root       2 -10  4300K  1964K select   0:20  0.00%  0.00% httpd
96188 nobody   -22 -10  4300K   648K swread   0:00  0.00%  0.00% httpd
96194 nobody     2 -10  4300K     0K RUN      0:00  0.00%  0.00% 
96198 nobody     2 -10  4300K     0K RUN      0:00  0.00%  0.00% 
96196 nobody     2 -10  4300K     0K RUN      0:00  0.00%  0.00% 
96197 nobody     2 -10  4300K     0K RUN      0:00  0.00%  0.00% 
  211 root     -22 -20  3412K   784K swread   3:17  0.00%  0.00% perl
  366 root       2   4  3112K     0K poll     0:00  0.00%  0.00% 
96216 root      36   4  3044K   716K RUN      0:00  0.00%  0.00% perl
  273 root      -6   4  3044K   532K piperd   0:01  0.00%  0.00% perl
  270 root      36   4  3044K   432K RUN      0:01  0.00%  0.00% perl
  274 root      36   4  3044K   412K RUN      0:07  0.00%  0.00% perl
  268 root      10   4  3044K     0K nanslp   0:01  0.00%  0.00% 
  234 root       2   0  2332K     0K select   0:06  0.00%  0.00% 
  191 root      10   4  2280K     0K RUN      0:13  0.00%  0.00% 
77300 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90242 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
48258 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
90215 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
75810 root       2   4  2280K     0K accept   0:00  0.00%  0.00% 
95759 monkads   30   0  2196K   532K RUN      0:03  1.84%  0.83% top
  239 root      10   0  2088K     0K wait     0:00  0.00%  0.00% 
  262 root       2   0  1356K     0K select   0:01  0.00%  0.00% 
96209 root     -18   0  1312K   548K spread   0:00  0.00%  0.00% perl
  201 root      -6   1  1264K   156K piperd   0:00  0.00%  0.00% ncftpd
96211 root     -14   0  1212K   536K inode    0:00  0.00%  0.00% perl
   95 root       2   0  1056K     0K poll     0:06  0.00%  0.00% 
  197 root       2   6  1016K     0K accept   0:00  0.00%  0.00% 
96206 root      -6   0  1008K   488K piperd   0:00  0.00%  0.00% cron
96039 root      -6   0  1008K   232K piperd   0:00  0.00%  0.00% cron
95749 root      -6   0  1008K   164K piperd   0:00  0.00%  0.00% cron
96072 root      -6   0  1008K    80K piperd   0:00  0.00%  0.00% cron
  102 root      10   0   992K     0K nanslp   0:03  0.00%  0.00% 
96104 root      10   0   992K     0K RUN      0:00  0.00%  0.00% 
96140 root      10   0   992K     0K RUN      0:00  0.00%  0.00% 

Update: Looking at the http access_log for around the time that the problem appears to start has not revealed any "smoking gun" evil URLs that somehow cause the receiving httpd to become a fork bomb, but that hay stack is rather large and the data recorded isn't ideal for finding such things. A more Everything-aware log of accesses is on my to-do list...

- [tye]        

Re: 'A' web server takes another "time out"
created: 2006-05-03 14:58:36
I don't have any guesses as to what's causing so many httpds, but perhaps you can fix it by changing your Apache configuration? It seems like a well-considered MaxClients could prevent this kind of explosion.

-sam

Re^2: 'A' web server takes another "time out" (MaxClients)
tye
created: 2006-05-03 15:50:37

Thanks for the pointer.

It looks like I'd need read access to /usr/pair/apache/... in order to check that but I don't have it. An older copy of httpd.conf that I requested (before the upgrade) had MaxClients set to 100. I'll have to ask for a new copy...

- tye        

Re^3: 'A' web server takes another "time out" (MaxClients)
created: 2006-05-03 16:04:07
If PerlMonks is running under mod_perl you should able to use the Apache API to examine the current setting. You might even be able to dynamically change it!

-sam

Re^4: 'A' web server takes another "time out" (MaxClients)
tye
created: 2006-05-04 03:02:57

If someone wants to take the time to be quite specific about how to check and/or set things like MaxClients and RLimitNPROC from mod_perl (or such), then I'll try them out. I did some looking but didn't come close to an answer in the short time I was able to devote to the search.

Update: Corion reports looking into this and finding nothing useful for Apache1 (or at least that works on our web server).

- tye        

Re^2: 'A' web server takes another "time out"
created: 2006-05-05 13:57:42
On my experience, is very important correct values for:
  • MaxClient -> 100 Ok (it's can be more)
  • MinSpareServers -> min free instances, sugest same of StartServers
  • MaxSpareServers -> max free instances, sugest more than 60% of MaxClients
  • MaxKeepAliveRequests -> never all instances, sugest 50% of MaxClient
  • MaxRequestPerChild -> max request before kill process, set if required...

Some time ago I had make a node Monitor instances of Apache Web server, with a script to see how are use of apache web connections online. To see historical usage, I'm sugest to use it or Apache-Tools (from Apache-Security).

Evaluating your load averages, swap and CPU states, on my opinion optimize apache make good results... See your running time of httpd process:

$ grep httpd 547234 | awk '{print $8}' | sort | uniq -c | head -n 5
    125 0:00
    110 0:01
     41 0:02
     14 0:03
      9 0:04
But the great info is "Parent Server Generation: XX" on server-status ... You realy need to enable this module ;)

Current Time: Friday, 05-May-2006 10:56:47 PDT
Restart Time: Tuesday, 02-May-2006 10:24:02 PDT
Parent Server Generation: 3
Server uptime: 3 days 32 minutes 45 seconds
Total accesses: 16557075 - Total Traffic: 349.2 GB
CPU Usage: u170.547 s310.375 cu0 cs0 - .184% CPU load
63.4 requests/sec - 1.4 MB/second - 22.1 kB/request
175 requests currently being processed, 81 idle workers

CKWWCKK_K_K_CKC_KK_K___KKK__K_KKKKKKKC_K_CC_KWK__WKKK_K_WKKK__WK
GGG.GG.G.GGG...GGGGG..GGG.GGG.GG..GG.G..GG..G.....GWGGGGGGGGGGGG
.G...W.GG.....G.GG.G..G.GG.........GGWG.G..G.G.....G...WG....G.G
_K__KKCKCKWCK_WK__KKK_K_KW_KC__W___KKKK_KKKCK_KKWKKC_KCKKWKCKKWC

--
Marco Antonio
Rio-PM

Re^3: 'A' web server takes another "time out"
created: 2006-05-06 22:15:09
MaxClients of 100 seems pretty high to me. Commodity hardward isn't going to deal with 100 simultaneous mod_perl jobs very well! Even if you have the memory to handle that many jobs, you probably don't have the CPU.

MaxClients can be high on a front-end server which serves static content and does a reverse proxy to the mod_perl backend. Those servers do much less work per-request and a given machine can run more of them simultaneously.

-sam

Re: 'A' web server takes another "time out"
created: 2006-05-03 16:26:59
See lots of 'httpd' processes appear

I'm not sure what the usual ratio is between available system resources and system resources needed to keep up with Perlmonks requests, but _if_ the ratio were to dip below a certain critical point, then the number of new processes would grow faster than the old processes could finish. If that were the case, the total number of running processes could be expected to increase steadily, further dividing the system resources (notably RAM) available to each, in a vicious cycle, which would explain the extremeness of the symptoms you describe.

However, that leaves open the question of what happens to trigger the event in the first place. If the available system resources were just barely adequate for handling normal (or normal peak) traffic, then a slightly-more-than-normal traffic spike could trigger it, but it seems like if the system were that close to maxed out all the time you'd probably already know it. Are there things users can do that cause substantially more activity on the server than a normal request? Too many Super Search queries at once, perhaps, or something along those lines?

but 'top' isn't particularly flexible but is still the best tool I've found available on this system so far

My immediate thought here is to look for process-related stuff on the CPAN, looking for something that doesn't just shell out to ps, preferably something Unix-oriented and written in pure Perl. I don't have much experience working with process tables, though, beyond what can be done with ps and top. update: My second thought is that I'm sure you're already aware some versions of top can show considerably more columns than they do show by default. The version I have here (on FreeBSD) is quite impoverished, but ISTR that the version of top that I used on Mandrake 9 had rather a lot of optional columns and a loose marble rolling around in the back of my head suggests it _may_ (it's been several months...) have had an option for showing the parent process. I mention this only on the off chance that you haven't already checked for it. Hit ? in top to see a list.


Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
Re^2: 'A' web server takes another "time out" (root)
tye
created: 2006-05-04 13:24:23

I'd be a bit disappointed if a mature system like FreeBSD contained this feedback loop in resource allocation. No system is perfect, but I'd come to expect better behavior when memory becomes scarce than such a feedback loop that makes the problem keep getting worse while trying to let each part continue to fight to do its thing such that nothing at all can get done and it takes so long before the system finally gives up and reboots (or is it that the system never gives up and pair.com notices the lock-up and eventually cycles power?). I recall much older systems noticing a problem and selecting processes to be completely "swapped out" (different from "paging", a more accurate term for what is often mislabeled "swapping") such that they stop fighting and other luckier processes get a chance to finish such that the resource exhaustion might pass or at least the system is capable of getting something done such that someone can "get in" in order to clean up "by hand". Note that when this happens to the 'A' web server, there is no hope of logging in to the system.

But perhaps this is just a case of bad tuning such that Apache fights too hard and it takes a while for FreeBSD to overcome it... Perhaps that is why many processes go to "0K" resident memory usage, though I'd expect a state much different than "RUN" to be reported for a swapped-out process. This lead me to notice again the angle brackets such as on "" and searching "man top" for what those mean I find "COMMAND is the name of the command that the process is currently running (if the process is swapped out, this column is marked '')" which isn't completely clear but somewhat supports that interpretation.

Since I don't have root access, I don't think trying to roll my own replacement for 'top' or 'sar' will be possible. At least, my assumption was that I'd not have access to what 'top' and 'ps' use to get all of that information about other processes. Indeed, I don't have any access to /proc (symlink to /root/proc and I have no access to even /root). But I see that neither 'top' nor 'ps' are set-UID nor set-GID so I'm not sure how the security is arranged. 'man ps' mentions needing procfs mounted (and referencing /proc and /dev/kmem). So would a self-built 'top' on an unprivileged FreeBSD account be useful? If not, I think just adding "ps" output to the existing "top" output would be one of the next steps.

- tye        

Re^3: 'A' web server takes another "time out" (root)
created: 2006-05-04 14:58:41
I don't have any access to /proc [...] 'man ps' mentions needing procfs mounted [...] I think just adding "ps" output to the existing "top" output would be one of the next steps.

If top took 5.5 minutes in showing output between two given snapshots above, I think adding ps won't improve the situation because ps data won't be correlated at all with top's. My bet would be to play with ps o argument, which allow you to get the information of top and more. Setting PERSONALITY to "bsd" on this Linux machine allows me to run ps as I were on a FreeBSD. I hope...

$ PERSONALITY=bsd ps faxo pid,euid,egid,ni:2,vsz:6,rss:6,pcpu,pmem,stat:3=ST,tname:6,stime,bsdtime,args
  PID  EUID  EGID NI    VSZ    RSS %CPU %MEM ST  TTY    STIME   TIME COMMAND
    1     0     0  0   1924    652  0.0  0.0 S   ?      19:24   0:00 init [2]  
    2     0     0 19      0      0  0.0  0.0 SN  ?      19:24   0:00 [ksoftirqd/0]
    3     0     0 -5      0      0  0.0  0.0 S<  ?      19:24   0:00 [events/0]
[...]
 1368   111   111  0  26580    912  0.0  0.0 Ssl ?      19:26   0:00 /usr/sbin/ippl -c /var/run/ippl/ippl.conf
 1423     0     0  0   4800   1608  0.0  0.1 Ss  ?      19:26   0:00 /usr/lib/postfix/master
 1428   101   104  0   4812   1604  0.0  0.1 S   ?      19:26   0:00  \_ pickup -l -t fifo -u -c

You can s/args$/comm/ in order not to show parameters of commands:

 1368   111   111  0  26580    912  0.0  0.0 Ssl ?      19:26   0:00 ippl
 1423     0     0  0   4800   1608  0.0  0.1 Ss  ?      19:26   0:00 master
 1428   101   104  0   4812   1604  0.0  0.1 S   ?      19:26   0:00  \_ pickup

HTH.

--
David Serrano

Re^4: 'A' web server takes another "time out" (root)
tye
created: 2006-05-04 15:47:26

Heh, but that doesn't show me the one thing I'm interested in, the parent PID. The 'top' and 'ps' output don't have to be in sync; I just need a snapshot of 'ps' output at some point during the "bad time" in order to see who owns the newest 'httpd' processes.

FYI, your hoping wasn't enough (:

ps: euid: keyword not found
ps: egid: keyword not found
ps: ni:2: keyword not found
ps: vsz:6: keyword not found
ps: rss:6: keyword not found
ps: stat:3: keyword not found
ps: tname:6: keyword not found
ps: stime: keyword not found
ps: bsdtime: keyword not found
ps: args: keyword not found
  PID %CPU %MEM
    0  0.0  0.0
    1  0.0  0.0
    2  0.0  0.0
...

- [tye]        

Re^5: 'A' web server takes another "time out" (root)
created: 2006-05-04 15:56:14
ps -axo pid,ppid,command

    --k.


Re^6: 'A' web server takes another "time out" (root)
created: 2006-05-04 16:17:12
Kanji,
Some of the other information suggested is probably useful to tye. In most cases, the specification of width doesn't work as pointed out below. In a few others, the keywords were invalid for FreeBSD's ps. The interesting thing is that 'args' is listed as a valid key word in TFM but still complained even though no width was specified.

Cheers - L~R

Re^5: 'A' web server takes another "time out" (root)
created: 2006-05-04 16:15:34
ps: euid: keyword not found
ps: egid: keyword not found
ps: ni:2: keyword not found

Great :^(. It seems that that ps doesn't support field width. The field for the PPID is surprinsingly ;^) called "ppid". After searching for the manpage on google, I'd try something like ps -j, ps -l and ps -a -x -o pid,ppid (this last one is just for testing if ppid works).

--
David Serrano

Re^5: 'A' web server takes another "time out" (root)
created: 2006-05-04 21:30:56
Well, "ppid" is one of the values to be specified for option "-o", "ps(1)". Try something like (tested on one of the Pair shared hosts running FreeBSD 4.8-STABLE) ...
ps -wwax -o ppid,pid,pgid,rss,vsz,nice,%mem,%cpu,rgid,ruser,user,stat,command \
| sort -k1,1n -k2,2n
... there are other options listed related to paging & swapping, and (real & saved) user & group id. If you specify the "-c" option along with "-o command", only command name will show up (w/o the arguments).
Re: 'A' web server takes another "time out" (root)
created: 2006-05-04 22:19:09
I'd be a bit disappointed if a mature system like FreeBSD contained this feedback loop in resource allocation

Oh, is the perlmonks server running FreeBSD? I didn't realize. In that case, top doesn't appear to show parent process IDs, unless I'm missing something. There are things I like about FreeBSD, but its version of top is not one of them. The ps that comes with FreeBSD is rather better, but in a scenario where you can't start a new process, top could be already running, and I don't know of a way to make ps do that (i.e., be already running and report output periodically).

I recall much older systems noticing a problem and selecting processes to be completely "swapped out"

I've observed on my desktop that FreeBSD will kill a process if it consumes too much RAM (in situations where Linux wouldn't, although Linux since circa 2.2 will also do this if the entire system is low on RAM, which is better than the Linux 2.0 behavior; but FreeBSD will kill a process for this even when there's unused swap space, if it surpasses some per-process memory usage quota). However, one process using lots of RAM is a very different scenario from many processes being spawned. I don't know what FreeBSD does with that. I could test that here with a forkbomb, I suppose...

Indeed, I don't have any access to /proc

That could make it hard to get a good look at the process tree.

So would a self-built 'top' on an unprivileged FreeBSD account be useful?

I don't know. It also seems like there _ought_ to be a tool designed to prepare a process ahead of time (preload it into RAM , go ahead and ask the operating system for a process table entry, and so forth) to be launched quickly, which might allow you to set up ps to run and then, when the problem is noticed, trigger it to go ahead. I do not, however, actually know of such a utility.

I feel your pain. Having to work around the lack of root access to accomplish things that would be much easier _with_ root access is certainly something that can be annoying. (I can also understand why the hosting company doesn't want to hand out root access, of course, but that doesn't make your situation any less frustrating.)


Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
Re: 'A' web server takes another "time out"
created: 2006-05-03 17:50:08

Is it simply possible that someone has some sort of scheduled DOS attack? I know its a stupid obvious question, buts its the first thing that comes to mind and maybe no one asked simply because it was so obvious. I do wonder why MaxClients isn't set low enough to stop this from happening though.


___________
Eric Hodges
Re: 'A' web server takes another "time out"
created: 2006-05-03 19:55:20
If you're capturing regular sar data, (With the sa1/sa2 scripts) this could provide a lot of useful information beyond what top provides. (You can profile the performance on a system quite extensively with good sar output). It would be helpful if you could make the sar data from the last week or so available for download. (If available)

The files are usually located in /var/adm/sa and should be readable from userland. I've found that FreeBSD or Linux boxes don't usually have sar enabled, (unlike a lot of commercial *NIXes) but it's worth a shot. Just tar 'em up and put them somewhere for download.

If the data is indeed available but you don't feel comfortable sharing it, there are some utilities available to analyse the data directly, such as:

Sadly it requires a commercial license and I can't think of any cost-free alternatives. Readers please chime in if you know of any similar analysis utilities.

Hoping to help,

m.att

Re^2: 'A' web server takes another "time out" (sar)
tye
created: 2006-05-04 02:33:10

Yes, 'sar' was what I first reached for, realizing that it is far better to compactly collect all of the performance data so that after the fact you can view slices of it this way and that to try to figure out what the matter is...

 $ sar
ksh: sar: not found
 $ ls -l /var/adm
ls: adm: Permission denied
 $ ls -ld /var/adm
drwxr-x---  3 root  wheel  512 Jan 22  2001 /var/adm/
 $

And I'm certainly not 'root' nor in 'wheel'. (:

- [tye]        

Re^3: 'A' web server takes another "time out" (sar)
created: 2006-05-04 14:24:25
Well, that's a bust.. too bad.

How about capturing some regular snapshots with vmstat? Maybe

vmstat 60

and a

vmstat -d 60

piped to a file for a few days (or at least a good bit of time before, during and after the event in question). (These commands may require different syntax if you're on FreeBSD, which I can't test with -- we're basically looking for VM stuff and IO/disk stuff... also see iostat) Maybe also throw in a few vmstat -s's for good measure. This would at least provide a little bit more detail around swap in/out and IO.

m.att

Re^4: 'A' web server takes another "time out" (sar)
tye
created: 2006-05-04 14:46:16

But I think that'd mostly be less information than I'm currently getting from 'top'. In any case, it doesn't give me the main system data I'm now wanting: which process(es) is creating the flood of new 'httpd's. I tell you, Win32's PerfMon sure looks great in comparison to this hacking....

- tye        

Re: 'A' web server takes another "time out"
created: 2006-05-03 23:10:42

You might have a look at RLimitNPROC, as well. According to the Apache documentation, "Limits the number of processes that can be launched by processes launched by Apache children." It would be nice if you could get your hands on the httpd.conf file...

Re: 'A' web server takes another "time out"
created: 2006-05-04 07:56:56

I guess you (or Pair in case you don't have access to access_log) should do some statistics to see if there's a significant hits/second ratio difference when everything is OK and when the forking occurs.

Another point to check is whether the kernel version you are using has bugs concerning swap allocation.

Also, while this wouldn't help to determine the exact nature of the problem, maybe it can help to avoid DoS - mod_evasive

Of course, Pair should agree to install/use it ;)

Dodge This!
Re: 'A' web server takes another "time out"
created: 2006-05-04 11:50:04
The system load is very high but the CPU 40% idle; I've often seen this in I/O bound situtations. Is it possibly that the system disk (especially the swap, or database disk) is anormally slow ? Perhaps the DMA isn't working properly ?
Re^2: 'A' web server takes another "time out" (disk)
tye
created: 2006-05-04 12:53:29

One of my prior working theories was a disk "going bad" (having much experience with the fact that the manufacturors of commonly-used disk drives, drivers, and controllers only took away half of the point of "fault tolerance"1 resulting in drives "going bad" extremely silently, the only "evidence" being a particular pattern of slow-down).

But that was when I didn't see good evidence of lots of swapping going on. Of course, "lots" is a relative term so, anyone, please feel free to make some calculations of disk speed based on the amount of swapping reported above and let us know if, in order to explain the CPU idleness, we'd need to have an unusually slow disk in the mix as well.

There is no database disk on this system.

- tye        

1 The major point of the fault tolerance movement was to prevent things from suddenly failing. The point was that you could spend more and more resources making things more and more reliable, probably reducing how frequently something just "falls down" but you'd still end up having things suddenly fail, likely at a very inconvient time and have to spend a lot of down time and running around in a panic trying to replace / repair what failed. A "better way" was seen: Don't have single points of failure so that when something fails, things can continue on and you can schedule to replace the failed part at a convenient time, perhaps without even requiring down time. And the key to this working is that someone must be notified that a failure happened! Unfortunately, so many common modern systems include features that are tolerant of faults but provide no means of notification and often even prevent you from ever being able to tell, no matter how hard you look, that a fault happened. Hard disks are a great example of this, in my experience.

It used to be that a hard disk going bad would start recording faults in your syslog and the frequency of these reports would rise, very slowly at first but following a geometric curve, and you'd replace the disk before it catastrophically died. Now most disks start to fail by slowing down from time-to-time, more and more dramatically, eventually nearly locking up while the disk retries reading the sector that is going bad but eventually fails, then the driver/controller retries which causes the disk to do a whole nother round of retries, then the operating system multiplies the number of retries yet again with its own retries... and eventually we just get lucky and the CRC "passes" and no hard evidence that anything at all went wrong remains.

I'd point you to a google search for the "S.M.A.R.T." acronym but google no longer treats searching for "s m a r t" differently from searching for "s-m-a-r-t" and so you'd just get a huge list of pages containing the word "smart". That system lets you query some internal counters kept nearly hidden inside the disk drive that likely includes a count of at least some types of retries. It is the only way I've been able to find any real evidence (usually still quite vague) that a disk is starting to fail. But note that most S.M.A.R.T. tools try to be "smart" and just figure out for you whether or not the disk is about to suddenly fail (making nearly the identical mistake mentioned above) and thus usually don't tell you a single thing until the disk is within minutes of failing (usually while you aren't using the computer, and often only after the failure has already become catastrophic). So you have to jump through hoops to look at the raw S.M.A.R.T. data and make guesses at what some of them mean... Which has a lot to do with why you've probably not heard of S.M.A.R.T. before (or only heard bad things about it).

And then there is the other extreme: parity checking of memory. When your memory is working just fine 99.999% of the time but a single bit error is noticed and reported to you by virtue of the fact that your entire computer system has suddenly become a frozen brick displaying the notification on the console. Being blissfully unaware of the rare single-bit error starts to look good when compared to having all of the in-progress work, most (probably all) of which would be unaffected by that one bit, being sent to evaporate for the sake of providing notification of a fault...

Yes, I understand that the plumbing of notifications is hard and that is why this plumbing of notifications is so often not done or is done so badly.

Re: 'A' web server takes another "time out"
created: 2006-05-20 13:48:17

I belive the two webservers are 209.197.123.153 and 66.39.54.27, right? But which one of them is called 'A'? Or is this some other distinction?

Re: 'A' web server takes another "time out"
created: 2006-05-24 12:45:48

Could you please post the output from: uname -a

There is a lot here that is not stock FreeBSD behavior. Looking at the code for top I cannot see where it would insert brackets around the process name. Top simply reports what it finds in the process record; so I can only assume Apache puts the brackets there.

On a similar vein man top asserts that swapped process are marked as <swapped> but I don't know how it can assert this as this is OS dependant behavior.

The FreeBSD virtual memory manager usually operates very transparently. I wrote a perl program to hog all the memory then ran several instances. I saw none of the behavior exhibited in your listing. The swapping was transparent. The individual process lines showed normal running with full memory allocation. Only the swap statistics showing the the swap being used. When I ran out of swap the process was killed. I am going to play with this more. It seems that Apache does it's own vm. I can't confirm this.


s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
Re^2: 'A' web server takes another "time out" (virtual vs real)
tye
created: 2006-05-25 03:24:00
Could you please post the output from: uname -a
FreeBSD $FQDN 4.8-STABLE FreeBSD 4.8-STABLE #0: Fri Apr 15 13:34:52 EDT 2005
     $USER@$HOST:/usr/src/sys/compile/PAIRqsv  i386

with 3 items replaced by Perl scalars for privacy reasons.

Looking at the code for top I cannot see where it would insert brackets around the process name. Top simply reports what it finds in the process record; so I can only assume Apache puts the brackets there.

Thanks for diving into the code. But I think your assumption above is less likely than my stated guess, and I think you even provide more evidence:

On a similar vein man top asserts that swapped process are marked as but I don't know how it can assert this as this is OS dependant behavior.

So it could certainly be the case that the OS, instead of replacing the program name with the literal string "", it puts angle brackets around the program name. This makes even more sense as a literal "" would leave you wondering what the heck got swapped out on you.

The FreeBSD virtual memory manager usually operates very transparently. I wrote a perl program to hog all the memory then ran several instances. I saw none of the behavior exhibited in your listing. The swapping was transparent. The individual process lines showed normal running with full memory allocation. Only the swap statistics showing the the swap being used. When I ran out of swap the process was killed.

You were only hogging swap space. That causes much different problems than hogging real memory. Note that "the process was killed" has a body buried in it, as one can't blame a specific process for exhausting the swap space and so, on a good operating system, heuristics are involved (on a bad operating system, the process unlucky enough to be the first to try to grab more space after none is available gets killed -- early Ultrix comes to mind).

In order to hog real memory, you have to keep using the pages of memory that you've allocated. See [id://55609]; the one labeled "Memory" just tests allocating lots of virtual memory, that is, tests using a lot of swap space. The one labeled "Swap" will cause a lot of swapping (more accurately, "paging" though swapping out would likely eventually happen if you ran enough of them) because it tries to use lots of real memory. (So, yes, the labels are backward, depending on how you look at it.)

It seems that Apache does it's own vm. I can't confirm this.

I won't say that I know for sure that Apache does not, but I'd bet real money on it.

- [tye]        

perlmonks.org content © perlmonks.org and ambrus, eric256, Hue-Bond, jonadab, Kanji, Limbic~Region, m.att, mda2, parv, samtregar, spiritway, starbolin, tye, Ultra, wazoox

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03