During this week we were hit by several problems that affected the performance of the Tier-0, leading to a high failure rate in the system.

Paused Jobs in the Tier0 this week

Authentication Error with EOS
  • Since several weeks ago we start having problems with some jobs that have issues while authenticating with EOS. For this we opened a ticket and opened a discussion in HN. This problem continued until last Wednesday, when a workaround was proposed (increasing the TTL for the manager connection from 20 minutes to 24 hours HN).
  • The solution was tested and then applied to the Tier0 configuration ( Elog). This solved the problems for the new workflows, but not for the existing ones, so it was necessary to patch their sandboxes.
  • After doing this we haven't seen the problem happening again.

Jobs failing with 137 error code
  • Jobs killed by the OOM (As seen in HN). These jobs are receiving a -9 signal. It seems like an infrastructure problem. After restarting these jobs the eventually succeeded.
  • The problem stopped appearing when the Authentication error was solved.

Frontier T0 Proxies problem
  • We found lots of paused jobs showing a Frontier Problem ( Elog)
Can not get data (Additional Information: [frontier.c:1111]: No more proxies. Last error was: Request 1319 on
chan 2 failed at Sat Jun 13 22:24:06 2015: -9 [fn-socket.c:111]: connect to 128.142.201.184 timed out after 5
seconds) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
  • This particular problem had to do with CMS T0 proxy servers that were experiencing difficulties in responding HN.
  • It was a temporary problem, after resuming the affected jobs we stopped seeing it.

Memory allocation problem
  • Then we realized there was another error showing relation with frontier
A std::exception was thrown.
Can not get data (Additional Information: [frontier.c:997]: Request 1254 on chan 2 failed at Thu Jun 18 01:49:51 2015: -2 [payload.c:145]: 
cannot allocate memory of size 133461220) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
  • We opened a discussion on HN about this. However, Frontier Team explains this was not related with the frontier server, but it could be a client problem.
  • This problem is not happening anymore, but in principle nothing changed, except for the solution of the Authentication Error. The main theory is this problem is a direct consequence of filling the VMs with jobs that eventually fail.
  • Maybe memory cleanup after a job fails is leaving left overs that affect other jobs. ( Elog).


This topic: Sandbox > TWikiUsers > JohnHarveyCasallasLeon > JohnHarveyCasallasLeonSandbox > BriefReport
Topic revision: r2 - 2015-06-19 - LuisContreras
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback