Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
GridKa: Ticket (GGUS:95135) for slow staging rate. It is the only site lagging behind with the restripping, with rates of 5TB/day since the last weekend. Before that, we had a good staging rate of ~250MB/sec for a couple of days.
SARA: No more jobs failing with "segmentation violation". (GGUS:95056)
RAL: CVMFS problem (Stratum 1 is down ATM), site contacts are aware and working on a solution. No GGUS ticket submitted yet. MC jobs affected.
IN2P3-CPPM (GGUS:94890) recurring problem of aborted pilots. blparser stuck/restarted almost every day
24th June 2013 (Monday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
GridKa :Server for the major staging pool is out of operation, so we're staging at a reasonable rate, but 30% of jobs fail to access data from the pool. (Input Data Resolution)
SARA: Jobs fail with "segmentation violation". (GGUS:95056). All jobs failed at worker node v33-39.gina.sara.nl.
CERN: SetupProject.sh timeouts around midnight, now seems to have recovered.
CNAF: Pilots failing at CNAF (GGUS:95059), resolved quickly. (Service that sometimes gets stuck on some WNs)
20th June 2013 (Thursday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
GridKa : Pilots going through GridKa brokers fail due to sandbox being lost (GGUS:94462)
GridKa : Jobs at GridKa failing due to local directory problems (GGUS:94471). This affects many jobs at GridKa, and merging jobs with large numbers of input files have very low success rate.
RAL : Problem starting jobs. Seems to have gone away for now (starting this morning). Internal tickets opened.
Other: Web Server overloading significantly alleviated with 3rd web frontend (GGUS:94824). Affects jobs at many sites, some more than others.
17th June 2013 (Monday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1: *GridKa : IDLE pliots preventing new jobs starting (GGUS:94898). Ticket opened yesterday and escalated to ALARM this morning. Solved now - possibly due to a black hole node. *GridKa : Pilots going through GridKa brokers fail due to sandbox being lost (GGUS:94462) *GridKa : Jobs at GridKa failing due to local directory problems (GGUS:94471). This affects many jobs at GridKa, and merging jobs with large numbers of input files have very low success rate.
*Other
Web Server overloading continues (GGUS:94824) - possible problem now due to the afs server underneath. Under investigation and affects jobs at many sites, some more than others.
14th June 2013 (Thursday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
*IN2P3
Alarm ticket (GGUS:94810) "SE return wrong tURL". Problem was fixed very quickly
*Other
Web Server overloaded (GGUS:94824), under investigation.
10th June 2013 (Monday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
*GRIDKA
Problem with staging during weekend; Solved
6th June 2013 (Thursday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1:
3th June 2013 (Monday)
Incremental stripping campaign in progress and MC productions ongoing
T0:
T1/T2: many sites have queues with "999,999,999" MaxCPU time