Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0:
T1: SARA : downtime extended for CPU. IN2P3 : thanks to IN2P3 for their additionnal 200 TB of tape.
26th February 2013 (Tuesday)
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0:
T1: IN2P3 : (GGUS:91760) : authentication problem with one certificate used for Production: Fixed (tomcat restarted) SARA : downtime
DashBoard : In the "Site Groups" drop down box, RHUL does not appear if you select "All sites". However if you pick "Tier 0/1/2", then you do see LCG.UKI-LT2-RHUL.uk.
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0:
T1: IN2P3 : (GGUS:91760) : authentication problem with one certificate used for Production. SARA : downtime
DashBoard : In the "Site Groups" drop down box, RHUL does not appear if you select "All sites". However if you pick "Tier 0/1/2", then you do see LCG.UKI-LT2-RHUL.uk.
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0:
ALARM ticket (GGUS:91690) for afs hosted web service which is not responding, understood and fixed
T1: NTR
21st February 2013 (Thursday)
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0:
ALARM ticket (GGUS:91690) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes
Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR
T1: NTR
20th February 2013 (Wednesday)
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
T0: NTR
Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks
T1:
IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket)
19th February 2013 (Tuesday)
Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
RAL: Job timeouts trying to set up environment on the worker node (internal ticket). Continuing problems with batch system (GGUS:91251).
GridKa : Continuing issue with srm / SE / network (GGUS:91474). Jobs failing to resolve input data multiple times at GridKa. Jobs at JINR waiting for a long time for data from GridKa, before being killed by the batch system there. One strange DNS problem fixed yesterday.
13th February 2013 (Wednesday)
Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
RAL: Problems with batch system came back (GGUS:91251)
GridKa : Possible issue with srm / SE / network (GGUS:91474). Jobs failing to resolve input data multiple times at GridKa. Jobs at JINR waiting for a long time for data from GridKa, before being killed by the batch system there.
12th February 2013 (Tuesday)
Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
T0: NTR
T1:
IN2P3: NAGIOS problem still ongoing (GGUS:91126). No idea who to follow up with.
RAL: Problems with batch system seem to be resolved (GGUS:91251)
11th February 2013 (Monday)
Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
T0:
NTR
T1:
IN2P3: NAGIOS problem still being investigated (GGUS:91126). Also, low level problem with access to data (input data resolution) - under investigation by IN2P3 contact.
RAL: Continuing problems with batch system (GGUS:91251)
FZK : Problem with FTS transfers solved over the weekend (GGUS:91315).
8th February 2013 (Friday)
Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
T0:
NAGIOS test for swdir at IN2P3 still not running as frequently as others (once in the last 24 hours) (GGUS:91126)
T1:
RAL: FTS transfers no going through without issue. Not sure what solved it but some tests still to be run.
RAL: Seen an increase of SetupProject errors - not a major problem, but any possible reason for this (e.g. AFS decommissioning) ?
IN2P3: Yesterday and last night had a number of 'Bus errors' reported across many WNs. Problem has gone away now, but we were wondering if there was a possible CVMFS glitch?
4th February 2013 (Monday)
Activity as last week: reprocessing, prompt-processing, MC and user jobs.
T0:
An issue with a NAGIOS test not running in the last few days (GGUS:91126)
T1:
RAL: Some FTS transfers are failing due to strange timeout during transfer. Only on some files. Experts are investigating.
1st February 2013 (Friday)
Nothing new to report, just few tickets for pilots aborting at Tier2s. New LFC is ok.