Received some data last night (3 out of 7 runs sent to OFFLINE). Expected much more tonight. False alarm when data was not yet migrated to tape on CASTOR. No disturbance following the intervention on SRM .
GridKA: GGUS:60647 Pilot aborting at GridKA. Problems seems related with 2 (out of three) CREAMCEs there.
SARA: GGUS:60603 Reported timeouts retrieving tURL information of a set of files on M-DST space token. It seems to be a general problem on their SE from 2am till 7am UTC and they claim the timeout on DIRAC is a bit too aggressive. We never saw this problem before.
Many MC production jobs running (15K in the last 24 h/s) and user analysis. Data reprocessing is running to completion mainly at NL-T1 site where the 90% of them is failing resolving data.
MC Production running at full steam (27K jobs run in the last 24 hours); reprocessing of data ongoing reasonably smoothly almost everywhere. Also many user jobs running in the system that makes for this week LHCb the most active VO in terms of CPU time consumed.
CNAF: replication problem of LFC. Excluded that this has to do with the recent migration of the 3D database.. GGUS:60458
pic : many (440 jobs) failed suddenly at 1 am on Sunday. GGUS:60451. This was due to a fraction of the farm that had to be halted due to a cooling problem in the center.
T2 site issues:
Jobs failing at UKI-LT2-UCL-HEP and DORTMUND. Shared area issue at Grif.
RAL: Suffering jobs failing because exceeding memory (which is a bit lower than what requested with the new data). Requested to increase to 3GB the memory (GGUS:60413)
SARA: downtime at risk, many users affected accessing data. The Storage has been banned yesterday. What is the status of this intervention (we need to run reconstruction at SARA)
IN2p3: Shared area issue: 25% reconstruction jobs failed today because of that (GGUS:59880)
IN2p3: CREAMCE cccreamceli03.in2p3.fr. Corrected the information, tbc (GGUS:60366)
A lot of MC production run last 24 hours and a lot of new data to be reconstructed and to be made available to users for ICHEP. Playing with internal priorities to boost the reconstruction activity.
All T1 site admins supporting LHCb are invited to update their WMSes with the patch 3621. This is now running happily in production at CERN since weeks and would cure CREAMCE problems.
There are some MC production ongoing and few remaining runs to be reco-stripped. Activities dominated by user activity and a new group has been added in DIRAC with the highest priority (lhcb_conf) to get resources for running jobs with highest precedence due to the forthcoming ICHEP conference.
The cooling problem in the vault reported yesterday affected also some lhcb voboxes and some of them hosting critical services for LHCb (Sandboxing service). Through CDB templates the priority of these nodes has been risen while other vobox priorities have been relaxed.
Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
T1 site issues:
SARA: due to an AT RISK intervention at the SARA storage, some user jobs were failing being declared stalled and then killed.
RAL at risk intervention on various services today, the 3D Database for LHCb went down for a while. No problem to reboot to roll back.
T2 site issues:
It seems that we are hitting again the problem of uploading output form UK (and not only, Barcelona in primis).
Reconstruction productions progressing despite the internal issue in DIRAC with many jobs (70K) waiting in the central Task Queue. Some MC productions running concurrently (~10K jobs running in the system).
The VOMS-VOMRS syncronization problem still there (GGUS:60017)
problem with the motherboard of one of the CERN VOBOXES (volhcb12)
T1 site issues:
GRIDKA: now running at full capacity the reconstruction of the runs associated to the site accordingly the CPU share. No problem
NIKHEF: no HOME set issue: GGUS:60211. Restarted the ncsd servers stuck on some WN.
IN2p3: shared area instabilities. GGUS:59880. Problem seems related to site internal script that makes temporarily unavailable WNs. Requested to increase the timeouts to 20-30 minutes.
SARA: received a list of files lost following the recent data corruption incident.
pic: in SD tomorrow.
RAL: received the early announcement of a major upgrade of their castor instance to 2.1.9. LHCb will give any help to validate it and by the time of this intervention most likely an instance of the LHCb HC will be in place to contribute to the validation.
Reconstruction production, on a selected number of runs (only T1 sites involved), is running. Only problems are GridKA (powercut) and IN2P3 (shared area)
T2 sites issues: * Shared area quota not sufficient at INFN-TORINO, site now out of the mask: GGUS:59912 * Shared area problem at INFN-LNS, site out of the mask: GGUS:59917
Problems with SARA storage: since there's the possibility for data corruption, some space token have been excluded. LHCb can still write on tape, and stage from it. We can write on T1D1 (100TB will not be available). User disk is not available.
Reconstruction and stripping of recent (and forthcoming) data have been halted as the LHCb applications are taking anomalous amount of time to process each event.
Essentially MC production running (20k jobs finished in the last 24 hs with a failure rate of ~7%)
Issue with CREAMCE requires intervention of developers (all proxies are really expired in the CREAM belly) GGUS:59559
T1 site issues:
(cfr CNAF-T2) also many production jobs are aborting at the T1
T2 sites issues:
UK sites issue uploading to CERN and T1: people from Glasgow seem to have found a solution (see UK_sites_issue.docx) that boils down to a network tuning.
CNAF-T2: misconfiguration of the ce02 many jobs failing because pilots aborting. GGUS:59621
Sheffield: many jobs failing there not clear the reason yet (it seems a shared area issue), site banned.