another 18 disk servers to be rebooted today to avoid problem with network card
GRIDKA
12 files lost at Gridka disk servers, first analysis shows that the files have been deleted, currently under investigation how this could happen (GGUS:81322)
24 April 2012 (Tuesday)
DataReprocessing of 2012 data at T1s with new alignment was launched today
2500 LHCb jobs SIGSTOPed during last night. Not sure, but they were suspected of being responsible for killing the batch system with too many queries. Though on DIRAC side nothing has been changed with respect to queries to the batch system in the last year. Jobs re-started this morning
waiting for un update about the lost files due to a broken Castor disk server (GGUS:80973) (last update was on Monday)
T1
GRIDKA: job submission to Gridka WMS are failing (GGUS:81405). Ongoing.
SARA:
ticket (GGUS:81457) for pilots aborted with Reason=999
asked to upgrade to last CernVM-FS version (GGUS:81462)
19th April 2012 (Thursday)
Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0. Yesterday stopped productions for reconstruction and stripping with the previous application version, which were taking very long time, causing jobs to hit the end of queues at sites and being rescheduled.
some production jobs failed due to no space left on disk. Opened a ticket, already solved: 20GB scratch disk space guaranteed for LHCb jobs (according to Vo card)
T1
IN2P3: site banned during 3 hours yesterday for SRM unscheduled down-time
General for all Tiers1: new version of CernMV-fs client available with a fix for the problem with the cache (see ticket), should be deployed asap
CNAF: ticket for the update WMS version. Solved already.
GEANT network problem: All jobs started reporting as stalled and had configuration service authentication issues. A fix was applied for the CS and after the network recovered, all jobs back to normal
Will there be an incident report covering the issues?
CVMFS/Broken file problem: Fix in CVMFS client prepared and is currently being rolled out to CERN WNs.
T1
RAL: Minor SRM glitch this morning but recovered OK
IN2P3: SAM jobs found as a possible cause of a lot of do-nothing pilots. Fix is ready and will be rolled out ASAP.
T2
12th April 2012 (Thursday)
New Production going through. Some site issues (see below) but generally significantly better than previously
Fix has been validated for the Conditions DB issue - Will now check local DB install and if not complete will download via web server squid/proxy
Any more news on the RAID controller? (GGUS:80973)
T1
GridKa have updated their LFC and it has been marked online again.
T2
10th April 2012 (Tuesday)
Prompt Reconstruction continued over Easter
Stripping jobs for (both Re-Stripping 17b & Pre-Stripping 18) are 99.9% complete (last few files going through)
MC simulation at Tiers2 ongoing
Had issues with ONLINE farm - Dirac Removal & Transfer agents hanging and thus slowing the distribution of data. Investigations ongoing.
Due to different Trigger settings, quite a few of these early runs are taking a long time to process/strip which results in long jobs. We have identified the problem and are in the process of stopping the current production, marking bad runs and creating a new one.
There is another issue with the 3-4 hour delay between a CondDB update and it being propagated to the WNs. Short term, will put in a 6 hour delay between new Conditions and creating jobs but a permanent solution is in progress.
Finally, after investigating issues with slow file access at GridKa and IN2P3, we would like to make a request for lcg-cp to handle the dcap protocol (currently uses GridFTP only). This would be preferable for both sites (fewer GridFTP connections) and us (faster transfer).