Week of 100524
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday:
- No meeting - CERN closed!
Tuesday:
Attendance: local(Harry(chair), Akshat, Ricardo, Dirk, Oliver, Jean-Philippe,
JanI, Lola, Eva, Patricia, Roberto, Maarten,
MariaDZ, Jamie);remote(Jon(FNAL), Gang(ASGC), Vera(NDGF), Jeff(NL-T1), John+Brian(
RAL), Rolf(
IN2P3), Angela(KIT), Alessandro(INFN), Jeremy(GRIDPP), Rob(OSG)).
Experiments round table:
T0 Highlights: Software change in the T0, bumped all datasets except RAW to new processing versions because of event content compatibility. KIT,
IN2P3, PIC and ASGC have not yet approved transfer requests.
T1 Highlights: 1) power outage at PIC on Friday/Saturday with very short recovery time (thanks). No measures like moving custodial assignments of primary datasets had to be taken. 2) Request for 50 Million
MinBias events will be run on T1 level as T2 level is saturated with other MC requests, almost done. 3) Expect to start pre-production for another re-reconstruction pass soon, still waiting for software and conditions.
T2 Highlights: 1) MC production as usual. 2) Starting large scale pile-up simulations at T2 sites, expect jobs with higher than normal I/O load at sites.
Weekly-scope Operations plan [Data Ops]:
Tier-0: data taking.
Tier-1: Expected re-reconstruction request for all 2010 data using new software release including skimming, possibly also re-reconstruction of corresponding MC.
Tier-2: Large scale pile-up production at most T2s
Facilities Ops]
Final webtools services migration to SL5 occurred las Friday. During this week SL4 nodes will be switched off and deprecated.
VOC working to provide CRC-on-duty access to all CMS critical machines.
Note1: Sites to provide SL5 UI/voboxes for CMS to run
PhEDEx soon. By end of June it will be mandatory, as
PhEDEx_3_4_0 will be released in sl5-only release. We ask the sites to provide sl5 UI/voboxes according to the deadline.
Note2: CMS VOcard (CIC-portal) to be upgraded soon.
GENERAL INFORMATION: High intensive MC production running during the weekend with peaks over 18K concurrent jobs. Good behavior in general of all Grid services at all sites.
Transfers: Raw data transfers going on during the last 3 days wityh 18TB transferred and an average speed around 55MB/s.
General Issue: during the last night (from 04:00 to 6:00) no ALICE jobs were registered by
MonaLisa. In fact all sites are showing a glitch around the mentioned window time. The problem comes from one of the ALICE central machines which crashed during the weekend and as result no jobs were recorded by
MonaLisa.
T0 site: Peaks over 3000 concurrent jobs, good behavior of the
CREAM-CE and the LCG-CE resources
T1 sites
CCIN2P3: A different environment setup was found during the weekend in both ALICE VOBOXES available at this site. The result is that one of the vobox is perfectly working while the second one is out of production due to the wrong environment setup. Reported to the ALICE expert at the site, the experts are looking for possible differences in the environment setup at both VOBOXES
NIKHEF: The local service responsible of the software installation (
PackMan) is failing at this site. Same issue found this morning in Cagliari. The problem has been reported to the
AliEn experts before warning the site
T2 sites
Kolkata: During the weekend, the Alien sw was updated at this site. The local ALICE site admin prevented the ALICE core team about an obsolete Alien version running at this site.
IPNL:
CREAM-CE out of production (submission failing). GGUS-58478
Experiment activities: Very intense data taking activity doubling the 2010 statistics. MC production ongoing.
T0 site issues:
On Monday the M-DST space was presenting many transfers queued and users report their jobs hanging (and then killed by the watch dog). SLS effectively showed this problem yesterday today recovered. We are also seeing continued degredation of service on service class where RAW data exists. Can we have some indication what and why this happened ? Jan Iven reported that the M-DST pool is too small and LHCb should request more resources be added to it. For the raw data pool looking at the example file and job he was seeing a normal file access response of 2 seconds and the job got killed in fact on cpu time-limit.
On Saturday afternoon we had default pool overloaded triggering an alarm on SLS; it recovered by itself.
T1 site issues:
IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Reproduced the problem. Any news from IN2p3 on the AFS shared software area ? Rolf replied they were testing a workaround for the afs cache problems that vey afternoon.
RAL: 1) request to increase the current 6 parallel transfer allowed in the FTS for the SARA-RAL channel in order to clean the current backlog draining too slowly. Details of this requirement wll be discussed offline. 2) lost a disk server (same one as last time). Files have been recovered.
CNAF: some FTS transfers seem to fail with the error below. CNAF people discovered a bug in Storm to clean up failed transfers at basis of this problem SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] Requested file is still in SRM_SPACE_AVAILABLE state!
PIC: Power failure on the week end causing the site being unavailable.
T2 sites issues: CSCS and PSNC both failing jobs with SW area issue
Sites / Services round table:
- NL-T1: Jeff queried a report that LHCb were changing cpu processing shares at NL-T1. Roberto explained this was an internal LHCb issue - NIKHEF and SARA have different roles for them and they are preparing to move analysis storage from dcache to dpm and did not propogate a matching change in cpu shares when they should have. Jeff asked them to let NL-T1 know if cpu servers need to be moved around.
- ASGC: Seeing intermiitent failures on CMS and test jobs since a few days. Have opened ticket 114622 in the CMS Savannah.
- NDGF: Performed a successful dcache upgrade today. Queried when the new FTS would be released. Maarten reported we are still waiting on test sites (including CERN) to report but Triumf had already put it in production. RAL will be testing the checksum case sensitivity fix. Hopefully will come out this week - more news tomorrow.
- IN2P3: Not quite understanding the ALICE problem of different VObox requirements. Patricia has sent log files to R.Vernets but not opened a ticket yet as not sure of where the problem really lies.
- KIT: Over the weekend a disk partition of one of their CEs filled up and job submission was disabled till 14.00 today. Filled by CMS and ATLAS pilot job logs but all less than 5 days old so not a failure of log rotation. Investigating why a normal number of jobs managed to generate so much logging.
- CERN CASTORATLAS: Harry queried an ATLAS operator alarm that morning on degradation of the T0MERGE pool. Jan Iven reported this was in fact provoked by a log daemon being stuck. SRMATLAS may also have been affected in the period from 04.00 to 10.00.
- CERN databases: Overnight a node of the ATLAS production database rebooted - being investigated. The DB stayed available but some sessions failed over to other instances. There was also a problem with one of the nodes of the CMS production DB running out of space in the file system.
AOB:
Wednesday
Attendance: local(Miguel, Harry(chair), Lola,
JanI, Eduardo, Ricardo, Jean-Philippe, Flavia, Eva, Maarten, Steve, Simone, Akshat,
MariaD, Pavel);remote( Ian(CMS), Jon(FNAL), Michel(BNL), Angela(KIT), Onno(NL-T1), Tiju(
RAL), Rob(OSG), Gang(ASGC),
IN2P3)).
Experiments round table:
- ATLAS reports - Lots of MC and reprocessing activities. Several ggus tickets for data export issues. ASGC needs to allow more FTS jobs - NL-T1 has been in an intervention - RAL problem has been closed. A monitoring database server at CERN is overloaded otherwise smooth running.
T0 Highlights: 1) Preparing for 900GeV running tonight. Potentially high rates. 2) Software change in the T0, bumped all datasets except RAW to new processing versions because of event content compatibility. All sites appear to have approved the requests.
T1 Highlights: 1) New CMSSW release working its way to sites. Full reprocessing of the data expected to follow shortly. 2) New ticket just opened at CNAF where some
MonteCarlo files do not have the expected checksums.
T2 Highlights: 1) MC production as usual. 2) Starting large scale pile-up simulations at T2 sites, expect jobs with higher than normal I/O load at sites. 3) A couple of sites have seen problems updated CRL for DOE Grids. Unclear if it's a transient or regional DNS problem. Seemed to affect Spanish and Portugese Tier 2 first.
GENERAL INFORMATION: Decrease in the number of running jobs due to the end of the MC cycles started during the last weekend. In addition, the usual reconstruction and analysis activities are going on.
T0 site: Last Friday we reported about the creation of a new
LanDB set including about 15 ALICE CAF nodes which required a common an specific connectivity. The name of the mentioned set had to be modified in order to follow the standard name definition set by the PES experts. As procedure to follow it was agreed to define a new set this time with the right name that as soon as it appeared properly populated with the names of the nodes, it would replace the current one. Still it appears to be empty in the network page, PES experts have been contacted.
T1 sites:
CCIN2P3: issue reported yesterday concerning different environments found in the two ALICE VOBOXES.The problem requires further and deeper investigations by the ALICE side in order to understand whether it is a problem associated to the ALICE environment.
NIKHEF: Issue reported yesterday concerning the bad behavior of the local
PackMan service: Solved. The VOBOX required the update of the local
AliEn version. Same procedure applied to Cagliari.
T2 site: IPNL GGUS ticket reported yeterday solved (local
CREAM-CE required the restart of tomcat). Services restarted at the local VOBOX.
- LHCb reports - 26th May 2010 (Wednesday)
Experiment activities: 1) Reconstruction of recent data ongoing at T0/1s. 2) MC production ongoing at T2s.
T0 site issues: Ticket against CASTOR closed. Not a CASTOR problem. Some shared software area problems currently appearing.
T1 site issues: 1) PIC: PIC-USER space token is full. 2) NL-T1: SARA dCache is banned due to ongoing maintenance.
Sites / Services round table:
- NL-T1:The observations of ATLAS and LHCb are because we are migrating from 12 dcache pool nodes to a new 12 trying a new procedure to keep the service up. This required a dcache reconfiguration and restarts which caused some transfer failures but no more since this morning. The whole operation will take a few days and it was agreed to document the new procedure for other dcache sites.
- IN2P3: Will have a scheduled down on 8th June to move hpss and gpfs services so there will be no tape access. Also afs servers will be affected so software releases (i.e. of new afs volumes) will not be possible. A transparent Oracle intervention will also be made.
- CERN FTS: Good progress on the new version. Up at CERN in a pilot but have not exercised data transfers yet - experiments are also encouraged to exercise the pilot.
- INFN: Successfully completed Oracle and batch system upgrades.
- CERN GPN: CERN external firewall was overloaded from about 17.00 yesterday with xrootd traffic to Tier 2 sites causing packet losses especially with udp packets. This traffic was diverted to the HTAR route from 10.00 today. Maarten added this had hurt exporting of the CERN top level bdii and that 3/4 of the Tier-2 sites were failing the lcg replication test (local file to reference CERN SE). The availability statistics will be corrected and we will also look at moving bdii export to HTAR. It was understood later this traffic was from the new ALICE CAF nodes.
- CERN CASTORATLAS: An ATLAS groupdisk server has some innaccessible files and needs a file system repair.
AOB: An LHC technical stop is scheduled from 31 May to 2 June inclusive. To be confirmed.
Thursday
Attendance: local();remote().
Experiments round table:
Sites / Services round table:
- CERN FTS endpoint on https://fts-patch4084.cern.ch:8443/glite-data-transfer-fts/services/.... is now available with 2.2.4. CMS have already done some basic testing. Ready when ever to upgrade the production T0 and T2 service. PATCH:4084
.
AOB: (
MariaDZ) A round of the
Periodic ALARM tests full chain will take place next week Mon-Wed. The
Tier0 ONLY is concerned this time. Steps to follow are in
https://savannah.cern.ch/support/?114705
and services to test are in
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices
Friday
Attendance: local();remote().
Experiments round table:
Sites / Services round table:
AOB:
--
JamieShiers - 20-May-2010