Week of 100524

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

  • No meeting - CERN closed!

Tuesday:

Attendance: local(Harry(chair), Akshat, Ricardo, Dirk, Oliver, Jean-Philippe, JanI, Lola, Eva, Patricia, Roberto, Maarten, MariaDZ, Jamie);remote(Jon(FNAL), Gang(ASGC), Vera(NDGF), Jeff(NL-T1), John+Brian(RAL), Rolf(IN2P3), Angela(KIT), Alessandro(INFN), Jeremy(GRIDPP), Rob(OSG)).

Experiments round table:

T0 Highlights: Software change in the T0, bumped all datasets except RAW to new processing versions because of event content compatibility. KIT, IN2P3, PIC and ASGC have not yet approved transfer requests.

T1 Highlights: 1) power outage at PIC on Friday/Saturday with very short recovery time (thanks). No measures like moving custodial assignments of primary datasets had to be taken. 2) Request for 50 Million MinBias events will be run on T1 level as T2 level is saturated with other MC requests, almost done. 3) Expect to start pre-production for another re-reconstruction pass soon, still waiting for software and conditions.

T2 Highlights: 1) MC production as usual. 2) Starting large scale pile-up simulations at T2 sites, expect jobs with higher than normal I/O load at sites.

Weekly-scope Operations plan [Data Ops]:

Tier-0: data taking.

Tier-1: Expected re-reconstruction request for all 2010 data using new software release including skimming, possibly also re-reconstruction of corresponding MC.

Tier-2: Large scale pile-up production at most T2s

Facilities Ops]

Final webtools services migration to SL5 occurred las Friday. During this week SL4 nodes will be switched off and deprecated.

VOC working to provide CRC-on-duty access to all CMS critical machines.

Note1: Sites to provide SL5 UI/voboxes for CMS to run PhEDEx soon. By end of June it will be mandatory, as PhEDEx_3_4_0 will be released in sl5-only release. We ask the sites to provide sl5 UI/voboxes according to the deadline.

Note2: CMS VOcard (CIC-portal) to be upgraded soon.

GENERAL INFORMATION: High intensive MC production running during the weekend with peaks over 18K concurrent jobs. Good behavior in general of all Grid services at all sites.

Transfers: Raw data transfers going on during the last 3 days wityh 18TB transferred and an average speed around 55MB/s.

General Issue: during the last night (from 04:00 to 6:00) no ALICE jobs were registered by MonaLisa. In fact all sites are showing a glitch around the mentioned window time. The problem comes from one of the ALICE central machines which crashed during the weekend and as result no jobs were recorded by MonaLisa.

T0 site: Peaks over 3000 concurrent jobs, good behavior of the CREAM-CE and the LCG-CE resources

T1 sites

CCIN2P3: A different environment setup was found during the weekend in both ALICE VOBOXES available at this site. The result is that one of the vobox is perfectly working while the second one is out of production due to the wrong environment setup. Reported to the ALICE expert at the site, the experts are looking for possible differences in the environment setup at both VOBOXES

NIKHEF: The local service responsible of the software installation (PackMan) is failing at this site. Same issue found this morning in Cagliari. The problem has been reported to the AliEn experts before warning the site

T2 sites

Kolkata: During the weekend, the Alien sw was updated at this site. The local ALICE site admin prevented the ALICE core team about an obsolete Alien version running at this site.

IPNL: CREAM-CE out of production (submission failing). GGUS-58478

Experiment activities: Very intense data taking activity doubling the 2010 statistics. MC production ongoing.

T0 site issues:

On Monday the M-DST space was presenting many transfers queued and users report their jobs hanging (and then killed by the watch dog). SLS effectively showed this problem yesterday today recovered. We are also seeing continued degredation of service on service class where RAW data exists. Can we have some indication what and why this happened ? Jan Iven reported that the M-DST pool is too small and LHCb should request more resources be added to it. For the raw data pool looking at the example file and job he was seeing a normal file access response of 2 seconds and the job got killed in fact on cpu time-limit.

On Saturday afternoon we had default pool overloaded triggering an alarm on SLS; it recovered by itself.

T1 site issues:

IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Reproduced the problem. Any news from IN2p3 on the AFS shared software area ? Rolf replied they were testing a workaround for the afs cache problems that vey afternoon.

RAL: 1) request to increase the current 6 parallel transfer allowed in the FTS for the SARA-RAL channel in order to clean the current backlog draining too slowly. Details of this requirement wll be discussed offline. 2) lost a disk server (same one as last time). Files have been recovered.

CNAF: some FTS transfers seem to fail with the error below. CNAF people discovered a bug in Storm to clean up failed transfers at basis of this problem SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] Requested file is still in SRM_SPACE_AVAILABLE state!

PIC: Power failure on the week end causing the site being unavailable.

T2 sites issues: CSCS and PSNC both failing jobs with SW area issue

Sites / Services round table:

  • NL-T1: Jeff queried a report that LHCb were changing cpu processing shares at NL-T1. Roberto explained this was an internal LHCb issue - NIKHEF and SARA have different roles for them and they are preparing to move analysis storage from dcache to dpm and did not propogate a matching change in cpu shares when they should have. Jeff asked them to let NL-T1 know if cpu servers need to be moved around.

  • ASGC: Seeing intermiitent failures on CMS and test jobs since a few days. Have opened ticket 114622 in the CMS Savannah.

  • NDGF: Performed a successful dcache upgrade today. Queried when the new FTS would be released. Maarten reported we are still waiting on test sites (including CERN) to report but Triumf had already put it in production. RAL will be testing the checksum case sensitivity fix. Hopefully will come out this week - more news tomorrow.

  • IN2P3: Not quite understanding the ALICE problem of different VObox requirements. Patricia has sent log files to R.Vernets but not opened a ticket yet as not sure of where the problem really lies.

  • KIT: Over the weekend a disk partition of one of their CEs filled up and job submission was disabled till 14.00 today. Filled by CMS and ATLAS pilot job logs but all less than 5 days old so not a failure of log rotation. Investigating why a normal number of jobs managed to generate so much logging.

  • CERN CASTORATLAS: Harry queried an ATLAS operator alarm that morning on degradation of the T0MERGE pool. Jan Iven reported this was in fact provoked by a log daemon being stuck. SRMATLAS may also have been affected in the period from 04.00 to 10.00.

  • CERN databases: Overnight a node of the ATLAS production database rebooted - being investigated. The DB stayed available but some sessions failed over to other instances. There was also a problem with one of the nodes of the CMS production DB running out of space in the file system.

AOB:

Wednesday

Attendance: local(Miguel, Harry(chair), Lola, JanI, Eduardo, Ricardo, Jean-Philippe, Flavia, Eva, Maarten, Steve, Simone, Akshat, MariaD, Pavel);remote( Ian(CMS), Jon(FNAL), Michel(BNL), Angela(KIT), Onno(NL-T1), Tiju(RAL), Rob(OSG), Gang(ASGC), IN2P3)).

Experiments round table:

  • ATLAS reports - Lots of MC and reprocessing activities. Several ggus tickets for data export issues. ASGC needs to allow more FTS jobs - NL-T1 has been in an intervention - RAL problem has been closed. A monitoring database server at CERN is overloaded otherwise smooth running.

T0 Highlights: 1) Preparing for 900GeV running tonight. Potentially high rates. 2) Software change in the T0, bumped all datasets except RAW to new processing versions because of event content compatibility. All sites appear to have approved the requests.

T1 Highlights: 1) New CMSSW release working its way to sites. Full reprocessing of the data expected to follow shortly. 2) New ticket just opened at CNAF where some MonteCarlo files do not have the expected checksums.

T2 Highlights: 1) MC production as usual. 2) Starting large scale pile-up simulations at T2 sites, expect jobs with higher than normal I/O load at sites. 3) A couple of sites have seen problems updated CRL for DOE Grids. Unclear if it's a transient or regional DNS problem. Seemed to affect Spanish and Portugese Tier 2 first.

GENERAL INFORMATION: Decrease in the number of running jobs due to the end of the MC cycles started during the last weekend. In addition, the usual reconstruction and analysis activities are going on.

T0 site: Last Friday we reported about the creation of a new LanDB set including about 15 ALICE CAF nodes which required a common an specific connectivity. The name of the mentioned set had to be modified in order to follow the standard name definition set by the PES experts. As procedure to follow it was agreed to define a new set this time with the right name that as soon as it appeared properly populated with the names of the nodes, it would replace the current one. Still it appears to be empty in the network page, PES experts have been contacted.

T1 sites:

CCIN2P3: issue reported yesterday concerning different environments found in the two ALICE VOBOXES.The problem requires further and deeper investigations by the ALICE side in order to understand whether it is a problem associated to the ALICE environment.

NIKHEF: Issue reported yesterday concerning the bad behavior of the local PackMan service: Solved. The VOBOX required the update of the local AliEn version. Same procedure applied to Cagliari.

T2 site: IPNL GGUS ticket reported yeterday solved (local CREAM-CE required the restart of tomcat). Services restarted at the local VOBOX.

  • LHCb reports - 26th May 2010 (Wednesday)

Experiment activities: 1) Reconstruction of recent data ongoing at T0/1s. 2) MC production ongoing at T2s.

T0 site issues: Ticket against CASTOR closed. Not a CASTOR problem. Some shared software area problems currently appearing.

T1 site issues: 1) PIC: PIC-USER space token is full. 2) NL-T1: SARA dCache is banned due to ongoing maintenance.

Sites / Services round table:

  • NL-T1:The observations of ATLAS and LHCb are because we are migrating from 12 dcache pool nodes to a new 12 trying a new procedure to keep the service up. This required a dcache reconfiguration and restarts which caused some transfer failures but no more since this morning. The whole operation will take a few days and it was agreed to document the new procedure for other dcache sites.

  • IN2P3: Will have a scheduled down on 8th June to move hpss and gpfs services so there will be no tape access. Also afs servers will be affected so software releases (i.e. of new afs volumes) will not be possible. A transparent Oracle intervention will also be made.

  • CERN FTS: Good progress on the new version. Up at CERN in a pilot but have not exercised data transfers yet - experiments are also encouraged to exercise the pilot.

  • INFN: Successfully completed Oracle and batch system upgrades.

  • CERN GPN: CERN external firewall was overloaded from about 17.00 yesterday with xrootd traffic to Tier 2 sites causing packet losses especially with udp packets. This traffic was diverted to the HTAR route from 10.00 today. Maarten added this had hurt exporting of the CERN top level bdii and that 3/4 of the Tier-2 sites were failing the lcg replication test (local file to reference CERN SE). The availability statistics will be corrected and we will also look at moving bdii export to HTAR. It was understood later this traffic was from the new ALICE CAF nodes.

  • CERN CASTORATLAS: An ATLAS groupdisk server has some innaccessible files and needs a file system repair.

AOB: An LHC technical stop is scheduled from 31 May to 2 June inclusive. To be confirmed.

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

  • CERN FTS endpoint on https://fts-patch4084.cern.ch:8443/glite-data-transfer-fts/services/.... is now available with 2.2.4. CMS have already done some basic testing. Ready when ever to upgrade the production T0 and T2 service. PATCH:4084.

AOB: (MariaDZ) A round of the Periodic ALARM tests full chain will take place next week Mon-Wed. The Tier0 ONLY is concerned this time. Steps to follow are in https://savannah.cern.ch/support/?114705 and services to test are in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 20-May-2010

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2010-05-27 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback