Week of 130506

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Alessandro, Ben, Felix, Luca C, Luca M, Maarten, Maria D, Raja
  • remote: Alexander, Christian, David, Gareth, Lisa, Michael, Paolo, Pepe, Rob, Rolf, Xavier

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T0/1s
      • CERN-PROD GGUS:93811 : on May 1-st the CERN-PROD starts to fail in many jobs with pilot: Get error: Staging input file failed (Savannah:137261). The 4 digit DB release file was only on CERN-PROD_SPECIALDISK, which was overloaded. The dataset was replicated on EOS.

  • CMS reports (raw view) -
    • In general a calm happy weekend.
    • Preparing reprocessing of the 2011 data at CERN, with very successful prestaging and transfers last week.
      • Luca: which storage system will be used for that? As CASTOR is being drained, it has less space available
      • David: probably EOS will be used instead, will check
    • Reprocessing of MC occurring at T1 at low level.
    • CMS only Tier 2's are encouraged to make the transition to SL6 -- Sites shared with other VO's need to wait until June 1 to do this.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress: Running between 1k - 3k of stripping and merging jobs with 98% of execution completed
    • T0: NTR
    • T1:
      • Possible slowdown in staging files at GridKa. Wait and see.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • NET2 (Northeast T2 == Boston Univ. + Harvard) are moving to a new computer center; expected to be up again on May 11
      • Alessandro: downtime declared in OIM?
      • Rob: yes
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • tomorrow at-risk downtime for Oracle security patch
  • NDGF
    • downtime Wed afternoon: some servers will be moved, causing some data to be temporarily unavailable
  • NLT1 - ntr
  • OSG
    • BDII issue reported last week was fixed on Fri
  • PIC
    • tomorrow morning at-risk downtime for tape system; some network cables will be moved as well
  • RAL
    • We have a scheduled outage (in the GOC DB) for Wednesday when we are switching over our standby & production castor databases.

  • databases
    • a regular patch campaign for the integration databases is foreseen for next week, with the production databases to follow 2 weeks later
  • grid services
    • tomorrow ce207 and ce208 will be upgraded to EMI-2, leaving ce206 still to be done
    • ce208 will support submission to SLC6 resources, currently making up 10% of the batch capacity
    • this morning the lxplus.cern.ch alias was changed to point to lxplus6 (i.e. SLC6) nodes; the lxplus5 alias remains available for SLC5 nodes and currently still has most of the lxplus capacity
  • storage
    • the lxplus alias change caused authorization errors in CASTOR, as some of the lxplus6 nodes are on a network that was not considered local; this was fixed around 11:15

AOB:

  • Next meeting on Friday

Thursday: Ascension holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Alessandro, Andrew, Ben, Eva, Felix, Luca M, Maarten, Raja
  • remote: Catalin, John, Michael, Rob, Rolf, Stefano, Tommaso, WooJin

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T0/1s
      • IN2P3-CC GGUS:93932 (and then a duplicate GGUS:93963) . One diskserver with HW issues (45.64TiB, 294729 files), which hopefully will be fixed on Monday.

  • CMS reports (raw view) -
    • CERN: jobs goingthroughce208 (so sl6 WNs) are failing due to the fact that on these WNs VO_CMS_SW_DIR is not set. All is documented in TEAM ticket GGUS:93965 and is still NOT solved. As a result, CERN site is currently blacklisted to CMS analysis jobs.

  • ALICE -
    • CERN: submission to SLC6 queue stopped Tue evening after job failure reports, possibly due to presence of incompatible Xrootd client code in standard directories; other SL6 sites do not show such problems and probably do not have Xrootd clients installed; to be debugged...
    • CNAF: currently no normal jobs after VOBOX got upgraded from SL5 to SL6 yesterday; where the shared area is used, the VOBOX OS needs to be the same as the WN OS
      • Stefano: currently there are only a few WN on SL6, behind a dedicated queue; we have no schedule yet for upgrading the farm; we will look into the different options for getting the ALICE job submission working again
      • after the meeting: the VOBOX was switched to the use of Torrent and jobs are ramping up again, thanks!

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions starting
    • T0: (GGUS:93975) voms-proxy-init does not work on lxplus(sl6)+cvmfs.
    • T1: NTR
    • Other : (GGUS:93966) Request to GGUS to allow fine-grained SE reporting by sites. For example, we do not need to ban a full site (srm indicated as down for example recently at PIC) when the tape system alone is down.

Sites / Services round table:

  • ASGC
    • intervention for FTS DB HW migration Mon May 13 01:00-13:00 UTC
  • BNL
    • scheduled maintenance Tue May 28 through Wed May 29; main objectives:
      • move the farm to SL6
      • upgrade dCache to the 2.2.x golden release
  • CNAF - nta
  • FNAL - ntr
  • IN2P3
    • EMI-2 CREAM (S)GE publisher crashes when there are more than 36k jobs in total (running + waiting), a patch is being worked on (GGUS:93506); KIT admins have been informed as well
  • KIT - ntr
  • OSG - ntr
  • RAL - ntr

  • databases - ntr
  • grid services - ntr
  • storage - ntr

AOB:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2013-05-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback