Week of 090831

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Harry, Patricia, Andrea, Jan, Lola, Gang, Olof, Jamie, Markus, Maria D, Dirk(chair));remote(Cedric, Gonzalo, Michel, Daniele, Ronald).

Experiments round table:

  • ATLAS - (Cedric) The test of the alarm ticket procedure to BNL on Friday afternoon was successful. During the weekend some problems in Lyon with dcache and at ASGC with CASTOR - both fixed quickly fixed. CMS experiences problems with transfer to Milano, because of missing FTS channels - now CNAF have added channels required also for other t2 transfers. ATLAS asks Tier 1 sites to proactively follow up on t2 channel configurations.

  • CMS reports - (Daniele) Problem with gLite WMS (CMS ticket 109692), relevant people are looking at solutions. CMS closed several T1/T2 tickets (more details on the CMS wiki). Tape migration in ASGC is still an issue. Several new T2 problems came up with transfer errors, or FTS submission - new tickets are being created. Michel: problem with globus libs on T2 WN on SL5: workaround implies copies to CMS s/w area. Markus: should be discussed at GDB level. Gang: ASGC Castor problem has been escalated to CASTOR team at CERN.

  • ALICE - (Patricia) Stable production over the weekend. SARA is currently not usable for ALICE due to a VObox issue. ALICE have created ticket 51238 ticket which describes the few steps to solve the issue.

Sites / Services round table:

  • Ronald/NIKHEF: scheduled tape maintenance ongoing. NIKHEF WN had to be switched off due to unexpected power intervention. Now back up again.
  • Gonzalo/PIC, Michel/GRIF, Gang/ASGC : NTR
AOB:

Tuesday:

Attendance: local(Gang, Jamie, Jan, Maria (chair), Harry, Cedric, Olof, Patricia, Roberto);remote(Daniele, Gonzalo, Andreas, Ron, Michel).

Experiments round table:

  • ATLAS (Cedric)- nothing to report

  • CMS reports - (Daniele) CRAFT09 excercise ends successfully today. To report just a few jobs failures in the express streams which are being investigated. The August production round is also over with more than 300M raw events and using 12 to 15 k slots in a stable way. Generally speaking (for the sites supporting the CMS VO) the T1's sites are in a good shape (apart from the migration at ASGS) and and the T2's are working on open tickets.

  • ALICE - The problem mentioned yesterday for SARA VObox is solved. Migration to SL5 on WN is under way. More news tomorrow.

  • LHCb reports - MC production is ongoing very smoothly with 50M events. Ramping up now. See report for more details on the T1 sites issues. Ron

Sites / Services round table:

  • ASGC (Gang): 600 TB disk space for ATLAS and CMS for CASTOR pool will be made available this week.

  • PIC (Gonzalo): Nothing to report.

  • FZK (Andreas): Some problem with PBS which needed to be rebooted. Related job submission errors observed.

  • SARA (Ron): Maintenance in disk space took longer than foreseen but is now over. The tape system is also back. ATLAS dcache problems reported and solved. (ATLAS - Cedric): ticket submitted this morning.

  • GRIF (Michel): Some CEs upgraded to SL5.

  • FIO OPS (Jan): Successful CASTOR public upgrade to 2.1.8-10. SAM test failures on CEs from yesterday understood and fixed.
AOB:

Wednesday

Attendance: local(Jamie, Maria, Gang, Cedric, Harry, Sophie, Andrea, Julia,Roberto, Antonio,Jean-Philippe, MariaD, Lola);remote(Daniele, Tiju, Gonzalo, Fabio, Gareth, Jos).

Experiments round table:

  • ATLAS (Cedric) - No news received from SARA about the tape services, which were in maintenance yesterday.

  • CMS reports (Daniele) - No major problems to report, apart from the open tickets described in full details in the attached report. ASGC tape migration is progressing. Backlog being digested. Also contacts to IN2P3 are being re-established after the reported August low responsiveness.

  • ALICE - No report.

  • LHCb reports - Very actively running 15k concurrent jobs as required by the physics groups for MC production. To report also a degraded performance to CERN with transfers failing. Following-up on the problem (a GGUS ticket will be opened soon). Four new tickets for the Tier1 sites are opened. Details are in the attached report. Woth to mention the request by the LHCb experiment for WNs dcache client version update to PIC, IN2P3, FZK and NL-T1. Acknowledged by PIC and IN2P3 to be done (in low priority).

Sites / Services round table:

  • ASGC (Gang): progressing well with the tape migration for CMS.

  • RAL (Gareth): ATLAS 3D database downtime announced for 08.09 (3h) for migration to 64bit. Also answer to the question that arose at the MB yesterday on impact of BNL site name change: confirmed that no changes needed

  • PIC (Gonzalo): dCache client update acknowledged - will be done. Can it come through s/w area? LHCb prefer sites to install on WN. Some CEs have ~6K jobs queuing in local batch q - to be checked. Add extra space on MC-M-DST space. Now at 25TB total w 6TB free (for LHCb)

  • IN2P3 (Fabio) - acknowledged dCache client update ticket - will look into it (Check feasibility of installing outside scope of gLite m/w stack). Failures of SAM tests (see full LHCb report of Tuesday) because of correlation between short CPU queues and low memory h/w. Could memory requirement be put in JDL?
    Scheduled downtime 22/09 for 2 days for electrical equipment change - to be added to GOCDB

  • FZK (Jos) - nothing to report

  • Dashboards (Julia) - require move of non-VO specific application from INTR to LCGR - to be scheduled.

  • CERN (Sophie) - LFC public upgraded to 1.7.2-4 (supports ATLAS bulk methods). ATLAS announced for next week. ATLAS still testing their site services and will confirm when needed for T1s.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local(Gang, Cedric, Stephane, Jamie, Jan, Gav, Jean-Philippe, Julia, Roberto, Harry, Simone);remote(Jeremy, Gonzalo, Gareth).

Experiments round table:

  • ATLAS (Cedric) - 5 T2 in unsched down - quite a lot! Most have storage problems. Working to fix it. Deployment of new LFC for T1s - good if sites can deploy asap. Preferably before end Sep - already done in some T1s, e.g. Lyon.

  • ALICE -

  • LHCb reports - Confirm activity of last days proceeding. 14-15k jobs concurrently running. 3 new tickets: CERN, CNAF & a T2. CERN: some critical test jobs fail due to VOMS cert not properly updated. Issue with CASTOR@CERN still open: device/resource busy, investigation ongoing. T1 issue: CNAF - similar to PIC "wrong BDII publication" on running/waiting jobs. Affects ranking expression. Site appears "attractive" despite huge # jobs. GGUS tickets against both sites. CNAF: transfers failing due to exhausted diskspace - data in wrong space token, being sorted out LHCb+CNAF - details in Twiki.

Sites / Services round table:

  • PIC (Gonzalo) - brief intervention on firewall this am. Everything went ok - last only 10' (not 2 hours foreseen). Did not close qs, just suspended running jobs & resumed 10-15' later. Seems AOK.

  • CERN (Gav) - all WMSs upgraded to gLite 3.2. Looking at LHCb CASTOR issue - seems to be localized to a few disk servers.

AOB:

  • Thursday next week is a public holiday in Geneva - no call that day.

Friday

Attendance: local(Jamie, Harry, Maria, Carlos, Gang, Jan, Alessandro, Jean-Philippe, Roberto, Olof, Simone, Stephane);remote(Gareth,).

Experiments round table:

  • ATLAS (Stephane) - some Tier2 problems. All ok with Tier0 and tier1 sites.

  • CMS reports - Daniele apologized for not being able to connect today.

  • ALICE - finishing the SL5 VOBOX testing to provide IT-GD with feedback. It is required the registration of the voboxes into the list of trusted nodes of myproxy.cern.ch. The email has been sent to the corresponding list, it will be appreciated if the request could be speeded up, so we can finish the certification of this upgraded service - Jan from FIO-OPS clarified that Alice should be able to do this by themselves. Will be done this afternoon by FIO.(Patricia after the meeting): Request already fullfiled by FIO

  • LHCb reports - Smoothly running in the system various MC productions (Simulation + reconstruction + merging @T1s) and user distribued analysis (12-13K) jobs running in the system. Also proceeding with testing of SL5 for DIRAC validation. CNAF is working on the chanche of the local space token definition. At Tier0 CASTOR problems when trying to upload files to the lhcbrdst service class.

Sites / Services round table:

  • ASGC (Gang) - for cms farmpool 5 disk servers and 130 TB have been added (there is another farmpool has 13 disk servers and 284 TB disk space). For atlasStage pool 2 disk servers and 9 TB disk space have been added and now this pool has 3 disk servers and 28 TB .

  • RAL (Gareth) - for clarification after yesterday's meeting the glite version is 3.1.28 and LFC is 1.7.2-4.

* CS (Carlos) - Monday 7th from 14:00 to 18:00 CS plans to reconnect the new router to the LCG core in order to gather more information about the performance degradation observed (as requested by F10 engineers). Service managers are invited to report any network anomaly

  • FIO Ops (Jan and Olof) New tuesday morning there will be a intervention at risk for the LSF to use new licences servers and a intervention with downtime from 14:00 to 15.30 to upgrade of c2cernt3 (castoratlast3, castorcmst2) to CASTOR version 2.1.8-10. Both interventions are on the IT status board.

  • Databases (Maria) - Migration to new hardware for the SARA DBs with a downtime from 8.00 to 12.00 on 11.09.

AOB: Simone asked when the lxplus alias will be pointing to SL5. Olof said that it will be discussed at the GDB and that tentatively it will happen on 05.10.09.

-- DirkDuellmann - 2009-08-31

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2009-09-04 - MariaGirone
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback