Week of 110425

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

  • No meeting - CERN closed.

Tuesday:

Attendance: local(Eva, Nilo, Ricardo, Ignacio, Maarten, MariaG, Stephane, Ale, Fernando, Dan, Dirk);remote(Michael/BNL, Gareth/RAL, LHCb/Renato, Rob/OSG, Xavier/KIT, Gonzalo/PIC, Ian/CMS, Jon/FNAL, Huang/ASGC, Chrsitian/NDGF, CNAF, Jeff/NL-T1).

Experiments round table:

  • ATLAS reports - Fernando
    • In a nutshell: Physics all day (data11_7TeV) with short calibration periods and few interruptions
    • Peak luminosity record broken for a hadron collider
    • T0
    • T1s
    • Central Services
      • Downtime collector stuck - quattor template of the machine had been incorrectly modified and was preventing any activity of the user under which the collectors run. This is why RAL was not automatically re-included in DDM activity for a couple of hours after coming back from their downtime on Thursday ~12:00AM.
    • Ale: still need better understanding for second castor problem. Ignacio: will follow-up with an incident report once fully understood. Still investigating several possible causes (DB backup, scaling issues with castor file name dump, overwhelmed rsyslog) - most likely candidate is rsyslog problem.
    • Stephane: VOMS failover after BNL problem did now work as expected. Maarten: not clear as timeout for failover is 3 minutes and people may not have waited sufficiently long. Ale: main message for upcoming VOMS intervention: old service should immediately reject client requests to avoid confusion as also discussed at T1 service coordination meeting last week.

  • CMS reports - Ian
    • LHC / CMS detector
      • Excellent running over the weekend. Good live time
    • CERN / central services
      • Nothing to report
    • Tier-0 / CAF
      • Last 24 hours CMS has been averaging 85% utilization of the Tier-0
    • Tier-1
      • 2010 Reprocessing launched at Tier-1s
      • Reprocessed some limited 2011 datasets over the weekend. Good responses by Tier-1
      • Data Ops has requested a consistency check of the site contacts
    • Tier-2
      • MC production and analysis in progress. Reduced effort over the holiday
    • Other
      • CRC-on-Duty : Peter Kreuzer as of this evening.
    • Ricardo: saw swap full at T0 - who is working on this on the CMS side? Ian: current workflow has high memory consumption. David Lang is following this closely, but consumption should go down by 400MB - 1 GB during the next 48h.

  • ALICE reports - Maarten
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Renato
    • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
    • A lot of MC continues to run.
    • T0
      • Problem SOLVED: Problems staging files out of Tape (72 files). Files requested to be staged yesterday we would have expected them online.
      • Yesterday, there were files missing to the OFF-LINE due to a hardware failure. Now (earlier) files start to move to OFF-LINE.
    • T1
      • IN2P3: Problems with software installation, "share" set to zero. Solution is in progress.
      • RAL: Storage Elements full (RAW and RDST, which use the same Space Token). It was reported that "some tape drives becoming stuck and not working", which seems to be fixed. However, there still a big backlog.
    • T2

Sites / Services round table:

  • Michael/BNL - long standing BNL - CNAF network issues closed! Yesterday morning 9:00 hic-up with VOMS tomcat server (probe exists to detect these problems). Service was restarted at 10:00. Oracle RAC for conditions needs to move to other data center: scheduled outage for 4h tomorrow. Maarten: VOMS service was not dead, but did not handle connections anymore. Therefore a modified timeout on connection would not have helped much.
  • Gareth/RAL - problems with ATLAS sw server, also with CVMFS (corruption). CVMFS issues should be resolved in latest CVMFS client version. CASTOR: DB backend problems. LHCb castor area became full after tape drive problems. disk cache became full and as knock-on effect also reads from tape were affected as garbage collection cleaned new read from tap before they could be accessed by LHCb . Additional resources have been added and the system is now recovering.
  • Rob/OSG - upstream probe problem with sam collector last night. Data will be resent to complete the missing time window.
  • Xavier/KIT - FTS Oracle back-end problems on Sat morning - FTS channel restart fixed the problem.
  • Gonzalo/PIC - ntr
  • Jon/FNAL - ntr
  • Huang/ASGC - Reminder: castor in downtime for upgrade until Fri night. Also planned network maintenance with limited connectivity from 10pm to 4 am UTC
  • CNAF - CE experienced "out of memory" problems - now back to normal
  • Jeff/NL-T1 - ntr
  • Mat/IN2P3 - dcache ticket for recent ATLAS problem still open and no answer yet. Operations are monitoring the system and will restart in case the problem re-ocurrs
  • Chrsitian/NDGF - ntr
  • CERN VOMS service The certificate for the LHC voms services on voms.cern.ch will be updated tomorrow on Wednesday around 10:00 CEST April 27th. The current version of lcg-vomscerts is 6.4.0 and was released 2 weeks ago. It should certainly be applied to gLite 3.1 WMS and FTS services. [ Has been put into release for those services a few weeks ago. T1s running FTS services should be sure that they have latest version of RPM.

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 19-Apr-2011

Edit | Attach | Watch | Print version | History: r21 | r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2011-04-26 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback