Week of 130930

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Alessandro (ATLAS), Ivan (dashboards), Ulrich (CERN/PES), Maarten (Alice), Ken (CMS)
  • remote: Michael (BNL), Thomas (NDGF), Vladimir (LHCb), Lisa (FNAL), Rolf (IN2P3), Sang-Un (KISTI), Wei-Jen (ASGC), Gareth (RAL), Rob (OSG), Pepe (PIC), Salvatore (CNAF), Onno (NL-T1)

Experiments round table:

  • CMS reports (raw view) -
    • CVMFS Stratum 1 is down, although this doesn't appear to be affecting our operations.
    • On Friday, trouble with the BDII for SAM tests, but it got resolved.
    • Two unresolved tickets, INC:395272 and INC:396136. Both came over the weekend, no action over this morning (but I think we need to deal with this on the CMS side).
    • OK, I've got the story now on the network link to Russia....

  • ALICE -
    • NDGF: 35k files (96% production, 4% organized analysis, some >2 years old) no longer found in dCache, probably lost; being investigated
    • many sites moving to CVMFS

  • LHCb reports (raw view) -
    • Main activity are MC productions.
      • Fall incremental stripping campaign will be launched this week (see also WLCG Operations Coordination meeting, 19 Sep)
    • T0:
      • NTR
    • T1:
      • RAL: one disk server down, technicians are looking into the issue, at the moment the files are declared unreachable within LHCb
      • GRIDKA: problem with staging during stress test

Sites / Services round table:

  • RAL:
    • 50% of batch resources moved from Torque to Condor (all on SL6). The remaining 50% will be upgraded to SL6 on wednesday but remain under torque for some weeks.
    • RAL to UK academic network will be upgraded tomorrow (AT RISK)
    • RAL UPS test on wednesday (AT RISK)
  • PIC: CVMFS problem with LHCb reported. Now stable.
  • NL-T1: some tape problems last week caused by lots of incoming data, all is fine now.
  • KIT: next thursday will be national holiday in Germany. Normal response to Alarms, less urgent tickets will have to wait for Friday.
  • CERN:
    • 2 new CEs are available (301 and 302, both pointing to SLC6 resources) but not used
    • more worker nodes are being deployed in Wigner Center. ATLAS would like to have a dedicated discussion on this. Will happen next monday in the ATLAS-IT monthly meeting.
  • CERN CVMFS
    • During the weekend the CVMFS stratum 1 service at CERN suffered from a full partition.
    • cvmfs repositories only mirrored at CERN became totally unavailable , i.e the non-production /cvmfs/lhcb-conddb.cern.ch
    • For other repositories e.g /cvmfs/atlas.cern.ch , /cvmfs/cms.cern.ch all clients including those at CERN will have now all switched away from CERN to use alternate other stratum ones.
    • Current situation at 13:00 on Monday:
      • All cvmfs repositories except for /cvmfs/atlas.cern.ch are again served correctly from cvmfs-stratum-one.cern.ch.
      • /cvmfs/atlas.cern.ch is currently not available from CERN.
    • With the exception of /cvmfs/lhcb-conddb.cern.ch everything is transparent to all users including both readers and writers of CVMFS.
  • Dashboards and Monitoring: MyWLCG and SUM interfaces will be down tomorrow 7:00 to 15:00 UTC during their upgrade so basically no interface to check results of SAM tests. Metric results will be queued and after the release, availability will be calculated. No data will be lost, but there will be a delay.

AOB:

Thursday

Attendance:

  • local: Simone (SCOD), Xavi (CERN-DSS), Maarten (Alice), Alessandro (ATLAS), Sang-Un (KISTI), Yeo (KISTI), Park (KISTI), Ulrich (CERN-PES), Maria (GGUS)
  • remote: Vladimir (LHCb), Lisa (FNAL), David (CMS), Kyle (OSG), Rolf (IN2P3), Gareth (RAL), Thomas (NDGF), Ronald (NL-T1), Wei-Jen(ASGC), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • the BNL VOMS server had to be switched off because of the new CA (/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=vo.racf.bnl.gov) which is now providing the host certificate, resulting in a different DN of the VOMS server. This DN is hardcoded in a file for the EGI sites. We are working to make a broadcast to have this file updated.
    • Tier0/1s
      • IN2P3-CC storage issue GGUS:97685 this was an HW problem now solved.
      • RAL-LCG2 all transfers failing for ~1h 3October early morning GGUS:97731 as reported in the ticket, RAL had a network issue 02:45 - 03:30 UTC
      • INFN-T1 GGUS:97687 few transfers failing between the site UAM and INFN-T1, problems in contacting the INFN-T1 SRM. Not understood yet.
      • TRIUMF GGUS:97698 not sure yet if this is related to the BNL VOMS issue, most probably it is. If site can confirm, then we can close the ticket.

  • CMS reports (raw view) -
    • Continuing to chew through legacy rereco of 2011 dataset, otherwise quiet on most fronts -- GGUS activity last few days:
      • GGUS:97714 WMS Config at DESY causing CERN WMS nodes to overload -- was old configuration, now solved.
      • GGUS:97705 User experiencing mapping (?) issue at Bari -- ongoing
      • GGUS:97677 CMS production glideins becoming held due to proxy to local user delegation at KIT
      • GGUS:97732 Not being able to override default 5.5 hour timeout for CREAMCE sites is causing them to fail CMS SAM tests more often than they probably should.
        • Maarten: Alice implemented its own version of the probe to overcome the hardcoded timeout. Will suggest the same to CMS.
          • after the meeting: the CMS problem is different, it used to work fine

  • ALICE -
    • CVMFS
      • T0/T1: CNAF and IN2P3 switched (joining KIT and RAL)
      • T2/T3: 23 sites
    • NDGF: dCache file name mapping turned out to be broken due to a misconfiguration a few weeks ago; some renaming to do, after which the missing files should be available again

  • LHCb reports (raw view) -
    • Main activity are MC productions.
      • Fall incremental stripping campaign will be launched this week (see also WLCG Operations Coordination meeting, 19 Sep)
    • T0:
    • T1:
      • GRIDKA: staging stress test finished successfully
    • Discussion: CERN will reply to GGUS:97736 concerning ce202 failing all jobs. If LHCb have the perception that not enough jobs are being executed at CERN, they will open a different ticket.

Sites / Services round table:

  • KISTI is currently in a scheduled 2 weeks downtime but there is a problem of delivering of CISCO switches which will extend it by 10 extra days.
  • OSG: Need to understand why the MyWLCG downtime of tuesday was not propagated enough to OSG. Also, there will be an upgrade of the RSV -> SAM uploader next Tuesday, October 8
  • RAL: a network outage cut the site connectivity to the outside (45 mins this morning). In the morning at 7:00 a router was in trouble (because of the other problem). All problems should be solved now. Concerning the batch farm, it has been upgraded to SL6 as planned.
  • PIC: in the last day there have been some internal network issues (packet loss, some services might be affected).
  • CERN: problem was discovered in the configuration of newly created VMs (solved). WNs being upgraded to EMI3 for SL6. 100 nodes (800 job slots) in Wigner.

  • GGUS: the september GGUS release merged with the october one, on the 23rd of october.

  • Storage Services - CASTOR v14 rollout. Transparent Intervention on central services and namespace but downtime required for stagers (~5h). Tentative schedule:
    • 7/10/2013 10h to 12h Nameserver and Central Services. Transparent intervention.
    • 14/10/2013 (tentative) 9h to 14h CASTORATLAS. Intervention not transparent. Downtime required.
    • 21/10/2013 (tentative) 9h to 14h CASTORALICE. Intervention not transparent. Downtime required.
    • 22/10/2013 (tentative) 9h to 14h CASTORCMS and CASTORLHCb. Intervention not transparent. Downtime required.

AOB:

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2013-10-04 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback