Week of 130729

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaV/SCOD, Eddie/Dashboard, Zbyszek/Databases, Belinda/Storage, Maarten/ALICE
  • remote: Sang-Un/KISTI, Michael/BNL, Saverio/CNAF, Brian/RAL, Wei-Jen/ASGC, Tiju/RAL, Lisa/FNAL, Pavel/KIT, Rob/OSG, Pepe/PIC, IN2P3/IN2P3, Federico/LHCb, Stephane/ATLAS, Stefano/CMS

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • NTR
    • T1
      • RALFTS3 : All UK sites now managed by RAL FTS3, in addition to DE and few other sites. If the hardware limit is not reached, it is planned to include FR sites by middle of this week. [Brian: working on FTSRAL3 also for CMS, will follow up issues and plans with ATLAS after the meeting]

  • CMS reports (raw view) -
    • Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
    • https://cern.service-now.com/service-portal/view-incident.do?n=INC347643 ARGUS service at CERN does not accept GermanGrid CA (but CERN CE's do !). Opened friday, no reply yet. Should we file GGSU instead? [Maarten: the fact that CEs accept a certificate and Argus does not is strange, since CEs accept certificates via Argus. Maarten: GGUS is normally faster than SNOW because it ensures that more people see a ticket than just the specific team assigned to a SNOW category. AndreaV: will follow up with IT-PES as the SNOW ticket is still not answered.]
    • GGUS:96212 to T1_UK_RAL slow disk-tape migration, issue with FTS3 under investigation
    • GGUS:96178 to T1_US_FNAL file access error, likely transient, in CMS hands to retry

  • ALICE -
    • KIT: low efficiency since coming out of maintenance; seems fixed now after increasing the TCP connection backlog on the Xrootd redirectors around noon

  • LHCb reports (raw view) -
    • Mostly MC productions ongoing, tail of reprocessing and restripping campaign
    • T0:
      • CERN: NTR
    • T1:
      • GRIDKA: still problems with SEs (GGUS:95135). [Pavel/KIT: there were problems with the LHCb SRM on the /var partition after the downtime. Xavier is looking at the LHCb ticket, the issue has been solved one hour ago. Federico: the issues does not seem solved yet. Pavel: will follow up with Xavier and ask to provide all relevant info in the ticket.]

Sites / Services round table:

  • Sang-Un/KISTI: ntr
  • Michael/BNL: ntr
  • Saverio/CNAF: ntr
  • Wei-Jen/ASGC: there will be two network interventions in August following a change in contract for network provider, they are in GOCDB, bandwidth will be 20 Gbps afterwards
  • Tiju/RAL: reminder, CASTOR upgrade for ALICE tomorrow
  • Lisa/FNAL: ntr
  • Pavel/KIT: upgrade completed successfully on many systems during the downtime (apart from issues with LHCb and ALICE discussed today)
    • firewall upgrade, big improvement in throughput has been seen
    • dCache upgrade with SHA2 compatibility, experiments were informed of some configuration changes
    • FTS update
  • Rob/OSG: ntr
  • Pepe/PIC,: ntr
  • IN2P3/IN2P3: ntr

  • Eddie/Dashboard: ntr
  • Zbyszek/Databases: there will be an intervention on Thursday from 2.30pm to 6.30pm on the storage behind ATLAS offline/online/ADC and LHCb offline databases, to address the issues observed last week, which according to the vendor were caused by problems on the motherboard of one controller
  • Belinda/Storage: ntr
  • Maarten/GGUS: web service refused CERN certificates on Sun due to expired CRL, fixed Mon morning (GGUS:96195). [Pavel/KIT: was on shift this weekend but got no SMS from the alarm system about this issue, will follow up. Maarten: thanks, this is one of two things to follow up, the other being why the CRL expired, which Guenter is looking into.]

AOB: none

Thursday

Attendance:

  • local: AndreaV/SCOD, Maarten/ALICE, Gavin/Grid, Zbyszek/Databases
  • remote: Michael/BNL, Lisa/FNAL, Matteo/CNAF, John/RAL, Wei-Jen/ASGC, Xavier/KIT, Ronald/NLT1, David/IN2P3, Rob/OSG; Stephane/ATLAS, Stefano/CMS, Federico/LHCb

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • NTR
    • T1
      • RAL FTS3 :
        • Monday 29 : Soft upgrade on FTS3 (database access optimisation) : Significant improvement in transfer activity
        • Wednesday 31 : Discovered that setting FTS priority was not effective. Corrected and deployed in RAL FTS3
        • [Stephane: move of French sites to RAL FTS3 discussed on Monday has been postponed]
      • FTS2 TAIWAN and CERN FTS2 for CERN-ASGC : both stuck. Issue raides by question from ASGC. ASGC restarted FTS2 during the adternoon. For CERN FTS2, just require feedback (GGUS : 96284) but it is fine now
      • INFN-T1 : Broken in late afternoon and recovered during the evening.
      • RAL-LCG2 : SRM issue this morning related to DNS issue. Recovered during the morning.

  • ALICE -
    • KIT: investigating large fluctuations in job efficiency (CPU time over wall-clock time) since coming back out of maintenance on Thu last week; probably related to configuration issues affecting the local Xrootd SE performance since many months [AndreaV: is this related to the storage issues at KIT reported by LHCb? Maarten: most probably not, unless the issue comes from the network which is the only correlation]

  • LHCb reports (raw view) -
    • Mostly MC productions ongoing, tail of reprocessing and restripping campaign
    • T0:
      • CERN: NTR
    • T1:
      • GRIDKA: Most problmes with SEs (GGUS:95135) are solved

Sites / Services round table:

  • Michael/BNL: ntr
  • Lisa/FNAL: ntr
  • Matteo/CNAF:
    • addition of storage for CMS is ongoing and should finish by August 20, then will start the other experiments
    • StoRM service for ATLAS was unavailable yesterday for 8h, it has been upgraded to EMI3 at the same time to avoid another downtime
  • John/RAL:
    • Castor upgrade for ALICE was successful
    • there was a problem with a DNS server between 6am and 8am this morning
  • Wei-Jen/ASGC: reminder, scheduled downtimes for network maintenance have been declared in the GOCDB
  • Xavier/KIT: ntr
  • Ronald/NLT1: ntr
  • David/IN2P3:
    • question for ALICE, is it normal that we see no ALICE jobs since yesterday morning? [Maarten: will follow up]
    • question about FTS3, our admin is trying to install this but cannot find the doc about its configuration, where is this? [Maarten: send me or the FTS3 team an email and we'll follow up]
  • Rob/OSG: problems since a few days about connections between OSG and the APEL accounting system, will create a ticket about this
  • Sang-Un/KISTI [via email]: Torque did not run idle jobs that leads to prevent new jobs being submitted by vobox. Made torque run jobs manually but no idea concerning the reason. Investigation on going.

  • Zbyszek/Databases: the ongoing 'transparent' interventions turned out to be not completely transparent
    • problems yesterday on the Compass/LCGR database because NAS could not take the full load during the intervention, we manually cut the Compass and Dashboard application load
    • intervention ongoing today for ADCR, will contact ATLAS to also reduce the load there
  • Gavin/Grid:

AOB: none

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2013-08-02 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback