Week of 121001

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, Luc, Ian, Jan, Maarten, Luca, Ivan, MariaD); remote (Michael/BNL, Rolf/IN2P3, Saerda/NDGF, Pavel/KIT, Jhen-Wei/ASGC, John/RAL, Jeremy/GridPP, Rob/OSG, Onno/NLT1; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
      • [Luca: DB replication latency has increased, we are investigating the root causes for this. The backlog has been reabsorbed in any case.]
    • T1
      • RAL failing jobs because of put errors. Also transfer errors including T0 export on Sunday. DB problem. (GGUS:86552) (GGUS:86541)
      • BNL-ASGC transfer errors. Both ends are investigating. (GGUS:86537) [Michael: issues seems to be due to too many transfers that queue up at the receiving end, causing SRM to stop responding. Jhen-Wei changed the ASGC config for the transfer channels to fix the issue.]

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • A number of glide-in issues reported last night. Lack of jobs being scheduled. Did not necessarily appear to be a site issue.

  • ALICE reports -
    • CERN: central AliEn services are down for upgrades since 10:00 CEST, should finish later this afternoon.

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework. [Jan: received a new version of the software and will try that out. There is also a workaround in place. Maarten: workaround is working but is quite ugly. Vladimir: yes workaround is working ok.]

Sites / Services round table:

  • Michael/BNL: ntr
  • Rolf/IN2P3: ntr
  • Saerda/NDGF: a couple of issues
    • a cluster in Oslo is down because it is being moved to another location
    • some dcache pools are in maintenance
  • Pavel/KIT: ntr
  • Jhen-Wei/ASGC: as Michael mentioned, reduced the number of transfer channels from BNL to fix the issue. If this looks ok, will keep this config. Will continue discussing offline with Michael.
  • John/RAL: there has been a problem with the DB behind the ATLAS SRM, still being investigated
  • Jeremy/GridPP: ntr
  • Rob/OSG: ntr
    • but still had trouble connecting via Alcatel (the usual issue, was put in an empty caht room and was not called back). [Maarten: the ticket with Alcatel has been closed because they claim that these issues will be fixed in the November release. Another list of workaround has also been published (in French for the moment), will send it around.]
    • [Maria: can you please also check in your email the questions about the DOE CA migration from OSNET to OSG? Rob: yes, we'll discuss this offline.]
  • Onno/NLT1: ntr
  • Saverio/CNAF (via email): ntr

  • Luca/Databases: ntr
  • Ivan/Dashboard: ntr
  • Jan/Storage: several Castor updates upcoming next week, details will be circulated

AOB:

  • GGUS: (MariaDZ)
    • The GGUS test ALARMs to CERN were repeated this morning. All notifications worked correctly. The SNOW-to-GGUS update propagation, still didn't work. Comments in Savannah:131998#comment16 and in GGUS:86556, GGUS:86557, GGUS:86558, GGUS:86559. Supporters from the experiments can put comments in the GGUS diary but are kindly asked not to close the tickets as we wish this to happen from the SNOW side.
    • The following text was sent in individual GGUS tickets to all Tier0,1 centres by G.Grein: GGUS is going to replace the current certificate used for signing alarm notifications by a SHA2 certificate.The certificate DN is /C=DE/O=GermanGrid/OU=KIT/CN=ggusmail/ggus.eu and will not change. This renewal is scheduled for the next GGUS release on Wednesday, 24th of October. If you expect any problems using a SHA2 certificate please let us know in time. Example GGUS:86351. [Maarten: this is a potentially disruptive change, if it does not work, we'll need to roll this back. So we should make a test before. Ian: what is more generally the timescale for this? Had understoood from the WLCG OPS meeting that this would be centrally coordinated, rather than having services moving to SHA2 one by one when they want. MariaD: yes we'll follow this up in the WLCG OPS meetings.]

Tuesday

Attendance: local (AndreaV, Luc, Maarten, Ian, Ivan, Jan, Ulrich, LucaC); remote (Saerda/NDGF, Michael/BNL, Gonzalo/PIC, Jhen-Wei/ASGC, Lisa/FNAL, Tiju/RAL, Rolf/IN2P3, Rob/OSG, Ronald/NLT1, Dimitri/KIT; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • New Castor libraries and pbs with ATLAS software INC:172716
    • T1
      • RAL transfer issues fixed (GGUS:86552). Known bug in Oracle
      • BNL-ASGC transfer errors solved (GGUS:86537). Due to BNL cyber security. [Michael: cybersecurity is monitoring all transfers inside and outside the campus and keeping a history. ASGC transfers had a peak that was reported as suspiciously high activity, so one individual node was blocked. Another side effect is that this information was shared on all DOE institutions, so SLAC also blocked the same node. The changes made yesterday by Jhen-Wei can remain if useful, but they are not relevant to fix this issue. Jhen-Wei: will keep the transfer config reported yesterday in any case. We still have issues with SLAC that are being followed up.]

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • cmsweb scheduled upgrade today. Resulted in some job resubmission at Tier-0
    • Tier-1/2:
      • NTR

  • ALICE reports -
    • CERN: yesterday's AliEn central services upgrade took until very late in the evening; there still is some fallout (failing user jobs) being investigated
    • [Maarten: a second issue is that jobs at CERN have been seen to access a lot of data at Lyon over the LHCONE network (the solicited servers are on the Tier 2 side), instead of accessing the locally available data at CERN. Jan: this may be related to the Castor to EOS migration, please look at that possibility. Maarten: thanks, will follow it up offline with colleagues].

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework. [Jan: plan to do some software updates after the meeting, which will also require a 5 minute interruption, is this ok? Vladimir: yes please go ahead.]
    • T1:
      • RAL: problem with staging; under investigation [Tiju: issue is still not understood and is being investigated]

Sites / Services round table:

  • Saerda/NDGF: ntr
  • Michael/BNL: nta
  • Gonzalo/PIC: ntr
  • Jhen-Wei/ASGC: nta
  • Lisa/FNAL: ntr
  • Tiju/RAL: lost 4 LHCb files from a faulty tape, will communicate the details offline
  • Rolf/IN2P3: ntr
  • Rob/OSG: ntr (was able to connect today by using a different sequence of steps in the Alcatel interface)
  • Ronald/NLT1: ntr
  • Dimitri/KIT: having issues with one cluster, vendor intervention was requested

  • Ulrich/Grid: ongoing upgrades of CEs, will continue in the coming weeks
  • Ivan/Dashboard: ntr
  • Jan/Storage: nta
  • Luca/Database: high latency in online to offline replication has been fixed this morning by some structural changes, the situation seems better now

  • GGUS: (MariaDZ)
    • The GGUS test ALARMs to CERN will not be repeated as the SNOW-to-GGUS update problem was traced down to be a usage of wrong tags used by SNOW to contact the new GGUS web service. Unfortunately, we still see propagation problems in this interface. As they are linked to the new GGUS SOAP interface, we trace this issue via the dedicated ticket Savannah:127763
    • Experiments/sites with GGUS tickets that do not receive adequate support, please submit to MariaDZ for presentation at the WLCG Operations' meeting this Thursday. Agenda https://indico.cern.ch/conferenceDisplay.py?confId=211021

AOB: ntr

Wednesday

Attendance: local(AndreaS, Luc/ATLAS, Ivan, Jan, Ulrich, MariaD, Eva, Maarten); remote(Vladimir/LHCb, Michael/BNL, Lisa/FNAL, Gonzalo/PIC, Tiju/RAL, Jhen-Wei/ASGC, Rolf/IN2P3, Ian/CMS, Ron/NL-T1, Rob/OSG, Saerda/NDGF, Salvatore/CNAF).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • New Castor libraries and pbs with ATLAS software INC:172716 ongoing
      • PVSS DCS replication is OK
      • BNL proxy affecting T0 export because panda & DDM use old gridsite version. Upgrade ongoing. Maarten: the BNL VOMS server certificate was updated last Monday evening but the PaNDA nodes still has an old version of the gridsite library that requires the VOMS server certificate. Updating gridsite will prevent this problem from happening again.
    • T1
      • TRIUMF transfer failures GGUS:86551 due to network problems. Ongoing
      • NDGF-T1 LRMS errors (Walltime limit). Cf. GGUS:85144

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Jobs in cool-off because taking too long to resubmit in the Tier-0 test infrastructure. Not clear if it's LSF or the test infrastructure
    • Tier-1/2:
      • T2_Pakistan seems to be going around their local squid and WNs hitting CERN directly. Investigating with experts.

  • ALICE reports -
    • CERN: the last minor issues with the AliEn central services upgrade were fixed yesterday; remaining problems reported by users were typically due to usage of old clients (now unsupported) or unrelated mistakes
    • IN2P3: yesterday's LHCONE overload due to jobs at CERN reading data at IN2P3 (and also CNAF) was due to an unexpected side effect of demoting the CERN ALICE_DISK SE for writing; this was fixed yesterday early evening

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s
    • New GGUS (or RT) tickets
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework. Fixed after few attempts, but now there is a huge backlog to absorb.
    • T1:
      • RAL: problem with staging; Fixed
      • SARA: files missing from SARA's storage (GGUS:86635)

  • LHCb stop replication of ConditionDB and read only LFC. Maarten: actually the LFC replication has not yet stopped.
    Vladimir confirms that the LFC replication can be stopped as of now and that the Tier-1 sites can decommission the LHCb LFC servers.

Sites / Services round table:

  • ASGC: transfer problem with SLAC previously reported was finally solved by removing the block of the Taiwan FTS node at SLAC.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: we received two GGUS tickets about GGUS (GGUS:86571), GGUS:86572) upgrading to SHA-2 certificate this month and we are concerned because the transition is planned for 2013, not now. First we should run tests on our integration testbed. MariaD: suggest to close the tickets saying so and discuss this at tomorrow's WLCG operations coordination meeting. Maarten: in this case the change should be very simple and concerns only alarm notifications, don't know why this is urgent for GGUS.
  • CERN batch: since yesterday around 2200 CERN fails all ops SAM glexec tests. Maarten: CMS saw ARGUS errors, it might be related.
  • CERN storage: ntr
  • Dashboards: ntr
  • Databases: ntr

AOB:

Thursday

Attendance: local (AndreaV, Luc, Jan, Ivan, Ulrich, Stephen, Ian, Eva, MariaD, Maarten, MariaG, Alexei); remote (Michael/BNL, Lisa/FNA, Saerda/NDGF, Jhen-Wei/ASGC, John/RAL, Rob/OSG, Ronald/NLT1, Rolf/IN2P3; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • New Castor libraries and pbs with ATLAS software INC:172716 solved
      • Latest gridsite to be deployed on Central Catalog machines sr#132631. Done also on Site Services which broke (D. Tuckett). Rolled back to old version. Ok now.
    • T1
      • SARA_MCTAPE full (10TB). 10 extra TB asked to be deployed

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Still reporting problems at T1_TW_ASGC in custodial data transfer of RAW and RECO data from CERN to ASGC. Impacting our ability to even make skims. Under investigation by local experts, but this is a persistent problem [Jhen-Wei: the CMS membership of the Taiwan colleague owning the Phedex agent was removed and is being renewed. Ian: please send us the name of this colleague so the memberhsip can be renewed immediately, this is urgent and should not go through the standard channels such as the secretariat]

  • ALICE reports -
    • CERN: EOS-ALICE crashed yesterday around 16:45 CEST. Taking advantage of the downtime the affected host was replaced by a machine with more memory and EOS-ALICE was again available around 17:30. After midnight the new setup was found to be failing requests quite often, which was then debugged by IT-DSS and fixed around 03:00 - thanks for that prompt effort! [Jan: the machine was actually down since 10.30. The issue is quite complex but is now completely understood. It only affects ALICE, not the other instances. On Monday there will be an EOS upgrade that will also address some issues related to this. An internal report is being prepared. AndreaV: please publish this report as a SIR.]

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s
    • T0:
    • T1:
      • SARA: files missing from SARA's storage (GGUS:86635)
      • NIKHEF: CVMFS problem (GGUS:86722) [Ronald: the issue is understood, a new config had already being deployed for ATLAS because cvmfs was too small, now the same problem appeared for LHCb and we'll deploy the same solution as we did for ATLAS.]
      • GRIDKA: Input data resolution problem (GGUS:86720)

Sites / Services round table:

  • Michael/BNL: ntr
  • Lisa/FNAL: ntr
  • Saerda/NDGF: ntr
  • Jhen-Wei/ASGC: nta
  • John/RAL: ntr
  • Rob/OSG: we had some problems with GGUS exchange, due to changes happening in GGUS and of which we were not aware, and lost a couple of tickets; we made a few adjustments and resent those tickets. [MariaD: I was not in this thread, please copy me]
  • Ronald/NLT1: nta
    • [MariaD: are there two people that want to participate in the hadoop course or not? Ronald: will follow up. Alexei: note that some courses are already full.]
  • Rolf/IN2P3: two problems being followed up
    • high load on ATLAS AMI DB, for reasons not yet understood
    • ALICE high load is still continuing. [Maarten: this is still being investigated, we are in contact with Renaud]

  • Ivan/Dashboard: ntr
  • Eva/Databases: transparent intervention next Tue on storage of some production DB (ATLAS and LHCb offline, PDBR, downstream capture)
  • Ulrich/Grid: ntr
  • Jan/Storage
    • CERN CASTOR - list of updates scheduled for next week (reminder, already on GocDB etc), all to 2.1.13-5:
      CASTORCMS down Mon 14:00-16:00
      CASTORALICE down Tue 10:00-12:00
      CASTORLHCB at risk Tue 14:00-15:00
      CASTORATLAS down Thu 14:00-16:00
    • CERN EOS - list of updates proposed for next week (all to EOS-0.2.18 + xrootd-3.2.5)
      EOSLHCB down Mon 11:00-12:00
      EOSALICE down Mon 14:00-16:00
      EOSATLAS down Wed 10:00-12:00
      EOSCMS down Thu 11:00-12:00

  • GGUS: (MariaDZ)
    • The SNOW-to-GGUS update propagation was identified as a number of fields returned with NULL value. This is not related to the new GGUS SOAP web interface of the last release. Details in Savannah:120423#comment38
    • GGUS postponed the move to a SHA-2 certificate due to IGTF's decision to wait for another year. Details in Savannah:132379#comment9

AOB: There will be a WLCG Operations Coordination meeting at 3.30pm, https://indico.cern.ch/conferenceDisplay.py?confId=211021

Friday

Attendance: local(Eva, Ian, Jan, Luc, Maarten, Mike, Ulrich);remote(Dimitri, Jhen-Wei, John, Lisa, Michael, Onno, Rob, Rolf, Vladimir, Zeeshan).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • ALARM GGUS:86788 & INC:175534 LSF bsub time too big
      • CERN GGUS:86778 & INC:175450 Get error. Pb with CERN-PROD_TMPDISK token used to store ESD before migrating to tape. Staging should be asked (under verification)
      • ATLAS-AMI: A better monitoring & make sure that jobs requests from T0 use valid credentials.
      • FTS bug: RAL has rolled out the fix
        • Maarten: RAL have just discovered a similar problem after 4 days of running with the patch (GGUS:86775); the developers are aware of it
      • GGUS: channel for interacting with experiments?
        • Maarten: Maria D will keep using this meeting to ask the experiments if there are significant tickets that are not getting enough attention; such tickets must be about issues that are a major nuisance for an experiment
    • T1
      • NDGF-T1 GGUS:86770 Functional test transfers to DATADISK failing. FTS overwrite option issue?
        • Onno: we had the same problem at SARA - after a first write attempt failed, dCache would clean up the corresponding space reservation only a few hours later and further write attempts would fail in the meantime; the dCache developers suggested that the FTS should be initiating the cleanup before trying again
        • Maarten: the overwrite logic is tricky because there can be other scenarios in which a valid file already exists, in which case it certainly must not be deleted by the FTS! The logic is conservative on purpose, delegating potentially dangerous decisions to a higher level, viz. the experiment frameworks. I will ask the developers to look into this case.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Filled Tier-0 job slots with the combination of production and test Tier-0 systems
    • Tier-1/2:
      • Tried to help the situation with Felix's grid-cert. Problem seems to be at the users office.
        • Jhen-Wei: for now we are using the certificate of a colleague who is in CMS already, but does not have the PhEDEx role yet, which causes mapping issues and transfer failures on some channels
        • Ian: we can easily enable such roles for any certificate, let's deal with it offline

  • ALICE reports -
    • IN2P3: remaining high LHCONE traffic is due to files that were moved to EOS and deleted from CASTOR, while the catalog was not yet updated - the jobs fail over to the next nearest replica when the access at CERN fails; the catalog will be updated next week
      • Rolf: the traffic went down already
      • Maarten: today there was little job activity

Sites / Services round table:

  • ASGC - nta
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF
    • looking into the ATLAS FTS issue
  • NLT1
    • today there was maintenance on the SARA MSS to fix 2 broken robot arms in the tape library, finished at 14:00 CEST; processing of the backlog should take another hour
    • LHCb ticket about NIKHEF CVMFS: fixed
    • looking into the missing LHCb files at SARA
  • OSG - ntr
  • RAL
    • scheduled outage for CASTOR-LHCb on Tue

  • dashboards - ntr
  • databases - ntr
  • grid services
    • looking into the LSF problem reported by ATLAS
  • storage
    • for next week's EOS updates we did not yet receive an OK from ATLAS and LHCb
      • Luc: OK for ATLAS
      • Vladimir: OK for LHCb
    • an LHCb user asked if we could restore 180 lost files; we asked the user to get in touch with the LHCb data management team first

AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2012-10-11 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback