Week of 111219

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Fernando, Gavin, Jan, Kate, Lukasz, Maarten, Maria D, Massimo, Steve);remote(Burt, Giovanni, Gonzalo, Jhen-Wei, Michael, Onno, Pavel, Rob, Tiju, Ulf).

Experiments round table:

  • ATLAS reports
    • T0
      • CERN-PROD ALARM: lsf batch system down Sat 19:04. T0 processing drained. Unresolved during weekend work with vendor, resume Monday. (GGUS:77547)
        • Gavin: worked with the vendor for many hours during the weekend; intermittent downtimes since Sat; public share back OK on Sun, dedicated shares still had problems until ~13:00 CET today, not understood what changed; some intermittent downtimes for debugging expected in the next hours
      • CERN-PROD: many EOS transfer failures, DE and UK clouds, 'failed to contact on remote SRM', existing ticket updated (GGUS:77333)
        • Jan/Massimo: service publication thought to be fixed since Fri, all CERN BDIIs look OK
        • Fernando: what about the FTS image of the info system?
        • Jan/Steve: FTS XML file was updated as well
        • Fernando: which FTS would be used to transfer to DESY? FTS at KIT? other FTS instances would also need to be updated
        • Jan/Steve: let's follow up in the ticket
      • Problem with BDII publishing from OSG to WLCG/CERN BDII -- pending (GGUS:77361)
        • Rob: BDII tickets were not updated with the solution
        • Gavin: will do (done)
    • T1 sites
      • Update to BNL AutoPilot services 8:00 Sun morning to restore US pilot scheduling, which was disabled as a result of Panda service misconfiguration 22:45 Fri night (GGUS:77551)
        • Michael: this was not a T1 issue; the T1 staff was asked to handle an unscheduled change in the PanDA system
      • Taiwan scheduled outage for network maintenance extended to Sat morning
      • File Transfers to Taiwan not progressing, Sunday 9:30. FTS Oracle DB problem, resolved 4:00 this morning (GGUS:77552)
      • NDGF-T1 scheduled warning, fiber maintenance, some ATLAS data unavailable. Sun 23:00 - Mon 4:00
      • PIC scheduled outage for dCache upgrade 9:00-12:50 today

  • CMS reports -
    • LHC / CMS detector
      • Shutdown
    • CERN / central services
      • LSF Problems observed over weekend. Still seems to be issues on the Tier-0 queues. Team ticket (GGUS:77563) created Saturday
    • T0
      • MC: LHE production continues, but seems to be impacted by LSF issues (writes to EOS /store/lhe)
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
    • T2 sites:
      • NTR
    • Other:
      • NTR

Sites / Services round table:

  • ASGC
    • there was an issue with the FTS, but it works OK now
  • BNL
    • HPSS upgrade finished OK on Fri
  • CNAF - ntr
  • FNAL - ntr
  • KIT - ntr
  • NDGF
    • 2 pools in Sweden down due to HW problems, affecting ATLAS data
  • NLT1 - ntr
  • OSG - nta
  • PIC
    • dCache upgrade to 1.9.12-14 went OK, batch jobs were paused, transparent
  • RAL
    • there was a site network outage Sat morning 04:30-05:00

  • CASTOR/EOS
    • CMS SRM DB issue on Sat, no ticket
    • adding space to EOS for ATLAS
  • dashboards - ntr
  • databases
    • SRM DB problem was due to storage HW issue, in contact with NetApp
  • GGUS/SNOW
    • see AOB
  • grid services
    • see ATLAS report
    • VOMS UK CA intervention (see last Fri report) will be done this afternoon

AOB: (MariaDZ) Reminder on support situation during the CERN shutdown:

  1. SNOW (Concerns the Tier0 only:) CERN is officially closed 2011/12/22-2012/01/04 inclusive.
    • Main route for getting notification of critical problems are the GGUS ALARM tickets. This is working and does not need any change.
    • For GGUS TEAM tickets, the members of e-group grid-cern-prod-admins are now in the Outside Working Hours (OWH) group in order to have access to the SNOW instance of the ticket at all times.
      • Massimo: team ticket OWH support?
      • Maria D: Xmas period only
    • For the rest, CERN specific, all services are correctly declared in SNOW with OWH support groups, and in Services' Data Base (SDB) with the right criticality, so it should be fine.
  2. GGUS (concerns all sites and all Support Units): KIT is officially closed 2011/12/24-2012/01/01 inclusive. GGUS is integrated in the on-call service, so if something goes wrong, the developers will be informed at home.

  • Maria D: checked time line of ATLAS alarm GGUS:77547 - operator response was dispatched 25h later due to problem on CERN mail gateway; following up with the mail service team

Tuesday:

Attendance: local (AndreaV, Fernando, Luca, Lukasz, Maarten, Jan, MariaDZ); remote (Michael/BNL, Ulf/NDGF, Tiju/RAL, Jhen-Wei/ASGC, Ronald/NLT1, Giovanni/CNAF, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • T0
    • T1 sites
      • CASTOR outage affecting tapes in ASGC. Tape endpoints blacklisted during Monday night.
      • Still seeing transfer errors from SARA-MATRIX reported the 30 Nov (GGUS:76920). [Ronald: this GGUS ticket about the SARA T1 is now in the hands of the Irish T2. Fernando: observed many failed transfers from SARA also to other sites, not just the Irish T2. Fernando will follow up and open a new ticket against SARA if necessary.]
    • Other Services
      • PandaMon port configuration problems caused high load on ADCR database. Experts looking into it
      • GGUS unaccessible from CERN at 10:20 during one hour

  • ALICE reports -
    • The LCG-CE nodes at CERN have more than 10k "ghost" jobs for ALICE: known by the WMS, but not by LSF. Probably due to yesterday's LSF troubles and subsequent interventions. Further job submission has been blocked and a cleanup will be applied later today. [Maarten: the LCG-CE is a backup for CREAM CE, both are running in parallel, but either is able to take the whole load. Will follow this up with IT-PES.]

Sites / Services round table:

  • Michael/BNL: ntr
  • Ulf/NDGF: there will be a minor intervention in Sweden on Thursday, causing a short unavailability for ATLAS. Will also fix one node that was giving problem this week.
  • Tiju/RAL: ntr
  • Jhen-Wei/ASGC: CASTOR issue has been solved by a system restart.
  • Ronald/NLT1: nta
  • Giovanni/CNAF: ntr
  • Rob/OSG: ntr

  • Luca/Databases: will restart the ATLAS apply to Gridka.
  • Lukasz/Dashboard: ntr
  • Jan/Storage: EOS ATLAS had to be restarted yesterday, may still experience some short (5 minute) interruptions this afternoon.
    • Maarten: is there not a more recent EOS version that should allow more transparent addition of EOS nodes? Jan: this is already a recent EOS version with better transparency, but ATLAS chose not to deploy the last EOS version before Christmas, so this will be deployed in 2012.

AOB:

  • Maria: following up the GGUS unavailability. There is a SNOW ticket (INC:089366), although it is not clear if everyone can see it. The GGUS team said that there was no alarm and that this could rather be a network problem. Maria herself definitely experienced the problem too (timeout while connecting). Not clear whether tomorrow this will be better understood.
  • Andrea: the meeting tomorrow will be the last one in 2011.

Wednesday

Attendance: local (AndreaV, Alexei, Fernando, Maarten, Arne, Lukasz, Luca, AleDG); remote (Felix/ASGC, Gonzalo/PIC, Ulf/NDGF, Alexander/NLT1, Burt/FNAL, John/RAL, Giovanni/CNAF, Rob/OSG, Rolf/IN2P3, Pavel/KIT; Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • T0
      • CERN-PROD: All T1s have updated their FTS configuration with the correct EOS URL. We had forgotten about the CERN fts-t2-service (Updated GGUS:77333)
      • ATLAS CERN shares are reconfigured on Tue Dec 20th (Ale DiGi & Gavin Mccance). More CERN slots are given to Grid production and analysis. The configuration will be kept asis until January 4th and then it will be re-evaluated.
    • ATLR database outage (20') at 7:50am due to fill up of the recovery space by a misbehaving process. [Alexei: thanks to Luca for the help! Is it clear what caused this? Luca: it was a problem in the Oracle backups, thisis being followed up.]
    • Other Services
      • PandaMon correctly reconfigured and amount of queries to ADCR back to normal.

  • CMS reports -
    • LHC / CMS detector
      • Shutdown
    • CERN / central services
      • LSF issues are resolved. Tier-0 was running well with MC Production.
    • T0
      • MC: LHE production continues, but
    • T1 sites:
    • T2 sites:
      • NTR
    • Other:
      • Normal production and processing activities expected through the CERN shutdown. CRC and daily shifts continue. CRC is reachable by phone 160777

  • ALICE reports -
    • The cleanup of yesterday's LCG-CE "ghost" jobs is not fully done yet, but should be later today.

Sites / Services round table:

  • Felix/ASGC: ntr
  • Gonzalo/PIC: the GGUS ticket opened by CMS has already been answered, we were not able to reproduce the issue and asked which VOMS proxy was used. It also seems that the GGUS to savanah bridge did not work, so will copy the answer to savannah directly.
  • Ulf/NDGF: ntr
  • Alexander/NLT1: ntr
  • Burt/FNAL: ntr
  • John/RAL: advance notice of an outage on January 5th for the database behind CASTOR, will add this to GOCDB
  • Giovanni/CNAF: ntr
  • Rob/OSG: ntr
  • Rolf/IN2P3: there was an unscheduled outage at noon to roll back to the old AFS client, after it was noticed that the new client was generating a lot of traffic to CERN. This was done because otherwise the network would have been cut off, affecting availability anyway.
    • [Arne/Maarten: is it clear why jobs are accessing AFS at CERN, normally they should not! AleDG: these are ATLAS jobs, the ATLAS software installation should normally not use AFS at CERN, so maybe this is something specific to IN2P3? Rolf: IN2P3 is doing nothing special, apart from having tried a newer AFS client, so the issue is probably in the ATLAS software installation. Andrea: are these jobs using CVMFS? There is a known bug in some of the CVMFS installations, which contain hardcoded paths to CERN AFS. Arne: yes this may explain, should be followed up.]
  • Pavel/KIT: ntr

  • Luca/Databases: ntr
  • Arne/Storage: nta
  • Lukasz/Dashboard: ntr
  • CERN FTS If there is a meeting today: fts-pilot-service.cern.ch is running again and has also been migrated to Oracle 11G running on the CERN integration oracle instance. [Fernando: this is related to a GGUS ticket that was opened yesterday, today it has been closed because the issue has gone.]

AOB:

  • Alexei: how shoud we proceed with alarms during the CERN closure? Should we place alarms normally? Maarten: you should do as during normal operations, i.e. place an ALARM when appropriate and a TEAM ticket otherwise. Team tickets will also be answered during normal working hours. AleDG: however maybe we should escalate them if not answered for 48h? Maarten: yes, then you can also escalate them.
  • Merry Christmas and Happy New Year! The next meeting will be on Thursday January 5th, 2012.

Thursday

CERN closed

Friday

CERN closed

-- JamieShiers - 28-Nov-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-12-21 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback