Week of 090914

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Olof, Gang, Andrew, MariaG, Jean-Philippe, Alessandro, Markus, Simone, Diana, MariaD, Roberto, Lola, Ewan, Harry, Julia, Dirk(chair));remote(Gonzalo/PIC, Michael/BNL, Brian + Gareth /RAL, Daniele/CMS , Andreas/FZK, Ron/SARA).

Experiments round table:

  • ATLAS - (Alessandro) over weekend only few issues on T2, Central catalog some load-balanced node outage due to kernel panic- service restarted, SAM tests affected for 8h. ATLAS has started ESD-ESD reprocessing, most T1 sites have been validated for the correct ATHENA version. During this experiment activity ATLAS will run a daily phone conf at 15:30 (after this meeting) focussing on site issues (chaired by expert on call). All sites participating are invited to join, some code as ADC.

  • CMS reports - (Danielle) last week physics week in bologna - now restarting with report on CMS twiki - should be completed by tomorrow. Some delay in support replies from IN2P3. Good progress in closed tickets in Asia/India. Also still waiting on replies from Russian T2 on issues outstanding for 2-3 weeks (not acceptable).

  • ALICE -

  • LHCb reports - (Roberto) active MC production (some 7k jobs running right now) with issue on Dirac side: large amount of failing pilot jobs (WMS overload suspected). Currently debugging many T2 site (pilot aborts). T0 network issue this morning (6:30-9:00) affected castor + bdii. Also issue at PIC this morning with no access to Dirac services. Problem disappeared around noon. Reason not fully understood yet, but apparently not related to T0 network problem. Also at PIC: LHCb found a larger amount of stalled user jobs (GGUS ticket from 3rd Sept). Gonzalo: no known issue at PIC, but will check logs of Dirac services. Roberto: connection problems not only from CERN but also from RAL (around 12:00). Alessandro: PIC networking works fine for ATLAS. CNAF ticket: one of the Storm directories had wrong write protection, SARA: dCache problem is currently blocking MC09, Ron: waiting for response dCache developers. Also IN2P3 had similar dCache issues. They may be able to provide a workaround if the problem turns out to be the same.

Sites / Services round table:

  • Golzalo: NTR
  • Micheal: NTR
  • Gareth/RAL: disk server in ATLAS DATA disk was out until this morning, upgrade of LHCb SRM to 2.8.0 ongoing, migrating most batch capacity to SL5 over next days, CASTOR name server upgrade to 2.1.8 planned.
  • Andreas/FZK: problem with 2 atlas disk only pools being worked on, Thu 17 - 8-10 local time at risk for firmware upgrade (in gocdb), expect short interruption for minutes. ATLAS SRM SAM test fail since a few days: ATLAS has now a new dedicated LFC but tests use still old LFC - ATLAS will change tests.
  • Ron/SARA : Thu last week thu storage was moved to 10GB switches, intervention went smoothly. Last Fri NIKHEF had cooling problems and WN had to be switch off. The problem is solved now. This Wed WN will move to 10Gb, next Tue (22nd) grid services will move to new networking infrastructure.
  • Gang/ASGC: good news for ATLAS: T1 and T2 will stay ultimately separate - exact timeline will be given by Jason later. Currently available 6 Tape drives will be extended by add another 18 drives mid of October. Tape performance will be limited until then. Plan to dedicate 4-5 drives to ongoing ATLAS activity. Mail from Jason, CERN<->ASGC network performance problems since network maintenance at CERN: CERN and T1 experts are investigating the cause.
  • MariaG/CERN: Security patch being applied to LHCb online today.
  • MariaD/CERN and OSG participants: discussion on OSG tickets related to LHC experiments now in this meeting (each Monday). Currently not items in the escalation queue for OSG. Does ATLAS want to test alarm ticket procedure already today? Fine for ATLAS and OSG. ATLAS will create a new alarm ticket after green light from Diana/MariaD.

AOB:

Tuesday:

Attendance: local(Harry,Jamie, Gavin, Olof, Ewan, Patricia, Julia, Jean-Philippe, Alessandro, Roberto, Diana, Dirk(chair));remote(Gonzalo/PIC, Tiju/RAL, Michael/BNL, Jeremy/GridPP, Daniele/CMS ).

Experiments round table:

  • ATLAS - (Alessandro) another kernel panic on the 3 load-balanced central catalog (experts are investigating), but the service was not affected. ATLAS reprocessing activity had to be stopped due to unforeseen problems: all tasks have been aborted. More detail in ATLAS daily meeting in 30 min.

  • CMS reports - (Daniele) CMS is following the Castor intervention which took longer than expected, but the information flow is good. T1 sites: going systematically through all issues still open from last two weeks. CMS remarked that CERN and all T1 had very high availability during the last weeks! A number of individual T1/ T2 transfer problems is being followed up with good progress. The CMS twiki give more details and individual ticket references. USCD is restructuring its storage element: all files are deleted and afterwards retransferred (to measure time for several TBs hosted).

  • ALICE - (Patricia) Alice production at all sites is stopped to prepare for new MC cycle. Now is a good moment for sites to provide new VObox. One instance has been setup at CERN (SLC5 VObox pointing to cream) and should be ready for next production.

  • LHCb reports - (Roberto) smooth MC and merging activity at T1s. As many sites moving to SL5 but Dirac certification for SL5 still pending the usable capacity for Dirac is decreasing. One faulty WN has been observed at CERN. LHCb is still debugging many T2 sites with failing pilot jobs (open GGUS tickets and more details on LHCb twiki).

Sites / Services round table:

  • Gonzalo/PIC: PIC in scheduled downtime, transfer failures today due that intervention - back at 5. SRM outage affects SE - always closed as well.
  • Tiju/RAL: Castor name server upgrade encountered a problem - outage extended to tomorrow midday.
  • Michael/BNL: NTR
  • Jeremy/GridPP: NTR
  • Gang/ASGC: NTR - Alessandro: any result from debugging of transfer problems ASGC->T2? Gang: no concrete problem found. Alessandro: Maybe FTS monitoring would help to diagnose the problem. Currently the configuration of the ASGC FTS does not allow further diagnostic for ATLAS. Gang: Will follow up.

AOB:

Wednesday

Attendance: local(Gang, Jan, Antonio, Harry, Olof, Patricia, Roberto, Eduardo , Alessandro, MariaD, Jamie ,Dirk(chair));remote(Gareth/RAL, Xavier/FZK).

Experiments round table:

  • ATLAS - (Alessandro) ATLAS dashbord still quite red but due to only few issue. RAL in extended downtime - wait for report. SARA in scheduled downtime and out of DDM, also T1-T1 traffic from/to SARA affected. FTS problems with CNAF as destination (eg from TRIUMF and other sites). Suspect that CNAF disappeared briefly while FTS config was updated. Problem went away at most sites but TRIUMF should verify/update FTS server config.

  • ALICE - (Patricia) ALICE is back in production and ramped up quickly: first 13k now 3k jobs. Two of the new voboxes are being flagged “maintenance mode” to avoid spma errors for operators. Final configuration will be put into Quattor soon.

  • LHCb reports - (Roberto) Little activity at the moment. Dirac has been certified for SL5 which allows LHCb to progressively use SL5 resources. Proposed Castor intervention slot is fine for LHCb. SARA: still a problem with ocality returned by SRM - show stopper for LHCb: asking for an update from the site.

Sites / Services round table:

  • Gareth/RAL: Name server update (2.1.8) intervention triggered unrelated problems with disk-to-disk copies. RAL rolled back to old name server version but the problem persisted. RAL experts work closely with castor development. A service incident report will be created one the root cause has been identified.
  • Xavier/FZK: Schedule outage from 8:00-10:00 local time for firmware upgrade, expect short network outage.
  • Gang/ASGC: NTR
  • Jan/CERN: Castor update today went well. CMS Castor intervention yesterday had to be extended due to a missing index after reimporting the DB. A service incident report has been prepared.

Release report: deployment status wiki page

Antonio summarised the status of upcoming releases. Alessandro: DPM 1.7.3 with SRM patch for SL5 is still in certification. Some sites are upgrading to SL5 already. When will patch be out? Antonio: don’t have a firm date yet but will follow up offline.

AOB:

MariaD: Test alarm ticket to BNL will be done on Monday 15:00 by ATLAS. ATLAS should report if ticket got successfully assigned to OSG.

Thursday

Attendance: local(Jamie, Gavin, Graeme, JPB, Simone, Lola, Julia, Ewa, Gang, Alessandro, Harry, MariaG, MariaD, Dirk(chair));remote(Daniele, Michael/BNL, Ron/SARA, Gareth/RAL, Jeremy/GridPP, Brian/RAL).

Experiments round table:

  • ATLAS - (Graeme) T1 situation better: SARA down until 5, RAL came back this morning and looks ok. ATLAS is commissioning a new SL5 infrastructure this week. SL5: concerns from sites for SRM 2.2 bug: disk server meta package is still missing.

  • CMS reports - (Daniele) gLite WMS problem: Catania WMS keeps submitting jobs to PIC SEs even though they are not published since several days. Potential migration backlog at ASGC for CMS even though other VOs seem to be fine. About 2k files are waiting for tape streams which may explain the CMS problem: Jason is checking. Transfer issues (probably) having their root cause at T1 sites: CIEMAT,DESY->IN2P3, Lisbon->PIC. Various other transfer problems a being taken care of (more details on the CMS twiki). CMS continues with T2-T2 link commissioning for Egamma. Harry: increasing number of grid account - should be available on Wed. Fine for CMS

  • ALICE -

Sites / Services round table:

  • Ron/SARA: network maintenance yesterday went well, renumbering of compute nodes into bigger subnet. Unexpected problem: TTL for name server entries was set to one day. Site had to extend downtime to insure that new entries were available in caches outside. Also changed ownership of datastore @ sara to redundant LDAP severs. All looks all fine and site expects to be back at 5pm. On LHCb issue: ticket is marked solved since yesterday. Problem was caused by dCache bug (in case no */* protocol has been configured). Now a workaround has been put in place which should remove the LHCB problem.
  • Micheal/BNL: tape library issue during the night before last. Tape control s/w was declaring whole library as unusable. BNL removed a faulty drive and rebooted library. This issue has likely slowed down processing during this period.
  • Gareth/RAL: Outage finished this morning - problems have been confirmed to be due to LSF configuration and not the Castor name server upgrade, which uncovered them. Not clear yet why that config changes which happened in June have only been spotted now. RAL now has re-upgraded their nameserver to 2.1.8. and also the CMS SRM to 2.8.0 (as planned). Other SRM endpoint follow with ATLAS next week. Good progress on the SL5 migration. Next Tue 22nd: UPS test in new building - Castor declared as unavailable.
  • Jos/FZK: scheduled outage for router update: updates took longer than expected which caused GPFS cluster to go down together with associated dcache pools. The setup has been restarted during this morning and afternoon. Unfortunately 10mins ago a rack with disks was switched off accidentally. The site is checking which VOs are affected (CMS is missing some disk only pools). The scheduled downtime had to be extended.
  • Gang/ASGC: The CMS tape migration problem being investigated.

AOB:

  • Jos: A message has circulated in the dCache user forum on an additional storage token of ATLAS. Is this a request all T1s should react on? Was the communication channel correct? Simone: the request was broadcasted to ATLAS distributed list, and will also be covered into today’s ATLAS distributed computing meeting. Yes, it is a request. A reaction on the time scale of a few weeks to one month would be desired.

  • MariaD: Currently several BNL entries in ggus (several pointing to same correct email). Should we cleanup to avoid confusion.
Michael: yes, will follow up from the BNL side. https://savannah.cern.ch/support/?109779

Friday

Attendance: local(Julia, Maria, Lola, Alessandro, Ewan, Jamie, Patricia, Jean-Philippe, Roberto, Gang);remote(Daniele, Michael, Brian, Gareth).

Experiments round table:

  • ATLAS - Nothing major to report.

  • CMS reports - gLite WMS: one WMS in Catania keeps submitting jobs to two PIC CEs which stopped publishing support to the CMS VO several days ago. Andrea Sciabà reports that it was an internal WMS process which had stopped and it has been restarted, so the problem should be solved now. Waiting for confirmation. UPDATE on Sep 18: after the support by Andrea Sciabà, Christoph Wissing reported that it seems that also thier problem with submission to a CE, that got closed for the CMS VO, is solved now. Case closed (Savannah #109960 --> CLOSED)

T1 sites issues:

Highlights:

  • migration backlog accumulated at ASGC now confirmed: waiting for more info
  • transfer issues in /Prod (probably) having their root cause at T1 sites: CIEMAT,DESY->IN2P3 (no news)

  • ALICE - Many jobs waiting - why not entering running status? SL5/SL4 issue? Maybe need more CEs pointing to SLC5. FIO will follow-up

  • LHCb reports - Very slow deletions - ticket opened against CERN - Ewan passed to Gav and will follow up.
    • Experiencing an unprecedented slowness in removing files through SRM and gfal. In chunks of 20 files the removal takes ~3 second per file while we remember once it was about 10Hz
    • to be confirmed the intervention on SRM to move to Castor gridftp internal. Tentatively it could happen during the already agreed intervention on the 22nd if version of lcg_utils/gfal going to be used by DIRAC in production will work with this new configuration.

Sites / Services round table:

  • DB: scheduled interventions for quarlerly patches coming up at PIC, RAL and CERN. SIR requested for incident at GridKA.

AOB:

  • Accidentally closed window so some details lost frown Thank heaven (or Daniele and Roberto) for the experiment daily report wikis smile

-- DirkDuellmann - 2009-09-14

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2009-09-18 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback