Week of 090803

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Harry, Danielle, Jean-Philippe, Alessandro, David, Jacek, Andrea, Roberto, Gang, Olof);remote(Xavier/FZK, John/RAL, Heremy/GridPP, Andrea/INFN T1, Ron/SARA).

Experiments round table:

  • ATLAS - Quiet weekend for ATLAS. 1) Job validation was failing in INFN T1 due to an Athena process that had run out of memory. 2) ASGC failing all assigned jobs due to WNs losing part of the ATLAS software (after previously passing validation).

  • CMS reports - 1) Tape migration has been stuck at RAL since yesterday. John reported a mechanical problem with their tape robot. 2) Some progress on the Russia-ASGC link commissioning with inbound traffic to ASGC now working. 3) Still 7 files pending tape migration at ASGC. 4) Continuing Tier 2 site/link commissioning but holidays mean tickets are being treated more slowly.

  • ALICE -

  • LHCb reports - Waiting for extra disk space at CERN in LHCBDATA pool expected very soon. Also planning to move some space from the raw data pool. Olof reported they will first expand LHCBDATA using new servers as migrating a server from a T1D0 pool to a T0D1 one is non-trivial.

Sites / Services round table:

  • FZK: During the weekend some ATLAS disk pools shut themselves down and had to be restarted twice. Under investigation. Next week, on Tuesday, August 11, Oracle patches will be applied to the FTS and LFC database back-end as well as to the LHCb 3D/LFC and ATLAS 3D databases. The FTS and LFC service will be "down" from 12:00 to 16:00 UTC. The LHCb and ATLAS databases will be "at risk" from 7:00 to 11:00 UTC.

  • RAL: On Friday jobs from a CMS user killed half of their worker nodes by triggering the out of memory kill process. The user was banned but restored this morning. Not understood yet by CMS as such workflows are pre-tested - being analysed by CMS experts.

  • GridPP: Will continue performing the ATLAS hammercloud saturation tests. Running two days a week, probably all of August.

  • SARA: Recovered from last weeks excessive dcache logging after upgrade - their usual cron log compression job had not been able to keep up. Did not need to rollback dcache.

  • CERN: Deploying the next Linux upgrade on all Quattor managed machines.

  • ASGC: Experiencing i/o errors on a newly installed disk server at their Tier 2 and which is affecting the batch service. Recovery under way.

AOB:

Tuesday:

Attendance: local(Ewan, Jamie, Olof, Jacek, Harry, Jeremy, Daniele, Julia, Alessandro, Roberto, Gang);remote(Xavier, Andrea, Brian, Gareth).

Experiments round table:

  • ATLAS - As announced last week, site validation tasks started this week to validate T1 before ESD reprocessing (starts in 10 days approx). Some troubles yesterday - many failures - under investigation. Problem for T1-T2 subscription: solved this morning. Today AMI down - declared downtime.

  • CMS reports - brief: no major updates on any of tickets. New ticket for Florida T2 - failing SAM lcg_cp test. ACL? Closed T2 Rome - not transferring to GridKA. Config problem. RAL: update from James - migration moving again. Backlog ~6000 files. Digested soon... Work still in progress in RAL to understand issues that caused CMS jobs to be killed. Roberto: on SL5. Gareth - pretty sure SL4.

  • ALICE -

  • LHCb reports - Informed yesterday night that 100TB informed. LHCb activities restarted this morning, draining backlog to copy back data to CERN. More info - including plot - in full report (see link). Jobs will start at rate as before outtage due to diskspace. Asked for one of VO boxes to be reinstalled- Ewan ok for today. No other major issues... Lyon - investigation on publication of info on site in IS.

Sites / Services round table:

  • FZK: problems reported yesterday: bug in BIOS that reacts to heat. Shut down due to overheating (but not true...) Trying to find solution. 8 ALICE and 4 LHCb diskonly pools in dcache lost disks as power device was failing - now fixed. Roberto- files not available during this time? A: yes.

  • CNAF: experiencing some problems with ATLAS - in touch with Rod and trying to understand. Memory allocation problems? Not really sure.. Working on it.

  • RAL: 2 points from CMS: not much to add. Batch still be investigated; tape writing not fully understood either. "At risk" on CASTOR this morning for Quarterly patches - others for "3D DBs" tomorrow for the same purpose.

  • ASGC: famous d/s not migrated to tape. 1 file now migrated and problem identified. Daniele: all the same problem? Gang: think so.

AOB:

Wednesday

Attendance: local(Jamie, Daniele, Harry, Jean-Philippe, Alessandro, Sophie, Kath, Steve, Gang, Roberto, Oliver, Andrea);remote(Xavier, Michael, Angela, Gareth, Luca).

Experiments round table:

  • ATLAS - also observed file loss in RAL - "almost new files". Intervention related? Savannah ticket open - ATLAS has implemented squad into Savannah - more info later. Pseudo-reprocessing (not full chain of reprocessing: RAW-ESD but not full chain to DPD etc.) to validate tape system in FZK - FZK behaved v. well - 4500 files yesterday - v +ve. Yesterday OPN problem - submitted a ticket - fibre cut Madrid - Geneva. Both primary & secondary down. Problem spotted by ATLAS shifters - should improve monitoring on these issues. https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=50777&from=ID why again no OPN presence at this meeting???. Finally, SLS problem: since this morning not available from outside CERN. Noticed once before - Remedy ticket entered - some procedure improvements needed.... Luca - any clue about problem of reprocessing jobs failing? Tried to understand problem in IT cloud using test from Rod - both in Roma and CNAF - fails but ok at lxplus@CERN. No explicit limits on memory allocation. Ale: locally no memory limits but LSF 2.1GB "hard-coded" for each process. Luca -nothing in LSF config. Removed some months ago upon ATLAS request. Opened call with lsf support. Ale - following thread together with Lorenzo (expert@CNAF). Luca: just to update that we are still investigating! Problem some info from lsf support tomorrow. Ale - will discuss with Sophie and summarise issue. Brian@RAL: issue of problems with new files: Savannah bug in? Ale: Y. We have procedures in ATLAS to recover lost files. Daniele: OPN stuff - transfer quality relatively green apart from inbound to PIC from T2 - shouldn't be OPN! Strange! Luca - going to PIC through Geneva. to be followed...

  • CMS reports - General services: 1 main issue - CMS site DB inaccessible. Web tools team investigating. T1 sites: closed ticket to RAL re pending migration to tapes - moving again - closed. Ticket to ASGC closed: all files in migrated state migrated to tape - also closed. New transfer problems concerning RAL: permission at destination? No major progress on other tickets.. T2s: closing ticket to UCSD - commissioning of transfers to this site - quality now excellent! New errors Purdue to LPC farm in FNAL: see full details. Florida US T2 progress: first guess ACL problems maybe correct! Gareth - tape problems still not fully understood. Problem over w/e just for CMS. Change made lunchtime Monday - quota thing?? - not the case. Also short intervention on tape at same time - problem has gone away so not fully understood but not h/w issue reported. (109290 Savannah for other RAL-related transfer problems).

  • ALICE -

  • LHCb reports - Resuming various production activities frozen before disk capacity problem at CERN. Also experiencing OPN problems. Also non-PIC sites affected. Problem reinstalling a VO box at CERN - finally ok. Lyon: publishing info on clusters/sub-clusters. Offline discussion with Steve - nothing to do with site but agent filling internal LHCb mask - to be fixed LHCb-side. 0 tickets open today!

Sites / Services round table:

  • ASGC: iperf test between ASGC & RU T2 ongoing.

  • FZK: we have split ATLAS s/w area.

  • RAL: was a problem reported yesterday with batch failing for CMS - investigations still ongoing! Update: Please could I put in a small correction or addition to the notes for yesterday (Wednesday's) meeting? The Atlas report refers to 'file loss in RAL'. I did not pick this up at the time. However, there was not an incident with files lost. I believe this refers to a server that was temporarily unavailable (server gdss229, part of the Atlas MCDISK space token) due to an operational error. The server was unavailable from Monday afternoon until late Tuesday morning.

  • DB: LHCb online DB still down. (Scheduled power cut prolonged).

  • Report from CCIN2P3 : we had today a failure on one cooling system around midday so ~300 worker nodes were taken out of production to avoid temperature increase in machine room. The cooling system will be repaired next week. So we will stay with 50% of nominal computer capacity until this time.

AOB:

Thursday

Attendance: local(Jamie, Daniele, Ewan, Roberto, Alessandro, Harry, Olof, Gang);remote(Tiju Idiculla (RAL), Angela Poschlad (FZK), Michael Ernst(BNL), Michel (GRIF), luca dell'agnello - INFN)).

Updates since yesterday:

  • IN2P3 - see report added above (yesterday's notes)
  • RAL - see update to yesterday's notes above
  • OPN - some progress in tickets: link to Madrid solved apparently but a new problem to US. More in GGUS for LHCOPN

Experiments round table:

  • ATLAS - GGUS ticket for OPN - not able to modify but still experiencing problems PIC - INFN. Don't see problems PIC - FZK. Sent Edoardo an email how to follow-up. Only channel in T1-T1 matrix is this one - seems same symptoms as two days ago. Experiencing some central catalogs instabilities. Probably due to new machine introduced to load-balancing. Solved? TBC. FZK: testing site validation for pseudo-reprocessing - site behaving very well: 16K jobs = gold medal in STEP'09 terms. Being tested only with ATLAS so have to coordinate with CMS a joint (re-)test.

  • CMS reports - RAL:closing case of transfer errors FNAL-RAL. New tickets: FZK - low transfer rate FZK-Rome. T1-T2 & T2-T2 transfers from new production round. Just a request to investigate. Some files - corruption? - at IN2P3. iperf tests ASGC-RU T2 still in progress. T2s: understood problems with Florida. Some additional sites / links passing commissioning metrics. US & Brazil: ongoing commissioning. Some tickets pending due to holidays.

  • ALICE -

  • LHCb reports - Resuming production, still waiting for full transfer backlog to be drained. In parallel working to find consistent set of m/w clients to work with SLC5. VOMS client problems (see details), versions in cert ok?? TBC. Since yesterday 5 T1 tickets, 1 T2. T1s: mainly transfers to Lyon (disk server outage) and SARA (LHCb filled up disk space for master MC). Expect same at GridKA, PIC and Lyon - see disk space usage plots in full report. LHCb online system facing alot of serious problems including elog.

Sites / Services round table:

  • ASGC - earlier today 10s network outage due to wrong operation - some operations affected but should be retried - retransferred.

  • RAL - CMS batch failing: still under investigation.

  • FZK: 1 ATLAS pool out (down ~8 hours) after midnight - same pool as last Saturday: temp problem. Rebooting 'solved' but will keep an eye on this.

  • BNL: OPN problem with US sites like BNL & FNAL. Affected are backup links - primary links still work. ESNET issue. At 07:00 EDT moved to a new m/c serving site services (ATLAS data replication machinery) - successful move, just an issue with publication being worked on. Long-standing issue with dCache gridftp doors. Stale connections developed, restart doors after time: this was done this morning - small affect.

  • GRIF: suffering sqlite over nfs - mainly LHCb - difficulties to find work-around.

  • CNAF: no news on intrinsic memory limit in lsf found by ATLAS - still working on it. Any news on OPN incident? Yesterday had to manually reconfigure router to go to PIC through GPN. Ale - rollback change to use OPN? Update (Luca) - I was wrong: stefano switched back our edge router configuration this morning and now we are able to connect to PIC via OPN. But indeed the channel PIC--> CNAF is not green...we are checking both FTS and gridftp

  • NIKHEF:
Just a reminder for the NIKHEF data center movement.

NIKHEF will start the data center movement on next Monday (10 August) as advertised few weeks ago. The 2-weeks scheduled downtime will be started then. If everything goes smoothly, the whole site will become online again on 21 August as planned.

  • LHCOPN: The (Madrid) ticket was updated and closed: All services are up and stable since 15:43:17 UTC. Total outage duration was for 26hrs17min25sec. The outage was caused due to fiber cut caused due to a public construction work.

    At the moment we are investigating an issue in the technical network affecting the PLCs controlling the cooling of the LHC. We are also preparing for a major upgrade of the technical network that can only happen this monday 10th of august (because it will stop the access to the LHC tunnels and pits and the cooling of the LHC). A DNS vulnerability forced us to upgrade all the servers, but this introduced a secondary issue that we want to fix urgently. The DNS servers are critical for the whole CERN, WLCG included.

    New ticket regarding BNL-CERN and FNAL-CERN backup links are down.

AOB:

Friday

Attendance: local(Harry, Danielle, Jean-Philippe, Gang, Alessandro);remote(Xavier, Michael, Jeremy, Gareth, Alexei).

LHC News (Rolf Heuer,DG) The LHC will run for the first part of the 2009-2010 run at 3.5 TeV per beam, with the energy rising later in the run. That's the conclusion that we've just arrived at in a meeting involving the experiments, the machine people and the CERN management. We've selected 3.5 TeV because it allows the LHC operators to gain experience of running the machine safely while opening up a new discovery region for the experiments.

The developments that have allowed us to get to this point are good progress in repairing the damage in sector 3-4 and the related consolidation work, and the conclusion of testing on the 10000 high-current electrical connections last week. With that milestone, every one of the connections has been tested and we now know exactly where we stand.

The latest tests looked at the resistance of the copper stabilizer that surrounds the superconducting cable and carries current away in case of a quench. Many copper splices showing anomalously high resistance have been repaired already, and the tests on the final two sectors revealed no more outliers. That means that no more repairs are necessary for safe running this year and next.

The procedure for the 2009 start-up will be to inject and capture beams in each direction, take collision data for a few shifts at the injection energy, and then commission the ramp to higher energy. The first high-energy data should be collected a few weeks after the first beam of 2009 is injected. The LHC will run at 3.5 TeV per beam until a significant data sample has been collected and the operations team has gained experience in running the machine. Thereafter, with the benefit of that experience, we'll take the energy up towards 5 TeV per beam. At the end of 2010, we'll run the LHC with lead-ions for the first time. After that, the LHC will shut down and we'll get to work on moving the machine towards 7 TeV per beam.

Experiments round table:

  • ATLAS - 1) RAL defined a new alias today for the ATLAS LFC but moving to it gave problems. An alarm ticket was sent and it was quickly resolved. Gareth confirmed that they had restarted the FTS daemon as a result but they did not fully understand since the backend had not changed. Jean-Philippe suggested to look in the LFC logs. (Later report from Graeme Stewart: IMHO the LFC problems at RAL were not connected with ATLAS change to a different LFC DNS alias this morning. I think it was just coincidence. I said as much in the ATLAS eLog (https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/5013 at the bottom) and no one contradicted me. Both aliases were affected because it was one of the backend hosts which was hanging. Derek found as much internally at RAL as well.)
2) ATLAS are requesting that an outage of their LFC at FZK (in order to move them to a dedicated LFC) scheduled for 07:00 to 08:00 UTC on Thursday 13 August (in Gocdb) be postponed since there will be no ATLAS experts around for the following 4 days. Alexei added that they did not want to risk to perturb an important ATLAS reprocessing. Xavier will pass this on but could not confirm as it affects other VO (but not LHCb).

  • CMS reports - Note: due to holidays, in the August 12-24 time window the usual CMS reports at WLCG Ops calls will be reduced/absent. Andrea Sciabà has kindly agreed to give brief reports for some of the days. Usual daily reports will be available again starting from August 25th. 1) FZK T1 low transfer rate to Rome closed. 4 other T1 tickets still open with no news on 3 and tests in progress on 1 (ASGC - Russian T2 link commissioning). 2) Two T2 site tickets closed - FNAL to Florida and Aachen to FZK failing transfers. All other T2 tickets remain open with updates expected next week as site admins return from holidays.

  • ALICE -

  • LHCb reports - Pending MC09 productions corresponding to old requests have been restarted.

Sites / Services round table:

AOB:

-- JamieShiers - 30 Jul 2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-08-07 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback