Week of 100111

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Eva, Simone, Jean-Philippe, Oliver, Alessandro, Eduardo, Jan, Jamie, Roberto, Andrea, Julia, MariaD, Patricia, MariaG, Nicolo, Daniele, Dirk);remote(Roger/NDG, Gonzalo/PIC, Angela/KIT, Ron/NL-T1, Kyle./OSG, Gareth/RAL, Michael/BNL, Gang+Jason/ASGC, Rolf/IN2P3, Joseph, Alessandro/INFN).

Experiments round table:

  • ATLAS - (Alessandro) quiet weekend - SARA/matrix in downtime, few problems in T2 (details in the elog), most important: castor problems over the weekend due to gridftp checksum miscalculation (transfers with CERN source failing, disabling of gridftp chksum calculation/verification (new in 2.1.9) - savannah bug open, post mortem requested. Jan: s/w bug in 2.1.9 release with checksums for multi-stream ftp transfers. No operational problems since new checksum handling has been disabled. CASTOR dev team preparing a fix. Will continue with 2.1.9 deployment for CMS with checksums disabled.

  • CMS reports - (Daniele) after changes in CMS organization: will have Nicolo giving the CMS report in the daily meeting and on Tue Pepe with weekly scope CMS report. (Nicolo) no particular issues apart from CERN, delegation issue with FTS 2.2. Went away with cron job at 6am, before all transfers failed to CERN, not using FTS2.2 for production transfers but affected CMS site readiness plot . CREAMN CE test at IN2P3: restarted test submissions to the site. ASGC repacking ongoing, T2 deployment of CMS SW ongoing: native SL5 still needs SL4 compat libraries for frontier distribution (problem also seen by ATLAS). CMS installs SL4 compat libs instead of using lib preload. Simone: FTS2.2 is also used in FZK, FNAL - did you see delegation problems there too? Nicolo: not yet but will check in logs. Jan: is there a ticket open? if not please add one for tracking.

  • ALICE - (Patricia) MC production ongoing with jobs counts now decreasing (as expected), during xmas CREAM CE has not been used due to a problem with the VObox: back in production now. New bunch of services in myproxy - site admins have been informed.

  • LHCb reports - (Roberto) no much activity right now. Issue with WMS reported on Fri at PIC: ICE queue filled up - now fixed by two new instances which are tested ok. Similar issue at SARA: queue has been emptied. Possibly similar problem at RAL, but disappeared again. Checksums provided to RAL. Lyon: issue with third party library for GSI dcap. Patch from dcache developers available - to be tested.

Sites / Services round table:

  • Roger/NDGF - ntr
  • Gonzalo/PIC - ntr
  • Angela/KIT - yesterday: black hole worker node (disk failure) - few jobs lost. Problem with SAM jobs test after queue draining to prepare for Wed outage. SAM jobs will now be redirected to other queue. Wed downtime: BDII, ATLAS LFC and FTS star channels will stay up, LHCb LFC will be down
  • Ron/NL-T1 - migration to chimera - all according to plan so far. Roberto: outage all week? Ron: yes, as also other upgrades eg raid controllers, and 2TB disk firmware are being done.
  • Gareth/RAL - fsprobe errors for LHCb - received checksums from LHCb - thanks. Other machine is still being verified.
  • Michael/BNL - reminder: maintenance tomorrow. Will drain queues from midnight on. New storage appliance tomorrow morning (1h outage plus some safety margin) should be back up by noon (eastern time).
  • Jason/ASGC: power cycle affected several core services for T1+T2 for 2h
  • Rolf/IN2P3 - ntr
  • Alessandro/INFN : last Fri: NFS server overload due limited number of NFS threads, now increased from 64 to 128 (problem fixed), batch came back after 2h. New problem yesterday when bug in cron job (clean up of accounting files) accidentally removed batch system file. Lost all queue entries but running jobs continued. Problem now fixed.

  • Eduardo/CERN: network incident on Fri during planned maintenance LCG and GPN due to software bug in router. Also fibre cut in Frankfurt: backup capacity was used and sufficient. Next Sat another scheduled intervention. Packet size issue for transfers between PIC and SARA (via CERN): issue with difference in packet size. Can this be detected automatically? Yes, but check is currently disabled due to security concerns.

AOB: (MariaD) Could OSG please update https://gus.fzk.de/ws/ticket_info.php?ticket=54538 (urgent) Michael clarifies: Harvard and Boston form together one T2 center. MariaD: will discuss ticket routing for this case offline with Kyle and Michael.

Tuesday:

Apologies - no attendance list today as the in/out flux for the overbooked meeting room was to high

Experiments round table:

  • ATLAS - (Alessandro) Not much to report, NIKHEF storage problems (more details in the site report), some T2 problem being followed up. Castor: short glitch of known FTS->SRM problems around 11:10. SRM support/dev has been contacted with details.

  • CMS reports - (Nicolo) castor cms upgrade today went well, no issues have been observed. FTS delegation problem reported earlier are being followed up. IN2P3: a few new tickets have been opened: stage-out failures for CMS SAM test (already closed). dCache issue in reprocessing jobs - acknowledged by the site contact and being worked on. (Pepe - CMS weekly) no major production planned, only testing jobs.Follow-up the SL5 WNs migration and tape recycling/repack in ASGC Tier-1, to bring the site back to operations and be ready for 2010 run. Some sites notified to upgrade to latest squid version (squid-2.7.STABLE7), which has a fix for a performance problem. Site readiness is evaluated including the data transfer qualities into the set of the Site Readiness metrics, since 24th December 2009. It was noted that SSB plots were still using the old readiness results. To be fixed by today. More detailed plans including T2 on the CMS twiki (link above).

  • ALICE - (Patricia) test with production-like environment for last CREAM CE (IN2P3): test went fine, but can not enter production yet, as gridftp on VObox had been removed by the site - but is required by ALICE. ALICE is in contact with the site to resolve this issue. Have been working with CNAF on VObox problems - production at the site will restart once some reconfiguration steps on the ALICE side have been done.

  • LHCb reports - (Roberto) Low activity / no problems - CREAM CE will now go into production for LHCb.

Sites / Services round table:

  • NL-T1: In contact with storage vendor (DDN) about h/w problems. Vendor support still an issue also for some standard operations (eg getting replacement disks). Communication with VO on storage problems seems good. Other sites (who also bought/deploy DDN equipment) are interested in support and h/w experience. Summary about these issues will be sent later by NL-T1. Chimera migration is progressing well.
  • Gonzalo/PIC: ntr
  • Angela/KIT: ntr
  • Michael/BNL: Queue draining for upcoming intervention
  • John/RAL: Advance warning for intervention next week plus additional info by from Gareth:

We have declared an 'At Risk' in the GOC DB for two hours on Thursday (14th January). This is for a maintenance on the UPS.

Not yet in the GOC DB: We are planning an outage on Tuesday/Wednesday next week (19/20 January). This is for several reasons including: - We now believe we understand the problems with the disk arrays that led to outages for us during October. These have been traced to noise on the electrical current from the UPS supply. We plan to migrate Castor, LFC and FTS databases back to these arrays which will be powered from another source until the UPS problem is fixed. - Various other work including checking disk systems (FSCK across all Castor disk servers) plus updates to the batch system that will require a drain of the farm ahead of this intervention.

  • Massimo/INFN: 1h ago - voms server outage due to h/w problem (being worked on)
  • Roger/NDGF: SRM upgrade took place, downtime scheduled for tomorrow
  • Rolf(?)/IN2P3: ALICE issues being followed up.

  • Eva/CERN: intervention on Alice online DB planned for coming days (exact time will be given later): storage firmware upgrade
  • Jan/CERN: Castor/CMS and ATLAS/xroot redirector upgrades took place . Plan to upgrade CASTOR/T3 next week - notification will be sent to VOs.

AOB:

Wednesday

Attendance: local(Maria, Jamie, Lola, Gavin, Nicolo, Eva, Alessandro, Julia);remote(Massimo Donatelli INFN-CNAF, John Kelly (RAL), IN2P3 (Rolf), Michael Ernst, Roger Oscarsson (NDGF), Angela Poschlad (KIT), Gonzalo Merino (ES-PIC), Joel, Ron, Jason).

Experiments round table:

  • ATLAS reports - Expert on-call: will rotate on Wednesdays. Good news - attached to agenda a link which will be filled by expert on call. Level of details will be tuned. From report: Many scheduled interventions ongoing: NDGF, FZK, SARA; INFN-T1 storage problems from 3am to ~7:30am "the process on the BE was dead (we are investigating the reasons).now the transfers are running again"; Some Tier2s instabilities: CYFRONET-LCG2 and LIP-COIMBRA . Both now fixed Weekly ATLAS distributed computing operations meeting in 30' - join for more info!

  • CMS reports - General service issues: in last days discussed FTS 2.2 delegation problems. Found evidence on FTS 2.2 servers (FZK, FNAL - caught "live" last night & gather debug info) outside CERN. Operations: CREAM CE testing at T1s in progress; no major issues reported so far; processing jobs successful - some errors in merge jobs but this might be jobs not CE. CREAM CE tests not yet completed at IN2P3 - jobs still waiting. Probably due to misconfig in submission and not site issue.

  • ALICE - Nothing to report.

  • LHCb reports - Low activities. There are about 2K user jobs running in the system and 6K more are waiting to be picked up by pilots; Just one stripping production is active now but not jobs created yet. T0 issues: LFC Read-Only instance shows problems with many connections just timing out. This is flickering however. Open GGUS ticket. T1 issues: ;GRIDKA: lcgadmin VOMS role is not correctly mapped to *sgm account via CREAMCE at GRIDKA. Answer from GRIDKA - can't have static account although for Lyon this is ok. Angela - ATLAS also requested this. We can't map without DN in gridmapfile and hence always mapped the same way. Involved CREAM and glexec developers and don't yet have a solution. Will discuss with IN2P3 people... Gavin - don't have information but will followup with people responsible for LFC.

Sites / Services round table:

  • CNAF - confirm that we have had problem with ATLAS storm b/e overnight. Probably due to bug in lcmaps.String in Java VM which created an application crash - investigating. Service was restarted this morning. Ale - is there a way to catch this problem and restart service if it reoccurs? A: yes
  • RAL - ntr
  • IN2P3 - yesterday ALICE reported gridftp not available on VO box. will be installed - underway. However installation of gridftp is not part of MoU between ALICE & site. gridftp service in CREAM CE. Don't quite understand need but will install.
  • BNL - ntr
  • NDGF - upgrade of dCache today. Seems to work well but some SAM tests still not working. Not directly related to SE. Investigating...
  • KIT - currently in downtime - going quite well and expect to finish in time. Else ntr. Ale - LFC & FTS? are they back? Angela - LFC should have been down only for a few minutes whilst a router was rebooted. FTS not yet up - fileservers still be updated. Channels from T2 to * continuously available. Ale - waiting for confirmation of LFC as all cloud put offline. Will re-include cloud.
  • PIC - ntr
  • NL-T1 - Chimera migration still moving forward. A few minor glitches but still in progress. Vendor of storage system using downtime to upgrade firmware on RAID controllers and on 2TB disks. NIKHEF: Jan 12 new disk servers shutdown due to very high failure rate. Problem now being resolved - restoring disk servers to service. 6/8 now operational and working on last 2.
  • ASGC - observe some failures - details to follow.
  • CERN - FTS 2.2 have put cron job in place that will snapshot credentials. If problem occurs at least we have some trace. Info from FNAL also.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ALICE -

Sites / Services round table:

AOB: The GGUS release being due on February 3rd can we please conclude the regular ALARM test day of the month? details in https://savannah.cern.ch/support/?111475#comment8

Friday

Attendance: local();remote().

Experiments round table:

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 07-Jan-2010

Edit | Attach | Watch | Print version | History: r15 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2010-01-13 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback