Week of 100111

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Eva, Simone, Jean-Philippe, Oliver, Alessandro, Eduardo, Jan, Jamie, Roberto, Andrea, Julia, MariaD, Patricia, MariaG, Nicolo, Daniele, Dirk);remote(Roger/NDG, Gonzalo/PIC, Angela/KIT, Ron/NL-T1, Kyle./OSG, Gareth/RAL, Michael/BNL, Gang+Jason/ASGC, Rolf/IN2P3, Joseph, Alessandro/INFN).

Experiments round table:

  • ATLAS - (Alessandro) quiet weekend - SARA/matrix in downtime, few problems in T2 (details in the elog), most important: castor problems over the weekend due to gridftp checksum miscalculation (transfers with CERN source failing, disabling of gridftp chksum calculation/verification (new in 2.1.9) - savannah bug open, post mortem requested. Jan: s/w bug in 2.1.9 release with checksums for multi-stream ftp transfers. No operational problems since new checksum handling has been disabled. CASTOR dev team preparing a fix. Will continue with 2.1.9 deployment for CMS with checksums disabled.

  • CMS reports - (Daniele) after changes in CMS organization: will have Nicolo giving the CMS report in the daily meeting and on Tue Pepe with weekly scope CMS report. (Nicolo) no particular issues apart from CERN, delegation issue with FTS 2.2. Went away with cron job at 6am, before all transfers failed to CERN, not using FTS2.2 for production transfers but affected CMS site readiness plot . CREAMN CE test at IN2P3: restarted test submissions to the site. ASGC repacking ongoing, T2 deployment of CMS SW ongoing: native SL5 still needs SL4 compat libraries for frontier distribution (problem also seen by ATLAS). CMS installs SL4 compat libs instead of using lib preload. Simone: FTS2.2 is also used in FZK, FNAL - did you see delegation problems there too? Nicolo: not yet but will check in logs. Jan: is there a ticket open? if not please add one for tracking.

  • ALICE - (Patricia) MC production ongoing with jobs counts now decreasing (as expected), during xmas CREAM CE has not been used due to a problem with the VObox: back in production now. New bunch of services in myproxy - site admins have been informed.

  • LHCb reports - (Roberto) no much activity right now. Issue with WMS reported on Fri at PIC: ICE queue filled up - now fixed by two new instances which are tested ok. Similar issue at SARA: queue has been emptied. Possibly similar problem at RAL, but disappeared again. Checksums provided to RAL. Lyon: issue with third party library for GSI dcap. Patch from dcache developers available - to be tested.

Sites / Services round table:

  • Roger/NDGF - ntr
  • Gonzalo/PIC - ntr
  • Angela/KIT - yesterday: black hole worker node (disk failure) - few jobs lost. Problem with SAM jobs test after queue draining to prepare for Wed outage. SAM jobs will now be redirected to other queue. Wed downtime: BDII, ATLAS LFC and FTS star channels will stay up, LHCb LFC will be down
  • Ron/NL-T1 - migration to chimera - all according to plan so far. Roberto: outage all week? Ron: yes, as also other upgrades eg raid controllers, and 2TB disk firmware are being done.
  • Gareth/RAL - fsprobe errors for LHCb - received checksums from LHCb - thanks. Other machine is still being verified.
  • Michael/BNL - reminder: maintenance tomorrow. Will drain queues from midnight on. New storage appliance tomorrow morning (1h outage plus some safety margin) should be back up by noon (eastern time).
  • Jason/ASGC: power cycle affected several core services for T1+T2 for 2h
  • Rolf/IN2P3 - ntr
  • Alessandro/INFN : last Fri: NFS server overload due limited number of NFS threads, now increased from 64 to 128 (problem fixed), batch came back after 2h. New problem yesterday when bug in cron job (clean up of accounting files) accidentally removed batch system file. Lost all queue entries but running jobs continued. Problem now fixed.

  • Eduardo/CERN: network incident on Fri during planned maintenance LCG and GPN due to software bug in router. Also fibre cut in Frankfurt: backup capacity was used and sufficient. Next Sat another scheduled intervention. Packet size issue for transfers between PIC and SARA (via CERN): issue with difference in packet size. Can this be detected automatically? Yes, but check is currently disabled due to security concerns.

AOB: (MariaD) Could OSG please update https://gus.fzk.de/ws/ticket_info.php?ticket=54538 (urgent) Michael clarifies: Harvard and Boston form together one T2 center. MariaD: will discuss ticket routing for this case offline with Kyle and Michael.

Tuesday:

Apologies - no attendance list today as the in/out flux for the overbooked meeting room was to high

Experiments round table:

  • ATLAS - (Alessandro) Not much to report, NIKHEF storage problems (more details in the site report), some T2 problem being followed up. Castor: short glitch of known FTS->SRM problems around 11:10. SRM support/dev has been contacted with details.

  • CMS reports - (Nicolo) castor cms upgrade today went well, no issues have been observed. FTS delegation problem reported earlier are being followed up. IN2P3: a few new tickets have been opened: stage-out failures for CMS SAM test (already closed). dCache issue in reprocessing jobs - acknowledged by the site contact and being worked on. (Pepe - CMS weekly) no major production planned, only testing jobs.Follow-up the SL5 WNs migration and tape recycling/repack in ASGC Tier-1, to bring the site back to operations and be ready for 2010 run. Some sites notified to upgrade to latest squid version (squid-2.7.STABLE7), which has a fix for a performance problem. Site readiness is evaluated including the data transfer qualities into the set of the Site Readiness metrics, since 24th December 2009. It was noted that SSB plots were still using the old readiness results. To be fixed by today. More detailed plans including T2 on the CMS twiki (link above).

  • ALICE - (Patricia) test with production-like environment for last CREAM CE (IN2P3): test went fine, but can not enter production yet, as gridftp on VObox had been removed by the site - but is required by ALICE. ALICE is in contact with the site to resolve this issue. Have been working with CNAF on VObox problems - production at the site will restart once some reconfiguration steps on the ALICE side have been done.

  • LHCb reports - (Roberto) Low activity / no problems - CREAM CE will now go into production for LHCb.

Sites / Services round table:

  • NL-T1: In contact with storage a vendor about h/w problems. Vendor support still an issue also for some standard operations (eg getting replacement disks). Communication with VO on storage problems seems good. Other sites (who also bought/deploy similar equipment) are interested in support and h/w experience. Summary about these issues will be sent later by NL-T1. Chimera migration is progressing well.
  • Gonzalo/PIC: ntr
  • Angela/KIT: ntr
  • Michael/BNL: Queue draining for upcoming intervention
  • John/RAL: Advance warning for intervention next week plus additional info by from Gareth:

We have declared an 'At Risk' in the GOC DB for two hours on Thursday (14th January). This is for a maintenance on the UPS.

Not yet in the GOC DB: We are planning an outage on Tuesday/Wednesday next week (19/20 January). This is for several reasons including: - We now believe we understand the problems with the disk arrays that led to outages for us during October. These have been traced to noise on the electrical current from the UPS supply. We plan to migrate Castor, LFC and FTS databases back to these arrays which will be powered from another source until the UPS problem is fixed. - Various other work including checking disk systems (FSCK across all Castor disk servers) plus updates to the batch system that will require a drain of the farm ahead of this intervention.

  • Massimo/INFN: 1h ago - voms server outage due to h/w problem (being worked on)
  • Roger/NDGF: SRM upgrade took place, downtime scheduled for tomorrow
  • Rolf(?)/IN2P3: ALICE issues being followed up.

  • Eva/CERN: intervention on Alice online DB planned for coming days (exact time will be given later): storage firmware upgrade
  • Jan/CERN: Castor/CMS and ATLAS/xroot redirector upgrades took place . Plan to upgrade CASTOR/T3 next week - notification will be sent to VOs.

AOB:

Wednesday

Attendance: local(Maria, Jamie, Lola, Gavin, Nicolo, Eva, Alessandro, Julia);remote(Massimo Donatelli INFN-CNAF, John Kelly (RAL), IN2P3 (Rolf), Michael Ernst, Roger Oscarsson (NDGF), Angela Poschlad (KIT), Gonzalo Merino (ES-PIC), Joel, Ron, Jason).

Experiments round table:

  • ATLAS reports - Expert on-call: will rotate on Wednesdays. Good news - attached to agenda a link which will be filled by expert on call. Level of details will be tuned. From report: Many scheduled interventions ongoing: NDGF, FZK, SARA; INFN-T1 storage problems from 3am to ~7:30am "the process on the BE was dead (we are investigating the reasons).now the transfers are running again"; Some Tier2s instabilities: CYFRONET-LCG2 and LIP-COIMBRA . Both now fixed Weekly ATLAS distributed computing operations meeting in 30' - join for more info!

  • CMS reports - General service issues: in last days discussed FTS 2.2 delegation problems. Found evidence on FTS 2.2 servers (FZK, FNAL - caught "live" last night & gather debug info) outside CERN. Operations: CREAM CE testing at T1s in progress; no major issues reported so far; processing jobs successful - some errors in merge jobs but this might be jobs not CE. CREAM CE tests not yet completed at IN2P3 - jobs still waiting. Probably due to misconfig in submission and not site issue.

  • ALICE - Normal operations during the MC production ongoing. The issue reported yesterday on regard with the CCIN2P3 ALICE VOBOX (no installation of the gridftp server in the machine) has been solved as the contact person at the site has confirmed this morning. Issues reported by Madrid T2 (CIEMAT) with the Alice local services (problems to start up the PackMan service) have also been solved from the central services at CERN

  • LHCb reports - Low activities. There are about 2K user jobs running in the system and 6K more are waiting to be picked up by pilots; Just one stripping production is active now but not jobs created yet. T0 issues: LFC Read-Only instance shows problems with many connections just timing out. This is flickering however. Open GGUS ticket. T1 issues: ;GRIDKA: lcgadmin VOMS role is not correctly mapped to *sgm account via CREAMCE at GRIDKA. Answer from GRIDKA - can't have static account although for Lyon this is ok. Angela - ATLAS also requested this. We can't map without DN in gridmapfile and hence always mapped the same way. Involved CREAM and glexec developers and don't yet have a solution. Will discuss with IN2P3 people... Gavin - don't have information but will followup with people responsible for LFC.

Sites / Services round table:

  • CNAF - confirm that we have had problem with ATLAS storm b/e overnight. Probably due to bug in lcmaps.String in Java VM which created an application crash - investigating. Service was restarted this morning. Ale - is there a way to catch this problem and restart service if it reoccurs? A: yes
  • RAL - ntr
  • IN2P3 - yesterday ALICE reported gridftp not available on VO box. will be installed - underway. However installation of gridftp is not part of MoU between ALICE & site. gridftp service in CREAM CE. Don't quite understand need but will install.
  • BNL - ntr
  • NDGF - upgrade of dCache today. Seems to work well but some SAM tests still not working. Not directly related to SE. Investigating...
  • KIT - currently in downtime - going quite well and expect to finish in time. Else ntr. Ale - LFC & FTS? are they back? Angela - LFC should have been down only for a few minutes whilst a router was rebooted. FTS not yet up - fileservers still be updated. Channels from T2 to * continuously available. Ale - waiting for confirmation of LFC as all cloud put offline. Will re-include cloud.
  • PIC - ntr
  • NL-T1 - Chimera migration still moving forward. A few minor glitches but still in progress. Vendor of storage system using downtime to upgrade firmware on RAID controllers and on 2TB disks. NIKHEF: Jan 12 new disk servers shutdown due to very high failure rate. Problem now being resolved - restoring disk servers to service. 6/8 now operational and working on last 2.
  • ASGC - observe some failures - details to follow.
  • CERN - FTS 2.2 have put cron job in place that will snapshot credentials. If problem occurs at least we have some trace. Info from FNAL also.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local( Nicolo, Jean-Philippe, Oliver, Roberto, Jan, Jacek, Andrea, Harry,Julia, Alessandro, Simone, MariaD, MariaG, Jamie, Dirk);remote(Jeremy/GridPP, Angela/KIT, Michael/BNL, SIP, Pepe/CMS, Roger/NDGF, Ros, Rolf/IN2P3, Ronald/NL-T1, Gareth/RAL, Jason/ASGC, Graeme/ATLAS).

Experiments round table:

  • ATLAS reports - (Graeme) FZK came back, normal ops in german cloud. Also INFN-T1 reenabled. Today: problem with ASGC FTS: all jobs rejected (GGUS #54664): stale proxy was cleared. NIKHEF storage back up, but file loss being chased. LIP-COIMBRA FTS channel from PIC is inactive (GGUS #54673)

  • CMS reports - (Nicolo) FZK after downtime: SAM test still failing due to jobs submission failure: checking. T1s still backfill test submissions. Merger job failures reported yesterday where not site related. Problems in CREAM CE backfill tests at PIC: files not available despite dCache reports them online. IN2P3 first CREAM jobs successful, FNAL: stress test of the site. Farm full with backfill jobs concurrent with transfer from CERN to check for possible interference. Few tickets for transfer failures from CNAF. Progress in repack at ASGC: now cleanup left over files. Two new T2 sites: Pakistan in transfer test, Krakow - will begin test soon.

  • ALICE - (Reported by Lola after the meeting): 2nd CCIn2P3 ALICE VOBOX entering production today in order to put the local CREAM-CE in production. Small issues with some T2 sites being followed with the site admins. No important issues to report in terms of the MC production

  • LHCb reports - (Roberto) Despite low activity a few problems. CERN: out of date Persistency interface to LFC lead to one user overloading RO-LFC server. Persistency interface should be updated - in discussion with Persistency development. May need to add more LFC h/w and in crease the number of threads. Can can we get the new ACL scheme into place. FZK: lcgadmin VOMS role is not correctly mapped to *sgm account via CREAMCE at GRIDKA. This problem is not present at other sites. CNAF: Storm/GPFS problem - will open GGUS ticket as soon as all information collected. Angela: CREAM issue was investigated with other config options but no success yet.

Sites / Services round table:

  • Angela/KIT: after downtime 2 problems (but not related to downtime), static info of CE was changed and enabled after reboot,
  • Michael/BNL: ntr
  • Roger/NDGF: dcache upgrade worked ok,
  • Rolf/IN2P3: ntr
  • Ronald/NL-T1: SARA firmware upgrade completed, last steps for chimera migration. NIKHEF: will be down on Tu 19 for upgrade of BDII and batch. Queues will be drained on Mon.
  • Gareth/RAL: disk server problem yesterday evening for LHCb: waiting for pare parts. Saw fsprobe errors since some time. Outage scheduled for Tue/Wed in gocdb: proposal to delay by one week. Two reasons: 1. LFC would be down (problem for ATLAS tutorial), 2. more testing to prepare intervention is required.
  • Jason/ASGC: following ATLAS ticket
  • Massimo/CNAF: LHCb SAM test failures due to glitch - now solved, storage group. More details from storage group will follow tomorrow.
  • Pepe/PIC: ATLAS FTS channels problem and CREAM CE errors are being investigated

  • Jan/CERN: scheduled CASTOR T3 non-transparent upgrade for Tue.
  • Jacek/CERN: ALICE DB down for firmware upgrade (since yesterday) - some upgrade problems encountered, but all ALICE sub-sytems are now using standby-DB. Hope to get main DB cluster back by tomorrow afternoon.

AOB:

  • MariaD: The GGUS release being due on February 3rd can we please conclude the regular ALARM test day of the month? details in https://savannah.cern.ch/support/?111475#comment8 . Discussion on best time slot: proposal eg Tue after the GGUS release or change or maybe more shortly after the release.
  • Michael: gridftp only transfer tests in the US: evaluate SE without SRM. Gavinís help for access to log files requested.
  • Simone: US ATLAS requested native gridftp (no SRM) tracked in savannah. This affects WLCG MOU and may need discussion: Markus Schulz has been contacted to follow this up.

Friday

Attendance: local(Nicolo, Jan, Miguel, Graeme, JPB, Alessandro, Lola, Harry, Eva, Giuseppe, Patricia, Roberto, Simone, Dirk);remote(Ono/NL-T1, Jason/ASGC, Jens/NDGF, Gonzalo/PIC, Angela/KIT, Michael/BNL, Rolf/IN2P3, Gang/ASGC, Gareth/RAL, Roger/NDGF).

Experiments round table:

  • ATLAS reports - (Graeme) SARA out of downtime. Resumed functional test data transfers to dCache. So far so good... UKI-SOUTHGRID-OX-HEP back in production after Xmas a/c failure. Re. proposed upgrade of CASTORATLAST3 on Monday we would like a "pre-mortem" before this takes place. Miguel: should discuss what is in the risk assessment form, but do t3 without form, otherwise would be too much delay. Fine for ATLAS

  • CMS reports - (Nicolo) KIT: Failures in CMS SAM test submission to KIT fixed. Tonight some Maradona errors in SAM tests. CCIN2P3: Some backfill test job failures on CREAM CE, under investigation. 'FTP Door' errors in transfers from T0,FNAL to CCIN2P3 - issue in new space configuration, fix applied by site admins. CNAF: Massive failures tonight in backfill test jobs on CREAM CE, reason 'Batch System lsf not supported', CNAF admins taking care of it. FNAL: Rare reprocessing error in opening files, previously also seen at other dCache sites (CCIN2P3). Error message is unhelpful: 'Error OK', might deploy custom library with increased debugging. Rolf: cream ce issue due to config problem since security patch. More detailed report will be given on Mon.

  • ALICE reports - (Lola) T0: No problems to report. Production through the CREAM-CE local system ongoing with no remarkable issues. CCIN2P3: CREAM-VOBOX is back in good shape after the reinstallation of the gridftp server inside the VOBOX. A high load of this VOBOX was reported yesterday evening to Renaud. Due to this high load (already reported by Renaud to the experts at the site) it has been impossible the setup of the VOBOX into production. It has been however fully configured and registered inside LDAP. MC production ongoing in the rest of the T1 sites. Madrid

  • LHCb reports - Roberto -There are less than hundred jobs running for the stripping production launched two days ago: it is about jobs that did not manage to run at GridKA and CNAF yesterday. Few more hundred in total considering user jobs. T0 issues: LFC Read-Only problem: any follow up on the request to add at least one more machine behind LFC-RO at CERN? (JPB: yes, looking into this.) Dirac now equipped to support geographically associated T1 LFC RO for catalogs queries.Any update on a possible slot for intervene on LHCb Castor stagers and bring the new privileges schema? (Miguel: LHCb can choose between Jan 26 or 27 - will be done jointly with castor upgrade) GRIDKA: CREAM CE mapping. Any news? (Angela: in contact with IN2P3 but not yet solved.) GRIDKA: noticed yesterday that also GridKA was stalling all jobs: either a problem with dcap server (tURL was returned but not data read) or with shared area. This morning jobs resubmitted were running ans some until completion. Anything known happening yesterday afternoon site-side? (Angela: WN without active automounter - fixed.) GRIDKA: one GGUS ticket about shared area unreachable. (Angela: Problems one of the file servers for s/w area may have caused overload on the remaining server). CNAF: confirmed that the problem yesterday was due to GPFS not available. Problem fixed..

Sites / Services round table:

  • Onno/NL-T1: migration to chimera completed successful. Mon: downtime to move to new kernel (11:00-12:00)
  • Jason/ASGC: working on ticket for job submission problem
  • Jens/NDGF: FTS downtime on Tue (12-15)
  • Angela/KIT: problem with s/w area for ALICE: fixed by node reboot
  • Gonzalo/PIC: next Tue 9:00 intervention on cooling system: will reduce to 400 job slots during intervention. At noon all worker nodes will be available again.
  • Rolf/IN2P3: downtime on 25th-26th (batch DB server will be changed to avoid recent incidents). On 26th storage at risk: no tape access due to hpss maintenance
  • Gareth/RAL: FTS db problem (logged as unscheduled outage). Large intervention with batch drain will now be Jan 27th. Disk servers for LHCb: one server went back into prod went (checksums checked), one (D1T1) still out, but all files have been migrated to tape.
  • Michael/BNL: ntr

  • Eva/CERN: Alice online DB: disk array firmware has been upgraded. Will switch over back to production cluster end of this afternoon,

AOB:

  • WLCG TIer1 Service Coordination calls - agenda for first meeting on Thursday 21 at 16:00 MariaG: medium to long term view for services: DB planning, apps issues (coral/frontier), data management issue.

-- JamieShiers - 07-Jan-2010

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback