Week of 091116

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Daniele, Jean-Philippe, Jamie, Patricia, Roberto, Gang, Harry, Dirk, Miguel, David, Julia, Alessandro, Simone, Andrea, MariaD);remote(Kyle(OSG), Michael, Gareth, Angela, Jason, Daniele (CNAF), Fabio, Ron).

Experiments round table:

  • ATLAS (Alessandro)- Quiet weekend. Couple of bursts of errors with CASTOR at CERN hopefully fixed by SRM upgrade. Atlas stager at CERN was moved to new hardware and the intervention was not transparent. Dashboard was down as announced, but other monitoring services were up like SLS, so it was still possible to monitor the service.

  • CMS reports (Daniele)- The follow up on the problem of low quality transfers beween CERN and FNAL is slower than expected. Disk to disk copy at CERN is very slow and needs follow up. Bad performance of MC production connected to a Condor bug is still being investigated. Shallow retry supposed to recover some of the failures seems to actually make the things worse. We do not know if WMS is responsible for some of the failures. Slow transfers to RAL. Globus error in Legnaro after an intervention on the CE. Needs to be investigated with Globus experts. 4 Tier2s have problems with BDII. Miguel wanted to comment on the problems at CERN: the first problem is not easy to understand (may be linked to an SRM upgrade?); the second problem was due to disk to disk copies done in parallel. This should be ok now.

  • ALICE (Patricia)- Quiet weekend. SRM upgrade this morning went well. The CREAM CEs are not working at CERN (GGUS 53286). Pending registration of DNs for MyProxy service has been done. Alice will publish the list of DNs that need to be authorized. This will help checks by FIO.

  • LHCb reports (Roberto)- Active weekend: MC production, stripping jobs at Tier1s and DST redistribution across Tier1s. No major issue. One GGUS ticket for Lyon and a few issues with shared areas at Tier2s.

Sites / Services round table:

  • Michael: The cause of the SRM exceptions was found by the developers thanks to the analysis done by BNL. The developers know how to fix it. The fix could be interesting for other sites.
  • Gareth: NTR. Tomorrow outage for FTS and MyProxy (two hours including drain). The maintenance of the UPS has been cancelled but a UPS test will take place on Thursday.
  • Angela: one CE offline due to disk failures, but jobs are still coming. Alessandro and Roberto will check why this CE is still targetted.
  • Jason: 40 cartridges with incorrect label. Those tapes are automatically disabled by CASTOR and the tapes are being relabeled. Robot mount errors have been fixed by recalibrating the second frame.
  • Fabio: NTR
  • Daniele (CNAF): NTR
  • Ron: SARA will be in scheduled maintenance on Wednesday for network upgrade + dCache upgrade (to enable tape protection). The intervention will take at most the day but most probably a few hours. ALICE VO box will be moved to new hardware.
  • David: PIC only reachable by GEANT because of the fibre cut between Madrid and Barcelona.
  • Julia: migration of the Dashboard DB for CMS tomorrow.
  • MariaD: NTR
  • Kyle (OSG): NTR

AOB:

Tuesday:

Attendance: local(Olof, Miguel, Jean-Philippe, Daniele, MariaD, Roberto, Jamie, Julia, Gang, Alessandro, Harry, Simone, Nicolo, Wei, MariaG);remote(Angela, Gonzalo, Jason, Jeremy, Daniele(CNAF), Michael, Ronald).

Experiments round table:

  • ATLAS (Alessandro)- 2 GGUS tickets incorrectly assigned by the Atlas shifters: burst of 1200 errors in a few minutes when transferring to Triumf; one assigned to Boston University because BU SRM does not publish the information in BDII which causes FTS failures. Michael says that work is in progress and problem should be fixed in next few days. ASGC is back in Cosmic Data distribution and will get 5% of the data; they will get standard ESDs and AODs as well. Tonight there will be a raw data taking test for collision data. The test will send 600 MB/s to all Tier1s. BNL will get all ESDs for Cosmic an Collision data. Because of the high data rate Atlas need more disk servers to be installed at CERN preferably before the end of this week. Simone will distribute the list of tags by the end of this week.

  • CMS reports (Daniele)- There will be a followup meeting just after the WLCG daily meeting to discuss about transfer quality between CERN and FNAL. MC affected by infrastructure: condor problem (loosing communication). Patch is available (version 6.8) but needs recompilation on SL3 (???). As some sites are suffering (France, Italy and Estonia for example), CMS sets the retry count to 2. For the Tier1 issues, there is no progress and no update from Legnaro. Some Tier2 to Tier2 transfer problems.

  • ALICE (Patricia)- (by mail) Basically Alice is continuing the reconstruction production at the T0 where they are using LCG-CE resources only. The CREAM-CEs at CERN are down and they submitted a ticket yesterday: 53286. We do not have any update in the status of this ticket from the experts. Could Alice get an update on this issue?

  • LHCb reports (Roberto)- High activity (13000 concurrent jobs) but no major problem. Disk space problem at SARA, but disks have been added (problem solved).

Sites / Services round table:

  • Angela: NTR
  • Gonzalo: NTR
  • Jason: NTR
  • Jeremy: NTR
  • Daniele(CNAF): NTR
  • Michael: Tomorrow transparent migration of Condition DB. Will take a day but the DB will be available in read-only mode during that time. Ok with Atlas. Eva will be in contact with Carlos to do the migration. The migration is needed because BNL is getting out of space in the DB.
  • Ronald: Scheduled down time tomorrow for network and dCache upgrade (going from 1.9.4 to golden release 1.9.5-8)

  • Miguel: problem with tape robot last night; fixed this morning at 11:00; data in this robot was unaccessible during that time.
  • Julia: Dashboard migration for CMS is taking longer than expected. Should be completed in the next hour.
  • MariaG: latest Oracle security patch and recommended set of patches will be applied this week: we only received the merged patches yesterday. Integration and test DBs are done now; the date for the patch of production instances has been agreed with experiments (see Service Status Board).

AOB:

Wednesday

Attendance: local(Daniele, Andrea, Jean-Philippe, Eva, Patricia, Roberto, Giuseppe, Miguel, Gang, Olof, Jan, Alessandro, Jamie, MariaD, Harry, Julia, Antonio);remote(Michael, Angela, Gonzalo, Tiju, Jeremy, Daniele, Ron, Jason, Fabio).

Experiments round table:

  • ATLAS (Alessandro)- SARA down as announced, so no DDM activity there. Problem with Dashboard configuration file after migration (fixed). SRM instability at CERN last evening (GGUS ticket submitted, but actually no new ticket was needed as the problem is similar to the previous one still being worked on). DB level contention is being investigated as a possible cause. It seems to be a CASTOR stager issue (messages lost between the stager and the SRM) but a workaround can be put in the SRM. A new SRM release will be available today. Kors will decide if Atlas wants this new version installed as soon as possible, as the load will increase and the problem seems to be load related. The current rate of failures seems to be 5-10 per hour.

  • CMS reports (Daniele)- About the bad performance for the transfers between CERN and FNAL, a meeting took place yesterday another one will take place tomorrow. The SRM patch should go in today. Progress on the MC issue: a patch from Condor is being tested. There was a Dashboard issue: after DB migration the Oracle Execution Plan was not correct (DB experts working on it). Slow progress on the issues at Tier1s. 49 files have been waiting for migration to tape at CNAF for a few days (workinng on it). CE failure at BARI has been fixed. Too many WMS instances launched at IPHC: problem reported to the developpers. Atlas and CMS will get the same CASTOR SRM version.

  • ALICE (Patricia)- Alice waiting for the upgrade of the VO boxes at SARA. Non working Cream CEs at CERN (ticket 53286): ce201 back in production, not yet the others.The corresponding bug (48144) has fixed in CREAM 1.5. Sites that have migrated the VOBOXES to gLite 3.2 seem problems with publication. A fix has been tested in Bari and will be propagated to the other sites.

Sites / Services round table:

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 16-Nov-2009

Edit | Attach | Watch | Print version | History: r8 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2009-11-19 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback