Week of 090629

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jean-Philippe, Roberto, Ricardo, Jamie, Gang, Alessandro, Maria, Maria, Nick, Diana, Andrew (SA3), Olof);remote(Gareth, Michael, Jeremy, Daniele, Angela).

Experiments round table:

  • ATLAS - (Alessandro) quiet week from operation view. ASGC has been put back in functional tests and replication went well during the Weekend although there are still some issues with Tier-1 -> Tier-2 replications. Another unscheduled downtime from ASGC noticed today, which is difficult to take into account for VOs. The four Italian Tier-2s were responding fine to a centralized request concerning a disk space issue. Cosmic data processing: all data sets are in 'Open' state, which means that they are not ready to export to Tier-1s. Developers are looking into the issue. Estimated time for recovering the backlog is 24-48 hours. Gagn: the unscheduled ASGC downtime today was due to a re-cabling of the diskservers. Downtime is over now.

  • CMS reports - (Daniele) as announced last Friday all Savannah tickets have been put in hold for ASGC and associated Tier-2s. Have closed urgent tickets for IN2P3 after reminder last week. No major tickets were opened over the Weekend. Thanks for STEP09 feedback on CASTOR and Tape operation experience from CASTOR teams at CERN.

  • ALICE (Patricia after the meeting) - smooth production ongoing with more than 12000 concurrent jobs. Tests with VM WN at CERN evolving well.

  • LHCb reports - (Roberto) as announced to LHCb community: DIRAC intervention toke place this morning and is still ongoing. The system is almost empty from jobs. Weekend: all transfers to Sara were failing -> ticket opened this morning and problem was fixed (full disk). Otherwise the glExec validation activity still ongoing on the preproduction system.

OPN tickets

  • 49859 none in progress 2009-06-29 2009-06-29 07:13 Planned maintenance in the Milano PoP
  • 49855 none assigned 2009-06-28 2009-06-28 20:24 ESnet maintenance possibly affecting all BNL and F...

Sites / Services round table:

  • RAL (Gareth) - in the middle of the move so no significant change since Friday. CASTOR and Batch still down. The schedule is that they come back in a week from today.
  • BNL (Michael) - NTR
  • GridPP (Jeremy) - NTR
  • FZK (Angela) - finished movement of LHCb software area. Should be ok now.
  • CERN (Ricardo) - a couple of tickets open for the CEs last week. The issue have been fixed and the CEs are back in production.
  • DB services (Maria) - Performing at the moment the cleaning the queue of the streams replication of the ATLAS CERN->T1 conditions by recreating the queue. The operation has been required as the spilled LCRs were not properly consumed after the re-syncronization of Taiwan last Friday. A service request with Oracle has been opened to follow-up on the issue.

AOB: * (Maria Dimou) USAG meeting THIS Thursday 2009-07-02 agenda: http://indico.cern.ch/conferenceDisplay.py?confId=63109 VO participation is important this time as we 'll analyse the backlog of GGUS tickets assigned to them. This is the escalation report we will use: https://gus.fzk.de/download/escalationreports/vo/html/20090629_EscalationReport_VOSupport.html

Tuesday:

Attendance: local(Jean-Philippe, Roberto, Maria, Olof (chair), Maria D, Sophie, Alessandro, Gang, Simone, Diana);remote(Angela, Derek, Michael, Fabio).

Experiments round table:

  • ATLAS - (Alessandro) Cosmic data replication: problem reported yesterday has been solved yesterday evening and backlog at Tier-0 is decreasing. Estimate 24-48 hours to recover.

  • ALICE (Patricia, before the meeting) - Smooth production ongoing. Regarding the VN WN tests at CERN, the situation of the 2 observed issues is the following: 1) by the 26th we observed the jobs were blocked while trying to register a tag file in FZK SE. The problem is understood in the sense that the VMs are on a private network, i.e. no access to outside. It is being followed up by FIO for the next series of tests. 2) In addition Alice has reported that aparently the VMs are reporting bogus numbers for CPU/Wall time utilization. The jobs are reporting 3x more CPU than Wall. Indeed in this case LSF reports the correct numbers, ALICE MonALISA is receiving bogus. To be followed from ALICE side.

  • LHCb reports - (Roberto) intervention on central DIRAC service toke slightly longer than anticipated. Service back in the afternoon. The intervention brought the system into a better state. The production has recovered and is running fine now. Several issues with gLIte WMS instances: PIC, SARA and GridKa. Concerning GridKa the intervention on the shared area yesterday was successful and Roberto can confirm that the software area works ok. glexec tests also ok on GridKa, thanks to Angela. Several Tier-2 sites were publishing CE not in production (scheduled downtime in GOCDB) -> tickets opened (CE's should not publish when in schedule in GOCDB). Same problem was seen at CERN last week.

Sites / Services round table:

  • FZK (Angela) NTR
  • RAL (Derek) NTR
  • BNL (Michael) NTR
  • IN2P3 (Fabio) - HPSS intervention for adding more tape drives was performed on schedule and service is back since 11am with additional tape drives.
  • CERN (Sophie) - CASTOR SRM upgraded for ALICE and CMS this morning.
  • DB services (Maria G):
    • problem mentioned yesterday by ATLAS - PVSS2COOL was blocked. The problem was understood to be a known behviour of Oracle asking for space on recycle bin when an application is asking for a bulk insertion operation. The issue was solved by extending the datafile for PVSS2COOL table. Lesson learned: should review if ATLAS really needs this recycle bin? or could we rely on disk backups. Post-mortem at https://twiki.cern.ch/twiki/bin/view/LCG/PostMortem27Jun09
    • Tomorrow there will be an exercise by ATLAS: switching off the online database understanding how long the online system survives. Announcement:
      tomorrow at noon (GVA time) we will perform a test to check whether 
      ATLAS is capable of continuing data taking if the online database
      (ATONR) goes down. The intervention is scheduled to last at most 90 
      minutes and will affect ANY user of ATONR. Please inform your 
      respective communities.
              Best regards
                               Giovanna Lehmann Miotto

AOB:

Wednesday

Attendance: local(Roberto, Harry, Antonio, Gang, Nick, Maria DZ, Ricardo, Diana, Fabio, Olof (chair));remote(Derek, Daniele, Fabio).

Experiments round table:

  • ATLAS - (report submitted by Alessandro before meeting) Cosmic data replication going on till the end of this weekend (as foreseen).

  • CMS reports - (Daniele) No major tickets for Tier-1 sites, except for two new + an already open one for ASGC. While this is the quiet week for ASGC and CMS will not push for tickets, Daniele recommends Gang to have a look at the CMS report Twiki. Gang: will do. Tier-2s: several tickets closed for French Tier-2s. Closed tickets for Nebraska but new one opened for a problems seen with Condor errors. Also one French Tier-2 site with more than one ticket.

  • ALICE (Patricia, before the meeting)- by this night around 3:00AM Alice has reported lost of bunch of VMs hosted in 2 servers. This issue has caused a significative lost of jobs (decreasing from around 12000 yesterday to 1200 this morning). Th problem has been found and fixed. No significat issues to report regarding sites.

  • LHCb reports - (Roberto) Issue CIC portal where some information got lost in VO id-card. Being investigated by CIC people. Nick: was the data recently changed? No. Internal DIRAC problems submitting 20k jobs with the same job-id causing same random seed to be used for all jobs and therefore the production was stopped. Another bug in DIRAC caused several LHCb specific SAM jobs failed. Site availability reports shouldn't be affected but a problem for the dashboards. Tier-1 issues: GridKA - root owned file in s/w area caused a lot of crashes. SARA: timeout issue (restart RB service tests). SL5 bug in python binding (more details in LHCb twiki). Fixed now by Remy.

Sites / Services round table:

Middleware (Antonio): coming to production 3.1 update 48 with first release SCAS + glExec. Have been tried in PPS pilot. The scripts for managing the environment variables is not part of the release. One patch is also important for CREAM - fixing mapping of accounts. Recommended to install this fix.

AOB:

Thursday

Attendance: local(Gavin, Jamie, Nick, Ricardo, Andrea, Alessandro, Gang, Maria DZ, Olof (chair));remote(Daniele, Gareth, Ron, Fabio, Jos).

Experiments round table:

  • ATLAS - (Alessandro) proxy used by DDM tools was upgraded this morning - no issue. Cosmic data taking: backlog in Tier-0 is still huge. Hope the backlog can be recovered quickly when the data taking ends on Monday.

  • CMS reports - (Daniele) standing item for ASGC with tickets postponed. Resume on CMS operation meeting next Monday. Strange problem with Savannah where automatic mail notification is sometimes not received. This slows down the quick trouble shooting cycle. Olof: contacted savannah.support? no, not yet but will do if/when problem is confirmed. Also suffering for RAL and another Tier-2 a problem, possible related to the downtime notification in the status board - being investigated by Stefano. Still some issues with Tier-2: Nebraska, Estonia, Florida.

  • ALICE (Patricia, after the meeting)- Alice production continuing smoothly, nothing remarkable to report

Sites / Services round table:

  • RAL (Gareth) - move is proceeding as scheduled. Should be back at middle of Monday next week. Tuesday next week, there will be another outage due to a site networking intervention, ~3 hours, with a drain out of FTS ahead of the start. Like to stay stable for the rest of that week. The week after there are a few site-at-risk interventions.
  • SARA (Ron) NTR
  • IN2P3 (Fabio) NTR
  • FZK (Jos) - just delivered a site incident report giving the story about the mishappenings during the beginning of STEP09. Can now confirm that the tape system is working ok after the changes last week. Work is now ongoing for the next 2-3 weeks for configuring dCache stagers to take advantage of the tape system improvements.
  • CERN (Ricardo) - a small degradation of CASTORALICE this morning due to a database overload. Problem investigated with developers. Gavin: at-risk for CASTOR SRM on Monday for a patch.

AOB:

Maria: User Support the following:

  1. a meeting took place yesterday July 1st with OSG on direct notification email addresses for FNAL and BNL. Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=62962 Input material: https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#GGUS_to_OSG_routing Savannah ticket: https://savannah.cern.ch/support/?107531
  2. USAG meeting today - no experiment participation, which is a pity because 2 reminders were announced in this meeting for the last 10 days and the item on the agenda was the analysis of pending tickets assigned to the VOs. Please consult https://gus.fzk.de/pages/metrics/download_escalation_reports_vo.php regularly and clean your backlog.
  3. When wishing to bring a supporter's attention to a GGUS ticket please make a comment in the ticket and use the "involve others", "assign to one person" or cc field, accordingly.

Friday

Attendance: local(Jean-Philippe, Ricardo, Gang, Andrea, Maria, Olof (chair));remote(Gareth, Stephane).

Experiments round table:

  • ATLAS - (Stephane) continuing cosmics run. Small problems with DDM not being robust against intermittent diskserver problems at CERN, when FTS reports files are reported as temporarily unavailable. In contact with DDM developers.

  • ALICE -

Sites / Services round table:

  • RAL (Gareth) - on track for getting service back on Monday. Currently tracking down various faults on diskservers after the move.
  • ASGC (Gang) - tape drives and servers are online since yesterday
  • CERN (Ricardo) - repeat of the small incident we saw for CASTORALICE yesterday.
  • Databases (Maria): a transparent intervention is scheduled for next Wednesday (8/7) for reconnecting redundant power-supplies on the switch to the public network. An announcement has been posted to the CERN computing status board (http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/) listing the affected services

AOB:

-- JamieShiers - 26 Jun 2009

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2009-07-03 - OlofBarring
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback