Week of 091019

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

from OSG related to this ticket yet.

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Harry, Jean-Philippe, Patricia, Roberto, Andrew, Simone, David, Gang, Olof, Dirk, MariaD, MariaG, Alessandro, Julia);remote(Gonzalo, Michael, Ron, Angela, Gareth).

Experiments round table:

  • ATLAS (Simone)- loss of files at 2 sites: NLT1 (180 files lost after a drive destroyed a tape) and RAL. They provided a list of 200k files that need to be cleaned from all catalogues. JPB will provide an SQL query to get a list of lost files from the LFC to crosscheck. The Lyon "hot disk" endpoint is failing all the time. A ticket has been opened as this space contains condition data and DB releases.

  • ALICE (Patricia)- beginning the migration of the VO boxes to SLC5 and gLite 3.2. At CERN Alice is using mainly SLC5. ce130 and ce131 are ok (2000 jobs now). cream202 had problems this morning.

  • LHCb reports (Roberto)- Very successful MonteCarlo run during the weekend. From Friday around 19:00 til Sunday afternoon there was a plateau. For the disk server move to SLC5, LHCb prefers to have a single slot for all disk servers in a pool. Agreed to have an 8 hours intervention for that on 29th October. There is most probably a problem in the ranking expression used in Dirac as CNAF reports too little activity and KIT too much. 60 files not accessible at CNAF. Space token full at KIT and PIC. KITwill add 10TB this afternoon. Nothing will be added at PIC as PIC provides already more resources than pledged.

Sites / Services round table:

  • Gonzalo (PIC): NTR
  • Michael (BNL): NTR
  • Ron (NLT1): Tomorrow, Oracle RAC will be moved to a new infrastructure and Sara will move to SL5.
  • Angela(KIT): Too many jobs sent by LHCb over the weekend (20k). Needed babysitting. Better to do this outside of the weekend. LHCb will work on improving ranking algorithm.
  • Gareth (RAL): nothing more to report today
  • Gang (ASGC): DB supposed to be resynced this week but Maria says that the DB is still not in good shape. Maria will send a mail to Jason with the error message.

  • Olof: Sophie wants to apply a patch from Massimo to the CREAM CE.
  • MariaG: migration of Atlas DB to new hardware is being done. LHCb DB was done in the morning. CMS DB will be done tomorrow and WLCG on Wednesday. The problem on Saturday was due to a problem at the pit.
  • MariaD: will send a mail to OSG: one team ticket (GGUS 52283) was not picked up by OSG.
  • Julia: the Dashboard DB for LHCb will be moved tomorrow.

AOB:

Tuesday:

Attendance: local(Daniele, Gang, Wei, Olof, Sophie, Jean-Philippe, Harry, Jan, Gavin, Jamie, MariaG, Andrew, Julia, Ricardo, Simone, Flavia, MariaD, Patricia, Roberto);remote(Gonzalo, Ronald, Michael, Angela, Gareth, Jeremy, Fabio).

Experiments round table:

  • ATLAS (Simone)- 2 T1s are down for less than 12 hours: PIC and NDGF; so they were not taken out of production; this of course produces failures. UAT (User Analysis) has been postponed to 28-30th October as not enough users were available this week. Agree to have the SRM upgrade at CERN tomorrow.

  • CMS reports (Daniele)- The CMS Analysis "October Exercise" has been officially declared over on Oct 19th, 9am GVA time. There are some issues installing the software at CNAF (not really understood) and IN2P3 (rpm db corruption). Both problems are being addressed. There are also problems at RAL (being addressed) and FNAL (CMS local contact informed). There are also permissions problems at a Russian Tier2 and corrupted files at Caltech. Finally there are some transfer issues between Tier2s and Tier1s.

  • ALICE (Patricia)- not in full production today, no issue to report. Registration of VO boxes for the proxy renewal service should be addressed to px.support@cernNOSPAMPLEASE.ch.

  • LHCb reports (Roberto)- About 10k jobs for simulation and user analysis. No major issue at CERN. dCache being upgraded at PIC. Problem with the publication of CEs at IN2P3 and KIT; does not fully explain problems seen last weekend, so LHCb is working on a smarter ranking algorithm. 60 files are in inconsistent state at CNAF, so they cannot be accessed nor deleted.

Sites / Services round table:

  • Fabio (IN2P3): The problem of incorrect publication of the CEs was related to the resources on SL5. Currently testing the dCache golden release. If everything ok, plan to put it in production on 9th November.
  • Gonzalo (PIC): Nothing to report
  • Ronald (NLT1): Oracle 3D migrated to new hardware. Still a few problems in accessing some channels: work in progress. Sara migration to SL5: work in progress.
  • Michael (BNL): Nothing to report. Just a comment: the GGUS ticket mentionned by Maria yesterday was not for BNL but for Boston University (problem being addressed).
  • Angela (KIT): not convinced that the incorrect publication in BDII was the cause of the overload during last weekend. Working on a fix to correctly publish information even in case of high load.
  • Gareth (RAL): still investigating both the hardware issue with the disk servers and the DB recovery problems. Catalin working on getting from LFC the list of CASTOR lost files.
  • Jeremy (GridPP): Nothing to report
  • Gang (ASGC): introduced a new person (Wei) who will participate in the Atlas and WLCG daily meetings representing ASGC.

  • Jan: SRM upgrade at CERN on 29th October.
  • Sophie: LFCs (except for LHCb), VOMS and FTS services will be down tomorrow morning for 90 minutes for a DB upgrade.
  • Julia: the Dashboard migration that took place today went well.
  • MariaG: successful migration of CMS cluster today. WLCG cluster will be done tomorrow. Resync of Atlas, LHCb and LFC DBs after the intervention at Sara. DB still not accessible at ASGC (probably a configuration problem)

AOB:

Wednesday

Attendance: local(Harry, Simone, Andrea, Roberto, Daniele, Gang, Wei, MariaG, Ikuo, Andrew, Ewan, Maarten, Lola, Edoardo, Alessandro, Olof);remote(Jason, Gareth, Angela, Michael, Ron).

Experiments round table:

  • ATLAS (Simone)- Yesterday around 18:30: problems getting files from Castor at CERN. Problem quickly investigated and fixed by CASTOR operation. SRM upgraded at CERN for Atlas. This morning: LFC, FTS and VOMS DB upgrade: minimal impact. Problems getting data from ASGC. Jason says that's due to a DB corruption. As it will take several hours to recover, ASGC taken out of production by Atlas. Ewan says that for yesterday's problem at CERN, a hot fix was applied and a proper fix will be applied in the next couple of days.

  • CMS reports (Daniele)- The CMS T0 team reported to be unable to reach any of the SLS monitoring pages beyond the main status page from any of the CMS CASTOR instances. This was a problem with SLS and not with CASTOR. Seems to be fixed now. Important for CMS: they rely on this. One week delay for the migration of a specific file at CERN (being investigated). IN2P3 has fixed the installation problem reported yesterday. The problem at CNAF in installing one CMSSW release is still being investigated. The installation of other releases at CNAF is postponed until the problem with that release is understood. The installation at other sites is OK, and continue unaffected by this. A few permission problems at a Russian Tier2 and at Purdue. The corrupted files at Caltech have been deleted. The dCache pool at Beijing is overloaded. Ewan reports that there was a problem with Remedy and handling mails. Alessandro reports that the monitoring for the Atlas central catalogue stopped working for a day but seems to be ok now.

  • ALICE (Patricia by mail)- continuing the migration of the local voboxes at the sites and this operation will be one of the central tasks of Alice for the next 2-3 weeks. Our operation tasks are concentrating in helping the sites with this migration. From the moment sites are migrating smoothly and we do not see any major issue with the new service (gLite3.2 VOBOX) at any site.

  • LHCb reports (Roberto)- 20k jos running concurrently. One issue at CNAF with 60 files in strange state. Being investigated by local staff and CERN CASTOR support. LHCb successfully ran yesterday stripping jobs at all Tier1s. So the dCache file access problem (stuck connection) seems to be understood and fixed.

Sites / Services round table:

  • Gareth (RAL): working on post-mortem report. List of CASTOR lost files has been produced using the LFC. Cleanup is ongoing. Actually there was almost no real data loss, as most of the files had a copy elsewhere.
  • Angela (KIT): A server needed a reboot. CE publication in Information System really working.
  • Jason (ASGC): It will take about 6 hours to restore the CASTOR DB.
  • Michael (BNL): Incident on Condition DB last night. Reboot needed. One could see a significant activity both for production and users. But there was a huge number (3500) of idle connections to the DB. So the server ran out of memory because of this. BNL getting help from Atlas DB expert (Gancho). MariaG reports that the COOL reader account was used for most of these idle connections. Problem being investigated.
  • Ron (NLT1): The migration of SARA to SL5 went smoothly. NIKHEF will upgrade to SL5 next week. The Oracle DBs were moved to new hardware.

  • Ewan: the CMS LFC endpoint has been moved from a dedicated endpoint to the shared instance. Problems with the central BDII. Probably due to the old version installed at CERN. A workaround has been put in place waiting for the installation of the new SL5 nodes with recent BDII software. CASTOR SRM was upgraded. There was a problem with WMS/ICE at CERN yesterday afternoon.
  • Maarten: Some Atlas Hammercloud test failures could also be due to the BDII problem (size of BDII internal buffer). A new node will may be be available tomorrow.
  • MariaG: successful migration of the DBs this morning between 10:30 and 11:50. It is now possible to connect to the ASGC DB. One need to have a agreement between ASGC, CERN and BNL about the possible date for resynchronization as we will be using transportable tablespaces from BNL to ASGC for that.
  • Edoardo: a new link to BNL will be put in production tomorrow. The intervention should be transparent. Then there will be load balancing between the two links.

Release report: deployment status wiki page

AOB:

  • Conference call problems: Apologies for those who had problems to connect. The “usual trick” of resetting the conference details seems to have worked.

  • (MariaDZ) News from Arvind Gopu (OSG) on Monday's issue:
We have assigned the appropriate GOC ticket (7654) to the GGUS ticket
52283. We apologize for the late response but the GGUS ticket appears to
have triggered a GOCticket creation but somehow the GOC ticket never
finished. I am unable to find logs dating back to Oct 12th at this time,
so I can't debug any further.
Hopefully this winter, with a newer ticket exchange mechanism in place,
we will have fewer problems like this.

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

  • Most European countries return to Winter time on Sunday 25th October: http://www.timeanddate.com/time/dst2009.html.
    • The US and Canada "fall back" on Sunday 1 November.
    • Taiwan: "no DST in 2009".
  • These meetings continue at 15:00 Geneva time!

-- JamieShiers - 14-Oct-2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2009-10-22 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback