Week of 091019

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

from OSG related to this ticket yet.

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Harry, Jean-Philippe, Patricia, Roberto, Andrew, Simone, David, Gang, Olof, Dirk, MariaD, MariaG, Alessandro, Julia);remote(Gonzalo, Michael, Ron, Angela, Gareth).

Experiments round table:

  • ATLAS (Simone)- loss of files at 2 sites: NLT1 (180 files lost after a drive destroyed a tape) and RAL. They provided a list of 200k files that need to be cleaned from all catalogues. JPB will provide an SQL query to get a list of lost files from the LFC to crosscheck. The Lyon "hot disk" endpoint is failing all the time. A ticket has been opened as this space contains condition data and DB releases.

  • ALICE (Patricia)- beginning the migration of the VO boxes to SLC5 and gLite 3.2. At CERN Alice is using mainly SLC5. ce130 and ce131 are ok (2000 jobs now). cream202 had problems this morning.

  • LHCb reports (Roberto)- Very successful MonteCarlo run during the weekend. From Friday around 19:00 til Sunday afternoon there was a plateau. For the disk server move to SLC5, LHCb prefers to have a single slot for all disk servers in a pool. Agreed to have an 8 hours intervention for that on 29th October. There is most probably a problem in the ranking expression used in Dirac as CNAF reports too little activity and KIT too much. 60 files not accessible at CNAF. Space token full at KIT and PIC. KITwill add 10TB this afternoon. Nothing will be added at PIC as PIC provides already more resources than pledged.

Sites / Services round table:

  • Gonzalo (PIC): NTR
  • Michael (BNL): NTR
  • Ron (NLT1): Tomorrow, Oracle RAC will be moved to a new infrastructure and Sara will move to SL5.
  • Angela(KIT): Too many jobs sent by LHCb over the weekend (20k). Needed babysitting. Better to do this outside of the weekend. LHCb will work on improving ranking algorithm.
  • Gareth (RAL): nothing more to report today
  • Gang (ASGC): DB supposed to be resynced this week but Maria says that the DB is still not in good shape. Maria will send a mail to Jason with the error message.

  • Olof: Sophie wants to apply a patch from Massimo to the CREAM CE.
  • MariaG: migration of Atlas DB to new hardware is being done. LHCb DB was done in the morning. CMS DB will be done tomorrow and WLCG on Wednesday. The problem on Saturday was due to a problem at the pit.
  • MariaD: will send a mail to OSG: one team ticket (GGUS 52283) was not picked up by OSG.
  • Julia: the Dashboard DB for LHCb will be moved tomorrow.

AOB:

Tuesday:

Attendance: local(Daniele, Gang, Wei, Olof, Sophie, Jean-Philippe, Harry, Jan, Gavin, Jamie, MariaG, Andrew, Julia, Ricardo, Simone, Flavia, MariaD, Patricia, Roberto);remote(Gonzalo, Ronald, Michael, Angela, Gareth, Jeremy, Fabio).

Experiments round table:

  • ATLAS (Simone)- 2 T1s are down for less than 12 hours: PIC and NDGF; so they were not taken out of production; this of course produces failures. UAT (User Analysis) has been postponed to 28-30th October as not enough users were available this week. Agree to have the SRM upgrade at CERN tomorrow.

  • CMS reports (Daniele)- The CMS Analysis "October Exercise" has been officially declared over on Oct 19th, 9am GVA time. There are some issues installing the software at CNAF (not really understood) and IN2P3 (rpm db corruption). Both problems are being addressed. There are also problems at RAL (being addressed) and FNAL (CMS local contact informed). There are also permissions problems at a Russian Tier2 and corrupted files at Caltech. Finally there are some transfer issues between Tier2s and Tier1s.

  • ALICE (Patricia)- not in full production today, no issue to report. Registration of VO boxes for the proxy renewal service should be addressed to px.support@cernNOSPAMPLEASE.ch.

  • LHCb reports (Roberto)- About 10k jobs for simulation and user analysis. No major issue at CERN. dCache being upgraded at PIC. Problem with the publication of CEs at IN2P3 and KIT; does not fully explain problems seen last weekend, so LHCb is working on a smarter ranking algorithm. 60 files are in inconsistent state at CNAF, so they cannot be accessed nor deleted.

Sites / Services round table:

  • Fabio (IN2P3): The problem of incorrect publication of the CEs was related to the resources on SL5. Currently testing the dCache golden release. If everything ok, plan to put it in production on 9th November.
  • Gonzalo (PIC): Nothing to report
  • Ronald (NLT1): Oracle 3D migrated to new hardware. Still a few problems in accessing some channels: work in progress. Sara migration to SL5: work in progress.
  • Michael (BNL): Nothing to report. Just a comment: the GGUS ticket mentionned by Maria yesterday was not for BNL but for Boston University (problem being addressed).
  • Angela (KIT): not convinced that the incorrect publication in BDII was the cause of the overload during last weekend. Working on a fix to correctly publish information even in case of high load.
  • Gareth (RAL): still investigating both the hardware issue with the disk servers and the DB recovery problems. Catalin working on getting from LFC the list of CASTOR lost files.
  • Jeremy (GridPP): Nothing to report
  • Gang (ASGC): introduced a new person (Wei) who will participate in the Atlas and WLCG daily meetings representing ASGC.

  • Jan: SRM upgrade at CERN on 29th October.
  • Sophie: LFCs (except for LHCb), VOMS and FTS services will be down tomorrow morning for 90 minutes for a DB upgrade.
  • Julia: the Dashboard migration that took place today went well.
  • MariaG: successful migration of CMS cluster today. WLCG cluster will be done tomorrow. Resync of Atlas, LHCb and LFC DBs after the intervention at Sara. DB still not accessible at ASGC (probably a configuration problem)

AOB:

Wednesday

Attendance: local(Harry, Simone, Andrea, Roberto, Daniele, Gang, Wei, MariaG, Ikuo, Andrew, Ewan, Maarten, Lola, Edoardo, Alessandro, Olof);remote(Jason, Gareth, Angela, Michael, Ron).

Experiments round table:

  • ATLAS (Simone)- Yesterday around 18:30: problems getting files from Castor at CERN. Problem quickly investigated and fixed by CASTOR operation. SRM upgraded at CERN for Atlas. This morning: LFC, FTS and VOMS DB upgrade: minimal impact. Problems getting data from ASGC. Jason says that's due to a DB corruption. As it will take several hours to recover, ASGC taken out of production by Atlas. Ewan says that for yesterday's problem at CERN, a hot fix was applied and a proper fix will be applied in the next couple of days.

  • CMS reports (Daniele)- The CMS T0 team reported to be unable to reach any of the SLS monitoring pages beyond the main status page from any of the CMS CASTOR instances. This was a problem with SLS and not with CASTOR. Seems to be fixed now. Important for CMS: they rely on this. One week delay for the migration of a specific file at CERN (being investigated). IN2P3 has fixed the installation problem reported yesterday. The problem at CNAF in installing one CMSSW release is still being investigated. The installation of other releases at CNAF is postponed until the problem with that release is understood. The installation at other sites is OK, and continue unaffected by this. A few permission problems at a Russian Tier2 and at Purdue. The corrupted files at Caltech have been deleted. The dCache pool at Beijing is overloaded. Ewan reports that there was a problem with Remedy and handling mails. Alessandro reports that the monitoring for the Atlas central catalogue stopped working for a day but seems to be ok now.

  • ALICE (Patricia by mail)- continuing the migration of the local voboxes at the sites and this operation will be one of the central tasks of Alice for the next 2-3 weeks. Our operation tasks are concentrating in helping the sites with this migration. From the moment sites are migrating smoothly and we do not see any major issue with the new service (gLite3.2 VOBOX) at any site.

  • LHCb reports (Roberto)- 20k jos running concurrently. One issue at CNAF with 60 files in strange state. Being investigated by local staff and CERN CASTOR support. LHCb successfully ran yesterday stripping jobs at all Tier1s. So the dCache file access problem (stuck connection) seems to be understood and fixed.

Sites / Services round table:

  • Gareth (RAL): working on post-mortem report. List of CASTOR lost files has been produced using the LFC. Cleanup is ongoing. Actually there was almost no real data loss, as most of the files had a copy elsewhere.
  • Angela (KIT): A server needed a reboot. CE publication in Information System really working.
  • Jason (ASGC): It will take about 6 hours to restore the CASTOR DB.
  • Michael (BNL): Incident on Condition DB last night. Reboot needed. One could see a significant activity both for production and users. But there was a huge number (3500) of idle connections to the DB. So the server ran out of memory because of this. BNL getting help from Atlas DB expert (Gancho). MariaG reports that the COOL reader account was used for most of these idle connections. Problem being investigated.
  • Ron (NLT1): The migration of SARA to SL5 went smoothly. NIKHEF will upgrade to SL5 next week. The Oracle DBs were moved to new hardware.

  • Ewan: the CMS LFC endpoint has been moved from a dedicated endpoint to the shared instance. Problems with the central BDII. Probably due to the old version installed at CERN. A workaround has been put in place waiting for the installation of the new SL5 nodes with recent BDII software. CASTOR SRM was upgraded. There was a problem with WMS/ICE at CERN yesterday afternoon.
  • Maarten: Some Atlas Hammercloud test failures could also be due to the BDII problem (size of BDII internal buffer). A new node will may be be available tomorrow.
  • MariaG: successful migration of the DBs this morning between 10:30 and 11:50. It is now possible to connect to the ASGC DB. One need to have a agreement between ASGC, CERN and BNL about the possible date for resynchronization as we will be using transportable tablespaces from BNL to ASGC for that.
  • Edoardo: a new link to BNL will be put in production tomorrow. The intervention should be transparent. Then there will be load balancing between the two links.

Release report: deployment status wiki page

AOB:

  • Conference call problems: Apologies for those who had problems to connect. The “usual trick” of resetting the conference details seems to have worked.

  • (MariaDZ) News from Arvind Gopu (OSG) on Monday's issue:
We have assigned the appropriate GOC ticket (7654) to the GGUS ticket
52283. We apologize for the late response but the GGUS ticket appears to
have triggered a GOCticket creation but somehow the GOC ticket never
finished. I am unable to find logs dating back to Oct 12th at this time,
so I can't debug any further.
Hopefully this winter, with a newer ticket exchange mechanism in place,
we will have fewer problems like this.

Thursday

Attendance: local(Jamie, MariaD, Jean-Philippe, Gang, Andrew, Ikuo, Miguel, Jan, Simone, Sophie, Edoardo, Andrea, Alessandro, Roberto, MariaG);remote(Michael, Jeremy, Ronald, Angela, Gareth, Jason, Luca).

Experiments round table:

  • ATLAS (Ikuo)- still suffering from new CASTOR SRM at CERN. Low latency. No ordinary user can replicate data as the public pool is used by the new SRM while the public pool is closed for ordinary Atlas users. Waiting for CASTOR at ASGC to come back. Jason reports that the local team is still investigating. The CERN supprt team will probably be needed. Downtime extended to tomorrow. Jan proposes 2 actions: either roll back to previous CASTOR SRM version or put a patch on the current version as the problem is understood. Atlas (Kors) prefers to roll back the version as the detector is currently very active and Atlas users will want to access data. Jamie says that it's good anyway to rollback as an exercise and in parallel the new version can be installed on PPS for tests. This means one hour interruption at least, but Atlas prefers this solution as the data has to be replicated not only at Tier1s but also at callibration sites.

  • CMS reports - apologies from Daniele. Clash with another meeting.

  • ALICE - no report

  • LHCb reports (Roberto)- Monte Carlo production now over. Trying to improve the ranking expression. This will be tested in the coming days. At CNAF, there is still the issue of the 60 files in strange state: they cannot be deleted nor accessed. Stefano is working on it.

Sites / Services round table:

  • Michael (BNL): followup on Condition Database access problem reported yesterday: most of these connections were due to jobs running at Tier2s and accessing the DB remotely (many jobs as a lot of CPUs at Tier2s in US). The jobs were failing and left the DB connections hanging. Orginally only Tier1s were supposed to run this type of jobs but the limits have now been set such that Tier2s can also run these jobs. The memory on the DB servers has been reconfigured, so currently BNL sees 1600 connections and the jobs run fine.
  • Ronald (NLT1): Squid servers for Atlas have been setup and are ready to use.
  • Jeremy (GRIDPP): Nothing to report
  • Angela (KIT): 2 issues: 1) server had again to be rebooted as the problem reoccurred 45 minutes ago. The problem seems similar to the one seen at the beginning of the year. 2) the SRM upgrade will be on 27th October because the UAT exercise starts on 28th. Roberto proposes to only have 2 queues at KIT, one for short jobs and one for very long jobs.
  • Gareth (RAL): Nothing special to report but he has provided the link to the SIR report documenting the problem at RAL with the DB hardware.
  • Jason (ASGC): still waiting information from the support to know if the DB recovery will imply data loss. Jason thanks Maria and her team for the very good support for the DB problem.
  • Luca (CNAF): the problem with the installation of the CMS software is still not understood (one week). Both the local CMS contact and the central CMS experts have been contacted. May be Daniele could boost the solution? Luca asked if FTS2.2 should be installed, Jamie replies FTS2.2 on hold until Christmas. Luca asks if the wiki page documenting the recommended version for the middleware is up-to-date, Jamie will check. For CMS, CASTOR has been replaced by Storm at CNAF, running fine.
  • Gang (ASGC): nothing to report

  • Miguel (CERN): CERN site will be at risk on Monday between 09:00 and 10:00 as a new constraint will be added to the DB used by the CASTOR Name Server.
  • MariaG: the migration of the DBs to new hardware has been successfully completed. Next Tuesday the ASGC DB will be resynchronized using table spaces from BNL. The intervention will probably take place in the afternoon.
  • Edoardo: the new link to BNL has been now up for 2 hours. Working well. Please report any problem.

AOB:

  • Alessandro: a new SAM test has been written to test Frontier/Squid for Atlas. It is not a critical test, but it is documented in the Atlas Wiki page and it would be nice if sites could check the results.

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

  • Most European countries return to Winter time on Sunday 25th October: http://www.timeanddate.com/time/dst2009.html.
    • The US and Canada "fall back" on Sunday 1 November.
    • Taiwan: "no DST in 2009".
  • These meetings continue at 15:00 Geneva time!

-- JamieShiers - 14-Oct-2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2009-10-22 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback