Week of 091026

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Ricardo, Jamie, Patricia, Roberto, Dirk, Harry, Ikuo, Simone, Alessandro, Flavia, Jean-Philippe, David, Eva, Lola, Maria Olof (chair));remote(Gonzalo, Michael, Angela, Onno, Daniele, Gareth, Brian, Gang, Fabio, Jason).

Experiments round table:

  • ATLAS - (Ikuo) Observed problem in transfers from BNL to other sites (T1 and T2s). No problem writing into BNL.See BNL report below for details. RAL has been put in production, for the moment only the transfer tests but soon full production.

  • CMS reports - (Daniele) Tier-0 issue with file not migrating last week. Fixed now. A remedy ticket for the SLS problem is still being investigated by the support line. Tier-1: CNAF problem fixed. ASGC CASTOR corruption has still not recovered and there are a few opened tickets related to it. Maria: do you have GGUS ticket number for the Russian Tier-2 issue? no, there is only a savannah ticket for the moment, will open a GGUS.

  • ALICE - (Patricia) problem reported on Friday concerning the CREAM CE configuration at CNAF-T1 was solved in the evening. voalice06 is out of warranty and the services running on it are being migrated to voalice13.

  • LHCb reports - (Roberto) No scheduled activity going on. DIRAC using out-of-date SAM client: has been fixed and Tier-1 can now trust the result. The two LCG-CEs ce132 and c133 should be submitting to SLC5 at CERN but were incorrectly configured - fixed now. CNAF issue with issue of RDST and RAW space tokens: GGUS ticket 52400. IN2P3 problem with jobs exceeding memory. The problem for LHCb is that the BQS system does not correctly report the run time to the pilot job causing it to pull another payload but then the job gets killed with a confusing error 'not enough memory error'. Fabio: do you have a GGUS ticket id for the problem? Roberto will check offline. How does the LHCb jobs get the time left from the batch system? it queries the batch system for queue length and calculates the time left.

Sites / Services round table:

  • PIC (Gonzalo) NTR
  • BNL (Michael) found the SRM problem: it was suffering from a 'catalina' related error, which is a Java exception for accessing array out of bound. This was continued for several hours and after restarting the SRM cell the service seems to be recovering. Still have to figure out what happened.
  • KIT (Angela) Two items:
    • ATLAS dCache update to 'Golden release'.
    • On Saturday there was a tape robot issue affecting reading from tape (writing not affected because other library could be used). Fixed now
  • NL-T1 (Onno) several items:
    • NIKHEF part of NL-T1 has upgraded today to SL5. The UI is up and running as well as most of the WNs.
    • SARA SRM on Friday early morning had a problem with BDII. The BDII service had stopped which caused the SRM service to be dropped out of the BDII system.
    • For ATLAS: planned downtime tomorrow for LFC-ATLAS will have to be postponed to next week.
    • Also for next week: SARA SRM dCache version to upgraded to 1.9.4-4.
  • RAL (Gareth) this morning we upgraded the CASTOR SRMs for ATLAS and LHCb to version 2.8-2. Plan CMS ad GEN instance tomorrow.
  • ASGC (Jason)
    • CASTOR problems, continue several attempts restoring the DB during the Weekend. Olof: can you please provide a written report? will do. Jamie: is there a possibility for a "cold" restart? Dirk: that's probably possible but it would be best to first understand why the restore is failing. It would be good if ASGC could provide more details.
    • SL5 migration is ongoing
  • IN2P3 (Fabio) target date November 9th for the dCache update to 1.9.5-4 (Gold release)
  • CERN:
    • DB service @ CERN (Eva): ATLAS downstream db problem during the Weekend. Oracle bug causing tight CPU load and swap full. Have started to deploy the Oracle October security patch on the test dbs this week. The ASGC DB sync using the streams is scheduled for tomorrow afternoon.
    • Ricardo: srm-atlas upgrade to 2.8-2 planned for tomorrow.

AOB:

Tuesday:

Attendance: local(Andrew (SA3), Ricardo, Jan, Olof (chair), Daniele, Gang, Jhen Wei, Lola, Jean-Philippe, Harry, Nick, Eva, Maria);remote(Gonzalo, Angela, Ronald, Jeremy, Gareth, Jason, Fabio).

Experiments round table:

  • ATLAS - (written input submitted by I Ueda before meeting) apologize that both Simone and myself cannot attend the meeting today. We are on the way in putting RAL back into ATLAS production, but not totally yet, because we observe small issues now and then. We hope the site performance be stabilized so that we can put it back in full production. Concerning the Castor SRM upgrade at CERN we have not seen any issue after the upgrade this morning

  • CMS reports - (Daniele) closing the SAM infrastructure issue not submitting to Indian Ter-2. SLS issue at CERN: no further update to the ticket but CMS can confirm it is working ok now. IN2P3: issue with 9 corrupted files on tape. ASGC: ongoing CASTOR issue, which also affects the Taiwan Tier-2. Jeremy: concerning the imperial college issue, do you get any response? yes, Stuart answered reporting a dCache issue.

  • ALICE - (Lola) testing in voalice06 this afternoon may result in alarms. The operators have been informed.

  • LHCb reports - Apologies from Roberto

Sites / Services round table:

  • RAL (Gareth)
    • problem with CASTOR yesterday afternoon: unscheduled outage for 1 hours. Caused by poor db performance, which seems in turn be due to suboptimal placing of Oracle processes on the RAC servers.
    • Upgraded that remaining SRM endpoints to 2.8-2.
  • PIC (Gonzalo) NTR
  • GridPP (Jeremy) NTR
  • NL-T1 (Ronald)
    • SARA made some changes to tape system, which caused some problem with the robotics during the second part of the night.
    • Failing SAM tests for LHCb this morning: WN upgraded to gLite 3.2 yesterday does not contain the CERN VOMS server cert anymore and it seems the LHCb SAM jobs depend on it: see GGUS ticket 52694.
  • IN2P3 (Fabio) NTR
  • ASGC (Gang)
    • Early this morning some users received 1000s of mails from ASGC. The local administrator killed the problem process.
    • CASTOR db issues (Jason): some input from the DB support list. Continue attempting to restore the db. Phone conf tomorrow morning with CASTOR and DBA people at CERN. Before the phone-conf Jason will submit a written input listing the recovery actions attempted so far.
  • KIT (Angela)
    • update of ATLAS dCache instance this morning. In principle ok but there is still some issue with root owned directories (also reported by PIC for an older version).
    • planning to update FTS next week on Wednesday morning. One hour outage.
  • CERN (Ricardo, Jan):
    • Noticed yesterday that LHCb jobs were frequently querying the batch system with bqueues. Since the output from bqueues is constant, at least for the duration of the job, it was recommended to LHCb to only query once.
    • Ongoing LCG-CE upgrade campaign. At the same time more CEs are made pointing to the SLC5 resources.
    • srm-atlas update to 2.8-2 went fine. Will update the CASTOR client to 2.1.8-13 srm-alice and srm-cms tomorrow (transparent update) - this fixes an issue with slow srmRm. srm-lhcb will be updated to 2.8-2 on Thursday.

AOB:

Wednesday

Attendance: local(Iuoko, Gang, Wei, Nick, Lola, Antonio, Maria G, Jan, Edoardo, MariaDZ, Simone, Olof (chair));remote(Onno, Angela, Tiju, Fabio, Jason).

Experiments round table:

  • ATLAS - (Iuoko) srm-atlas upgrade went fine, although some small instabilities seen with timeouts yesterday. A ticket was opened with CASTOR support. It seems to be an intermittent problem. RAL now in full production. After KIT updated dCache there were FTS failures to find the SE. There was also a problem with directory permission, same as seen at PIC. Fabio: any information on UAT exercise? no, it's supposed to start today but no information. There is a meeting later today.

  • CMS reports - Apologies from Daniele who cannot attend because of the "All-CMS" meeting

  • ALICE - (Lola) Tests in voalice06 yesterday will continue also this afternoon. There may be some operator alarms triggered.

  • LHCb reports - Apologies from Creig

Sites / Services round table:

  • ASGC:
    • Maria: a restore has been launched this morning, which will finish soon. Thereafter there will be a recovery, which will tell us if the database can be opened. If so, we have the name server recovered up to October 13th. Iterate up to the 20th when the db was corrupted. If the recovery is not successful, we will have to consider tomorrow morning if the database cannot be rescued and we have to go ahead starting with a new instance from scratch. Phone conf tomorrow morning 09:15 CET = 16:15 Taipei time is confirmed.
    • Jason: the 3D database the replication from OSG (BNL) is ongoing. More news tomorrow.
  • NL-T1 (Onno)
    • WMS issue at SARA. Being investigated
  • KIT (Angela)
    • Confirm the problem with dCache directory ownership reported by ATLAS. It is being investigated
  • RAL (Tiju)
    • Problem with the GEN instance affecting the (ALICE) SRM service. Being investigated.
  • IN2P3 (Fabio) NTR
  • CERN (Olof):
    • CASTORLHCB intervention tomorrow
    • Update of CASTOR client on srm endpoints ok this morning
    • Ongoing tests with xrdcp with x509 certificates
    • hardware move for LHCb LFCs tomorrow morning, resulting in a 30 minutes downtime for the read-only LFC
  • Middleware (Antonio) gLite 58 announced today.
  • Databases (Maria G): applying latest security patches integration/validation environment for ATLAS, CMS and LHCb.

Release report: deployment status wiki page

AOB: (MariaDZ) USAG meeting tomorrow 2009-10-29 @ 9:30 am CET. Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=72278

Thursday

Attendance: local(Jamie, Maria, Lola, Jean-Philippe, Ricardo, Iuoko, Simone, Olof (chair));remote(Ronald, Angela, Tiju, Fabio, Jeremy, Jason, Brian).

Experiments round table:

  • ATLAS - (Simone) User Analysis Test (UAT) has started. Users run physics analysis worldwide using the grid. Will stop tomorrow.
    • A xrootd problem at CERN yesterday, which was fixed within a couple of hours. Olof: xroot redirector had been left with debug logging, which caused a filesystem to fill up unexpectedly fast.
    • A new problem started today with ATLASDATADISK on srm-atlas at CERN.

  • CMS reports - Apologies from Daniele.

  • ALICE - (Lola) continuing the alarm testing with voalice06, which hopefully finish this afternoon. The services will then be moved from voalice06 to new machine voalice13.

Sites / Services round table:

  • IN2P3 (Fabio): NTR
  • NL-T1 (Ronald)
    • Failing LHCb SAM test reported last Tuesday. Also ATLAS has reported a similar problem. Seem to be VOMS certificates generated from the UI at lxplus using the old lcg-voms.cern.ch, which should be pointing to the same voms.cern.ch. Olof: will check with the VOMS support.
  • KIT (Angela) * New dCache patch installed on the ATLAS instance. Hopefully it will fix the problem seen with root owned directories
  • RAL (Tiju) NTR
  • GridPP (Jeremy) NTR, Brian: looking into issue with UK tier-2 data shares.
  • ASGC (Jason): * Streaming service: recovering archive log mode. * CASTOR service: start with new hardware for 3 db instances (one for each of name service, stager and DLF). Maria: up to an hour ago, Luca and Jacek did not have access to the new machines? Jason: the six nodes are empty and will be installed similarly to the streams db servers.
  • CERN DB (Maria G):
    • BNL -> ASGC import successfully completed yesterday. Streams replication has been re-enabled and is working.
    • CASTOR db recovery @ ASGC: have managed to recover the database up to the 21st of October, which is beyond the first corruption. Simone: for the further cleaning, will CASTOR cleans up by itself. A phone conf call is scheduled for tomorrow morning 09:15 CET = 16:15 Taipei
  • CERN prod (Ricardo): castorlhcb intervention is ongoing.

AOB:

Friday

Attendance: local(Gang, Wei, Lola, Simone, Jean-Philippe, Jan, Dirk, Ricardo, Maria, Julia, Olof (chair));remote(Christopher (KIT), Jeremy, Fabio, Michael, Gareth, Onno, Jason, Daniele).

Experiments round table:

  • ATLAS - (Simone) NTR

  • CMS reports - (Daniele) Tier-1 issues:
    • ASGC: ongoing CASTOR issue. Gang: recovery ongoing. CASTOR will hopefully ready for custodial store again early next week
    • RAL: new ticket for data not going to tape
    • IN2P3: Frontier squid issues. Fabio: noticed there was problem with one of the squids. A log had filled the filesystem and log-rotation was activated. Will now check the other Frontier server
    • CNAF: problem with link to Caltech

  • ALICE - Apologies from Patricia who cannot attend but submitted a written report:
    1. Yesterday afternoon a new AliEn (Alice software) distribution was announced to all sites. This new software version has been distributed to all sites automatically from the central services of Alice at CERN and transparently to the sites. It is however "advisable" that site admins check the good behavior of the local Alien services at the voboxes.
    2. This morning Alice claimed they cannot run more than 700-800 concurrent jobs at CERN in SL5. FIO has already answered: Today we are with somewhat reduced capacity on SLC5 (only ~120 nodes available). However, we have 200 new machines being configured which should be online tomorrow (or Monday the latest). This will bring the capacity back up to "normal" levels. An intermediate solution has been proposed and it is to use in the meantime SLC4 resources. However Alice is already working in SL5 mode at CERN only and will wait until Monday to increase the number of jobs.
    3. This is a requirement: There are several emails sent from sites admins to px.support@cernNOSPAMPLEASE.ch, to register new voboxes of Alice into myproxy server. Could this process be speeded up? This is a show-stop for the setup of these voboxes. They cannot enter production if they are not previously registered into myproxy. The last request we have comes from SPBSU in Russia. Jan: backlog of tickets but as of yesterday evening it should have been handled. There is still a remaining issue to be discussed with ALICE.

Sites / Services round table:

  • KIT (Christopher)
    • dCache patch installed yesterday solves the root owned directory problem
  • BNL (Michael)
    • Incident on one of the storage arrays while the system was rebuilding. Apparently a control problem. The array has now been put in a non-operational mode for rebuilding the RAID sets. This means that there is currently a lack of access to 32TB in MCDISK.
    • There will be a major outage of all Tier-1 services on Tuesday. Work on core components of the local network, replacing switches. All ATLAS services affected for the whole day.
  • IN2P3 (Fabio) NTR
  • RAL (Gareth): GGUS ticket of slow transfers PIC -> RAL. Problem went away ~11pm last night. Probably some intervention on the OPN
  • GridPP (Jeremy)
  • NL-T1 (Onno)
    • SARA compute cluster Torque server crashed. This caused all running jobs (mostly ATLAS) to fail. It's perhaps necessary to resubmit the jobs
    • On Monday there is scheduled downtime for the SARA SRM for dCache update.
    • On Tuesday: SARA scheduled downtime for network, lfc-atlas and FTS
  • ASGC
    • Jason: continue to setup the 3 database servers. The CERN DBAs have been given access to the nodes. The restore CASTOR service early next week.
    • Maria: the recovery has now reached Oct 26th at 4am. The export has completed and is now being imported into the CERN test cluster so that we have a copy here. It will be imported to the new db servers at ASGC as soon as the configuration has finished. The installation looks ok and we now need to decide on the RAC configuration and how to do the CASTOR configuration. The suggestion is to do it similar to the CNAF setup.
  • CERN DB (Maria): would like to know the status of the dashboard cleanup for ATLAS and CMS? Julia: has not finished yet but continue during the Weekend and hopefully complete by early next week. We have to check with CMS when the stop can be done.
  • CERN (Jan): upgrade the myproxy server next week (move to new hardware). The update should be transparent but would like to negotiate a 'good' date in case it goes wrong.

AOB:

-- JamieShiers - 22-Oct-2009

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2009-10-30 - OlofBarring
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback