Week of 09504

WLCG Baseline Versions

WLCG Service Incident Reports

  • Service incident with the robotic library outage that occurred on April 25th at CCin2p3.
  • Service incident with tape backend outage that occurred on May 4th at SARA (NL-T1).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry, Julia, Jean-Philippe, Dirk, Alessandro, Olof, Sophie, Nick, Ewan);remote(Gang, Angela, Daniele, Michael, Michel).

Experiments round table:

  • ATLAS reports- Several issues during this long weekend: 1) The machine submitting functional tests and inserting monitoring data into SLS got stuck and was only fixed a few hours ago. Apparently an incorrect quattor template removed the arc (for privileged afs commands) command. ATLAS was however able to monitor the 4 sites recieving the current cosmics run data. 2) The prodsys dashboard stopped working. ATLAS experts were contacted but did not have enough instructions so it was only restarted this morning. To be followed up. 3) The AGLT2 (Michigan) muon calibration tier 2 site SRM endpoint has been very unstable and there is no agreement for out of hours support so ATLAS need to establish better procedures. 4) There are many FTS transfer timeouts on large (5GB) files and ATLAS want some expert advice. DB said CMS have seen such problems and have been looking to increase timeouts when only a few percent of a file remains to be transferred and data is still being moved.

  • CMS reports - DB emphasised that the problem of large files at PIC (solved by deleting the datasets) needs a policy rethink by CMS. These files are input to reprocessing where the output files are larger than the input so this contributes to the problem. A ticket was submitted to afs support a week ago with no response yet- Olof will follow this up in FIO. CMS are soliciting input on the planning of middleware, sites and the other experiments for moving to SLC5 and plan to raise this at the next GDB (May 13). Nick reported that worker node middleware was available for slc5.

  • ALICE -

  • LHCb reports - The report includes an interesting presentation of the results of LHCb Analysis Tests at Tier1's of running a user application over large amounts of input data (up to 25 TB/day).

Sites / Services round table:

IN2P3 - message to WLCG from D.Boutigny, Director: CC-IN2P3 has been impacted with a major cooling problem during the week-end. Yesterday evening one of our chiller went down due to an overpressure alarm. This problem triggered a chain reaction in the cooling system and all the chillers went down. Of course the temperature went up very fast in the computer room, we even had temperature raising up to 50 degrees near the disk units before we had time to shut down everything. The cooling system has been restored during the night and we are now in the process to restart the services. We do not know yet how much broken hardware we will have to fix. I will keep you informed of the evolution of the situation.

ASGC: Have had CASTOR errors due to the well known BigID problem (understood but not fixed yet). Have completed file export to PIC, but a few files stucked in transferring to FNAL. Alessandro asked when they might be ready to restart running the ATLAS functional tests and the answer was the end of this week or the beginning of next but that DPM could be tested very soon.

FZK: Have changed several dcache parameters to try and improve stability. A CMS disk-only pool has developed a disk i/o error - being investigated.

CERN: 1) ATLAS CERN WMS are currently experiencing large backlogs due to resubmission of failing jobs from one DN. The cause of the failure is understood but there is not yet a clear idea on how to shift the backlogs. Investigations are continuing. 2) There was an outage of the myproxy service from 8.50 till 11.15 this morning. See the post-mortem at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090504 3) CASTOR upgrades to 2.1.8 this week with ALICE tomorrow morning at 09.00 and ATLAS on Wednesday from 14.00. Each upgrade involves a 4 hour downtime.

AOB:

Tuesday:

Attendance: local(Julia, Nick, Sophie, Harry, Alessandro, Jean-Philippe);remote(with apologies the conference phone link did not work today).

Experiments round table:

  • ATLAS reports - They would like to check the castor configuration at ASGC notably the setup of the space tokens. Alessandro will email Jason explaining in detail.

  • CMS reports - CMS were running staging tests at CERN where staging stopped for 7 hours then resumed. To be followed up.

  • ALICE -

  • LHCb reports - 6000 jobs running for the physics MC production MC09. CNAF problem of yesterday was related with the propagation into StoRM of new bunch of local users (pilot users) followed by an ACL problem.

Sites / Services round table:

  • ASGC: ATLAS data deletion on castor started yesterday evening, now around 40~50 TB has been removed.

  • CERN: This mornings castoralice upgrade completed successfully. A new xrootd plugin is scheduled for Thursday. castoratlas upgrade is scheduled for 14.00 to 16.00 tomorrow.
AOB:

Wednesday

Attendance: local(Harry, Alessandro, Jean-Philippe, Nick, Olof, MariadZ, Roberto);remote(Jeremy, Gang, Daniele, Ian).

Experiments round table:

  • ATLAS reports - By email from Graeme Stewart yesterday: Panda database migration from the Oracle integration service (INTR) to the ATLAS offline production service (ATLR) will happen tomorrow (meaning Wednesday 6 May), which will involve downtime for the panda servers here at CERN. This will put the panda service "at risk" tomorrow afternoon for all clouds except FR and US.

The intervention is scheduled for 1300 UT (1500 CEST) and is anticipated to last for two hours. After the service is restored the switch of backend database will be transparent to clients. During the intervention the panda monitor will run - read access will continue, but write access for CERN clouds will fail and the data for CERN clouds will become a bit stale. Paul has extended the timeouts slightly on the pilot, up to one hour, however we do expect some jobs to fail after this as they cannot update their status in the panda server. Running jobs will be unaffected over the anticipated downtime period 2 hours. Shifters will see failed jobs appear as lost heartbeats 6 hours after the intervention.

After tomorrow's downtime the final act, the migration of the FR and US clouds, should take place on Monday 11th May

AdG ATLAS report: 1) We would like to know the ASGC castor pool names and how they map to space tokens. It was agreed he would mail the question to Gang. 2) There was a 1 hour downtime of the LFC at PIC after they installed a new 64-bit front-end. J-PB explained this was because both 32 and 64-bit clients were installed requiring /lib/lib_uuid and /lib64/lib_uuid respectively and the 32-bit version was initially missing. This problem does not happen in client versions 1.7.0 and higher. 3) There was a 2 day delay in SARA responding to a ticket following failure of SRM to export data. Apparently the delay was due to a public holiday. There was a reported 36 hour tape backend problem affecting SRM at SARA which may have been the cause - a SIR has been requested. 4) ATLAS are now training new shifters at the pit prior to restarting daytime shifts in June so there will be more elog entries and perhaps poor quality tickets while the training goes on.

Olof queried how ATLAS publicised this afternoon's castoratlas downtime as they were getting tickets from ATLAS shifters and users. AdG agreed shifters should check the IS service status board and will follow up.

  • CMS reports - RS queried what were the file level investigations at CNAF leading to closing tickets. DB explained that there had been a small amount of disk data not migrated to tape when CNAF started their recent long scheduled downtime and that on restart these data had been lost. DB also explained that the reported 3 files lost at FZK were now ascribed to CMS prodagent file registration workflow problems.

  • ALICE -

  • LHCb reports - 1) MC09 production and merging is ongoing at Tier 1 sites. 2) There was an ACL problem with storm at CNAF that was quickly fixed but we still have problems accessing returned turl's this morning.

Sites / Services round table:

  • ASGC: Cleaning of ATLAS data on ASGC castor has finished but not yet on DPM. Then some storage will be setup and ATLAS functional tests can be restarted.

  • FZK (by email from X.Mol): as no one from our site (FZK) can join in the daily conference, we want
to give a statement regarding our status:

1) Stability of SRM service: The reason for this problem is identified: Several users have a huge number of active GET TURL requests. They probably use a client which does not release finished requests. We are in touch with dCache developers for finding an appropriate fast solution in order to reduce the impact on the production.

2) CMS specific SAM tests, that failed because of one disk-only pool throwing i/o-errors, are green again as this pool has been repaired.

*CERN Castor: RS asked when castorlhcb will be upgraded and Olof explained that they wanted to run the upgraded castoralice and castoratlas for a week before deciding.

AOB:

Thursday

Attendance: local(Harry, Sophie, Gavin, Jean-Philippe, Nick, Julia, Steve, MariadZ);remote(Gang, Angela, Jeremy, Gareth, Michel, Daniele, Roberto).

Experiments round table:

  • CMS reports - Main issue is following up on the impact of the partial restoration of services at IN2P3 (see below). Also a large RAW file is failing repeatedly in transfer CERN to FNAL. File size is 50 GB so it is probably simpy too big to be transferred before the timeout of 1 hour on the channel. Central teams are leaving up to FNAL to decide if they wish to transfer it manually or take some other actions.

  • ALICE -

  • LHCb reports - are getting close to the end of producing the first batch of MC09 minimum bias events. After quality tests generation of 10**9 'no truth' events, expected to take some months, will start. Merging jobs are held back at CNAF due to ACL issues of pilot accounts in Storm.

Sites / Services round table:

IN2P3 - message to WLCG from D.Boutigny, Director: After the cooling incident we suffered last Sunday, CC-IN2P3 is now partially back into operation, but today, while people were powering on more CPU servers, we experienced another cooling system trip, fortunately this time it was possible to keep the system under control without tripping the whole computing center.

By beginning of June, a new 600 kW chiller will be put into operation. As it is impossible to speedup this installation we will probably have to live with reduced computing power during the coming month. We are studying the possibility to install and to power a temporary chiller but even if we have good hopes, it is not yet clear whether this is feasible or not within such a short time scale.

At the moment CC-IN2P3 is running at a little bit more than 1/3 of its CPU capacity. The full storage capacity that was available before the incident is back online.

CERN: Following the update of the CA rpms to version 1.29-1 on Wednesday 6th May 2009, several IT and Experiments services were affected by failing client based certificate authentication: ATLAS Central Catalog, Nagios services, SAM Portal, Gridview Portal. The problem is due to the increased number of certificate authorities (the length of the string). The rpms have been rolled back for a cleanup of the number of authorities though this is seen as a temporary solution. A post-mortem is available at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090507

FZK: had transient NFS problem. Would like ATLAS to note that the FORTRAN compiler available on their SL5 worker nodes is gfortran and not g77. In reply to a question from RS on SRM problems replied that they have identified 3 users whose applications are not releasing turl's and are trying to find out which clients they are using.

AOB: MariaDZ is looking for help on gstat from ASGC. Apparently the expert (Joanna) is currently at CERN working with Lawrence Field. Also they need information from CMS on GGUS/CMS ticket interworking (see https://savannah.cern.ch/support/?106911#comment36).

Friday

Attendance: local();remote().

Experiments round table:

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 28 Apr 2009

Edit | Attach | Watch | Print version | History: r18 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2009-05-07 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback