Week of 091207
WLCG Service Incidents, Interventions and Availability
GGUS section
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
Additional Material:
Monday:
Attendance: local(Harry(chair),
MariaG,Dirk, Stephane, Ueda, Jean-Philippe,
DavidG, Roberto, Patricia, Andrew, Simone, Alessandro);remote(Xavier(KIT), Jens(NDGF), Rolf(
IN2P3), Michael(BNL), Gareth+Brian(
RAL), Jeremy(
GridPP), Kyle(OSG-lost voice contact), Jason(ASGC), Paolo(CNAF)).
Experiments round table:
- ATLAS (SJ) - 1) ATLAS has taken data over the weekend and distributed it with few problems. This also included export to Tier-2 sites on Sunday. Most ongoing problems are related to other activities. 2) On Friday the FTS2.2 functional tests to the Tier-1 failed with 'could not load client credentials' (ggus ticket 53887) which has since been fixed. 3) A pending issue for some time is files not accessible at CERN CASTOR giving 'invalid path' errors (ggus ticket 53825). Where possible these are recopied from Tier-1 but some are only at CERN and more were discovered at the weekend. 4) (SC) There was a burst of SRM timeouts while exporting data from CERN on Friday morning. Developers thought this was contention between the atlas scratchdisk and Tier1 export pools so higher priority was given to the Tier1 pool but this happened again on Sunday when there was no scratchdisk activity. 5) Currently there are timeouts exporting some data from CERN with CASTOR reporting files to be on tape and disk while SRM reports tape only. Being followed up but thought to be 2 separate problems.
Following this report Michael Ernst queried whether BNL (and FZK) should continue running FTS 2.2, where they see no local problems, or downgrade to 2.1. Simone thought the weekend problems were not with FTS though more worrying was the previous week when coredumps were being seen. Dirk Duellmann confirmed that coredumps were also seen in FT 2.1. Simone said FTS 2.2 has a rare problem, difficult to clear, where a channel agent gets stuck on a particular job. The consensus was for the Tier-1 to stay at FTS 2.2.
- CMS reports - No report - CMS week.
- ALICE - Good quality events using the whole detector have been recorded at the weekend with first pass reconstruction done at CERN using exclusively the CREAM-CE. Preparation of conditions data for second pass reconstruction is under way. Four sites, CNAF, KIT, Dubna and Subatech (Nantes) have taken ESD data from CERN and with their grid resources performing perfectly. On Friday the operational importance of the CERN VO-boxes was increased to 50 (so an alarm will trigger operational escalation) and today CASTOR was upgraded to 2.1.8-17 and SRM to 2.8-5 (both transparently).
- LHCb reports - 1) Interesting weekend with collision and magnetic field on: some data fully reconstructed made available on the LHCb BK for users. LHCb were prepared to receive 1 million collisions and we barely got 15,000. Some issues observed with both the code (Brunel crashes) and with the Conditions database (wrong magnetic field). Both are being fixed now and reprocessing will start later today (not a big deal). 2) Following some problems observed at T1s, DIRAC will move to what had been decided long ago: move to data copy for reconstruction jobs. This will guarantee data to be at least reconstructed. 3) Few MC simulation requests received from the PPG are now running at low pace in the system. 4) Tier-0 VOLHCB09 had an alarm with swap full also during this weekend preventing data to be accessed through the hosted BK service. The sysadmin found a python processes eating the swap. This process has been killed and DIRAC experts looking at the source of this problem. 5) Observed (opened a TEAM ticket) an anomalous (10 hours seen) delay in migrating real data files to tape. This is felt by LHCb as a problem; triggered a discussion about policies for data migration; next time we will be more prepared and confident in opening GGUS tickets when we suspect a real problem is happening. 6) SARA: issue accessing raw data (both from NIKHEF and SARA WNs). Also IN2p3: issue accessing raw data. The symptoms are similar to the ones observed at NL-T1. Both are using the gsidcap protocol (hence the move to local disk copies for reconstruction jobs). The error (quoting Ron Trompert) seems related to a third party library that might affect all dCache centers. We let various sysadmins (via GGUS) to share this bit of information
Sites / Services round table:
- KIT: At about 18.00 Friday a dcache write pool went down causing 'black hole' behaviour for CMS but it was fixed promptly.
- NDGF: SAM monitoring broke down over the weekend but production was not affected.
- RAL: Possibly seeing interference between a weekly cron job (04.00 on Mondays) and data flows. Queried if any other sites see similar effects.
- ASGC: Security patching of worker nodes will finish today.
- CNAF: Tomorrow is a public holiday in Italy (Madonna) so CNAF will not attend tomorrows conf call.
- LHCOPN: 1) Maintenance on KIT primary link tomorrow 06.00 for 12 hours. During the maintenance announced by DFN (https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=53924
), in case of interruption of the link, the traffic from/to GRIDKA will use the path GRIDKA-IN2P3-CERN as a backup with possible reduced bandwidth. 2) Maintenance at PIC from 23.00 for 7 hours with possible reduced bandwidth.
AOB: 1) Two periods of collisions with stable beams of 4 bunches at 450
GeV over the weekend but less than the hoped for 1 million events. Currently there is a cryogenic problem expected to be fixed by midnight. 2) One open OSG ticket (see 53915 in ggus) namely: AGLT2_PRODDISK has 233 error during transfer_preparation. Michael reported this as having been due to problems with two storage servers that were fixed this morning. 3) ATLAS are making plans for data processing over the Xmas period and request to know the CERN computer centre planning. It would be useful if all experiments and sites could expose their plans later this week. We suggest an email to
wlcg-scod@cernNOSPAMPLEASE.ch and we will add an information summary page to the minutes pages.
Tuesday:
Attendance: local(Harry(chair), Stephane, Simone, Ueda,
MariaG, Jamie, Lola, Jean-Philippe, Dirk, Andrew,
MariaDZ, Rob Quick, Julia, Roberto);remote(Xavier(KIT), Rolf(
IN2P3), Gareth(
RAL), Jeremy(
GridPP), Jason(ASGC), Ronald(NL-T1), Michael(BNL)).
Experiments round table:
- ATLAS - 1) The only new incident was a problem with the LFC at IN2P3 from 13.00 till 14.30 today. An alarm ticket was not sent since the shifter was already dealing with it. Transfers to IN2P3 were delayed by 1.5 hours and production there failed. Rolf explained that the problem had been with a DNS server, solved by a reboot but not yet fully understood. Grid services with aliases on this server were affected including the top bdii of the French cloud.Details will be sent when more is known. 2) xrootd authorisation over X509 has been successfully tested so this will allow CERN analysis jobs to use xrootd for data access. 3) There are an increasing number of innaccesible CERN CASTOR files as reported yesterday. Could this be related to the ongoing migration of disk servers from SLC4 to SLC5 ? 4) The issue of SRM reporting wrong status for online/offline of files is still pending (also reported yesterday). Dirk informed that investiigation is ongoing but it is thought this is due to files undergoing disk-to-disk copy (they are reported online if in any disk pool) leading to FTS timeouts. The problem was also present in FTS 2.7. 5) What is the status of pilot roles at CERN - an ATLAS production user is unable to submit jobs to CE103 with the pilot role while this works if he uses his production role ?
- ALICE - The Grid is working like clockwork - the experiment is trusting the system and they have not seen any inconsistency or error last night. There is a small issue with AliRoot seen during the event reconstruction due to an internal bug. A new revision of the software is being built and will be ready by noon today (Tuesday), after which one of the runs will be reprocessed.
- LHCb reports - 1) Reconstruction of (few) collision data was launched on Sunday but reprocessing was done only yesterday (in just one hour) after fixing a problem with the Brunel application crashing and once data conditions had been updated by hand. Everything went like a dream using the schema to first download data into the WN and then opening locally. 2) Also big achievement as far as concerns Moore (HLT Trigger application) and Physics stripping (b/c inclusive and minimum bias) that managed to run over all data (100% except for RAL where we lost data). Merging has been flushed and produced final DSTs. 3) At lower priority (than real data reco or MC stripping) also running MC simulation that keeps warm non-T1 sites. Discussed at today's TF meeting (and also visible from the SSB) how the 25% of resources can be roughly considered wasted because of some internal problem with LHCb Gaudi applications causing jobs to fail. This has to be addressed by LHCb core application people. 4) VOLHCB09 (at CERN) had another alarm for swap full and the python process killed. Suspicion is on one of the new pieces of code that has been temporarily disabled. 5) Considering the very low DAQ rate - and in order to demonstrate how the whole system can quickly serve data to end users, it has been requested to CASTOR team to relax a bit the migration to tape policies for the remaining days of data taking this year moving from current 8 hours to the lowest threshold feasible in CASTOR. 6) The dcap file access issue at SARA and IN2p3 reported in the weekend has been confirmed to have the same root (thanks to Ron and Lionel) and dCache developers have been alerted.
Sites / Services round table:
- GridPP: Are also interested in Xmas planning. Most of their Tier-2 sites will be on best-efforts support from around 23 December till 4 January.
- ASGC: Repacking and migrating data for CMS to save tape space.
- BNL: 1) Migration of the conditions database to new hardware has been successfully completed with thanks to Eva and Carlos. This was transparent to ATLAS. 2) Urgent work on our power supply is happening over the next 5-6 hours. About 27% of worker nodes, spread evenly across production and analysis usage, have been gracefully shut down.
- CERN dashboard: There are performance problems affecting the CMS Site Status Board. MariaG reported that this was some sort of locking issue.
- CERN SRM: Dirk reported on the SRM timeouts seen by ATLAS last weekend where they have observed a higher than usual rate of SRM requests at about 1400/minute. They are trying to understand where they were coming from (IP addresses, DNs) as they suspect it was not production usage. Clearly more monitoring is needed in this area.
- CERN physics databases: Currently there is a 1 hour rolling intervention on the ALICE Online DB then on 10 December security patches will be applied to the ATLAS and LHCb downstream capture databases from 10.00 to 12.00.
AOB:
- There was a brief discussion on Xmas plans. Current planning is that LHC commissioning will stop on 16 December and resume on 4 February. IT department planning will be published shortly. Various web sites show details of long, medium and short term planning fed by weekly and daily machine and experiment meetings. We will add the most useful links to our permanent set. One idea was to minute short summaries of the daily experiment meeting plans in these minutes.
- (MariaDZ) LHC experiment VO representatives please join this Thursday 10/12 at 9:30 in room 28-R-006 the USAG meeting where a full fail-safe system for GGUS will be discussed. Please consult agenda on http://indico.cern.ch/conferenceDisplay.py?confId=73657
Last week's MB agreed on a monthly frequency of the ALARM tests. Progress can be followed via https://savannah.cern.ch/support/?111475
Wednesday
Attendance: local(Harry(chair), Simone, Ueda, Dirk, Nick, Andrew, Jean-Philippe, Olof, Antonio, Miguel, Julia, Giuseppe, Alessandro,
MariaG,
MariaDZ, Roberto);remote(Gonzalo(PIC), Michael(BNL), John(
RAL), Ronald, Jason, Rolf).
Experiments round table:
- ATLAS (IU)- 1) Started using FTS 2.2 for Tier-0 data export yesterday - no problems so far. 2) Errors transferring data to NDGF SRM this morning where they had posted a schedule 'at-risk' rather than a downtime. 3) No more beam expected before tonight.
- ALICE (LSS)- 1) No fresh data was recorded last night but a reprocessing of some of the previous runs with the new Aliroot version was made. This reprocessing was done using Grid resources at the Tier-0 and they worked perfectly fine. 2) During the Alice Task Force meeting this week we are going to announce that ANY site which has not migrated the WNs and the vobox to SL5 by the 31st of December will be taken out of production. After the TF meeting, this will be announce to the sites via email.
- LHCb (RS) reports - 1) This morning received, reconstructed and migrated to tape in just ~1 hour a collision09 data. 2) Running at very low level MC simulation (in total less than 1K jobs in the system). 3) A permission issue at SARA was found during an old test data clean up activity.
Sites / Services round table:
- BNL (ME): Work on the electrical power infrastructure, announced yesterday, was completed some 2 hours ahead of schedule.
- NL-T1 (RS): 1) Yesterday had instabilities in a core router. There was a brief interrupt last night for a hardware replacement. 2) Newly acquired storage is now ready for production so NL-T1 is prepared for 2010.
- IN2P3 (RR): Still working on analysis of yesterdays DNS incident. Today they are starting work on the extension of the computer centre building scheduled to be completed in February 2011.
- CERN-Castor (MC-S): CASTORpublic software has been upgraded to version 2.1.9-3-1.This should be the version deployed for the LHC experiments in January. Simone queried if all CASTOR diskservers had now been upgraded to SLC5 and could the upgrade processes explain the invalid path errors ATLAS are seeing ? Miguel replied that about 10% of diskservers are left to do and this is unlikely to explain the errors but they will look together after the meeting.
- CERN-PhyDB (MG): The CMS site status board performance issues have been fixed but further work is needed in this area. There is a rolling intevention today on archive and integration database servers to change motherboard batteries. No downtime but possible performance degradation.
- CERN gLite (AR): Have decided to postpone the release of the next set of patches into staged rollout until January. Many services are affected so it is prudent to wait. A couple of problems with the current rollout have been reported by sites, one with the SL5 CREAM-CE. Note that the rollout includes a new version of the WMS ICE component that fixes a critical problem for ATLAS (bug 59054). Simone asked if the new release included FTS 2.2.3 ? The answer was not yet, this will be released to staged rollout in January with full production release expected by mid-January.
Release report:
deployment status wiki page
AOB:
Thursday
Attendance: local(Harry(chair), Ueda, Simone, Miguel, Jamie, Gavin, Lola, Tim, Jean-Philippe, Olof,
MariaG,
MariaDZ);remote(Xavier(KIT), Rolf(
IN2P3), Gareth(
RAL), Michael(BNL), Jason(ASGC), Ronald(NL-T1), Jens(NDGF)).
Experiments round table:
- ATLAS - 1) Could not send ggus tickets yesterday - fixed this morning. 2) NDGF had indicated a scheduled FTS downtime yesterday, for which there was an EGEE broadcast, but our monitoring showed it was working. Jens explained that when they were about to start the intervention they realised they were not sufficiently prepared and so cancelled it. 3) LHC is setting up for high intensity 450 GEV/c beams today. Experiments are hoping for a million events each over the next few days. 4) There are outstanding ggus tickets for files trapped on a disk server waiting for vendor intervention (giving the invalid path errors ?). FIO said this should soon be repaired. 5) there are new entries to be made to the CERN atlas-support egroup.
- ALICE - 1) This morning beams started circulating from around 6:30 till 8 am and Alice has recorded 8 runs until around 7:40. From these 8 runs, those marked as "good" ones have been reconstructed on the Grid (T0 resources) with no problems observed. 2) The French federation is beginning to provide SL5 gLite.3.2 VOBOXes for Alice in several sites with more or less half of this federation already providing such service with nodes tested and put in production. In addition the whole french federation is already providing WNs under SL5 in all their farms.
- LHCb reports - 1) Two LHCb TEAM GGUS tickets newly created in the last days seem to have disappeared from the system after the November ggus portal upgrade that took place recently. The ticket were open against GRIF and INFN-MILANO-ATLASC by user Vladimir Romanovskiy. Opened a GGUS ticket for this problem (54002). 2) IN2p3 have fixed a problem with the published MaxCPUTime limit which was causing the time left utility to wrongly estimate the remaining time and then the batch system wrongly killing jobs because of exceeding time limit. No more failures have been observed because of this problem.
Sites / Services round table:
- KIT (XM): Preliminary warning of a half-day downtime on Jan 13th for whole site to switch a router. May be extended by testing activities.
- IN2P3 (RR): Some more information on their recent DNS failure - two independent redundant DNS servers broke down at the same time so an external scan is suspected. The same thing happened a few hours later. Stronger criteria have now been applied for automatic restarting. Investigation is ongoing but there is not much logging on such servers.
- RAL (GS): Advance warning of an outage on 5 January for tests of the bypass of their UPS.
- ASGC (JS): Added 200TB of disk space and some 1500 cores. Migration to SL5 continues.
- NDGF (JN): Reboot of dcache servers tomorrow for kernel upgrades.
- CERN-CASTOR (IR): Authentication to CASTOR via xrootd can now be done with a grid certificate thus supporting more voms role mappings. Scale testing of this functionality is now needed. CASTOR operations will be proposing dates for experiments to upgrade to version 2.1.9 in January.
- CERN -FTS (GM): A patch for the bug causing channel agents to crash in FTS 2.2 (seen in ATLAS transfers to BNL) is now ready.
- USSAG (MD): There was a good USSAG meeting this morning. Savana entries made in response to ggus tickets will no longer be automatically closed but stay open untill the associated bug is closed. GGUS tickets of all types can now be routed directly to ROCs bypassing the TPM layer.
AOB: A comprehensive SIR on the 2 Dec CERN power cut has been added to the WLCG SIR pages (link at top of this page).
Friday
Attendance: local();remote().
Experiments round table:
Sites / Services round table:
AOB:
--
JamieShiers - 04-Dec-2009