Week of 090126

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
FZK 24 Jan 3 Days FTS, LFC Down PM in preparation Andreas H due
CERN 23 Jan 6 Days lcg-cp Loss of functionality PM - see Thursday Gavin M received

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry, MariaDZ, Nick, Jan, Julia, Patricia, Simone);remote(Gonzalo, Michel, Jeff, Gareth, Michael).

Experiments round table:

CMS : Daniele had to drop out with phone problems but will upload a report later.

ALICE (PM): Preparing a large production to be run under the latest version of aliroot. They wish to replace the aging voalice05 (castor 2 gateway) and voalice06 (xrootd redirector) with more modern machines by mid-February. Advice from FIO is that this should be straightforward.

ATLAS (SC): Several problems over the last few days. 1) BNL hit by the FTS proxy delegation bug on Friday. Fixed locally. Michael asked the status of the fix for this. Reply from Nick/Harry that the fix is being back-ported from FTS 2.2 as a patch to FTS 2.1, SLC4 version only. This is now in certification so unless this fails it should be out in a few weeks. 2) SARA has configured FTS channels for their associated Tier 2 while NIKHEF has not. Would they now do as as MC production from Russia needs to be sent back to NIKHEF. 3) Following the 10Million file test IN2P3 have deployed a third load balanced FTS server addressing CNAF, PIC, RAL and ASGC. Some improvement though not much was initially seen. They are currently in scheduled downtime for a dcache upgrade. 4) On Friday they found they could not import data into the CERN MCDISK pool due to a misconfigured disk server that was then removed. Then on Saturday importing failed again which was reported back as the pool being full. Monitoring showed it full while stager and srm queries did not. JvE explained that removing the misconfigured disk also took out state information, a known castor problem. The machine is now back in the pool and the state information will automatically be resynchronised. 5) Thursday/Friday the lcg-cp command (which makes srm calls) started failing when requested to create 2 levels of new directories. To be followed up. 6) FZK currently have Oracle backend problems (see site report). 7) PIC were down over weekend due to a storm (see site report).

LHCb (by email from RS): Only a point to report: CNAF shared area issue. We had a debugging session on Friday and we discovered that GPFS file system, for intrinsic caching mechanism limitation, runs the LHCb setup script in 10-30 minutes (depending the load of the WN) while a NFS file system (on top of GPFS storage) runs the same in the more reasonable time of 3-4 seconds as at other sites. They told me to have setup this NFS on GPFS configuration for LHCb and ATLAS (suffering this problem too) and we are waiting for their green light to start using.

Sites / Services round table:

NL-T1 (JT): 1) Need to perform an urgent worker node kernel upgrade to fix a local site security problem. Cannot wait for all long ATLAS jobs (recent agreement was to allow 80 hours wall clock time) so will do quasi-rolling upgrade first by draining then emptying (by killing jobs) the largest cluster (of 4) while allowing the others to drain fully. 2) Have not seen any ATLAS site specific tests run since 21 January. SC will follow up.

PIC (GM): Barcelona was hit by storm with strong winds last Saturday and suffered a power cut about 11.30 which turned off air conditioning and they had to close down. Resumed this morning being fully back around midday. Had some problems bringing back Oracle databases following the unclean shutdown.

FZK (by mail from Andreas Heiss): The FTS and LFC services at FZK are down since saturday due to a problem with the Oracle backend. The problem is quite complex and Oracle support is involved and currently working on the issue. We hope to get the problem solved today.

ASGC (by email from Jason Shih): expired host certificate problem in TAIWAN for transfers to LYON * fts error:

  • FTS Retries [1] Reason [SOURCE error during TRANSFER phase:
[PERMISSION] the server sent an error response: 530 530-globus_xio: Server side credential failure530-globus_gsi_gssapi: Error with GSI credential530-globus_gsi_gssapi: Error with gss credential handle530-globus_credential: Error with credent
  • solved - Jan 26, 8am UTC
  • all expired (aronud 10) hostcerts have been rekey around 6AM UTC today, and on site operator have replaced all expired host certificate before.
  • review: there is some problem with local alarm system (have fixed the configuration now) that part of the disk servers (around 10) wasn't able to continue the rekey before new year holiday break. the possible recovery time will be tomorrow with assistant from onsite operation. right now, we have disable the pool to avoid the impact but direct data access to the disk servers will fail with generic gss auth error. sorry for the late fixing the problem.

- ASGC 3D database

  • Jan 25, finish installing ASM and clusterware
  • rebuild the file system on new raid subsystem and abandon the OCFS2 file system for 3D.
  • we're preparing the DB on new RAC cluster. the preparation expect to be finish today.
  • the Streaming database setup will extend till Tue. tomorrow.
  • future action:
  • rebuild clusterware on new ASM serving grid services - earlier of Feb
  • the same setup will be applied also for CASTOR/SRM DB on new storage system. the action expect to be finish mid of Feb, and continue with validation from CASTOR admin group as well as extend the evaluation for another two weeks.

- FTS job submission failures due to the constrain of max tablespace

  • Jan 24, 0:30AM UTC, with the following error:
  • 2009-01-24 11:59:45,884 ERROR transfer-agent-dao-oracle-vo - Got
SQLException in Executing Update: [0x676] ORA-01654: unable to extend index FTS02.IDX_REPORT_FILE by 8192 in tablespace FTSTBS02
  • local ops confirm the max table space refuse the extension of index
of FTS02.IDX_REPORT_FILE table:
  • selection:

SQL> select file_name, bytes, autoextensible, maxbytes from dba_data_files where tablespace_name='FTSTBS02';

FILE_NAME


BYTES AUT MAXBYTES ---------- --- ---------- /u02/oradata/gdsdb/ftstbs02.dbf 3.4309E+10 YES 3.4360E+10

  • local ops take urgent action and extend another 100MB for the table space, and FTS able to accept new submission requests then.
  • review:
  • we'll try adding new plugin to monitoring the size of table space and also enable the same monitoring in OEM to avoid the same situation in the future.

- TW-LCG2 EXEPANDA_DQ2_STAGEIN (>450jobs)

  • error:
  • get_data failed. Status=256
Output=httpg://srm2.grid.sinica.edu.tw:8443/srm/managerv2: No type found for id : 36652446 26 Jan 2009 04:57:13| !!FAILED!!2999!! Error in copying (attempt 1): 110 - Copy command returned error code 256 and output: httpg://srm2.grid.sinica.edu.tw:8443/srm/managerv2: No type found for id : 36652446
  • the err 'No type found for id : 36652446' have been observed at three sites deploying CASTOR. the root cause remain unclear and castor supporter imply that it's hard to reproduce the same error while RAL open SR to oracle earlier last Thu., we hope to narrow down the root cause with assistant from CASTOR developer. hope we able to patch the system earlier to avoid the same problem then.

AOB:

Tuesday:

Attendance: local(Harry, Gavin, Jean-Philippe, Roberto, Simone, Andrea);remote(Gonzalo, Luca, Jeremy, Michael, Jeff, Daniele, Gareth, Brian).

Experiments round table:

CMS (DB): 1) At CNAF there was a CMS operational misunderstanding that stopped transfers from Tier 2 to CNAF tape storage over the weekend. The correct tapes have now been added to the CMS tape family and migration to tape has been running at 100-120 MB/sec since. 2) IN2P3 have been having difficulties to migrate data to their Tier 2. This is CRAFT data, a 15000 file dataset, so this is slowing down CMS analysis work. It is not clear if the problems are related to their recent dcache upgrade, finished yesterday, or other operational issues e.g. with prestaging or transfers to the Nebraska Tier 2 (running at 20 MB/sec) taking too much of the bandwidth.

ATLAS (SC): One outstanding request for help - as reported yesterday lcg-cp at CERN can no longer create 2 new directory levels and we have 100K files to upload so would appreciate a reply to our ticket. Under questions/comments M.Ernst had seen a mail that following the CERN prod MCDISK being full ATLAS tasks less than number 20000 would be transferred to other Tier 1 and asked what storage volume this would require. The reply was that this referred mostly to AOD and ESD data that are both at CERN and a Tier 1 where the CERN copy would be deleted. Simone would ask S.Jezeqel to send round detailed numbers. Brian Davies of RAL said they had had many transfer failures due to a full space token at their Oxford Southgrid Tier 2 but that they were currently deploying additional disk for this.

ALICE (by mail from PM): Alice is still preparing the new aliRoot to continue the production, but currently there are no jobs running.

LHCb (RS): 1) FEST09 (Full Experiment System Test) has started this week concerning in this first phase only the ONLINE system. Soon transfers from the pit and an exercise of the Xpress line will however start and also the OFFLINE computing system will be involved. 2) The CERN WMS216 is giving openssl assert error messages (seen before at CNAF). Job throughput through this machine has been very low. 3) Queried the status of the shared software areas at CNAF ? Luca confirmed they are seeing much lower network transfer burdens using NFS over GPFS but have only been able to test to a single worker node for now (see CNAF site report).

Sites / Services round table:

CNAF (LdA): Have had a water chiller under maintenance since Monday 19 Jan so have lost 30% of their worker node capacity including their test nodes (hence the limited testing for the LHCb shared software area). Should have been finished last Friday but slipped till today. Now need to verify functioning overnight before bringing back to full capacity.

FTS 2.1 (GM): have now received the provisional fix for the delegated proxy corruption bug. This will be tested in the PPS then, if ok, put into certification at top priority.

AOB:

Wednesday

Attendance: local(Harry, MariaG, Julia, Roberto, MariaDZ, MiguelCS, Gavin, Jan, Patricia);remote(Gonzalo, Michel, Gareth, Jeremy, Jeff, Luca).

Experiments round table:

CMS (by email from Daniele): Clarification from yesterdays minutes: Nebraska is not - as any other T2's are not - responsible at all of the low rate transfers out of IN2P3 atm: namely, they are not "causing" it. I am not 100% sure about the root cause. I am writing imminently to ask if they think that dCache at IN2P3, or pre-staging of a specific dataset, may still not be in the expected good shape after the downtime was over. What I can add, is that some CMS DataOps bits may add a complication to the IN2P3 picture: the dataset(s) needed for export out of IN2P3 by CMS are 'heavy' datasets for analysis (of the order of ~15k files, ~20 MB of size each on average), this is due to an 'original sin' at the DataOps level which triggered a low merging threshold in the production phase (so: too many small files), unfortunately they also claimed that given their priorities they can't re-run the merging of that at the moment, so the situation is a bit heavier for IN2P3 than actually desirable.

ATLAS (by email from Simone): 1) Problem in CASTOR@CERN using lcg-cp with recursive directory creation is still not solved (at least was not 2h ago). The issue is now CRITICAL. 2) Some FTS jobs lost in FTS at SARA (after update to SLC4). This created problems to ATLAS SS, which are now being modified to cope with this.

LHCb (RS): The openssl WMS error reported yesterday was in fact an due to error in a configuration file. However the slow throughput issue is not fully understood yet - on Monday pilots (targeted everywhere) were failing in submission at CERN with "No compatible sources found" meaning a list-match failure in the WMS.Also confirmed the issue with SRM/CASTOR reported by ATLAS concerning recursive directory creation since the update last week.

ALICE (PM): Still in a low rate of production with less than 100 jobs running of which 95% are at CERN.

Sites / Services round table:

NL-T1 (JT): 1) Asked when they could expect the waterfall of LHCb FEST09 jobs. Reply was at least one more week. 2) ATLAS pointed out they could not use the lcg-tags function at Nikhef and this was traced to a file permissions problem now fixed. LHCb confirmed they use only 1 such tag while ALICE use none. 3) Tenders have gone out for the data base upgrades.

CNAF (LdA): confirmed that chiller maintenance was successful and they are now bringing back their full capacity. They will next test the scalabiity of the NFS-based shared software area. If successful LHCb will need to use a new path but not the other VOs.

RAL (GS): Reminder of FTS downtime tomorrow 7 till 11 in preparation for eventual move to new building (no date yet, building under acceptance).

CERN lcg-cp (GM): The recursive directory problem probably came in with an srm feature patch on Thursday (from level 12 to level 14) supposedly containing only minor bug fixes but they would like experiments to confirm when they first saw the problem. They are suggesting to roll back the PPS version and test for the problem there.

CERN Databases (MG): They have applied the January oracle security patch to most of the integation databases and, as usual, will apply to the production instances in 2 weeks if there are no problems.

CERN Castor (McS): In the next few weeks they will change the stager deployment model to increase the number of stager daemons on a server. This will require all CERN clients to be at level 2.1.7-14 or above. The downlevel clients (unmanaged machines) are known and will be contacted.

AOB: MDZ reminded of the USAG meeting tomorrow with the special issue of direct routing of GGUS tickets.

Thursday

Attendance: local(Nick, Gavin, MariaG, MariaDZ, Jean-Philippe, Harry, Simone, Andrea, Jan, Jamie);remote(Daniele, Gonzalo, Jeremy, Michael, Luca, Jeff).

Experiments round table:

ATLAS (SC): The problem of lcg-cp failing on recursive new directories has now been fixed. An FTS problem has been found at NL-T1 where a non-superuser querying for the existence of an FTS jobid which does not exist gets the reply that he is not authorised to ask. There are two issues to follow-up here - is this the behaviour we want and that the user in question, K.Bos, should be registered as a superuser at NL-T1 as was already requested.

CMS (DB): Following on the slow throughput observed of exports of CRAFT data to Tier 2 from IN2P3 this is now put down to the small file sizes involved (average of 20MB) following a non-optimal merging operation. It is now agreed by the dataops team to redo the merging to create the normal 1 GB files. It will be done at IN2P3 as the custodial site for this data so CMS is checking there are sufficient local resources.

LHCb (by email from RS): Some small steps on FEST09 activity that from a pure ONLINE exercise over last days is now entering into a 'testing phase' for the OFFLINE infrastructure (i.e DIRAC services+ WLCG). T1's will start receiving data to begin with at the rate of ...1 file per hour.

The shared area issue at CNAF is on its way: they mounted on all production batch nodes the CNFS file system (NFS on top of GPFS) as announced by Luca at previous meetings and we have now banned off CNAF (T1 and T2) from our mask to drain the site and to allow the shared area to be migrated into this new mount point. Awaiting for CNAF people to proceed with the re-installation of the software then.

Finally: we updated the CIC VO-CARD with a more descriptive list of "system" libraries for 32bit platform that we found to be missing in new OS worker nodes in the Grid (e.g. SL5 at CERN or SL4 pure 64bit at PIC and CNAF) via SAM tests (the "file acccess test"). Deeper investigations (to have a omni comprehensive view of what are the system libraries that must be installed natively with the OS for being 32bit compatible) are carried out by LHCb Core Software devs and their preliminary findings match perfectly the list available in the CIC Card. For what concerns non-system libraries they will be shipped with LHCb applications; again the Core Software team together with the LCG-AA are bundling those compatibility libraries (as it was done for the SLC3 to SLC4 migration).

Sites / Services round table:

NL-T1 (by email from Ron Trompert to ATLAS): Today we are moving the database of the atlas lfc at SARA to Oracle RAC. This downtime was announced but at the wrong date (i.e. 30-1 frown ). Our apologies for the inconvenience this has caused. Everything is going smoothly so I expect that at sometime this afternoon the LFC will be back. (was reported back by EGEE broadcast at 14.30). On the phone Jeff reported there were more LHCb jobs today.

CERN SRM (JvE): The ATLAS and LHCb SRM were downgraded to 2.12 yesterday to fix the lcg-cp recursive directory creation failure. The other experiments and public will now be done. A postmortem is visible at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090126 . Simone asked if CERN was the only site running this SRM level. Jan replied yes and that other sites have now been warned.

CNAF (Luca): The new NFS mounted shared area for LHCb has been mounted on part of their farm and they are setting up write access. LHCb will be informed shortly to start testing. There will be a brief LFC service shutdown next week for recataloguing of data from SRMV1 to SRMV2.

DataBases (MG): The 3D database at ASGC is back in production and will now be resynchronised.

AOB:

Friday

Attendance: local(Harry, Julia, Patricia, Roberto, Alessandro, Jan);remote(Michael, Jeremy, Jeff, Gareth).

Experiments round table:

ALICE (PM): Had a small voms configuration issue with the GSI (Darmstadt) VO-box, now resolved. At Torino they are seeing strange ssh connections on their VO-box - to be followed up.

LHCb (RS): FEST09 file movement for reconstruction is now running at 1 file per hour per Tier 1. Running smoothly to PIC and NIKHEF. 1) CERN CASTOR has a bad disk server causing failures to copy out of CERN. Ticket submitted. 2) A team ticket has been opened at CNAF for LSF scheduler problems. (Resolved - see CNAF site report). 3) Files transferred to IN2P3 become marked as 'unavailable'. 4) At RAL reconstruction jobs were failing trying to reserve a TURL with the Oracle 'Big Id' bug. This was cleared yesterday and jobs are now ok. 5) We have performed a software installation in the new CNAF shared software area and the first jobs are running well.

ATLAS (AdG): At the beginning of next week ATLAS will migrate DDM site services to the new release. RAL cloud will be migrated on Monday, the others a suivre on Tuesday and Wednesday. More details on Monday. Under questions Jeff queried that ATLAS jobs at Nikhef had dropped from being 60 hours cpu each to about 17 minutes each. Alessandro will check this is not a fault.

Sites / Services round table:

NL-T1 (Jeff): After the LFC Oracle migration from a single machine to a RAC server yesterday a severe performance loss was seen and the service is now being reverted (started around 13.00). Alessandro pointed out that ATLAS cannot use the NIKHEF SRM end points when the SARA LFC is down which Jeff confirmed. LHCb is probably not affected since their data management tests were still working.

CNAF (by email from Luca): 1) the shift of the software shared area is going on. 2) During this afternoon one of the LSF license servers went down. While the farm lsf managed to switch to use the other license servers, the castor lsf server was not able to switch and hence stopped working. We had to manually switch the order of license server on the castor lsf server. After this intervention Castor service was restored. It is again up and running again since 19.00 CET.

AOB:

-- JamieShiers - 22 Jan 2009

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2009-01-30 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback