SC3 FTS Christmas running logs

Tracking of SC3 / FTS status during Christmas period 2005/2006.

Thursday 22nd December

16.00: Looking into apparent problems with FTS IN2P3 channel agents. Turned out to be a user proxy issue. Noted some service stopping/starting issues with busy agents.

17.00: Checking: All channel agents up OK on lxshare026d. All VO agents up on lxshare021d. Webservice running OK on sc3-fts-external. IN2P3 running 10 concurrent jobs fine at 100% success rate.

Friday 23rd December

11.30: All agents and webservices up and running. IN2P3 channel running 10 concurrent files at ~100% success rate.

23:00: All agents and webservices up and running.

Saturday 24th December

12:50: All agents and webservices up and running. Activity from Atlas on most of their channels.

  • High failure rate on CERN-INFN with many missing source files, e.g.

srm://castorgridsc.cern.ch/castor/cern.ch/grid/atlas/ddm_tier0/perm/physics.000344.raw/physics.000344.raw._00026

Sunday 25th December

00:25: High activity on the LAPLAND-CERN channel. All elves up and running.

Monday 26th December

17:25: All agents and webservices up and running. Activity from Atlas on most of their channels. High failure rate on CERN-INFN; most errors on missing source files, as on 24th.

Tuesday 27th December

11:30: All agents and webservices up and running. Activity from Atlas on most of their channels. High failure rate on CERN-INFN; most errors on missing source files, as on 24th.

Wednesday 28th December

09:15: All agents and webservices up and running. High failure rate on CERN-INFN; most errors on missing source files, as on 24th, the remainder timing out on the put. Link to SARA down:

[root@lxshare026d logs-archive]# traceroute srm.grid.sara.nl
traceroute to srm.grid.sara.nl (145.100.3.91), 30 hops max, 38 byte packets
 1  l513-c-rftec-1-ip40 (128.142.224.7)  2.395 ms  0.236 ms  0.223 ms

11:00: INFN stager reset. Many files bound for INFN still failing at CERN end ("source does not exist").

12:30: Not clear who can fix networking problem to SARA. Computer ops couldn't help. Network Piquet suggests Computer Ops. Seeing if Computer Ops have a procedure to run for this. If not, we leave until after break.

Thursday 29th December

10:00: All agents and webservices up and running. PIC, IN2P3, GRIDKA, TRIUMF, BNL, ASCC running. Some network failures on BNL channel.

  • All SARA jobs failing due to down network link - no critical alarms for this link - leave until after break.

Friday 30th December

11:10: CMS agent down on lxshare026d. Auto-restart failed (actuator script not updated to match new config). Error was upon daily logrotate restart, from DB connection:

2005-12-30 04:02:31,279 ERROR  transfer-agent-dao-oracle - Error Executing Check Connection Statement: [0x614d] ORA-24909: call in progress. Current operation cancelled

2005-12-30 04:02:31,279 WARN   transfer-agent-dao-oracle - Connection has been dropped
2005-12-30 04:02:31,280 ERROR  transfer-agent-dao-oracle - An error occurs during Connection termination: [0x614d] ORA-24909: call in progress. Current operation cancelled

Restarted CMS agent manually. Check after holidays.

All other agents and webservices running OK. BNL, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA channels running jobs well.

  • Some jobs failed due to gridFTP network timeouts on CERN-BNL channel. Transfer timeout increased from 1800 to 3600 seconds.

  • All jobs failed on CERN-SARA since network link is still down.

  • All jobs failed on ASCC channel: various non-contactable SRM errors, cannot create path errors. Mailed ASCC support.

17:20 Trying Atlas CERN-ASCC jobs again.

ASCC (Jason Shih) resets ASCC SRM (headnode crash).

Error on web-service database attempting to change state for all Held CERN-ASCC jobs back to Pending:

2005-12-30 17:23:06,374 WARN  [http-8443-Processor25]  Caught SQLException. Exiting:  - OracleFTSDBHelper.changeStateForHeldJobs:1290
java.sql.SQLException: ORA-01000: maximum open cursors exceeded

        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:125)
        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:305)
        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:272)
        at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:626)
        at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:182)
        at oracle.jdbc.driver.T4CPreparedStatement.execute_for_describe(T4CPreparedStatement.java:662)
        at oracle.jdbc.driver.OracleStatement.execute_maybe_describe(OracleStatement.java:894)
        at oracle.jdbc.driver.T4CPreparedStatement.execute_maybe_describe(T4CPreparedStatement.java:694)
        at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:984)
        at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:2885)
        at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:2926)
        at oracle.jdbc.driver.ScrollableResultSet.refreshRowsInCache(ScrollableResultSet.java:228)
        at oracle.jdbc.driver.UpdatableResultSet.execute_updateRow(UpdatableResultSet.java:2383)
        at oracle.jdbc.driver.UpdatableResultSet.updateRow(UpdatableResultSet.java:1513)
        at org.apache.commons.dbcp.DelegatingResultSet.updateRow(DelegatingResultSet.java:487)
        at org.glite.data.transfer.fts.db.OracleFTSDBHelper.changeStateForHeldJobs(OracleFTSDBHelper.java:1275)

Hmm.. No more than an update of 150 rows and web-service isn't very busy at all. Cursor leak in code? Look after holiday.

Reset all Held CERN-ASCC jobs individually - that works OK.

100% failure on all the jobs again:

CERN Castor reports that the source file does not exist for many of them (similar to INFN dataset a few days ago). e.g.

srm://castorgridsc.cern.ch/castor/cern.ch/grid/atlas/ddm_tier0/perm/physics.000214.raw/physics.000214.raw._00014

2005-12-30 16:45:03,533 [ERROR] - FINAL:SRM_SOURCE: Failed on SRM get: Failed SRM get  on httpg://castorgridsc.cern.ch:8443/srm/managerv1; id=676005447 call. Error is specified file(s) does not exist%

The remainder still fail on ASCC side:

2005-12-30 16:47:08,390 [ERROR] - FINAL:SRM_DEST: Failed on SRM put: Failed To Put SURL. Error in srm__put: SOAP-ENV:Client - Operation now in progress

  • Mailed ASCC again.

23:45: Bad SRM node removed from ASCC configuration. Transfers reset from Hold to Pending and are completing OK. Remaining failures due to missing source files.

Saturday 31th December

12:13: Running jobs on SARA, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA, BNL, ASCC channels. All agents and webservices up OK.

  • Problem recurred at ASCC for Atlas files. It is intermittent; 50% success rate:

2005-12-31 07:15:34,899 [ERROR] - FINAL:SRM_DEST: Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !

  • Still problem at CERN-BNL with timeouts on gridFTP transfer (even after doubling timeout to 1 hour). Looking closer - it seems to be a gridFTP progress marker timeout - need to check for dCache 1.6.6 at what frequency it should send progress markers. By default, we timeout after three minutes if we have recevied no new gridFTP markers (default gridFTP server sends one marker every 10 or so seconds). Contacted BNL to look for problems on the door nodes.

  • All SARA jobs failing due to network link down.

Sunday 1st January 2006

12:41: Running jobs on SARA, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA, BNL, ASCC channels. All agents and webservices up OK albeit with very sore heads from the night before.

  • All SARA jobs failing due to network link down.

  • Still (intermittent) gridFTP timeouts from CERN-BNL. Awaiting response to mail.

  • Still (intermittent) SRM connection errors at ASCC.

Monday 2nd January

12:30: Running jobs on SARA, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA, BNL, ASCC channels. All agents and webservices up OK.

  • All SARA jobs failing due to network link down. A few jobs seem to retrive a TURL(?) but then fail in the gridFTP (no route to host).

  • Still (intermittent) SRM connection errors at ASCC.

  • Errors from BNL SRM on mapfile entry (mailed BNL). e.g.

CERN-BNL__2006-01-02-1125_sgKp1k:2006-01-02 11:25:17,185 [ERROR] - FINAL:SRM_DEST: Failed on SRM put: Failed To Put SURL. Error in srm__put: SOAP-ENV:Server - org.dcache.srm.SRMAuthorizationException: user usatlas1 is not found; also failing to do 'advisoryDelete' on target.%

  • 100% failure on INFN channel - timeout on SRM.put (tested the network and it is OK). Mailed INFN.

18:44: Problems at INFN should be fixed by morning. All agents and webservices up OK.

Tuesday 3rd January

13:15: Running jobs on SARA, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA, BNL, ASCC channels. All agents and webservices up OK.

  • All SARA jobs failing due to network link down.

  • BNL report internal network problem resulting in authorization problems (can't reach LDAP server). Problem fixed? Success rate has improved.

  • INFN working 100% now.

  • ASCC working to resolve problems.. maybe dead node in DNS alias:

[root@lxshare026d CERN-ASCCfailed]# telnet castorsc.grid.sinica.edu.tw 8443
Trying 140.109.248.8...
telnet: connect to address 140.109.248.8: Connection refused
Trying 140.109.248.4...
Connected to castorsc.grid.sinica.edu.tw.
Escape character is '^]'.
^]
telnet> close
Connection closed.

22:00: All agents and webservices up OK.

  • ASCC dCache node had died: restarted.

  • BNL (Jane Liu) notes timeouts on gridFTP (from FTS client) degrading service - see note on 31st December (BNL are using latest 1.6.6 dCache). Also noted possible problems with the way we abort the gridFTP transfer on a timeout - this may cause issues on the server side.

Wednesday 4th January

11:20: Running jobs on SARA, TRIUMF, RAL, PIC, INFN, GRIDKA, BNL, ASCC channels. All agents and webservices up OK.

  • ASCC problems resolved: channel working 100%.

  • All SARA jobs failing due to network link down.

  • Scheduled downtime at IN2P3 until morning of 5th. Channel closed.

20:05: All agents and webservices up OK.

20:50: IN2P3 (Lionel Schwarz) reports SRM back up after scheduled intervention. Channel opened and jobs completing successfully.

Summary

Summary:

  • Jobs running on SARA, TRIUMF, RAL, PIC, INFN, IN2P3, GRIDKA, BNL, ASCC channels.

  • ASCC had intermittent contact problems on their SRM for a while - resolved by ASCC.

  • INFN SRM went down a couple of times - resolved by INFN.

  • BNL mapfile problem - resolved by BNL.

  • IN2P3 scheduled intevention went fine. Jobs paused and re-started OK.

  • Network to SARA down for the whole period - there was no procedure to fix this or to get the problem investigated (or not one I was able to trigger).

  • Atlas sent a lot of jobs with missing source files (failed on SRM.get).

Software issues noted:

  • Problems stopping / starting busy agents sometimes.

  • Timeout problem noted on gridFTP transfers to BNL (see 31st December) - should investigate FTS timeout code and dCache gridFTP progress marker logic. Aborting gridFTP transfer (due to timeout) also seems to cause trouble on dCache gridFTP door.

  • Some maximum open cursors exceeded DB errors in FTS web-service logs (with the server not being especially busy) - investigate in FTS code.

-- GavinMcCance - 22 Dec 2005

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2006-01-04 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback