WLCG Tier1 Service Coordination Minutes - 25 February 2010

Data Management & Other Tier1 Service issues

Status of FTS 2.2.3 deployment

Site reports on 1) version installed, 2) upgrade plans, 3) problems encountered in the upgrade.
  • CERN:
    1. Current Versions:
      DNS alias SLS name OS Lemon Version (25th Feb 2010)
      fts-t0-export.cern.ch FTS-T0-EXPORT SLC4 64-bit lemon subcluster FTS 2.1
      fts-t1-import.cern.ch FTS-T1-IMPORT SLC4 64-bit lemon subcluster FTS 2.1
      fts-t2-service.cern.ch FTS-T2-SERVICE SLC4 64-bit lemon subcluster FTS 2.1
      fts22-t0-export.cern.ch FTS22-T0-EXPORT SLC4 64-bit lemon subcluster FTS 2.2.3
      • In fact these version numbers are a little imprecise since CERN FTS has been following individual patches rather than gLite releases. Certainly with the next update to CERN FTS we will will resync to the gLite releases - they may be same but needs verification. For reference these are current installed packages: cern-fts2.1-rpm-list, cern-fts2.2.3-rpm-list.txt.
      • NB the mapping from FTS version number to gLite release number is not obvious. e.g 2.2.3 not mentioned anywhere on glite-FTS_oracle-3.1.20-0-update
    2. Upgrade Plans: Pilot updated of course, Production can be done now if desired. 1 weeks lead time.
    3. Problems Encountered: There were none for the recent update of the pilot but this has been done patch by patch. i.e not a good comparison to production.
  • ASGC: 1) FTS 2.0 installed on SLC3 i386 2) Testbed has been prepared, will start to do test. We will upgrade production one soon once we finish test without problem. 3) N/A
  • BNL: 1) FTS 2.2.3 installed, 2) N/A, 3) Missing documentation on Oracle privileges (fts_history.start_job, CREATE JOB is needed)
  • CNAF
    • 1) FTS 2.1 on SL4 x86_64, gLite 3.1 x86_64 Update 55.
    • 2) Upgrade test eventually successful. Will upgrade in the week of March 8th.
    • 3) A first upgrade attempt failed and was rolled back. One problem was a typo in a YAIM configuration file (users.conf), the other was due to an old version of log4cpp. Details in GGUS #55828.
  • FNAL:
    • 1) Version installed = FTS 2.2 installed and working. We see the proxy error extremely rarely
    • 2) Upgrade plans = Testing of FTS 2.2.3 is completed, currently running 100 MB/s Phedex loadtests for CERN-FNAL traffic - agents and channels are all working. Anticipating upgrading to FTS 2.2.3 on Monday.
    • 3) Problems encountered = None
  • IN2P3:
    • 1) Version installed : FTS 2.1 on SL4 x86_64, gLite 3.1 x86_64 Update 54
    • 2) Upgrade plans : Will setup test instance at first to test the upgrade procedure. If we are satisfied with the test, we will upgrade the production instance.
    • 3) N/A
  • KIT:
    • 1) Installed version: FTS 2.2.3 on SL4 x86_64, (Update 61)
    • 2) Update plans: maybe update of glite-data-delegation-cli-2.0.0-5 to version 2.0.1-4
    • 3) Problems: rarely transfers stay in status "Pending" after failing (no connection to MyProxy); in last few months this affected 6 transfers
  • NDGF:
    • 1) Installed version: FTS 2.2.2(?) on CentOS 4 i386 (gLite 3.1, update 60).
    • 2) Update to FTS 2.2.3 in update 61 in the coming weeks.
    • 3) N/A
  • NL-T1: no information available.
  • PIC: 1) FTS 2.1; 2) Will upgrade in the week of March 8th.
    • 1) Version installed in production = FTS 2.1
    • 2) Upgrade plans: If the issue found in the test instance with "use case 2" does not represent a stopper for experiments, we can upgrade the production instance in 2 weeks (the expert will be away most of the next week).
    • 3) Problems encountered: FTS 2.2.3 running in a test instance since a couple of weeks. Mostly working ok. One problem detected with file corruption while transferring (use case 2, see GGUS 55610) or savannah 63518).
  • RAL: 1) FTS 2.1, 2) Will setup test instance at first to test the upgrade procedure. If we are satisfied with the test, we will upgrade the production instance, 3) NA
  • TRIUMF: 1) FTS 2.1, 2) Will setup test instance at first to test the upgrade procedure. If we are satisfied with the test, we will upgrade the production instance, 3) NA

CASTOR

  • Versions deployed at CERN
    • CASTOR: 2.1.9-3 (ALICE, LHCb), 2.1.9-4 (ATLAS, CMS, PUBLIC)
    • SRM: 2.8-5 (ALICE, LHCb), 2.8-6 (ATLAS, CMS, PUBLIC)
  • Released SRM 2.9-1 (release notes). It mainly addresses scalability problems seen in srm-atlas and it is now certified for SLC5.
  • Released hotfix 2.1.9-4-2. Not critical.
  • Released CASTOR 2.1.9 related XROOT Update 2 (release notes). It cures mainly problems in the virtual socket layer connected with SSL authentication which led to a service interruption of xroot in CASTORATLAS.
  • Planned upgrades?

dCache

Conditions data access and related services

CORAL and COOL

  • No patch has been received yet from Oracle Support to fix the 11g client bug affecting ATLAS sites running AMD Opteron quadcore nodes. As a consequence, ATLAS decided last week to move back to the 10g client. A new release of CORAL, COOL and POOL (LCG 56e) has been prepared for ATLAS. In parallel the issue continues to be followed up with Oracle Support.
    • The end of basic support for Oracle 10g is presently foreseen for July 2010, but it is expected that one-year extended support until July 2011 can be obtained at no extra cost.
    • No loss of functionality is expected from the downgrade from 11g to 10g. The 11g client does provide new features that are currently being evaluated for CORAL, like client result caching and integration with TimesTen caching, but none of these is used in production yet.
    • The response from Oracle Support to the escalation request has been very disappointing. A request to move to "priority 2 - escalated" was sent two weeks ago, but this was finally done only today, after a reminder to Oracle by phone. The issue should be raised in the next meting with Oracle in March.

Frontier/Squid services

  • ATLAS weekly FroNTier meetings
  • Squid Server: Yesterday February 24th 2010 Version 2.7.STABLE7-8 of the frontier-squid server rpm was announced. It offers support for quattor-managed machines as well as other machines. WARNING: The rpm now installs in /data/squid by default. For further details please refer to https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2.
  • Frontier Servlet: On February 15 2010 a new frontier servlet release 3.23 was announced. It cures a bug introduced in version 3.22 that caused the server to get completely stuck, as happened on 2 of the 3 CMS production servers. The recommendation was to execute more tests at CERN and possibly at BNL and to ask sites to upgrade as soon as possible. See the release notes at http://frontier.cern.ch/dist/servletreleasenotes.txt for details.
  • In the future the recommended versions for the frontier-squid server and for the Frontier servlet will be advertised in the WLCG Baseline Versions Table.

Prolonged Site Downtimes: Strategies

Experiment Database Service Issues

  • ALICE: nothing to report
  • ATLAS:
    • BNL: Tables missing data from the replication have been restored on Wednesday 17.02. BNL apply process did not apply correctly transactions to some tables because they were missing the Streams dictionary. The cause is still unknown (Streams dictionary is sent at schema level when Streams capture process is configured). Oracle is still investigating this issue. The rest of the Tier1 sites have been checked and they do not present the same problem.
    • GRIDKA: Atlas replication to Gridka was blocked after the streams administrator password change, coordinated with Tier0. Account was locked by an internal ATLAS monitoring (Tier0 was not aware). This account MUST NOT be used for monitoring purposes: Streams replication depends on it and has administrative privileges not needed by the monitoring.
    • New account has been added to the PVSS Streams replication. More accounts expected. This setup is already running at the maximum LCRs/sec sustainable rate. ATLAS must consider this when adding more load to the setup.
  • CMS: Capture process aborted last night. Running out of memory due to long and large transactions running. Memory has been increased in order to avoid the problem.
  • LHCb: nothing to report.

  • Tier0 - Streams:
    • The 3D OEM has been patched with the latest patch set 10.2.0.5.2 + 9274655 (specific Streams patch) on Monday 22.02.
    • On Monday night because of run of time zone setting procedure two jobs in the OMS3D repository got stuck due to lock contention. All agents could not upload data until Tuesday when jobs were re-enabled.
    • Streams monitoring tool has been configured to use a different account (with less privileges) in order to monitor the databases involved in a replication environment.
  • Tier1 sites reports:
    • BNL: nothing to add
    • CNAF: LHCb cluster migration to new hw next Wednesday. In contact with Eva.
    • IN2P3: memory upgrade on ATLAS cluster last week. PSU to be applied next week.
    • SARA: nothing to report
    • PIC: nothing to report
    • RAL: PSU and PSEs to be installed. Revoked Java privileges.
      • CASTOR: PSU to be installed. One node to be added.
        • Plan to move part of the services on the 3d and LFC/FTS databases. Split the load over the 3 systems.
    • GRIDKA: PSU will be applied next week 2nd March.
    • TRIUMF: one node failed, node added back, gsd daemon won’t start, Oracle suggested that might be corrupted, will be tested the procedure to recover it.

AOB

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ExperimentRequirements.pdf r1 manage 2522.7 K 2010-02-25 - 17:22 EvaDafonte Experiment Requirements from last Distributed Database Workshop @ CERN, November 2009
Unknown file format1-rpm-list cern-fts2.1-rpm-list r1 manage 2.1 K 2010-02-25 - 13:49 SteveTraylen Current RPMS of FTS 2.1 at CERN.
Texttxt cern-fts2.2.3-rpm-list.txt r1 manage 2.0 K 2010-02-25 - 13:50 SteveTraylen Current RPMS of FTS 2.2.3 at CERN.
Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback