LCGSCM Data Management Status - 2007

12 December 2007

Transfer service

  • Bug fixes for FTS for SRM 2.2 support - patch coming.

CASTOR

  • No issues.

November 14 2007

Pilot services

  • For FTS and LFC for ATLAS. To be discussed: what, hardware, support.

Transfer service

  • Patch for FTS coming through certification, fixing some of the issues found during CSA'07 (Transparent Intervention™). Another patch fixing more coming through integration, hot on its heels.
  • Monitoring node (FTM) now available. fts102 will be updated to use this.

LFC

  • LHCb replica deployment: to be discussed.

CASTOR

  • All instances now upgrade to 2.1.4, all diskservers running SLC4
  • SRM v22 production endpoint for LHCb delivered, hope to deliver Atlas later this week.

October 17, 2007

  • CASTORATLAS was successfully upgraded to 2.1.4 on the 10th of October. Two days later (12/10) we had problems with the service with several different types of ORACLE errors reported in the stager logs. The situation went back to normal after having restarted all services and moved the stager itself to a different box. Problem not understood but old server is undergoing memory checking (successful so far).

October 10, 2007

CASTOR

  • CASTORALICE and CASTORLHCB updated to 2.1.4 (ALICE two weeks ago and LHCb last week). LHCb upgrade revealed a problem with the root protocol. The problem was found and fixed (with an urgent patch) the same day
  • CASTORATLAS being upgraded to 2.1.4 today

August 22, 2007

Transfer Service

  • SL4 testing ongoing.
  • Support for CSA'07:
    • Many issues with the FTS channel not being set high enough - so the jobs get 'stuck in the queue' and Phedex goes into a cancel/resubmit thrashing loop with FTS.
    • CERN-FNAL SRM copy channel had problems. CMS had to move the tier-0 export operations to the FNAL FTS server.
      • 1 config error corrected
      • 1 bug found in DB cache layer.

DPM

  • lxdpm101 is a core service for SAM so should be moved to production state.
  • Need a plan for re-installation to increase /var partition on this box.

LFC

  • Upgrade coming soon for secondary groups.

CASTOR

  • ...

August 22, 2007

CASTOR (Olof)

  • CASTORCMS and CASTORLHCB were successfully upgraded to the 2.1.3-24 release last week. CASTORPUBLIC is being upgraded today
  • A misconfiguration deployed on late Friday afternoon caused problems (full filesystems) for COMPASS CDR over the weekend. The problems were solved during Saturday. Although the same misconfiguration was also deployed on the LHC stagers, it was fixed before any damaged happened.
  • A patch for some of the reported SRM v2 problems was deployed Monday this week. The 'too many threads' problem has not been solved yet.

Transfer Service (Steve)

August 15, 2007

LFC (Ignacio)

  • Stable running over the last few weeks, no issues

Transfer service (Steve)

  • A patch is pending for the production FTS. Ready to go on now when ever convenient. Will cause some downtime.

Castor (Jan)

  • Generally stable running over the last few weeks
  • Maintenance release 2.1.3-24 being deployed (Alice and Atlas done, CMS today, LHCb tomorrow, Public instance next week)
  • New version 2.1.4 with improved support for 'durable' diskpools is under test.
  • Changed node type of our SRM endpoints from "SE" to "SRM" in GOCDB, fixes network monitoring tool at https://ccenoc.in2p3.fr/DownCollector/?sn=CERN-PROD
  • SRM v2 testing has revealed several bugs, most seem minor, but the experts are absent right now.

July 18, 2007

LFC

  • ...

Transfer service

  • Patch 1232 certified.
  • Pilot service upgraded to latest patch - channel definition for SRM 2.2. testing underway, as per SRM 2.2 testing plan.
  • Service review underway: FtsServiceReview20

Castor

  • ...

July 10, 2007

LFC

  • Streams replica[s] setup for LHCb

Transfer service

  • FTS 2.0 patch (1126) had a problem. New patch issued (1232) and built quickly - now in Ready for Certification.
  • Problem last week on transfer service: MyProxy on which we depend was down for 1 day. Problem understood.

Castor

  • Cleaning daemon (which should keep the stager database clean) found not to work properly in the latest release. This led to a change of an Oracle execution plan, which degraded Castorcms on Friday, and Castorpublic over the weekend. A workaround is now deployed, all instances are closely watched.
  • A bugfix release (which includes a fix for the cleaning daemon) should become available today. We intend to run tests for a week, and will contact the experiments for a deployment in ~2 weeks from now.
  • We have been hit by nameserver problems on two occasions. The problems are not understood, but under active investigation
  • SRM v2:
    • stress test have started, and have triggered some s/w problems that the developers are looking at
    • We are configuring SRM-v2 for the LHCB tests

June 27, 2007

Castor

  • CASTORALICE being upgrade to 2.1.3-15 today. This finalises this round of upgrades (hurrah!)
  • SAM test failures last Wednesday
    • at ~11:30, stagemappings on Castor gridftp servers got corrupted (human error), causing SAM tests to fail. The problem was solved at 22:00.
    • at ~18:00, SAM tests stopped running (Tomcat problems?)
    • Gridview site availability plots continued to report Castor SE's as down, apparently based on stale SAM test results
  • Castor SRM information provider problems this Monday
    • an upgrade of the Castor SRM information provider (attempting to fix a missing entry) broke the information for the CE's in a non-obvious way. The upgrade passed our (simple) tests, but something still went wrong
    • the roll-back introduced another problem, which was reported by a CMS user
    • the problems were solved on Tuesday morning. Laurence provided a script to test the LDIF, which we asked to be distributed.

June 20, 2007

LFC

  • Planning upgrade of CERN-PROD to LFC 1.6.5. This introduces secondary groups
    • Schema upgrade needed (from current 1.6.3).

Transfer service

  • FTS intervention on Monday
    • Noted fragmentation on DB: need to understand. Another intervention running this morning (June 20) to defragment tables.
    • Software issue noted on one channel causing intermittent downtime - not seen on pilot service testing. Understanding this now. It's affecting lhcb and cms on the CERN-PIC export channel.
  • Discussions ongoing with 3D group about FTS service: requirements, volume, advice, etc

Castor

  • Successfully upgraded castorpublic, castoratlas and castorlhcb to latest castor release (2.1.3-15)
  • castorcms will be upgraded to 2.1.3-15 tomorrow
  • castorlhcb/lhcbdata 'durable' pool has been increased to 80TB, which is what LHCb will require up to EOY'07
  • Monthly Savannah ticket review meeting will take place this afternoon in 513-1-027 (phone conf arranged for external institutes). Agenda at http://indico.cern.ch/conferenceDisplay.py?confId=16168

May 30, 2007

LFC

  • The lcg-vomscerts RPM was not updated in time, and the service was interrupted on Thu May 24.

Castor

  • Atlas TO tests progressing well, we are starting to plan upgrades of production stagers.
  • Gridftp timeouts on Atlas transfers to BNL being investigated:
    
    Analysed 40241 gridftp requests between Sun May 27 04:03:10 2007 and Mon May 28 04:10:34 2007
    
    *** Transfers ***
    
       *** Type "SIZE"                                   -  20421 requests
          ***  (none)                                    -  20372 transfers
          ***  *.bnl.gov                                 -     15 transfers
          ***  *.pic.es                                  -     15 transfers
          ***  *.gridka.de                               -      7 transfers
          ***  *.in2p3.fr                                -      5 transfers
          ***  *.sara.nl                                 -      5 transfers
          ***  *.ndgf.org                                -      2 transfers
    
       *** Type "RETR"                                   -  19733 requests
          ***  *.bnl.gov                                 -   5980 transfers
          ***  *.gridka.de                               -   4279 transfers
          ***  *.sara.nl                                 -   3271 transfers
          ***  *.in2p3.fr                                -   3206 transfers
          ***  *.pic.es                                  -   1467 transfers
          ***  *.triumf.ca                               -   1385 transfers
          ***  *.ndgf.org                                -    145 transfers
    
       *** Type "STOR"                                   -     87 requests
          ***  (none)                                    -     87 transfers
    
    *** Failures ***
    
       *** Error "421 Timeout"                           -    553 requests
          ***  *.bnl.gov                                 -    542 errors
          ***  *.gridka.de                               -     10 errors
          ***  *.sara.nl                                 -      1 errors
    
       *** Error "451 Local resource failure"            -    320 requests
          ***  *.sara.nl                                 -    127 errors
          ***  *.in2p3.fr                                -     94 errors
          ***  *.gridka.de                               -     60 errors
          ***  *.pic.es                                  -     39 errors
    
    
  • SAM tests now configured for all SRM endpoints, intermittent failures to be investigated
  • High load on castorgrid caused by a few CMS users, who have been contacted.
  • RGMA-based publishing of gridftp records has been stopped

May 16, 2007

Castor

  • Database moves to new hardware:
    • Castorpublic being moved today
    • Atlas move to be planned, tied to 2.1.3 tests.
  • 2.1.3: Atlas T0 + export tests on C2ATLAST0 ongoing.
  • srm.cern.ch was down last Tuesday/Wednesday, because of h/w problem on the request spool node. Very high request rate from LHCB user may have caused this.
  • We want to limit access to classic SE castorgrid to non-LHC VO's.
  • SAM tests against CERN-PROD SRM endpoints failing for different reasons, a cleanup up of SRM endpoint information in GOCDB is necessary.
  • Q: support for OPS on srm-durable-{atlas,lhcb}? Move srm-v2.cern.ch to PPS? Or try to fix the problem smile

May 9, 2007

Transfer service

  • Intervention planned for upgrade of tier-0 export to 2.0. Scheduling awaiting validation by CMS and Alice.

LFC

  • Planning to upgrade LFC to 1.6.4.

Castor

  • Database moves to new hardware:
    • Alice, LHCb have now been moved
    • Castorpublic (dteam + ops!) to be moved next Wednesday, May 16
    • Atlas move to be planned, tied to 2.1.3 tests.
  • 2.1.3: Atlas started T0 + export tests on C2ATLAST0. Biggest current issue: migrator speed too low, creating backlogs
  • srm.cern.ch currently down, because of h/w problem on the request spool node (mid-range server...)
  • new glite-yaim rolled out, allowed to remove a few hacks for SE_castor node type

April 25, 2007

Transfer service

  • FTS 2.0 testing: Alice and LHCb OK. Alice see problems getting the state back into Alien. CMS starting this week. Installation at RAL PPS underway.
  • Intervention planning underway for FTS 2.0 upgrade.

LFC

  • No new issues.

Castor

  • [Castor 2.1.3]: testing and debugging continues, Atlas Tier0 setup is being prepared in parallel
  • We are planning to move the Alice databases next Wednesday, May 2nd. Details here
  • Issues with Gridview Publisher being followed up by the developers.

April 18, 2007

Transfer service

  • ATLAS and LHCb tested FTS 2.0. CMS and Alice this week.
  • FTS 2.0 being certified.

LFC

  • 1.6.4-2 LFC/DPM working its way through integration (support for secondary groups).
    • Should define rollout plan and work out what ACLs are needed for Atlas.
  • All T1 sites at LFC 1.6.3.

Castor

  • [Castor 2.1.3] (new LSF plugin): testing and debugging continues.
  • Plan to migrate Castor databases to RAC setup on https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScCastorOracleRac.
  • We are planning to move the CMS databases tomorrow morning, Apr 19
  • Two hours of service interruption on Castoratlas yesterday, Apr 17 (stager database dropped by human error).

April 4, 2007

Transfer service

LFC

Castor

  • Castor nameserver database has been moved to DES RAC on Apr 2
  • preparation of Atlas T0 stager with new Castor version (new LSF plugin) ongoing
  • working instruction to regularly clean stager databases being put in place
  • maintenance release of Castor-SRM v1 deployed on the Cern endpoints

March 21, 2007

Transfer service

  • FTS 2.0 release still ongoing.
  • Request from Atlas to set all channels on CERN T0-export to have gridFTP timeout of 60 minutes to reduce load on Castor Atlas. No objections from FTS side - but since the channels are shared with other experiments, it should be agreed by SCM. This was agreed at the SCM, Steve will deploy this change

LFC

  • 'Errors noted from atlas LFC' understood - they were using the wrong endpoint.

Castor

  • Intensive debugging of Castoratlas problems continues
  • Upgrade of WD firmware campaign is advancing well
  • Bug fix release of Castor srm v1 is underway (thanks to A&A)
  • We are preparing diskpools for LHCb Tier-0 tests, and Compass CDR
  • 'cleaning database': not followed up
  • 'resilience against service failures': not followed up

March 14, 2007

Transfer service

  • Announcement sent for service split.
  • FTS 2.0 release preparation underway.

LFC

  • Errors noted from atlas on their LFC - it's not clear they all calling the correct methods - following this up with them.
  • Another LHCb request to change SRM endpoint name was directly sent to LFC.Support. Q: are the experiments aware that they are expected to go through the Weekly OPS meeting now?

A: Yes - here is the procedure from the OPS meeting minutes:

All significant intervention (those involving multiple sites, multiple services or significant work for a single service) requested by VOs should be announced at the operations meeting, in the WLCG section of the meeting. It will be the responsibility of the VO to find coordinator for the intervention (could be from the CERN EIS team or a service manager or someone with sufficient knowledge from the VO) The coordinator will create an intervention plan (template available) which must be ratified by all parties involved. Once the intervention is requested through the operations meeting, planned, and agreed, the proper broadcast should be sent.

Castor

  • Main item: Castoratlas problems
    • Thu/Fri: problems with execution plan of the migrator, causing high I/O wait on the stager database, slowing down everything.
    • Mon: replaced LSF plugin by a version w/o logging. No functional change, but it stopped crashing...
    • Now: we observe a high number of requests, coming in bursts, from non-Tier0 activity. This causes a scheduling queue, and Tier-0 cannot efficiently use its resources. Under investigation.
  • SRM endpoints not published on Thu Mar 1, afternoon. Trailing whitespace...

February 28th 2007

Transfer service

  • Main activity: CMS continue. Atlas have just started to ramp up.
  • Service split now complete - load-balanced aliases now in place. Still to announce notice period for switchover (the T0-export still runs the catch-all channels as well) - goal: all experiments to switch by 1 April.
  • Installing monitoring nodes (webservers) on gridfts cluster: these will present status pages and transfer summarisation pages.
  • Proactively and systematically following up problems with sites that have been detected by the FTS.
  • Pilot installed with FTS 2.0 - still understanding some problems on a couple of channels before opening to experiments.
  • SRM 2.2 tests continue.

LFC

  • LHCb LFC upgraded to latest version.

Castor

  • added support for EELA on the Cern SE
  • WD disk firmware upgrade is starting, "easy" boxes first
  • we will ask the LHC VO's to stop using Castorgrid
  • aim to move Castor nameserver to different hardware, 3rd week of March. To be planned in detail.
  • actively working with LHCb and ROOT team to allow grid user jobs to access Castor files through the ROOT protocol

February 21st 2007

  • RAC intervention on Thursday.

Transfer service

  • FTS 2.0 pilot deployment done. Testing with dteam at low transfer rate on all channels. Will begin testing with experiments asap.
  • FTS service split still to do: waiting on DNS alias.
  • Beginning stress-test of FTS software against SRM 2.2 test instances: DPM and dCache for now.

LFC

  • Plan to upgrade LHCb LFC during LHCb RAC intervention - Monday 26th, 8:00-11:00

Castor

  • WD is at CERN preparing for the disk intervention. First servers were done to understand what is involved.
  • Atlas Tier-0 pools updated to SLC4 and to new hardware that will not be part of WD intervention.

February 14th 2007

Transfer service

  • Production moved to new hardware. A few configuration issues caused degradation on the service.
  • Service split still to do.
  • FTS 2.0 pilot deployment still ongoing.

LFC

  • All LFC's (except LHCb) have been upgraded to 1.6.2 on Feb 13.

Castor

  • Deployed new Gridview gridftp logfile publisher s/w on diskservers, incl monitoring
  • Moved SRM v1 endpoint to more reliable hardware during last week's network intervention
  • Still no tested procedure (thus no timeline) for WD firmware upgrades
  • Are preparing 48 diskservers for Castor-2 instances
  • Castor-2 added to SLS

DPM

  • Three new DPMs deployed - 1 for SAM tests (lxdpm101), 2 for interop and devo work (lxdpm102, lxdpm103). These nodes are fully quattor managed.

January 31st 2007

Transfer service

  • SRM 2.2 tests ongoing with test SRM systems.
  • Moving FTS service to new hardware this week: no service interruption on main service. Short interruption on T2 CAF service.
  • T2 CAF service will change endpoint: we will CERN-STAR channels on both services in parallel for a couple of weeks.

LFC

  • Core-dumps in LFC server: dteam instance instrumented - no incidents yet to record.
  • LFC/DPM 1.6.1 ready for certification.

Castor

  • we are starting to plan upgrades. Plan should be ready in a week...
  • we are updating lxbatch configurations to map LHCB grid jobs to their Castor-2 instance

Issues

  • ...

January 17th 2007

Transfer service

  • fta_wrong problems on FTS over Christmas understood (configuration problem on new FTS setup locking DB account).
  • Multi-VO tests starting again.
  • Pilot of FTS 2.0 this week.
  • SRM 2.2 tests ongoing with test SRM systems.

LFC

  • Looking at problems with LFC: unexplained core-dumps in server.

Castor

  • stable running (mostly...)
  • problems with Castoratlas during X-mas break, now again
  • new version to be tested and deployed in coming weeks
  • castorgrid now runs SLC4 smile

Issues

  • ~120 production diskservers will need to have their disk firmware upgraded. This will be a major operation, for which planning is starting now. Timescal: February

Previous reports

Older reports are moved to:

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-07-09 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback