LCGSCM Data Management Status

Oct 21, 2009

Castor

  • Currently running 2.1.8-12 on all VO instances
  • DB hardware moves ongoing.
  • Nameserver currently downgraded to 2.1.8. Waiting for a fixed version of 2.1.9, planning to deploy this before startup.
  • Startup: 2.1.9-2 will be deployed on public and c2cernt3. Possible upgrade of LHC experiment instances to be discussed per experiment.

  • xroot for Atlas Grid jobs (gssklog converting x509 cert to kerberos) - to be discussed.
  • Issue with checksums reported by CMS (CDR) - to discuss at Castor meeting.

SRM

  • All SRM instance at 2.8 - (LHCb still at 2.8.0 rather than 2.8.1 - transparent upgrade.)

FTS

  • Should aim for the fixed FTS 2.2 (using the old FTS/SRM interaction). New version in testing on PPS. ATLAS will already move. To discuss with other VOs.
  • Leave FTS 2.1 service around for a while.

LFC

  • Runing version 1.7.2-4.
  • Hardware and DB moves done.
  • Plan to move support for CMS to prod-lfc-shared-central, and to decommission =prod-lfc-cms-centra=l. Important: no data import/export, so all CMS data will be "lost". No date fixed yet.

Oct 7, 2009

Castor

  • Currently running 2.1.8-12 on all VO instances
  • Plans: DB hardware move this month on all instances - shutdown 1 hour - to be scheduled.
  • Startup: 2.1.9-2 will be deployed on public and c2cernt3. Possible upgrade of LHC experiment instances to be discussed per experiment.

SRM

  • CMS, ATLAS and ALICE currently running SRM 2.7-19. LHCb running 2.8.0 (better support for TURLs, checksums and better operational logging).
    • 2.8.1 to be deployed on PPS, LHCb, ALICE asap. CMS and ATLAS to schedule (still testing on PPS). (2.8.1 fixes some checksum issues). Aim to have everyone at >=2.8 for startup.
    • 2.9.0 is the pipeline (for deployment with Castor 2.1.9, adding syslog logging).

FTS

  • FTS 2.2 being stress-tested on large PPS service with ATLAS.
    • First version to priovide checksum support.
    • It changes trhe way it interacts with the SRMs for the first time in 4 years, and therefore requires a signficiant scale test.
    • To scheule test with other VOs
    • Status of T1 testing? (some sites had volunteered).
    • Startup: dependant on test results. Plan B: backport checksums and use FTS 2.1.

LFC

  • Runing version 1.7.2-4.
  • Plan to move support for CMS to prod-lfc-shared-central, and to decommission =prod-lfc-cms-centra=l. Important: no data import/export, so all CMS data will be "lost". No date fixed yet.
  • All nodes are going out of warranty and will be replaced by new ones (except CMS). No date fixed yet.

Apr 29, 2009

Castor

  • CASTORCERNT3 running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
  • CASTORALICE running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
  • CASTORATLAS running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
  • CASTORCMS running 2.1.8-7. To be patched on Tuesday 2nd to 2.1.8-8 (latest version)
  • CASTORLHCB to be upgraded on Wednesday 27th to 2.1.8-7. To be patched on Tuesday 2nd to 2.1.8-8 (latest version)

  • CASTOR-xroot redirectors running 1.0.6-9 (latest version).

  • Castor central services VMGR, CUPV, VDQM running 2.1.8-3. To be patched on Thursday 28th to 2.1.8-8 (latest version).

SRM

  • All production instances still running version 2.7-15 which has some memory leak and stability problems.
    • 2.7-18 (the fixed version of 2.7-17) will be released today and put on srm-pps to be tested. Time is short for upgrade before STEP'09.
    • 2.8 will be put on stresstest SRM PPS instance once it is releases.

FTS

  • No changes before STEP

LFC

  • No changes before STEP
  • Waiting for LFC version 1.7.2-4 (with methods for ATLAS) to upgrade after STEP

Apr 29, 2009

Castor

  • Deployment of 2.1.8-7 on Tier-0 production instances will start next week with CASTORALICE. The 2.1.7 - 2.18 upgrade was successfully exercised on the CASTOR PPS setup last week
  • Database switch intervention scheduled for Wednesday morning: CERN-PRODsite will be flagged 'At risk'

SRM

  • Minor release of SRM v22 2.7 fixing problem with socket leak will be deployed in the next weeks

Jan 21, 2009

Castor

  • production services running stably
  • 2.1.8 testing still ongoing on (internal) repack setup

FTS

  • proxy delegation bug hit Atlas last Saturday, S/w fix promised in the next few weeks

SRM

  • production services running stably
  • maintenance release 2.7-14 deployed this week

LFC

  • Reboots of all LFC nodes to get the latest kernel on Friday afternoon:
    • Transparent on all nodes.
    • Except on the standalone read-only LHCb LFC (prod-lfc-lhcb-ro.cern.ch):
      • To avoid having to schedule a reboot with LHCb, we configured another machine to serve as read-only LFC.
      • The configuration wasn't complete: the LHCb VOBOXES were not trusted during 7 minutes (between 18:27 and 18:34).
      • LHCb assured that it did NOT affect them, as their code was robust against this.

Dec 10, 2008

Castor

  • Stager database upgrades to 10.2.0.4 finishing today
  • Testing of 2.1.8 on (internal) repack instance.

SRM

  • production deployment of v 2.7 finishing today
    • biggest problem: BoL behaviour. Degraded srm-atlas on Friday and Monday.
    • expecting a maintenance release addressing most issues discovered since upgrades started Dec 1
  • plan to stop SLC3 nodes next week.

Nov 19, 2008

Castor / SRM

  • Disklevel checksumming is now finally working and put in action on both CASTORATLAS and CASTORCMS and soon also CASTORALICE, CASTORLHCB and CASTORPUBLIC.
  • 2.1.8 is being production tested on the internal c2repack instance. Several bugs have been found and must be fixed before any deployment on the Tier-0 produciton instances can start.

FTS

  • CERN FTS SLC4 upgraded to latest released patch on Monday. Migration to SLC4 services good.
    • Target for end-of-life of SLC3 FTS services is end of this year. Discuss.
  • Status of new release for T1s - today?
  • Deployment and test of FTS monitoring tool progressing on FTS PPS.

LFC

  • Nothing to report.

Nov 5, 2008

Castor / SRM

  • Production instances:
    • Deploying bug-fix release 2.1.7-21
      • Allows for checksumming of RFIO v3 transfers
      • Atlas yesterday, others to be agreed (hopefully next week)
    • Deploying Castor 2.1.8
      • Planning is starting...
    • Deploying Castor SRM 2.7
      • PPS endpoints are being handed over
      • Atlas tests successful, and will be intensified
      • Production deployments: over the next weeks (barring problems)
    • Upgrade databases to Oracle 10.2.0.4
      • Upgraded Atlas and Alice SRM databases
      • others to be done together with upgrade to Castor 2.1.8

DPM / dCache

  • No report

FTS

  • No report

LFC

  • Smooth running.

Oct 15, 2008

Castor / SRM

  • Production instances: bug-fix release 2.1.7-19-2 has been deployed (ATLAS,CMS already, LHCB/ALICE/Public today)
  • Still testing SRM 2.7 on PPS, setting up pre-prod endpoints for the experiments
  • All SRM v1 instances have now been stopped.
  • Data corruption on one filesystem of one CMS diskserver https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008
    • 31 corruption files saved
    • Improved monitoring being deployed
    • Castor RFIO checksumming found not to work in case if file-updates, so we cannot configure checksumming currently

DPM / dCache

  • [from now on]

FTS

  • Testing continuies at CERN on FTS SLC4 services (20 TB CMS, 20 TB Atlas, 0.5 TB LHCb)
  • Two issues holding up FTS SLC4 release:
    • log4cpp library bug (segfault for >1k messages) - library rereleased, 1 FTS RPM needs rerelease to fix dependency. Fix deployed at CERN.
    • Error message problem (FTS' error categorisation omitted from front of error string) - this is a pain for operations since it loses the information about where the error actually is. https://savannah.cern.ch/bugs/?32942 (January 2008). Fix now deployed at CERN. Postmortem todo.

LFC

  • Intervention on LHCb database content at CERN and Tier-1s was successful. 24h delay at RAL (database update did not finish whereas it took 15 minutes maximum at other sites, and the database content had to be imported from CERN again). RAL says they've tuned the database parameters since then, so that the intervention would work next time.

Oct 1, 2008

Castor / SRM

  • Production instrances: maintenance release 2.1.7-19 to be dployed (transparantly)
  • "T3 instance": pends on first official 2.1.8 release,
  • SRM: version 2.7 (first official SLC4 release) is out, will test it over the next weeks. Deployment plan to be developed with experiments.
  • SRM v1: to be stopped once LHCB LFC entries have been updated.

FTS

  • FTS 2.1 (gLite 3.1 SLC4) installed at CERN for T0-export and T2-service (in parallel to existing service). Experiments are asked to test the new service with a view to switching production to it soon.
  • Problem found in FTS 2.1 release (patch 2048): hold for now. We'll patch the CERN FTS services here.

LFC

  • LHCb SRM update: intervention date not fixed yet. Probably next week
  • LFC on SLC5 before LHC restart under discussion.

Sep 17, 2008

Castor / SRM

  • Production Castor services running v 2.1.7-17, SRM service at v 1.3-28
  • SRM frontednode moves done, planning move of backend servers. This will involve a small service intteruption (~15 minutes)

FTS

  • FTS 2.1 (SLC4) installed for T0-EXPORT and T2-SERVICE, awaiting tests.

LFC

  • Staying with 1.6.8 for now (problem reported in newer version, under investigation).

Sep 3, 2008

Castor / SRM

FTS

  • CERN FTS 2.1 installation proceeding (will run in parallel to existing SLC3 based production services).
  • Some node moves to ensure reasonable warranty coverage for existing SLC3 FTS services.

LFC

  • No issues.

Aug 20, 2008

Castor / SRM

  • All Castor instances now all upgraded to 2.1.7-14

FTS

  • Planning FTS 2.1 rollout. Still in discussion.

LFC

  • On request from Nick, testing the 1.6.11-3 LFC 64 bits against Oracle.
  • Waiting for a confirmation from Akos/David whether the difference of version for glite-security-voms RPMs is intentional.

Aug 6, 2008

Castor / SRM

  • All SRM endpoints were upgraded to 1.3-28 last week
  • CASTORATLAS upgraded to 2.1.7-14 last Thursday
  • CASTORCMS upgrade scheduled for next Tuesday, August 12th, 09h00-11h30 CERN time

Jul 7, 2008

Castor / SRM

  • We will propose to upgrade Castor ATLAS next week to 2.1.7-10 (allows different GC policy, various fixes)

LFC

  • All LFCs are now running with 60 threads, not only the LHCb ones.
  • Support for vo.sixt.cern.ch added to prod-lfc-shared-central.

Transfer service

  • Testing ongoing for SL4 FTS version. Couple of issues reported from CMS and Atlas - being understood (1 is Castor, the other looks like a config mistake on one of the channels). Software certified and now in Preproduction. New machines at CERN being prepared for installation. Want to wait a bit more to see the results of the pilot before scheduling upgrade. Again, we'd like to run for a couple of weeks at CERN before pushing to T1 sites (though time is short).

  • Following issue of 'missing' FTM data - we'll contact a likely site and see if we can debug there.

Jun 24, 2008

Castor / SRM

  • SRM running stably
    • SRM endpoints now split over dedicated hardware (RACs, frontend, backend servers)
    • bug-fix release 1.3-27 deployed for CMS today, other VOs tomorrow
  • Castor
    • no issues

LFC

  • nothing much

Transfer service

  • nothing much

June 3, 2008

Castor / SRM

  • Q1: we want to configure the use of the Castor "internal gridftp" implementation. This requires an FTS upgrade first. As the tURL format changes, we will contact the VO's to coordinate this change.

Transfer

  • nothing much
  • extracting data from the service for the CCRC post-mortem

21 May 2008

Transfer

  • Various degardations to T0-export as a rsult of Castor SRM issues.
  • Found out that the version of FTS deployed for CCRC'08 to permit the preferred Castorint gridFTP setup only fixes half of the problem (mis-tag).

CASTOR

  • Various service degardations to SRM service, summariised https://prod-grid-logger.cern.ch/elog/Data+Operations/2
    • LHCb tablespace problem on SRM database (ran out of space)
    • Issue of long stager operations - putDones taking a long time (20 mins). Understood: solved in latest Castor version. Upgrading ATLAS today (21 May), CMS tommorrow (22 May).
    • Issue of SRM server getting stuck ("CGSI: no header returned", i.e. server accepted the TCP connection but didn't respond) casuing serious degration to ATLAS: not understood. Putting actuator in place to restart daemon in cause of no response. This in itself is a really bad workaround - the SRM doesn't gracefully deal service its threads on restart - but it's better than leaving it stuck.
    • Misc. application deadlocks in the SRM database - some resolved by new SRM version. Upgrading ATLAS and CMS tomorrow (22 May).

LFC

  • No issue.

30 April 2008

Transfer

  • FTS upgrade to patch 1740 (supporting Castorint gridFTP configuration) on all CERN-PROD services.
  • DESY channels moved to FTS-T2-SERVICE at CMS' request.

CASTOR

  • Deployed versions:
VO Castor SRM2 SRM1 Comments
Atlas 2.1.7 srm-atlas, v 1.3-21 -  
CMS 2.1.7 srm-cms, v 1.3-21 srm.cern.ch  
Alice 2.1.6 srm-alice, v 1.3-21 -  
LHCb 2.1.6 srm-lhcb, v 1.3-21 srm.cern.ch, srm-durable-lhcb.cern.ch  
Ops, Dteam 2.1.7 srm-{ops,dteam} srm.cern.ch  
  • Disk caches:
    • required diskspace for CCRC provided for Atlas, CMS, Alice, will add 50 TB to LHCb today

16 April 2008

CASTOR

  • Preparing for CCRC: Software upgrades, space token creations, access control lists update, ...
  • Castor version 2.17 has been deployed on Castoratlas on April 8, and is stabilising. We are planning an upgrade of Castorpublic next week. Other instances will start CCRC on latest release of 2.1.6
  • Decommissioning SRM v1 for Atlas and Castorgrid Classic SE for all LHC VO's

LFC

  • All LFC servers, except the LHCb ones: 3 hours downtime on Thursday afternoon - 14:00 to 17:00 to rename SURLs).
  • Unplanned 2 hours downtime of the LHCb LFCs today from 15:00 to 17:00. The LHCb DB was marked down for scheduled intervention, but the associated service wasn't (it wasn't clear that this database intervention was going to affect the service). Communication problems?

Transfer

27 February 2008

CASTOR

  • Current deployed versions:
    • Castoratlas, Castorcms, Castorpublic (as of today): v 2.1.6-10
    • Castoralice, Castorlhcb: v 2.1.4-10 (upgrades early in March?)
    • SRM2 endpoints: v 1.3-14
  • CCRC report

13 February 2008

Transfer Service

  • A fair number of issues have been discovered in FTS
    • Issue with proxy delegation: sometimes the delegated proxy is bad [the private key is corrupt]. (Not understood, workaround in progress). [CMS]
    • Issue with proxy delegation: on the FTS the proxy is indexed by the hash of its DN (it should be by the hash of DN AND VOMS roles). The means you can end up not getting a VOMS role when you expect to, which affects the mapping on some SRMs (reported at NIKHEF). [ATLAS]
    • Failed to process transfers in 5 seconds (not understood, busy servers?). This reduces the efficiency of the jobs since they fail and have to be retried. [ATLAS]
    • CERN-ASGC agent stuck for a day on CERN-PROD FTS-T0-EXPORT: monitoring and lemon actuator didn't restart it (Not yet understood). [ATLAS]

  • Other issues:
    • Storm: 'someone' doing SRM.Ls is creating excessive load on CNAF Storm - not clear where Ls requests come from. We have confirmed FTS sends a compliant SRM.Ls(numLevels=0).
    • SRM.mkdir issue on dCache understood: in case of directory already exists= the dcache should return the correct code. Reported to dCache.

LFC

  • All quiet on the western front. No issues.

CASTOR

  • Current deployed versions:
    • Castoratlas, Castorcms: v 2.1.6-10
    • Castoralice, Castorlhcb, Castorpublic: v 2.1.4-10
    • SRM2 endpoints: v 1.3-13
  • CCRC report
    • Stable running, mainly
    • Several bugs and configuration problems found and fixed in the first week

16 January 2008

Transfer Service

  • CCRC'08 preparations:
    • Awaiting patch 1589
    • Deploying Lyon's transfer service monitor on fts102.

CASTOR

  • Preparations for CCRC ongoing:
    • Creation of SRM v2 space tokens well underway (only LHCB configuration to be finalized)
    • Upgrade of CASTOR central services next Wednesday (downtime during morning)
    • CASTOR s/w v 2.1.6 still under test.

Previous reports

Older reports are moved to:

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt CERN_site_report_-_Jan_24.ppt r1 manage 222.0 K 2007-01-23 - 23:33 JanVanEldik CERN site report for WLCG Collaboration Workshop
PowerPointppt Physics_services_at_Christmas.ppt r2 r1 manage 398.0 K 2007-02-02 - 00:32 JanVanEldik post-C5: Physics service at X-mas
Edit | Attach | Watch | Print version | History: r223 < r222 < r221 < r220 < r219 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r223 - 2009-10-21 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback