LCGSCM Data Management Status - 2007
12 December 2007
Transfer service
- Bug fixes for FTS for SRM 2.2 support - patch coming.
CASTOR
November 14 2007
Pilot services
- For FTS and LFC for ATLAS. To be discussed: what, hardware, support.
Transfer service
- Patch for FTS coming through certification, fixing some of the issues found during CSA'07 (Transparent Intervention). Another patch fixing more coming through integration, hot on its heels.
- Monitoring node (FTM) now available.
fts102
will be updated to use this.
LFC
- LHCb replica deployment: to be discussed.
CASTOR
- All instances now upgrade to 2.1.4, all diskservers running SLC4
- SRM v22 production endpoint for LHCb delivered, hope to deliver Atlas later this week.
October 17, 2007
- CASTORATLAS was successfully upgraded to 2.1.4 on the 10th of October. Two days later (12/10) we had problems with the service with several different types of ORACLE errors reported in the stager logs. The situation went back to normal after having restarted all services and moved the stager itself to a different box. Problem not understood but old server is undergoing memory checking (successful so far).
October 10, 2007
CASTOR
- CASTORALICE and CASTORLHCB updated to 2.1.4 (ALICE two weeks ago and LHCb last week). LHCb upgrade revealed a problem with the
root
protocol. The problem was found and fixed (with an urgent patch) the same day
- CASTORATLAS being upgraded to 2.1.4 today
August 22, 2007
Transfer Service
- SL4 testing ongoing.
- Support for CSA'07:
- Many issues with the FTS channel not being set high enough - so the jobs get 'stuck in the queue' and Phedex goes into a cancel/resubmit thrashing loop with FTS.
- CERN-FNAL SRM copy channel had problems. CMS had to move the tier-0 export operations to the FNAL FTS server.
- 1 config error corrected
- 1 bug found in DB cache layer.
DPM
-
lxdpm101
is a core service for SAM so should be moved to production state.
- Need a plan for re-installation to increase
/var
partition on this box.
LFC
- Upgrade coming soon for secondary groups.
CASTOR
August 22, 2007
CASTOR (Olof)
- CASTORCMS and CASTORLHCB were successfully upgraded to the 2.1.3-24 release last week. CASTORPUBLIC is being upgraded today
- A misconfiguration deployed on late Friday afternoon caused problems (full filesystems) for COMPASS CDR over the weekend. The problems were solved during Saturday. Although the same misconfiguration was also deployed on the LHC stagers, it was fixed before any damaged happened.
- A patch for some of the reported SRM v2 problems was deployed Monday this week. The 'too many threads' problem has not been solved yet.
Transfer Service (Steve)
August 15, 2007
LFC (Ignacio)
- Stable running over the last few weeks, no issues
Transfer service (Steve)
- A patch is pending for the production FTS. Ready to go on now when ever convenient. Will cause some downtime.
Castor (Jan)
- Generally stable running over the last few weeks
- Maintenance release 2.1.3-24 being deployed (Alice and Atlas done, CMS today, LHCb tomorrow, Public instance next week)
- New version 2.1.4 with improved support for 'durable' diskpools is under test.
- Changed node type of our SRM endpoints from "SE" to "SRM" in GOCDB, fixes network monitoring tool at https://ccenoc.in2p3.fr/DownCollector/?sn=CERN-PROD
- SRM v2 testing has revealed several bugs, most seem minor, but the experts are absent right now.
July 18, 2007
LFC
Transfer service
- Patch 1232 certified.
- Pilot service upgraded to latest patch - channel definition for SRM 2.2. testing underway, as per SRM 2.2 testing plan.
- Service review underway: FtsServiceReview20
Castor
July 10, 2007
LFC
- Streams replica[s] setup for LHCb
Transfer service
- FTS 2.0 patch (1126) had a problem. New patch issued (1232) and built quickly - now in
Ready for Certification
.
- Problem last week on transfer service: MyProxy on which we depend was down for 1 day. Problem understood.
Castor
- Cleaning daemon (which should keep the stager database clean) found not to work properly in the latest release. This led to a change of an Oracle execution plan, which degraded Castorcms on Friday, and Castorpublic over the weekend. A workaround is now deployed, all instances are closely watched.
- A bugfix release (which includes a fix for the cleaning daemon) should become available today. We intend to run tests for a week, and will contact the experiments for a deployment in ~2 weeks from now.
- We have been hit by nameserver problems on two occasions. The problems are not understood, but under active investigation
- SRM v2:
- stress test have started, and have triggered some s/w problems that the developers are looking at
- We are configuring SRM-v2 for the LHCB tests
June 27, 2007
Castor
- CASTORALICE being upgrade to 2.1.3-15 today. This finalises this round of upgrades (hurrah!)
- SAM test failures last Wednesday
- at ~11:30, stagemappings on Castor gridftp servers got corrupted (human error), causing SAM tests to fail. The problem was solved at 22:00.
- at ~18:00, SAM tests stopped running (Tomcat problems?)
- Gridview site availability plots continued to report Castor SE's as down, apparently based on stale SAM test results
- Castor SRM information provider problems this Monday
- an upgrade of the Castor SRM information provider (attempting to fix a missing entry) broke the information for the CE's in a non-obvious way. The upgrade passed our (simple) tests, but something still went wrong
- the roll-back introduced another problem, which was reported by a CMS user
- the problems were solved on Tuesday morning. Laurence provided a script to test the LDIF, which we asked to be distributed.
June 20, 2007
LFC
- Planning upgrade of CERN-PROD to LFC 1.6.5. This introduces secondary groups
- Schema upgrade needed (from current 1.6.3).
Transfer service
- FTS intervention on Monday
- Noted fragmentation on DB: need to understand. Another intervention running this morning (June 20) to defragment tables.
- Software issue noted on one channel causing intermittent downtime - not seen on pilot service testing. Understanding this now. It's affecting
lhcb
and cms
on the CERN-PIC
export channel.
- Discussions ongoing with 3D group about FTS service: requirements, volume, advice, etc
Castor
- Successfully upgraded castorpublic, castoratlas and castorlhcb to latest castor release (2.1.3-15)
- castorcms will be upgraded to 2.1.3-15 tomorrow
- castorlhcb/lhcbdata 'durable' pool has been increased to 80TB, which is what LHCb will require up to EOY'07
- Monthly Savannah ticket review meeting will take place this afternoon in 513-1-027 (phone conf arranged for external institutes). Agenda at http://indico.cern.ch/conferenceDisplay.py?confId=16168
May 30, 2007
LFC
- The
lcg-vomscerts
RPM was not updated in time, and the service was interrupted on Thu May 24.
Castor
May 16, 2007
Castor
- Database moves to new hardware:
- Castorpublic being moved today
- Atlas move to be planned, tied to 2.1.3 tests.
- 2.1.3: Atlas T0 + export tests on C2ATLAST0 ongoing.
- srm.cern.ch was down last Tuesday/Wednesday, because of h/w problem on the request spool node. Very high request rate from LHCB user may have caused this.
- We want to limit access to classic SE
castorgrid
to non-LHC VO's.
- SAM tests against CERN-PROD SRM endpoints failing for different reasons, a cleanup up of SRM endpoint information in GOCDB is necessary.
- Q: support for OPS on srm-durable-{atlas,lhcb}? Move srm-v2.cern.ch to PPS? Or try to fix the problem
May 9, 2007
Transfer service
- Intervention planned for upgrade of tier-0 export to 2.0. Scheduling awaiting validation by CMS and Alice.
LFC
- Planning to upgrade LFC to 1.6.4.
Castor
- Database moves to new hardware:
- Alice, LHCb have now been moved
- Castorpublic (dteam + ops!) to be moved next Wednesday, May 16
- Atlas move to be planned, tied to 2.1.3 tests.
- 2.1.3: Atlas started T0 + export tests on C2ATLAST0. Biggest current issue: migrator speed too low, creating backlogs
- srm.cern.ch currently down, because of h/w problem on the request spool node (mid-range server...)
- new glite-yaim rolled out, allowed to remove a few hacks for SE_castor node type
April 25, 2007
Transfer service
- FTS 2.0 testing: Alice and LHCb OK. Alice see problems getting the state back into Alien. CMS starting this week. Installation at RAL PPS underway.
- Intervention planning underway for FTS 2.0 upgrade.
LFC
Castor
- [Castor 2.1.3]: testing and debugging continues, Atlas Tier0 setup is being prepared in parallel
- We are planning to move the Alice databases next Wednesday, May 2nd. Details here
- Issues with Gridview Publisher being followed up by the developers.
April 18, 2007
Transfer service
- ATLAS and LHCb tested FTS 2.0. CMS and Alice this week.
- FTS 2.0 being certified.
LFC
- 1.6.4-2 LFC/DPM working its way through integration (support for secondary groups).
- Should define rollout plan and work out what ACLs are needed for Atlas.
- All T1 sites at LFC 1.6.3.
Castor
- [Castor 2.1.3] (new LSF plugin): testing and debugging continues.
- Plan to migrate Castor databases to RAC setup on https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScCastorOracleRac.
- We are planning to move the CMS databases tomorrow morning, Apr 19
- Two hours of service interruption on Castoratlas yesterday, Apr 17 (stager database dropped by human error).
April 4, 2007
Transfer service
LFC
Castor
- Castor nameserver database has been moved to DES RAC on Apr 2
- preparation of Atlas T0 stager with new Castor version (new LSF plugin) ongoing
- working instruction to regularly clean stager databases being put in place
- maintenance release of Castor-SRM v1 deployed on the Cern endpoints
March 21, 2007
Transfer service
- FTS 2.0 release still ongoing.
- Request from Atlas to set all channels on CERN T0-export to have gridFTP timeout of 60 minutes to reduce load on Castor Atlas. No objections from FTS side - but since the channels are shared with other experiments, it should be agreed by SCM. This was agreed at the SCM, Steve will deploy this change
LFC
- 'Errors noted from atlas LFC' understood - they were using the wrong endpoint.
Castor
- Intensive debugging of Castoratlas problems continues
- Upgrade of WD firmware campaign is advancing well
- Bug fix release of Castor srm v1 is underway (thanks to A&A)
- We are preparing diskpools for LHCb Tier-0 tests, and Compass CDR
- 'cleaning database': not followed up
- 'resilience against service failures': not followed up
March 14, 2007
Transfer service
- Announcement sent for service split.
- FTS 2.0 release preparation underway.
LFC
- Errors noted from atlas on their LFC - it's not clear they all calling the correct methods - following this up with them.
- Another LHCb request to change SRM endpoint name was directly sent to LFC.Support. Q: are the experiments aware that they are expected to go through the Weekly OPS meeting now?
A: Yes - here is the procedure from the OPS meeting minutes:
All significant intervention (those involving multiple sites, multiple services or significant work for a single service) requested by VOs should be announced at the operations meeting, in the WLCG section of the meeting.
It will be the responsibility of the VO to find coordinator for the intervention (could be from the CERN EIS team or a service manager or someone with sufficient knowledge from the VO) The coordinator will create an intervention plan (template available) which must be ratified by all parties involved.
Once the intervention is requested through the operations meeting, planned, and agreed, the proper broadcast should be sent.
Castor
- Main item: Castoratlas problems
- Thu/Fri: problems with execution plan of the migrator, causing high I/O wait on the stager database, slowing down everything.
- Mon: replaced LSF plugin by a version w/o logging. No functional change, but it stopped crashing...
- Now: we observe a high number of requests, coming in bursts, from non-Tier0 activity. This causes a scheduling queue, and Tier-0 cannot efficiently use its resources. Under investigation.
- SRM endpoints not published on Thu Mar 1, afternoon. Trailing whitespace...
February 28th 2007
Transfer service
- Main activity: CMS continue. Atlas have just started to ramp up.
- Service split now complete - load-balanced aliases now in place. Still to announce notice period for switchover (the T0-export still runs the catch-all channels as well) - goal: all experiments to switch by 1 April.
- Installing monitoring nodes (webservers) on
gridfts
cluster: these will present status pages and transfer summarisation pages.
- Proactively and systematically following up problems with sites that have been detected by the FTS.
- Pilot installed with FTS 2.0 - still understanding some problems on a couple of channels before opening to experiments.
- SRM 2.2 tests continue.
LFC
- LHCb LFC upgraded to latest version.
Castor
- added support for EELA on the Cern SE
- WD disk firmware upgrade is starting, "easy" boxes first
- we will ask the LHC VO's to stop using Castorgrid
- aim to move Castor nameserver to different hardware, 3rd week of March. To be planned in detail.
- actively working with LHCb and ROOT team to allow grid user jobs to access Castor files through the ROOT protocol
February 21st 2007
- RAC intervention on Thursday.
Transfer service
- FTS 2.0 pilot deployment done. Testing with dteam at low transfer rate on all channels. Will begin testing with experiments asap.
- FTS service split still to do: waiting on DNS alias.
- Beginning stress-test of FTS software against SRM 2.2 test instances: DPM and dCache for now.
LFC
- Plan to upgrade LHCb LFC during LHCb RAC intervention - Monday 26th, 8:00-11:00
Castor
- WD is at CERN preparing for the disk intervention. First servers were done to understand what is involved.
- Atlas Tier-0 pools updated to SLC4 and to new hardware that will not be part of WD intervention.
February 14th 2007
Transfer service
- Production moved to new hardware. A few configuration issues caused degradation on the service.
- Service split still to do.
- FTS 2.0 pilot deployment still ongoing.
LFC
- All LFC's (except LHCb) have been upgraded to 1.6.2 on Feb 13.
Castor
- Deployed new Gridview gridftp logfile publisher s/w on diskservers, incl monitoring
- Moved SRM v1 endpoint to more reliable hardware during last week's network intervention
- Still no tested procedure (thus no timeline) for WD firmware upgrades
- Are preparing 48 diskservers for Castor-2 instances
- Castor-2 added to SLS
DPM
- Three new DPMs deployed - 1 for SAM tests (lxdpm101), 2 for interop and devo work (lxdpm102, lxdpm103). These nodes are fully quattor managed.
January 31st 2007
Transfer service
- SRM 2.2 tests ongoing with test SRM systems.
- Moving FTS service to new hardware this week: no service interruption on main service. Short interruption on T2 CAF service.
- T2 CAF service will change endpoint: we will CERN-STAR channels on both services in parallel for a couple of weeks.
LFC
- Core-dumps in LFC server: dteam instance instrumented - no incidents yet to record.
- LFC/DPM 1.6.1 ready for certification.
Castor
- we are starting to plan upgrades. Plan should be ready in a week...
- we are updating lxbatch configurations to map LHCB grid jobs to their Castor-2 instance
Issues
January 17th 2007
Transfer service
-
fta_wrong
problems on FTS over Christmas understood (configuration problem on new FTS setup locking DB account).
- Multi-VO tests starting again.
- Pilot of FTS 2.0 this week.
- SRM 2.2 tests ongoing with test SRM systems.
LFC
- Looking at problems with LFC: unexplained core-dumps in server.
Castor
- stable running (mostly...)
- problems with Castoratlas during X-mas break, now again
- new version to be tested and deployed in coming weeks
- castorgrid now runs SLC4
Issues
- ~120 production diskservers will need to have their disk firmware upgraded. This will be a major operation, for which planning is starting now. Timescal: February
Previous reports
Older reports are moved to: