Castor Task Force

Introduction

The Castor Task Force was created on the 22nd of March 2007 to address the current stability and performance problems in the short term and medium term. The mandate is described in a document prepared by Les Robertson.

Document and web pages


Minutes of meetings

Morning meeting every day at 9:30

* 13 August 2007 *

Present : Jan, Miguel, Ignacio, Bernd, Sebastien, Dennis, Ulrich

The new version 2.1.4 has been build and passed the first tests. But there are still problems with RFIO and tape migration. The new version is installed on ITDC and will be installed on C2PPS. More stress tests are needed. As ATLAS and CMS are busy we have to invent a few more tests to be run over the weekend. Especially the new functionality for the durable pools needs to be tested in detail.

There are still problems with the RPM’s from xrootd. They should be rebuild against 2.1.4 The two instances (ITDC and C@PPS) have now a very similar setup in terms of pools and size, the man difference is that ITDC can be ‘upgraded’ on-the-fly by the developers.

We have to come up with additional test scenarios. Sebastien will start right away to produce release notes, so that the operations team can start to integrate the changes into an upgrade plan.

2.1.4 requires also a small change in the name server setup. This should be tested on the name server test setup we have. Miguel will look into this and the ITDC and probably also C2PPS should point to the test name server

There are also changes in the VDQM and VMGR area, but there seem to be no test suites. Bernd will have a chat with Tim about this.

The new software should be run on the head nodes with SLC4, so that we are sure tjhat it runs with SLC3 and SLC4. The move to 2.1.4 in September will still use SLC3 head nodes and we will move to SLC4 later. This SLC4 move will also incorporate some other changes, e.g. better load balancing setup, spread of nodes over different switches, etc.

Ulrich mentioned that there is a new LSF7 version. The decision is to use this in the new 2.1.4 setup, but we do not upgrade the 2.1.3 instances.

The move to 2.1.3-24 for CMS and LHCb is planned for Wednesday and Thursday respectively.

It followed a long discussion about the note Bernd had prepared last week. * Castor_disk_pool_issues_v2.doc: Castor_disk_pool_issues_v2.doc

We agreed the following :

 Sebastien will spend one day to integrate and test s ‘simple’ mechanism to check the user requests already in the request handler based on several parameters (user, group,pool, read/write)using a simple black/white listing scheme. This would allow to apply restriction to the use of pools, e.g. in a durable pool the production user can write files, but everybody else can only read  We can’t do anything complicated with the garbage collector and should keep this as simple as possible  Disk-to-disk copies are useful to a certain extend, but because they are not ‘controlled’ they can easily screw a pool. No easy solution was found, the focus should be on performance throttling  There is a need to have regularly the list of files on disk. Sebastien remarked that this is an expensive operation and can give non-correct results if users have renamed files. Bernd insisted on this point, but also clarified that this is not a tool which should be given to the users. We need again full control in this area.  There is a need for a more complicated user access control. The read command can of course trigger a tape recall or a disk-to-disk copy. For certain durable pools this should not be allowed. This would require larger changes in the stager which cannot be done easily and have to wait for the next release (2.1.5 ?)


* 01 August 2007*

Present : Oof, Jan, Dennis, Bernd

In one of the CMS servers there is an accumulation of GridFTP services not finishing correctly causing high load. To be investigated by Jan and Olof.

The ‘stageout’ problem has been fixed and tested and will be part of 2.1.3-24. Dennis will build this relaes tomorrow and runt he usual test suits. Jan had tested in the mean time the upgrade procedures for 2.1.3.-23 successfully. The difference between 23 and 24 is minor.

Alice and CMS will be scheduled for the uypgrade next week (Wednesday and Thursday).

Another timeout needs to be added to RFIO. This needs some further investigation and more tests. We decided not to include this in release for next week, but rather do more investigations for 2.1.4.

Olof discovered another rbug in the stager which leads to inconsistencies, which means the system can’t find the files anymore. There are alreadya few hundreds int eh CMS stager, Olof checks this regularly and has means to rectify the situation. There was a major fix for a similar problem in 2.1.3, but it looks like a new one was introduced.

Olof will give Julia another test suite which tests exceptions in the RFIO. We need more tests of this exceptional kind.

The ‘stageout’ problem raised the question again whether we have enough monitoring and alarms.

Yesterday bern participated in a discussion with The T0 team about the near term plans t

this is a quick summary from today's meeting I had with Michael,Dirk and Peter from the CMS T0 team

-- there will be a weekly T0 meeting with CMS, starting this Thursday -- the first point is to sort out the pool configuration for their tests -- CMS will define specific test suites to be run on the pools, with defined parameters (throughput, #streams, etc.) -- they observed 3-4 Castor hiccups during the last 2 weeks, which should be analyzed -- they (Peter) are interested in extensive xrootd tests. I proposed that they first piggy-back on the tests with Andreas from ALICE and have larger tests at the end of September. In the mean time the same clarification in terms of operation and responsibilities have to be clarified with CMS as has started with ALICE.

-- key points for Castor (actually the same as for ATLAS) --> have a regular list of files on disk --> ensure read-only pools, access restriction to specific pools (user 'white-lists') (none of the two can wait until 2008...)


* 31 July 2007*

Present ; Jan, Olof, Bernd, Dennis

We have a large accumulation of files in the stageout state . A chunk of code has been removed from the stager already in 2.1.2 (repack build). Dennis tried to put back the missing part of the code, but it did not work because the structure of the call changed (putdone area). Dennis and Julia are looking into this. Olof is still trying to find a recipe for the cleanup and recovery.

Mighunter does not shutdown gracefully. This was fixed and can be part of 2.1.3.-24. Segv of the mighunter is also fixed by Olof.

After a switch problem Jan had to do some cleanup in the pools, as quite a few servers and file systems were disabled. This needs further investigation More loging necessary for the rmmaster rmnode daemons ( status updates ). Dennis will do this for the next release


* 30 July 2007*

Present : Ignacio, Jan, Olof, Bernd, Dennis

Olof found a problem with files staying in STAGEOUT cause by an interrupted transfer. There are currently about 30000 files in this state in each of the ATLAS, CMS and PUBLIC Castor instances. Dennis and Olof will analyze this problem now. Olof will also invent a mechanism to clean/recover these files. There have been no user complains, which is a bit odd.

During the next weeks there will be only a thin man-power coverage of the service due to the holiday season.

Jan will still try the upgrade of the PPS system with the latest release (2.1.3-23). The hope is that the ‘STAGEOUT’ file fix will have minimal impact on the procedures. The plan is to upgrade ALICE and ATLAS next week (8-9 August), CMS and LHCb the week after (15-16 August) and PUBLic on the 23rd August (SPS MD).

Jan is replacing quite a few machines in c2public (compass and na48). The Transtec server still cannot be used for this.

2.1.4 is still on time.

About 50 non-castor disk server need to be cleaned/upgraded by Jan.


* 23 July 2007*

Present : Dennis, Jan, Bernd, Olof, German

The garbage collection is already broken in the running release. Only no-tape files are effected, where a temporary table in the stager is used to speed up the deleting of files. This does not work and as a consequence the files are not deleted in the name server. In 2.1.3-21 the temporary part changed and is now caused looping GC’s and heavy load on the name server. Dennis will revert with a hot-fix to the previous version and the real fix is only deployed with 2.1.4.

PPS upgrade procedure for 2.1.3-21 will be tested in the afternoon by Jan and Dennis. The ALICE upgrade is now postponed to Thursday.

The Na48 and Compass disk server problems were traced back to a load balancing problem. The policy tries to distribute the load equally over all disk server, but there are two different types (E4 and E5) with different performance characteristics. The load then brings the E4 nodes to their knees. The first measure is to decrease the number of slots from 350 to 50 on the E4 nodes and also reduce the slots on the E5 to 250. The problem is specifically severe as these nodes cannot run SLC4, thus they use SLC3 with ext3 as a files ystem, which under high load tends to loose files.

The Compass problem with a lot of files still to be migrated is also understood. The migration policy will be changed. If there are only files on one disk server to be migrated the tape is un-mounted after each file, which of cause slows down the migration considerable. A PSQL hot-fix ‘resolves’ that problem, but could cause ‘over-booking’ of file systems for migration. Dennis will apply this hot-fix right away to PUBLIC, so that Olof and Jan can watch the performance. A more complete and sophisticated fix will be developed for 2.1.4

German has finished the new NCM component which allows much more flexible log-rotation policies on the disk and tape server. This needs to be deployed and configured by Jan. need to decide on how long the logfiles should stay (200 days) and which logfiles (rtcopy, rfiod, gridftp, etc.). Bernd will schedule a meeting next week devoted to logfile policies and logfile analysis.


* 23 July 2007*

Present : Olof, Jan, Nilo, Dennis, Bernd, Miguel, German

Some 1200 files were not migrated in Compass, all of them on one disk server. Miguel is checking the problem.

The ALICE upgrade to 2.1.3-21 on Wednesday is still scheduled and some preparations have already started, but Dennis discovered on the ITDC instance some problems with the GC (seems to loop on already deleted files). To be checked, decision on the upgrade will de taken tomorrow morning.

Nilo needs to apply a security oracle update. He will start with the test instances and then do the DLF DB’s. Stager and particularly the name server will wait. To prevent possible security incidents, Nilo will restrict the number of hosts which can access the DB’s

High load on Compass and Na48 disk servers, most of them run ext3 which Can created file loss when the server is unresponsive. It is not clear what the reason is. Olof will look into the problem with the disk servers.

Rmmaster restart problems on public, probably the ‘cleaning’ of the running processes before did not work Needs further investigation, in the mean time actuator will be disabled and in case this happens again the developer are called for a real-time debugging session.. There seems to be a general problem of restarting daemons after crashes or hangs.

Discussion on pool setup followed. Bernd presented some general layouts. One of the basic problems is to protect pools. How to make a pool read-only for users ? (including the stop of recalls and disk-to-disk copies) this will become a general requirement for all experiments. Bernd will write a note with more numbers.


16 July 2007

Present : Olof, Jan, Miguel, Bernd, Dennis, German

Discussion about backlogs in the request handler, as Oracle does not keep the order of requests in the table. This lead earlier in the year to problems with overwriting files in the ATLAQS T0 exercise. This happened a lot when Oracle is very busy. There is a way to ‘fix’ this in Oracle, but this has not been implemented yet.

Still problems with Compass over the weekend, seem that the tape writing is very slow. Their rate is much higher than expected. Stager ‘died’ at Saturday at 2 in the morning, Miguel restarted the stager by hand later in the morning, there were no alarms, stager probably blocked , further investigations have high priority.

Cleaning problem fixed as a hotfix to 2.1.3-21-1. Need a further test in production.

There is also a backlog of 80k files in CMS to be migrated. Need further investigation.

The new release is already on c2test and ITDC, will also be installed on c2pps. Dennis will run more stress tests starting today. Miguel has a summer student to run and develop stress tests through batch clients. Dennis and the summer student will coordinate their activities.

Jan will announce the upgrade to 2.1.3-21 to the LCG DCM and the experiments, so that the deployment can be scheduled during the next 2 weeks. It will take a downtime of about 2h.

The xrootd tests have stopped and seem not to continue during the next 3 weeks, because of holiday periods. Needs to be followed-up by Bernd .

Quite some activities in the SRM area. Jan has setup the SRM endpoints for LHCb, published in the BDII.

Atlas has restarted the export exercise on a low level.

The bug review is confirmed for Wednesday .

CNAF upgraded last week successfully to 2.1.3-15.


11 July 2007

Present : Jan, Miguel, Maarten, Sebastien, Dennis, Ignacio, Bernd

Jan finished the setup of the LHCb SRM2 endpoint. Disk pool setup discussion with LHCb by Jan today. Decide on a name server directory structure to have good matching of file classes and tape pools.

Flavia continued here tests and found a few more errors to appear regularly

There have been some more problems with the name server, which degraded the ATLAS service during about 1 hour and effected others in a more lightly way. A thread from ATLAS was blocked, not clear why, finally the thread was killed which resolved the problem. No put-statement went through. Nilo was looking at the name server briefly and today Miguel and Nilo will go into more details. Also automation Nilo, Marten, Miguel to look into scripting of debugging information.

Name server needs more attention !

Coming back to the RFIO, name server issues, need to check with JPB.

The new cleaning script is causing problems with the recalls, as it seems to also delete the corresponding sub-requests. More investigation needed.

Release 19 is there and can be installed.


10 July 2007

Present : Jan, Miguel, Ignacio, Bernd, Dennis, Sebastien, Maarten

Preparations for the change of the DB passwords on the stager instances. Ignacio has tried it on the PPS and ITDC instances and found a few problems with password length and blocking accounts. This is fixed now by the DB experts. The change should probably start with ALICE and only during the day, when DB experts are around.

Miguel asked for two name server head nodes, one for test and one for the SLC4 tests.

Dennis did a setup for more SLC4 head nodes, to be put into ITDC for tests. First time we try SLC4 on the head nodes.

SRM overview from Giuseppe, Flavia found some problems and Giuseppe did the debugging. One need to change the castor client software, not a complicated problem. Flavia was running with 9 nodes with 30 clients, each producing SRM requests constantly. But this was running only for several minutes, it is not clear what the level of the problem is. Flavia should run the test for several hours (12h) from now on, Maarten will follow this up.

SRM configuration files, how to configure ? There are some operational issues, more documentation needed. Some confusion about the space token mapping.

Operations team will define the tools and the behavior and the SRM team will provide the implementation and feedback. Jan will start this process.

Reserve space token discussion. The definition is not clear and also the requirements are not really clear.

Giuseppe will prepare a SRM presentation for next week to get everybody up to the details.

The new release was painful to make as Sebastien and Giuseppe tried to automate it as much as possible., Need some escalation with the Oracl DB team for a better automation. Sebastien will send a message to the DB team to request the necessary changes on their side.

Sebastien applied a hotfix to the existing DB’s to cope with the DB problems from the weekend. Reminder to everybody : changes need to be documented and an email send to the mailing list !

There is now a new cleanup script which covers the one remaining problem. All other problems should have been fixed.

Decision to move to 2.1.3-19 and incorporate the hot fixes And also release a hotfix for 2.2.3-15 for the outside institutes.

Nothing has been tested for 2.1.3-18 yet.

Need to check the DB state regularly to understand where problems arise, this is a monitoring issue.

Too many sub-requests in the tables triggered the problem over the week end Sebastien has an idea about reducing this number, this will go in 2.1.4

Sub-requests per se and how to deal with them is an issue , needs further discussion.

Problems with blocked files in castor , data base load dependent., production environment is ‘solved’

We need to have migration policies per pool and not only per stager as it is the case today, something for 2.1.4

The plan is still to have 2.1.4 production ready by the first of September.


09 July 2007

Present: Jan, Ignacio, Miguel, Eric, Bernd, Dennis

Heavy data base problems over the weekend with CMS on Friday and PUBLIC on Sunday. This stopped the COMPASS data taken for 2h .

The problem was that we had accumulation of sub-requests, the cleanup not working and Oracle changed the execution plan of some commands. The DB response was very very slow.

The cleaning will be fixed by Sebastien, Dennis and Miguel this afternoon. Nilo will ‘fix’ the execution plan of the Castor commands when he is back on Thursday There are about 400 different Castor commands iin oracle.

Need also more monitoring in the DB area to develop more alarms.

There is no way to do throttling on this level in the Oracle DB to avoid this kind of problems, this must be done at the application level

There is a lack of documentation in the SRM area, Giuseppe should be involved.

Alarm issues, the probe is SLS seems not have been triggered by the problems on the weekend, more alarms need to be deployed including also more details from the DB.

There are heavy direct user connection from lxplus (COMPASS) with phython to the DB, effecting the overall perfroamnce. Jan and Miguel will contact the user and propose a different solution.

The name server problems from last week are not yet fully understood. Before restarting the xrootd tests which showed a strong correlation with the problems, Miguel will setup an additional test name server head node. The xrootd tests will use the new head node.


05 July 2007

Present : German, Jan, Ignacio, Sebastien, Dennis, Bernd, Maarten

Stager_qry option discussion There is no time to change this in 2.1.3-18. Sebastien will send again Miguels proposal to the sites and ask for feedback until early next week. Sebastien has started already the implementation of new options and output formats for 2.1.4, based on earlier discussions.

The new release will be build today.

Logfiles rotation German has a new proposal , which leaves the default implementation inside the Castor releases untouched and concentrates on the CERN NCM components. He will implement something before the end of July, so that we can dynamically change the policy at any time. The first goal should be to keep logfiles for 9-12 month. There is already a savannah ticket for the archiving of these logfiles, which needs higher priority.

There was yesterday a name server problem which was most likely caused by the xrootd tests done by Andreas. It looks like all the threads handling name server requests were blocked after some time. Jan and Sebastien will look into the logfiles to do some further investigations and then send the results to Jean-Philippe.

SRM2 tests have restarted by Flavia yesterday, but no results yet.


04 July 2007

Present : Jan, Miguel, Sebastien, Dennis, Maarten, Bernd, Ignacio

Nilo thinks he can also host the 5 DLF DB’s on the current NAS serves, but this is now very dense packing on the hardware. He proposes to start with one heavily used DLF first and if there are now problems a few weeks later the others could be moved.

Small discussion about the Stagsr_qry output definition, the decision is to wait for 2.1.4

The new release (2.1.3-18) will be build this afternoon and then moved to c2test and/or to ITDC for further testing. Tus by Monday it could be ready for deployment. The upgrade will take about 1h where the corresponding instance will not be available. Jan will contact the experiments for suitable upgrade dates, the upgrades should be done within the next 10 days. We should check what the time-out+retry situation with the different access methods is (RFIO, FTS, GridFTP, SRM, etc.). this should be set to values which enables them to cope with 1-2 Castor downtimes.

Pre-staging implementation needs quite some further discussion. What is the correct way for the experiements ? this is triggered by a discussion/complain fro CMS.

We will have dedicated meeting next week to specify the details for the next version (2.1.4). possible time frame is an internal release at the end of July, followed by heavy testing until the end of August.

After the morning meeting we had a SRM2 planning meeting with Flavia and Shaun. The idea is to do heavy focused testing during the next 10 days with fast feedback from the developer and operation. The setup is defined and ready.

The Xrootd-Castor situation looks very good, the latest tests were successful and now large and long stress tests have been started Mails from Andreas :


F.Y.I: I have started the test around 11:30.

We have made a summary monitoring page for the i/o on all machines:

http://pcalimonitor.cern.ch/display?page=xrootd/diskio/combined

You can see the i/o of xrootd & the local disk i/o.

Unless you tell me to stop or it breaks it will run until tomorrow at noon.

Cheers Andreas.

PS: there was one illegal value send by one host at the beginning of the test for the disk write - ignore it - it will be fixed in the next monitoring package.


Hi, I have put the results of the today tests: https://twiki.cern.ch/twiki/bin/view/FIOgroup/HowtoTestXroot

Short summary:

14 disk server:

read: 1.1 GB/s write: 1.1 Gb/s

read-write: read: 900 Mb/s

write
650 Mb/s
sum
1550 Mb/s

No failures, no hangups. Test duration 1 hour / 30 minutes.

Next plan: - upgrade the monitoring package to add the castor migration/staging bandwidth to the plots

- run a long duration write-test

After: - try & debug if necessary staging behaviour (file preparation etc.)

Cheers Andreas.


02. July 2007

Present : Jan, Miguel, Bernd, Maarten, Dennis, German, Sebastien

Some rogue user activity flooding the castor1 stage public and LHCb stager. The corresponding user have been stopped to access the stagers and in the LHCb case the LSF parameters have been changed to limit the influence of single users. How to protect against a zillion small file requests ?

Compass is happy withy their setup, will also move now their full production users.

There was a hiccup with the migration in CMS, the basic fix for this incident will be in the next release.

T0perm has been moved back to 14 migrtation streams, as backlog was accumulated and the test with a reduced number of streams failed (no performance change)

Still the plan iis to have a checkpoint on Wednesday for the new release.

There have also been tape fixes put into CVS by Arne which should be incorporated


26. June 2007

Present : Ignacio, Jan, Miguel, Bernd, Olof, German, Sebastien, Dennis

The preparation for the ALICE upgrade to 2.1.3-15have started. Jan has checked and updated the upgrade procedures. The DB upgrade has been already tested and Ignacio will do the clean-up later in the morning

Compass running on public with increased speed, they have moved from 1 GB link to 10 GB link. Still using a combination of castor1 and castor2 which will be fixed now, after Jan clarified with Compass the details. Still data speed fluctuations as the migration policy is not yet tuned for compass.

There is a Na48 backlog of files, probably small files. 2 more disk servers have been added and the disk server slot numbers moved form 350 to 200 slots. Probably some more tuning necessary.

Atlas is running again with export and they reached 900 MB/s for several hours, the total output from t0perm reached 2.3 Gbytes/s . With the spread of disk servers over two main switches this is about the limit we can reach.

There have been recall problems of 5% in LHCb. This is due to an ‘ancient’ killer process for stuck recall processes. But this error as been fixed and the killer now removed good processes. The killer is now switched off killer process is now disabled

There was some discussion about durable pools and the Castor decision to anyway back them up on tape to ease the operational issues in case of disk server problems. Bernd remarked that we need some more information about file system unavailability (disk severs) as a prerequisite to discuss with the experiments the usage of durable pools and what to do when data becomes unavailable.

Miguel will reduce the number of migration streams in t0perm to 8 and also reduce the numbr of dedicated drives to 8, as the fluctuations in the tape system are still very large.

What kind of castor alarms exists? Miguel will send a pointer to a twiki

Jan added 100 TB to the CMS pools and also to PUBLIC.

Currently we have 430 disk servers in all pools, 2 PB effective space.

Still missing 86 disk servers in production.


25. June 2007

Present : German, Jan, Sebastien, Dennis, Ignacio, Nilo, Bernd, Olof

Nilo noticed stager DB problems over the weekend, as file systems on the DB servers were filling their disk space with error messages. This is related to the garbage collection, some changes in 2.1.3-15 obviously caused this, but it seems to have it’s origin in an Oracle bug. A bug report has been filed. Nilo and Sebastien will ‘fix’ the problem today. Essentially the garbage collection is not working on any of the new stagers.

Nilo has been asked to find out whether it is possible to host the 5 DLF systems also on the new NAS systems, as we have had several Oracle corruption cases. The DLF DB’s were put during the Castor upgrades to new hardware (small disk servers) and the systems have less errors than the previous ones.

Problems with na48, disk server allocation needs to be upgraded.

Atlas restarted the export and they are reaching high throughput. As the disk servers in t0perm are spread essentially only over two switches (ip226 and ip227), we start to reach network limits. One possibility is to spread the servers over more switches. Bernd will arrange with CS to change the blocking factor for a certain number of switches form 2.4 to 1.2. this will require some ip renumbering.

Jan will add more disk space to CMS to get them much closer to their allocation.the others have essentially they final config. CMS needs more space in the second half of June for their pallned pre-CSA07 exercise. Expect to need more in the CMS challenge at the end of July (+150 TB)

There is still a lack of old logfiles for debugging purposes. The logrotation is much too short. German will look into this and how to increase these numbers form days to months. Frequent reports about inconsistencies of stager and name-server , user perception

The current development focus is on fixing the remaining disk1tape0 problems.

Sebastien and Dennis reported that 2.1.3-17 is out and has been tested with the existing test suites. Dennis also updated the upgrade documentation.

Still RPM problems with xrootd, Andreas iss looking into this.

SRM testing needs to be followed up.


22. June 2007

Present : Olof, Bernd, Dennis, Jan, Ignacio, German, Sebastien, Maarten

The CMS upgrade went fine. It took bit longer due to problems with the DLF data base, data corruptions could not be repaired and the DB needed to be dropped. Only few users complaint which were not informed inside CMS

All Elonex disk server had their hardware firmware controller updated Also all SLC4 server were rebooted to receive the latest kernel and network parameter updates.

Stageconf and stagemap configuration issues which broke GridFTP. A bug fix for durable pool issue was introduced which had side effects. Fixed late on Wednesday evening. SAM stopped also working and GridView was not correctly reporting, I it was using old values.

Nilo and Eric will be invited for Monday to have a discussion about the DLF hardware.

Lemon is not working on the production system. A ‘bug’ in oracle causes high load and the node monitoring values cannot be entered anymore in the data base. A RAC instance is already prepared but not yet in production.. This instance receives already the lemon metrices and does not suffer from the high load. Dennis will switch the web-page lemon interface to this instance today. Miro will be back on Monday to do a complete switch to the new instance.

ALICE has given the go-ahead for the Castor upgrade on next Wednesday. Jan will send the corresponding warning messages and prepare the move.

The ATLASS mighunter had a problem yesterday and for several hours no tape migration took place. A new alarm will be implemented from the operations side and Olof has filed a bug report. Will be fixed in the next release


20. June 2007

Present : Olof, German, Ignacio, Bernd, Miguel, Maarten

Miguel continued to clean-up after the LHCb upgrade. Several disk servers were ‘imported’ with higher then 99% filled file systems (deficiencies from the old system). The Castor system puts them automatically into draining, because this has caused an alarm. Draining mode means that files are read-only, all writes are stopped and each access causes a disk-to-disk copy. This caused heavy load on the system and started to ‘exclude’ user accesses, as there is no priority scheme. Miguel started to use the LSF mechanisms to change that. First the number of jobs per user on a disk server was limited to 10 (with 300 in total on the default pool and 150 on lhcbdata). This had not the necessary effect, so finally the number of concurrent jobs per user was limited to 50. • the disk-to-disk copy scheduling takes only the destination into account • there should be another disk server state called ‘read-only’ which would avoid the automatic disk-to-disk copy procedure

Tentatively schedule ALICE for next Wednesday. Further discussions with ALICE.

Preparations ongoing for tomorrows upgrade of the CMS stager instance. The CMS data base had been already extracted last Friday and the procedures Have been successfully tested. Ignacio will do the additional clean-up today.

Problem with the state synchronization between the name server and the stager, this causes regular operation problems. User think they have written files while they are not anymore on the stager and there are wrong entries in the name server consistency problems.


19. June 2007

Castor extended morning meeting with tape issues 19 June 2007

Present : Miguel, German, Maarten, Bernd, Olof, Ignacio, Ulrich

Tape team : Tim, Vlado, Gordon, Charles, Arne

The LHCb migration went fine. The DB had to be recalled once more, because there were some oddities seen in the the number of sub-request entries in the DB. Miguel and Dennis expected many more, but it turned out that 95% were already deleted by Ignacio running the cleaning procedure.

Proposal from Miguel to put the old 5 DLF DB’s on one server to be able to look at older monitoring information. He should go ahead with this.

Later during the day finally LHCB responded with their choice on how to spread their allocated disk space (140 TB) between the disk only and the disk-tape pools.

Bernd will talk to ALICE this week to discuss the Castor planning with them (migration, data challenges, xrootd, etc.). The team would like to move ALICE next week to complete the upgrade cycle, Berdn wants before that a few clarifications with ALICE.

Olof in holidays from the 26th for two weeks.

Any changes in sthe schemer for DB’s should first be tested on the various test systems we have. The information should be logged and propagated via email and incorporated into the corresponding upgrade procedures.

Small number of LSF jobs are lost in the new instance. A first investigation showed that they were interrupted from the user side, correlated to long tape queues CMS has high number , Atlas has only 3 out 5100 jobs lost.

The meeting was then extend to include the tape team with Tim Bell. The team brought forward their problems and issues they would like to be fixed to ease the operation and improve the performance.

What follows is a list of items, not prioritized (many thanks to German for these minutes) :

Top 5

  • Hugo's list
a) easy fixes • Information on tape drives - restricted characters and lenghts. Olof - sounds trivial but isn't as many changes ahead (client/server changes). Who can do it? Based on historical knowledge... VMGR knowledge needed but nobody has this knowledge; not so simple. b) less easy fixes • VDQM queue prioritisation and mounting - recall policies; prioritisation. Olof -> this is the stager part; the weighting is done there. Olof -> new VDQM should be improved rather than the existing one. • VDQM changes: not before 2008 • Olof: Recall policies - says recall policies are easy to enable, just changing few lines in the stager; enable calling a script which implements the policy. • Bernd - enabling them should be done asap; fine-tuning is different. Simple thing like "either file is too old or more than N files or more than N GB".

  • Gordon:
• rtcopyd communication problems: Gordon: "Half or more of all the requests for tape fail". However, this should improve with the new stager. Can we get detailed statistics - but not easy as long as we don't have all stagers (except maybe Alice) migrated; from thursday on

  • Charles:
• Repack: No working repack. Preordering of recalls; performance; error propagation issues (why are things failing). • Physical file position (#26030) - recall optimisation - changes to name server, rtcopy, measure tape position o Add two (or more) new fields in the tape segment part of the ns to keep track of physical file position (only for full tapes); not yet extend ns api. Have a script which externally to existing Castor ns software populates this information for tapes out of DLF information; this would be incomplete but in addition to estimation (based on logical block ID??) can help for the recall policies. But recall policies require ns extensions...

  • Vlado:
• CUPV (#16901). Not useful security function... not easy; comes after strong authentication and then after VOMS integration. Tim: should be rather using list of nodes/users (roles). Doesn't need to be fine access control as now. o Tim: The responsible for CUPV should provide a procedure on how to add tape admins.. o Tim to update Savannah ticket with what he wants/needs. • message daemon: Single point of failure. Message daemon is only needed/used by the tape software. Olof doesn't know many details. Originally used for operators etc. Arne should look into this. Tape team (TSI) will look into stopping it.

  • Tim:
• Additional tape status needed more than enabled/disabled. Scenarios: a) tape temp unreadable - stuck in a drive. Need some way of telling users to wait - rather than just disabling the tape. b) load balancing tapes from one robot to another; needs an additional 'standby' state while tapes are physically moved rater than failing. Some kind of 'short term unavailable, let's wait and retry later'. Is this bug #21688? Needs changes in VMGR according to Olof. Who knows about VMGR? Jean-Philippe who wrote it. Maybe Ben?

  • Miguel:
• Load balancing between robots in the different robot instances. Spreading of writing over different instances. No smart decision when selecting the tape; it doesn't take into account the queues. Paco did a development for adding a weight (on tapes?), but (probably) not integrated/tested/whatever... Should at least be able to randomize according to building.. Paco's developments were never put in prod. • Problems: max number of tapes per service class; max=10 always goes to take 10 tapes for 10 files. MigHunter - Olof has been maintaining this. • Another MigHunter improvement. Migration policies. Need a combination for policies such as 'start migration if >500GB OR if more than X files'. Also 'what tape pools (e.g. IBM ones but not STK) to use for small files'. This needs to be defined per service class. • Tim - we need a file-size based policy per experiment in particular for ATLAS. We would need two tape pools (IBM, STK) but we anyway have already two pools for ATLAS - reshuffle them. MigHunter calls the policy script. Olof - policy can be implemented quickly but tape pools need to be reshuffled (e.g. for ATLAS)


14. June 2007

Present : Jan, Miguel, Ignacio, Bernd, Maarten, Sebastien, Dennis

The upgrade of PUBLIC and ATLAS went fine. The procedures were well prepared. There were just two issues with the DB schemer moves. In the future all changes which are done on the production system should first be tested on other instances and then be recorded (mail, changelog twiki ?). Especially the corresponding upgrade procedures should be modified to cope with the changes. There will be some exceptions (the usual last minute, fire-fighting change on Friday afternoon), but still afterwards the procedures should b e followed..

Post mortem:

  • steps were complete
  • timing estimate was good
  • the preparations last week were good, gave good practice
  • test of the schemer update had some problems , as not all changes were documented  LHCb upgrade preparations will start earlier , extract the DB earlier and test the schemer update
  • user communication Jan to update the help desk procedures
  • MigHunter did not restart on PUBLIC, manual configuration, added io the procedures
  • Need to review the instructions for the sysadmins and the operators

The communication inside ATLAS is not working efficiently, as the Castor operations team got quite a few complains about Castor not working.

ATLAS has not yet fully restarted yet, need to cross-check with Armin. The export will essentially be stopped during the next few days, while ATLAS is upgrading theirDDM/DQ2 system world-wide. This gives us a bit of time to look at the details of the tape migration.

The way to put things into recovery mode (disk servers) has changed and is now fixed.

LHCb upgrade planned for Monday, but some things can already to be done today and Friday:

  • Decide on the batch stop or not
  • DB stager copy has already been done, Dennis will test the upgrade procedure
  • cleanup of the DB needed, Ignacio
  • Jan will prepare the LSF7 config

Dennis to look into the cleanup of the 2.1.3 DB’s , get rid of leftovers form 2.1.1.

ATLAS DLF response is slow, Nilo and Dennis will check

Bernd will organize a meeting with the Tape team Tuesday to discuss bug fix priorities and additional features needed.

Miguel and Ignacio to coordinate the CMS move on Thursday, Jan and Dennis are on a training course.

Need to have a meeting to improve the procedures for documentation, DB schemer changes and PSQL changes and workflows.


07. June 2007

Present: Bernd, Ignacio, Dennis, German

Yesterday morning (9-14) Miguel increased the number of job slots on the disk servers in t0perm form 4 to 8. There was (within the fluctuations) no visible effect.

The RFIO logging problem has been solved by Dennis. In the new release (2.1.3-15) all RFIO activity is logged again into the standard logs.

The new release has been made yesterday afternoon and deployed onto c2test and ITDC. Julia started the test suites yesterday evening. Miguel will install it on c2pps this morning to verify the installation procedures. We should then upgrade c2atlast0 later in the afternoon, this should take about 15min.

The new release contains the automatic timeout of pending requests, which is configurable per service class and can de dynamically changed from a config file. Currently the default timeout on the clients is set to eternal, this should be reduced in the future.

Dennis will coordinate with Rosa and Andreas about the xrootd tests on ITDC. He will also establish with the new release the basic throughput figures of the LSF plugin.


06. June 2007

Present : Miguel, Olof, Ignacio, Bernd, Dennis, Sebastien

Tentative planning for next week : Move PUBLIC on Tuesday to the new Castor release Move ATLAS on Wednesday to the new Castor release (ATLAS has agreed to this yesterday)

Miguel will announce this in the GRID meeting, morning meeting and CCSR. Olof to coordinate with compass and na48 for the Tuesday intervention. Downtime from 8-18

Plan for today :

Produce a new release (2.1.3-15) which includes about 20 new items (bug fixes and feature enhancements), this includes fixes in the tape area. Install them on c2pps, ITDC and later on c2atlast0, after running the usual test suites on them. Continue the large scale testing. After the re-installtion of ITDC , Giuseppe will set it up correctly for the SRM2 test And coordinate that with Flavia. The idea would be to dedicate this for the next 5 days for SRM tests.

ATLAS has restarted the systematic export test. BNL only during the day and adding SARAH and LYON at 16:00 yesterday. The performance increased. The separate site performances aggregated well, but the RMS of the fluctuations increased by a factor 2.

Today in the morning from 9:30 to 12:00 we will increase the number of disk server slots from 4 to 8, to see possible effects on the export and the general performance of the T0 test.

There are no rfio messages logged into the rfiod logs, which makes debugging a bit hard. Can this be fixed ? Dennis will look into this, possible through the rfio config files?


01. June 2007

Present : German, Dennis, Bernd, Ignacio, Miguel, Eric

New LSF7 RPM provided by Ulrich, to be deployed on Monday.

Rack power tripped , SRM was down for 3 hours, just the sysadmins involved, no Castor operations team intervention needed. Need to distribute the SRM nodes over more switches and racks.

Two corruptions yesterday on the old atlas stager, quite some time spent by Nilo and Eric to fix it.

First SRM2 tests have been run on the default pool by Flavia and Giuseppe.

Simple client tests run on the default pool.

The distribution of the migration streams over the available dedicated tape drives have improved (intervention from Charles. The performance per drive is in the 40-50 MB/s range with large fluctuations on short time scales. Maybe we have too many migration streams, to be tested on Monday.

The implementation of LSF user shares will be done on Monday, which should also give the possibility to reject in general request.

The upgrade of ITDC was successful. Dennis will downgrade it again to the version we are currently using in production at CERN and then upgrade it again to 2.1.3-14.

Miguel started to upgrade the pre-production system and will run the test-suite on it today in the morning.

Bernd will run several ‘interference’ tests over the weekend and coordinate with the SRM tests of Flavia.

We will take a decision on Monday whether to upgrade ATLAS at the end of next week.


31. May 2007

Present: Miguel, Bernd, Ignacio, German, Olof

Jan is in holidays for the next 10 days.

Sebastien made a full production release (2.1.3-14) including all release notes and updating procedures. The first updating was tested successfully on ITDC. Miguel will try today/tomorrow to upgrade himself, just using the instructions, the pre-production system.

Miguel is still trying to fix the default pool in c2atlast0 for the SRM tests.

We will try later during the day a test, where the number of job slots in t0perm will ve increased from 4 to 8.

There is an ‘unexpected’ xrootd meeting organized by ATLAS this afternoon where Sebastien was invited. We decided to ask Andreas to make this talk.

Andreas reported yesterday severl problems with his tests. After talking to Andy most all of them were fixed. Small scale tests were ruun were files were successfully written and read via the Castor-xrootd interface. It is in principle possible to read files which have not been written via xrootd in the pool, but there seems to be a missing step on the Castor part. As the pre-production system will be used during the following days to test the new release, all xrootd tests are stopped until early next week. We will have a look at the situation on Monday and decide on larger scale tests including the new Castor release. We are also having a meeting on how to tackle the authorization issues early next week. Even when files are written externally from xrootd, in xrootd foreseen, but missing one castor action (sebastien to look at it, + rosa)

German briefly explained the outcome of the RAL meeting. They will come with plans during the next month on how to take over the complet SRM2 part and how to organize, setup and operate the Castor testing infrastructure.


30. May 2007

Present : Sebastien, Dennis, Jan, Miguel, Ulrich, Ignacio, Bernd, Olof

There was another GC problem in the CMS stager overnight. The system was down between 1:00 and 9:00 in the morning.

A front-end node of the name server system has died with power supply problems. There are currently many HP systems suffering from these symptoms.

C2atlast0 was upgraded yesterday afternoon by Miguel and Dennis. This took about 1:30 and was simple, as the release notes are not yet complete. A problem during the upgrade of the disk server made about half of the disk servers unavailable over night. The whole system kept working, just the export rate dropped considerably and of course lot’s of error messages were created on the atlas Atlas and Castor side.

Sebastien has now nearly finished the description of the new procedures and release notes which is a prerequisite for an upgrade of any other production stager.

The tape system for the t0perm pool is not yet working correctly, the number of available drives is only 8 out of 16. The problem with the rtcopy buffer sizes has been understood and fixed, so that nodes with can now use the maximum available memory for the input file buffering. The two tape server with 64bit SLC4 and 8 GB of memory are part of the tape servers for c2atlast0. from the last 2 days of running one can only see an improvement of 10-15% by doubling the memory on the tape servers.

The defaulty pool in c2atlast0 is not yet working, probably a problem with LSF. Miguel and Ulrich are checking this. The idea is that Giuseppe, Shaun and Flavia will use this for the first SRM2.2 tests.

Jan is doing a detailed error monitoring and analysis of the gridftp errors seen during the last 2 days, especially the ones for BNL.

We had a first planning discussion about the upgrade of productions stager to the new Castor version 2.1.3-xx. We had a look at different upgrade strategies : upgrade in one-go with a day of downtime for the existing stagers, prepare a new fresh instance of the new stager and slowly move activities or a combination of both. The first evaluation of the pros and cons leqad to the recommendation to plan for a full upgrade in one-go.

There are certain ‘constraints’ for the different experiments.

  • Atlas will re-start their MC production next week. The dress rehearsal should start at the end of July
  • CMS plans to run their large scale pre-CSA07 tests in the second half of July
  • The SPS has a long MD planned for mid-June

A tentative schedule could be the following :

  1. Sebastien and Dennis are testing the upgrade procedure today on ITDC, by having several up- and downgrades on this system
  2. tomorrow the PPS system is upgraded. The operations team will be involved in the next upgrades, so that we have some confidence in the procedure until the weekend
  3. decision on Monday to do an upgrade test by the operations team alone, feedback to the developer
  4. from Monday ITDC can be used for tests again (xrootd)
  5. upgrade the Atlas stager at the end of next week (week 23)
  6. upgrade the public stager in the middle of week 24
  7. upgrade the CMS stager in the middle of week 26 (last week in June)


22. May 2007

Present : Olof, Jan, Nilo, Bernd, Miguel, Ignacio, German, Sebastien, Dennis, Ulrich

ATLAS switched on the full reconstruction and merging part of the T0 scheme yesterday evening. The system is running at nominal ATLAS speed, low error rate (0.3%, due to the LSF wrong cpu reporting), some rfcp wait time (simply load related). The export is still only at the 400-500 MB/s level, seems not to be related to the CERN Castor site. ATLAS will in the afternoon increase the numbers of sites to 10.

The next release (2.1.3-13) is now labeled production. Dennis and Sebastien will build this release today and deploy it on ITDC, then run a few large scale tests over night. Tomorrow c2atlast0 will be upgraded. This release can be downloaded by RAL and CNAF, the detailed release notes will be prepared by Sebastien in parallel until next week. RAL has expressed the wish to move to 2.1.3 very soon.

Miguel will take 4 nodes from t0merge and create a new dis1tpae0 pool in c2atlast0.

Nilo is preparing the SRM2.2 DB today.

Jan and Giuseppe will prepare the SRM system, so that Flavia can run a first test suite against the new pool in c2atlast0 on Thursday. A more detailed test plan will be prepared for next week (Flavia and Bernd), thiis will incorporate Castor scalability and stability tests and SRM. Shaun need to be informed about these developments, he is back in the office tomorrow.

The new pool will also be used to test the interference between pools and how to tackle problems in that area (more LSF queues, shares, etc.).

We have today 4 instances for tests and pre-production (2 * c2test, c2pps, ITDC). During the next 2 weeks the pre-production system will be upgraded to 2.1.3-13, which gives Sebastien the opportunity to test the Castor upgrade procedures. It was used in the past to debug the RAL and CNAF versions , but are now so different that these tests take place directly at the corresponding site.

Lsf cpu problems, Sebastien checked with a script automatically PIM files and found good examples of a wrong cpu time reporting. Looks like LSF and not the system, few more tests to verify.

Jan run successfully the SRM1 test suite against c2atlast0.

In the future the goal is to make SRM2.2 part of the standard castor test suite, this needs some work from Flavia.


21. May 2007

Present : Nilo, Olof, Ignacio, Miguel, Jan, Bernd, Dennis, German, Sebastien

Added 22 more servers on Friday afternoon and moved the 10 drives For migration back to 16. The maximum tape migration for a few hours was about 900 MB, while the average was still below 600 MB. The export went up to a maximum of 600 MB/s (average 450 MB/s) with up to 8 sites enabled. There were no other software or configuration changes. The export increased steadily over the weekend while the migration fluctuated.

There was only one migration stream per tape server, because of large file sizes.

In addition about 1500 recalls per day were seen in the ATLAS setup (files exported older than 24h).

Plan :

  • Miguel will initiate the real dedication of 16 tape drives with the tape team
  • Miguel will prepare the t0perm pool in cs2atlast0
  • Bernd will contact ATLAS so that they can start the full reconstruction scheme
  • Bernd will talk to Flavia about the details of the SRM2.2 tests
  • Bernd will ask Andreas and Rosa for a report on the xrootd-castor status and a presentation to the castor team
  • Sebastien and Dennis will look into the cases where a file system was 100 % over the weekend
  • Ignacio and Giuseppe will clarify the automatic cleaning of the DB once a week
  • Jan will run a SRM1 test suite on c2atlast0
  • Sebastien will inquire with CNAF and RAL about their test suites
  • Jan will separate and move the current SRM2.2. instance

We are aiming for a production level release of 2.1.3-xx for the end of the week, Sebastien is preparing this (documentation, etc.).


16. May 2007

Present : Jan, Ignacio, Bernd, Olof, Miguel, Sebastien, Dennis, Ulrich

Yesterday at noon all 23 disk servers in c2atlast0 were rebooted to enable the new network driver, the 3ware firmware and the new NIC parameters. There seemed to be some improvement in the export, but not striking.

At 18:30 the new Castor release was deployed on the c2atlast0 system. The migration continued well during the upgrade. Afterwards a few disk servers were ‘wrongly’ disabled and a bit later the LSF queue stopped. At ~22:00 the system restarted and improved the migration speed from 10-15MB/s to >30 MB/s. there are currently 23 disks servers int eh Castor pool with 4 slots per disk server enabled. The new load balancing made things better, but it is still not understood why we can’t reach the ‘nominal’ 60 MB/s per tape server. The scheduling and patterns of the export streams are a strong candidate for additional interference.

The 16 tape drives in c2atlast0 are now dedicated, which makes the monitoring of the tape migration performance much easier. Alasdair deployed iptable monitoring on the disk servers, so that we can distinguish the performance for the different streams (tape, export, daq). It allows also to distinguish between different sites. Very useful ! needs still full integration into lemon.

Still regular gaps in the lemon plots due to problems with the lemon DB.

Plan for today :

  • Miguel will reduce the number of concurrent migration streams from 16 to 10, to see the aggregate performance change.
  • At midday Miguel will add the additional ~20 disk server (all with new drivers and firmware)
  • In the early afternoon we will switch back to 16 drives
  • Bernd will contact flavia for a detailed test plan for SRM2.2. tests
  • Jan (+sysadmin) will finish the head node move of ITDC and bring the instance back to life
  • C2public move to the new NAS DB hardware has started

The head node move, COMPASS setup and other operational activities have been severly hampered by instabilities and heavy load of the CDB server. This is a general problem and the plans to improve the hardware and software need to be accelerated.

We received some answers from LSF concerning the ‘strange’ CPU values, but no ‘solution’ yet. On their request Ulrich has send logfiles and debug output the Platform.

New LSF7 release from Ulrich, no urgency, to be deployed maybe next week.


15. May 2007

Present: Jan, Miguel, Olof, Ignacio, Dennis, Sebastien, Bernd

The new Castor release 2.1.3-11 was prepared yesterday evening, but not deployed.

ITDC was running with highly compressible files which caused some problems for the tape operations. The tapes stayed for a long time on the drives (days) and drive cleaning became a problem, this was identified and fixed by Charles yesterday

CDB was very slow yesterday and caused some delays for the operation teams.

Lemon failed this morning for several hours.

Olof fixed a configuration issue on the 64bit tape servers, which seems to work now (first stress tests passed successfully). We can now start to use larger memory machines which should improve the tape writing performance (later during the month).

Miguel will ask Charles to dedicated 12 tape drives to the c2atlast0 pool. This will ease the monitoring and understanding of the tape migration considerably. Should be available during the afternoon.

The c2atlas disk servers have the new 3ware firmware (ex_07_1 nodes), the increased ring-buffer configuration entries and a new e1000 network driver (nicely packaged by the Linux team) ready. This will be activated by a staggered reboot of all 23 disk servers. After running for a few hours to see any performance changes, the new Castor release will be installed on c2atlast0 (with improved load balancing). After running for a few hours Miguel will add about 20 extra disks servers to the pool. This should run for the long weekend.

RAL reported three problems during the last few days (2 bugs and a config issue). All were fixed, but one required a new release which was delivered yesterday evening. Now RAL has started testing.

Changing the load balancing algorithm can be done quickly on ITDC , but would require a full release on a production system. The parameter values are already externalized in a config file. Sebastien and Dennis will change the code so that there can be independent values and algorithms per pool (service class).

Still waiting for feedback from platform for the screwed CPUu values in lsf

Sebastien is now focusing the development effort on a feature which is essential for the disk1tape0 area , but also enables throttling in general : being able to kill requests in Castor transparently with correct feedback to the clients.

No answer yet from LHCbb for their durable space requirements this year.


14. May 2007

Present : Bernd, Olof, Sebastien, Miguel, Dennis, Jan, Ignacio

The debugging/tuning of the load balancing is still ongoing. There was no new release on Friday. ITDC was running with an improved version (monitoring and algorithm). The overall performance has improved. But certain disk servers stopped to be used, not yet understood why. The system was running with 600 clients and the disk servers were restricted to 3 slots and later 6 slots only.

It would be good to keep certain daily lemon plots available for longer (e.g. the network in+out plots), but that is difficult from the lemon point of view. Some more thoughts necessary.

The putdone problem was solved by Gulia, which was a showstopper for a 2.1.3-xx production release.

Plan for today :

  • Fix the problem with the not-used disk servers
  • Make a new release and deploy on c2atlas
  • Continue running over night and evaluated the improvements
  • RAL has problems with their release and need help
  • Jan has to prepare COMPASS with some urgency

Plan for tomorrow

  • Add more disk servers to c2atlast0
  • Jan to check with Tim on how much of the network fixes (new drivers, increased ring-buffers) will be available for the disk servers; plan to reboot the systems correspondingly
  • The ITDC head-nodes will be moved from the criticial area to the ‘normal’ area, whichis a good test to see how a running system behaves under interruptions

A test with DLF was actually done already last Wednesday, where the DLF DB was switched off for several hours and the systems still continued without problems.


9. May 2007

Present : Jan, Miguel, Sebastien, Olof, Ignacio, Bernd, Dennis, Ulrich, Nilo

Problems with the SRM 1 main server caused problems on all stagers, no SRM commands got through. Jan is looking into this.

Miguel tried to change the GC frequency from 5 to 25 min to test whether this could give better performance for the migration. This failed due to unknown reasons, GC did not restart and over night the ATLAS input data rate dropped to 0.

Dennis has started to change the monitoring scheme, so that one can gather and use more detailed information per stream as input to the load balancing algorithm. This is still under test and need to be tried on ITDC.

Plan for today :

  • Fix the GC problem on c2atlast0 (Miguel)
  • Add a little more randomness to the file system selection (Sebastien) ; before lunch time
  • Add 6 more tape streams for some time (Miguel); in the afternoon
  • Prepare more disk servers to be added to c2atlast0 (Miguel); for Monday

The problem of ‘putDone’ not updating the file size at the end of the operation will cause large problems in the production, migration works, but there is no recall possible and lots of manual intervention necessary. Gulia will fix this with high priority.

The accumulation of subrequests is fixed and tested and can go into the next 2.1.3-xx release

The public will move next Wednesday to the new DB hardware.

Jan will setup COMPASS now with high priority.

Ulrich has contacted Platform Computing to get information about the problem of LSF jobs in Castor getting killed due to a nonsense value in the used CPU time (64 bit issue ?!).


7. May 2007

Present : Jan, Miguel, Olof, Bernd , Dennis, German, Ignacio, Ulrich

C2atlast0 is running since 4 days the basic ATLAS T0 test which creates RAW data, ESD and AOD data, exports them and writes them to tape. The file sizes are 3-4 GB for all of them. There is still a problem with the distribution of the migration streams over the disk servers. The allocated 12 migration streams should be easily cope with the 350 MB/s input rate, but are only running at about 100 MB/s. While the total number of streams is rather well distributed. Sebastien is looking into this.

The large files also mean that the current tape server setups are not well adapted to this. They are still running SLC3 and have only 4 GB of memory, thus they can’t read several files in parallel to improve the overall speed.

The tape software is not yet 64bit certified.

We are not able to monitor the tape access per pool. This needs to be fixed soon. Miguel will look into this. Miguel and Alasdair will fix the lemon sensor to put the pool tape data into lemon.

Before increasing the ATLAS complexity we need to fix the load balancing issue.

The problem of dropped network packets has been understood and the cure is an update of the GB driver and a configuration change (increase the number of ring-buffers from 256 to 4k). This needs to be deployed on the disk servers some times during the next week(s).

The DB clean-up procedure was not run every week, because it looked like that also files which are currently running are ‘cleaned’ which ‘destroys’ them. Ignacio will take the recipes from Sebastien and implement the regular cleaning procedure. So that we will follow the plan of regular DB cleaning every Wednesday.

CMS needs a clean-up right now. CMS is stuck in one pool, looks like GC problems.

Put-done operation does not update file size in the name server.

The Gridftp malloc errors are understood. If there is a problem (stuck) on the destination site, GridFTP stopps, but also tries before that to load the file into memory which creates memory problems for large files. Jan will implement a memory limit for GridFTP of 50 MB.

LHCb is currently being upgraded to the new NAS hardware and DLF moved to a small disk server.

LSF is killing jobs (in 6 and 7), because of running into strange cpu time limits, this explains all errors ATLAS is seeing currently.

Jan and Miguel will start to remove disk servers from the ‘old’ ATLAS stager to be put next week into c2atlast0.

We are still waiting for an answer from LHCb to size their disk1 pool.

C2public will be moved to the new DB hardware next week, before NA48 and COMPASS start their SPS runs.


3. May 2007

Present : Olof, Jan, Miguel, Sebastien, Ignacio, Dennis, German, Nilo

c2atlast0 status: Atlas T0 tests started yesterday around 17:50. A number of issues have been reported by Armin:

  • SRM misconfiguration: Jan - SRM stagetype.conf for atlas was not put in, no functionality tests done. Atlas to re-verify
  • Migration was not working. Two small misconfigurations - # of configured drives was 0 (despite the setup procedure being correctly followed correctly by JvE; he is trying to recreate this problem with a dummy svcclass). The CUPV permissions were wrong (required root entry missing). Both problems have been corrected and migrations have started before 9AM. During this time a bug (already in Savannah) was causing inconsistent behaviour when selecting files for migration but no stream is available (no reselection).
  • Migration throughput: Miguel would expect 700MB/s, but we have right now ~400-500MB/s peak values. Yesterday an average of 33MB/s per tape server was reached without any other activity going on, which is not good. Load balancing fs-level selection granularity may imply that some disk servers are not used. Still we shouldn't see 6 streams per box (note: rfiod daemons don't count as streams if data transfer hasn't started yet so there might be an overlap effect). Should reach O(60MB/s) with same file sizes and same drives (as seen in past ALICE tests). Migrations won't be efficient if there is an unequal distribution across disk servers. However, yesterday's tests were not conclusive - there was a relatively good fs spreading across disk servers. Miguel will try to reproduce tests done in the past for ALICE.
  • Sebastien will change the equation parameters to take into account the load on disk servers and the free disk space (as second-order parameter) for trying to improve the load balancing. A medium-term improvement is to randomize file system selection across equal candidates (currently, an implicitly sorted list is used). Another (longer term) improvement would be to split up the LSF plugin into two phases, as recommended by the LSF developers - a first phase filters out irrelevant scheduling targets and a second phase performing a finer-grained selection.

Atlast0 Next steps:

  • Leave it running with the current configuration, until equation changes have been verified on ITDC.

ITDC status:

  • Dennis reports that running the standard (600 clients) test shows poor migration performance - only 80MB/s.
  • Nilo: 100% CPU load on stager DB. He also reports that a filesystem has become full with tracefiles.
  • Both problems need to be investigated further.

AOB:

  • The multiplicity of test instances (for Castor functional tests, SRM2, Repack2, XROOTD, ATLAS etc.) is causing an important operational burden: Jan: We need a plan of what instances are allocated to what tests; to be run by whom, for how long, what configuration is needed (Castor version, disk servers, tape servers, etc). Instances should be reused whenever feasible as we cannot afford having dedicated instances for each test case.


2. May 2007

Present : Olof, Jan, Miguel, Sebastien, Bernd, Ulrich, Ignacio, Dennis, German

The latest release has been installed (2.1.3-10) on ITDC and c2atlast0. A bug fix needed to be deployed before the system could start.

ITDC configuration: 12 tape migration streams, 15 disk server, 50 slots per disk server C2atlast0 configuration : 12 tape migration streams, 23 disk server, 10 slots per disk server

We need to urgently find a solution to the fact that we are unable to easily disentangle the migration performance from the client read performance : monitoring issue.

The memory leak in the stager is now ‘fixed’ (6 MBin two days), not a focus any more.

Migration was running very slow , only 20 MB/s aggregate.

Lots of strange file states in ITDC, might explain part of the bad performance.

Heart-bert problem solved, rmmaster not taking disk servers out as early as before No crashes.

C2atlast0 test started yesterday late evening and filled the disk servers quickly.

Distribution of jobs over the disk servers looks okay, just one exception

Migration was not working correctly, there is obviously a problem with the correct selection of file systems for new streams.

Sebastien will change the policies for the file system selection during the morning

There are still 100% full file systems, but this is understood. We used actually the ATLAS ‘production’ settings of the GC (delete only 24h old files).

Plans :

  • Clean ITDC completely , change the balancing policy and restart the tests

  • C2atlast0, Change the balancing policy, change the file systems space threshold (reduce by 5%) and see how the still running tests recovers

If this okay (checkpoint at 14:00) , prepare c2atlas for a direct ATLAS T0 test.

There is a large backlog of tickets and problems on the mailing list. Olof estimates that this needs full time effort for the next 2-3 weeks to clean this up. We need to have some prioritization here


30. April 2007

Present : Nilo, Olof, Ignacio, Ultrich, Jan, Bernd, Dennis, Sebastien

Jan reported on the detailed progress of testing functionality on c2atlast0 with the latest release : Move disk servers worked ,Tape pools ok, migrator okay, GC worked Difficulties: rmnode crashes Dennis identified the problems and a fix is provided for the next release. One major difference between ITDC and c2atlast0 is that all disk servers in th former instance are on SLC3 and in the latter on SLC4. Automatic recovery procedure from Miguel in production.

ITDC

Started large scale tests on ITDC on Friday afternoon. First the GC was not working well causing asymmetries in the space distribution. This was fixed and also the time interval for the GC was reduced to 1 min. The migration started with 12 streams enabled. No input yet, just see how the migration behaves. Tape streams are running with about 40 MB/s. some problems with tape pools filling up on Friday night, fixed by Olof, not really clear where this came from.

All disk servers were at 50% space usage on Saturday. Dennis started a full test with 600 clients which write a 1GB file and read it back immediately afterwards, plus 12 migration streams running. During the filling of the file systems the 15 disk servers were running at 800 MB/s input and 1 GB/s output speed. As the tape speed is limited and the input speed much larger than that, the system moved into equilibrium after a few hours. Input rate was about 300 MB/s and output (client read + migration) rate about 600 MB/s.

The general problem of not being able even to read a file from a full disk pool has been solved.

The GC worked , never exceed 95%, but some times breached the 90% level (scheduling + GC detailes to be tuned).

There is still a small memory leak in the stager, 15 MB/day.

The test run successfully for 36h, before it was deliberately stopped this morning.

Nilo changed a parameter in the DB to improve handling of request. The DB was successfully restarted while the migration was still running.

Looking into more detailed plots from Dennis a few oddities were discovered. Some disk servers dropped out of the production during the night. There is still one case where the space calculation went wrong .

During moderately high load on the disk servers the heart-beat stopped and caused temporarily file systems to be taken out of the load balancing. The heart-beat parameters need to be improved, further investigation necessary.

Plan for today and tomorrow :

  • Sebastien will produce a new release
  • Dennis will continue further tests on ITDC
  • During late afternoon Jna will install the release on ITDC and c2atlast0
  • The 600 client write+read test + 12 stream migration will be started on c2atlast0
  • Further queue tests on ITDC in parallel

There was a problem with the CMS GC last week for CMS (stopped working). This was fixed by Giuseppe, but not clear was the fix was.

LHCb will come up with a size for their disk1tape0 pool for the rest of the year.

As we are moving all disk servers to SLC4, one has to take special care for the GridFTP software, as it has to run in SLC3 compatibility mode.

There is a special version of ROOTD running on the LHCb disk servers (fixes an AFS home directory problem). We wait for further changes (GSI security) before this is deployed on a wider scale.


26. April 2007

Present : Jan, Miguel, Olof, Dennis, Sebastien, German, Nilo, Ignacio

Tests on ITDC involving tapes (Dennis):

  • Not good results achieved. All the disk servers are full above their limit, some at 99% (exceeding the 95% watermark). One file reported to be found which was gc'd while being in CANBEMIGR status. Migration was also failing. Olof - only a subset of disk servers (59xx) were used in migration but not the 57xx ones. Sebastien/Dennis to check. After the meeting, Sebastien provided the following status update:


As expected few, already known, causes. The full story is :
  - the update to the stager DB for monitoring info did not work. We
realized that yesterday but could not fix it because of an internal dev
meeting
  - the consequence is that the GC never started and the filesystem
became full.
  - we had 50 jobs of 1.2G per machine. These are 60G that can be
written after a filesystem is declared full which is >
minAllowedFreeSpace (5% of 1T = 50G). So Olof's prediction was right
that we don't respect the hard limit here. The fix is to take streams
into account in the equation as mentioned this morning.

Another point is that the migrations did not work (PL/SQL error, same as
for monitoring update). That also is a reason for filling the
filesystems since all files are CANBEMIGR.

Finally, NO CANBEMIGR FILE WAS GCED ! This is just a bad interpretation
of the logs where a file was overwritten and the gc dropped the old
(invalid) copy. The new one was properly kept.

In summary 3 actions :
  - update of stager to be debugged (I'm on it)
  - migration PL/SQL to be debugged (same error)
  - taking streams in consideration in the freespace equation

  • UDP communication problems between stager and migrator (Dennis): partly fixed, not yet completed.

Follow-up on ITDC stress test problems yesterday night:

  • Oracle connections did freeze and the connection couldn't get restarted causing the stager to freeze. Sebastien: we still have one case where an Oracle error is not catched correctly. Taking into account all cases for cleaning up before an automated restart can be very difficult.. In the short term, we can expect this problem to happen on c2atlast0 sooner or later.

Status of c2atlast0:

  • 2.1.3-9 software installed.
  • LSF upgraded (also on ITDC). Note: log files have moved to /var/log/lsf. Metrics parsing the log files need to be readjusted.
  • DB schema has been created despite initial problems, now fixed in CVS.
  • Disk servers: Are still being reinstalled, Miguel to confirm with sysadmin team. Should be completed today.
  • DB server was suffering from a mismatched MTU setting in the private interconnect network (wrong version of ncm-network used). Problem is understood and Manuel will fix this look into this. (Reported to be fixed after the meeting)
  • Jan reports that Kors was in today's morning meeting - planned AtlasT0 should not involve tape migration. However, ITDC problems above need to be understood first.
  • Jan - the regular (3h) stager restart was removed from the 2.1.3 stager configuration - should we re-enable it? Sebastien recommends to keep a restart once a week as there is still a minor memleak - Jan will configure the restart.
  • Next steps: create svc classes, assign disk/tape pools, fix database server. Dennis to continue with tests once DB is available again

LSF status:

  • Load balancing tests - job distribution analysis (Miguel): Not yet done.

Next steps

  • ITDC problem understanding+fixing
  • c2atlast0 preparation and tests (see above)
  • 2.1.3-10 release still being prepared (Sebastien).


25. April 2007

Present : Jan, Miguel, Olof, Dennis, Sebastien, German

LSF status (Sebastien/Dennis):

  • managed to enter 94K LSF requests on the test setup but this caused a meltdown as the scheduler is killed due to slow response. The test LSF node was swapping (lxb1368) as it has only 1GB of memory. The queue length test was then re-run on ITDC (its LSF server node has 4GB of RAM). After entering 77K jobs the disk servers were reopened. Jobs were dispatched in groups of 300 (instead of continuously) until the queue was reduced to 30K when continuous job running stats. The new 'meltdown' point can be thus considered to be found at 30K jobs in the LSF queue.
  • Sebastien has been profiling LSF using Valgrind/Callgrind in order to better understand the scaling limitations: When jobs are entered into the system, mbsched calls the plugin for all the jobs in the queue. With 80K jobs it can take ~ 20minutes which is beyond the killing threshold. When the scheduling phase is entered, the plugin is again called for each job. So for 80K jobs we have 160K calls to the LSF plugin. Callgrind shows ~ 30% of total CPU usage spent in the plugin, the rest is inside LSF (70%).
Phone conference with Platform Computing:
  • During a (very good and constructive) phone conference with LSF developers, it became clear that the maximum queue length ever tested by Platform is 100K jobs, running on "very good" hardware. This figure matches what we are seeing taking into account the LSF plugin overhead and the HW we are running on. Olof suggests to disable the plugin to see what the real limits are. But how many jobs should we expect? Is 20K realistic? In Castor 2.1.3, the only scheduled jobs are file accesses so the total number of jobs will be lower than on the current 2.1.1 production instances.
  • What is the max # of requests per second to enter queue: Answer still to be provided by LSF developers.
  • How to avoid suspension and 'post message' in LSF? The LSF developers have made available a tar ball containing the source code of the bsub command. They don't use 'post message' but an alternative mechanism which functionally does the same but seems to be more efficient, and suspend/resume (which currently triples the load for LSF) is not required anylonger.
  • alternatives for the 'read message' mechanism were discussed as well.
  • Platform suggests to modify bkill for our own purposes (eg. contacting the stager); the bkill command source code is available (open source).
  • Medium-term investigations for improving LSF performance: having a two-level queue where only jobs to be scheduled are entered to the LSF queue; revisit internal LSF states

Modifications to Garbage Collection (Miguel):

  • As discussed yesterday, the GC lower watermark was changed and Miguel noticed that reaching the new level took a long time (~1h), however much data had to be dropped. The GC is still running every 15min only. Sebastien points out that the GC has a first phase where files are marked as deleted, which is centralized and executed sequentially fs by fs; this is difficult to parallelize. Not an issue at the moment but may become one in the future?

Load balancing:

  • Sebastien points out that the first-order load balancing criteria is based on # streams, and not on not free disk space, which may cause uneven distribution. The disk free space is taken into account for load balancing as a second-order criteria (equal # of streams) as IO troughput is more critical than free space. For the time being, free space is disabled in castor.conf; moreover Sebastien is considering changing the coefficient from absolute disk free space to expressing it in %.
  • test with 100 clients and space threshold 50-80%: Job distribution was even but considering a big time window; Miguel still needs to look at the logs in order to confirm that within shorter time periods a smooth and even distribution was achieved as well.

Tape writing tests:

  • functionality working; migration monitoring still showing problems
New 2.1.3.9 release:
  • On ITDC; however, LSF7 was not upgraded yet.

Other items:

  • Miguel: set of actuators for automatically taking out disk srvs are now working.
  • Miguel requests that stager_query should have a timeout; whenever the stager is restarted the client never gets an answer. To be filed into Savannah.
  • c2atlast0: 19 new disk servers added and with sysadmins for reinstallations, should be ready this morning.
  • Dennis ran Miguels original stress test which stopped at 4AM with Oracle connections being dropped. Dennis and Sebastien still need to investigate if this was due to Oracle errors or RH/stager deadlocks.

Plans until tomorrow:

  • larger scale tests including migration on c2atlast0: Miguel needs to finish installing c2atlas (with Jan) and run tests on it (LSF and IO troughput, scheduling as many jobs as possible; load balancing).
  • Tape tests on ITDC to be launched by Dennis now. Tape pools are configured and 4 dedicated drives made available. The idea is to get a high throughput to tape; this will probably need tuning.
  • Test setup: Sebastien to continue LSF testing.
  • A new 2.1.3-10 release is needed for fixing a minor stage_query bug (reintroduce reservedSpace for backward compatibility)
  • Savannah bug status to be cleaned up if time permits


24. April 2007

Present : Jan, Miguel, Ulrich, Ignacio, Bernd, Olof, Dennis, Sebastien, German

C2test stager instance

Sebastien tried to test the queue length limits. He was able to fill the queue without problems with 94000 jobs, but they were not executed due to a problem with a LSF admin command which was stuck over night. The system can absorb about 10 jobs per second. The request handler and the DB were able to handle about 11000 jobs within about 1 minute. No requests were passed to the stager probably due to a very busy DB, but this is for the moment not a problem (one minute ‘dead’ time for 11000 jobs). All 20 threads of the request handle were used. These are very good results.

Ulrich will look with Sebastien into the admin problem.

The first tape tests were running fine, besides a small problem in feeding back monitoring information from the migration. This can lead to small problems in the overall load balancing in the pool. Should be fixed , but not high priority to delay the next release.

ITDC instance

The problem with the load balancing and the wrong space allocation were due to wrong monitoring information. They were fixed and the software on ITDC was updated by hand. A long run overnight showed a very good distribution of free space over servers and file systems. The GC was running well in addition. Miguel stooped this morning the GC on all nodes and observed a smooth and equally distributed filling of the disk space. The hard limit of 95% was in most correctly handled. The over-shooting of space usage is understood and due to the frequency of the GC running plus small deficiencies in the extrapolation of space usage per stream in the monitoring. This can easily be improved by moving the threshold down to 90% and the frequency of GC to 5 min.

The current test was running still with 600 clients writing 1 GB files and 15 stream slots per disk servers (15 disk server in total).

To test the improved load balancing a new test is started right now with 100 clients and a different space threshold (50-80%)

All these parameters are now in castor.conf on all disk servers and not anymore hard coded in the DB.

Miguel implemented a set of actuators to automatically take out disk servers from the pool which have problems. Under test.

C2atlast0 instance

Miguel has started to drain disk servers from c2atlas t0perm pool (21) and will add them today to c2atlast0. he will also enable the tape pool.

Sebastien is preparing a new release (2.1.3.9). Ulrich has a new LSF7 release (v15). He will move with Jan to this new version on the TEST instance so that the castor release can be build against this. New release to be installed on ITDC and c2atlast0 tomorrow midday Tomorrow afternoon larger scale tests including migration will be started on c2atlast0.

Some discussion about job behavior in case of overload

In case of problems with the disk servers we need measures : 1. time-out for pending jobs 2. limit on the queue length 3. message back to the user if queue is full 4. queue per service class minimizes interference to be checked 5.

cancellation of jobs needs to be propagated into the stager.

New upgrade of Gridview by Jan on all slc4 disk serves.

Configuration of different tape pool for LHCb for the different data necessary.


23. April 2007

Present : Jan, Olof, Miguel, Ignacio, Bernd, Sebastien, Dennis, Ulrich, German

New version installed on ITDC on Friday. The rmnode daemons on the disk servers were not restarted. The interface changed which requires a restart of all daemons everywhere. This was fixed on Saturday

The major test run from Saturday is still ongoing.

600 clients, 1.2 GB files, 15 slots per disk server. Running since Saturday morning. Nearly no errors in the stager, restart of daemons was not happening. No LSF problems, no daemon restarts but only about 1 request/s now (low load).

Miguel showed some plots about the occupation of file systems. Out of the 45 used FS in ITDC, 13 behaved strange. They did not ‘obey’ the thresholds used (band between 85 and 90%), we have nodes under 85% and above 95%. There is also a problem with the distribution of space within one node, the FS free space distribution is not equal. The garbadge collector ran on all FS. The data have been taken from lemon, which stores the space used per file system per node. The reason is probably a combination of the space reservation not working, the long time interval for starting the GC and the quality of the monitoring information.

Gulia run the tape test suite on the TEST setup which worked fine.

Test plan for today :

  • Queue length tests on the TEST setup

  • Leave the ITDC setup running for the space allocation debugging

  • C2atlas t0 to be upgraded with more disk servers

  • Adaptation of latest releases into production by Miguel

  • Consistency scripts rerun to check the disk servers, cross-checking the installation. Ignacio

Problems with the LHCB durable disk pool setup. Needs some principle decisions. Meeting with them next friday

Misplaced file problems should get higher priority, because they cause considerable work in the operation team.

Jan will propose to move ALICE to the new NAS DB next week.

Jan still working on compass and na48.

Jan arranged that we can stop the RGMA daemon on all disk servers


20. April 2007

Present : Miguel, Dennis, Nilo, Bernd , Olof, German, Sebastien, Jan

Dennis reported that the problems of reproducing results from last weak were solved. The tests done showed different behavior because the performance was much better! With ~200 nodes and 20 processes each opening a file and writing 1 byte, these were not able to fill the queue, as all DB dead-locks had been fixed in the mean time and thus the job processing overhead was reduced considerably.

Another test was started yesterday at 18:00 with 600 processes writing 1.2 GB files constantly. This run smooth for 6h at 1.3 GB/s aggregate data rate, with 15 disk servers and 15 job slots each. Then Dennis discovered that the garbage collector was not running, due to updating problems in rmmaster. A fix was applied at 23:00, but did not work correctly, so the system started to drop performance until it stooped. In addition analysis of the monitoring data showed that the threshold for space per file system did not work. Lot’s of jobs started and finished with an error of file system full. This created then a large load on the data base Which had to be restarted/killed in the morning by Nilo.

A new monitoring variable needs to be visualized ; number of jobs started/finshed per time interval. The information is in the mean time collected correctly by DLF, but still needs to extracted and presented.

The CMS stager move went smoothly. Nilo had to fix a problem on the DB side, which requires a restart of the stager. Actually the stager is anyway restarted by a cron job every 3 hour as a workaround for a memory leak, which is fixed in the new version.

The plan for today is :

  • Sebastien is continuing to fix small things, but they don’t effect the larger testing.

  • New release planned for today.

  • Fix the GC and file system threshold problems

  • Restart the tests over the weekend

  • Miguel and Jan are continuing to test the infrastructure of c2atlast0 and add more disk servers.


19. April 2007

Present: Dennis, Jan vE, Olof, German, Uli, Miguel, Sebastien

  • CASTORCMS upgrade status: A block corruption on the current (to-be-replaced) stager DB was found today at 6AM. Miguel stopped all daemons; as a consequence the planned upgrade to the new DB may take a longer.

Update on actions :

  • Status of new release deployment (2.1.3-7): Installed on ITDC since 5pm and running LSF stress tests overnight. Dennis couldn't achieve more than 1000 concurrent jobs sustained instead of ~3000 expected; the reasons have not been understood yet. Peaks of 14000 requests per hour were reached but not in a stable manner. Next steps: Understand the reasons for the degradation.

  • Broken Disk2diskcopy unveiled by RAL (as PrepareToGet operations are not scheduled anylonger) has been fixed. A pure database fix already covers the general case, but a fix in the code had to be applied in the scenario of a get of a file whose only copy is on a disabled diskserver, and no tape copy is found for it. RAL are already testing the PL/SQL fix. Bot 2.1.2 and 2.1.3 Castor series are affected. A new 2.1.3-8 release will be built containing the fix.

  • Adding disk server functionality testing status. Miguel: most important functionality tested and working, but needs more exhaustive tests, feedback to be provided to the development team (low urgency).


18. April 2007

Present : Olof, Jan, Miguel, Sebastien, Nilo, Ulrich, Dennis, German, Bernd

Some work with the Gridview team has been done, memory bug fixes and better automatic installation and updates. Latest version installed on all disk servers. This used quite some time of the operations team.

The plan for the DB upgrade of the CMS stager has changed. As there is the CMS week ongoing they prefer to have the upgrade this week instead of next week. Thus the move to the NAS system is planned for tomorrow.

Since last night all disk pool activity of CMS stopped and they report problems to access Castor data. Bad block on the CMS data base.

Nilo to send a status of the NAS hardware and castor instance mapping until lunch time.

LSF7 status is now okay. Few problems left, but they should not effect the Castor system rather more important for lxbatch. Ulrich described the details in an email.


Dear Olof et al,

I have just release a new set of rpms for all platforms (which should fix the problem below as well as some others), and updated https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScLSFrpm
as well as the project status page at
https://twiki.cern.ch/twiki/bin/view/FIOgroup/FsLSF7Evaluation

Note the following changes made on request of CASTOR people:
- all libs have been moved to the standard library path /usr/lib btw /usr/lib64
- the header files in the devel package are now in /usr/include/lsf (rather than in /usr/lsf/include/lsf)

I made some other improvements which were pending for the LSF6 part concerning log file location and rotation. There are still remaining issues here (see list) but they do not seem to be critical for CASTOR.
One problem _not_ affecting castor is that the melim is not being started which is a problem for batch but not for CASTOR, and I opened a support request for this. Also, there is still an upgrading problem pending which requires an update of the LSF6 rpms. This will be addressed next (not a show stopper I think).

This is the third release since yesterday. The previous version which contained mainly configuration changes seems to work fine from what I have heard so far.

Cheers,
Ulrich


The new releases are now deployed more or less simultaneously on the three instances : c2test, ITDC and c2atlast0. An immediate Bug fixed which crashed the stager

The run over night should stability in the LSF7 area, but some data base dead-locks. Looking at time-out issues with the oracle team

In the operations team a large backlog of tickets ahs been accumulated during the last weeks.

Miguel is and has spending some time on the adaptation of the configuration and monitoring for the new release.

Some extensive discussion about the release process. A current release takes about half a day to make.  30 min for preparations, release notes  30 min for the actual make process, compilations  15 uploading to the swrep software repository  15min – 2 h , mostly unpredictable, entering profiles into CDB  5 min quattorized installation on the stager instance In addition the test suites need to be run for a minimal error check.

The weak point is the CDB are . German is fully aware of these and measures to improve the situation are already since some time under preparation. New and faster hardware will be available during the next 3-4 weeks, software improvements are planned for the next months. Jan proposed a shortcut which avoids to wait for the possible long CDB update, which sebastien agrees to and will apply if possible (in some cases this recipe would not work).

The maximum number of release turns is thus about 2 per day.


17. April 2007

Present: Dennis, Jan vE, Olof, German, Uli, Miguel, Sebastien

  • By mistake, the ATLAS main stager DB was dropped instead of the T0 DB. Nilo is trying to restore it. Kors has been informed.

Update on actions from yesterday :

  • New 2.1.3-6 has been released (as internal release). ITDC reinstallation with this release: Miguel is currently busy restoring CASTORATLAS. In the meantime, 2.1.3-6 will be installed on c2test. Seb+Dennis can then help reinstalling ITDC. It is important to not touch c2atlast0 for the time being.
  • Adding disk server functionality: Still to be tested.
  • Discussion: What test subtrees should be run (e.g. Giulia's test suite, tape tests, Miguel's test etc), where (c2test, c2itdc, ...) and when (minor releases, major releases, etc)? A general strategy will be worked out (Giulia, German) and presented in a forthcoming meeting; for the time being it is agreed to run on c2test Giulia's test suite and exercise tape recall, and then on ITDC run parallel file access from 200 clients (LSF tests by Dennis).
  • Miguel reports that ITDC was very loaded overnight, many deadlocks occured.
  • Problem with load balancing - uneven job dispatching on disk servers: Seb has externalised more parameters onto castor.conf (load, #streams, read/write rates), now ~ 20 parameters available. The load balancing is to be re-checked once ITDC is reinstalled with 2.1.3-6. The goal is to fill half of the available slots for verifying even job distribution (as with all slots full this is difficult to see).
  • LSF7 instance status: Uli - WIP. He has fixed some configuration issues and has produced an intermediate release which Jan will now check. Still some packaging problems to be addressed. Ulrich has figured out how to put the lsb.events and accounting files on the local fs and not on AFS.
  • CASTORCMS -> NAS HW: Announced for April 25, but objected by Daniele Bonacorsi. Jan will find out if can be done tomorrow morning, otherwise Thursday morning.
  • New DB servers: Jan presented his draft plan for proposed configuration: https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScCastorOracleRac
  • Stager DB merging: Sebastien reminds to initiate the base sequence ID very high from the start in order to allow for posterior merging.


16. April 2007

Present : German, Jan, Miguel, Sebastien, Dennis, Bernd, Ignacio, Olof, Nilo, Ulrich

The tests over the weekend were not successful. Lot’s of problems with the DB backend and the LSF7 installation. 81 restarts of LSF until Sunday morning, even when there is no activity. DB corruption dead-locks, single CPU – heavy load, problem with error feedback to the castor client program.

Short stable period where a test run at ~1.3 GB/s over 15 disk servers, which was pretty good. A problem with the load balancing was identified and will be fixed.

Olof expressed some doubts about the new NAS hardware as they have not yet been tested with a stager instance. Two weeks ago the name server was moved to the NAS hardware and behaved well so far, but has a less stressful access patterns than the stagers. Nilo disagrees and ensures full confidence in this hardware.

Long discussion about how to continue after the problems we have seen in the different areas.

Plan for the next days :

  • High priority of fixing the LSF7 instance, Ulrich to focus on this during the next 2 Days in collaboration with Dennis and some help from Jan

  • Jan continues to do the setup of the castor2 instance for COMPASS and NA48. The decision is still to have them both on the new Castor2 public instance. For the moment they will we the only customers.

  • Jan will send an early warning to CMS,as we are tentatively planning to move them to the new NAS hardware next week. Seen too any corruptions and more memory should help the performance and stability.

  • Sebastien will make a new release this afternoon . ITDC will be completely ‘wiped’ and Miguel will help to test the automatic installation and configuration of the new release on ITDC. If this works and all the identified missing feature and functionalities (e.g. adding disk servers) are okay, the new release will also be installed on c2atlast0. ITDC will only be used for functionality tests, it will stay with the current data base on old hardware. C2atlast0 will now be used for the performance tests and the installation should have production quality, primarily for the moment from the point of view of functionality and configuration.

  • Nilo will prepare three more NAS DB instances until the end of this week

  • After the experience of fixing the outstanding LSF7 problems by Ulrich, we will take a decision on Wednesday morning whether to continue with LSF7 or move back for some time to LSF6. it will take the dev team about half a day to move the release back to LSF6.

  • Nilo will right a small status report about the NAS situation and how many DB s can be hosted with/without interference of each other.


13. April 2007

Present : German, Jan, Miguel, Sebastien, Dennis, Bernd, Ignacio, Olof

We have a lot of files which are updates after been written the first time (e.g root files) About 1000 Stage-update requests per day (over all 4 LHC stagers)

Presentation by Sebastien about the changes in 2.1.3, has also been send by email. Quite some detailed discussion to understand and make some modifications. Sebastien and Dennis will focus on fixing the problems for the operations teams, as currently the setup of the new instance is stalled due to several problems e.g. Still problems with adding new disk servers to the pools with this new release.

A bug was fixed in the shared memory code (LSF plugin , rmmaster communication) Which caused crashes.

Large instabilities in the LSF7 instance. About every 1-2h a crash of the mbatchd process. No clear error messages. The daemon should have been automatically restarted by LSF. Probably a configuration issue. The LSF expert (Ulrich) is back on Monday, so we live with this until Monday and don’t dig further. Miguel will install the LSF-restart script on ITDC which runs in the background.

Miguel will add a tape pool to ITDC (actually the same as for ATLAS t0export).

Sebastien and Dennis will start a large scale write test in the afternoon to run over the weekend, writing 1 GB files. Use 200 client nodes to run several thousand streams in parallel.

Miguel to start an extra tests automatically on Sunday morning, large files also (5GB), but adding lots of stager queries and other ’control’ statements

Already see high load on the data base (1 CPU node in itdc), this is good for optimizing the queries.

LSF7 setup to be debugged on Monday by Ulrich.

The 2.1.3.x version is needed on an instance to do further SRM2.2 tests. The plan for next week is to stabilize ITDC and than copy the release to c2atlast0. The ATLAS can be involved in the tests and TDC canbe used for the SRM testst. Giuseppe promised srm2 endpoint c2idtc 2.1.3 is needed

Jan has deployed the new Gridview version on all disk servers yesterday at 14:00. yesterday. Bernd will do some checks.

The 2.1.2.7 release is needed for RAL bug fixes, (crashing stager by preparetoget) Giuseppe is using the PPS to test this, have to downgrade LSF from 7 to 6


11. April 2007

Present : Jan, Miguel, Bernd, Olof, Dennis, German, Sebastien

Adding disk servers to the new instance (ITDC, c2atlast0) is not working anymore.

Lots of things have changed in the new release which causes some problems for the operations team. Monitoring is also an issue which does not work as in the old scheme.

A list of changes needs to be provided by the developers immediately, especially dropped functions and new functionality.

Sebastien will send a first list of changes to the operations team around lunch time. Giuseppe is back this afternoon and will help Sebastien to debug and also interface with the operations team for questions+answers concerning the new release.

Some discussion about the release plans. Bernd insisted that the new release gets into production very soon. The developers remarked that there are still known problems, but the idea of waiting another few months before this is fixed is not the way to go. We need a faster release cycle and more interaction between the developer and the operations team on a regular basis. This will create some more overhead for support issues, but there will be always a set of releases to be supported in parallel with different emphasize (old, pro, new).


10. April 2007

Present : Jan, Dennis, Miguel, Bernd, Nilo, Sebastien

The new Gridview program which runs on all disk servers has a large memory leak (250 MB per day) and is creating lots of problems. The errors seen in the export of ATLAS are due to this. Seems to be a lack of testing on the provider side. The program will be stopped until this is fixed.

Nilo fixed a data corruption on the Atlas DB.

The export of ATLAS ran out of data due to a script problem on the Atlas side.

Miguel reported that the LHCb disk1tape0 pool filled up againa and he added a disk server. Bernd commented that this is not sustainable and that we also should stop our internal tape writing for these type of pools (to be discussed further).

Sebastien finished the tests on the C2tests instance successfully and moved to ITDC, This is a 15 disk server full Castor instance.

Jan reported that there are still some LSF7 configuration problems, e.g. where to put the logfiles and accounting files (effects the start of daemons). All these issues are reported so that Ulrich can have a look at it when he is back on Monday.

Sebastien started on Friday with some large scale tests on ITDC, 15 disk server configured to hold 50 slots each (750 slots in total). With 4000 client he submitted simple programs which open files and read 1 byte. The observed request rate (entering the queue) was about 12 per second. Filling the queue with 50000 jobs was no problem. With this queue length about 3 requests per second can be scheduled, as the system takes into account all jobs in the queue. It moves to about 20 requests/s when the queue length goes down to 15000. Some odd behavior of LSF was observed (stuck LSF, loosing jobs from the queue and reappearing later). One test takes up to 5h. Changing the number of stager threads talking to rmmaster (from 20 to 40) did not change the number of requests per second. The used load balancing worked fine, but was using only one scheduling parameter (number of streams). The tuning of this can be done later and will anyway be a continuous process.

Actions :

Sebastien to spend some more time on understanding the LSF issues

Jan is SMOD this week and thus has to spent art least 2h per day on this activity (in general the task force will NOT overwrite these obligations)

Jan has also to continue the move to Castor2 for NA48 and COMPASS which start data taking at the end of May. And a new SRM version needs to be deployed.

The new c2atlast0 can now be finished (data base schemer, etc.). ITDC will not be modified further but rather c2atlast0 be upgraded to finally replace c2atlas. Larger scale tape tests will not be done on ITDC but on c2atlast0. Miguel will provide the necessary tape pool for this (the same one as for the T0 tests for ATLAS)

Miguel has to contact LHCb for the setup of their pools for the upcoming DAQ-T0 tests


05. April 2007

Present: Jan, Sebastien, German, Miguel, Bernd, Eric

Jan and Sebastien are correcting configuration issues for the move from the TEST instance to the ITDC instance. Some of them are basic and are feedback to the LSF team. Ohters are related to the new release and require some up-front manual work.

The issue of bug-fix release needs to be discussed.

Data base corruptions have been found in PUBLIC and CMS. Eric and Nilo are trying to fix them without interruptions of the DB. Before fixing CMS, the clean-up procedure is run again. The DB accumulated 150000 special entries in ~48h. The 'timeout' of finished operations will be reduced to 6h.

The plan is still to move CMS to the NAS hardware in the beginning of week 16. Eric ensures the setup is ready by the end of next week.

Sebastien will send a mail if/when the move to ITDC is finished today, so that maybe further tests can be run over the long weekend


04. April 2007

Present: Jan, Sebastien, German, Miguel, Bernd, Olof, Nilo

A problem with one disk server in the ATLAS stager caused a higher error rate in the ongoing export exercise. File systems get still filled asymmetrically and on SLC3 a full file system causes the node to hang. As the node looks 'empty' this can create also 'black-hole' effects.

The CMS stager cleaning went successfully, no hardware move needed now, but rather move in 2 weeks to the 'final' NAS system.

Sebastien has tested the new release 2.1.3.4 successfully with 400 concurrently running clients and 16000 jobs in the queue. A different client setup is needed to increase the load. Looks very promising. When the next release (2.1.3.5) is done on the TEST setup, Rosa should run some basic tape tests. The file system load balancing problem was identified and fixed

The move fromt he TEST setup to the much larger ITDC instance will start later today. Because this setup is already ready. After further larger scale tests the move from ITDC to the c2atlast0 setup willbe straightforward and is probably possible for the end of next week.


03. April 2007

Present: Jan, Sebastien, German, Miguel, Bernd, Olof

The name server move to the new hardware went fine.

There are several releases :

  • 2.1.1 currently installed at CERN, different sub-version installed in CNAF
  • 2.1.2 stable bug-fixed release, but not yet deployed at CERN, beeing deployed at RAL
  • 2.1.3 new version, including bug-fixes and new LSF-plugin, currently under test

2.1.3 looks rather good now, still few bug fixes needed. First test suite run successful. Load balancing on the file systems is not yet working, not clear whether stis is a small bug or an architectural issue. More testing ongoing.

The CMS stager is running into trouble because of 3.5 million entries which need to be cleaned, swapping has started on the node effecting the performance. The standard regular cleaning of the DB does not touch these requests. There are actually several levels of cleaning procedures, of which only one is running automatically and regularly. Bernd asked for the others to be run regularly once per week and not when the number of 'special' requests reach a critical level. With the help of Nilo this shoucl be integrated later into a standard data base operation. The cleaning itself is already a heavy process on the DB.

Nilo + Giuseppe will start the process and watch it. In parallel Jan will prepare intermediate new STAGER hardware, as the NAS systems will only be ready next week. decide on Wednesday morning whether cleaning was enough or a hardware move is needed.


02. April 2007

Present: Jan, Sebastien, German, Miguel, Bernd, Eric, Nilo

Eric and Nilo will be busy today with the name server move.

Jan has prepared the new LSF7 instance, some configuration issues still to be sorted out.

Sebastien has fixed quite a few problems with the new release, several oft them created DB hangs. There are still soem core problems left.

A bit of a discussion about the DEV and TEST setup. German will look into the issue of up-to-date DEV installations. it is also necessary to provide a permanent second TEST setup, one for the new release and one for the bug-fix releases.

a first meeting with ATLAS is called for tomorrow to discuss the problems encountered so far with the export of data (Castor, GridFTP, FTS, perfromance and frequesnt errors, data flows, etc.).

Dennis is in holiday this week and Giuseppe will leave Wednesday.


30. March 2007 Present: Jan, Sebastien, German, Miguel, Bernd

Yesterday evening a new release was made by Sebastien. Still more bug-fixes needed, not stable yet.

Jan started the configuration of a new LSF7 instance needed for each stager instance. A new disk server is beeing added to the TEST instance to allow more scalability testing at this level.

The move of the name server to the new NAS hardware is planned for Monday.


29. March 2007 Present: Jan, Sebastien, German, Miguel, Bernd

Present : Olof, Sebastien, Dennis, Jan, Bernd , Miguel, German, Guiseppe

Status of new Castor software tests :

Sebastien : 10 little things discovered of which 8 are already fixed New release is planned for the at then end of the morning In parallel the configurations are prepared for the move to the next stager instance first test-suite to be run in the afternoon tests will be ongoing until Monday

Jan is preparing the new ATLAS instance 6 mid-range servers are available, software installed, pending final configuration Issues (NCM component details), feed-back loop with developers

The preparations for the new Oracle hardware and software setup is nearly finished.

General points

  • t takes about 1/2 day to produce a new release
  • in case there are problems with CDB response times , the DEV team should contact German for priority access
  • the test suites used by the developers are not complete (no tape access e.g.)
  • Gulia will provide a twiki with the list of existing test programs
  • The dev team will contact Miguel for more tests to be run in the afternoon
  • the hardware configuration of the TEST instance is not large enough
  • Jan is providing 1-2 more disk servers for the TEST setup
  • Moving disk servers between servers is complicated and very work intensive

The new ATLAS instance needs more disk servers. It turns out the foreseen Transtec servers cannot be used, as unexpectedly they developed new problems which require a firmware update of all disks. We will take systems from the ITDC setup currently used by ATLAS for their T0 test

  • Bernd will provide a plan for this today, reviewed first internally and than proposed to ATLAS

Some discussion about the need to go to LSF7. Jan explained the theoretically simple setup of a new LSF instance (6 or 7), but experience with 7 does not exist. The plugin has only been tested so far with 7 not with LSF6.

  • Jan looks a bit more into the installation, but waits until Monday before any action
  • The dev team carefully check during their next days of testing whether there are indications of principle problems with LSF7
  • We take a decision on the LSF version on Monday or Tuesday

There are some restrictions for next week :

1. on Monday the name server is moved to a new hardware setup, this will occupy all Castor teams for the morning at least and even further if there are problems

2. next week is short due to the Easter weekend and some people take more holidays to extend the weekend

Special meetings

27. March 2007

* taskforce_meeting_minutes_27March2007.doc:


















-- BerndPanzersteindel - 03 Apr 2007

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt MB_task_force_status_05June07_v3.ppt r1 manage 97.5 K 2007-06-14 - 11:04 UnknownUser Task Force status report for the MB, 05 June 2007
Edit | Attach | Watch | Print version | History: r49 < r48 < r47 < r46 < r45 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r49 - 2007-08-14 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback