Week of 081013

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Olof, Julia, Gavin, Maria, Jamie, Harry, Andrea, Simone, Roberto, Flavia, Jan);remote(Michel, Michael, Derek).

elog review:

Experiments round table:

  • LHCb (Roberto) - 1st point is about last week's LSF problem. LHCb would like to a post-mortem of this. Running up to 370 jobs - struggled in past, would like to understand what was cause of multiple problems experienced. Olof - job requirements for local batch jobs resulted in job slots being underused on multi-core nodes. Also swap requirements - will ask Uli for PM. 2nd point: data consistency & integrity checks run during w/e. LHCb not a priori against such checks (obviously) but 'killed' service. Consistency check between catalog and SE run by LHCb. 100 concurrent jobs. Olof - problem was 180K pre-stage requests before analysing data. Substantial number on disk servers put in draining mode for out-of-warranty. This triggered server-server copy. Tape recalls not a problem. 4 servers held ~30K files which were being replicated. Problem more than these servers where not actively drained. Normally do this 'passively' through activity for some time. This is first of 3 bunches that will be run. Would like to run a few more such tests - will check all DSTs. Maybe next bunch will be in coordination with service provider. Olof - servers now drained - should be ok. Tape access ok - use more drives than usual. Can go ahead with another 150K but better not on w/e as service is more actively monitored. Olof - will actively drain disk servers for LHCb in future - could have done so on Friday. 3rd point: some strange messages from FTS on job consistency error. Gav - Paolo responded - a single FTS job can have multiple files but all must be assignable to the same channel. Thus spake Gavin.

  • ATLAS (Simone) - few things from w/e. 1st: some 5-10% failures reading data from CASTOR@CERN - ticket. Gav just replied - hot disk server - too many popular files. srmget doesn't return, gridftp callwait, end of file at source - are these all due to the same root cause? Gav - draining disk server to take load off. Simone - not much load now but will resume with functional tests. 2nd: CIC portal - sites put themselves in scheduled downtime but not broadcast. ATLAS put these in calendar taken from LHCb but if no broadcast no calendar entry hence shifters risk to complain and sites then say 'hey - we were in scheduled downtime'. Both in dashboard and calendar - neither display downtime if no notification. srm problem at ASGC on Sunday - got solved in a couple of hours by Jason. CNAF run out of disk space on datadisk - cannot collect cosmics any more nor participate in functional tests. Some data cleaned up as usual on Mondays. Moving some space from MC space token - might make room for a week or so - some more space next week? Just in time... Michael - currently observe >3K datasets (ATLAS DATA) to be replicated to BNL. Most seem to be 'not ready for transfer' (-1) - can you get me an update as to what will happen? Monitoring page that reports on progress on data(set) replication shows -1. Simone - will check and send email in next hour.

  • ALICE - a new production release is currently being deployed, hence production currently stopped.

Sites round table:

  • GRIF - killed by last glite update. Site basically destroyed and re-installing.

Core services (CERN) report:

  • CASTORATLAS was upgraded to 2.1.17-19/2 - bug fix release. CMS tomorrow morning.

DB services (CERN) report:

  • A few planned interventions: today GridKA for LHCb conditions & LFC - to apply diagnostic patch to try to find out causes for streams-related issue which seems to affect only GridKA. Hope more news asap... Intervention planned at RAL tomorrow - transparent rolling intervention for setting parameter needed by streams. RAL & CNAF are last sites to apply... (_buffered_publisher_flow_control_threshold)New areas of requests - some request from SAM to handling old data in archive mode. Will work on this trying to understand best way of proceeding. Readonly / archive mode and retrieve on demand.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Olof, Julia, Jean-Philippe, Patricia, Harry, Nick, Gavin, Roberto);remote(Gareth, Michael, Jeremy, Jeff).

elog review:

Experiments round table:

ATLAS (SC): We have followed up on Michael's question of large numbers of files 'not ready for transfer' to BNL yesterday. Last Friday we switched from closing one dataset per 12 hours run to closing one per 20 minutes and logically merging 12 hours datasets into a new container definition. This is in order to try and speed up dispatching which is triggered by closing a dataset. However, this has hit a limit on the number of datasets in ddm site services and overall the distribution is getting very slow. We have hence increased the time to close a dataset to one hour or each 500 files and have increased from one to three the number of VOboxes serving the distribution. This was only done an hour ago so will take time to feed through. As a test we tried serving only one site, BNL, through a dedicated VObox last night and the throughput went up to 500MB/sec and the BNL backlog has reduced from 1600 to 1250 datasets. Also, to help clear backlogs, functional tests have been stopped for the rest of this week. Other news is that RAL is in scheduled downtime so their datasets have been diverted elsewhere and SARA has a tape subsystem problem. In addition there are some Tier 1 subscriptions for datasets not at CERN that should be cancelled and we are seeing some 5% of srmget at CERN timing out after 180 seconds (GM reported this as due to a hot diskserver). Finally this morning we had problems to send data to the Great Lakes Tier 2 muon calibration site (AGLT2) that turned out to be an incorrect publishing by the CERN FTS information system. Michael queried this as the information was correct when he looked and Gavin explained that the wrong entry came initially from AGLT2 and would have been corrected by the CERN automatic daily update run but that this had failed yesterday (he will check why) and that he had later run it by hand.

LHCb (RS): We are performing data consistency checks. We plan to run a small (about 1000 jobs over all sites) well controlled full chain of jobs under the Dirac3 infrastructure to verify that we can achieve a 100% success rate in running jobs chains (currently we only see 70-80%). Jeff queried that Nikhef is only seeing a few LHCb jobs and this will be checked. Olof announced FIO were preparing a Post-Mortem on the LSF problems reported by LHCb. Roberto is preparing an EGEE broadcast to ask all sites to repopulate the gridmap file for LHCb to only contain default roles. This is to get round the problem that adding a new LHCb role in Voms takes time to propogate so that for some time the mapping falls back to the gridmap mapping and special users can be wrongly assigned a non-default role. Implementation of this needs either a post-yaim configuration (each time yaim is run) or making changes to gridmap.conf and these will be explained in the broadcast. Jeff requested that this also be raised at the next TMB meeting (15 October) and this was agreed.

ALICE (PM): They would like to replace all remaining RB servers by gLite 3.1 WMS but many sites still rely on the RB so this will be done gradually. CERN would probably be the last to convert as it is used by many Tier 2. They hope to complete this in a months time. Jeff thought NL-T1 should meet this date.

Sites round table: Jeff reported that the SARA backend tape system was not working since about 10.30 am so no recalls from tape were possible at the NL-T1 (as already noticed by ATLAS).

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update (NT): The bdii meta-rpm in the recent gLite 3.1 update 33 was withdrawn last Friday after the CERN ROC found many bdii entries appearing and disappearing. Also some sites reported jobs being assigned to CEs by pattern matching then finding the requested resources were no longer available. The problem may not, however, be in the information system.

AOB:

Wednesday

Attendance: local(Kors, Alessandro, Graeme, Nick, Jamie, Harry, Andrea, Jean-Philippe, Roberto, Gavin, Jan);remote(Gareth, Michael, Jeremy).

elog review:

Experiments round table:

  • ATLAS (Graeme) - problems since mvoing to use of containers - smaller datasets but more of them. Site services not happy... Yesterday thought splitting machines doing export from T0 into 3 (i.e. 3 machines crunching less datasets each), made datasets slightly bigger (Friday). Rates 'reasonable' but not as high as hoped. Lyon, BNL and ATLDATA have ~20-30TB backlogs. Looking v closely - backlog increasing or not. Still gathering data... A lot of rate fluctuations - maybe there are periods of small file datasets - sometimes up to 500MB/s, e.g. this morning. Box server CERN and calib T2s went strange last night and out 22:00 - 10H outtage. T0exports currently served by 2 m/cs. Maybe revert to 3 later today... Monitoring situation, in touch wiht FTS exports, more update later..

    Downtime at RAL - due to finish at lunchtime - has crept forward to tea-time. Downtime extensions don't get broadcast. ANY DOWNTIME EXTENSIONS SHOULD BE ANNOUNCED. Nick - would like GOC&CIC stuff to be treated automatically. Extensions should be treated as unscheduled downtime. Kors - rules for scheduled / unscheduled. Will issue reminder... Roberto - proposal at last operations meeting, announce downtime and then reminder when about to start. Nick - LHCb and others say would be useful to have a) when entered b) when due c) when finishes. Would like to automate, RSS feeds etc. Passing to GOCDB & CIC portal developers. Harry - did Gavin find ought? Gavin - looked at a few channels. Increase in #files overloads - maybe CASTOR - and rate drops badly. On exports "failed to get file in 180s" - no obvious correlation with diskserver. Graeme - still see low level of gridftp timeout. Gav - related to disk server. Some on servers waiting to be retired. Not main source of errors. Harry - anything else for IT to follow up on? Another box? Kors - not needed for now - they're not overloaded. Graeme - about 20% CPU. Can move back to config of 3 boxes this afternoon. Jeremy - link which comes up with broadcast and downtime procedure. Will reiterate. Jamie - "original" (i.e. as approved by WLCG MB) document is attached below.

  • CMS (Andrea) - global run ongoing. Switch on mag field tomorrow. Recons ok, some problems yesterday due to bugs. CERN doesn't show red in CMS SAM availability - problem fixed - one of checks that recently became critical is version installed in s/w of tfc - must be same as that in CVS. Was not true for many sites hence made test critical. Many sites at that point - incl CERN - failed for a while. Still a problem for many Tier2s. Tier1s - CNAF squid server down, so site unavailable for CMS. 2-3 days ago a WMS at CERN for CMS (wms104) broke. Replaced with one update to gLite 3.1. Harry - do they become uncommissioned? Yes, can happen. Not CERN & T1s. If site not ok > 2 days in last 7 -> uncomimssioned.

  • LHCb (Roberto) - couple of points: 1) production meeting -10'. Post-mortem on castor service? Jan - LHCb driven consistency check. Harry - even though disk servers were being drained system should not have reacted this way. 2) discussion at core s/w meeting on possibility to run some stress tests on cond DB access at Tier1s. No schedule yet...

Sites round table:

  • RAL (Gareth) - yes, was issue with downtime creep. Became clear could not get back on time. Problems updating GOCDB. Glad its a known problem! Will send in e-mail report.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Are you (RAL) coming out of downtime this afternoon? Gareth - hope so, need to confirm.

Thursday

Attendance: local(Graeme, Jean-Philippe, Jamie, Harry, Olof, Andrea, Roberto, Gavin, Jan);remote(Gareth, Jeremy, Michael).

elog review:

Experiments round table:

  • ATLAS (Graeme) - still having problems but think now understand why got into such a pickle over w/e. Problem with ATLAS T0 Thursday (corrected on Thursday) - twice as much data at one time as expected, rates went very high, didn't clear backlog. Then on Friday containers problem kicked in - whole chain started. To BNL, Lyon, RAL and SARA very significant backlog - now on tape - hence exceedingly painful. New data subscribe tends to have 1 success immediately. DDM has +ve feedback effect - prefers d/s closer to completion. Loosing effective bandwidth - some datasets slip into backlog. Plan is to cancel all backlog subscriptions. Will pre-stage and resubscribe. If any raw in backlog will do first. Then ESD. DDM site services m/c reported yesterday as faulty - now understood - config problem - corrected last night. Now back with 3 boxes. 2 issues from CASTOR that we saw were 1) alot of problems with hot disk servers since end last week. Gav&co investigating. More issues sent in now. In last 4 hours almost all disappeared (gridftp 500 cmd failed) down to 5-6 per mil. 2) Calculation of size / rate = 5 days (buffer). But files -> much earlier than 5 days. Think because castor optimizes garbage collection and writing differently. e.g. cache turnover can vary per disk server. Explained to us by Miguel Coelho. Hard to quantify - have to look at 'offline'. Lifetime of files on T0ATLAS??? Jamie - would occur any time a T1 'went down'. Olof - backlog, not on disk, recalled to different service class. Lifetime of 5 days in T0ATLAS should not be affected as files will be recalled elsewhere... GC is per f/s. Load problem is in default pool. Have to ask developers to look at this if an issue. Graeme - model is export from T0ATLAS before on tape. Once recalling files (into t0default) leads to a lot of difficulty in getting data back and clearing backlog. If we are loosing files < 5 day mark -> problems. Users can hammer default as well so gets much more messy. Jan - 1) Some examples of files that were served from default instead can investigate - need to understand. 2) Configured srm to bringonline in default to protect t0atlas. Did not have backlog clearing in mind. Can change order of service classes. T0atlas has additional protection which prevents 'ordinary users' from staging files there. Could (?) configure so files to back into t0atlas 3) in future if you specify space token in srmget will have a handle on where data goes but does not resolve fact that bringonline lengthy. Graeme - did run into big problems when users could read from t0atlas. Jan - Under control of production system. Only if you have privs to recall into this service class will it go there. Graeme - not requesting any such changes right now. Have run into trouble with other configs. space token on srm get requires new FTS client. Will dig up examples of when getting files in 5 day window. Was a bug in way castor chose write disk server.

  • CMS (Andrea) - very little to say. Ramp up of magnet began this morning. Last news: 13:30 field at 2.5T. Should have increased since then. 2 problems to report: 1) problems with phedex file stager agents - problems with overloaded pools in castor. Will send ticket if problem persists. 2) transfer problem from CERN export to FNAL. No details yet.. Again contact to castor support if persists.

  • LHCb (Roberto) - no problems to report for castor - received PM thanks! Andrew restarted integrity checks. Prob exhausted batch LSF job slots at CERN.. Triggered staging of files yesterday. Plans: waiting for particle gun and alignment + rampup of dummy MC production. SAM jobs at CERN - s/w man acct has some problem to get resources. Have to relax requirements on sam jobs. In 2ND q - probably overkill. With Stuart started testing glexec on PPS - ran into problem using dirac proxy - looks like glexec not running. Under investigation.

Sites round table:

  • RAL Downtime (John Gordon). The RAL extension WAS broadcast (attached). It is not obvious it is an extension though – it just looks like a normal announcement although issued round about when we were due back up. Strangely the start time in the second announcement was 90 minutes earlier than the first, even though the time had long past.

  • RAL downtime (Gareth) - I have just been gathering information and, embarrassingly, have been asked to extend the downtime, which I have just done by adding an unscheduled one.
    As part of the patching of the Oracle RAC systems several other tasks were undertaken. Some of these have not gone smoothly. These include:
    • A memory upgrade on one of the nodes was problematic owing to a faulty memory module.
    • It took longer than expected to export/input the databases.
    • A test of failover of the fibrechannel connecting the disk server to the Oracle RACs did not go smoothly.

  • BHAM (Jeremy) - question from ALICE about VOBOX. Request from ALICE on update.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Propose to simplify meeting agenda, replacing core services, DB services, etc. with simply "Services". Anything else can always come under "AOB".

Friday

Attendance: local(Harry, Roberto,Andrea,Olof,Maria,Graeme);remote(Gareth,Michael).

elog review:

Experiments round table:

LHCb (RS): Waiting for the new software distribution to be complete. May start running some MonteCarlo at T1 as well as at T2.

CMS (AS): Magnet at 3.8 T and planning to take as much cosmics data as possible. Some repacking and reconstruction jobs from the last two days were not tracked and have to be recovered. FileStager agent back to normal. A problem with the Data Quality Monitor was solved and histograms for several of the recent runs appeared. Two diskservers were down in the c2cms/t0export pool last night.

ATLAS (GS): Situation healthier today. We have got rid of old subscriptions keeping those for raw data but these are not many. The ESD subscriptions have been resubscribed using other T1 as sources rather than CERN. There are some gridftp timeout (gt 180 secs) errors probably due to hot disk servers. Olof has reported on the ATLAS CASTOR export cache turnover where the ATLAS model assumes 5 days and finds 95% of files have a lifetime of gt 5 days and 100% have a lifetime of gt 4.5 days. Investigations of the files that we thought had disappeared from the disk cache after a few days, so having to be recalled from tape, showed they had disappeared for other reasons. ATLAS would like to have a more intelligent error message from srm in these cases and this could be done through the existing srm-ls call. ATLAS will also take cosmics all this weekend.

Sites round table:

BNL (ME): have observed, since 02.00, a sharp rise in transfer rates from close to zero up to 1 GB/sec following the ATLAS resubscription of ESD to the T1 sites. A remarkable number of files are being replicated, about 6500 per hour, mostly small ESD files. At the same time the load factors of the pnfs and srm servers is low (srm servers have a load factor of 2-3) so there is reserve capacity. We have detected a potential for data loss at ASGC with dashboard monitoring showing there may be a few hundred files with no disk or tape copies. Graeme will look at this.

RAL (GS): ATLAS and LHCb CASTOR services were down for about an hour today and the databases had to be bounced back. Also they have found bad identifiers in the CMS CASTOR database as they had last August. Maria asked if RAL had already migrated to Oracle 10.2.0.4 and the answer was yes so this is worrying. She thought that globally all Oracle applications sharing resources should use fully qualified table names to avoid these potential Oracle cacheing bugs.

Core services (CERN) report:

DB services (CERN) report: There will be a distributed databases workshop on 11/12 November with participation of the CASTOR DBAs from CERN, RAL, CNAF and ASGC. There will also be a dedicated session with Oracle RAC experts.

Monitoring / dashboard report:

Release update:

AOB:

-- JamieShiers - 10 Oct 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf RAL-downtime-extension-15oct.pdf r1 manage 51.7 K 2008-10-15 - 16:55 JamieShiers  
PDFpdf SC4-scheduled-maintenance-June21.pdf r1 manage 22.9 K 2008-10-15 - 15:37 JamieShiers WLCG Schedule Maintenance - as approved by WLCG MB in 2006
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-10-17 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback