Week of 090608

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Miguel, Andrew,Simone, Alessandro, Eva, Gang, Edoardo, Julia, Harry, Kors, Roberto, Patricia, Olof);remote(Gonzalo, Vera Hansper NDGF, Michael, Di, Angela, Gareth, Marc, Brian).

Experiments round table:

  • ATLAS - weekend summary. (Graeme) - reprocessing - "key STEP activity for ATLAS@T1s", reading of data from tape, recons then writing back. 6 days so far, good chance for "post mortem". PIC, RAL, TRIUMF get 3 gold stars. PIC&TRIUMF had to stop as processed all raw files! Clean buffers. CNAF: some probs over w/e - castor? - prestage itself fine. Reprocessing jobs seem to fail on AMD as opposed to Intel nodes? Needs investigation. Otherwise pretty good. SARA pretty good this morning. Tuning of DMF over w/e - flowing much better. R/W better balanced. NDGF problem last week - not triggering prestage properly - seems to be a config problem. Prod system doesn't realise files on tapes and hence doesn't trigger prestage. Simple to fix?? Report tomorrow if ok. BNL - running well. Problem sites: FZK still limping badly. Only 5K files - well below performance required from T1. Pretty serious that tape problems still not resolved. Taiwan completely dead in water. Eva suggested 2nd listener for Oracle on old port. Some activity over w/e but still not ok - not functioning for ATLAS. Lyon - found late Friday that they came out of downtime successfully. Write access to tape ok, read access not before tomorrow. 4 day downtime -> 9 day downtime!!! Read access asap - like to bring some data back from tape today. Marc - take note. Turning to data distribution - 1 week snapshot. log book twiki page: 2 tiny red bits: RAL - understand this. Got > MoU share over w/e. Takes time to swallow. Rome calibration site: now fixed - disk server with network problems resolved last night. Big bit of red: FZK: SE below par last week and serious trouble over w/e. Transfer commensurate with 5% T1, not 10%. 850 d/s to catch up! Around 30TB... Very challenging before end of step. Another serious issue. FZK well below par. Hits merging: started end last week; issue with prod workflow. Sometime low priority jobs block system. Run patchily over clouds - more tomorrow. Jobs themselves ok but need to understand blocking. User analysis: Hammercloud completed >500,000 jobs. Some T2s crippled by load. Problems getting accurate numbers from jobs submitted via WMS. Complicated chain... In general CPU efficiency <<< smaller scale tests. Big postmortem focus. UK, DE, FR, IT particularly good... Really do hope CMS will rampup this week -want to see how reprocessing survives under heavy load.

  • CMS reports - Important is that we're going to reset our T1 pre-staging/processing and T1->T1 transfer tests today at 4 PM CEST.

Questions we would have to CERN-IT and the T1 sites (answers will be added later..):

- T0 tape writing scale tests:
    - We think we wrote during the weekend with 1 - 1.5 GB/s to tape  
for some periods. As the monitoring of the tape rates is difficult,  
could CERN-IT have a look and confirm that our observations are correct?
    - To achieve this high rate without real data from the detector,  
we have to stage in streamer files which puts more load on the system  
in addition to the writing. What do other VOs currently run at CERN so  
that we can get an overview who is using the system in parallel to us?

- T1 pre-staging tests:
   - Reports from RAL say that there is large multi-VO tape activity.  
Would be interesting to know form the other T1 sites if they see the  
same?

  • ALICE - (Patricia). A change in the production cycles was announced on Saturday the 6th. This changes produces an unstable job profile which decreased the number of jobs to about 4000 concurrent jobs. Currently back in stable production regime. This morning one of the central services was down. The problem was detected by Torino-T2 which observed some connection problems from the local VOBOX services to the central services. It has been a fast operation (to debug & fix) which has not been detected by the rest of the sites. Bad publication of the voview values at CESGA forced to stop the production through this site last Friday evening. Experts working on this issue. CREAM-CE@CERN: ce202.cern.ch is not publishing any information through the corresponding resource BDII, stopping the ALICE production through that specific service. GGUS Ticket submitted this morning (#49339) (now solved & CREAM CE back in production). Regarding the ALICE transfers foreseen for this week, the experiment is hitting a snag with the destination, most probably an application problem (FTD) which is preventing the startup of the transfers. Module being debugged currently.

  • LHCb reports - today is day0 for STEP09 4 LHCb. Will start 16:00 after small intervention at online. Request to cleanup disk cache at T1 sites. Some sites already cleaned... Usual MC simulation continues... 1B event production will take system until first data from LHC. Major issue: request from Joel re files from online to be migrated to tape over w/e? Issue transferring data to STORM at CNAF - gridftp server down. LHCb data incident - update from Olof. 6500 files lost. NIKHEF: tight collaboration with Jeff and Ron investigating raw file access at GSIDCAP sites. Miguel: ticket from Joel to delete files. Usually do not delete files! Interface is same as for user. Avoid going into DB for exactly the problems on Friday - very easy to make small mistake. Migrations: side effect from Friday. Bug in one of daemons, when node is rebooted daemon not working correctly - doesn't move files out to tape.

Sites / Services round table:

  • FZK - working on tape problem. Fixed config errors. Newer version of TSM. Consolidated stagin - saw improvement over w/e. Still looking to find bottleneck. Graeme - huge backlog of accepting data to disk is another problem! Angela - many timeouts in transfers.. Transfers run into timeout - increased but did not help all the time... Graeme - need to decide whether FZK can cope with being a 10% T1 during STEP!. Angela - some of gridftp doors ran into memory problems - investigating it. Agree it is not good - really working on it.. What else can we do? Graeme - we try to find an operational state that the T1 can handle. Angela - would be best if tomorrow someone from storage group attends meeting.

  • IN2P3: don't have much info about read access problem. Looks like bug in dcache. Lionel working on it. GGUS ticket.. Report more tomorrow about followup.

e-mail from Fabio Hernandez added later:here.

  • ASGC: creation of 2nd listener - already working on this but as not good result must wait. Might take more time - leaves very little time for STEP09 (ed). Eva - who is working on this? Graeme - see great willingness from Oracle experts at CERN to help - please ask them!

  • RAL - hit a peak on network traffic - getting >1GB/s! Close to maximum on some parts of network. Related to extra ATLAS data we've been shipping out.

  • DB - replication to Taiwan stuck since yesterday when they tried to add 2nd listener. Friday had problem with apply process on ATLAS offline for conditions. Developer made mistake and made a huge table on one of schemas being replicated. Took a total of 6 hours to resolve. Simone - ASGC DB - how long can they stay in "split mode?" - eventually have to resync (again) as from RAL last week.

AOB: (MariaDZ) Can ATLAS comment in https://gus.fzk.de/ws/ticket_info.php?ticket=44585 please?

Tuesday:

Attendance: local(Maria, Markus, Jamie, Oliver, Daniele, Miguel, Gang, Graeme, Olof, Roberto, Patricia, Kors, Simone, Julia);remote(Gareth, David Britton, Fabio, Angela, Brian, Vesper, Jeremy, Michael, Di, Marc).

Experiments round table:

  • ATLAS (Graeme) - Reprocessing: NDGF panda config fixed last night, prestage working, reprocessing much better, 5K jobs in 14h - huge acceleration in workflow! FZK: very much healthier: 5K reprocessing jobs in last 24H - same last previous 6 days! IN2P3: clarified srmLs working yesterday - broken over w/e. First prestage test worked over night. This morning launched bulk - by tomorrow much better idea of whether AOK. Signs good. PIC+TRIUMF - problem with restart of tasks (finished first cycle of reprocessing). ASGC: good news is that late yesterday listener on old port working. Reprocessing kicked off. Prestaging v slow - not enough drives? Additionally v large failure rate - 25-30% rate 49365 GGUS ticket. No news so far. Data distribution: FZK much better, high rate all yesterday, by this morning had cleared backlog. Hammercloud analysis activity overloaded gridftp doors - problems with SE! (2nd example of user analysis interfering) BNL: big backlog. Current transfer efficiency ~100%, but rate poor - not eating into backlog. T2s: quite a few with backlog (~12 sites >24h - see logback). Clouds to check: not enough fts slots, slow transfers hogging slots, overload from analysis. (gridftp works but slowly). I/O contention on diskservers. T1 issues (RAL) CE blocked pilot submission since last night - ran out of inodes. Known problem - 2nd time observed during STEP. Last night reported files not being written to tape. Fixed? Hits merging not started - now understood. Starts patchily. IT and DE ran nothing last night. Do we want extra hits merging tasks as part of STEP? > 1 high priority task difficult - stick with reprocessing. User analysis at 600K jobs. One T2 will partition time between different analysis types to see which analysis backend works best. Maybe other T2s should do likewise. Gang - reprocessing at ASGC tape drive was stuck. Staging speed not fast enough. Once done speed should increase. Gareth - tape writing issues at RAL: still a problem. Files not migrated - being locked at. Partial solution in that new files are being written to tape but not old ones. Some policy issue. Have enough buffer so should not be a problem for > days. Oliver - FZK are you reading and writing? A: both.

  • CMS reports (Daniele) - for tape write test at T0 ramping down (mid week global run starting tomorrow.) Confirm 1 - 1.5 GB/s for a few days. Trying to confirm from monitoring side that rates understood. Activity was confirmed. T1 tests: prestaging most sites ok. Over last 24h PIC >100MB/s, CNAF >117MB/s, ASGC 145MB/s, RAL>190MB/s. FZK & IN2P3 in standby. FNAL some tape issues - write at 1.4GB/s, reading abit below what had been seen earlier. Recovered over w/e but large backlog. Some T1-T1 tests for FNAL showing pnfs isses (many files/dir). At RAL CMS&ATLAS flat out in parallel. Plots on twiki. Processing: some in parallel at FNAL&PIC. Usual ops issues (permissions at ASGC&CNAF, both fixed). Transfer tests: t1-t1 AOD replication restarted yesterday pm. Input to FNAL not yet started (CMS problem). PIC export not started yet. RAL->IN2P3 work but Phedex download agents not reported properly. T1 export to other T1s ok. Analysis: last 24h >20K jobs submitted, 75% completed. 5-6M events processed, 90% job success rate, 33/39 sites > 99% success. Twiki: plots on stagein/removal cycle, T0 export. Remark to comment yesterday at RAL: confirm CMS activity constantly over past few days. Graeme - question was "are you satisfied with T1 load? ATLAS is, so if CMS too then great!" Daniele - concurrent load at RAL ok.

  • ALICE - transfer situation: began tests with new FTD instance. Some problems with old files. Created new set of 600 files about 4GB each. Changed FTD to have single instance at CERN. Still debugging. Destinations: NDGF 49379, SARA 49380, RAL 49381: (no space available, permission denied etc.) Transfers to IN2P3, FZK, CNAF running ok. peak of 100MB/s with these 3 sites. Services: overload of WMS at CERN due to misbehaviour of T2 in Spain. Stopped production through that site to drain WMS. Reopen ticket on CREAM CE ce202: still failing re info provider. 49366 ticket. Same as before. Vera - we don't understand ticket. Have updated. Patricia - copy & paste of FTS error. Markus - need to involve FTS support.

  • LHCb reports - STEP09 transfer activity to all T1s started yesterday. Seems very smooth. Recons jobs: on those freshly transferred data will be sent around after express stream gives green light. Notified from all T1s that disk cache cleaned. Reprocessing will be sent around to check staging. > 6K MC simulation jobs still running parallel with user analysis. Major issues: FZK - many users complain and prod. activities report problems accessing data through dcap. Site or expt issue? NL-T1: problem accessing raw files still being investigated. WMS@SARA: hanging & failing. Ron on ticket. Fabio - was informed this morning that IN2P3 also had dcap problem. Person in charge did some tests. Discovered some limitation in dcache client library. New unreleased version recommended - problem not reproducible! GGUS ticket updated - same as at NL-T1? FZK: who did you contact? Alexei.

Sites / Services round table:

  • BNL - overnight discovered that 4 servers as write buffers got out of balance - basically 1/4 was being used , others idle. Under investigation. Result is BNL not keeping up. Nothing obvious but must investigate why dcache has preference to just 1 server. Engineers in and working - should be resolved in a few hours. Other DDM instances working at full performance.

  • IN2P3: yes, we have started prestaging yesterday for CMS. Introduced a new component scheduling stager requests before sending to HPSS. Started yesterday for CMS - very happy with initial results, but must analyse further. reduce # of tape mounts is purpose of this new component. Will then activate for ATLAS. Being done right now. When ATLAS sends avalanche probably ready with this module. If not staging will still be done even if not optimized. Batch activity: observed about 2K simultaneous ATLAS jobs, 1200 for CMS in last 48H. Transfers: in and out seems also ok. Overall doing good but focus now on staging.

  • FZK - didn't change things since yesterday. Working on memory problem! Tape rates are improving - quite stable again. CMS will start recalls. Oliver - still waiting for go ahead.

  • TRIUMF - problems (timeout) between NDGF and TRIUMF. Increased but still timeouts.

  • ASGC: The second database listener for ATLAS was setup successfully yesterday afternoon at around 5:00, and then ATLAS reprocessing jobs could succeed at ASGC farm.

  • DB: reconfigured the Streams setup in order to use port 1521 for ASGC as we risk(ed) to exceed the recovery window again waiting for the second listener. So, yes, I can confirm that Streams replication is working fine now. (Eva) Now all services should be available.

AOB:

  • Kors - outlook for overlay with CMS? Daniele - for T0 running tape writing tests on any available days between mid-week global runs and CRUZET. Write 1 - 1.5GB/s until today but ramping down. Midweek global run last 2 days, will rampup tape writing then until the next global run. 1 period of overlap done.. 1 to come. T1s: constantly running prestaging for all T1s since 2 Jun. Everyday at 16:00 submit processing jobs and cleanup old data. (Except FZK & IN2P3). Seeing nice overlap of CMS and ATLAS at RAL for example. This will continue throughout exercise. Write output of processing jobs to tape. Custodial cosmics also written to tape. Also read some data from tape at T1 to transfer to T2. T1-T1 not using tapes. Kors - will stop Friday midnight. Oliver - using 5K slots on 6 multi-VO T1s. Simone - problem is space at T1s and T2s. Space getting full. For writing into tape at T0 switched to recycle pool - this can continue. Oliver - rampup Friday morning 00:00 and run thru until Monday, rampdown Tuesday for Wed-Thu midweek global run. This would be 1 day longer than previous window.

Wednesday

Attendance: local(Jamie, Miguel, Daniele, Graeme, Eva, Simone, Patricia, Kors, Roberto, MariaDZ, Olof);remote(Marc, Angela, Di, Brian, CNAF (Dal Pra - Ronchieri - Veronesi), Gareth, PIC, Vera Hansper/NDGF).

Experiments round table:

  • ATLAS - announce "step down" - decided after consulting with CMS - T1 load good over course of load - stop at 23:00 CEST on Friday. Data export load generator will stop, sites will have 48h to clear backlogs. Many T2s still have challenging backlogs. Hammercloud will also stop; all step09 production tasks will be aborted. Major issue last night with IN2P3: team + alarm ticket, resolved about 08:00 UTC - reasons understood. ATLAS i/f prob last night - atlas pilot factory ran out of disk space - stopped flow to DE + others. Restarted ~09:00 UTC. Looks like global rate limitation in ATLAS pool used for export: 4GB/s: Simone - anything above 3GB/s of total I/O traffic "challenging". Amount of data pushed in/out is abit more than nominal rate; ~300MB/s over, so not so much contigency! (Acquire data 14H then drain for 10H). Have to rediscuss next week whether foreseen to have increase of resources. If at level of diskservers foreseen to have factor 2 by evolution plan. If switches / network then??? Graeme - identifed as BNL fixed all problems seen yesterday but could not catch up as fast as expected -> bottleneck. SARA MCDISK almost ran out of space (files deleted); RAL severe tape robot problems this am; Luca from CNAF reported tape robot problems. Operational problem preventing restart PIC+TRIUMF now ok ; RAL + NDGF finished first cycle and started on 2nd; SARA hit by pilot factory issues last night - running fine now. Operationally left behind by T1s having bigger problems. FZK: hit by pilot factory job - didn't run so many jobs. Many jobs failing on stageout: 49402. IN2P3: pre-staging working well but when jobs finished dCache dead. ASGC: error rate up to 50% - no response? on ticket. Pre-stage a little better but can't get jobs to run properly through system. only 35 running jobs at ASGC T1. Michael - rate limit observations: time to correct "BNL accumulated backlog due to site issues" - 2 small problems observed were not reason for backlog. Discovered yesterday rate limitation. 2nd important aspect: evidence that rate to BNL limited in view of ceiling whenever other T1 sites demand their share. Rate this morning - despite backlog - limited to a v small share. Needs to be understood. Whenever data available to be sent to BNL must make sure get nominal rate. Kors: have CASTOR people confirmed rate limitation? Miguel: pool configured to much lower rate than now delivering - don't expect to give > 3GB/s unless we change h/w. Simone: how much pool can deliver and in case of congestion BNL should get rate proportional to share - i.e. 25% of share. No intelligence in system for that. Can configure # slots of active FTS transfers. In case another site has to catch up need enough active transfers. These are currently configured to reflect max that a site can do - to be followed up. Michael: how to follow up? Simone - involve FTS people in discussion. LSF course means followup meeting next week. Michael - Wednesday on please as some not available before. Angela: problem with stageout of files might be due to memory problems of gridftp doors. Working on it - hope for improvement soon.

  • CMS reports - As of today starting MWGR - T0 tests as soon as global run done, i.e. Friday morning. T0 running will continue until early next week. T1 rates ok but CNAF: 50% of rate due to load on tape system. Processing - mostly smooth. ASGC - can't get all slots, not using pledged numbers. Some changes done at site to address. As first step to p-m extracting some CPU efficiency numbers and plots - not yet reported automatically to dashboard. Site status: see detailed report via above link. Kors - how much data could you prestage at ASGC? A: ~4-5TB. Have to check...

  • ALICE - Using 6 T1s for transfers. NDGF: problem was with destination path in FTD. After changing was ok. RAL: site admins needed to fix write permissions. Gavin + Akos support essential. Peak of 500MB/s this morning. Sites: ticket 49366 regarding ce at CERN.

  • LHCb reports - step09 activities proceeding. Reconstruction on the fly - xpress jobs completed successfully. Pending data quality checks. Reprocessing started on 6/7 sites for LHCb. 7th site is CNAF - blocked by tape problem. All stage attempts fail with timeout. Investigations on dcap issue: m/w issue with dcap plugin for root. In order to proceed with dCache sites DIRAC downloads data to WNs. Seems to be running well. STEP09 point: also 5K MC jobs running in parallel. Main issues: staging data at CNAF; issue with LFC due to Persistency i/f which needs to be optimised. Workaround to use local instances at T1s. Issue reported by Joel about transfers Lyon-CNAF. Thought to be due to STORM but in fact gridftp servers at source (Lyon). Simone: dcap and root - which versions? Roberto- seems to be latest root + latest dcap. Simone - ATLAS doesn't see this problem. Graeme - no massive failures for analysis jobs that use dcap. Need to check versions...

Sites / Services round table:

  • FZK: Update on tape access. Memory problems mentioned above. IP conflict for one gridftp door. Transfers seem to be better - saw some channels to FZK hosted in external T1s, e.g. RAL, ASGC. In ASGC 30 files / channel. For RAL 8 files allowed. Brian - will look into this.

  • IN2P3: You may have been informed that a problem with dCache occurred this morning at IN2P-CC. Since 3:00 AM this morning all gridFTP transfers were aborting. This was due to a failure with the component in charge of routing requests to gridFTP servers. The corresponding service has been restarted at 10:30 AM and dCache is fully available since 11:00 AM. (Probably due to some load - many pending requests). Unable to announce downtime as problem with GOCDB at same time - cert problem. SIR asap.

  • CNAF: update on tape library issue:
dear all,
we have been just given 1 (I mean one!!!) cleaning tape and we are going to  restart the 5 tape drives currently out of service.
Additional tapes should be delivered soon.

luca

Luca dell'Agnello wrote:
> dear all,
>
> Tape storage activity for Castor at CNAF could run out of available 
> tape drives (both T10000A or T10000B models) quite soon.
>
> Problem description:
> tape drive functionality depends on periodic head "cleaning" performed 
> by cleaning cartridges, having a maximum limit of 50 operations.
> Because of an unsolved bug in the firmware of the SUN T10000 drives, 
> these cleaning tapes are mounted far too frequently, thus exhausting 
> the usage limits. Once a drive needs a clean operation it refuses to 
> perform any other normal work operation.
>
> New cleaning cartridges are expected to be available only tomorrow, 
> despite having been ordered quite many days ago and despite they 
> should have already been delivered.
>
> In this (quite frustrating) scenario, data migration to tape is not 
> guaranteed (as well as read access of course).
>
> More news tomorrow
>
> luca
>
>
>

Small update - site scheduled at risk up to 14:00, tape library problem concerns castor tapes only. TSM doesn't use these libraries. Yesterday at 16:00 observed first drive to go down, 1st of 11 drives that we have. This morning had 5 drives in down status! Happening during heavy tape activity. Have quite long queue of tapes to be read/written. This afternoon had a cleaning cartridge which permits us to go on - now have all tape servers fully functional. Expecting to get a new cleaning cartridge quite soon so no more problems of this kind.

  • RAL - tape problem - CMS summary v good. Had problem with one handbot which failed about 04:00 - fixed ~10:30. Series of meetings next days which will mean no phonein unless particular issues.

  • BNL - had a stagein record yesterday. 24K stage requests in short period of time, completing at a rate of 6000 per hour - much more than ever before!

  • DB - apply process for LHCb data to CNAF aborted overnight - ORA 600 - restarted apply this morning. Will investigate with CNAF. Propagation problem with LFC to GridKA. Updated service request.

AOB:

Thursday

Attendance: local(Oliver, Jamie, AndreasU, Eva, Harry, Gang, Graeme, Dirk, Wayne, Roberto, Simone, Daniele);remote(Michael, Leif, Di, Jos).

Experiments round table:

  • ATLAS (Graeme) - favourite topic: reprocessing. Good news RAL, PIC, CNAF, TRIUMF, BNL, NDGF (all in log! https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Logbook#Thursday_11_June) reprocessing metrics passed - 5x data taking rate. IN2P3: after late start and dCache probs now running v well - 10K reprocessing in last 24h. Extremely impressive! FZK : abit healthier, 3K jobs in last 12h, post-step we need post-mortem from all T1s. SARA: 1500 reprocessing jobs in last 24h - still some way to go to tune DMF. ASGC - still in crisis. Log book minutes - 4 ATLAS jobs, sent 49406 GGUS. Two issues: 1) real critical ASGC issue. At same as 4 T1 jobs, ~1000 T2 jobs. Shares same batch system. Not protecting T1 properly for ATLAS. Very serious issue! Some movement - communication with ASGC. Another ATLAS pilot factory (TRIUMF&ASGC) died overnight. But reprocessing way way way behind. Serious problem. Other t1 issues - RAL tape robot. Data distribution: stood down load generators on Thursday as last week. ASGC has problems 49423 GGUS getting data from CERN & 4 T1s. CNAF: can't get data from TRIUMF or ASGC. Other T1s look healthy. Logbook: snapshot of T2s. Still behind at many sites: DE, SARA T2s. To pass STEP metrics must clear backlog by 21:00 UTC on Sunday. Because ATLAS shares set using d/s inhomogeneous data sets can cause surprising variation - some T2s in backlog got "tough nut". Finally: production - did an awful lot! Analysis challenge: 700K successful jobs. Gang - Geneva: Frankfurt fibre cut, shifted routing via Chicago but lilnk 2.5Gbps. Reprocessing jobs: pilot server at Canada is down. 5-10' ago saw 200 reprocessing jobs. High failure rates on these jobs due to "Big ID" problem. Di - pilot factory did not crash, don't understand why no jobs submitted to ASGC. Oliver: CMS working with ASGC for last days as not enough jobs running (below fair share). Always saw 1000-1500 jobs. Graeme - this was simulation production running through T2.

  • CMS reports (Daniele) - OPN cut impact on CMS, mainly links to/from ASGC. Was best i mporting from other T1s at that moment! Prestaging smooth. FNAL had up to 379MB/s prestaging rate! (Full details in link) 2nd & last STEP OPS meeting for CMS this pm. Will decide ramp-down. Post-mortem activities started...

  • ALICE (Patricia. Report submitted before the meeting)- Issue reported to the FTS experts: In order to ease the transfer tests that Alice is currently performing for the STEP09, they are using the option -o of glite-transfer-submit in order to overwrite the already existing file and not to have to previously delete the file. However it seems that the error message they get is still: "File exists". Implementation procedure followed with the FTS experts. FTS transfers exercise ongoing currently to 175MB/s (report corresponding to 11:30 AM). Today Alice has observed the same issues with RAL and SARA reported few days ago. Tickets 49380 and 49381 reopened. Regarding the issue reported 3 days ago regarding the CREAM-CE: ce202.cern.ch, still waiting for an answer of experts through the ticket 49366.

  • LHCb reports (Roberto) - hit 2nd major issue with m/w (after dcache library still not released for prod..) Persistency version: LFC i/f not adequate for production. Cannot continue production - kill all jobs and restart using sqlite. LHCb will soon request all sites to wipe diskcache in front of mss and restart.. (JDS - why now?) Seen a few weeks ago; probably too late for release procedure. Load related problem not previously observed. Top 3 issues: 1) All jobs failed to be staged at NL-T1; 2. Jobs at CNAF failing either contacting LFC or opening the file in CASTOR; 3. IN2p3 transfers failing last night because of disk full.

Sites / Services round table:

  • FZK - tape storage: CMS started recalls, rates ~50MB/s input which only half of the promised throughput because one of the stager hosts is faulty. Being investigated. Failing stageout for ATLAS: one gridftp 'door' (dcache) had to be restarted. Out of memory. One library down for 2 hours. Fixed. Currently some drives used for writing are not accessible because of SAN zoning problem (we've hit a FW bug) Will try reset on 11/6 morning. Another set of staging hosts is prepared to be added to the pool of four (2 ATLAS, 2 CMS); Dirk - Jeff was mentioned a similar problem and a possible solution or work-around.

  • BNL - looking into srm server issue. Started to run at almost full CPU shortly after 07:00. # incoming requests ramped up significantly. Primarily due to analysis jobs on farm writing output (copying output files from WN) to SE. lcg_cp used to write into space token area hence through SRM. Not viable method! Need version of dccp client that can do this. Developers say dCache 1.9.2 but becoming an immediate and urgent issue. Load is coming down - almost at normal levels. Would expect transfer errors to go away. But may happen again and will hit all other T1s running dCache as well. Networking: big fibre cut in Frankfurt area affecting also US. For BNL bandwidth reduced from 10 to 4.2 Gbps. Negotiated with USLHCNET to rebalance bandwidth - now at 6Gbps. Also serve TRIUMF through this link .

  • CERN: LFC issue from LHCb as discussed; ALICE ce202: now fixed, bug in CREAM CE that doesn't cleanup sandbox files so one of server partitions full.

INCIDENT:
Our main links provider is suffering a fibre cut between Frankfurt and
Geneva: the following LHCOPN links from CERN are down:
TW-ASGC primary
IT-INFN-CNAF
DE-KIT
NDGF


IMPACT:
The affected sites are using their backup connectivity. The available bandwidth may be reduced.

There is no estimated time for repair at the moment.


Edoardo Martelli

Comments from FZK & NDGF:

On Wed, 10 Jun 2009, Andreas Heiss wrote:

> Hi all,
> CERN <-> FZK traffic is re-routed over SARA.
> We see no bottleneck so far.
>
> http://grid.fzk.de/monitoring/wan.php#cern
> http://grid.fzk.de/monitoring/wan.php#sara

Same automatically happened for NDGF:

http://www.nordu.net/stat-q/load-map/ndgf,2009-06-10,traffic,busy

With clickable links for the load plots:

http://www.nordu.net/stat-q/plot-all/ndgf-cern,2009-06-10,raw,traffic-kbit
http://www.nordu.net/stat-q/plot-all/ndgf-sara,2009-06-10,raw,traffic-kbit

~2Gbit peaks on 10GE links don't seem taxing on our side, as long as sara can keep up with all the extra traffic.

/Mattias Wadenstein

  • IN2P3 (added after meeting)
Hello,

as no one from my site could attend today's daily meeting (because a collision with an internal monthly meeting), I would like to give you some information about our site activities today.

Yesterday afternoon we had some problems as a consequence of a configuration modification of HPSS which made about one third of the tape drives unusable for several hours. In addition to the "natural" load imposed on HPSS by the staging and migration activities on the tier-1 by both ATLAS and CMS, we were hit by a significant load put by CMS analysis jobs running in the tier-2 co-located with the tier-1. This is due to the fact that both the CMS tier-1 and tier-2 at FR-CCIN2P3 still share the same instance and end-point of the storage element (dCache/SRM).

We knew beforehand this was a problem but had not the time to completely modify the configuration before the begining of STEP. So late in the night we decided to suspend the execution of CMS analysis jobs allowing the ones already in execution to finish. The MC jobs were not affected by this measure.
The backlog on the staging requests were absorbed during the night and the system worked well (at least what we can see) during the whole day.
Besides this, the space token LHCb_MC_DST was exhausted (7 TB) during the night and this made the import of LHCb data transfer fail. It was increased early in the morning to 12 TB. We have to check if this is enough.
Finally, average number of jobs in execution during the day (tier-1 & tier-2 summed);
&#8226;   ALICE: 623
&#8226;   ATLAS: 1967
&#8226;   CMS: 1481
&#8226;   LHCb: 468
Cheers,
________________________________________________________________________
Fabio Hernandez
Deputy Director/Directeur Adjoint
IN2P3/CNRS Computing Centre - Lyon (FRANCE)           http://cc.in2p3.fr
Tel. +33 4.78.93.08.80 | Mob. +33 6.84.27.14.74 | e-mail: fabio@in2p3.fr

  • ASGC (added after meeting):
Hi Jamie and Harry,


hereafter briefing for the actions/issue of ASGC T1 for last four days:

- the problematic LTO4 drive have been fixed this Wed., after re-calibrating from console of the library. all new procured tape drives are full functional now. majority of the activity refer to read access and from cms, try recalling data from cmsCSAtp pool.

- around 500+ job pending in queue since yesterday, both for atlas and cms. the problem is clear earlier as two of the misconfigured wn add to the pool after resetting the client config yesterday evening. should now all fixed and cluster showing 67% usage.

- 25% reduction observed in atlas production transfer and also less than 40% MC archive found in CMS, as already brief to wlcg-operation. 
congestion of backup route through CHI showing high packet loss rate earlier this afternoon, specific for inbound activity.

- still we need to sort it out urgently for the t1/2 fair share issue as atlas submission to the T2 have four times the number than at T1 CEs and we merge T1/2 using the same batch pool (after relocating to IDC). 
splitting the group config in scheduler and also having different group mapping in different CEs from T1/2 able to balancing the usage.

- draft planning for the facility relocation 11 days later. the IDC contract end at Jun 23, and we'll have one day downtime for computing/storage facility and two days for tape library. as re-installation of IBM TS3500 linear tape system need calibrating the horizontal between all frame sets.


BR,
J

--
Jason Shih
ASGC/OPS
Tel: +886-2-2789-8374
Fax: +886-2-2783-7653

AOB: (MariaDZ) This week's discussion on ggus ticket 44585 which was waiting for Atlas experts' reaction for several months revealed that some Atlas experts don't know who is behind the relevant GGUS Support Unit. This can be found on page https://gus.fzk.de/pages/resp_unit_info.php .For example here are the members of VO Support(Atlas) answering GGUS tickets. All VO Support(any_LHC experiment) members please review your tickets' backlog from https://gus.fzk.de/pages/metrics/download_escalation_reports_vo.php .

Friday

Attendance: local(Jamie, Miguel, Graeme, Laurence, Harry, Nick, Daniele,Gang, Kors, Eva, Jean-Philippe, Julia, Alessandro, Olof, Dirk, Andrea, Simone);remote(Di, Marc, Gonzalo, Vera, Michel(?)).

Experiments round table:

  • ATLAS - Canadian pilot factory? Thanks to Di know it was not dead. At TRIUMF they hit the 32k files in an ext3 directory bug, similar to that seen at RAL on Monday/Tuesday. At ASGC the issue is that gram job submission fails with this error: Job failed, no reason given by GRAM server. Code 2 Subcode 0. I have added this to the ASGC GGUS ticket #49406. Many thanks to Di at TRIUMF for this information. (As in logbook). Reprocessing: most sites home and dry; Lyon doing extremely well; FZK managed to start a lot of jobs, which died because of a gridftp door problem - being tracked by ATLAS-DE people and FZK; SARA remain slow, with lack of ready jobs meaning many CPU slots get lost to simulation; ASGC are still well behind - more in logbook: https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Logbook#Friday_12_June. Problem with subscription engine: some progress with FR, NL, AP and CA clouds (some subscriptions..) but still not completely resolved. Probably not crisis until this evening - sites with backlog have 24h to chew on backlog. If we can't fix in time have to adjust metrics. (Sites much get data that is being sent - if can't send data maybe extend window or end exercise.) Analysis challenge: may well hit 800K analysis jobs over challenge!Step Down:
    • Analysis jobs will now be killed tonight
    • Beer was drunk and pizza was consumed at lunchtime
    • Thanks to the sites and facility people who have contributed to STEP09 for ATLAS Concerning end of file ticket local time trying to fix. ASGC currently running 450 ATLAS and 1000 CMS jobs.
Roberto
800K analysis jobs? Real users or? Graeme - a lot of additional analysis activity. Early messages from users are that STEP'09 has not been a perturbation. 800K are increment over 'background' of real user jobs.

  • CMS reports - Trying to understand failures - several hundred jobs failing at IN2P3. INP23 has suspended analysis jobs at T2 as load on HPSS. CMS is satisfied with the outcome and the lessons learned from all tests, but for the following tests the CMS STEP team would like to run until completion : https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports#12_Jun_2009_Friday: T1-T1 synchronisation (end Sat pm), analysis test, T0 tape tests at CERN, comparison of CPU efficiency with/without prestaging."Grand summary": ramping down but some tests will continue to completion and some specific tests in addition.

  • ALICE (Patricia, reported before the meeting)- The glite-transfer-job issue reported yesterday (Alice wants to use the -o option of this command to ensure files are overwritten in the destination rather than previously deleating the already existing ones. however while using this option the error message obtained was "File already exists") has been followed with Akos which gave the advice to ensure the usage of a fully qualified SURL, so FTS would use the SRMv2.2 endpoint of the SE ( SRMv1 has no option for 'overwrite'). Once this advice was introduced in production, the -o option worked perfectly and it has been already implemented in the ALICE transfer system. T0-T1 transfers are working fine with CNAF, SARA (ticket reopened yesterday already solved), NDGF, CCIN2P3 and FZK. Ticket reopened yesterday concerning RAL (49381) is being followed with the site admins. Last Gridview upgrade was showing for ALICE a transfer speed around 400MB/s this morning at 10:00AM

  • LHCb reports - STEP09 will continue during this week end w/o touching the system. Staging and reprocessing data will be done w/o the remote ConditionDB access (that currently would require a major re-engineering of the LHCb application). but via available SQLite CondDB slices . With this same mechanism LHCb does currently run FULL Stream reconstruction after the DataQuality group has given the green light on top of the results from the EXPRESS Stream reconstruction. Sent a request to all sites (CNAF and RAL already did it) to clean again the disk-cache area in front the MSS. Please note that we do not required toclean also the tape resident file

Sites / Services round table:

  • GRIF: discussed with Simone - observed with ATLAS lcg_cp timeouts at all Fr sites (Hammercloud tests). Suspect not site nor load - maybe some race condition? Similar problem reported also with STORM during pre-GDB. Investigating this. Hope for more news next week.

  • DB: replication CERN-FZK for LFC data. Diagnostic patch from Oracle no longer valid for Oracle 10.2.0.4. Oracle will provide new diag patch but will split FZK into archivelog mode: data will reach FZK after 30' when archive log switch happens.

  • FZK: SAN switch resettet after contact was lost to 8 LTO3 drives. Datasets from LHCb were unavailble. Some data from CMS. Restarted movers that stopped because of this outage. Restarted a mover for ATLAS which was stuck for more then 5 hours. Trigger went unnoticed because of intense other activities. Rates picked up to >100MB/s after restart. Network problem on a CMS stager node limits throughput. Possibly the node is copying data to itself because the stager runs on the same machine as some other pools. Config error on the stager pools. Files files staged were copied to tape write pools. Exact life of a file to be determined. Also a CMS pool was recalling data but then never showed them when they arrived. Restart fixed this. 3 of 4 gftp proxies (doors) locked up. Cause unknown. Restarted. We ave now 2 more stager nodes ready that contain no other pools. Experiments please keep the load. GridKa will fight to keep up the traffic and fix problems.

AOB:

-- JamieShiers - 04 Jun 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf FZK_update_on_tape_access.pdf r1 manage 55.6 K 2009-06-10 - 08:31 JamieShiers  
PDFpdf in2p3-hpss-june.pdf r1 manage 76.1 K 2009-06-09 - 07:57 JamieShiers Comments from Fabio on IN2P3 / HPSS (Monday)
Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback