Week of 090601

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS CMS


Monday:

No meeting today - holiday at CERN & elsewhere

  • ATLAS (Simone) - In the morning there has been a problem with the ATLR oracle database, hosting several applications like DDM, Panda, Dashboards etc... All the affected services could not contact the oracle backend for several hours. As rough estimate, the problem last from 6:30am to 11:15am CEST. Services are fully functional since then, thanks to the intervention of the experts, with the exception of a 20 mins glitch of the ATLAS central catalogs from 13:40 to 14:00 which is possibly a side effect of the previous problem. STEP09 ramp-up activities are back to normal. The transfers now run at 50% nominal rate as foreseen. Tomorrow the incident will be discussed with the experts and more infos will be provided.

Tuesday:

Attendance: local(Harry, Daniele, Dirk, Nick, Andrea, Patricia, MariaG, MariaDZ, Roberto, Simone, Gang, Jason, Steve, Sophie, Miguel, Markus, Olof, Julia, Jean-Phillipe, Graeme, Jamie);remote(Brian, Michael, Jeremy, Gonzalo, Fabio, Angela, Kors, John, Jeff).

Experiments round table:

  • ATLAS - Graeme: Started STEP'09 this morning. 1) There is a logbook page of the main activities and observations at https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/ . 2) The load generators for data movement have now been ramped up to 100%. 3) User analysis has started with 100000 jobs being sent via WMS servers. 4) Pseudo-reprocessing jobs reading tapes are now running on 4 clouds and others will be defined this evening. 5) Simone reported on the DB access problems of Monday (see ATLAS report above and CERN report below). Started around 08.30 , fixed at 10 but knock-on effects for some hours. Mail was sent to phydb.support@cernNOSPAMPLEASE.ch but was this correct ? MariaG reported that this was ok but the DB team had already been alarmed by their own monitoring (they have an informal best-efforts phone rotation). 6) There are various transfer problems - failures IN2P3 to ASGC but not vice-versa, TRIUMF to CNAF is slow, TRIUMF to 2 Canadian T2 are slow, failures to Coimbra and Lisbon and SARA to T2 are slow. 7) ATLAS had not been computing checksums but some sites have now enabled this so it has been turned on. 7) BNL DQ2 VO-box has been taking 50% cpu. Found to be a level of GFAL brought in with a DQ2 upgrade but that does not work properly. GFAL component has now been downgraded.

  • CMS reports - 1) follow-up plans following broken CERN tape of 28 May are satisfactory to CMS. 2) STEP'09 starting today focussing on site readiness at T1. Plans are to trigger major activities at 16.00 CEST daily as a good time for sites world wide so today will start prestaging then tomorrow at 16.00 the reprocessing of what has been prestaged in the previous 24 hours. 3) A CRUZET (cosmics at zero tesla) run this week at CERN with data export to T1 so no major Tier 0 activities so as not to perturb this. 4) Analysis tests at T2 are being ramped up but should not interfere with other VOs.

  • ALICE - Began running in STEP'09 mode yesterday with over 15000 concurrent jobs. Will begin FTS transfers next week and decide how to use FZK meantime. Expecting to upgrade Alien soon. WMS issues at CERN due to bad CE shapes at KFKI and MEPHI. Details of the STEP'09 plans for ALICE: GDB presentation

  • LHCb reports - STEP'09 will start next week for them with preparations this week to clear local tape disk caches. Had CRL problems on IN2P3 worker nodes (see below) and an NFS lock on NL-T1 worker nodes (see below). In future LHCb SW will be copied locally to NL-T1 worker nodes.

Sites / Services round table:

ASGC (by email from Jason on 30 May):

- due to bandwidth limitation from prestage disk pool, we're adding 
another disk servers to share the gridftp load from the only one add two 
weeks ago. performance expect to drive up to 210+MB/s requesting from 
pre-stage performance validation by atlas.
- scratchdisk have been enable on T1 MSS. space token confirm with Atlas 
central ops.
- 60TB add to ftt dpm now, space token updated are given below:
74d80644-e027-4319-8668-91ee14ca85bc ATLASDATADISK dpm_pool
         atlas
         58.59T 58.33T Inf REPLICA ONLINE
b9d9a84d-a144-4b61-9b73-405edf14471e ATLASMCDISK dpm_pool
         atlas/Role=production
         14.65T 14.23T Inf REPLICA ONLINE
- 5 new LTO4 drives onlined this Mon., full functional in Tue., this 
week. tape capacity online for the time being (waiting confirmation from 
the recovery of data center, this we'll enable another tape system at 
ASGC data center providing 1.4PB with old base frame) is around 0.8PB. 
performance observed from metric is around 66.02 MB/s inc. drive 
overhead which is dominated by CMS mainly. file per mount is around 240 
according to last 4hr stats, which is good as well. avg file size around 
1.7G in atlas and around 3.4G in cms enhance also the performance of 
restaging processes.
- installation of new tape library with two S54 high density frame set 
and existing 3 frame sets implemented earlier of 2009. the inventory 
able to provide 1.3PB with all LTO4 cartridges.
- T1 MSS metrics uploaded and confirm with Alberto (metrics generated by 
tapelog reporter and also xml generation toolkit provide by Tim)

CERN: 1) Report on the ATLR (DB) problem of yesterday (report): The switch connecting several production databases to the GPN network failed Monday morning around 8:30 causing unavailability of the systems ATLR (Atlas off-line db), COMPR (Compass), ATLDSC (Atlas downstream capture) and LHCBDSC (LHCB downstream capture). The problem was fixed and databases were available again around 10:00. Some connectivity instabilities were also possible in the case of the ATONR database (Atlas online). Issues that might have caused such instabilities were addressed around 12:30. A post-mortem from the DB services point of view will be prepared. 2) The migration of the integration databases scheduled for today has been postponed 2 weeks after an issue was found. 3) there will be (transparent) patches to castorcms and castorlhcb tomorrow and an urgent upgrade of xrootd is being planned. 3) Now freezing the linux kernel upgrades. Was to be deployed 15 June but now will be delayed till 22 June. 4) The ATLAS SRM problem of last Thursday has not re-occurred and Dirk reported that an individual user was probably the cause.

IN2P3: HPSS migration has started with metadata being moved today. They modified their mechanism to download CRLs to worker nodes last week and this did not properly finish hence the LHCb problems.

NL-T1: 1) Had an NFS problem affecting LHCb software access on worker nodes after a user somehow caused NFS locks (which should not happen). They had to reboot the lock server plus the concerned worker nodes. 2) They were failing ATLAS SAM tests over the weekend due to the dcache disk pool 'cost' function not load balancing correctly across full disk pools.

FZK: Still sufferring from tape problems - looking for bottlenecks.

AOB:

Wednesday

Attendance: local(Harry, Daniele, Nick, Andrea, Patricia, MariaG, MariaDZ, Simone, Gang, Jason, Sophie, Olof, Julia, Jean-Phillipe, Graeme, Jamie, Oliver, Ignacio, Diano, Antonio);remote(Reda, Brian, Michael, Jeremy, Gonzalo, Kors, Joel, Andrea, NDGF).

Experiments round table:

  • ATLAS - 1) Simulation at sites has now started. 2) Data distribution is going pretty well. Only 5 raw streams are being produced so not all Tier 1 get data each day - TRIUMF and NDGF were the last recipients. 3) DDM server at CERN ran out of DB connections but has since recovered. 4) CE problem at RAL where they stopped accepting pilots perhaps due to a file limit on a globus directory. Brian reported RAL have cleared one of their two ATLAS CE and are using the second for debugging this problem. 5) At ASGC they have a serious interference between many stalled analysis jobs and the production jobs - being looked into. 6) Analysis via Panda and WMS is going well though have to check why some sites are empty of pilot jobs. Larger sites can have over 1000 analysis jobs at a time. Analysis at CERN will be started tomorrow. 7) Reprocessing has started at RAL/TRIUMF/FZK and CNAF but these tasks have now been aborted to get more work into the system. 8) Some backlog writing to tape in NDGF where they got two datasets. Writing to tape at 60 MB/sec but will probably be excluded from the next distribution. 9) They would like to be sure the data committed to tape at CERN is actually going to tape (the tapes are being recycled).

  • CMS reports - 1) Remedy ticket sent to CERN after disk only CASTOR files became innaccessible. Was due to a diskserver that crashed about 01.30. 2) CRUZET running at CERN so no STEP'09 activities at TIer 0. 3) Prestaging tests at Tier 1 started yesterday at 16.00 giving good results. The tests at FNAL have now completed successfully. 4) Workflows for processing at Tier 1 are prepared and will be triggered at 16.00.

  • ALICE - 1) A new MC cycle is being prepared. 2) RAL-LCG2 CREAM-CE stopped working. Ticket was submitted and problem solved by restarting tomcat server. 3) A new Tier2 site has been added for ALICE, CESGA in Spain.

  • LHCb reports - Had problems this morning ascribed to their castorlhcb upgrade but in fact this was done from 14.00 to 15.00 so they should check again.

Sites / Services round table:

CNAF: Had an LSF system error last Sunday.

BNL: 1) Had a site services problem affecting transfer performance due to a gfal library compatibility issue. After fixing this they observed up to 20000 completed srm transfers per hour, a very good rate. 2) They achieved sustained transfer rates of 700 MB/sec exporting to Tier 2 and 500 MB/sec incoming. 3) Data migration rates of 800 GB/hour from hpss to tape were observed at the same time as running merge jobs. 16 drives were used for staging and 4 were reserved for migration.

RAL: 1) Achieving 5 Gbit/sec to their Tier 2. 2) Have a couple of low level disk server failures. 3) Having transfer troubles from the Storm-SE at QMC to their CASTOR server. Will check with CNAF.

NDGF: Have had to increase FTS timeouts.

ASGC: - atlas user analysis job fill up the queue with 1.4k jobs and lower down the performance of pilot and production jobs. part of the user analysis jobs have been cleanup forcedly and apply later the reservation for different type of analysis jobs including also reasonable fair share for pilot jobs.

- rearrangement of the disk servers, pool to fit STEP requirement. changes mainly refer to Atlas specific service class and for disk cache only. this including also the new space token added and two new service class created two weeks ago, and one of it specific for pre-staging validation.

- new tape servers installed, 5 lto5 and one showing h/w error. the other one will be deployed soon next week. total online tape drives are 7 (with one offline atm, due to h/w issue):

* tape drive hardware issue: tape can be unload normally. escalation to local IBM already

- dedication of tape drives for data migration. this is setup earlier to improve the data migration of CMS data with sufficient streams and tape drives dedication. (4 lto4 dedicated for data write)

- preparing the tape std op manual with assistant from Vlado. we will try tweaking the script provided as binding CERN local setup base on cdb. we hope to provide operation manual for on site duty staff asap.

- srm upgrade to 2.7-18 this week, from 2.7-17, concerning the big Id issue. date need to confirm later this week including also the status at the other T1s.

CERN: Database - RAL have taken over, with our thanks, DB synchronisation to ASGC.

CERN: PPS - Nothing happening in the immediate future.

AOB:

Thursday

Attendance: local(Maarten, Gavin, Andreas U, Ignacio, Harry, Sophie, Graeme, MariaDZ, MariaG, Daniele, Gang, Edoardo M, Steve, Konstantin, Nick, Simone, Jean0-Philippe, Jamie, Diana);remote(Vera Hansper+Thomas Bellman -NDGF, P.Veronesi -CNAF, Angela, Reda, Jeremy, Dave, Gareth, Michael, Gonzalo, IN2P3).

Experiments round table:

  • ATLAS - 1) Data distribution to NDGF is slow. Gavin reported they have doubled FTS concurrent transfer slots from CERN to NDGF from 20 to 40. NDGF reported they have also changed some parameters so that long running transfers do not break. 2) T1-T1 transfers were scaled down this morning which had the effect of dropping the success rate probably because it showed up the tail of problematic transfers. 3) For the T1-T2 transfers a snapshot taken at midday showed that 35 out of 62 T2 had recieved at least 90% of their data. Those T2 not working well had storage problems or, the more interesting case, had a large analysis load that was slowing down their WAN transfers and building up a backlog. 4) Reprocessing has started. Current inspection shows- ASGC 1 hour jobs still there after 6 hours - PIC ok with 557 jobs - SARA ok with more than 2000 jobs - NDGF only 177 jobs and dont seem to be getting prestage requests - CNAF ok with 900 jobs - FZK only 16 jobs but a very unhealthy tape system - RAL very good at 3000 jobs - TRIUMF none but doing a heavy high priority validation run - BNL also running validation - SLAC ok but slower since input and output data are via BNL. 5) Note that reprocessing logs as well as AOD are being written to tape but the success metric is number of files not number of GB. Individual site problems are now being chased up.

  • CMS reports - 1) Prestaging continues at the T1 (except FZK and IN2P3) with good results. 2) Reprocessing was started yesterday at 16.00 on all 7 Tier 1. The executable was misconfigured at RAL and PIC and RAL jobs had to be killed and resubmitted. 3) Special Phedex setup at PIC to avoid WAN transfers triggering scattered tape mounts. To be tested over Spanish T1 to T2 over the next few days.

  • ALICE (Patricia absent due to the ALICE weekly TF Meeting) - Alice MC new cycle creation: done. The production came back last night, currently running around 8000 jobs. MonaLisa monitoring having problems this morning, service temporarily down but experts taking care. CESGA site (Santiago de Compostela, Spain) entered production yesterday evening. Currently jobs aborting at CC-IN2P3. The WMS used for this site is that one placed at GRIF which is not able to find the resources at CC-IN2P3. The problem has been reported to the site admins and service responsible and we are looking into it (GGUS ticket #49250).

  • LHCb reports - 1) Mapping problem at PIC - quickly solved. 2) Discovered a CERN CASTOR problem of innaccessible files that was thought to be due to the last minor upgrade but in fact had been there from the recent major upgrade and involves data loss after garbage collection was accidentally enabled. Attempts are being made to recover as much as possible. See post mortem at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090603

Sites / Services round table:

  • CERN:
    • SELinux execheap check has been disabled in the SLC5 WNs (requested by ATLAS). (At CERN a ncm component is used to configure this, the same results can be achieved by executing: setsebool -P allow_execheap=true.)

  • IN2P3: The ALICE job failures at IN2P3 are thought to be due to a problem on the GRIF bdii (the WMS used is at GRIF). The HPSS migration should be ending this afternoon so tape services will be resuming.

  • TRIUMF: Seeing FTS timeouts on transfers from RAL and ASGC especially for the large (3-5 GB) merged AOD files. Have hence increased their timeouts from 30 to 90 minutes. Simone added they also see timeouts from TRIUMF to CNAF (which goes over the GPN in fact).

  • RAL: The CMS executable problem was in fact picked up quickly by local CMS support. They are seeing timeouts on their RAL to BNL FTS channel.

  • BNL: As regards ATLAS reprocessing BNL had 5000 jobs queued which trigggered prestaging of which 4000 are already done so the reprocessing jobs will now ramp up as job slots become free.

AOB: (MariaDZ) For ALICE (also sent in email today): Please create in vomrs groups to contain your teamers and alarmers as per https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#How_to_register_your_TEAM_and_AL For CMS (also in https://savannah.cern.ch/support/index.php?104835#comment68), please change your groups from TEAM and ALARM into "team" and "alarm". Although you were the only experiment that did what the instructions originally recommended, LHCb created Roles in lower case and ATLAS created Groups in lower case. As VOMS is case sensitive please be so kind to also go into lower case to simplify the GGUS parsing rules.

GOCDB: is now fully moved to the backup service.

Middleware: A.Unterkircher reported a recently found lcgutils problem whereby gfal_open gives a segemntation fault under certain conditions. Will discuss what to do offline with the experiments.

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

Edit | Attach | Watch | Print version | History: r23 | r21 < r20 < r19 < r18 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r19 - 2009-06-05 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback