Week of 081208

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
ASGC 6 Dec     Fire! (see below) Jason open

GGUS Team / Alarm Tickets during last week

VO ALARM TEAM TOTAL
ALICE 0 0 1
ATLAS 0 11 39
CMS 0 0 6
LHCb 0 0 8

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Olof, Roberto, Maria, Gavin, Miguel, Harry, Jamie, Jean-Philippe, Graeme, Jan, Julia);remote(Michael, Gareth, Luca).

elog review:

Experiments round table:

  • ATLAS (Graeme) - picking up from Friday, saw some problems Friday morning reported at meeting. Intervention from CASTOR experts cured this about 'tea time'. Ran happily over w/e. Why did GGUS team ticket that was sent to CERN 'get lost'. Response only this morning - get confused with today's ATLAS scheduled downtime. After Friday evening things ran ok until late Sunday night when cosmics to BNL got stuck - 180s could not prepare file timeouts, Globus timeouts. Some files were found to be staged on default instead of t0atlas. Resolution put on hold due to intervention this morning. (Came back late) then went sick again. Many 180s timeouts. Plans: tomorrow interventions to upgrade ATLAS site service and central catalogs. Start "SRM killer" 10M file test tomorrow afternoon but conditional on SRM ATLAS being ok at CERN. Miguel - same load on all sites? A: yes, 7 day test.

  • SRM ATLAS at the moment suffering similar problem to Friday - investigating. Consequence - indirect - to castor atlas intervention this morning. DB was not in expected state at beginning. 3rd problem - transfer to BNL still to be looked at. Specific to this channel? A: yes, but BNL is only T1 getting cosmics at the moment so maybe that... 4th point: problems on Friday understood and fixed at dinner time. Alerted to problem by Simone. Ticket got to CERN ROC at 11:15 Friday only to storage team Monday morning. Analysis of problem to follow... Olof: what is expected request load from ATLAS for 10M file challenge? A: things will get throttled at some point by FTS but several lakh (100K) files per day per T1. Olof: 10M files out of CERN? A: 10M test files, whether all shipped to be seen. Some files already distributed, e.g. reprocessing files. Flavia - dCache sites have problem with get request. Since when clients make request they don't specific expiry time request stays active. Max number of active requests (concurrent) - if you exceed this request gets dropped or if server not configured can cause problems. Gav - Paolo looking into this from FTS side. Flavia - clients should specific 'sensible' expiry time but neither FTS not lcg_utils do. Hardcoded default of 24h. Graeme - surprised about dCache issue. Has it be around for long? A: yes, sites have seen this and enlarge buffer for get requests. "Next release" will have parameter at server site but clients should set 'sensible' value or release. Gav - FTS sends release after successful transfer.

  • LHCb ( Roberto) - ongoing testing of stripping for LUMI5 before starting which will give stress on T1s. MC simulation proceeding smoothly. 11K jobs running concurrently on Friday - new record. Very few failing... Testing activity last week - gfal python 2.5 release candidate is fully thread safe. Would like release in application area. Flavia - ongoing.

Sites round table:

  • BNL (Michael) - authentication issues. Friday night all transfers to BNL Panda instance failed due to permissions problems. Found that FTS failed to get VOMS role from myproxy. Unable to fix this despite several attempts. Hiro - DDM expert - changed DQ2 code to skip myproxy. Now uses FTS delegation for BNL Panda instance. Suggest to ADC developers to incorporate this fix. Gav - not aware you could put VOMS proxy in myproxy but delegation mode recommended. Graeme - this mode used for almost DDM transfers for quite some time now. Surprised any left using myproxy. Graeme - configuration on site services for this in DDM - take offline.

  • RAL (Gareth) - scheduled outage in GOCDB for ATLAS SRM tomorrow. Go ahead and join ATLAS tests a bit late. Move ATLAS DB back onto RAC, update to version of CASTOR as already done for other instances.

  • CNAF (Luca) - last Friday we had to put the Atlas srm end-point in an unscheduled downtime due to a problem with the underlying file-system (gpfs). We recovered during the night between Friday and Saturday (~ 3.00 CET) and then during this long week-end (today is holiday in Italy) we have reconfigured gpfs on our farm (without stopping the service) to avoid problems on the disk-servers.Now we do not observe anymore odd latencies in file system access.

  • ASGC (Jason) - From: Jason Shih <Horng-Liang.Shih@cern.ch>
    Date: Sat, Dec 6, 2008 at 4:05 PM
    Subject: fire incident at computing center affect ASGC network connectivity (limited)
    To: wlcg-scod@cern.ch
    Cc: ASGC T1 Ops <asgc-t1-op@lists.grid.sinica.edu.tw>, noc@asgc.net

    we are inform (two hours ago, around Dec 6 05:20 UTC) for the fire incident at
    Sinica computing center (not ASGC). All facilities have been shutdown during this time and wait for local administration to resume the examination.
    The actual impact for ASGC is two WAN connectivity to CHI and JP routers are all down as the urgent shutdown of all facilities (inc. UPS). ASGC NOC have adjust all affect links and reroute through available path already that network latency will have significant increase according to the new path adjust. Connection to CHI (2.5G) will reroute through AMS and some connections to JP sites will reroute through HK and still some through AMS/CHI/SINET.
    right now, we're still waiting the fire incident report from Sinica Computing Center, before that, no power will be resumed (inc. generator) to sustain the facilities in data center in the center. except for the impact of network connectivities, transfer and job execution at ASGC from T0 and other tiers shall be limited.

Services round table:

  • FTS SLC3 (Gav) switched off this morning.

  • CASTOR LHCb & CMS - upgrade DB and SRM 2.7 in production 1 h downtime.

AOB:

Tuesday:

Attendance: local(Nick, Miguel, Harry, Jamie, Simone, Daniele, John, Jan, Gavin, Luca, Olof, Ignacio);remote(Michael, JT).

elog review:

Experiments round table:

  • ATLAS (Simone) - testing plan for new DDM site services: ATLAS ready to start massive test of site services. Plan: before 17:00 today missing bits and pieces will be put in place - some scripting plus a few configurations on dashboard and site services themself. Give time to RAL and SARA to finish scheduled intervention today. Test is T1-T1 where CERN considered as a T1. Average of 2000 datasets of 100 files each per site - this is data volume that can be moved. Will be subscribed from source to all sites. Subscribe 100 d/s per T1 source tonight. If no prob -> 2K then 3k then rest. 1 week - 10 day exercise. Miguel: benefit to start today at 17:00 rather than tomorrow? No experts around.. A: ATLAS expertise will be there 16x7. Simone - could start tomorrow - takes ~1 hour to start. Miguel: if something breaks tonight will probably be broken over night... Simone - ATLAS test, point taken, send e-mail. Harry: how big are files? A: tiny - 50-100MB / file. Test will go on for 10 days and will not stop over night. Harry: prob of something breaking overnight on startup quite high - piquet call -> cost. Mail thread about problems with CASTOR last Friday. 1) Change in workflow with new site services. e-p at CERN is of type datatape first try to find where file is and then move file. Bringonline issued before srmls. Talked to Miguel Branco - could do srmls to start but srmls with 50 files fails 3 times out of 4. Needs dev-dev discussion. John: since new SRM or always true? A: not doing massive srmls before. Main use before was recall from tape. Any file of type datatype need to determine file location. One of comments from Olof that Friday's problems might be load related certainly true. Many more srmlses now. Why did ticket from Graeme get to CERN ROC and 6-7 hours to SMOD? (After 'this' meeting). Team tickets should bypass ROC and go stragiht to site. Luca - this happened to us too. Saw e-mail by chance - only received mail Monday. John - seems fairly common that TPM sends back to ROC. Simone - ROC should be cc-ed for tracability. John: ticket or e-mail to site? A: e-mail. Jan: sent link for post-mortem. One of points to be followed up. Original ticket 11:40 from Graeme and arrived quickly to site but sat there all w/e. CERN site basically SMOD who directs internally. Nick - ask Maria Dimou to follow up. Luca: has test started (arrived late...) No - 17:00 tonight or tomorrow. Miguel: also GGUS ticket about connectivity CERN-BNL - seeing failures but could read files from other sites, e.g. Oxford, CERN. From BNL some timeouts on transfer itself. BNL expert answered to ticket that could not see any problem BNL-side. Miguel also no prob CERN-side. Need to do low-level CERN<->BNL test. Asked BNL expert to contact us - done before using iperf. Had at least twice a network problem. Jan: another ticket opened on transfer problems CERN - BNL over w/e but no time to follow up yet. 3/4 srmls failures: ticket from Miguel Branco -2 weeks using older srm version. See if it can be reproduced with new version. Luca: ATLAS issues at CERN - sent ATLAS emails about resolution of problem. Does ATLAS confirm problems now fixed? A: problems not seen now but activity very low. When test starts we will see...

  • CMS (Daniele) - this week is CMS week; also busy for DM - major phedex upgrade to 3.1. Tested in supervision with craft reprocessing. Tests so far show no show stoppers for not doing release and upgrade. Planned for today to superimpose with castor intervention but need a couple more pre-releases so now most probably tomorrow morning. Downtime of 4h on all tiers - no data movement during this period. Update after today's dataops meeting. Minor issues at sites being taken care of: bdii glitches that cause some jobrobot jobs at a few T2s to fail. Nothing crucial... Following up.

  • LHCb (Roberto) - not much to report; ask why yesterday cert on VOBOX expired? Had to ask a new cert to be installed.. Was there warning of expiry? Ulrich apparently did not get reminder. Usual mc simulation going on smoothly. JT - dashboard kinda confusing regarding SRM@SARA. Eratic behaviour shown - are you having problem on not? A: test not yet critical - unit test using DIRAC framework. Failing test means e-p not usable - will follow-up.

Sites round table:

  • ASGC - update on network problem last Saturday:

the network problem should have been resolved 2hr right after the first event escalated (around 4:30pm). service provider able to input different power source with separated power generator to avoid exhausted of UPS system supporting ASGC 10G network. before that, two affect links have been reroute through AMS and HK routes, latency might increase to two or four times connecting to US and JP (KEK and HIROSHIMA) sites respectively. connection with Tokyo Univ have minor impact as reroute at HKiX.

local service provider will submit quality improvement report as well as risk assessment asap and review by local network operation administrators.

hereafter the summary report inc. also the time stamp provided by ASGC NOC:

* events observed by local monitoring system:

* 2008-12-06 11:20:09 CST JP, CHI link down

  • Traffic between ASGC and following sites were re-routed to
TPE-AMS-CHI link.
  • FNAL, BNL, TRIUMF, CNAF, part of NDGF, INDIACMS-TIFR,
IN-DAE-VECC-01, JP-KEK-CRC-01, JP-KEK-CRC-02, JP_HIROSHIMA_WLCG
  • Traffic between ASGC and following sites were re-routed to TPE-HK link.

* 2008-12-06 13:31:19 CST TWAREN link down

  • Traffic between ASGC and following sites were re-routed to TWGate
IP transit link.
  • NIU, NCUHEP, NCUCC, NTCU
  • Traffic between ASGC and Academia Sinica Campus network was
re-routed to Internet2-SINET via TPE-AMS-CHI link.

* 2008-12-06 13:31:26 CST ASCC link down

* 2008-12-06 15:47:22 CST JP, CHI link up

  • Traffic between ASGC and following sites were back to TPE-CHI link.
  • FNAL, BNL, TRIUMF, CNAF, part of NDGF, INDIACMS-TIFR, IN-DAE-VECC-01
  • Traffic between ASGC and following sites were back to TPE-JP link.
  • TOKYO-LCG2, JP-KEK-CRC-01, JP-KEK-CRC-02, JP_HIROSHIMA_WLCG
  • Traffic between ASGC and Academia Sinica Campus network was
re-routed to SINET via TPE-JP link.

* 2008-12-07 18:38:13 CST TWAREN link up

  • Traffic between ASGC and following sites were back to TWAREN link.
  • NIU, NCUHEP, NCUCC

BR, J

  • BNL (Michael) - LFC failure last night. Daemon died. Not much info left. Not clear if we will figure out why... Don't enjoy same stability with LFC as with LRC. Need to followup on this. Some communication with developers. Have received a fix - remains to be seen if this helps. In the meantime improved monitoring and automated restarts.

  • RAL (Gareth) - in middle of castor instance upgrade and back-to-rac. May overrun a bit - if so will notify.

  • CERN (Jan) - castor upgrades for LHCb and CMS went 'ok' in sense upgrade of Oracle to 10.2.0.4 and latest SRM. Appreciate feedback on situation...

Services round table:

  • SRM - See https://twiki.cern.ch/twiki/bin/view/DSSGroup/CastorMorningMeeting # srm-atlas meltdown on Friday - high failure rates and CPU load: Quoting Olof: "The problem is related to the use of internal gridftp, which puts the SRM in the middle of a giant request vortex where accumulations of timeouts in both client and movers effectively inhibit any useful activity at all from taking place: the SRM waits for the gridftp mover to start on the diskserver but since all transfer slots are full, the mover cannot start and while the SRM is waiting the daemon thread is blocked. Eventually the whole thread pool is full. In the meanwhile the clients timeout so when the gridftp mover eventually starts up nobody will connect to it but to be really sure it will sit on its slot for a minute or so and eventuall fail... [...] On Monday we could perhaps consider to switch back to external gridftp until there is a SRM release that fixes this problem? The external gridftp doesn't solve the inherent problem of too many transfers to the same pool but rather than putting the SRM in the middle, the transfers will merely accumulate on the diskservers as gridftp processes stuck in rfio_open().". More details can be found here.
# The timeouts are ~60m in GridFTP? compared to 3m in metadata layer (FTS/SRM).

  • CASTOR ATLAS / SRM (Olof) - The scheduled CASTOR ATLAS stager DB intervention was actually today (Monday). The problems on srm-atlas.cern.ch last Friday were independent and arose after the srm-atlas endpoint had been upgraded to the CASTOR SRM 2.7 release last Wednesday. The 2.7 release is a major upgrade (it also include an upgrade from SLC3->4), which had been tested by the 4 LHC VOs for several weeks on the srm--pps endpoints. The upgrade last week was actually implemented as a switch-over DNS alias srm-atlas-pps -> srm-atlas in order to minimize the impact for atlas. The new SRM worked fine between Wednesday and Friday morning when ATLAS started to report FTS timeouts, which we traced to an accumulation of requests in the SRM. One of the things that comes with 2.7 is a change in the scheduling of the gridftp transfers, which seemed to have been the cause for the problems we had on Friday (I think the developers are working on a full explanation for the problems). Why this didn't come up in the VO pps testing is not clear to me but it could be load related.

  • Cause understood, handling of bringonline which is suboptimal. Discuss with Simone / Graeme to see if it can be mitigating until fixed in s/w. Simone: asked DDM developers to situation where atlasdatatape considered as disk. Caveat is that files not on disk FTS handling as before will happen... Not sure if this change has been made but should be effective 'now'. Jan+Simone to discuss later...

  • Upgrade schedules for CASTOR PUBLIC and ALICE tomorrow: changed - now extended to have 'more realistic' downtimes of 2 hours - will be offset.

  • A report on what went wrong with the other 3 is pending...

AOB:

  • Friday 19th December: 513 R-068 will be used by operators whilst there room is being painted. Will find an alternative...

  • Over Xmas: will create minutes file and let people add entries so that we have some record over the period.

  • Next year: meetings continue same time same place...

Wednesday

Attendance: local(Luca, Jan, Gavin, Maria, Jamie, Flavia, Jean-Philippe, Simone, Roberto, Maria);remote(Ron, Michael, Gareth).

elog review:

Experiments round table:

  • ATLAS (Simone) - update on tests. Test started this morning. Last night despite fact that ATLAS was supposed to roll back to previous behaviour (now bringonline) this change only made ~midnight. Before midnight known problem reappeared. Since switched problem reduced and then disappeared. Mail < Giuseppe Lo Presti who is developing fix. Jan will try to put in PPS later today. Test: 2 issues this morning. First with FZK - srmget. Noone could get files from FZK. Problem fixed around 10:00. Don;t know details. SARA yesterday in scheduled downtime - after end storage starting working again but with v high (60-90-100%) failures. Continued -> 5am. Unscheduled down until lunch. Intervention on pnfs. Since SARA situation improved a lot. Still some srm timeouts but efficiency ~90%. Plan:injected 1K datasets, 10K subscriptions as each d/s subscribed to 10 places. 1/2 of this has been moved, the other half by end of day. Another 2K will then be injected and if ok then the remaining ~14k. From ATLAS point of view looking at dashboard callbacks - abit slow. Experts investigating. URL will be provided. Ron - yesterday updated to 1.9.0-6. Upgraded Java 1.5-1.6. Moved services (dcache head node) from 1 single node to 2 new nodes. PNFS on one, rest on the other. Upgrade went fine. Around 19:00 went maint over everything look fine. Manual SAM test ok. Went home happy smile This morning saw some serious pnfs errors -> unsched maint to fix. Shutdown srm server. Appears to be postgres related error. Could not fix with reset transaction log, reindex. Made dump of pnfs prior to yesterdays intervention and restored this this morning. Some how move of postgres db from one m/c to other - db not closed properly(?) Q for Simone - not to let people to transfer data shutdown FTS channels. Restarted (active state) only since lunch today. Where did 60% failure rate come from? FTS transfers should not have happened.. Simone - if channel closed DDM will not transfer files. But site also holds data to be transferred elsewhere - hence channel at other FTS server. Equally for FTS at CERN. Flavia: Q how did you split srm instances? Alias per expt? A: no, one single srm node. Split pnfs part from other dcache headnode services, srm is separate node anyway. New dashboard link(?) : http://dashb-atlas-data-user.cern.ch/dashboard/request.py/site

  • LHCb (Roberto) - production (dummy MC) ongoing. Some failures in stripping testing activity so prod activity most likely postponed until next week. rfcp transfer online to castor hanging. Not reproduced - tested alarm ggus ticketing. Worked fine - operator on duty and service people called ok. Problem in end as port closed on online system. Some problems at CNAF T2 for some s/w not available for users & mc prod jobs - gpfs problem, fixed after ggus ticket. Similar problem in BCN - not PIC. Several jobs failing at RAL - killed by batch - running out of wallclock time. All fixed following ggus.

  • CMS (Daniele) - migration to PhEDEx 3.1 centrally done with 0 problems (!). IN2P3 has been the first site to upgrade and runs nicely atm. Others are following, depending on their time availability for the upgrade. The upgrade is mandatory, due to a schema change in the Transfer Management DB.

Sites round table:

  • RAL (Gareth) - castor atlas intervention yesterday overran by about 1.5 h.

  • CERN (Jan) - upgraded c2public and alice stagers@cern. Completes upgrades for stagers to 10.2.0.4 and rollout of srm 2.7 in production. Overrun in public, alice ok. Babysitting srm version. Bringonline and a few other issues. Hope to test this pm.

Services round table:

  • DB (Maria) - power tests in P5. Started 6:00 and will finish late tonight late. CMS online DB powered down from last night until tomorrow 08:30.

AOB:

Thursday

Attendance: local(Harry,Julia,Simone,Jean-Philippe,Jan,Gavin,MariaDZ,Andrea,John Gordon,Flavia);remote(Jeremy,Brian,Gonzalo,Michael,Jeff).

elog review:

Experiments round table:

CMS (AS): Testing usage of Space Tokens in CMS data transfers to separate custodial and non-custodial data arriving at T1s. Procedure for the determination of the readiness of CMS sites now is properly documented and followed on a daily basis.

LHCb (RS): Would like to use python 2.5 by having data management clients (gfal and lcg_utils) made available in the AFS Applications Area. Complain that the announcement of the CVS kserver being stopped did not go through normal channels (it has been agreed to postpone this).

ATLAS (SC): 1) The 10M file test is ongoing with 3000 datasets subscribed. First problem seen was an accumulation of callbacks, about 1 M, to the ATLAS dashboard. The way site services send callbacks was not optimised. This has now been improved gaining a factor of 10 reduction and no further queues. Second issue was that site services and the data base were running on the same machine as this proved best in a previous workflow. This is no longer the case and they have been separated again. This has provoked many temporary 'file exists' errors but which will clear and currently ATLAS are transferring 12-15K files each 10 minutes. The test will carry on for another week. Jeff queried why the data rates were so low. Simone replied these are small files and the rate is limited by queries to lfc, site services, srm etc. It is thought the srm queries are the main limiting factor as the LFC has 500000 files resolved and the FTS polling is fast enough. Two sites look a bit slow - TRIUMF which has a conservative FTS setup of 2 slots/channel and ASGC which has probably something wrong in their FTS state machine. Brian (RAL) reported their FTS servers are having trouble keeping slots full with such small files, a previously seen problem where 1 slot is often empty. Simone suggested they increase the number of jobs waiting in FTS at RAL from 300 to 500 and will warn when they make this change (probably tomorrow morning). 2) ATLAS have asked for a tool to change ACL's in their LFC and this is becoming urgent. Jean-Philippe will try a script against a copy of the LFC today.

Sites round table:

PIC (GM): pointed out that for the SAM tests there is a SAMADMIN page to trigger instantaneous tests and asked if there was a similar one existing, or planned, to trigger experiment SAM tests. Andrea (for CMS) said not as these tests are submitted from dedicated VO systems. Anyway the CMS tests run hourly which Gonzalo thought satisfactory. Jeff then pointed out that not all tests seem to run hourly - the ATLAS CE test has been very erratic at NL-T1 with the last one 2 days ago. Simone said this must be broken and he will check. John Gordon pointed out discussions have been held on splitting out the SAM tests into the regions but it is early days for such changes.

CNAF: Andrea said that 2 of the 3 CMS WMS at CNAF are failing to matchmake and these are the two with the latest certified patch. He will circulate the patch level with a warning.

Services round table:

FIO (JvE): they are struggling with the ongoing network problems in the CERN CC with a bad set of ATLAS diskservers plus intermittent failure elsewhere. Experts are investigating. The CERN ATLAS and CMS SRM endpoints were upgraded at 14.00 fixing recent (Monday) bring_online problems

AOB:

Friday

Attendance: local(Simone, Eva, Maria, Jamie, Miguel, Harry, Jan, Gavin, Andrea, Roberto, Julia);remote(Gonzalo, Michael, JT, John Kelly, Brian).

elog review:

Experiments round table:

  • ATLAS (Simone) - 3K subscriptions injected in previous days almost drained. Quick fix today to site service - more monitoring. New subscription mid-morning. Current situation looks quite good but: 1) noticed some jobs in FTS in Taiwan & Canada remain in active state for very long time - many hours. This is the bit that has changed in monitoring - dump every few hours -> site managers to check. Feedback from ASGC (Jason) - multiple databases being used(?) Migrated from one to other. Still jobs in asgc for which state m/c seems stuck. 2) For BNL problem with LFC in BNL. Mail to Hiro. Michael - have not seen any message on this so can't comment. Simone -> Expert on call to follow up. Michael - no elog message. Simone - will check. Julia: jobs in active state? A: submit FTS job, goes through state m/c, active, finalised etc. Active state usually for time of file transfer. srm timeouts can make it longer. But we see jobs in this state for 8 hours! Not monitoring issue - FTS at site. ATLAS site services have knowledge of how long ago a transfer was injected. Gavin - write note to fts-support@cernNOSPAMPLEASE.ch. Brian@RAL: trying to tweak SRM to get better loading on fe due ATLAS high load test - went to 100% load. Not just from high rate test but also avalanche of jobs starting on CE. Trying to debug job recovery which was enabled on WNs for T1. Some problem - debugging with Paul Nielson.

  • CMS (Andrea) - Reprocessing going well at RAL, FNAL. CNAF had problem with s/w area but now fixed. Restarted reprocessing.. SRM issues at IN2P3 seem better after update of SRM space manager . FZK ;reprocessing not going at full speed - being investigated. Phedex 3.1.1 deployed at all sites - some minor fixes. Sites now evaluating readiness using tools developed by site commissioning team. Since yesterday evening 2 WMS used by SAM (ops & CMS) not working. ops reverted to old WMSs. CMS moved to new WMS - now things should be fine. But sites appear to have failed job submissions tests. Roberto - correlated with BDII? No - failure with WMSs which cannot accepted submissions? Harry - related to recent patch? Non plus.

  • LHCb (Roberto) - stripping production pending LHCb core s/w to be released. Testing on-going. 9K jobs dummy MC - good. Keeping sites 'warm'. Main issue with dummy MC production: problem with shared area. Also ran into problem at CNAF, also Lyon, T2s,... GGUS tickets have been opened... LHCb will prepare for Xmas shutdown. Plan to run over Xmas best effort. Making dummy MC production 'human free' - no intervention over Xmas.... DM activities (staging, transfers etc) will restart in Jan. Gonzalo - this mc production - trying to fill all available slots? Fixed load on sites? See some activity but free slots and no Q. Roberto - T1s supposed to run real production - LHCb throttling submission of dummy MC to T1s.

Sites round table:

Services round table:

SRM @ CERN (Jan) - srm public, alice & lhcb to latest srm version - fix problem with bringonline reported by ATLAS. Still some low level problems - some more maintenance releases expected over coming weeks.

Net (Miguel) - network timeouts no longer seen. Investigation still ongoing - no news. Problem in router in backbone. Reboot of router? Removal? Intervention was done yesterday afternoon. If any timeouts seen since please report asap. Some packets corrupted and checksum wrong - retransmitted by recorrupted -> network timeouts.

DB (Eva) - was a problem with SARA during last 2 days - DB down due to space on archive log area (lack of..) Replication stopped for all expts as SARA has one DB for all replication setups. Fixed this morning - setup now resynched. Problem took almost 2 days to fix as DBAs at SARA both away... This morning both capture processes on ATLAS online (conditions, PVSS) aborted due to bug with log miner. caused by some operations done by ATLAS DBAs on online DB without correct prior testing on integration setup. No patch available - opened a service request. two service requests for same problem from 2 other customers with no progress - most recent from July. Testing should be enforced - no immediate solution from Oracle and duplicating T0+T1s in a test setup is a major thing! ATLAS will clean up data on online accounts. Maria - more news Monday. First request PVSS online to offline. Simone: conditions data will be scratched and resynchronised? (A:Yes) Reprocessing exercise which would use conditions at T1s. Eva - offline to T1s not affected. Simone - should be brought to attention of people responsible for reprocessing. Maria - different recovery scenarios but heavy. Should investigate and ATLAS decide. Propose locking of owner accounts to reduce risk - unlocked on demand.
The are 2 big groups of schemas replicated for ATLAS conditions and only one of them is affected by the Streams bug reported during the meeting: ATLAS_COOLONL_XXXX schemas which are replicated from the ONLINE to the OFFLINE databases and from the OFFLINE to the Tier1 databases. As the capture process running on the ONLINE database is stopped, these schemas are not being replicated towards the Tier1 sites. The offline schemas, named ATLAS_COOLOFL_XXXX, which are only replicated from the OFFLINE to the Tier1 databases, are not affected and data is actually replicated.
Depending on which data the reprocessing tasks are done at the destination sites, they might be affected or not.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf alice-ggus-dec8.pdf r1 manage 1.8 K 2008-12-08 - 09:45 JamieShiers  
PDFpdf atlas-ggus-dec8.pdf r1 manage 14.6 K 2008-12-08 - 09:45 JamieShiers  
PDFpdf cms-ggus-dec8.pdf r1 manage 3.4 K 2008-12-08 - 09:45 JamieShiers  
PDFpdf lhcb-ggus-dec8.pdf r1 manage 4.3 K 2008-12-08 - 09:45 JamieShiers  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-11-03 - GermanCancio
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback