Week of 081215

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Eva, Maria G, Jamie, Jean-Philippe, Steve, Markus, Harry, Maria D, Dirk, Patricia);remote(Michael, Gonzalo, Gareth, Daniele, Brian).

elog review:

Experiments round table:

  • ATLAS (Simone): 1) This is a summary of the T1-T1 replication activity so far (the 10M files test).

Summary for (site) number of COMPLETE datasets number of INCOMPLETE datasets completion rate
TRIUMF-LCG2_DATADISK 1408.0 4349.0 0.244571825604
INFN-T1_DATADISK 4454.0 1271.0 0.777991266376
TAIWAN-LCG2_DATADISK 165.0 5578.0 0.0287306285913
FZK-LCG2_DATADISK 4157.0 1565.0 0.726494232786
PIC_DATADISK 4760.0 922.0 0.837733192538
BNL-OSG2_DATADISK 4000.0 1716.0 0.699790062981
IN2P3-CC_DATADISK 2857.0 2887.0 0.497388579387
NDGF-T1_DATADISK 4590.0 1159.0 0.798399721691
SARA-MATRIX_DATADISK 4184.0 1558.0 0.728665970045
CERN-PROD_DATADISK 5042.0 1330.0 0.791274325173
RAL-LCG2_DATADISK 2344.0 3424.0 0.406380027739

Observations: looking at the breakdown by source (which I do not attach here but I can provide if needed), the poor replication comes from mainly 3 sources:

    • ASGC: the SRM is unreachable both for put and get. Jason sent a report this morning. Looks like they had again DB problems and in particular ORA-07445.
    • LYON: SRM unreachable over the weekend
    • NDGF: scheduled downtime

In addition, RAL was also showing SRM instability this morning.

The plan is to inject the remaining subscriptions tonight (about 7K datasets) and let the system drain over the week. No issues encountered in DDM SS after the latest fixes last friday

2) There are still 3 sites which did not pass the "Acrobatic" (read "validation") tests for reprocessing. The sites will need to pass the Acrobatic testbefore thursday or they will be excluded from reprocessing campaign. Those sites are CNAF, ASGC and SARA. SARA need a fix in the pilot job, dealing with the SARA-NIKHEF storage and farm setup, ASGC failures are caused by the storage being not accessible, CNAF is having problems accessing the software area on GPFS from the WNs.

3) People responsible from the reprocessing have been alerted about the ONLINE->OFFLINE replication problems and consequences. I got no answer, but on the other side, this morning the following appeared on the ATLAS CERN announcements HN from Gancho:

***

Dear All,

As Oracle support didn't provide a feedback so far, we are starting with our procedure on Streams recovery for the following DB schemas on the replica ATONR => ATLR

[..]

as total 27

For the time of the operation the data on the offline DB ATLR will be unavailable. Will let you know when the replica is back in operational mode. Apologize for the inconvenience, Gancho

***

  • Brian - looked at FTS in & out rates (RAL) . 116 files in at a time, read from us 172 - FTS balancing.

  • CMS (Daniele) - couple of new processing rounds ready for T1s. Both reprocessing of summer MC production. 2 families, RAW->AOD, RECO->roottuples. All workflows by data ops. Final config available. Final green light being discussed now. 1st likely to start today, 2nd maybe early next week!

  • ALICE (Patricia) - alien upgraded to latest version. Not in production - latest aliroot not yet in production. A few sites running test jobs.

  • LHCb (Roberto) - increasing number of sites where jobs fail by trying to access the application area. CNAF has been banned right for this reason though it does not look evident where the problem is.
    As said last week this week is rather devoted to prepare for Xmas shutdown when LHCb will work on the best effort.

  • PIC (Gonzalo) - comment: LHCb opened GGUS ticket over w/e. Problem that jobs couldn't access sqlite DB - copy in this format in file in NFS shared . Seems that NFS is working fine but problems inherent in sqlite & NFS. Maybe needs to be followed up - as a site cannot do much - NFS working fine. But trying to access sqlite through NFS not recommended? Any experience?

  • Dirk - known locking issue with sqlite? Gonzalo "old locking sqlite issue". Can follow up - 'workaround' to globally disable locking

  • Maria - ticket number? 44767

  • Michael - had some problem with ATLAS who have abandoned it for moment. sqlite convenient but comes at a cost!

Sites round table:

  • ASGC (Jason) - update on ASGC castor transfer efficient drop: report - plot

  • RAL (Gareth) - issues affecting ATLAS test - srm issues there yesterday too. A couple of DB-related issues. Once in DB behind ATLAS castor stager, another still being investigated. Meant not running effectively from Sunday morning.

Services round table:

  • VOMS (Steve) - voms-core serving voms-proxy-init commands was completely unavailable from 00:01 -> 07:00 CET Sunday 14th December. The CERN operators tried to restart at the beginning of period without success. The service recovered by itself at around 07:00.
    Cause: A database connection error: " ORA-01652: unable to extend temp segment by 128 in tablespace TEMP. Preparing : SELECT version FROM version"
    Database admins have been alerted and are investigating.

  • DB (Eva) - working on resync of conditions data online-offline.

  • BDII (Markus) - emergency update. Not anything wrong with BDII but a consumer. Correction of BDII schema made WMS fail. Easier to change BDII.. Wait for fix to WMS

AOB:

  • Ticket assigned to ATLAS copied to Alessandro since 5th December 44460. Can someone from ATLAS look at it?

  • List of schemas affected by ATLAS.
From online to offline:
ATLAS_ATLOG
ATLAS_COCA
ATLAS_CONF_MDT
ATLAS_CONF_TGC
ATLAS_CONF_TRIGGER
ATLAS_CONF_TRIGGER_V2
ATLAS_COOLONL_CALO
ATLAS_COOLONL_CSC
ATLAS_COOLONL_GLOBAL
ATLAS_COOLONL_INDET
ATLAS_COOLONL_LAR
ATLAS_COOLONL_MDT
ATLAS_COOLONL_MUON
ATLAS_COOLONL_MUONALIGN
ATLAS_COOLONL_PIXEL
ATLAS_COOLONL_RPC
ATLAS_COOLONL_SCT
ATLAS_COOLONL_TDAQ
ATLAS_COOLONL_TGC
ATLAS_COOLONL_TILE
ATLAS_COOLONL_TRIGGER
ATLAS_COOLONL_TRT
ATLAS_MDA
ATLAS_MDT_STATUS
ATLAS_OKS_ARCHIVE
ATLAS_RUN_NUMBER
ATLAS_SFO_T0

From online to offline to Tier1 sites:
ATLAS_CONF_TRIGGER
ATLAS_CONF_TRIGGER_V2
ATLAS_COOLONL_CALO
ATLAS_COOLONL_CSC
ATLAS_COOLONL_GLOBAL
ATLAS_COOLONL_INDET
ATLAS_COOLONL_LAR
ATLAS_COOLONL_MDT
ATLAS_COOLONL_MUON
ATLAS_COOLONL_MUONALIGN
ATLAS_COOLONL_PIXEL
ATLAS_COOLONL_RPC
ATLAS_COOLONL_SCT
ATLAS_COOLONL_TDAQ
ATLAS_COOLONL_TGC
ATLAS_COOLONL_TILE
ATLAS_COOLONL_TRIGGER
ATLAS_COOLONL_TRT

From offline to Tier1 sites (in addition to the previous block of accounts):
ATLAS_COOLOFL_CALO
ATLAS_COOLOFL_CSC
ATLAS_COOLOFL_DCS
ATLAS_COOLOFL_GLOBAL
ATLAS_COOLOFL_INDET
ATLAS_COOLOFL_LAR
ATLAS_COOLOFL_MDT
ATLAS_COOLOFL_MUON
ATLAS_COOLOFL_MUONALIGN
ATLAS_COOLOFL_PIXEL
ATLAS_COOLOFL_RPC
ATLAS_COOLOFL_SCT
ATLAS_COOLOFL_TDAQ
ATLAS_COOLOFL_TGC
ATLAS_COOLOFL_TILE
ATLAS_COOLOFL_TRIGGER
ATLAS_COOLOFL_TRT

Tuesday:

Attendance: local(Jamie, Jean-Philippe, Harry, Simone, Olof, Roberto, Patricia, Maria, Steve);remote(Derek, Michael, Gonzalo, Daniele, Gavin, Jan).

elog review:

Experiments round table:

  • CMS (Daniele) - continuing craft reporcess. 80% complete at IN2P3, problems with srm instab & merge space. Min bias 100% complete at RAL, MC prod: main physics request almost done. 251M ev 217 CMS s/w. 200M ev recons. Over Xmas period MC will continue best effort. OSG operator. LCG operator for EU & Asian sites but not 20 Dec - 4 Jan. During this OSG operator will also look at production over other sites. V best effort as less in touch with these sites.

  • CMS ( cut&paste from Daniele) - Fast update wrt yesterday's report: CRAFT re-reco status: /Cosmics 80% complete (IN2P3: fixed problem on unmerged space, plus some SRM instability now addressed); /MinimumBias ~100% complete (RAL); /Calo <10% complete (FZK: initially CMSSW installation problems, then submission problems: being investigated). --- MC production. Main physics requests: 251M evts produced (GEN-SIM-RAW, CMSSW_2_1_7). 199M evts reconstructed (CMSSW_2_1_8). Over Christmas: best effort: OSG goes on (1 CMS MC operator), LCG as well - with no LCG operator in the 20/12-04/01 slot though, so only the CMS OSG MC operator will be reachable (very-best effort) for the EU/Asia regions as well in the aforementioned time window.

  • ATLAS (Simone) - status of tests: table with completion rates yesterday. Healthy sites got 70-80% of sites from T1-T1 stress tests. Something between 5-8% will never make it - due to DDM bug of last week. Several subscriptions not treated and removed. Problem cured but d/s still in stats. At least one site which is basically down - Lyon. THis morning Stephane reporting long list (srm, gridftp servers, h/w...) Working hard to fix but eff low. D/S with Lyon at source have problems... Problems at RAL in last couple of days. Report from Graeme this morning forwarding info from RAL - comment from RAL later. Taiwan not fully down but 50% eff. srm timeout both put & get. PIC started showing srm overload and timeout in last few hours. Situation not healthy - no sense to inject new subscriptions. 4 T1s showing problems - alters results of tests. Decision taken today to stop test now. Plan is to restart default 5% func T0-T1 T1-T1 T1-T2 running as low rate tests over Xmas to keep sites active - already foreseen. Will be displayed in new dashboard. Tests will run for next 2 weeks. Will start checking status after break - Jan 11: restart 10M files test. Concerning other activities: main activity being tested: reprocessing. Datasets being delivered now to CA, DE, ES, US, NDGF, UK. 3 sites not passing validation for reprocessing. SARA, CNAF, ASGC. If not past by THursday d/s will be redistributed to healthy sites and they will be out of test. XMAS: continue func test for DDM; func test for prod system; distributed shifters on duty - 2 time regions (US & EU) - 16H / 24. Experts for DDM, Panda, dashboard, DB best effort. Yesterday AOB on GGUS ticket - now solved and verified. Brian (RAL) - issue was stager db . Transaction rate meant couldn't copy off log files quicker than being written. Filled up stager DB. Made some changes to try to improve. Things broke again - made further changes. Reduced FTS channel settings to 1/2 normal rate. With large 10M test stopping should be able to increase channels for normal production run. One area we reduced quite a lot was for T2s. Looking at new h/w, upgrading SRM. Simone - really load related? Brian- many failed transfers from Lyon, also work on production side in addition to tests. Jobs taking longer, more user status requests coming in which then ended up adding to problem. As rates reduced ended up getting more requests and transactions started occurring.

  • LHCb (Roberto) - quick overview of Xmas: some real MC production about to come. Still don't have anything running. Usual dummy MC. Over Xmas more or less unattended plus some random user activity. Stripping activity supposed to take place before Xmas still pending LHCb core s/w release not yet fully validated - delay. Ultrich PPS instance of CE -> SLC5 batch farm. Will be tested. Would like other VOs to stop using this CE during 2 days. Last point: 1 d/s at CERN - lxfsrj0502 - all transfers out are failing. Firewall issue? Jan- update to ticket asking for more info. Useful to have timestamps etc. Derek (RAL) - will be swapping in new srms tomorrow at 11:00 GMT - 1 h "at risk" in GOCDB.

  • ALICE (Patricia) - LHCb requirement just received req and have stopped submission to this CE. Sysadmins putting 50 nodes more so will be able to share it. ALICE will continue production over Xmas. Expecting as soon as AliRoot in place production will continue as normally over Xmas.

Sites round table:

  • ASGC( Jason) - transfer problems resolved. Only problem is that subrequest table full. Full table scans - shrinking to reduce table size and improve performance. ATLAS production transfers at 100% mark. Simone - will check thanks!

brief for the castor transfer problem address yesterday. i tried to connect to WLCG concall by have no luck to open the chat window.


thanks assistant from local DBA and Nilo from CERN. the dlf procedures 
recompilation error have been escalated to Oracle with SR open. will try 
  fixing it asap. but actual effective workaround could be the shrink of 
subrequest table as top activity have been observed with status=6 and 
execution plan trying to reduce the size of the table when performing 
full table scan.


besides, db_cache_advice have been switch off (recommend by Nilo, and he 
confirm that in the past, there were problem caused a lot of contention 
in the blocks stored in the buffer cache whenever there was a high 
activity in the buffer cache)


atlas dashboard confirm the transfer able to pass since 12:30 CET:
http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site?name=&statsInterval=4

SAM S2 probes able to pass normally now:
https://lcg-sam.cern.ch:8443/sam/sam.py?funct=ShowHistory&amp;sensors=SRMv2&amp;vo=OPS&amp;nodename=srm2.grid.sinica.edu.tw


and the following tickets are closed now:


- #44720 UPDATE for ROC_Asia/Pacific TW-ASGC Problem with file transfer 
at TAIWAN-LCG2_MCDISK

- #44783 UPDATE for ROC_Asia/Pacific TW-ASGC SRM problems in ASGC

- #44777 UPDATE for ROC_Asia/Pacific srm2.grid.sinica.edu.tw down

- #44720 TEAM TICKET UPDATED - Problem with file transfer at 
TAIWAN-LCG2_MCDISK

- #44755 UPDATE for ROC_Asia/Pacific TW-ASGC all jobs fail with 
EXEPANDA_DQ2_STAGEIN

- #44706 UPDATE for ROC_Asia/Pacific transfer errors at RNL and LYON due 
to srm2.grid.sinica.edu.tw unscheduled downtime


Services round table:

  • SRM ATLAS (Jan) - missing team ticket - service managers found 6 GGUS mails in SPAM folders! Asking mail support to whitelist helpdesk@ggusNOSPAMPLEASE.org. Cannot say why this mail got flagged as SPAM!. Recommend setting spam filters to medium rather than high. Simone - test? Yes, 4/6 where team tickets.

AOB:

  • Maria - 4 VOs please think for 15 Jan user meeting. Like to enhance VO support unit in GGUS to help TPMs who do not always understand match applications / VO, e.g. "PANDA". If VOs would provide this info will make this available to other supporters.

Wednesday

Attendance: local(Harry,Jean-Phillipe,Nick,Jan,Julia);remote(Andreas (FZK),Daniele,Michael).

elog review:

Experiments round table:

ATLAS (by email from SC): Functional test at 5% rate restarted for T0-T1. T1-T1 and T1-T2 will restart on thursday (normal weekly turnover). Distribution of cosmic datasets for reprocessing (not being delivered in the previous months for any reason) ongoing. No major issue to report.

LHCb (by email from RS): The problem about one disk server at CERN reported yesterday has been further documented and it seems to be a gridftp problem but waiting for confirmation.

CMS (DB and AS):

IN2P3 have identified their srm issue as a bug in srm 1.9.0-4 which affects writing but not reading. Other issues seen since the weekend have been identified and resolved. There was an impact on reprocessing at the site.

CNAF made an intervention which caused CMS SAM tests to fail. There was a possible loss of propogation of conditions data as all reprocessing was failing. CMS is cleaning the CNAF squid cache and will resubmit the reprocessing work. WMS011 at CNAF starting aborting all jobs (mostly analysis) with an authentication error. It has been removed from the configuration and is being debugged.

FZK have started publishing a CREAM-CE which is attracting CMS jobs which then fail. Andreas reported this is in production use by other VOs. Two solutions were suggested - to stop publishing it as a CE available to CMS or change the GlueCEState to not be 'Production' if the other users are using direct submission to the CE. CMS also detected the failure of a central FZK router - see the FZK site report.

Sites round table:

FZK (AH): Just after midnight a router failed in a way that also killed its monitoring so the problem was not seen overnight. Turned out to be caused (which should not happen) by two 10 GigE fileservers connected to the router. The fileservers had a firmware change some days ago but how they could bring down a router is not understand. A post mortem analysis will probably be prepared.

ASGC (by email from JS): fyi, another transfer problem observe earlier this morning:

- first event: start since '16-Dec-2008 19:35:14', we have observed found ASGC SRM v1/2 are not functional normally and both production transfer and SAM probes keep failing with same error msg.

we take urgent action (our DBA working in different time zone) and apply the recommended shrinking to the subrequest table again to keep the table size smaller for full table scanning. lock contention observe from srmdb have reduce a lot already, and action session contribute from application have suppress to less than 3-5 and user i/o take effect after shrinking the subrequest table. throughput observe in local monitoring system is limited to 60MB/s as peak. you can refer to three figures attach in this thread, fig-ddm refer to ddm transfer plot of last four hours, and fig-2 is the db issue address above (yet able to cover more detail) and fig-4 is the network traffic extract from gmond.

efficiency of atlas ddm is increasing now, while overall recovery might take another 30min or one hour inc. SAM functional tests carried out by both OPS and Atlas.

and SAM S2 probe fail with:

+ lcg-ls -t 120 -b -T srmv2 -l -d 'srm://srm2.grid.sinica.edu.tw:8443/srm/managerv2?SFN=/castor/grid.sinica.edu.tw/d1t0/ops' [SE][Ls] httpg://srm2.grid.sinica.edu.tw:8443/srm/managerv2: CGSI-gSOAP: Could not open connection

Nilo Segura (CERN) has been looking at these problems and sent in his analysis:

- Problems with two indexes caused a database slowdown:

+ Missing index in the SRM installation. After the index was identified, they created it manually + Disabled index in the Stager database. They manually re-created it

They do not know how the DB ended in this state.

- Some partitions of the SUBREQUEST table were not properly shrink-ed. This caused that some operations took longer than they should. They manually fixed this problem (running ALTER TABLE SUBREQUEST MODIFY PARTITION XXXX SHRINK SPACE...)

+ There is a castor database job that should take care of the shrink of certain key database tables. Is the job running correctly ? No idea, they are going to check.

Other point we are examining is that they seem to suffer contention on RAC db events, and this happens when several sessions are accessing the same data from different RAC nodes. They claim they do not load balance the db sessions across the RAC nodes. It could be that some db jobs were executed on a different node, and that caused the traffic through the RAC interconnect. This is fixed in one of the latest Stager released (we force the jobs to execute in the right node). I think that Giuseppe suggested them to upgrade the latest 2.1.7-x stable release.

- Some pl/sql procedures in the DLF database became invalid and they can not be fixed. I told them to send the errors to castor-external operations mailing list. We have seen these problems at CERN from time to time in DLF, and normally we fix them immediately.

Services round table:

AOB: Sites and experiments are invited to try and enter a test line into next weeks daily report to verify access during the two week Xmas break.

Thursday

Attendance: local(Harry, Jan, Patricia, Roberto, Simone, MariaDZ);remote(J.Kelly(RAL),Michael).

elog review:

Experiments round table:

ALICE (PM): SLC5 testing ongoing. Have been trying for a long time to put a VObox in production in Birmingham (UK) but they have a problem of using the right certificate addressing myproxy. Also for a long time the only WMS they have in France, at DAPNIA, has not been working and this needs escalating.

LHCb (RS): 1) Testing SLC5 via the CE in the CERN PPS. gssklog fails under SLC5 in a full Dirac job but not outside of a Dirac job. May be a conflict in the OpenSSL library. 2) The gridftp problem announced Tuesday is now understood to be the same bug in lcg-cp as was found in lcg-rep. Jan said a patch may be ready tomorrow but probably not in production till February. LHCb are happy to use the patched version out of the AFS applications area. 3) For their shared file system problem at CNAF gpfs will have to be restarted so they are draining worker nodes for this. ATLAS is also affected with long times to create files. 4) Many sites see failures accessing an sqlight conditions db that is nfs mounted due to a known locking problem in nfs. Sites can mount using nolock, which they might not like, but with a new version of the LHCb application in January the problem will go away as files will be copied locally.

ATLAS (SC): Plans for xmas have not changed. An open point is if they will do bulk reprocessing or not and this should be decided today.

Sites round table:

Services round table:

VOMS (by email from ST): there is now a post mortem for last Sunday's loss of VOMS service at https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x12x14

AOB: MariaDZ is trying to collect contact points inside the experiment support mechanisms to help the TPM in ggus to pose follow-up questions to tickets. They will be added to VOid cards but so far only LHCb has replied. Harry reported that Gareth Smith had, as requested - thanks, to edit this Twiki using his external login and grid certificate and both failed. It turns out that the edit function needs a CERN login so apologies for that. The alternative is to mail wlcg-scod@cernNOSPAMPLEASE.ch which will be looked at over Xmas.

Friday

Attendance: local(Jean-Philippe);remote(Michael,Jeff).

elog review:

Experiments round table:

ATLAS (by mail from SC): The ATLAS reprocessing exercise will now go ahead and 300K jobs will be defined today. The list of sites participating is still being finalized.

Sites round table:

RAL (by email from GS): We have been having problems with one of our WMS system (LCGWMS01). Initially there were non-WLCG jobs filling the sandbox. However, looks like there may also be other problems on the system. Marked as an unscheduled outage until28th December as main support person away. We still have another WMS�(LCGWMS02) running.

Services round table:

AOB:

-- JamieShiers - 09 Dec 2008

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-01-05 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback