Week of 081201

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
CERN 25/11 ~8 hours cooling batch https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081125 TNT received
ASGC - TBD CASTOR/Oracle CASTOR service - ASGC pending

GGUS Team / Alarm Tickets during last week

  • HINT - save output of these queries as PDF and attach to file for a given week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Maria D, Julia, Jamie, Nick, Jan, Harry, Simone, Patricia, Roberto, Maria, Olof, Flavia);remote(Gareth, Daniele, Brian).

elog review:

Experiments round table:

  • ATLAS (Simone) - over w/e quite a few problems with data movement in general. More ATLAS-related - looks like high load / overload of ATLAS DDM site services. New version of site services - about to be tested - should address these issues. Instead of patching existing version -> testing and deployment of new version asap. Started this morning. Installed in VO box which services RAL, SARA, Lyon and FZK. Functional tests for these sites served by new site services and being monitored. Tests will continue to Thursday - if no major issues proceed to massive T1-T1 testing with 10M files. (Described ~2 weeks ago). Q - missing datasets in Canada? A: this is just one of the problems that occured. Datasets were not dispatched in a short enough time scale and not always completely. Brian - if works will start Thursday and it will run 7-10 days? A: yes, that's it!

  • CMS (Daniele) - at CERN dataops people using new prodagent to run some reprocessing on cosmics. At T1s with custo. IN2P3 for cosmics, FZK for calo - RAL miniminum bias dataset 22-67TB. Will take some time. FZK starting, but fine. IN2P3 & RAL - will summarize in following calls. Gareth - looks like we've shifted about 1PB of data internally over the w/e! Maria D: question - AOB style Q. Meeting this Thursday with OSG management to satisfy direct routing to OSG sites. No such request from CMS. Daniele will followup with Matthias.

Site NewDataset Size
IN2P3 /Cosmics/Commissioning08-ReReco-v1/RECO 69.6TB
FZK /Calo/Commissioning08-ReReco-v1/RECO 62.2TB
RAL /MinimumBias/Commissioning08-ReReco-v1/RECO 19.9TB

  • LHCb (Roberto) - see following plot:

LHCb-cumulative_plot_of_staging_jobs_run-dec1.png
It shows the number of staging jobs in the last week reflecting then the load each T1 went through during this week of this staging exercise. It's impressive indeed how the throughput - read in terms of files staged (a file to be staged=a grid job to read the internal GUID to verify the file is OK) has been sustained this week steadly by many sites. GridKA suffered a bit more the problem of the small size of the files used and IN2P3 pointed out two dCache related issues that have been filed into dCache bugs (see GGUS 44208 for traceability)

Above exercise now over. Sites could cope with file staging ok. Will resume same exercise but with more representative filesizes - e.g. 2GB as for real data. On the other hand this exercise (see plot) shows all Tier1s were well behaved WRT these requests. # staging requests 10x effective need of LHCb at every T1. So for 2GB files should be much better... Another point on this "1st phase" is to prepare some documentation for sites - only possible when Andrew is back from UK. A couple of picquet calls over w/e. VO boxes - 06 with disk full, 09 with heavy swapping. Operator follow up all correct. Olof: are you still using RB & LB services? For how long? A: phase out by end 2008. Add a margin of a couple of weeks.... DIRAC3 can now support also analysis. DIRAC2 still being used by people doing analysis. Olof: schedule stop of service end Jan?

* ALICE (Patricia) - not too much to say - experiment running 450 jobs all analysis. Waiting for latest version of alien still not deployed not nowhere. Module for WMS submission stressed by analysis users. Deployed at all T1s. Some running analysis jobs at these...

Sites round table:

  • RAL (Gareth) - incorrect link in incident report! 1 disk server for ATLAS currently out of access following Friday's power glitch. Hope back later today.. RBs RIP.

Services round table:

  • CASTORSRM (Jan) - upgraded srmpublic to 2.7. Procedure straightforward and smooth. Walkthrough for experiment upgrades. Problem with lcg-rep - CE tests because of this. Understand how to work around this. No experiments have reported any problem with this. Not show-stopper. ATLAS on Wed, LHCb and ALICE Thu, CMS next Tue (PhEDEx upgrade scheduled that day?) Brian - although expts may not use lcp-rep some sites do, e.g. for clearing problem files. Jan - more info: castor gridftp can be used in 'internal' and 'external' modes. Always configured as 'ext' so far, but less efficient than 'int'. Can be configured on VO-VO basis. Will configure to use internal mode for all LHC VOs. SAM test under OPS VO - will reconfigure to use 'external'. Reconsider SAM test to use something that experiments use! This is a CE test. Brian - ATLAS SAM tests on CE use lcg-cr and then lcp-cp then lcg-del. Ticket in at moment as config problem with CASTOR as ATLAS don't use space tokens on copy which causes issues. Jan- problem with lcp-rep reported to developers. Roberto - upgrade only for SRM? A: Y. mss b/e upgrade not coupled to this and won't come out on this timescale. Procedure: put FTS into draining mode then swap alias then test then re-open. Time taken governed by FTS draining. Might be quicker... Servers "at risk" in GOCDB for 1h. Simone - should stop sending data to pps endpoint? A: yes - can continue until tomorrow but...

  • DB (Maria) have patched all online DBs of LHCb , ALICE and CMS. (ATLAS in CC so was done with the rest in CC).

  • Releases (Nick) gLite 3.1 update 37. Fix bdii to prevent problem seen at GRIF. Was seen and fixed but came back again....

  • VOMS (Maria) - Steve upgraded all service this morning. Patches were out for a long time so don't expect problems but...

  • ASGC (Harry) - now working again at 90% efficiency.

AOB:

Tuesday:

Attendance: local(Gavin, Jan, Simone, Steve, Julia, Miguel, Gareth, Roberto, Patricia, Flavia, Ignacio, Marias G&D);remote(Michel, Michael, Gonzalo, JT).

elog review:

Experiments round table:

  • ATLAS (Simone) - testing of DDM site services progressing. A few issues being addressed. Currently functional tests FZK, Lyon, SARA, RAL as before. Also new release of dashboard being tested. Shows breakdown of activities "functional blocks". Monitoring is "a bit fishy" - some clouds on one dash, some on another. Hope to converge soon. Basically testing phase. (In case you are wondering why traffic not there - see other dash). Reprocessing exercise for ~20% of cosmic data from last months. Sort of final list of datasets provided to data man ops from physics coordination. Distributed evenly to several T1s. 90% are on disk. Reprocessing will probably not trigger big tape recall. It is driven by physicists needs. Meant to start in November - late. Problems with s/w release. Tested, problems found, fixed, tested again etc. Target to start before Christmas holidays. Have to reassess in a week. ASGC - see traffic for MCDISK e/p - stable since yesterday. Reintroduced this morning in func tests. JT: GGUS ticket 443303 - both RAL&SARA mentioned, assigned to UKI ROC. Simone: check. JT: remind that a site maybe in SARA cloud but not necessarily for SARA - problem with site in Russia.

  • LHCb (Roberto) - ammendment: staging exercise "was" over but... Responsible running it was not attending meeting and so tests continue. Looks fine smile LCG RB retirement - would like to keep at least until Jan was statement yesterday. Small community need to access DC06 data - only possible through DIRAC2. Not clear that RB can be retired on this time scale... Dummy MC production: running smoothly. Mainly problems with shared areas - small centres (i.e. non-T1 and non-large T2s). Plan to start lumi5 stripping activity T1s + CERN start today or tomorrow. JT: 1) 44202 GGUS ticket - problems getting back some files in staging exercise. No response from LHCb since last week despite much info from SARA side. Roberto - ticket opened for tracability? Will look at ticket and check. JT: dashboard overview for LHCb - SARA still red. Roberto - whole site red best 1 test (WMS) set as critical - test failing, (LHCb dashboard). Site is usable. Gonzalo: prestage tests finished?? Yes and ... Plans to resume but with realistic filesize. Resume soon with 2GB files. Simone - met with Patrick Fuhrmann decided to disentangle tests, start without SRM. Ron will try to pre-stage with native dCache commands and see what performance is. Ron wants to wait for ending of LHCb tests before starting this - would like to know when the tests are really over...

  • ALICE (Patricia) - not running, not yet. FZK has announced this morning a CREAM CE setup. Will start testing today.Michel - have you been able to assess correct behaviour of GRIF WMS? A: will let you know today. AFAIR tested one week ago and was ok.

Sites round table:

  • BNL (Michael) - yesterday observed impact to performance of storage system and in particular components based on Sun Thumpers. Caused by "scrubbing" - used to ensure data integrity on servers. Operation started simultaneously due to common configuration. Whilst this is a critical operation should not be synchronised so can always provide sufficient service to applications. Ran for >12h and hence visible impact on performance of BNL re data replication. US cloud - problem at Wisconsin with file registration. LFC component there doesn't work properly - believed to be a Globus bug? Reported - waiting for fix from Globus.

  • FZK (Harry) - quite high data rate for ~1 week from unregistered VO - might be COMPASS. Jan will check logs.

  • CERN (Jan) - will upgrade ATLAS e-p SRM at latest time tomorrow. Will confirm ALICE & LHCb Thursday. Have to apply Oracle patches on CASTOR + SRM DB servers - would like to do this next week. Results in downtime of 1h for castor isntances at CERN. SRM PUBLIC upgrade yesterday and failing SAM tests. Tests continue to fail as config change not consistently applied. Individual CEs predominantly red but CERN overall mostly green - "OK" but degraded.

  • DB (Maria) - PIC will have an intervention on 10 Dec 14-18 for some streams-related patch. Trying to negotiate with experiments need for full online DB support over Xmas. Support for offline DB ok, for online CMS has clarified online is not needed, ATLAS would like full production level service but this cannot be provided... Not at night and not holiday days. Offline will have agreed coverage.

Services round table:

  • VOMS (Steve) - Summary
    • 10:00 -> 22:00 voms-proxies were limited to default 86400 seconds rather than VO specific values.
    • Sysadmin error , would have noticed in 30 seconds if VomsPilot had been tested by ALICE, CMS, LHCb or ATLAS.
    • 10:00 -> 23:00 voms-admin was using wrong (read only) database account. Consequence , user requests could be processed but were not implemented. All pending changes have now been implemented. * 23:30 -> 00:30 Unrelated to the voms upgrade a security scan caused a crash. Operator alarm raised, situation corrected by them. LinuxHA configuration has been wrong forever to correct automatically for this.
      Details: https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x12x01

AOB:

Wednesday

Attendance: local(Harry,Julia,Simone,Gavin,Olof,Roberto,MariaDZ,Jan);remote(Jeremy,Michael,Jeff,John Kelly,Michel).

elog review:

Experiments round table:

ATLAS (SC): Validation of new DM site services is still pending. Hope to have some progress later today or tomorrow. Last nights functional tests got into trouble with a bad certificate, resolved this morning.

LHCb (RS): Working with the dashboard team to follow up the NL-T1 issue of site criticality display. Issue of gssklog failing at CERN understood to be a problem with 32-bit version running on 64-bit nodes and where several proxy renewals have occurred. Workaround is to run the updates on SLC5 full 64-bit nodes. Staging tests at NL-T1 have definitively stopped so ATLAS can go ahead with their tests. LHCb have lost data at PIC with a broken tape and are looking to see if the data is replicated elsewhere.

Sites round table:

BNL (by email from ME): The IP address change of the BNL SRM server was pushed out (and therefore propagated) to primary name servers outside BNL (e.g. ns1.es.net). Depending on the DNS cache lifetime (TTL) at sites the updated address is already in effect (used by services at those sites) or not. Many sites are known to run their cache with a TTL of 24 hours. If they don't refresh their DNS cache manually it may take up to another ~20 hours before transfers will succeed. Meeting report: things looking much better this morning following installation of a dual ported configuration on their SRM server - one port for internal use and the other directly attached to the LHC OPN address space. This is part of the planning to allow to abandon policy based routing which has caused problems in the past when external network circuits have failed. This is the first of 3 steps with FTS and file catalog to come. Will try to optimise future steps - the DNS cache issue is technically easy for a site to refresh, more a communications problem among the Tier 1. A second big step was migrating from LRC to LFC which is almost completed. Data replication via LFC is working but not yet mapping of roles to accounts as needed on SEs. Panda developers should have this ready by midday.

RAL (by email from GS): - On Monday I reported a disk server unavailable. This went back into service yesterday morning (Should have reported this yesterday).
- Problems with Atlas transfers between RAL and Tier2s for about four hours (3pm to 7pm CERN time) yesterday. Atlas were using the new (test) FTS server here. Don't know cause - or indeed why it went away - yet. Understand Atlas will switch back to our production FTS before the 10 million transfers test.
- Assume Atlas will confirm when the 10 million test will start. Answer from SC - will not be today nor, probably, tomorrow. Will let you know.

NL-T1 (JT): Thanks for the LHCb dashboard critical tests work - important for our weekly operations meeting. Have put in GGUS ticket on problem with time limits displayed when SAM tests fail - Julia to follow up. A second issue is the rumour that ATLAS want to clean up all files made before 1.1.08. Simone clarified this - there is a campaign to get rid of data in srmv1. ATLAS physics coordinators will decide what srmv1 data is to be kept and this will then be moved to srmv2 endpoints and the srmv1 endpoints will be decommissioned.There is some early data important for validation to be kept and sites will get a list of these in 1-2 weeks.

Services round table:

Dashboard (by email from JA): Summarizing the situation with availability calculated in Dashboard application, in particular what concerns SARA and Nikhef sites.
Dashboard application was developed in order to provide the VOs flexibility in estimation the site performance from various perspectives, for example 'whether the site can be used at all by my VO?', 'whether a particular VO activity can be performed at the site?', etc.. I think not all VOs are currently exploiting this flexibility. Atlas and LHCb are. The main availability which should be considered by the site and should be exposed by the high level tools like site GridMap is , naturally, the one which answers the first question, and reflects site usability in general.
Roberto had introduced this availability for LHCb. It is called 'LHCb critical'. First results are already visible on the UI: http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=500&sites=LCG.NIKHEF.nl&algoId=82&timeRange=last24 In addition to the new availability type, Roberto asked to make modifications in the description of the LHCb topology. This change also concerns Nikhef-SARA. From LHCb point of view these two sites should be regarded as a single site, which is called NIKHEF.nl and combines services residing in Nikhef and SARA. William implemented the change according to the request of Roberto.
I'll try to clarify with ALICE and ATLAS whether for these two VOs we expose correct availability in the Site GridMap. For CMS and LHCb it is fine.

lcg/gfal dcache issue (by mail from FD): it has been noted that at many dCache sites application making a request for a TURL with a Get operation do not release the TURL after the operations on the TURLs are completed. This causes the allocation of request slots in dCache with long queues. Please, if you use lcg-gt to get a TURL, make sure you use lcg-sd to release the TURL once done. If you use the gfal library, please remember that a gfal_get should be followed by a gfal_release to release the TURL. I will ask this to be documented in the WLCG glite user guide.

CERN CASTOR/srm (JvE): ATLAS srm has been upgraded to the latest version and those of ALICE and LHCb are planned for next week. There should also be a CASTOR upgrade next week. There is a potential srm configuration problem when internal gridftp is used as this requires FTS 2.1. To be followed up.

AOB:

Thursday

Attendance: local(Ewan,Gavin,Miguel,Harry,Julia,Roberto,Olof,Simone,Andrea);remote(Gareth,Jeff,Michael,Brian).

elog review:

Experiments round table:

LHCb (RS): Trying to test the PPS CE at CERN supporting the SLC5 WNs they find it is not matcheable from the UI. Trying to isolate the problem before putting in a ticket. Deployment of the pilot role across LHCb Tier 1 sites now waits only on CNAF and IN2P3.

ATLAS (SC): 1) Functional tests of MC endpoints finished today. All Tier 1 get 95% of the data from other Tier 1 in 24 hours and 100% of the Tier 2 from their Tier 1 except for Wisconsin which has an already reported library problem. 2) Have had a development meeting on file merging of MC data. This is done at the byte stream level at the Tier 0 and will be done in 2 steps - first hits then AOD. Can be done with current machinery and probably start next week but will only be for new files. Log file merging does not fit this model so something for that will be developed over the next few months. 3) developers have been patching ddm and there are now test deployments on VO-boxes at SARA, FZK, IN2P3 and RAL. 4) FTS under SLC4 has been running at RAL since a few days with no issues so ATLAS will stay with this version. Brian Davies said RAL did see a 4-hour period of failures to some Tier 2 but thought it was a high load problems. Simone agreed it was probably individual Tier 2 problems. RAL would, however, like to roll-back the version before the Xmas break but hopefully after the ATLAS 10 million file test. Simone said they plan to start this no later than next Tuesday.

CMS (AS): Have been testing the CE supporting SLC5 worker nodes, largely successful, first with MC production then will extend to CRAB analysis jobs.The SAM framework problem that was stopping the testing of sites with production roles only (no user roles) has now been fixed.

Sites round table:

RAL (GS): Want to schedule a full day castoratlas outage to move back to Oracle RAC but are expecting the 10 m files test (which will last 7-10 days). Will schedule for next Tuesday but back off if ATLAS have started and reschedule on 15th Dec. It was agreed to rediscuss this next Monday.

BNL (ME): Had two sites in ATLAS US cloud with FTS proxy delegation problems. The error message referred to targets to be used for ssl connections. The workaround was to stop dq2 at the affected Tier 2, intervene at the Tier 1 then restart at the Tier 2. They will send details to FTS support. BNL has completed migration to using LFC and are in a learning phase of the interaction with roles. They will do some alignment via secondary groups.

NL-T1 (JT): Next Tuesday SARA dcache will be down most of the day migrating to a new head node with higher performance. Tomorrow is the Dutch pre-Xmas present day so reduced presence is likely.

Services round table:

RB (ER) - The CERN RB will be turned off end January in agreement with the experiments.

FTS (GM) - There is no production traffic left under FTS 1 so access will be blocked next week (after an announcement).

GRIDVIEW (JvE) - Following the CERN CASTORATLAS srm upgrade yesterday it was discovered the technique used by GRIDVIEW to identify the VO executing a transfer no longer works and for the moment GRIDVIEW is reporting ATLAS transfers out of CERN as belonging to an unregistered VO. A work-around is in preparation using the disk server name and this should be ready tomorrow. The ATLAS dashboard display is not affected. The long term solution should come through using the FTM service (FTS Monitor).

CASTOR (MCS) - The rollout of CASTOR 2.1.8 has been put on hold following issues found with the instance being used for tape repack. ATLAS would like this for its enhanced logging but it will not be released this year. CASTOR operations will send out a plan for upgrades to the current 2.1.7 version.

AOB:

Friday

Attendance: local(Harry,Simone,Julia,Gavin,Jan,Patricia);remote(Gareth,Brian,Jeremy,Michael).

elog review:

Experiments round table:

LHCb (by email from RS): 1) the problem in submitting jobs via the PPS CE that was not working, as reported yesterday, is finally due to a bug of the WMS as explained in the GGUS ticket open #44440 and reported yesterday as a guess. 2) LHCb is testing the thread safe version of gfal (1.10.18-5) available in the YUM repository for release candidate of PYTHON2.5 preview. http://egee-jra1-data.web.cern.ch/egee-jra1-data/python25-preview/ LHCb expressed interest in that version also because of supporting python2.5 and would ask gfal devs to release this as a patch of gfal and make it available in the Application Area (LCG-AA). 3) last night 11300 concurrent MC jobs running on the Grid (now ~10K) and just 5 failures at some sites!

CMS (by email from DB): The first file block of the CRAFT (cosmics run at 4 tesla) reprocessing round has been closed at FZK, and migrated to global DBS plus subscribed to the CAF by DataOps. Things keep moving for the other involved T1's.

ATLAS (SC): 1) Noticed this morning that though transfer services to BNL are using their LFC some applications on CERN VOboxes were still using the LRC at BNL. This has now been corrected. 2) In the datasets being produced for the 10 million files test there was a naming convention overflow and data with the suffix of .2 instead of .1 is now being produced. These were not being distributed but this is now corrected. 3) On Monday ATLAS will upgrade site services on the remaining CERN VOboxes serving Tier 0 to Tier 1 data distribution and run validation tests. This should allow on Tuesday to start the 10 m files transfer tests (aggregate between Tier 1 with some fraction to Tier 0). 3) At the moment there is a 60-70% failure rate with srm timeouts from prepare_to_get at CERN although files are in the correct pool. A team ticket has been sent but a backup to castor.support will be sent. This must be fixed else the 10 m files test will be delayed.

ALICE (PM): 1) The new Alien version is now installed in the central services at CERN and Torino. 2) There is still no WMS available for ALICE in France. 3) Planning to use the SLC4 FTS for the next MC production. 4) 3 sites - FZK, IHEP and Kolkota run the CREAM CE supporting a total of 7 queues. Direct submission (i.e. no WMS) will be used to these CE. 5) ALICE plan to run Monte Carlo over the Xmas break and have already started at CERN and Torino.

Sites round table:

RAL (BD): They have switched back to FTS under SLC3 but still saw some ATLAS traffic through the SLC4 FTS. Simone will check. They queried why no files are currently being sent to the UK and Simone reported this as due to the current CERN CASTOR problem.

Services round table: JvE - we are still working on trying to get GRIDVIEW correct. A castoratlas upgrade is scheduled for Monday. Gavin reminded that CERN SLC3 FTS services will be switched off on Monday.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng LHCb-cumulative_plot_of_staging_jobs_run-dec1.png r1 manage 62.8 K 2008-12-01 - 11:47 JamieShiers  
PDFpdf atlas-ggus-dec1.pdf r1 manage 16.6 K 2008-12-01 - 10:04 JamieShiers  
PDFpdf cms-ggus-dec1.pdf r1 manage 3.7 K 2008-12-01 - 10:04 JamieShiers  
PDFpdf ggus-summary-dec1.pdf r1 manage 10.3 K 2008-12-01 - 10:18 JamieShiers  
PDFpdf lhcb-ggus-dec1.pdf r1 manage 4.4 K 2008-12-01 - 10:04 JamieShiers  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2008-12-05 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback