Week of 090427

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Nick, Jan, Harry, MariaDZ, MariaG, Stephane, Alessandro, Julia, Roberto, Patricia);remote(Jeff, Gareth, Gang, Luca, Michael).

Experiments round table:

  • ATLAS reports - Alessandro and Stephan will share attendance from now. This is a detector muon week so ATLAS will start cosmics data taking probably on Wednesday. There is a small issue over the length of time it takes directly routed GGUS tickets to US Tier2 sites to arrive. MariaDZ reported that BNL should be migrating from GocDB to the OSG OIM which will affect this. Michael added that this has been delayed by the ATLAS reprocessing campaign but is now scheduled to be done in week 21 (second half of May) and that BNL staff are aware of the ticket routing issue.

  • ALICE - production is running smoothly with 10000 concurrent jobs. The french WMS (at Grif) has been downgraded from the (uncertified) 3.2 level and should soon be ready to resume production.They found a small directory configuration issue at the weekend at IN2P3 so one module of ALIEN will be replaced on all ALIEN servers duriing this week.

  • LHCb reports - They will be testing the merging of MC09 files this week in order to optimise the cpu/wall clock ratio. Several problems at FZK: an unstable WMS, several worker nodes with certificate problems and an unresponsive srm endpoint. The SRM endpoint at CNAF was not returning the correct Turl. Luca had looked at this ticket and could not see an error. Roberto added that it is opening of the Turl that is not working and both agreed to investigate further. JT said LHCb jobs had dried up over the weekend so is their current test over ? The reply was to expect merging jobs later this week.

Sites / Services round table:

  • ASGC: SRM ramping up and down during the past three days due to 'Big ID' problem. Site has applied a crond to make sure the missing id will be fixed every 10min, but problem still exists. This is a CASTOR bug and CASTOR development team is already working on this, just more time needed.

  • ASGC (Jason):
    • catalog service
      • lfc purged and continue with new table space recreated last Fri.
      • consider migrating other VOs from production instances serving atlas
      • adding lfc redundancy later this week
      • acl updated this morning to add default user/group access control for /atlas/role=production and /atlas/role=lcgadmin
    • FTS upgrade from 2.0 to 2.1, next week.
    • SAM availability drop due to screw of system clock
    • fixed last Thu
    • S2 SAM availability drop
      • due to bigId issue (savannah issue open by castor dev, https://savannah.cern.ch/support/?106879), data registration sometime fail with missing id type. majority of the missing Id refer to userfile type in id2type table
      • restart srm server/daemon able to suppress the error while we need new patch provided by castor dev to resolve the occi return into clause issue
    • Atlas
      • atlas data cleanup
      • T2 dpm data cleanup - help from J-P with script handling the data purging from specific vo, say atlas
    • SAM availability drop due to the acl of catalog after migrating to new table space

    • CMS
      • part of data files not able to migrate to tape (compOps savannah)
      • manual fix by data ops
      • drop of SAM availability due to the incorrect group permission of .globus dir (old LCG CEs).

* NL-T1 (JT): During the scheduled weekend power-off downtime the intention was to keep the site visble having an ldap server on the backup power supply but unfortunately its network switch was not on backup power.

CERN Databases (MG): the 3-monthly Oracle security patches are being prepared.

AOB: (MariaDZ) VO Admins of the LHC Experiment VOs go to https://lcg-voms.cern.ch:8443/vo/YourVOname/vomrs, create Groups "TEAM" and "ALARM" and add the DNs of your authorised teamers and alarmers. If you can't remember who is who please contact Guenter.Grein@iwrNOSPAMPLEASE.fzk.de for the Teamers' list and read the Alarmers' list in the Alarms' twiki.

NT: requested if there was interest from the experiments in having a generic SAM test to test if sites had installed the worker node 'hostname.lsc' files that obviate the need to renew annual Voms server certificates. Apparently voms-proxy-info returns a non-zero return code in this case and this is tested by LHCb though it does not affect job execution.

Tuesday:

Attendance: local();remote().

Experiments round table:

  • ATLAS STEP preparation (Graeme Stewart)
Dear All

I have two pieces of news important to sites re. the STEP09 challenges.

The first is for T1s, where a new project, step09, has been defined.
Tape families should be set up for this project with the expectation that all data here can be deleted. This project will produce files in both the ATLASDATATAPE and ATLASMCTAPE areas.

The second more concerns the EGEE Tier-2s. Here we would like sites to

1. Enable support for the /atlas/Role=pilot VOMS role to enable wide testing of analysis through the panda system.

2. Configure their fairshare targets for ATLAS to be:

50% for /atlas/Role=production
25% for /atlas/Role=pilot
25% for /atlas

Further information about STEP09 is being gathered in the twiki, where these requests are recorded. There will be a discussion on STEP09 at this week's ADC Operations meeting.

https://twiki.cern.ch/twiki/bin/view/Atlas/Step09

We will likely ask for one or two sites per cloud to perform a more detailed analysis of which jobs were run during STEP to evaluate the performance of the different infrastructures and their interaction at the sites.

Thanks in advance

Graeme

  • ALICE -

Sites / Services round table:

  • ASGC:
    • SRM temporarily recovered from 'Big id' problem after restarting the SRM server.
    • Waiting scripts from castor team to delete ATLAS data on SE.

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ALICE -

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ALICE -

Sites / Services round table:

AOB:

Friday

  • No meeting - public holiday (at least in some places...)

-- JamieShiers - 24 Apr 2009

Edit | Attach | Watch | Print version | History: r16 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2009-04-28 - GangQin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback