Difference: ProductionOperationsWLCGApril09Reports (1 vs. 3)

Revision 32010-06-11 - PeterJones

Line: 1 to 1
 

April 2009 Reports

To the main

30th April (Thursday)

Line: 313 to 313
 RAL (but also Nebraska) had a problem with new host certificate for VOMS not properly installed on the WNs causing failures in veryifing the VOMS extensions of user proxies.

Changed:
<
<
Top 3 site-related comments of the day: 1. LFC at CERN timing out 1/3 of the connections from CORAL applications at T1s
>
>
Top 3 site-related comments of the day: 1. LFC at CERN timing out 1/3 of the connections from Persistency applications at T1s
 
2. Most pilots aborted ESA-ESRIN
3. VOMS certificate issue at Nebraska
Line: 328 to 328
 

T0 sites issues:

Changed:
<
<
  • [OPEN- 16th April 2009 ] LFC accessed by CORAL applications at T1s (to retrieve the connection string to ConditionDB) seems to fail by 1/3 of the cases
>
>
  • [OPEN- 16th April 2009 ] LFC accessed by Persistency applications at T1s (to retrieve the connection string to ConditionDB) seems to fail by 1/3 of the cases
  T1 sites issuse:

Revision 22010-01-29 - unknown

Line: 1 to 1
 

April 2009 Reports

Added:
>
>
To the main
 

30th April (Thursday)

Experiment activities

Revision 12009-06-18 - RobertoSantinel

Line: 1 to 1
Added:
>
>

April 2009 Reports

30th April (Thursday)

Experiment activities

  • Testing the MC09

Important news


Top 3 site-related comments of the day:

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 27th of April ] StoRM endpoint returns tURL that are not accessible. Waiting for answer
  • [OPEN - 28th of April ] SRM at GridKA problems returning tURLs
  • [OPEN - 28th of April ] IN2p3 gsidcap closing connection. Might be related to this, we observe a much poor CPU/WallClock ratio for analysis jobs at IN2P3 (other dCache sites have improved with DCACHE_RAHEAD and DCACHE_RA_BUFFER tests

T2 sites issues:

  • Pilot aborting in small sites; still few remaining sites with the voms server certificate not properly upgraded on the WNs

29th April (Wednesday)

Experiment activities

  • Testing the MC09

Important news

The issue of all WMS at T1's timing out still under investigation but it seems a con-cause of different factors: a bug in the UI currently in production accordingly http://glite.web.cern.ch/glite/packages/R3.1/deployment/glite-UI/glite-UI.asp

and the fact that some VOMS roles used by LHCb (Role=pilot and Role=user) were not properly deployed on these WMSes. They were OK over the weekend, so why did thy got out of synch?

It looks like there is also an intrinsic bug in the WMS server code (that is going to be fixed with the next mega-patch 3.2) altough experts do not exclude also Network problems.



Top 3 site-related comments of the day: 1. WMS instances at T1's failing by timing out

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 27th of April ] StoRM endpoint returns tURL that are not accessible. Waiting for answer
  • [OPEN - 27th of April ] WMS services at GridKA, timing out all operations
  • [OPEN - 28th of April ] WMS services RAL,SARA and CNAF timing out all operations
  • [OPEN - 28th of April ] WMS at PIC not working
  • [OPEN - 28th of April ] SRM at GridKA problems returning tURLs
  • [OPEN - 28th of April ] IN2p3 gsidcap closing connection. Might be related to this, we observe a much poor CPU/WallClock ratio for analysis jobs at IN2P3 (other dCache sites have improved with DCACHE_RAHEAD and DCACHE_RA_BUFFER tests

T2 sites issues:

  • Pilot aborting in small sites; still few remaining sites with the voms server certificate not properly upgraded on the WNs

28th April (Tuesday)

Experiment activities

  • Testing the MC09

Top 3 site-related comments of the day: 1. WMSes at RAL,CNAF, GridKA, PIC not working
2. IN2p3 gsidcap closing connection
3. Issue with SRM at GridKA

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 27th of April ] StoRM endpoint returns tURL that are not accessible. Waiting for answer
  • [OPEN - 27th of April ] WMS services at GridKA, timing out all operations
  • [OPEN - 28th of April ] WMS services RAL,SARA and CNAF timing out all operations
  • [OPEN - 28th of April ] WMS at PIC not working
  • [OPEN - 28th of April ] SRM at GridKA problems returning tURLs
  • [OPEN - 28th of April ] IN2p3 gsidcap closing connection

T2 sites issues:

  • Pilot aborting in small sites; still few remaining sites with the voms server certificate not properly upgraded on the WNs

27th April (monday)

Experiment activities

  • Testing the MC09 merging activity to evaluate the impact of a varying number of input files in the CPU and Wall Clock time

Top 3 site-related comments of the day: 1. INFN-T1: tURL returned but not being open by LHCb application
2.WMS at GridKA not working
3.

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 27th of April ] StoRM endpoint returns tURL that are not accessible
  • [OPEN - 27th of April ] WMS services at GridKA timing out all operations
  • Issues with SRM at GridKA returning tURL and some WNs with new voms server certificatre not properly upgraded. Also problems downloading into some WNs at GridKA.

T2 sites issues:

  • Pilot aborting in small sites; still few remaining sites with the voms server certificate not properly upgraded on the WNs

24th April (Friday)

Experiment activities

  • Preparation for MC09 Production (merging excercise). Chaotic analysis activity

Top 3 site-related comments of the day: 1. ru-PNPI large number of pilots
2. ITWM large numbero of pilots
3. IL-BGU voms certificate issue

GGUS tickets

  • 4new tickets today
    • T0: 0
    • T1: 2: problems with WMS at GRIDKA and PIC (problem with upgrade 44)
    • T2: 2
  • List of all open tickets

T0 sites issues:

T1 sites issues:

  • [CLOSED - since 23rd of April ] both pic and LCG2-FZK WMS issues.

T2 sites issues:

  • Large number of pilots aborting at several sites. still few sites with voms certificate problem.

23rd April (Thursday)

Experiment activities

  • Preparation for MC09 Production (merging excercise). Chaotic analysis activity

Top 3 site-related comments of the day: 1. problems of WMS at CANF, GRIDKA and RAL
2.
3.

T0 sites issues:

T1 sites issues:

  • [OPEN - since 23rd of April ] CNAF/RAL/FZK WMS issues.

T2 sites issues:

  • Large number of pilots aborting at several sites. still few sites with voms certificate problem.

GGUS tickets

22nd April (wednesday)

Experiment activities *Preparation for MC09 Production. Chaotic analysis activity

Top 3 site-related comments of the day: 1. No relevant point at T0 and T1: a lot of T2 sites that have not updated their voms server certificate on the WNs
2.
3.

GGUS tickets

T0 sites issues:

T1 sites issues:


T2 sites issues:

  • [CLOSED - since April 08 ]

21st April (Tuesday)

Experiment activities


  • Preparation for MC09 Production. Chaotic analysis activity

Top 3 site-related comments of the day: 1. No relevant point at T0 and T1: a lot of T2 sites that have not updated their voms server certificate on the WNs
2.
3.

GGUS tickets

T0 sites issues:

  • Issue with Master LFC reported last week seems to be due to the suboptimal access of LFC from CORAL. Authors alerted and will look at that.
  • [CLOSED- 20th April 2009 ] FTS share defined to LHCb on the channel CERN-CNAF preventing to transfer there.The problem was due to a misconfiguration of the information published by srm-v2 and it has been solved.

T1 sites issues:

  • [CLOSED- 20th April 2009 ] PIC WMS dow. Rogue users overloading the WM. FIxed and cleaned up.
  • [CLOSED- 20th April 2009 ] CNAF FTS share defined to LHCb on the channel *-CNAF preventing to transfer there. The problem was due to a misconfiguration of the information published by srm-v2 and it has been solved.

T2 sites issues:

  • Some issues regarding shared areas and permissions and sites with pilots failing systematically.

20th April (Monday)


Experiment activities


  • Apart from the usual distributed analysis activity there is not production activities going on. Preparation for MC09 now whose software must be installed an all sites.

Top 3 site-related comments of the day: 1. WMS at PIC not available
2. Pilot aborting at INFN-MILANO
3. Could not determine shared area at INFN-CAGLIARI

GGUS tickets

T0 sites issues:

  • Issue with Master LFC reported last week seems to be due to the suboptimal access of LFC from CORAL. Asked authors to look at.
  • [OPEN- 20th April 2009 ] FTS share defined to LHCb on the channel CERN-CNAF preventing to transfer there

T1 sites issues:

  • [OPEN- 20th April 2009 ] PIC WMS down
  • [OPEN- 20th April 2009 ] CNAF FTS share defined to LHCb on the channel *-CNAF preventing to transfer there.

T2 sites issues:

  • Some issues regarding shared areas and permissions and sites with pilots failing systematically.

17th April (Friday)

Experiment activities


  • FEST ongoing

Top 3 site-related comments of the day: 1. Master LFC at CERN exhausted threads.
2. Jobs failing at NIKHEF and IN2P3 due to data access problems. We have changed the policy such that data will be staged to WN disk before processing. Some jobs have been rescheduled and we are waiting to see the results.
3. Pilots aborted at ce02.grid.acad.bg BG04-ACAD

GGUS tickets

T0 sites issues:

[OPEN- 16th April 2009 ] LFC master exhausted thenumber of threads. Under investigation

T1 sites issues:


T2 sites issues:

  • Some issues regarding shared areas and permissions and sites with pilots failing systematically.

16th April (Thursday)

Experiment activities

  • FEST ongoing
  • MC Dummy production (stopped for FEST) is now draining

Relevant News

Overnight tonight, the CVS service again suffered from the problem of last Monday as also reported by SLS. desirable some Alarming system for a system that is extemely important is at least desirable.

RAL (but also Nebraska) had a problem with new host certificate for VOMS not properly installed on the WNs causing failures in veryifing the VOMS extensions of user proxies.

Top 3 site-related comments of the day: 1. LFC at CERN timing out 1/3 of the connections from CORAL applications at T1s
2. Most pilots aborted ESA-ESRIN
3. VOMS certificate issue at Nebraska


GGUS tickets

T0 sites issues:

  • [OPEN- 16th April 2009 ] LFC accessed by CORAL applications at T1s (to retrieve the connection string to ConditionDB) seems to fail by 1/3 of the cases

T1 sites issuse:

  • [CLOSED - 10th April 2009 ] BDII information for INFN-T1: fixed by setting meaningful values for MaxCPUTime and MaxWallClock

T2 sites issues:

  • Some issues regarding shared areas and permissions and sites with pilots failing systematically.

15th April (Wednesday)

Experiment activities

  • FEST ongoing
  • MC Dummy production on going (4K jobs running in the system now)
Relevant news:
because of a problem with LHCb specific scripts we had since yesterday all (LHCb specific) SAM jobs failing at every each site. This has been fixed this morning

Top 3 site-related comments of the day: 1. BDII information for INFN-T1
2. Jobs stalling at IN2P3-LPC

GGUS tickets

T0 sites issues:

T1 sites issuse:

  • [CLOSED - 14th April 2009 ] rb03.pic.es failing list match (TEAM)
  • [CLOSED - 14th April 2009 ] wms010.cnaf.infn.it failing list match (TEAM)
  • [CLOSED - 14th April 2009 ] wms.grid.sara.nl failing list match (TEAM)
  • [CLOSED - since January 2009 ] IN2P3: Wrong status set on replicated data (possible solution has been deployed)
  • [CLOSED - 11th April 2009 ] All lhcb jobs to ce-2-fzk.gridka.de failing. Installed missing libraries. SL5 WNs accessible from a production CE? (ALARM)
  • [OPEN - 10th April 2009 ] BDII information for INFN-T1

T2 sites issues:

  • Some minor issues regarding shared areas and disk quotas at a couple of sites.

14th April (Monday)

Experiment activities

  • FEST week. T1s can expect data export from CERN.
  • 15000 jobs ran over the Easter weekend.
  • We have closed the long standing open tickets against CNAF-StoRM.

Top 3 site-related comments of the day: 1. All lhcb jobs to ce-2-fzk.gridka.de failing
2. BDII information for INFN-T1
3. Wrong status set on replicated data at IN2P3 (in progress)

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 14th April 2009 ] rb03.pic.es failing list match (TEAM)
  • [OPEN - 14th April 2009 ] wms010.cnaf.infn.it failing list match (TEAM)
  • [OPEN - 14th April 2009 ] wms.grid.sara.nl failing list match (TEAM)
  • [OPEN - 11th April 2009 ] All lhcb jobs to ce-2-fzk.gridka.de failing (ALARM)
  • [OPEN - 10th April 2009 ] BDII information for INFN-T1
  • [CLOSED - 12th April 2009 ] FTS transfers to srmlhcb.pic.es timing out in TRANSFER_PREPARATION phase(ALARM)
  • [OPEN - since January 2009 ] IN2P3: Wrong status set on replicated data (possible solution has been deployed)

T2 sites issues:

  • Some minor issues regarding shared areas and disk quotas at a couple of sites.

9 April 2009

Experiment activities

  • Ran 5000 jobs at T2s over night.

Top 3 site-related comments of the day: 1. Wrong status set on replicated data at IN2P3 dCache (possible solution to be tested)
2. CNAF-StoRM ticket that has been open for >8 months!
3. Stalled jobs at UKI-NORTHGRID-MAN-HEP

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - since February 2009 ] PIC: lhcb SGM jobs blocking the NFS software area
  • [OPEN - since January 2009 ] IN2P3: Wrong status set on replicated data (possible solution to be tested)
  • [OPEN - since July 2008 ] CNAF: gfal + CNAF StoRM = Segmentation fault (can someone comment?)

T2 sites issues:

  • [OPEN - since March 25 ] Manchester: stalled jobs
  • [CLOSED - since April 08 ] CYFRONET: configuration problem meant that $HOME was not defined
  • [NEW - since April 08 ] CY-01-KIMON: gcc libraries missing at
  • [CLOSED - since April 07 ] RUL: Jobs were being killed by batch system due to some jobs going to a queue which had a 30min wallclock limit on it (intended for parallel jobs).

8 April 2009 (Wednesday)

Experiment activities

  • Ran 2000 jobs at T2s over night with >96% success rate.

Top 3 site-related comments of the day: 1. Wrong status set on replicated data at IN2P3 dCache
2. CYFRONET configuration problem meant that $HOME was not defined (now fixed)
3. gcc libraries missing at CY-01-KIMON

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - since February 2009 ] PIC: lhcb SGM jobs blocking the NFS software area
  • [OPEN - since January 2009 ] IN2P3: Wrong status set on replicated data
  • [OPEN - since July 2008 ] CNAF: gfal + CNAF StoRM = Segmentation fault

T2 sites issues:

  • [CLOSED - since April 08 ] CYFRONET: configuration problem meant that $HOME was not defined
  • [NEW - since April 08 ] CY-01-KIMON: gcc libraries missing at
  • [CLOSED - since April 07 ] RUL: Jobs were being killed by batch system due to some jobs going to a queue which had a 30min wallclock limit on it (intended for parallel jobs).

-- RobertoSantinel - 18 Jun 2009

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback