Difference: ProductionOperationsWLCGAug09Reports (1 vs. 2)

Revision 22010-01-29 - unknown

Line: 1 to 1
 

August 2009 Reports

Changed:
<
<
>
>
To the main
 

28th August (Friday)

Experiment Activities Still no production running in the system (93 jobs running in the system SAM included!)

Revision 12009-09-01 - unknown

Line: 1 to 1
Added:
>
>

August 2009 Reports

28th August (Friday)

Experiment Activities Still no production running in the system (93 jobs running in the system SAM included!)

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

LHCb user community affected by the scheduled intervention on lxbatch for kernel security patch.

T1 sites issues :

pic: problems with pilot jobs aborting there. GGUS ticket #51203 open. <>follow up: The user Graciani has been banned because he was sendingjobs which occupy too much disk space, about 80 GB.

<!--<span-->
>

Prompt reaction from both pic and NL-T1 about the dCache client issue. They are looking to install there on the WNs the version of d-cache client working with LHCb application that I report below for sake of completeness:

 
 /afs/cern.ch/sw/lcg/external/dcache_client/1.9.3/slc4_amd64_gcc34/dcap/lib
 total 636
 drwxr-xr-x  2 roiser z5   2048 Jun 11 18:38 ./
 drwx------  6 roiser z5   2048 Jun 11 18:38 ../
 -rwxr-xr-x  1 roiser z5 304950 Jun 11 18:38 libdcap1.2.42.so*
 lrwxr-xr-x  1 roiser z5     16 Jun 11 18:38 libdcap.so -> libdcap1.2.42.so*
 -rwxr-xr-x  1 roiser z5 339257 Jun 11 18:38 libpdcap1.2.42.so*
 lrwxr-xr-x  1 roiser z5     17 Jun 11 18:38 libpdcap.so -> libpdcap1.2.42.so*
 

27th August (Thursday)

Experiment Activities No jobs running in the system. No production defined. Old ones in validation of output phase.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

LHCb confirms that is not necessary to reduce the lhcbraw intentionally if this causes painful operations and - in view of few months time from now - this extra space might be needed again.

T1 sites issues :

IN2p3: most likely a dodgy disk server being some recently produced files not available through SRM

Any comment from the sites pic and NL-T1 about the dCache client issue?

26th August (Wednesday)

Experiment Activities Few MC production /simulation are going on right now (~few thousands concurrent jobs).

GGUS (or Remedy) tickets since yesterday:

T0 sites issues: Request of reshuffling disk pools on CASTOR - submitted to FIO the 14th and re-requested the 20 of August has been honored. FIO added extra space on lhcbdata and lhcbuser. Still pending lhcbhistos (new service class). As far as regards lhcbraw they preferred to not removing explicitly disk server (beyond the ones running out-of-warranty) and they discourage to do so if in few months time we envisage to request more space on this service class. To be confirmed: reducing this space by removing old disk server not longer covered by warranty could be enough having now a much improved situation for the rest of the space tokens.

T1 sites issues : There is still open a problem reading files at dCache sites NL-T1 and pic. It results indeed - as discovered and reported during STEP09 - to be due to a compatibility issue between ROOT application plugin and the Cache client installed at the site and picked up by LHCB application.

This is proven by the fact that the 1.9.3 dcache libraries that LHCb ship themselves work on these sites. Ideally LHCb would like the sites to deploy the working dCache client version at some point such that they can use it directly from the local installation on the WN instead having to ship it. This is indeed required to avoid further compatibility issues client/server that also might take place as happened in the past.

25th August (Tuesday)

Experiment activities: System almost empty. No productiongoing on. Just user activity and SAM.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues :

24th August (Monday)

Experiment activities: Previous MC production are now in validation phase (checking data properly available). Still running few thousand of remaining physics production.

GGUS (or Remedy) tickets since Friday:

T0 sites issues:

T1 sites issues :

21st August (Friday)

Experiment activities: About 7K MC simulation jobs concurrently running in the system that pair to 1K jobs from the user distributed analysis.

Actively testing DIRAC by running LHCb application at various sites providing SL5 resources. Found a consistent set of SL5 native lcg clients.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues :

CNAF: new StoRM endpoint successfully tested

20th August (Thursday)

Experiment activities:

MC production ongoing. Yesterday the massive cleanup of data (old DC04 and DIG/SIM for MC09) at various T1's freed about 250TB of disk space.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:


Request to reshuffling the disks at CERN (14th of August). The current situation (after the extra 100TB added to lhcbdata):

POOL lhcbraw          CAPACITY 112.98T    FREE  28.64T(25%)

POOL lhcbrdst         CAPACITY 33.00T     FREE   1.67T( 5%)

POOL lhcbdata         CAPACITY 337.80T    FREE  69.14T(20%)

POOL lhcbmdst         CAPACITY 19.24T     FREE   8.52T(44%)

POOL lhcbfailover     CAPACITY 6.50T      FREE   4.46T(68%)

POOL lhcbuser         CAPACITY 9.00T      FREE       0( 0%)

POOL default          CAPACITY 62.47T     FREE   4.23T( 6%)

                       Total    580.99T</font></font>

must go into this new ideal situation: that would also prevent problems observed when a space token is full to retrieve a tURL (Savannah: #54259)

<font><font size="2">
POOL lhcbraw          CAPACITY 35T    <-- decrease

POOL lhcbrdst         CAPACITY 33.00T

POOL lhcbdata         CAPACITY 400T   <-- increase

POOL lhcbmdst         CAPACITY 19.24T

POOL lhcbfailover     CAPACITY 6.50T

POOL lhcbuser         CAPACITY 15T    <-- increase

POOL default          CAPACITY 62.47T

POOL lhcbhistos       CAPACITY 8T     <-- NEW

                       Total    579.21T</font></font> 

T1 sites issues :

FZK /SARA LHCb_MC_M-DST full

IN2p3: to be understood the reason why LHCB jobs consume more than 1.2MB of RAM despite the VO ID-card requirements.

CNAF: setup the new Storm endpoint for lhcb. Tests in the afternoon.

19th August (Wednesday)

Experiment activities: There is a little active MC productions on going in the system (less than 1K jobs in the system presently). LHCb is also running a massive cleaning up of old MC production data at T1’s no longer used by the user community. For the production LHCb has (for various reasons) at least half of the Tier-1s out of action (see below form ore information).

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues :
CNAF issue (SQLlite DB in the shared area) aseems to occur from time to time in some WNs
FZK /SARA LHCb_MC_M-DST full
IN2P3 has some problem with the SAM jobs that seem to exceed the memory limit of 1272 KB.
NL-T1:For the still active productions the problem was that we had data at NL-T1 and it stopped accepting pilots due to the SARA downtime, this meant LHCb had to ramp up at all other sites in order to complete the requests and the data produced at SARA won't be used (as far as I know no job has run there for weeks).
RAL: back into action

13th August (Thursday)

Weekly report available here

Experiment activities:

MC09 productions ongoing.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues : CNAF issue (SQLlite DB in the shared area) appeared again
PIC LHCb spaces full
FZK LHCb_MC_M-DST full

10th August (Monday)

Weekly report available here

Experiment activities:

MC09 productions ongoing.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues : CNAF issue (SQLlite DB in the shared area) has been fixed in the w/e.


T2 sites issues:

Shared area problem at Turin

7th August (Friday)

Experiment activities:

Pending MC09 productions corresponding to the old requests have been restarted. A stocktaking activity is still to be performed on the otherwise completed productions as
soon as all the remaining failover requests are cleared. New production defined to attempt any T1 SE in the relevant grouping prior to sending files to failover. Some parameters defined for physics MC generation have to be revied being a couple of production failing systematically.

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues :

CNAF: because of the SQLlite reported some week ago the site is out of the LHCb production mask (just temporary integrated to finish a merging production)

People over there been advertized about this still open point.

Sites contacted yesterday promptly added the requested extra-space.
T2 sites issues:

site shared area issue

6th August (Thursday)

Experiment activities Resuming production activities. Still waiting for the full transfer backlog to be drained and just few production started to run again at low rate (<1000 jobs in the system right now). Working to find a consistent set of m/w clients (mainly voms clients) to work with SL5. Latest versions (1.9.8-1 and 1.9.8-2) do not seem to provide a consistent set of clients. It seems that currently in certification versions can do the job (pending tests LHCb side)

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:


LHCb ONLINE system is facing a lot of serious H/W problems with the new router also affecting the LHCb elogbook

T1 sites issues :

IN2p3: disk server outage creating problems transferring data out of there

SARA filled up completely the space on MC_M-DST space token preventing to upload there.

Getting close to filling up this space also at pic, GridKA and in less extents IN2p3 (see pics)

Here an estimated breakdown as recently presented at a scrutiny committee by Nick Brook

PIC-LHCb_MC_M-DST.png

SARA-LHCb_MC_M-DST.png

GRIDKA-LHCb_MC_M-DST.png
T2 sites issues:

site shared area issue

5thAugust (Wednesday)

Experiment activities Resuming production activities. Waiting for the transfer backlog to be drained.

Service issue:

We are having problems transferring data into and out of many sites because of the OPN and backup being down. Is this scheduled downtime?

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

Problems reinstalling one of the VOBOX at CERN: volhcb05 (hosting some DIRAC services). System admin intervention requested.

T1 sites issues :

Issue at IN2p3 (preventing to use the site in filling mode) had nothing to do with the site but rather with the agent filling the information.


T2 sites issues:

4thAugust (Tuesday)

Experiment activities Ignacio informed last night the 100TB were installed and the activities restarted. In a first time the failover spaces at T1's are drained from the huge backlog of transfer requests. This can be seen from the picture where starting from 7:30 am (UTC) LHCb started to transfer back to CERN. Still jobs are not running yet having first of all to drain it.
Transfer_throughput.png

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

T1 sites issues :

IN2p3: working in touch with the site admins and contact person to find a solution to the problem of the publication of Ce's information in the Top BDII preventing in turn to fill the LHCb mask with all information about IN2p3

T2 sites issues:
Usual problems of shared area or sites wrongly pubklishing information in the BDII erroneously attracting jobs there.

3rd August (Monday)

Experiment activities

LHCb is still rest.
Productions are 'on hold' pending the installation of the 100 TB agreed of disk space on lhcbdata such that the backlog of already produced data can be cleared before resuming.

May be today at most tomorrow.... Proposed also to migrate some of the disk servers from lhcbraw to lhcbdata (70TB, currently lhcbraw does not need such large space)

GGUS (or Remedy) tickets since yesterday:

T0 sites issues:

CERN: pending the installation of more disk capacity.

T1 sites issues :


T2 sites issues:
Couple of sites in UK with large number of jobs failingother sites with problems with shared area

-- RobertoSantinel - 2009-09-01

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback