June 2009 Reports

To the main

30th June (Tuesday)


Experiment activities

  • Intervention yesterday took longer than expected. Back into production in the afternoon much better performances now.
  • MC09 for 10^9 events restarted.
  • LHCB gLExec tests seems to be OK now at GridKA.
  • GridKA intervention on the shared area yesterday is OK. SAM jobs installed new software happily.

Top 3 site-related comments of the day:

  1. CIC portal: VOID card information got lost
  2. Sites in downtime still advertising them selves in production
  3.WMS issues at GridKA, SARA and pic (timeout)


GGUS (or Remedy) tickets

T0 sites issues:
T1 sites issues:

WMS at SARA pic and GridKA are failing the submission with timeout

T2 sites issues:

Sites in downtime in GOCDB still advertise them selves in production in the BDII attracting jobs that then are going to fail. This happens not only at smallish T2 sites but also at other mor eimportant centers.

29th June (Monday)


Experiment activities

  • From 9am until 12am the Central DIRAC Worload Management Services will be down (as announced last week) in order to do a migration into a more powerful hardware. Jobs in the system finished to run and then not more jobs can be submitted. The activity in the WLCG infrastructure should be currently zeroed .
  • LHCB gLExec tests still attempting to run on the PPS (FZK and Lancaster)

Top 3 site-related comments of the day:

  1. SARA transfers failing systematically
  2
  3.


GGUS (or Remedy) tickets since Friday

T0 sites issues:
T1 sites issues:

All transfers are failing against SARA

26th June (Friday)


Experiment activities

  • MC physics productions restarted after some fix done by DIRAC developers on the problem of the data inconsistency. Merging production running on T1's. Tests on internal prioritization mechanism still large number of waiting jobs piling up on the central task queues of DIRAC
  • Test of gLExec wrapper script on the PPS at GridKA and Lancaster started but failing miserably (since yesterday) due to a variety of small tunings required for properly configuring gLexec. Before asking DIRAC for more complicated tests using the real pilot framework I would really like to have these annoying issues to be fixed.



GGUS (or Remedy) tickets siunce yesterday

T0 sites issues:


T1 sites issues:
Issue at GridKA with gLExec wrongly confugured on the WN. Reported to Angela.

25th June (Thursday)


Experiment activities

  • MC physics productions (and corresponding merging) have been temporarily halted. It looks like some internal inconsistency in DIRAC as far as concerns files registered in the Bookkeeping system and files known in the Production DB. Checks will be performed
  • Priotization system in DIRAC being reviewed (it explains the huge number of waiting user jobs reported yesterday)
  • Profiting of this break the Production System will be moved to a new VOBOX machine (volhcb13)
  • Many users jobs are failing with problems with the conditions database.

| Top 3 site-related comments of the day:

  1. DIRAC file registration
  2.DIRAC job prioritization
  3.

GGUS (or Remedy) tickets since yesterday

T0 sites issues:


T1 sites issues:


24th June (Wednesday)


Experiment activities

  • MC physics productions (and corresponding merging) are running in the system. The goal is to achieve is 10^9 events produced.
  • Huge amount of user jobs waiting due to internal DIRAC prioritization mechanism.

| Top 3 site-related comments of the day:

  1. CE at CERN in Downtime but publishing as it was in production is failing a lot of pilot jobs
  2.DIRAC prioritization mechanism
  3.

GGUS (or Remedy) tickets

T0 sites issues:

-received this morning some Input data resolution problem for some analysis jobs at CERN.. Checking the files afterward did not show the problem again

- ce124 is failing a lot of pilot (see ce124.PNG)


T1 sites issues:

Received this morning some Input data resolution problem for some analysis jobs at GridKA. Checking the files afterward did not show the problem again

23th June (Tuesday)

Experiment activities

  • MC physics productions (and corresponding merging) are running in the system

| Top 3 site-related comments of the day:

  1.
  2.
  3.

GGUS (or Remedy) tickets since yesterday

T0 sites issues:

T1 sites issues:


22th June (Monday)

Experiment activities

  • MC physics productions (and corresponding merging) are running in the system

Top 3 site-related comments of the day: 1. SQLite on shared area access problem at CNAF
  2.
  3.

GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

  • CNAF SQLite file access problem on the shared area.


19th June (Friday)

Experiment activities

  • Few remaining reprocessing and FULL Stream FEST reconstruction jobs running. .Some application job crashes experienced at all (T1) sites but not so obvious whether depending on site or no (log SE service got disk full and logs are not available)
  • MC physics productions (and corresponding merging) are running in the system (about 5K concurrent jobs)
Important News
  • Received request to upgrade (transparent intervention) the srm-lhcb endpoint at CERN to 2.7-18. Tentatively Tuesday next week would be OK (unless different advices from LHCb high management)
  • After the patch to Persistency it can be claimed that the Oracle access to COOL for LHCb has been for the first time fully tested at nominal level!
firstpassjobs.png
Top 3 site-related comments of the day: 1. SQLite on shared area access problem at CNAF
  2.
  3.

GGUS (or Remedy) tickets since Friday

T0 sites issues:

Central VOBOX did not recover after the CERN power cut this morning.

T1 sites issues:

  • CNAF SQLite file access problem on the shared area: pending a new release of GAUDI to use the copy-local option.


18th June (Thursday)

Experiment activities

  • Discussions about possible migration plans of some data in StoRM believed by LHCb to be put in the right space token while it was not. This might turn into a LFC service request to change replica information.
  • Some signal MC production has been delivered. Other productions (and corresponding merging) are running in the system
  • Reprocessing jobs runnign and also FULL Stream FEST reconstruction.
Top 3 site-related comments of the day: 1. Power cut at CERN this morning no all CERN VOBOX back to work
  2. Occasional problems with shared area at GridKA
  3.Intermittent file access problem at RAL

GGUS (or Remedy) tickets since Friday

T0 sites issues:

Central VOBOX did not recover after the CERN power cut this morning.

T1 sites issues:

  • Intermittent issues accessing data at RAL ( seems to be the consequence of the job slot limit in LSF for the root protocol (capped to 100) and some diskserver hit this limit.Increased this limit.)
  • Shared area issue at GridKA.
  • Shared area issue at CNAF. The problem has gone and a close monitoring has been put on GPFS servers serving the shared area.


17th June (Wednesday)

Experiment activities


  • Successful rerun FEST (data transfer + reconstruction at T1's) with the patched version of CORAL. Full stream production is still running
  • Reprocessing (Staging+ Reconstruction) ongoing on the T1's.
  • Various MC productions requested by the physics group are proceeding smoothly
  • Central VOBOX machines had to be rebooted (kernel upgrade) this morning not w/o consequences on the DIRAC services and then activities of LHCb
- No big issues to report.
- Answering Jeremy's question yesterday: The under usage of T2 UK sites (and not only) by DIRAC seems to be due to an overload of the central DIRAC services that cannot push the system more than 25-30K jobs per day. The increased load is due in turn to an increased number of updates snet by (on average) shorter jobs.




GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

  • Sporadic failures are beiong observed at GridKA indicating shared area access problems.
  • intermittent data access failures at RAL with errors suggesting some authentication with rootd servers
  • intermittent data access failures at CNAF

T2 sites issues:



16th June (Tuesday)

Experiment activities


  • FEST (data transfer+ reconstruction at T1's) restarted yesterday evening with a patch to bypass Persistency accessing LFC for retrieving CondDB information. Full stream production is running now.
  • Various MC productions requested by the physics group. Now running 8K jobs in the system with normal user analysis jobs. No big issues to be reported
  • Reprocessing (Staging+ Reconstruction) ongoing on the T1's.

GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

T2 sites issues:

pilot aborted at several small sites and also

15th June (Monday)

Experiment activities

  • STEP09 is considered finished. Reprocessing of data as announced for the week end did not happen becauseit was not worth from a physics (DQ) perspective. The online snapshot shipped with SQLDDB (SQLlite) had a magnetic field set to zero. LHCb intend to rerun and test the scalability of the Oracle CondDB access as soon as a Persistency workaround is in place.
  • Two active (MC09) productions running on the system now (up to 7K jobs) without major issues to be reported: the minimum bias (w/o MC truth) and the corresponding merging production.

GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

T2 sites issues:

pilot aborted at several small sites and also

12th June (Friday)

Experiment activities

  • STEP09 will continue during this week end w/o touching the system. Staging and reprocessing data will be done w/o the remote ConditionDB access (that currently would require a major re-engineering of the LHCb application). but via available SQLite CondDB slices . With this same mechanism LHCb does currently run FULL Stream reconstruction after the DataQuality group has given the green light on top of the results from the EXPRESS Stream reconstruction. Sent a request to all sites (CNAF and RAL already did it) to clean again the disk-cache area in front the MSS. Please note that we do not required toclean also the tape resident file


Top 3 site-related comments of the day: 1. Issue with Staging data at SARA: directory removed from tape too
  2.FULL Stream Reconstruction failing at CNAF. ConnectData IO error
  3.

GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

  • Staging at SARA issue seems related to the fact that all directories to be staged were also removed from tape.
  • Data access on CASTOR at CNAF for the full reconstruction


11th June (Thursday)

Experiment activities

  • STEP09 Reprocessing and staging confirmed that Persistency interface to LFC needs urgently to be updated. A lot of jobs failing to contact LFC server because brought down by the inefficient way Persistency queries it. A Savannah bug open (ref. https://savannah.cern.ch/bugs/?51663). Like for dCache client issue reported days before, this is another important middleware issue that prevents to continue the exercise STEP09. We cannot process data w/o ConditionDB and then we will kill currently running jobs.LHCB will ask again sites to wipe data from the disk cache and rerun the reprocessing using SQLite slices of the Condition DB (instead of using it directly) in order to compare the two approaches. Worth to note that with REAL data this would have not been possible!

Top 3 site-related comments of the day: 1. All jobs failed to be staged at NL-T1
  2. Jobs at CNAF failing either contacting LFC or opening the file in CASTOR
  3. IN2p3 transfers failing last night because of disk full.

GGUS (or Remedy) tickets since Friday

T0 sites issues: LFC degraded by CORAL

LFC.png

T1 sites issues:

  • Issue transferring to MC_DST at Lyon. Disk really full
  • File access on CASTOR at CNAF
  • Staging at SARA. All files to be reprocessed failed after 4 attempts


10th June (Wednesday)

Experiment activities

  • STEP09: Reconstruction on-the-fly: Express job completed successfully at CERN. Pending DQ confirmation before full stream data can be processed. (FEST aspect)
  • STEP09 Reprocessing and staging: started on 6/7 of the sites (CNAF issue).
  • Investigation with Jeff has discovered a middleware issue with dcap plugin for root. Enabled in DIRAC the configuration to read data after a download into the WN (ATLAS approach). Running smoothly now.
  • MC09: on going

Top 3 site-related comments of the day: 1. dcap plugin for root buggy.New version of client not yet released would cure that problem at dCache sites
  2. LFC issue: due again to Persistency interface
  3. Staging at CNAF issue accordingly announced tape cleaning intervention

GGUS (or Remedy) tickets since Friday

T0 sites issues:

LFC issue but due to still old Persistency interface exhausting the threads on the master instance. Changed the wrapper to use local LFC instance to spread the load.

T1 sites issues:

  • IN2p3-CNAF transfers failing: originally sent a ticket against Storm becasue from the FTS message it results a disk space problem at destination while it seems an issue with the gridftp server at source (as confirmed by EGEE Broadcast sent by IN2p3 people)
  • Staging at CNAF: no reprocessing jobs because of the intervention
  • MC_DST and DST space tokens at CNAF. It seems that the space token depends on the path while StoRM should guarantee the indipendence between space token and namespace path.

T2 sites issues:
Issues with slow shared area and with jobs being killed by the batch system

9th June (Tuesday)

Experiment activities

  • STEP09: all T1's cleaned up the disk-cache. throughputSTEP09.png
  • STEP09: EXPRESS stream for STEP reconstruction activity sent (awaiting to run at CERN). The FULL stream will go only after green light from EXPRESS
  • STEP09:Reprocessing production (staging data from tape) started.
  • MC simulation 6K jobs in the system
  • Investigation on RAW file access at gsidcap sites is continuing: problem reproduced by Jeff.

Top 3 site-related comments of the day: 1. WMS instance at SARA hangs list match and fails job submission
  2
   

GGUS (or Remedy) tickets since Friday

T0 sites issues:

T1 sites issues:

  • NIKHEF continuing the investigation on RAW file access at gsidcap sites in close collaboration with Jeff and Ron. Problem reproduced
  • WMS issue at SARA
  • Users report jobs failing in accessing data at GridKA. Local contact person looking at that.

T2 sites issues:
Issues with slow shared area and with jobs being killed by the batch system



8th June (Monday)

Experiment activities

  • Simulation never ended. Currently 6K jobs in the system
  • STEP09: a request to all sites T1 sent to clean disk cache in front of MSS to test the site staging capability in the reprocessing that will take place this week
  • STEP09 (data transfer and reprocessing at T0 and T1) will start later this afternoon after an intervention at the ONLINE system.

Top 3 site-related comments of the day: 1 cleaning of disk cache for STEP09
  2.Files from ONLINE during the weekend had still to be migrated to TAPE
  3. Transfers to/from StoRM at CNAF failing. Problem spotted by the SAM test

GGUS (or Remedy) tickets since Friday

  • 11 new tickets today
    • T0: 1
    • T1: 8 (49323 49329, 49330, 49331,49332, 49333, 49334, 49335)
    • T2: 2
  • List of all open tickets

T0 sites issues:

  • (Friday night): more files lost in LHCBDATA: the attempts to creare a tape copy of remaining T0D1 files in lhcbdata failed miserably and further 6548 files were lost
  • Files from ONLINE during the weekend had still to be migrated to TAPE (remedy ticket open)

T1 sites issues:

  • StoRM issue preventing to copy files to/from
  • NIKHEF continuing the investigation on RAW file access at gsidcap sites in close collaboration with Jeff and Ron

T2 sites issues:

  • Issues with slow shared area and with jobs being killed by the batch system

5th June (Friday)

Experiment activities

  • MC09 continuing: 4 different physics production (and relative merging jobs) are on going fairly smoothly
  • STEP09: a request to all sites T1 will be sent to clean disk cache in front of MSS to test the site staging capability in the reprocessing that will take place next week

Top 3 site-related comments of the day: 1 Thouand of files lost in lhcbdata pool
  2 dCache client configuration (NIKHEF and IN2p3 issue)
  3 FTS tier2 at CERN down

GGUS (or Remedy) tickets

T0 sites issues:

  • Thousand of files have been lost. It seems related to a bug in the upgrade scripts that accidentally switched on the GC and files in temp fileclass not yet backup-ed.
  • FTS tiertwo-fts-ws.cern.ch is not responding.

T1 sites issues:

  • It seems that MDF-RAW files (ie non-ROOT native file, w/o ROOT headthen ) cannot be read against site with gsidcap plugin required (NIKHEF and IN2p3).The guess is thatsome parameter passed to dcap by the root plugin prevents a correct access of the file. By investigating this issue at NIKHEF (IN2p3 has the MSS down currentky) we also discovered that despite LHCb unset any dCache environment variable about read-ahead and/or set the read size buffer to some more comfortable value (200KB) the file is open with a completely different set of paraemters suggesting the clients set internally some default values overwriting the user settings.

T2 sites issues:

  • Issues with slow shared area and with jobs being killed by the batch system

4th June (Thursday)

Experiment activities

  • STEP09 preparation: tranferring last week FEST data across T1. These same data will be requested sysadmin to delete from the cache to test staging capability in the reprocessing phase during STEP09
  • User analysis activity

Top 3 site-related comments of the day: 1. expiration of password for POOL account at CERN
  2. mapping problem at PIC
  3.

GGUS (or Remedy) tickets

T0 sites issues:

  • Input data resolution at CERN

T1 sites issues: T2 sites issues:

3rd June (Wednesday)

Experiment activities

  • STEP09 preparation: tranferring last week FEST data across T1. These same data will be requested sysadmin to delete from the cache to test staging capability in the reprocessing phase during STEP09
  • User analysis activity

Top 3 site-related comments of the day: 1. May be related to intervention but many user jobs failing to resolve input data at CERN
  2.
  3.

GGUS (or Remedy) tickets

T0 sites issues:

  • Input data resolution at CERN

T1 sites issues: T2 sites issues:

2nd June (Tuesday)

Experiment activities

  • Pending MC production completed during the weekend. Waiting for more information from physics group to "validate" them (suspicious some bug in Reconstruction application)
  • STEP09 preparation: tranferring last week FEST data across T1. These same data will be requested sysadmin to delete from the cache to test staging capability in the reprocessing phase during STEP09

Top 3 site-related comments of the day: 1. IN2p3 Issue with CRL on the WNs. Nothing to do with MSS sheduled downtime
  2.
  3.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • [CLOSED - since 28 May 09 ] NFS locking system stuck causing troubles accessing SQLite file resolved by rebooting the NFS server. Testing new option to copy this file locally.
  • [CLOSED - since 30 May 09 ] CRL issue with WN at IN2p3.

T2 sites issues:

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2018-09-23 - MarcoCattaneo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback