-- JamieShiers - 30 Jun 2006

23 May 2007

LHCb

Slowness in submitting jobs to central WMS at CERN (rb112 and rb117): the single job submission takes one roughly one minute per job while (the same tag WMS3.1) at CNAF the LHCb dedicated nodes are extremely performant. Measurements on rb112 are displayed at sub.png where you see 3000 seconds for submitting 100 jobs (prompt back time). The step profile suggests a similar problem we run on the two instances at CNAF wms006/wms007 for single job submission (the LHCb way to get the grid) and it has been fixed by replcacing some GSOAP libraries by hand. For more information please get in touch with Pacini Fabrizio.

25 April 2007

ALICE

  • In the case of Alice, the production is in scheduled downtime because the new version of Alien (v2-13) is beeing deployed now. At this moment the central services
at CERN will include the new version and then we will begin to update all Alice Tiers. Therefore we do not have so much to say this time.
  • In terms of transfers, Alice is already testing the FTS2.0 version and Gavin can surelly give details in the procedure.

LHCb

  • Restarting Reconstruction and Stripping from a clean situation after disk space clean up at T1 (accordingly the new LHCb policy regarding data storage) and an extensive debug and test activity of their (buggy) application (both GAUSS and Brunel) . The main problem (under investigation) that has been discovered in replicating rDST from all T1 to CERN (within the T1 disk space clean up activity) is reported at the week 2007-17 on the weekly report at CIC page ( https://cic.gridops.org/index.php?section=vo&page=weeklyreport). The problem has been confirmed at PIC and NIKHEF and seems to be affecting files present on disk1tape0. We are eagerly awaiting for T1 sysadmins response regarding d0t1 files. Recent findings from PIC show that it might also happen after a disk pool migrations (when some race condition is reached). Dcache experts are aware and looking on that, any way this is a d-cache bug that definitely would affect other VOs as well and needs to be closely followed.

18 April 2007

ALICE

  • We have performed several tests to compare the running jobs vs. waiting jobs coming from 4 different info sources: local GRIS, top BDII, batch system and LB. The exercises have been executed at CERN, GridKa and CNAF, trigerring scripts which collects the info each 1 minute. Results are available from the Alice ML page.
  • Inestabilities in the publication of the CEs at CERN last week were shown also by these tests
  • The exercise will be extended to Atlas and CMS showing the same results but the LB, since many different persons are submitting jobs at each site. Results will be also available from ML

4 April 2007

ALICE

ALICE has 2 questions regarding the FTS transfers that are beeing performed now:
  • It was assumed that FTS will work both ways-limit the upper and lowee level of transfers, i.e. would guarantedd that the VO does not go below, but also above a certain share. If this is the case, ALICE is observing from Friday 30/03 morning, the spped jumped to 120MB/s average. the question is therefore opened
  • ALICE would like to know if there are any steps taken by FTS service providers to assure that all elements of the service (including SRM and Storage at the sites) are working normally. As in the previous exercise, the experiment is transfering to 3 sites out of 5.

21 March 2007

ALICE

Issues with the accounting in big sites as CERN or FZK. It seems the information reported by the IS fails from time to time. Also the informartion provided by the LB is too slow. This affects the agent submissions of Alice to the sites.

LHCb

I asked to close the last issues (triggered by the famous lcg-gt behavior) at the operations because now our work-arounds seem to be working and people (Patrick from dcache) are aware that this is a bug in the dcache implemetation of SRM and are working on that. We will restart reprocessing of data at dCache sites (the only where data is available) right today with the new service for stageing files in production.

Nothing more to report being all MC simulation exhausted since the last week. No major issue to be discussed.

14 March 2007

LHCb

These weeks burning issue that has to be followed very closely- consequence of the lcg-gt and file not staged problem (see weekly operations report for more info) - is the need to open dcap ports at RAL, IN2P3 and GridKA to the rest of the world. This is to allow dccp commands running from CERN (commands used in turn for staging files in the disk pools).

ATLAS

Yes - I'll bring it up.

Gavin is presenting the FTS 2 plan at the MB later today. The target is still April 1 for CERN with Tier1s sometime later. The pilot should be available before that.

I'll put this note in the ATLAS section...

Cheers,

-- Jamie


Original Message----- From: Simone Campana Sent: Tuesday, March 13, 2007 15:10 To: Jamie Shiers Cc: Alessandro Di Girolamo Subject: Tomorrow SCM: FTS issue.

Hi Jamie.

This morning we discovered an issue with FTS trying to write into DPM, which has again to do with permissions and ACLs. The problem and the follow-up can be summarized in today' s task force meeting minutes (I append down here the relevant part) and the mail to the ATLAS operation' s mailing list which I sent a few minutes ago, summarizing a discussion I had with JRA1 (Claudio and Gavin). Unfortunately I will not be able to come to the SCM tomorrow since I have a parallel meeting, I would ask Alessandro to represent ATLAS there.

Can you bring up the issue: the main points would be:

1) Understand the deployment plan of FTS2 (which would solve the problem) both at CERN and T1s. There is a tentative plan from Gavin which can be found in the second snippet below, but this should be formalized properly, especially for the timescale

2) There is the need for a short term solution. Gavin will discuss with Jean Philippe about this as soon as Jean Philippe comes back.

Please let me know in case of questions/comments. Feel free to append this to agenda.

Alessandro, can you pass by my office so that we discuss the problem in the technical details? Can you attend the meeting tomorrow?

** From the TF Minutes: description of the problem **

Problem of permissions with FTS (Campana): FTS1 does not handle VOMS proxies i.e. has no knowledge of groups and roles. This means, with FTS it is not possible to write in an area dedicated to production activities where only the production user can write. This is a problem for T1->T2 transfers of AODs. The new version of FTS (FTS2) handles delegation properly, including roles and groups. The deployment schedule of FTS2 is being discussed.

**

** From the discussion with Gavin and Claudio: possible timescale **

Hi Stephane (the mailing list in cc:)

I just had a discussion with Gavin. The good point is that the FTS2 server does in fact support delegation, including the VOMS roles and groups. Concerning the deployment, the FTS2 service is being tested at the moment. It is still uncertified, but the plan is to have it certified very shortly and give the possibility to the VO to test it within a week or so. The first test would be using the old client with the new server, then the new client (with delegation) with the new server within the experiment framework. After that the server can be put in production at CERN. For the deployment of the new server at T1s, this will probably be schedule one month after the time it is put in production at CERN, so about 2 months from now, considering Easter in the middle.

I have put Gavin in cc: in case I misunderstood some important point or I made confusion.

Cheers

Simone

07 March 2007

LHCb

  • LHCb ask for the possibility to have always files staged on disk once the SRM enpoints gives back a tURL. It is much more advantageous to wait until the file is on the disk and lcg-gt (or whatever command for querying the tURL) is slow, than having ROOT failing to open the file (case gsidcap) or staging the file (dcap) for subsequent opening. This problem (experienced only on LHCB T1 sites running dcap/gsidcap) has been already reported via GGUS (# 19205 ) via direct contact with SARA admins and also reported at operations. Other general GGUS tickets have been also open (19338 and *19398)* If sites do not agree, LHCb need a way to stag-in files for being open by the application at run-time .The only ways available today: dccp (and a fake destination file) or lcg-cp (causing transfer of data). But they just represent ugly (and not always workable) hacks.

28 February 2007

ALICE

  • No specials issues to report regarding Alice, just to remind the gLite-CE issues we have already reported in the Monday meeting, also we have almost finished the SAM implementation and Vikas is working also to adapt the framework for automatic tests of the sites

CMS

  • Problems found in the latest PhEDEx release, sites had trouble moving produced MC samples to CERN. Measurement of transfer metrics suspended until debugging is complete. No problems found on the WLCG side (FTS, SRM). Very good transfer rates reached (T0->T1 at 800 MB/s for several hours, good T1->T2 rates as well).

ATLAS

1. The T0 throughput test has started. It is still in the phase of rampup/debugging, I would guess sites will not see a decent data flow for a couple of days still. Yesterday there was a 10 MB/s transfer throughput toward Lyon.

2. There is a 5% failure rate writing into CASTOR with rfcp for the T0 exercise. For the production system, this rate is 10% after 3 retries using SRM (srm.cern.ch).

3. The fact that there is no apt or RPM based distribution of the UI for SLC4 is becoming a problem. All the ATLAS T0 machinery works on SLC4 and Alessandro (EIS) is spending a huge ammount of time in struggling getting RPMs installed. The tarball solution is not desirable, since some RPMs needed by ATLAS are not included in the tarball distribution (see i.e. LFC clients with bulk methods).

4. The deployment of LFC clients with bulk methods is a mess. Some days ago the client RPM has been rolled back because of a problem in the update of the DB schema (which is in the server side). Everyone had assumed this had been certified 2 weeks ago. Sites need an understanding of when they will be able to upgrade, we (i.e. Alessandro and myself) are a bit uneasy installing an uncertified version of the client, Dietrich needs the client for the DDM client tools needed for analisys and is getting nervous about the delay.

14 February 2007

ALICE

  • Coninuing the PDC07 in production mode
  • Getting information about dress rehearsal during the TF meeting
  • Implementing specific SAM tests for VOBOXES
  • Issues with myproxy VOMS extensions (followed up)

CMS

  • Two additional LCG RBs at CERN for the Monte Carlo production
  • First CMS tests integrated in SAM (software installation, squid cache, FroNtier test, file stage out for MC production)

31 January 2007

ALICE

  • SLC4 tests in PPS at CERN are over. All Alice jobs ran succesfully
  • Beginning to test the gLite-CE, several issues observed in the corresponding CE avoiding up to now the submissions
  • Pending the issues of the VOBOX tests in SAM

17 January 2007

LHCb

  • Problem in running MC Simulation and reconstruction at GridKA.
  • Instability observed over the past weeks of all T1's SRM endpoints. This caused a huge backlogs formed in their VO-BOX used in their failover mechanism that has still to be fully honored.

ALICE

  • Very good results during the Christman break in the production with an average of 2000 jobs per day
  • Ready to continue the multi-VO exercise
  • Testing SLC4 through voalice03 at CERN using the PPS infrastructure
  • In general good results for all AliEn commands up to a compatibility problem between SLC3 and SLC4 in AliRoot. Experts working on this

06 December 2006

ALICE

  • ALICE stopped the transfers to get ready and follow the Multi-VO exercise
  • During the exercise the VO will continue the CERN-RAL transfers to keep on checking CASTOR@RAL
  • As soon as voalice03 (VOBOX) is back in production, it will be set to check the ALICE software in SL4

15 November 2006

ALICE

  • Central services of Alice stopped on Sunday night, coming back Monday morning
  • Several issues in the AliEn proxy service rejected an smooth ramp up at all sites
  • Comming back in smooth production in the afternoon after 16:30
  • Castor2 crashes at CERN on Monday afternoon affected the ALICE FTS transfers
  • All SE services had to be restarted at all sites
  • SARA issues: the software area was not accesible from the VOBOX, the queue dessapeared from the IS. also the SE in the destination was having SRM problems. FTS transfers were therefore affected. Ticket was submitted. situation improved on Thrusday
  • Overload of the RAL VOBOX forced to stop all services at that site and reboot th machine, back in production on Thrusday
  • The new ALICE VOBOX at CERN in production
  • Problems with all CERN queues for ALICe. sgm submissions aborted the jobs

8 November 2006

ALICE

  • Setting up voalice03 as new VOBOX at CERN for Alice
  • In terms of transfers ALICE suffered this week of a castor2 downtime (6th November) at CERN and a castor2 DB restart (7th November)
  • Tranfer seepd at this moment: 170MB/s
  • Downtime at CNAF (announced)
  • Good transfer efficiencies with the rest of sites

1 November 2006

ALICE

  • Issues with CE: ce102.cern.ch. queue having some issues for alice jobs, temporarily this queue will not be used for alice production until the problem is solved. The grid_monitor does not consider the jobs as finished although the batch queue did not know them anymore.
  • In terms of FTS transfers small issue with the load balancing system which send continuously transfers to the queues during the weekend. Transfers failed to cero on Saturday night, but recovered in the morning.
  • At this moment good behaviour in the transfers

25 October 2006

ALICE

  • FZK - still in scheduled down
  • CNAF - Comming back in production
  • RAL - low number of transfers (as usual), 66% success rate (pool errors). At this moment negociating the access of ALICE to castor
  • SARA - now all transfers are failing with errors like: (it has been reported via GGUS) Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success
  • CCIN2P3 - the only one working 100%
  • Status of the new VOBOXES at CERN for ALICE?

The MonALISA page is fixed now, it shows the correct speeds.

18 October 2006

ALICE

  1. Shutdown of Central services this week for updates. The production is back from the 17th evening (including FTS transfers)
  2. This new bunch will begin to test the gLite-RB provided for ALICE at CERN
  3. Following the 17th October report provided by ARDA Dashboard, this 1st day of production shows (in terms of FTS transfers) quite good results for CNAF (97% efficiency), CCIN2P3 (98% efficiency) and FZK (100% efficiency). Still bad results for SARA (all transfers failed) and RAL (10% efficency)

LHCb

  • Still working for making LHCb software fully working for Reconstruction and stripping. Hopefully in two weeks from now they will start (in parallel to the reco activity) the Stripping. This will imply diskT1-diskT1 transfers. We restart monitoring health status of the T1-T1 matrix. (http://santinel.home.cern.ch/santinel/cgi-bin/lhcb)
  • Any updates about ACL modification on LFC . Joel asked me to increase the severity of this task; he wouldn't be able otherwise to use VOMS for running production and proving then the readiness of the LHCb computing framework for using VOMS.
Date: Wed, 11 Oct 2006 14:28:19 +0200 (MEST)_ From: IT Helpdesk Reply <arsystem@sunar01.cern.ch> To: santinel@mailNOSPAMPLEASE.cern.ch Subject: CT372673 REGISTERED [Adding a new ACL on LHCb LFC ]

10 October 2006

ALICE

a) ALICE is observing problems in the network performance (seems to be quite poor) and also the fluctuations in the result. Here you have the results of the last night:

The system was stable, with the following observations: - average aggregated speed around 100MB/s - CNAF : ~30MB/s - FZK : ~20MB/s - LYON : ~40MB/s - RAL : almost 0, crashes 70% of the transfers with gridftp errors - SARA : ~20MB/s, crashes all the transfers because the vobox config

For all the sites the differences in speed from one hour to another are sometimes 2x. This is an important factor, but the fact that the speed to FZK for example is so low on average is more important. While we need 60MB/s in average for each site, only LYON reaches peaks of this value.

b) Jeff Templon put on the table the problem of the sgm submissions. In this sense ALICE makes the submissions through the VOBOX and therefore with sgm accounts. It seems the politic now is to give higher submission prevalence to sgm accounts but less resources. This will be a quite problem for ALICE.

c) RB in gLite. At this moment Harry announced for ALICVe a gLite Rb shared with LHCb and Geant4. For testing purposes it is fine, but ALICE needs a dedicated RB for production.

ATLAS

  • Several operations needed in CERN LFC: 1) Change old CASTOR endpoints with new ones 2) Change of ACLs 3) Change of endpoint srm-atlas-durable.cern.ch -> srm.cern.ch for a list of files. 1) and 2) have been requested respectively last friday and tis monday. Follows up on that? 3) Will be requested soon.
  • SRM instabilities are a problem all over the places. Last cases are RAL and LANCASTER, but many other endpoints suffer daily instabilities.

LHCb

  • Only MC simulation going on without major problems. Reconstraction and Stripping temporary stopped for problem in their software.
  • Successfully tested the gssklog mechanism on the CERN glite CE. It works fine for LHCb that could be able to install software through grid jobs. Just a remark: the FQAN to be used for mapping to lhcbsgm (as already asked and reported at the ops meeting time ago) shouldn't be the usual lhcb:/Role=lcgadmin (used by all the other VOs) but lhcb:/lhcb/sgm instead.
  • Log&Sandboxes service: Olof and Co. provided a special service class in castor (disk1tape0). Last week I have written a "ad hoc" cgi-bin application that allows for browsing files in the CERN castor storages (the file system requirements for this service is not longer an issue). Just for you information the application is reachable at: https://volhcb01.cern.ch/cgi-bin/castor. This was the last piece missing (and it might be useful for other VOs too).

27 September

ALICE

during last week the FTS transfers of Alice have been extremelly instable. They had problems with CNAF (castor problems and also DNS issues from CERN site, the srm endpoint was not resolved while srmcp was working). Timeouts with FZK and lack of connections With SARA the message said, channel not found or VO not authirized for transfers.

Also we have seen the FTS endpoint from CERN was not published in the IS. In the case of ALICE if both origin and destination are defined and one of them is CERN, this FTS server is used for the transfer, otherwise the destination is used. Since CERN was not published, the endpiht at SARA was used and seems to be in bad shape.

ATLAS

For ATLAS, tha main attention is on LFC. Last week's meeting brought up some points where ATLAS was not using properly the catalog.

1) For a proper use of the session in the most expensive ATLAS catalog operation (listing replicas for a dataset), the python getreplica method needs to work (C API working, there is a problem with SWIG for the python binding). Kristhof is working currently on it and a solution is forseen by the end of the week. The new client library will be installed in the VOBOXes by ATLAS directly in user space.

2) This will include also the timeout and retry in the client which has been certified and is currently in pre-production.

3) The problem of client hanging querying the Taiwan LFC (from europeans VOBOXes) has been investigated by Jean Philippe. There might be a timeout parameter in the GSI authentication (both client and server) which is too short. The parameter is hardcoded. Therefore, Jean Philippe will produce new version of the server to be installed in TAIWAN only and new version of teh client which will be installed by ATLAS at every VOBOX. If changing the parameter solves the problem, a request to the developers will be made, to have this parameter configurable.

4) There is a procedure established for the ACL problem. Write access to the /grid/atlas/dq2 directory will be given to the production role. There will be a transition phase when both the normal user and tyhe production user will be able to write in such tree. After this transition, write access will be disabled for the normal user.

Before doing that, Miguel will need to clean up several thousands files at every LFC. The physical replicas also need to be removed. This operation with lcg-del taked 5 secons per file. This is too much. Any other solution is welcome.

LHCb

  1. LHCb are going to intergate the new gLite WMS in their production infrastructure (to use in parallel with the old LCG). Before that they would like to repeat some tests and try to reproduce what CMS achieved in their tests. For that they do require a RB with the same configuration of rb102.cern.ch (same harware, same version of the middleware installed, same RAID 5 configuration). This is a pretty tight request.
  2. lxfsrk524 disk server used by the LHCb production UIs and by the old logs storage element lxb2003 (with old logs stored not yet backed up) has crashed now a month ago. Asked for replace it with a new hardware. Status of this request.
  3. Unstable transfers during last week from CERN to T1s.

20 September

LHCb

Central FTS seems to suffer an intermittent load problems. All request from LHCb were hanging since yesterday morning (TOMCAT hanging issue). A short period when FTS was working again (because Gavin has been notified privately yesterday, before going through GGUS) then again since 12:30 of yeasterday no more transfers were going through FTS. Submitted yesterday a GGUS ticket (#12953) immediately after the second "wave" of problems. Further investigation point the problem is due to a misconfiguration of a new node added recently and affecting not only LHCb.

13 September 2006

LHCb

LHCb week undergoing in Heidenberg -->. Just a couple of things (Harry is already aware of):

  • As discussed at the Task Force meeting yesterday there is currently a requirement from LHCb about another dedicated Storage Element for storing Dirac Sandboxes. What was suggested yesterday would be to have a single SE shared between logs and sandboxes, in order to reduce the number of servers.It could be basically the current LOG SE storage element that they already have (volhcb01) extended to hold also the sandboxes or some new solution (Castor or DPM). For this new service, they explicitly require a file system in order to keep the possibility of browsing the logs through the https server. It has been also specified that it should be possible to easily extend its hardware in a transparent way. A rough computation of the amount of space required for their sandboxes shows a modest consumption of disk space (~1MB per Sandbox, 2MB per job)
  • At the last ops meeting it has been spotted again the fact that productions from LHCb are carried out with the SGM account. This is what we informally defined Joel's use case and the only way to tackle it is to pass to VOMS. LHCb is moving in this direction although we already see several issues at the horizon that should be further discussed and analyzed within the VO and then correctly addressed.

6 September 2006

ATLAS

  • A solution must be found for TAG publication is sites with multiple identical CEs (for fault tollerance and load balancing) like CERN. The tags published in one CE are not propageted to the others automatically, despite the fact that the underlying batch system is the same.

  • Atlas will need to change ACLs in every T1 LFC, to grant write access to the "production" VOMS role. Jean Philippe has prepared a script which is being tested. The script needs to be applied at CERN and passed to T1 admins to run it in their LFC (you need root access into the machine). Is there a chance at some point teh VO will be capable to do that (granting superuser priviledges to the VO admin)?

  • There is a new LFC client with the configurable timeout and retry asked by ATLAS. SARA refused to put it into the ATLAS VOBOX since the client is "untested" and the VOBOX is shared with ALICE (SARA is the site which suffered the most problems of hanging clients). I would like to have this installed in the VOBOX at CERN, so that the client will become "tested".

  • AFS mirroring of gd/apps/atlas : the ATLAS SW manager agrees with the mirroring and the implications discussed with Harry. To be scheduled.

16 August 2006

ALICE

  1. Lyon and GridKA and CNAF are reasonably debugged. We have reached for a short while 50MB/sec to GridKA, maximum of 100 MB/sec. CNAF is also working, but with lowe transfer speeds (4MB/s average, 20 MB/sec peak). If we can reach ~60 MB/sec sustained per site, we will reach the target 300MB/sec rate.
  2. Currently the transfers to Lyon are failing with:
    • Transfer failed. ERROR the server sent an error response: 42542 Cannot open port: java.lang.Exception: Pool manager error: Best pool <pool-disk-sc3-10 _moz-userdefined="">too high : 2.0E8 </pool-disk-sc3-10>
    • This is what we have to master in order to get the transfers flowing. In such cases tickets are submitted to GGUS and in parallel to site experts.
    • It is imperative that the GGUS tickets are escalated very quickly, otherwise we lose momentum.
  3. Still to debug SARA and RAL.

24 July 2006

ALICE

  1. At this moment accepted the space conditions provided by RAL to begin to make transfers. Waiting for the SRM endpoints to be fixed
  2. Succesfully transfers CERN-CNAF only
  3. SARA and Lyon still to be tested
  4. FZK still problems with the VOBOX

LHCb

The status of the running the recons jobs are:

  • CERN - jobs seem to run without problems
  • PIC - jobs seem to run without problems
  • RAL - jobs seem to run (we have seen jobs to completion in the past) Production jobs over the weekend never completed and seem to process only a small number of files. This is under investigation with LHCb contact and RAL tech people
  • FZK-GridKa - jobs weren't picked up at GridKa. (There have also been issues with production)
    • A meeting between GridKa & LHCb experts occurred yesterday and problems with the config GridKa system are trying to be understood.
    • Recons jobs have run to completion at GridKa
  • IN2P3 -Lyon - are now running only secure dcap.
    • This is not supported by ROOT in the AA until next release. LHCb will need to re-build our applications.
    • Although not advertised the Lyon disk SE does support insecure dcap. Our "hack" to use this failed - currently under investigation by LHCb
  • CNAF - unable to access since the weekend data from CASTOR. A GGUS ticket submitted, 24 hours ago, as yet no response. I note FTS people have also reported problems with the CNAF endpoint which seems to be the same. *** Would you be able to escalate this issue? ***
  • NIKHEF/SARA: problems well understood.

30 June 2006

LHCb

  1. GridKA is currently not usable
  2. CERN: system to transparently install software as they are use to do in any other site (gssklog mechanism hidden to the end user)
  3. CERN: grid user should be able to inherit an environment for using Castor2 (automatically and transparently)
Edit | Attach | Watch | Print version | History: r49 < r48 < r47 < r46 < r45 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r49 - 2007-05-23 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback