-- HarryRenshall - 25 Jan 2008

Week of 080128

Open Actions from last week: Castor operations to check if write access to the CERN ATLAS CAF disk pool is now restricted. Castor operations to monitor the CMS instance running the new 2.1.6-7 software with a view to scheduling upgrades to the other experiments this week.

Monday:

see the weekly phone conference in Indico

Tuesday:

Experiments ATLAS: Restarting their rests for space tokens. LHCb:
  • Waiting on twiki pages to be set up per site.
  • Problem with LFC replica @ RAL - LFC Developers in the loop.
  • PIC site admins are ready to migrate from CASTOR tape to Enstore, so might need to run a LFC script to change the replicas. Need to check that the first migrated stuff is ok before pic proceeds with the full migration

Core Services

  • ATLAS Castor upgrade done. LFC upgrade (to 1.6.8/SLC4 64bit) for all VOs is still ongoing.

Databases

  • DB intervention at CNAF is done. Streams reconfigured to remove old dbs and add new ones. SARA is still out of streams configuration

Monitoring:

  • RAS

Release Update:

  • RAS

Site Issues:

  • Michel Jouvin: Problem with old versions of lcg_utils against DPM (perhaps a gridftp2 problem). Fixed in latest versions. We probably need to specific a minimum version of UI/WN to be installed

  • Michel Ernst: recent upgrade of VOMS has broken the VOMS XML interface which is affecting a mirror version of VOMS @ BNL. Contacted developer and they agree to fix it in then ext release. Patch is already available - would like to have it put in place @CERN. Markus/Jamie notified last friday.

  • Problem with the SAM tests failing at BNL (double publication via OSG) - James to follow up and involve Rob Quick/Arvind.

Wednesday

Experiments
  • Alice: RAS
  • ATLAS: SAM Criticality for SE tests changed.
  • LHCb: Some problems with network server on rb123. Under investigation. Possibility to run analysis on large T2, e.g. GRIF is interesting
  • CMS:

Core Services

  • FTS - intervention ongoing.

Databases

  • Interruption of few minutes on ATLAS RAC due to hardware problems on a storage node - should have been handled automatically in Oracle, but didn't work. Followup with Oracle.

  • Itrac315 - hosts monitoring apps for streams/OEM. - hardware failure - host down . Availability is still available via SLS.

  • Streams - lost connection to ASGC - likely due to an intervention (announced) on their side - DBA has restarted it.

  • Oracle client on Linux modifies setting for FP rounding precision on the FPU -very minor, probably not an issue for physics calc, but an issue for reproducibility. Workaround in place, raised to Oracle as severity 1 bug to get a patch.

  • Applied oracle CPU successfully on ATLAS, and also now we have LCG integration RAC back again.

Monitoring:

  • Gridmap will be showing degraded for Atlas SE due to a problem in SAM PI until old results timeout (7 days). Need to followup for availability calculation impact in Gridview.

Release Update:

Site Issues:

  • GRIF : down due to upgrading to glite 3.1 for SE.
  • RAL/LHCb : Issue at RAL with LFC and streams following LHCb upgrade. Solved by restarting LFC daemon.

  • BNL: Markus following up with isolating VOMS jar as a single fix.
  • BNL Meeting with BNL tomorrow to followup on SAM issues - possible solution identified - need to work on what is possible.

Thursday

Experiments
  • Alice: RAS
  • ATLAS: Issues to be covered in the FDR meeting later
  • LHCb: RAS
  • CMS: RAS

Core Services

  • RAS (no attendance)

Databases:

  • 3D streams and OEM monitoring is still down. FIO investigating the repeated hardware faults on this. Two oracle service requests filed, no update.

Monitoring:

Release Update:

Site Issues:

  • CERN restarted VOMS server with the new fixed applied (by accident !). BNL to check if it makes things better to them.

AOB:

  • For Oracle problem, will mention at Operations meeting, but there will be no action for sites, since the AA will take care of this for the experiments.

Friday

Experiments
  • Alice: RAS (not attending)
  • ATLAS: RAS
  • LHCb: 64bit question at T1s given to GDB list.
  • CMS: RAS

Core Services:

  • finding and fixing small bugs in CASTOR - service has been running fine. On Wednesday the tape queue subsystem was not working well on ATLAS, working more or less on CMS.
  • Would like to upgrade LHCb during CCRC if we can schedule it. No push on when to do it exactly.

  • Two bugs in FTS
    • DPM/dCache SRMcopy doesn't work due to incompatibility between space tokens (dCache only wants an integer, spec says string, DPM provides a UUID)
    • FTS SRM space 'user description' bug. FTS gets space token for user description from source and then tries to use it on the dest (should get it on the dest, not the source). This affects only SRMCOPY channels in push mode.

Databases:

  • 2 hour downtime DB intervention on the Atlas RAC due to fixing up the problems earlier in the week
  • OEM and 3D streams monitoring still down. Escalated to vendor - investigation taking longer.

Monitoring:

  • We need to start putting things into the logger to make reporting easier.

Release Update:

  • RAS

Site Issues:

  • BNL: VOMS has not made the issue better. They're working on their side to implement their functionality in the "standard" gLite client.
	From: Hover, John
	Sent: Thursday, January 31, 2008 11:02 AM
	To: Ernst, Michael
	Subject: Re: CERN ATLAS VOMSAdmin interface bug
	 
	I'm afraid replication is still failing with the same "duplicate attributes" error. It must be that the problem is caused by a similar, but different, bug to that addressed by the patched Axis library. Since I'm just looking at text (XML) output from the server, there is no way to tell exactly what is creating the duplicated part of the XML response.
	 
	Since this has come up, I have learned that EGEE is distributing a VOMSAdmin Python client (written by Andrea C.). His tool is very similar to mine, but uses a different SOAP library to interact with web services (I used SOAPpy, he used something called ZSI--Zolera SOAP Infrastructure). And, of course, his doesn't do any synchronization.

	The quickest, easiest, and probably also the best long-term solution is for me to add in my high-level synchronization function to the official glite client tool, provided his ZSI-based interaction is working. We're probably talking about one day of work for me.
	 
	The best thing about this approach is that the gLite folks test Andrea's client with new versions before release. Provided they will accept my addition to the tool, it will ensure that interface problems like this don't happen again.
	 
	Cheers,
	 
	--john

GRIF: gLite 3.1 upgrade went ok.

AOB:

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2008-02-01 - JamesCasey
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback