Latest Stuff

Date Service Summary Description C5
Worthy
21/02/2012 AI Attended AI Meeting    
21/02/2012 ITTF Prepring VoIP session 30 March Discussed with Rodrigo Sierra.
Title may have to be changed.
 
20/02/2012   Schedule MAPS Agreed with Massimo and Manuel to have it this week.Asked Massimo for a little time to prepare.  
20/02/2012 CPCD Proposed to the list of presidents of the CPCD (Commission paritaire consultative de
Discipline).
Discussed with Manuel.
Would like to discuss with Helge.
 
20/02/2012 LFC bdii of prod-lfc-shared-central down looked with Manuel and Ulrich.
Trick described in https://svnweb.cern.ch/trac/gservices/ticket/17
 
20/02/2012 LCG Attended LCG Ops meeting. Replacing Alex, replacing Philippe.  
20/02/2012 CASTOR Looked into permissions for DB access. Requested by Xavier on the framework of the investigation of checkreplicas not working due to DB schema changes in CASTOR 2.1.12  
20/02/2012   Back from holiday last week. Processing mail backlog.  
21/02/2012 AI Attended AI Meeting  
21/02/2012 ITTF Announcement of ITTF 24 Feb
Agile Infrastructure: Monitoring
Verified that IT Auditorium as bad as usual.
Mail to it-dep
 
21/02/2012 batch Updated https://twiki.cern.ch/twiki/bin/view/Batch/QueuePlanning to include the things done for the ATLAS CAT
resource rationalisation as requested by Manuel.
 
21/02/2012   Access Request renewal for the 513 computer room https://edh.cern.ch/Document/General/ACRQ/4878391  
21/02/2012 batch Answered INC106437 problem running OpenFOAM with miltiple machines on -R "select[(eng)]" on AFS  
21/02/2012   Started preparing my MARS    
22/02/2012 batch Discussed with Ulrich cleanup of processes and/or possible campaign of node draining + restart    
22/02/2012   Working on my MARS To happen Fri afternoon
14:30 with Manuel
around 17:00 (after Arne) with Massimo.
 
23/02/2012   Attended IT-PES-PS section meeting    
23/02/2012   Still worked on my MARS text.    
23/02/2012 batch Request from sysadmins for the retirement of the machines which are in
encra6501 enclosure (rack 65)
Followed up with Ulrich and Eric. Answered that waiting to verify draining procedures with Ricardo and Gavin.  
23/02/2012 AI Attended AI Sprint Meeting Node importance, service criticality, HW homogeneity and KVM.  
24/02/2012 ITTF ITTF session on
Agile Infrastructure: Monitoring
by Markus Schulz and Pedro Andrade  
24/02/2012   MARS interview with Manuel Mainly objectives  
27/02/2012 ITTF Contacted Tim by mail for SNOW Usage Experience ITTF session As discussed on Friday.
He appointed Bruno Lenski.
 
27/02/2012 ITTF Contacted Alistair Bland by mail for for ITTF session on BE/CO computing activities. As discussed on Friday.  
27/02/2012 CPCD Confirmed to Sigrid and Wisla that OK for the list of presidents of the CPCD (Commission paritaire consultative de
Discipline).
   
27/02/2012   MARS interview with Massimo Mainly results of last year.  
27/02/2012 CASTOR Made Massimo owner of castor-monitoring and castor-monitoring-admins On request of Jan Iven.  
27/02/2012 CASTOR Prompted by PHP version question for compass02 and compass22 that are not from SWREP, pointed to documentation for vobox fileservers https://savannah.cern.ch/task/?16223
https://savannah.cern.ch/task/?23045
https://savannah.cern.ch/task/?23222
https://savannah.cern.ch/task/?19559
 
27/02/2012 batch Closed again INC106437 after answer from Nils. Although no batch issue this is a hasle. I should have given the ticket to Nils or to the AFS guys from the beginning.  
27/02/2012 CASTOR Fixed useracess configuration of vona4801 and vona4802 that was affected by the problem described at the end of https://twiki.cern.ch/twiki/bin/view/ELFms/ELFmsZuulSLC5 Also told Emmanuele Leonardi that not dealing with these boxes anymore so he should contact Massimo.  
28/02/2012 ITTF Contacted Vito Baggiolini by mail for ITTF session on computing in BE/CO. Mentioned this to Nils who is running a coordination meeting with them. Fixed a date and created event in Indico in https://indico.cern.ch/conferenceDisplay.py?confId=180286  
28/02/2012 ITTF Proposed the date of 20 April for the session on Experiences Using Service-Now in IT. Created event in Indico in https://indico.cern.ch/conferenceDisplay.py?confId=180141  
28/02/2012   Attended PES group meeting. With presentation from Alberto Pace on Starage Strategy.  
28/02/2012 LFC batch Attended PES-GT meeting. Agreed to revive LFC pps node.
Main information is that EMI WN is not compatible with lcg libs so should be avoided for SLC5. I we want ARGUS we should go for a newer LCG release.
 
28/02/2012 Message Brokers Answered question from Lionel about not being able to (re)install. You have to change the driver for the network interface from 'synthetic' to 'emulated' for PXE to work.

'emulated' is required for installation with PXE but 'synthetic' is
fater and better for normall operation.

You can change it with something like
/afs/cern.ch/user/v/vmmaster/bin/vmtool nic lxxxx --type=emulated

Pls note that this implies a reboot of the VM. More info in
https://twiki.cern.ch/twiki/bin/view/PESgroup/VirtualMachineCreation
 
28/02/2012 Message Brokers Lionel still complained that no way to check the network drivers and that he has no access to manage the VM. I to check with Alex for the display question. Sugested to check with Manuel for the acces question.  
29/02/2012   Still had to work on my MARS text with input from Massimo. Fed the text back to Massimo.  
29/02/2012 ITTF Discussed with Waine on the possible use of the IT Auditorium for the coming iTTF sessions. He thought that we should not count on it. Proposed to discuss with Tony opening the corridor doors in 513 1-024 to mitigate the risk.  
01/03/2012 LFC Worked on bdii startup as I thought that its failure could affect yaim. Found that it is selinux (configured in enforcing mode) that prvents the startup by not allowing slapd to bind to port 2170  
01/03/2012 LFC Investigating https://ggus.eu/ws/ticket_info.php?ticket=78770. Looked into the queries done by lcg-infosites Trying to find out why we seem to publish wrong data for the LFC in the BDII.  
01/03/2012 LFC Looked into https://ggus.eu/ws/ticket_info.php?ticket=77026 that seems to be related to the above.    
01/03/2012 Message Brokers As suggested by Helge called a meeting about the Message Brokers to signal that I will take over their operation. Invited Pedro as monitoring person and later also Lionel and Massimo.  
01/03/2012   Discussed with Tony on using the IT Auditorium for the coming sessions of the ITTF. Tony suggested to contact Frederic to ask for his advice.  
02/03/2012 CASTOR Discussed with Massimo why a user with group def-cg does not get a CASTOR home directory. Basically you need a "resource" group that you can map to an organisational unit and def-cg is not.  
01/03/2012   Got answer to RQF0072607 : Error. NoAccess
Opened because Massimo cannot access my MARS.
Send to Massimo printout of current MARS contents.  
02/03/2012 AI After discussion in the mornign meeting pointed Tomas to the LDAP query for batch egroup The thingy that recursively queries LDAP for group membership and fills a sticky cache is in
https://svnweb.cern.ch/cern/wsvn/batchinter/trunk/batch/CERN-CC-LSF/scripts/loadPwentCache.py
This we run in a daily cron table.
The thingy that uses the sticky cache is in
https://svnweb.cern.ch/cern/wsvn/batchinter/trunk/batch/CERN-CC-LSF/scripts/egroup
 
02/03/2012 LFC As discussed with Manuel yesterday, submited request RQF0073296 for a VM to test and debug LFC deployment. LFCTEST.  
02/03/2012 LFC Pursued the problem of wrong info in bdii for CERN LFCs I found that the yaim function /opt/glite/yaim/functions/config_gip_lfc
provided by glite-yaim-lfc-4.1.1-1 was failing because the INSTALL_ROOT variable was not passed so it was not generating/updating the info provider in
/opt/glite/etc/gip/provider/glite-lfc-provider.

On the other hand I found that some nodes already had /opt/glite/etc/gip/provider/glite-lfc-provider generated by other means
(hand ?) with different params.

I proposed to patch myself the yaim function and generate/update the info provider in all lfc production nodes.
 
05/03/2012 ITTF Discussed with Mats on session on Service Management,    
05/03/2012   On rota this week.    
05/03/2012 LFC Pursuing the problem of wrong info in bdii for CERN LFCs, investigated errors when running yaim in production LFC nodes. I realized that the voms errors, no matter how bad they look, do not seem to be fatal for yaim.

The only fatal error was actually the one due to config_gip_only not defined anymore but referenced in from /opt/glite/yaim/node-info.d/glite-lfc_oracle.

So the voms errors not being fatal help explain why nobody cared to clean the gridgroups entries for lcgadmin in prod/components/yaim_usersconf/defaults.tpl.
 
05/03/2012 Message Brokers Meeting about the Message Brokers signaling that I start working on them. Notes from Pedro in https://twiki.cern.ch/twiki/pub/AgileInfrastructure/AgileInfraDocsMinutes/AI_monitoring_messaging_5th_March_2012.txt  
05/03/2012 CASTOR Followed up on alarm from TSM admin that a backup tape for VOHARP01 has been lost. Started an incremental backup.
Also pointed out that from now on Massimo will take care of supporting VOHARP01.
 
06/03/2012   Doing tickets for the support rota INC110495: myproxy registration request for crab3dev.cern.ch
INC110324: lfc_noread lfc_nowrite on lfcshared01
INC110471: lfcatlas01 lfc_noread
INC109285: No access to lxbsp0501 and lxbsp0502
INC110557:LSF js on lxbsu2014.cern.ch: LFS js: no AFS token
INC110845: GGUS-Ticket-ID: #79939 Ticket "please add new authorized renewer to myproxy configuration"
 
06/03/2012 ITTF Mail discussion with Tony, Wayne and Frederic on what to do with ITTF sessions when the IT Auditorium is off. Ignacio,

Unfortunately it is a little more complicated. The back door is closed as opening it causes problems for people evacuating from the other rooms along that corridor,
all the more so as there is no window in the door so you might open it into somebody.

Although somewhere bigger but less convenient is clearly the best (PS Auditorium?), I would not be against using 024 with a clear mention of the evacuation issue at
the beginning of each meeting-an (unannounced) evacuation exercise was explicitly organised during a post-C5 in the past to see how people would evacuate this room
and things were calm. Thinking further, we could have the rear doors unlocked and have someone explicitly identified as being responsible to open the door carefully
and manage evacuation via that route.

Cheers,
Tony


-----Original Message-----
From: Ignacio Reguero
Sent: 06 March 2012 09:24
To: Wayne Salter
Cc: Ignacio Reguero; Frederic Hemmer; Tony Cass
Subject: Re: IT Auditorium

Hi Wayne, Frederic and Tony,

So the IT Auditorium is actually off and I plan to move all coming ITTF sessions from the IT Auditorium to 513 1-024.

Could we do anything to mitigate the safety risk when having attendance higher than the nominal capacity of the room (open the back door)?

Should we consider other (bigger but less convenient) venues on the site ?

Thanks & cheers ...Ignacio...

On Mon, 5 Mar 2012, Wayne Salter wrote:

> Hi Ignacio,
>
> The official situation (which we learnt fromGS on Friday afternoon) is
> that there is no heating, cooling or air renewal in the amphitheatre
> until July at the earliest. Hence, it is not possible to use this room
> for large meetings. It is debatable whether small meeting could be
> held under these circumstances but in any case other rooms for smaller meetings exist.

 
07/03/2012 LFC Pursuing the LFC bdii problem. ANswered GGUS tickets 78770 and 77026 corresponding to INC111102: BDII and INC099445 : CERN LFC nodes are badly published in the BDII  
06/03/2012 LCG Ops meeting Followed up report from CMS of flakeyness in SLS status dispay for lxbatch. Monitoring problem found and solved by Steve and Gavin. Due to overload of lxplus  
07/03/2012   Doing tickets for the support rota INC111042: Cannot kill hanging jobs
INC110874 : lfcatlas03 lfc_nowrite lfc_noread
INC110487: no space on execution hosts
 
07/03/2012 ITTF Announced session on Storage Strategy for next Friday. Prompted by Rainer, checked that bld 30 auditorium not available.
Also checking with the Vidyo people.
 
08/03/2012   Doing tickets for the support rota INC111319: Problem with lxbatch
Follow on to INC109285 : No access to lxbsp0501 and lxbsp0502
 
08/03/2012   Pointed out that Bitorrent for ALICE which is a known legitimate cases.
Followed up SPAM from Computer Security people.
LXBSU1504: Policy violation detected  
08/03/2012   Ticket Review + short e-group/ldap resolution discussion with Alex Described in detail what we did for the LSF egroup.  
08/03/2012   Attended section meeting. https://indico.cern.ch/getFile.py/access?resId=minutes&materialId=minutes&confId=180534  
08/03/2012 batch Attended meeting to discuss batch urgent stuff.    
09/03/2012   Again, pointed out that Bitorrent for ALICE which is a known legitimate cases.
Followed up SPAM from Computer Security people.
LXBSU1345: Policy violation detected  
09/03/2012 ITTF IT Technical Forum about “Storage Strategy” Announcements.
Discussion with Tim Smith on Vidyo setup.
 
09/03/2012   Doing tickets for the support rota Follow on INC111291: nfsnobody user and group id's  
12/03/2012   Doing Vm creation tickets for the support rota of the week before RQF0073296: Test machine for LFC deployment
RQF0073305: VMs for Jira cluster
RQF0076469: VM for tendering document project
Had to deal with errors due to inconsistent configuration for useraccess in boinc cluster + zillions of
[ERROR] cannot release lock file: /var/lock/quattor/ncm-ncd
Also had to deal with PrepareInstall failing with SINDES error in jira cluster due to missing files.
 
13/03/2012 batch Created usertest/batch namespace as well as 'stages/usertest/batch' template. To provide a standard CDB area for batch tests.  
13/03/2012 Message Brokers Three mails trying to find out out to find out how to report problem in hwcollect seen by Lionel on the Message Brokers.    
13/03/2012 batch Ulrich kindly created an updated version of the script to get the RPM list for the glite release that uses a new URL.
The script is in /afs/cern.ch/group/c3/tools/bin/CDB_create-glite-templates.new.
He ran the script to generate prod/cluster/lxbatch/glite_3_2_5-1_glite-glexec_wn.tpl and prod/cluster/lxbatch/glite_3_2_12-1_glite-wn.tpl i.e. for the versions
recommended in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
 
14/03/2012 batch I made usertest/batch/cluster/lxbatch/glite-GLEXEC-wn-x86_64.tpl pointing to the baseline versions and modified profiles/profile_vm64slc5test.tpl as a "normal" batch worker
in 'usertest/batch'
It commited OK,
however SPMA gave 27 dependency problems.
 
14/03/2012 ITTF Sent reminder for Stefan Lueders session. I would like to invite you to the IT Technical Forum about “Squaring the Circle: Reflections on Identities, Authentication & Authorization at CERN”  
14/03/2012 batch Worked on on BI-550: Upgrade batch to gLite-WN 3.2.5-1
until agreed to leave worker node at 3.2.1 for now.
Until cofirmed clash of glite and EMI rpms in SWREP:
You can confirm the fuckup by doing something like the following and comparing the outputs.

# rpm -q -requires -p
http://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-WN/sl5/x86_64/RPMS.updates/lcgdm-libs-1.8.2-3sec.sl5.x86_64.rpm
# rpm -q -requires -p http://swrep/swrep/x86_64_slc5/lcgdm-libs-1.8.2-3sec.sl5.x86_64.rpm

# rpm -q -requires -p http://swrep/swrep/x86_64_slc5/CGSI_gSOAP_2.7-1.3.4-2.sl5.x86_64.rpm
# rpm -q -requires -p
http://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-WN/sl5/x86_64/RPMS.updates/CGSI_gSOAP_2.7-1.3.4-2.sl5.x86_64.rpm
 
15/03/2012 ITTF Discussion with Tim Smith on Vidyo setup for ITTF. Also installed Vidyo Android client.  
15/03/2012 batch For BI-549: usertest/batch Quattor area for small-scale tests, across multiple host types moved to usertest/batch 1 VM: vm64slc5test and 2 real machines: lxbst1338(vi_10_21) and lxbsu1523(e4_10_20).  
15/03/2012 batch (BI-544) Upgrade batch to SLC5.8 once its available Tried ELFMS_OSDATE in usertest/batch to be current latest 20120309.
Found errors in commit that Steve sorted out.
 
15/03/2012 batch Followed up on the 'Cannot kill hanging jobs' thread triggered by
INC111042.
Let me add something, in this case the jobs were really stuck so kill -9 in the batch node would not do anything to
them. I this case, according to the man:

If the job cannot be killed, use bkill -r to
remove the job from the LSF system without waiting
for the job to terminate, and free the resources
of the job.

So we would leave the stuck processes behind which is not what we want.

The trick to kill the processes pointed out by Ricardo, is to try to attach them for strace. I think that somehow,
the signals from strace (SIGTRAP and maybe SIGSTOP) allow the SIGKILL to be processed.

If the problem recurs, We would rather need to understand better how the processes managed to get into this stuck
state and how to release them.
 
15/03/2012   I realized that removing it-dep-fio-smod-alarm is NOT OK because it is included by all the -operator-alarm
egroups so that SMSs are sent to the people there which includes both CASTOR and Batch people. I mean things like
atlas-operator-alarm@cern.ch.
This works because the it-dep-fio-smod-alarm members have the format @mail2sms.cern.ch to use
the SMS gateway.
 
16/03/2012 ITTF IT Technical Forum session titled “Squaring the Circle: Reflections on
Identities, Authentication & Authorization at CERN”.
The speaker is Stefan Lueders.
Announcement, Vidyo setup with Tim, etc.  
16/03/2012 Message Brokers Opened INC114036 : /usr/bin/hwcollect crashes often in SLC6 hwcollect crashes reported by Lionel.  
19/03/2012 Took day off. Kid sick.    
19/03/2012 batch Answered a couple of support questions in the saga of
Job 225473233: Exited
   
19/03/2012 batch worked on (BI-544) Upgrade batch to SLC5.8 once its available I tried to set ELFMS_OSDATE in usertest/batch to be current latest 20120316.
The commit worked but SPMA failed with
depcheck: package nfs-utils-lib 1.0.8-7.9.el5 needs nfs-utils >= 1.0.9-45
This is because nfs-utils and nfs-utils-lib are included by prod/os/x86_64_slc5/rpms/20120316/base.tpl
but nfs-utils is deleted in prod/cluster/lxbatch/config.tpl.
Either both nfs-utils and nfs-utils-lib should be deleted (or none).
 
20/03/2012 batch Prompted by Gavin, checked that LSF masters OK if we remove NFS RPMs from the batch workers. I checked and the autofs RPM does not depend on the NFS ones, however you do need the NFS stuff to do the mounts
when you use automount with NFS as we do on the masters:
[root@lxmaster20 ~]# mount |grep nfs
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
caenas01:/vol/LSF/batch/work on /usr/local/lsf/nfs/work type nfs (rw,fstype,nfs3,hard,addr=137.138.144.196)

In any case we seem OK for the masters as I see that they do not include prod/cluster/lxbatch/config.tpl so they
get the NFS rpms from the os rpms base template (prod/os/x86_64_slc5/rpms/20111111/base.tpl).
 
20/03/2012 batch (BI-544) Upgrade batch to SLC5.8 once its available: test in usertest/castor First applied the (NFS free) update to the machines in usertest/batch.
 
20/03/2012 batch (BI-544) Upgrade batch to SLC5.8 once its available: announcement Submitted announcement to IT Service Status Board in
https://itssb.web.cern.ch/service-change/upgrade-system-rpms-lxbatch-nodes/20-03-2012
 
20/03/2012 batch (BI-544) Upgrade batch to SLC5.8 once its available: changed config of preprod nodes <cdbop@cdbserv.cern.ch: ~/cdbfiles> commit
[INFO] '/preprod/cluster/lxbatch/osdateversion': will be updated
[INFO] '/prod/cluster/lxbatch/config': will be updated
please confirm [yes]:
Last comment: Remove WN update stuff
Press [Enter] to confirm the last comment or enter a new one.
Comment: Update ELFMS_OSDATE to "20120316" in lxbatch preprod
[INFO] please wait...
[INFO] commit OK
 
20/03/2012 batch (BI-544) Upgrade batch to SLC5.8 once its available: deployed with spma after testing on a couple of nodes. Did nc-client --cluster lxbatch --stage preprod --tag spma
The spma went OK.
 
20/03/2012 batch Discussed system level being deployed with Linux Supporters >
> We are deploying ELFMS_OSDATE "20120316" to preprod for batch.
> Does this correspond to SLC5.8 ?

Yes, it should: integrated release was prepared 15th of March,
so 16th snapshot is at least as up to date as it (in reality
it is little bit more up to date since packages for
integrated 5.8 were gathered ~ 10th of March: but this
shall not matter)

Cheers

Jarek
 
20/03/2012 ITTF IT Technical Forum about “New Computing
Centre” next Friday
Sent announcement.  
20/03/2012 Message Brokers Anwered question from Lionel about sysctl -p giving the following error
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
in the SLC6 machines in mb/agileinf.
The problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=512206
and
https://bugzilla.redhat.com/show_bug.cgi?id=639821

To summarize
- These entries in /etc/sysctl.conf prevent bridged traffic getting pushed through the host's iptables rules.
- The keys only become valid after "bridge.ko" is "insmoded". (Bridges seem to be instantiated by libvirt on
demand).
- The official solution is to use the '-e' option of sysctl to ignore errors about unknown keys.
- If the machine does not need to bridge we could get rid of the entries in /etc/sysctl.conf.

As matter of fact /etc/sysctl.conf comes initially from
initscripts-9.03.27-1.el6_2.1.x86_64. It is controlled by the sysctl NCM component, however the component just
applies the keys defined in CDB to current /etc/sysctl.conf that is taken as a template.
 
21/03/2012 LFC Solved INC113177: Problem on the machine lfclhcbrw02 While the BDII was running fine (I have fixed it no long ago), the lemon-agent was not working properly:

[root@lfclhcbrw02 ~]# lemon-host-check --show-all
[INFO] lemon-host-check version 1.3.6 started by root at Wed Mar 21 11:24:43 2012 on lfclhcbrw02.cern.ch
[ERROR] The following required sensors are not running:
[ERROR] gridlfc

The gridlfc lemon sensor was crashing due to a wrong /etc/ld.so.conf that was prventing the LFC client commands to work.Fixed it.
 
21/03/2012 Message Brokers Answered Lionel questions about CERN-CC-hardwaretools and abrt problems Pointed out that the abrt problems in SLC6 are because dumps from unsigned RPMs are rejected unless you set 'OpenGPGCheck =
no' in /etc/abrt/abrt.conf.

We could set it in the mb/agileinf boxes and see how it goes.

More details in https://bugzilla.redhat.com/show_bug.cgi?id=699152.
 
21/03/2012 batch contributed to (BI-562) Disable netlog on batch for 5.8 upgrade I saw that 'services/netlog/config' is included for non SLC6 nodes from both prod/cluster/lxplus/config.tpl and
prod/cluster/lxbatch/config.tpl so what Steve put into preprod/services/netlog/config.tpl should be OK for lxbatch
as well.

We have to note that for this to go in place we have to do an ncm-ncd run or at least run 'ncm_wrapper.sh
filecopy'.
As we did not do it yesterday for the lxbatch preprod nodes we should eventually do it.
 
21/03/2012 CASTOR Chat with Luca from the CASTOR/EOS team about diskPoolDump features and tricks.    
22/03/2012 Message Brokers Read the instructions for Lionel's machinery to generate the config through RPMs. In /afs/cern.ch/project/tom/mbcg/README  
22/03/2012   Attended IT-PES-PS section meeting.    

Date ServiceSorted ascending Summary Description C5
Worthy
27/03/2012 AI Helped Tomas on questions about AFS ACLs. for gitolite  
27/03/2012 AI Got help from Tomas on questions about Puppet Apply. Module definition not enough in puppet apply. Need to call the module. This seems to be not required for runs with Foreman.  
29/03/2012 AI Attended discussion on scheduling aspects for batch and pilot use cases Including economic aproach to be used for all services. Also discussion about scalability and pros/cons of fat nodes.  
02/05/2012 AI Attended AI ticket review.    
24/04/2012 AI + LB Attended SCRUM meeting, Discussion in the meeting and over coffee on doing an specific LB server for AI.
Also proposed ot check zookeper.
Agreed with Gavin that I'll check with neteng and move forward with this.
 
24/04/2012 AI + LB Mail to neteng on the establishment of an LB server for AI. Date: Tue, 24 Apr 2012 15:01:33 +0200
From: Ignacio Reguero <reguero@mail.cern.ch>
To: neteng@cernNOSPAMPLEASE.ch
Cc: Gavin Mccance <Gavin.Mccance@cern.ch>, Juan Manuel Guijarro <Juan.Manuel.Guijarro@cern.ch>,
it-support-dns-loadbalancing@cernNOSPAMPLEASE.ch, Jose Carlos Luna Duran <Jose.Carlos.Luna@cern.ch>
Subject: New LB server for Agile Infrastructure

Dear Neteng,

We would like to make a new Puppet managed LB server to run in parallel then the current Quattor
managed lbmaster.
Do you see any problem with this?

We need in order to get experience running lxplus kind of services with Puppet. We may also want
to change the technology used for the LB server. Currently looking into things like Apache
ZooKeeper.
Do you have other suggestions in this area?

Ultimately we would also like to look into interfaces to automate configuration changes, with
something like Apache libcloud: http://libcloud.apache.org/docs/load-balancer-base-api.html
For this question we would have to work with you as LB Alias creation currently requires manual
intervention by the neteng.


Thanks & best regards
...Ignacio...
 
16/04/2012 Back from hols Back from hols Cleaning mail backlog  
22/03/2012 batch Attended meeting about MPI (multi-core jobs on lxbatch) questions. with Frank Schmidt and Cedric Hernalsteens  
22/03/2012 batch Added C5 entry for the RPM upgrades of lxbatch and lxplus. In https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/ItPesServiceReports  
03/04/2012 batch Followed up failed installation of lxbsq1430 in SLC6 due to libgcc problem. As suggested by Steve setting latest
variable ELFMS_OSDATE = "20120330";
in profiles/profile_lxbsq1430.tpl and reinstalling.
 
2012/04/24 Batch Asked Ulrich to look into INC123467 about job in VM stuck out of memory vmbst051605    
25/04/2012 Batch Answered RQF0090907.
The jobs were hitting the queue runtime limit. Still they seem to come with rc 255 which is not consistent with getting SIGXCPU signal which should be code 152.
Using tools and accounting DB from Jerome and Ricardo which are cool.  
26/04/2012 Batch Still following up RQF0090907. Checking whether he has an LD_LIBRARY_PATH problem.
Checking with jobs.py I see that he is still submitting to 8nm
python jobs.py -v 100 -j 239419634,239419660
 
26/04/2012 Batch for tickets INC124300 and INC124432
Restting VMs with https://twiki.cern.ch/twiki/bin/view/Batch/VMBatchTerminateInstance
VMs out of swap + SPAM problem  
27/04/2012 Batch tickets INC124662 and INC124550 VMs out of swap + SPAM problem cleanup using /afs/cern.ch/group/c3/tools/bin/vmcancel  
30/03/2012 batch and CASTOR Answered questions on RQF0081473 about LSFWeb for CASTOR from Ulrich as current CASTOR people not aware. I did show to Luca that the cron job
32 * * * * c2adm /afs/cern.ch/user/s/stage/acron/castorsvcacl.py --safemode --quiet castoratlas/atlt3
allows the users in LFS group u_ATLASCAT in the pool ACLs of castoratlas/atlt3. This is the last entry of this kind left. I guess that castoratlas/atlt3 should be eventually migrated to EOS. I let Luca follow that up.
Yes
16/04/2012 C3 admin Added isteers from Procurement team as well as toulevey (and ouleveyt) from Linux team to C3 group.    
05/04/2012 C5 Did C5 report for the work done on prod-lfc-atlas the day before.   Yes
16/04/2012 CASTOR Gave some input to RAL on question about slots while switching from LSF to TM In castor-operation-external@cernNOSPAMPLEASE.ch mailing list  
19/04/2012 CASTOR Added Xavier's phone number to it-dep-fio-smod-alarm e-group    
23/03/2012 ITTF ITTF session on new computing centre by Wayne Salter New Computing Centre  
23/03/2012 ITTF Confidentiality question with the slides so they had to be removed while verification takes place. New Computing Centre  
30/03/2012 ITTF ITTF session on VoIP new pilot:
Unified communications: Lync becomes your desk phone
   
30/03/2012 ITTF scheduled session 15 Jun - New IT Web Site. As requested by Tim and announced by Ian Bird.  
30/03/2012 ITTF scheduled session 22 Jun - Agile Infrastructure Project Upate. As suggested by Wayne and agreed by Bernd.  
05/04/2012 ITTF scheduled session 1 Jun - IPV6 by Edoardo Martelli. As discussed with Jose Carlos Luna.  
16/04/2012 ITTF Scheduled session on Linux Kernel Development by Panos Sakkos on 1st June. https://indico.cern.ch/conferenceDisplay.py?confId=186923  
16/04/2012 ITTF Feedback to Maite on Experiences Using Service-Now in IT. Also replaced Bruno Lenski with Emmanuel Ormancey.  
16/04/2012 ITTF Follow up on session about SCRUM (Agile Methods). As suggested by Miguel proposing coffee with CS people.  
17/04/2012 ITTF Sent reminder of ITTF session next Friday “Experiences
Using Service-Now in IT” next Friday 20 April at 10:00 in 513 1-024.

The speakers are Maite Barroso Lopez (IT/PES), Massimo Lamanna (IT/DSS) and
Emmanuel Ormancey (IT/OIS).
 
18/04/2012 ITTF Jamie came asking to organise a Post-CHEP session for beginning of June. Moved IPV6 session to end of June.
Suggested to contact Miguel in order to make it a joint Computing Seminar - ITTF.
 
20/04/2012 ITTF Sent reminder and organized session IT Technical forum today: Experiences Using Service-Now in IT  
20/04/2012 ITTF Meeting with Vito Baggiolini on the session 'Computing in BE/CO' Moved the session to 4 May as requested by some prominent people attending HEPiX next week.  
20/04/2012 ITTF Scheduled ITTF session for 6 July:
CERN mobile web site, and how it was implemented with jQuery Mobile
by Sebastian Lopienski  
23/04/2012 ITTF Answered to Veronique who was requesting for minutes/action list for the last session.    
01/11/2007 ITTF Sent announcement for ITTF session on "Computing on BE/CO" by Vito Baggiolini. Put abstract from Vito there.  
23/03/2012 LB Discussion with Manuel on work that needs to be done on the LB to fulfill requests from CMS and others. To present status in IT-CMS coord. meeting on 18th April.  
01/04/2012 LB Answered ticket INC117695 from Jan Iven. The problem was that eosatlas and eoscms were not defined as external.  
17/04/2012 LB Got hold of lbserver source (actually Perl stuff) from svn co svn+ssh://svn.cern.ch/reps/ELFms/loadbalancing Started having a look at it to fulfill request from CMS about behaviour when no servers available.  
17/04/2012 LB Looking how to fulfill the tickets opened by CMS and ATLAS for the LB. They are
https://savannah.cern.ch/bugs/index.php?89933
https://savannah.cern.ch/bugs/index.php?89925
https://savannah.cern.ch/bugs/index.php?89901
 
17/04/2012 LB Updated tickets with current status https://savannah.cern.ch/bugs/index.php?89933
https://savannah.cern.ch/bugs/index.php?89925
https://savannah.cern.ch/bugs/index.php?89901
 
18/04/2012 LB Attended CMS meeting Discussed with Jorge what they acutally need for the configuration update.
Discyssed with Nick who will follow up how to implement a workflow for the request generated by the record producer.
 
19/04/2012 LB Looked into loadbalancing component and its CDBsql query. Learnt how to do query in lxadm
$ CDBHosts --cl all --query "get_value(hostname,'/software/components/loadbalancing/clustername') is not null" --data="hostname,get_value(hostname,'/software/components/loadbalancing/clustername')"
 
19/04/2012 LB After dicussion with Steve and Gavin decided to change to hourly lbd configuration reloads as suggested by Manuel a while ago. Updated ticket
https://savannah.cern.ch/bugs/index.php?89933
 
19/04/2012 LB Did the change in lbmaster:
[root@lxservb01 etc]# mv cron.daily/configure-loadbalancing cron.hourly

and also committed it in SVN on top of the CERN-CC-lbd-2.1-5 RPM.
[reguero@lxadm11 trunk]$ svn commit --message "move configure-loadbalancing from cron.daily to cron.hourly"
Sending trunk/lbd.spec
Transmitting file data .
Committed revision 661.
 
19/04/2012 LB Discussed with Nick how to implement new LB alias creation in SNOW.
Went through Hardware Request workflow but not totally automated as we would like.
Nick is going to follow-up with Mats for this question.  
20/04/2012 LB set atlasddm-mb to become external as requested by Lionel long due.  
23/04/2012 LB Following up token problem with slsmon job in lxservb06. Seems to be a cache corruption problem.
Steve has the password.
 
23/04/2012 LB Answered comment from Mats oSubject: Re: Question concerning a request: RQF0090008 : Possibility to have tasks within Requests
that can be assigned to different FEs
n
a) We cannot wait for long. We have both ATLAS and CMS complaining about this since a while.

b) This is clearly not Change Management process.
We just have requests (to define an alias on the LB server) that have a prerequisite request on
the CS people (define the LB alias on the DNS server) but they do not have to approve anything.

I think that Nick just though of using approvals as a trick to implement the workflow in Snow.

Thanks for looking into this.
Cheers ...Ignacio...

On Mon, 23 Apr 2012, CERN Service Desk wrote:

> Dear Nick Ziogas,
>
> More information is needed to work on the resolution of the request RQF0090008.
>
> Question(s) asked:
>
> ______________
> __
> 23-04-2012 16:50:09 CEST - Mats Moller
> Additional comments (Customer View)
> Sounds very much like Change Management. Can we wait until we have CM in place?
> Regards,
> Mats
>
 
24/04/2012 LB Attended meeting with Mats, Nick and Beutel on Snow record producer + workflow to support new LB alias requests. Following up on RQF0090008  
24/04/2012 LB Answered ticket RQF0090222 configure-loadbalancing cronjob mails Also replied directly to Ben and Veronique.  
25/04/2012 LB FYI I had and informal meeting with David Gutierrez Rueda for LB server questions. David explained that each LB alias defines a subtree within the DNS DB that
we can then modify at will from the LB server.

He sees no problem in having an LB server for the Agile Infrastructure
as long as it manages different aliases than the current one.

He does not see any problem either on trying other ways to implement the
LB server.

He said however that thay are not ready for something like the Apache
libcloud interface because the action of adding new LB aliases requires
update on the base cern.ch domain that is delicate securitywise. This same
security concern is what prevented an automated workflow with the form to
request new aliases. I mentioned to him that I am working with the Snow
people to implement the Form and the workflow in Snow.

I also mentioned that I am working on changing the behaviour when no hosts
are pingable in the set and asked about a testing facility. He said that
that although we could test on an ad-hoc LB alias on the production LB
server, if we want to be completely isolated from production, he can
provide a different key for "testing" similar to the ones used for
"internal" and "external".
 
25/04/2012 LB As discussed with Steve, I have checked what we use as protocol to update the alias info in dns. As can be seen in update_dns() in ./LBCluster.pm, we actually use DNS protocol using the Net::DNS
Perl module to handle the interaction. Security is provided by TSIG HMAC-MD5 keys that sign the
update packages. These are the itfio-internal and itfio-external keys mentioned by David.
 
2012/04/23 LDAP Continued debugging of INC094014 about "nss_ldap on SLC5: cannot get more than 1500
entries of a given item (group), needs "ranged results"
Retried test case with additional ldap.conf parameters:
timelimit 0
sizelimit 0
 
02/04/2012 LFC Following up on ticket https://ggus.eu/ws/ticket_info.php?ticket=80780 from LHCB verified versions in prod-lfc-lhcb-central.cern.ch.
It is LFC-server-oracle-1.8.2-0sec.sl5.
However found glite-LFC_oracle RPM that is
1.8.0-1. This seems wrong. Is there any reason for that ?
In any case glite-LFC_oracle only updates some files with version numbers in /opt/glite/release/glite-LFC_oracle.
I do not know whether this is used anywhere...
 
03/04/2012 LFC Looking into SLS for LHC LFCs on request of Stefan Roiser Reported that LFC_ATLAS and LFC_SHARE have the same gap, ie. they stopped displaying the same times than LFC_LHCB in SLS. I guess
some infrastructure that is common for the LFC SLS probes must have been in trouble. Still do not understand the details though.
 
03/04/2012 LFC INC118478 from ATLAS. Investigated whether they are hitting max number of thread limits. can see in the log that 0.79% of the requests are running with all threads full in 2 of the machines (41356 out of 5214849 in lfcatlas01 and 37623 out of 4758510 in lfcatlas03 in the current log) which indicates that a number of requests have been actually rejected (Pls. note that we do not see the rejected ones in the log).
The other machine in the alias (lfcatlas02) does not present so much load. Are the clients disliking it? We would avoid this problem if the load was spread on the three nodes.

We are currently running with NB_THREADS=80 and I heard from Philippe Defert that previous attempts to go above 90 triggered other problems.
I also see that the problem seems to have gone down this morning. (See below the number of requests are running with all threads full per hour in the current log of the 2 most loaded servers)
Could you please confirm whether you still would like an increase of the number of threads ?.

Best regards ....Ignacio...


[root@lfcatlas01 lfc]# cat /var/log/lfc/log |tr ',' ' ' | awk '{if ($4 = 79) {print} }' |awk -f /afs/cern.ch/user/r/reguero/public/histolog.awk field=2 width=2
04 - 5707: *****
05 - 6827: ******
06 - 7216: *******
07 - 6348: ******
08 - 5299: *****
09 - 8840: ********
10 - 981: *
11 - 74: *

12 - 85: *
Total counts = 41377

[root@lfcatlas03 ~]# cat /var/log/lfc/log |tr ',' ' ' | awk '{if ($4 =
79) {print} }' |awk -f /afs/cern.ch/user/r/reguero/public/histolog.awk field=2 width=2
04 - 5271: ******

05 - 5751: *****
06 - 6786: ******
07 - 6230: ******
08 - 7106: *******
09 - 4840: ****
10 - 1468: *
11 - 60: *
12 - 111: *

Total counts = 37623
 
04/04/2012 LFC Still working on INC118478.

Deployed new node lfcatlas04 in production and as agreed after the lcg ops meeting changed best_hosts to 3 for prod-lfc-atlas
Deployed change to NB_THREADS=90 with Philippe.
Philippe asked JP. Baud whether thread increase still possible. He said that possible up to 99.
Created 4th node lfcatlas04 forprod-lfc-atlas.cern.ch.
Asked DBAs (Eva Dafonte and cia) to check whether the DB session configuration can support the increase of threads foreseen.
Also made HW request RQF0084661 for the new box mentioning that already created.
 
05/04/2012 LFC Attended LCG Ops meeting to get feedback from ATLAS. Situation looks better now.  
16/04/2012 LFC Reported to Maite on LFC for IT problem management meeting - agenda for 17-APR-12 I realize that it is a hasle but there is no progresss yet.

Please note that in the last times we have been busy with 2 more urgent
problems:
- INC099445 CERN LFC nodes are badly published in the BDII.
- INC118478 Slow LFC registration for prod-lfc-atlas.

LFC lemon monitoring improvement is tracked in
https://svnweb.cern.ch/trac/gservices/ticket/14
 
16/04/2012 LFC In the context of INC118478, Jean-Philippe has told us that he gave an LFC client to Simone over 6 months ago to solve the load balancing "unbalance" but Simone was reluctant to deploy it. Following this up with Alessandro and Cedric from ATLAS.  
17/04/2012 LFC Follow up both with Jean-Philippe and the ATLAS people (Alessandro and Simone) of the LFC thread utilization problem visible in INC118478. Jean-Philippe offered to help debug the problem.  
17/04/2012 LFC Tried a bit of LFC log parsing looking for long opened/not closed sessions using my /afs/cern.ch/user/r/reguero/public/parseLFCsess.pl Did not found sessions left open un current log of lfcatlas04. Found limited number of sessions that lasted for over 100 secs. So it seems not really significant.  
18/04/2012 LFC Verified that figures quoted vy Anthony for LFC_NOREAD alarms in RQF0088684 are indeed correct. [reguero@lxadm11 ~]$ lemon-cli -m 30082 -n '*' -s --start 5w |grep remote|egrep -v 'lfcatlasmig1|lfcshared1|lfcmerg|lfclhcbrw02' |awk '{if ($10 == 1) {print} }' |wc -l
111
 
19/04/2012 LFC Had to look into urgent problem in lfcatlas02.
The problem was due to lfcdaemon crashes.
On request of Alessandro I put lfcatlas02 in maintenance so that it does not got into the LB alias.
After preliminay debuging, ( the backtraces look the same in all cases,
in strchr() called from Cns_chkaclperm), I created ticket INC122340 to report the problem to the LFC developers.
 
20/04/2012 LFC Following on the INC122340 ticket, I restarted lfcdaemon in lfcatlas02 and put the node in production so it is eligible by the
prod-lfc-atlas LB alias. I verified that it is already being picked up.
[root@lfcatlas02 ~]# /etc/rc.d/init.d/lfcdaemon status
lfcdaemon dead but pid file exists [FAILED]
[root@lfcatlas02 ~]# /etc/rc.d/init.d/lfcdaemon start
Starting lfcdaemon: [ OK ]
[root@lfcatlas02 ~]# /etc/rc.d/init.d/lfcdaemon status
lfcdaemon (pid 5877) is running... [ OK ]

[reguero@lxadm11 ~]$ sms clear maint other lfcatlas02

[INFO] Authenticating as user `reguero'
waiting for response...
lfcatlas02: OK, state is now production (default state)
[reguero@lxadm11 ~]$

[reguero@lxadm11 ~]$ host prod-lfc-atlas
prod-lfc-atlas.cern.ch has address 128.142.194.103
prod-lfc-atlas.cern.ch has address 128.142.196.43
prod-lfc-atlas.cern.ch has address 137.138.162.158
 
23/04/2012 LFC Follow up LFC problem that looks like an LFC client misconfiguration. Request from Giacomo and Maaike Limper.
Pointed to the LFC UI on AFS:
$ source /afs/cern.ch/project/gd/LCG-share/sl5/etc/profile.d/grid-env.sh
 
23/04/2012 LFC Change of TSM server from "TSM91" to "TSM512".
Backup seems to work OK, however I keep getting the 'TSM failed backup notification for user lfc-operations' mails.
Question about tsmserver "hostname.BACKUP.CERN.CH"  
24/04/2012 LFC Still helped Maaike Limper and Giacomo on (python) configuration for the LFC in SLC5. Read the F. script /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.sh
After several iterations both python and firewall problem.
 
25/04/2012 LFC As suggested by Alex, changed
"/software/components/cron/entries" = list(nlist(
to
"/software/components/cron/entries" = push(nlist(
in prod/cluster/gridlfc/config.tpl
to avoid overwriting the cron entry to restart ntpd in VMs
<cdbop@cdbserv.cern.ch: ~/cdbfiles> commit
[INFO] '/prod/cluster/gridlfc/config': will be updated
please confirm [yes]:
Comment: Do not overwrite cron entries to fix ntpd in VMs
[INFO] please wait...
[INFO] commit OK
<cdbop@cdbserv.cern.ch: ~/cdbfiles>

After testing in lfcatlas04 also did
[reguero@lxadm10 ~]$ wassh -l root -c gridlfc 'ncm_wrapper.sh cron'
 
25/04/2012 LFC Submitted request https://cern.service-now.com/service-portal/view-request.do?n=RQF0091368
requesting an additional node for prod-lfc-atlas
As requested by Cedric Serfon in INC122340
___________________________________
20-04-2012 10:53:32 CEST - Cedric Serfon
Additional comments (Customer View)
Hi Ignacio,
Thanks a lot ! The situation greatly improved after lfcatlas02 is back : we had accumulated a
huge registration backlog yesterday and this backlog is now decreasing. That means we
definitively cannot live with only 3 frontends. Would it be possible to envisage a 5th FE (in
case one of the FE has problems as lfcatlas02) ?
Cheers,
Cedric
__________________________________
 
27/04/2012 LFC INC124738 : lfcatlas02 lfc_nowrite
LFC restart by the Lemon actuator when all requests getting
'Failure getting the request: Name server not active'.
Also found some Oracle errors.
Passing the ticket to the LFC developers.  
02/05/2012 LFC Followed up ticket INC124738 on auto LFC restart. With Fabrizio Furano.  
02/05/2012 LFC Following up lfcatlas02 stuck as reported by Fabrizio.
Due to power problem on NAS disk containing VM images.
Case had been debugged and fixed by Alex.  
02/05/2012 LFC I updated https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScGridLFC
removing obsolete services and changing names and Lemon pointers to the current names.
I put it in https://twiki.cern.ch/twiki/bin/view/PESgroup/ItPSsection.
with the pointer to SLS.
 
26/03/2012 Message Brokers Reading mcollective and activemq doc. To configure activemq for mcollective.  
27/03/2012 Message Brokers Strated working with activemq puppetlabs module. To configure activemq for mcollective.  
28/03/2012 Message Brokers Checked with Steve procedure to put information in hiera. Needed for activemq username an passwords.  
29/03/2012 Message Brokers Got activemq running for mcollective Worked on a puppetlabs module to provide parameter separation, hiera usage, etc.  
02/04/2012 Message Brokers Discussed with Steve current status of mcollective.
Looking how to run arbitrary commands.
Learnt to do
# mco rpc package status package=xxxx
# mco rpc service status service=ntpd
# mco rpc service restart service=ntpd
 
16/04/2012 Message Brokers Attended morining meeting - To do SSL for activemq with Puppet certs.
- To do RabitMq for OpenStack when Belmiro back.
- to do RabitMq for Slurm with Nacho.
 
30/04/2012 Message Brokers Added entries for RabbitMQ, ActiveMQ and Apollo to https://agilecat.web.cern.ch/catalog/component-catalog As requested by Pedro.  
2012/05/02 Message Brokers Found in the mcollective 1.3 release notes in (AI-366) that it needs features of latest version of activemq (3.5.6.x) so we may not use Apollo. Subscribed to https://groups.google.com/forum/#!forum/mcollective-users
to ask mcollective people about Apollo.
 
02/05/2012 Message Brokers Follow up discussion with Lionel on https://agileinf.its.cern.ch/jira/browse/AI-260?focusedCommentId=14606#comment-14606    
26/04/2012 myproxy Did INC123976    
26/04/2012 myproxy https://ggus.eu/ws/ticket_info.php?ticket=81671
   
26/03/2012   Afternoon off.    
03/04/2012   Attended GT/PES monthly meeting. With Steve and Ricardo. Discussed with Markus and Oliver points in https://twiki.cern.ch/twiki/bin/view/PESgroup/Tuesday3rdApr2012  
17/04/2012   Attendedf PES group meeting. Presentation about Jira service.  
17/04/2012   Had to reset/reboot desktop PC due to browser update. To have new login features.  
20/04/2012   Changed my passwords and grid certificate that were about to expire https://www.cern.ch/account
https://ca.cern.ch
 
24/04/2012   Asked Steve to look into LXPLUS403 for which there was a security warning.    
30/04/2012   Attended morning meeting and AI one.    
03/05/2012   Attended AI + section meeting + ITUM Discussed AI LB + MB plans  
23/04/2012 On rota this week. Attended morning meeting + AI morning meeting.    
26/04/2012 VOBOX Anwered INC123815 after getting help from Ben. the problem is that tools like sindes-get-certificate -f do not work because the server sees the requests coming from vocms208-vip rather than from vocms209 Alex explained that this seems to be a virtual IP for HA setup and that they could fix the problem by changing the metric of the interface with the virtual IP to 2.  
27/04/2012 vobox Answered ticket INC121504: rfcp on amsvoboxXX
passwd entry for ams user had different primary group than the one on the central services.  
30/04/2012 VOBOX Followed up via mail to vobox-admins-alice@cernNOSPAMPLEASE.ch the /tmp full alarms in the lxcaf/caf_alice_nodes reported in the morning meeting. Passed the reply to it-support-mm@cernNOSPAMPLEASE.ch  
30/04/2012 VOBOX Explained to Maarten how to disable exception.tmp_full (metric 30010) in the lxcaf/caf_alice_nodes In mail to ail to vobox-admins-alice@cernNOSPAMPLEASE.ch  
02/05/2012 VOBOX Followed up questions from Maarten how to disable exception.tmp_full (metric 30010) in the lxcaf/caf_alice_nodes In mail to ail to vobox-admins-alice@cernNOSPAMPLEASE.ch  
26/04/2012 VOMSR Passed to Steve INC124027 and INC124043 that seem to be a known problem.    

Date Service Summary Description C5
Worthy
04/05/2012 Message Brokers Contacted RI Pienaar with Apollo support and scalability questions for mcollective He answered that Apollo should work but it has no clustering and we will need clustering to support 300K nodes.  
04/05/2012 ITTF Sent announcement of "Computing in BE/CO" session.    
04/05/2012 LB Follow up by mail with Nick Ziogas on RP for LB alias creation.    
07/05/2012   Attended SA session of Saved Leave, Pensions, etc.    
07/05/2012 Batch LDAP Looked into loadPwentCache.py to correct a misunderstanding of Emmanuel Ormancey. The problem is that some of the (sub)groups are in a different hierarchy than the one with OU=Unix.
The solution suggested by Emmanuel is that the method should search in OU=Workgroups and not in OU=Unix,OU=Workgroups when doing the recursive sub-searches. The first one must search only in OU=Unix. This has been implemented and tested successfully by Gavin.
 
07/05/2012 Batch Answered question from Maite on whether the 2nd level could help on the jobs stuck and VMs out os swap cases last week. My answer:

There were specific problems (with the partitioning for the AFS cache) in VMs that favored the proliferation of these cases in lxcloud. The problems are being
solved by Ulrich. The case is followed up in https://agileinf.its.cern.ch/jira/browse/BI-637 (Understand memorydeath and spam issues on batch VMs).

AFAIK, (pls. Ricardo and Gavin correct me if wrong), in normal conditions the monitoring should handle machines going to swap full correctly (before they are
actually full).
 
07/05/2012 ITTF Still cannot modify http://information-technology.web.cern.ch/about/meeting/it-technical-forum-ittf

Following up with Cath Noble.
As requested by Frederic.  
07/05/2012 ITTF Scheduled session "Open Software for Open Hardware" for the 13 July with Javier Serrano. Made event page in https://indico.cern.ch/conferenceDisplay.py?confId=190126 As requested by Frederic.  
08/05/2012 Message Brokers Followed up Lionel's comments about (AI-260) Enable MCollective on Apollo.    
08/05/2012   Attended PES group meeting.    
08/05/2012 LB Reported LB work status to Arne for the IT-CMS coordination meeting Hi Arne,
Latest news, not sure if you want to report it all.

cheers ...Ignacio...

- Nick and I had a meeting with Mats Moller and an ITIL consultant on what
we could do as a workflow when several FEs are involved. Unfortunately there
is not possibility of automated routing but felt that better then current
form and mails.

- Nick currently waiting for permissions to implement Record Producer (form)
in Service-Now for new LB alias creation replacing current form and mails.

- Mentioned to CS group that ultimately we would like an LB as a service
interface as in Apache Libcloud:
http://libcloud.apache.org/docs/load-balancer-base-api.html.
They replied that not yet ready, in particular for new alias creation due
to security concerns.

- After checking with other stackeholders, I changed the cron job
reconfiguring the LB server from daily to hourly to reduce latency when
changing alias members.

- Currently preparing a new LB server for the Agile infrastructure project
that I plan to use for tests of new behaviour when no hosts available.
 
2012/05/08       Yes

Week February 6, 2012

Service Description Impact/Risk
ANY O ANY End of the year...

Other informations
  • Mon
    • Attended Agile Infrastructure Meeting.
    • Answered Helge on the Where do we raise uid changing thread. Data obtained with # getent passwd |awk -F\: '{if ($3 < 1000 && $3 > 101) print $0}' |sort -t ':' -k 3 > /tmp/passwd.sort on a batch box. I guess that for RedHat based tools, we are concerned by the ones <=500. They are 216.
    • Followed up with Ulrich /afs/cern.ch/user/l/lsfadmin/scripts/CloudFactory errors in lxadm.
    • For ticket INC094014 'nss_ldap on SLC5: cannot get more than 1500 entries of a given item (group), needs "ranged results"' Produced the output of tcpdump  -vvv -X -w /tmp/tcpdump.ldap.out host xldap.cern.ch when doing getent group zp in SLC5 and SLC6 and disabling nscd caching. This was to display that SLC6 is actually using the range protocol.
    • Answered ticket INC101333 about LFC: getting the list of LHCb files with zero replicas from LFC db We should not be doing DB queries when there is a method to access the information directly. I could not find any doc myself either on the lfc_getpath method. I guess that this should be eventually reported to the LFC developers. On the other hand, googling arond I did find the example called lfc-getreplica-data-1.2.py that uses lfc_getpath. I have tested that works OK in my environment by doing python ./lfc-getreplica-data-1.2.py srm-public.cern.ch  --lfn .
    • Meeting on LFC issues with Manuel and Philippe.
    • Followed up on the LFC meeting to see that the LFC from the EMI 1 release is SCL5 only. See http://www.eu-emi.eu/emi-1-kebnekaise and http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-10-24-11-2011
    • Uploaded material of ITTF session on OpenStack to Indico.
    • Mail discussion with Tim and Helge on possible topics for the ITTF.
  • Tue
    • Sent mail reminding ITTF material and upcoming events. Most information in Indico in https://indico.cern.ch/categoryDisplay.py?categId=3407
    • Prompted by a warning from Frederic, verified that we can fold back to 513 1-024 fro the coming ITTF sessions.
    • Discussed how we could let ATLAS manage their own limits in LSF as they requested in the last coordination meeting.
    • From CMS ticket INC101988 'dpm package on some tier-0 nodes causing problems' we realized that the problems are due to clashes of the dpm packages required by the emi-wn-1.0.0-0.sl5. In particular dpm(-1.8.2-3sec.sl5) RPM contains all the rf* commands clashing (actually overwritting) the ones in castor-rfio-client(-2.1.9-3). This seems to be what breaks CMS jobs. So Ulrich has produced dummydpm-0.0.1-1 to fulfill the dependencies of emi-wn-1.0.0-0.sl5. I have I have reported the problem to linux.support, I have also fixed the template prod/cluster/lxbatch/emi_1_0_0-wn.tpl to use it instead of the dpm RPMs and I have deployed this to the lxbatch nodes with --stage preprod by doing SPMA, followed by 'rpm -e castor-rfio-client-2.1.9-3' followed by another SPMA. This is required to have the right rf* commands from the castor-rfio-client-2.1.9-3 rpm. I also answered the ticket and commited dummydpm in the batchinter SVN rep.
    • Matthias found that the RPM problem does not happen anymore if we upgrade to at least CASTOR 2.1.9-8. I asked the CASTOR Operations people to which version we should go.
    • Discussed by mail how to answer the question about having preprod nodes exposed to production jobs. On request of Helge, I agreed to attend the IT CMS coordination meeting tomorrow.
  • Wed
    • Attended the AI meeting.
    • In the framework of the EMI-WN problems, looked into the version of xroot software with Gavin as well a de CGSI libs.
    • Attended the IT-CMS coord meeting to discuss the benefit of having preprod nodes in the production farms.
    • Helped producing a text for the minutes.
    • Following up with Steve sysadmin remedy tickets fro cvmfs boxes:
      • CM000000000464862: lxbst2801. The box is bollocksed I would say, not sure if was CVMFS did it. Reboot it for sure.
      • CM000000000463952: That's a new error for me. /etc/mtab~ was present as a lock file to /etc/mtab and that stops other things being mounted. And so mounts were failing. Clearly something is going wrong somewhere.
  • Thu
    • Attended the AI meeting.
    • Explained to Maite the 'Where do we raise uid changing' case for PSM.
    • Ticket INC094014 about nss_ldap problems on SLC5. Provided the tcpdump output again, now with '-s 0'. More specificaly [root@lxadm10 ~]# tcpdump -vvv -X -s 0 -w /tmp/tcpdump.ldap.s0.out host 137.138.240.49 or 137.138.142.25 or 137.138.144.149 or 137.138.145.178 or 137.138.145.182 or 137.138.240.48
    • Read slides about Quantum in http://robhirschfeld.com/2012/02/08/quantum-network-virtualization-in-the-openstack-essex-release-2/
    • Section fondue.
    • Attended computing seminar “Defining Computer ‘Speed’: An Unsolved Challenge” by Dr. John Gustafson (Intel Research).
    • Also prompted by INC102803. Verified that /usr/bin/rfcp binary OK by doing '$ wassh -t 100 -p 100 -l root -c lxbatch 'ls -lstr /usr/bin/rfcp' |grep -v 'root root 36118 Nov 6 2009 /usr/bin/rfcp'' and found two boxes (lxbrb2516 and lxbrl2316) that had the dpm ones. The boxes were in stage usertest/jhefferm so looking like preprod but out of the cleanup done. So cleaning them up now with the sequence of spma_wrapper.sh; rpm -e castor-rfio-client-2.1.9-3; spma_wrapper.sh; ncm_wrapper.sh. As matter of fact, John says the usertest/jhefferm stage looks like a very old thing he is not interested in so we could move the boxes to prod.
  • Fri
    • Discussed with Stefan about ITTF session for 16 March Squaring the Circle: eflections on Identities, Authentication & Authorization at CERN and prepared Indico Event https://indico.cern.ch/conferenceDisplay.py?confId=177625 and logistics.
    • Attended IT-PES-PS section meeting.
    • For ticket INC094014, provided SLC5 tcpdump with TLS encription disabled to do so I edited /etc/ldap.conf to comment out #tls_checkpeer true and replace ssl start_tls with ssl no. As matter of fact the SLC6 configuration has TLS encription disabled. IT should use ldaps: instead of ldap: in the URI.
    • Checked with Philippe a couple of pending tickets, in particular the permssions to access the LFC logs for E. Laciotti from LHCB + auto scheduling addition of privileged hosts to ATLAS.
    • Removed the atlascalong and atlascatshort queues and resources as described in Gira in https://agileinf.cern.ch/jira/browse/BI-369?page=com.atlassian.streams.streams-jira-plugin:activity-stream-issue-tab#issue-tabs

Week January 30, 2012

Service Description Impact/Risk
ANY O ANY End of the year...

Other informations
  • Mon
    • Attended Agile Infrastructure Meeting.
    • Following report from Max Baak on Sunday, reconfigured lsf to exclude u_ATLASCAT users from the u_zp limit of 100 slots per user so that they can get to the group limit of 1000 slots with a single user.
    • After another report from Max Baak in the afternoon verified that g_atlascaturgent is actually dedicated to CAT users and pointed out problem with shared resources (jobs not starting). This was finally solved by Ricardo by doing a badmin reconfig.
    • Looked into the mghunterd params on request of Steven Murray. They were the old fashioned ones with -t rather than -s -m. Checked with Jan Iven that ther was no reason to keep them and let him update them to modern ones.
    • Pointed out problem with slsmon account in lxmaster20 that was triggering errors When trying to connect for the SLS probe. Fixed by Ricardo in the Quattor config.
    • Prepared Indico entries and reserved rooms for coming ITTF sessions in https://indico.cern.ch/categoryDisplay.py?categId=3407.
  • Tue
  • Wed
    • Switched to use this file as a log rather than https://twiki.cern.ch/twiki/bin/view/DSSGroup/FDOWeeklyIgnacioReguero
    • Attended the Agile Infrastructure SCRUM meeting.
    • With the agreement of those involved, moved the ITTF monitoring session to the 24th February.
    • Followed up hickups in lfcshared01 that triggered lfc_noread alarm. I saw that the lemon actuator restarted the daemon, seemingly due to problems in LCGR database reported later. Answered INC100127 and INC100136 about them.
    • Closed obsolete CASTOR tickets INC034978 and INC061931.
    • Attended meeting with Lionel Cons and Massimo Paladin on questions about the Messaging service support workflows and migration to Agile Infrastructure.
    • Question from Massimo on rmAdmiNode operations in CASTOR disk move script.
    • Mail to report to Frederic Ian and others of the coming ITTF sessions, in particular the new Computing Centre one.
  • Thu
    • Attended Belmiro's rehearsal for the ITTF.
    • Produced C5 report for the ATLAS CAT resource streamlining.
    • Mail discussion with Tim on new proposals for the ITTF.
    • Attended PES group lunch.
    • Attended ITUM.
    • Attended Agile Infrastructure meeting on priorities.
  • Fri
    • Sent reminder for the ITTF.
    • Replied mail on high_load on compass22 and mentioned that not working anymore with fileserver VO boxes.
    • Organised ITTF.
    • Discussed with Alberto and scheduled ITTF session on Storage Strategy for the 9th March.

Week January 23, 2012

Service Description Impact/Risk
ANY O ANY End of the year...

Other informations
  • Mon
    • Attended Agile Infrastructure Meeting.
    • Backing up Philippe for the Piquet.
    • Put in production loadPwentCache.py and loadPwentCache.cron. I changed the configuration in CDB, did the SPMA, recreated the cache by hand, and then, as lsfadmin ran the daily reconfiguration script /usr/bin/lsf_cron_auto_reconfig -C batch -V 7.0 -t -p 300 in lxmaster20.
    • I also verified that the output of bugroup is consistent with /usr/lsf/etc/egroup after the deployment.
    • Checked the status of the DiggiChristmas limit that is set to 0 slots with blimits -w -n DiggiChristmas after the reconfiguration and informed Alessandro from ATLAS.
    • As suggested by Gavin, started working with Ulrich on the EMI Worker Node.
  • Tue
    • Attended Agile Infrastructure Meeting.
    • Did activity report for the section meeting minutes. It is in https://twiki.cern.ch/twiki/bin/view/PESgroup/WrIgnacio si that the latest extract can be seen in https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/PesPsMembersWorkReports. Pleanning to move to https://twiki.cern.ch/twiki/bin/view/PESgroup/WrIgnacio at the end of the month.
    • Discussed with Ulrich that is starting the work for the EMI WN software.
    • Meeting of Ricardo Gavin and I with the ATLAS People (Max Baak + Alesandro, Guido) on the atlascatlong and atlascatshort queue reorganisation.
    • Discussed with Ricardo what is needed in the LSF configuration and set off to implement it.
    • Checked with the DBAs before restarting the LHCB LFC nodes for aliases prod-lfc-lhcb-central and prod-lfc-lhcb-ro. Something like sms set maintenance other 'DB upgrade' lfclhcbro01 lfclhcbro02 lfclhcbro03 lfclhcbrw01 lfclhcbrw02 lfclhcbrw03; wassh -l root lfclhcbro01,lfclhcbro02,lfclhcbro03,lfclhcbrw01,lfclhcbrw02,lfclhcbrw03 service lfcdaemon stop . Verified that LFC logs OK. Realized and reported that one nodes (lfclhcbrw02) is out of production since end of November. Reported it to the lfc-operations list.
    • Attended the Grid Ops meeting replacing Philippe. Read report from Steve.
    • Also replacing Philippe, on request of Serguei Baranov, reconfigured the alias for pandamonitor from best_hosts = 3 to best_hosts = 2, restarted the LB server and notified Serguei.
    • Sent announcement for the IT Technical Forum next Friday. Tried first it-dep-full@cernNOSPAMPLEASE.ch that is restricted so it did not work. On sugestion of Nathalie Thiers, tried afterwards it-dep@cernNOSPAMPLEASE.ch that allows mail from members.
    • Started working on Twiki for the IT Technical Forum. Asked Nils (that was off) and then Pete for a public Web after realizing that the existing ones do not fit well what is required for the ITTF.
  • Wed
    • Attended Agile Infrastructure Meeting.
    • Discussed with Ulrich to deploy the EMI WN in preprod with SCAS rather than ARGUS. At the end agreed not do it as we have to solve glexec before.
    • Replacing Philippe, answered the test alarm tickets after the GGUS release.
    • Discussed with Mats about doing an ITTF session on Service Management.
    • Replied to Nils about ITTF Twiki location.
    • Got from Philippe explanation on procedure for new LB aliases.
    • Got explanation from Gavin on how to put message when submitting on obsolete queues. To use queuename.esub in esub.d packaged in CERN-CC-esub-filter RPM.
    • Agreed with Philippe to put lfclhcbw02 in maintenance until we understand the problem with it.
    • Fulfiled request RQF0059707 from CMS by creating cmsbpfrontier LB alias (and restarting the LB server).
    • Attended Grid Ops meeting.
  • Thu
    • Got feedback from Ricardo for the reconfiguration steps required for the atlascatlong and atlascatshort queue reorganisation.
    • Attended Section Meeting and took notes.
    • Attended Ticket Review meeting
    • Contacted Mats to fix date for Service Management session of the ITTF.
    • After warning of possible overcrowinding of 513 1-024 from Wayne contacted Caltec people that had booked the IT Auditorium to swap room and announced the change to it-dep@cernNOSPAMPLEASE.ch.

  • Fri
    • Organised ITTF: Agile Infrastructure: Configuration Management. 104 attendees.
    • Migration out of dedicated resources atlascatlong and atlascatshort. Produced and deployed a new version of the CERN-CC-esub-filter with queue.atlascatlong and queue.atlascatshort in esub.d so that bsub to these queues is intercepted and a proper message is shown.
    • Fixed /afs/cern.ch/user/s/stage/acron/castorsvcacl.py cron job used to generate the pool ACLs for castoratlas/atlt3. This job broke when zp_CAT was replaced by u_ATLASCAT in the batch configuration because of the streamlining of the atlascat dedicated resources.

  • Ignacio still 50% still on Castor until the end of January.
  • Asked by Frederic to coordinate the IT Technical Forum. Started with Agile Infrastructure session on 27th Jan.
  • Produced on LDAP query to fill egroup sticky cache. egroup is used on LSF reconfigurations to resolve groups to member list.
  • Packaged all the egroup bits in CERN-CC-LSF-* RPMs. Deployed them in Quator and Tested in batchint and deployed in production on 23/Jan.

  • Ignacio now officially in PES-PS section - 50% still on Castor until the end of January. Working on tying up loose ends in particular for fileserver VOBOXES for COMPASS and HARP.
  • CERN version of xroot clients for SLC6 lxbatch and lxplus( in prod/site/cern_cc/rpms/addons/slc6/xrootd-clients.tpl) while it is put in EPEL by the CERN xroot developers (Lukasz Janyst).
  • Waiting for procedures for a couple of batch merge operations.
  • Working on scripts to hunt for problematic jobs

-- IgnacioReguero - 06-Dec-2011

-- IgnacioReguero - 08-May-2012

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2012-05-08 - IgnacioReguero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback