CMS-VO Working Notes for T2_CH_CSCS and T3_CH_PSI

Inode creation again

Nadyas jobs filling up /scratch



Monday, 25 March 2019 




[root@t3wn26 scratch]# du -sh *
...
210G    nchernya
...
[root@t3wn26 scratch]# cd nchernya/
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]# ls
sgejob-7196336  sgejob-7368744  sgejob-7372804  sgejob-7394107  sgejob-7397348  sgejob-7402154  sgejob-7405309
sgejob-7196820  sgejob-7368880  sgejob-7372982  sgejob-7394165  sgejob-7397578  sgejob-7402162  sgejob-7405434
sgejob-7196881  sgejob-7368891  sgejob-7373080  sgejob-7394221  sgejob-7397579  sgejob-7402211  sgejob-7405444
sgejob-7196957  sgejob-7369042  sgejob-7373221  sgejob-7394273  sgejob-7397605  sgejob-7402241  sgejob-7405466
sgejob-7197012  sgejob-7369051  sgejob-7373450  sgejob-7394321  sgejob-7397775  sgejob-7402268  sgejob-7405705
sgejob-7197429  sgejob-7369198  sgejob-7373662  sgejob-7394376  sgejob-7397983  sgejob-7402309  sgejob-7406008
sgejob-7239535  sgejob-7369226  sgejob-7373812  sgejob-7395101  sgejob-7398008  sgejob-7402332  sgejob-7406323
sgejob-7241151  sgejob-7369395  sgejob-7373943  sgejob-7395120  sgejob-7398048  sgejob-7402365  sgejob-7406635
sgejob-7345948  sgejob-7369612  sgejob-7374369  sgejob-7395143  sgejob-7398212  sgejob-7402396  sgejob-7406785
sgejob-7346147  sgejob-7369769  sgejob-7374393  sgejob-7395159  sgejob-7398302  sgejob-7402462  sgejob-7406952
sgejob-7346232  sgejob-7370013  sgejob-7378015  sgejob-7395205  sgejob-7398330  sgejob-7402526  sgejob-7407246
sgejob-7346306  sgejob-7370236  sgejob-7378172  sgejob-7395236  sgejob-7398420  sgejob-7402580  sgejob-7407298
sgejob-7346310  sgejob-7371348  sgejob-7378811  sgejob-7395250  sgejob-7398450  sgejob-7402645  sgejob-7407323
sgejob-7346431  sgejob-7371449  sgejob-7378834  sgejob-7395261  sgejob-7398516  sgejob-7402663  sgejob-7407502
sgejob-7346479  sgejob-7371483  sgejob-7378977  sgejob-7395269  sgejob-7398688  sgejob-7402730  sgejob-7407709
sgejob-7346482  sgejob-7371555  sgejob-7379049  sgejob-7395550  sgejob-7399354  sgejob-7402836  sgejob-7407897
sgejob-7346505  sgejob-7371581  sgejob-7379194  sgejob-7395635  sgejob-7399403  sgejob-7402915  sgejob-7407992
sgejob-7367813  sgejob-7371641  sgejob-7379233  sgejob-7395749  sgejob-7399703  sgejob-7403000  sgejob-7408963
sgejob-7367844  sgejob-7371659  sgejob-7379247  sgejob-7395794  sgejob-7399765  sgejob-7403203  sgejob-7409440
sgejob-7367857  sgejob-7371712  sgejob-7379374  sgejob-7395802  sgejob-7399874  sgejob-7403359  sgejob-7410932
sgejob-7367870  sgejob-7371732  sgejob-7379429  sgejob-7395863  sgejob-7400001  sgejob-7403757  sgejob-7411396
sgejob-7367901  sgejob-7371799  sgejob-7379470  sgejob-7395910  sgejob-7400034  sgejob-7403875  sgejob-7411447
sgejob-7367916  sgejob-7371844  sgejob-7379501  sgejob-7396163  sgejob-7400284  sgejob-7404167  sgejob-7411503
sgejob-7367944  sgejob-7371911  sgejob-7379871  sgejob-7396508  sgejob-7400326  sgejob-7404257  sgejob-7411552
sgejob-7368022  sgejob-7371979  sgejob-7380035  sgejob-7396511  sgejob-7400707  sgejob-7404365  sgejob-7411599
sgejob-7368114  sgejob-7372062  sgejob-7380176  sgejob-7396748  sgejob-7400762  sgejob-7404396  sgejob-7411641
sgejob-7368210  sgejob-7372138  sgejob-7380236  sgejob-7396978  sgejob-7400897  sgejob-7404478  sgejob-7411682
sgejob-7368293  sgejob-7372264  sgejob-7380392  sgejob-7397014  sgejob-7401328  sgejob-7404636  sgejob-7411806
sgejob-7368416  sgejob-7372354  sgejob-7380425  sgejob-7397038  sgejob-7401996  sgejob-7404768
sgejob-7368456  sgejob-7372447  sgejob-7380659  sgejob-7397039  sgejob-7402062  sgejob-7404799
sgejob-7368570  sgejob-7372547  sgejob-7380684  sgejob-7397091  sgejob-7402069  sgejob-7404924
sgejob-7368597  sgejob-7372650  sgejob-7393995  sgejob-7397133  sgejob-7402102  sgejob-7405013
sgejob-7368725  sgejob-7372693  sgejob-7394049  sgejob-7397333  sgejob-7402111  sgejob-7405182
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]# ls sgejob-7196336/
CMSSW_9_4_9  output_trees_2016nodes_19_03_2019_GennewtrainigMjj
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]#
[root@t3wn26 nchernya]# ls sgejob-7196336/output_trees_2016nodes_19_03_2019_GennewtrainigMjj/
output_GluGluToHHTo2B2G_node_12_13TeV-madgraph_2.root  runJobs12.sh.done


T2 Clean up 2019/2018



39.1 TB   : store/user/oiorio
33.4 TB   : store/user/algomez
33.1 TB   : store/user/ytakahas
29.8 TB   : store/user/cgiuglia
22.2 TB   : store/user/decosa
21.1 TB   : store/user/dezhu
16.7 TB   : store/user/vstampf
12.3 TB   : store/user/mschoene
10.5 TB   : store/user/lshchuts
10.0 TB   : store/user/dpinna


39.1 TB   : store/user/oiorio
29.8 TB   : store/user/cgiuglia
23.3 TB   : store/user/ytakahas
21.1 TB   : store/user/dezhu
18.9 TB   : store/user/decosa
16.7 TB   : store/user/vstampf
12.3 TB   : store/user/mschoene
10.5 TB   : store/user/lshchuts
10.0 TB   : store/user/dpinna



Hi Annapaola,

There are multiple ways. The central instruction page is here: https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe . The most basic command would be gfal-rm:

gfal-rm -r --dry-run srm://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/username/path/to/dir

Another particularly easy way is to use uberftp:

uberftp storage01.lcg.cscs.ch

You then enter an interactive session where you can use posix-like commands:

[tklijnsm@t3ui02 ~]$ uberftp storage01.lcg.cscs.ch
220 GSI FTP door ready
200 User :globus-mapping: logged in
UberFTP (2.8)>
UberFTP (2.8)> cd /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm
UberFTP (2.8)> ls
drwx------  1 cms001     cms001              512 May  4  2017 GenericTTbar
drwx------  1 cms001     cms001              512 Aug 16  2017 SEtest_Aug16
drwx------  1 cms001     cms001              512 Jun 12  2018 SEtest_Jun12
UberFTP (2.8)> cd SEtest_Jun12
UberFTP (2.8)> ls
-r--------  1 cms001     cms001         78122992 Jun 12  2018 smallElectronSample.root
UberFTP (2.8)> rm smallElectronSample.root
UberFTP (2.8)> ls
UberFTP (2.8)>

Be careful though, as under the hood the simple-looking commands are still expensive-to-execute SE interactions. Simple deletions should be fine though.

Cheers,
Thomas




Dear CSCS storage user,

The CMS disk space at CSCS is now FULL. Disk pools are full, and will refuse write access to new files for users (bad), but also for CMS central computing workflows (very bad).

Please urgently delete files you no longer need. Unless you are managing datasets intended to be used by multiple people, you should limit your usage to 10TB at most!

39.1 TB   : store/user/oiorio
33.4 TB   : store/user/algomez
33.1 TB   : store/user/ytakahas
29.8 TB   : store/user/cgiuglia
22.2 TB   : store/user/decosa
21.1 TB   : store/user/dezhu
16.7 TB   : store/user/vstampf
12.3 TB   : store/user/mschoene
10.5 TB   : store/user/lshchuts
10.0 TB   : store/user/dpinna

Please clean up today!

Best,

Thomas
(CMS-VO rep. at CSCS)


Monday, 4 February 2019 

        31.4 TB   : store/user/oiorio
        25.8 TB   : store/user/ytakahas
        22.2 TB   : store/user/decosa
        21.1 TB   : store/user/dezhu
        16.7 TB   : store/user/vstampf
        16.3 TB   : store/user/algomez
        13.5 TB   : store/user/cgiuglia
        12.7 TB   : store/user/mschoene
        10.5 TB   : store/user/lshchuts


Thursday, 20 December 2018 

alberto.orso.maria.iorio@cern.ch
annapaola.de.cosa@cern.ch
carme.giugliano@gmail.com


subj: Please clean up on the T2 storage element


Dear CSCS storage user,

As the storage on T2_CH_CSCS is beginning to reach its quotum, it is time to clean up old files. You are currently using over 10TB in storage:

31.4 TB   : store/user/oiorio
25.8 TB   : store/user/ytakahas
22.2 TB   : store/user/decosa
21.1 TB   : store/user/dezhu
16.7 TB   : store/user/vstampf
16.3 TB   : store/user/algomez
13.5 TB   : store/user/cgiuglia
12.7 TB   : store/user/mschoene
10.5 TB   : store/user/lshchuts

Please have a look at your files and delete what you do not need anymore. Once the disk usage reaches critical levels important services may crash, so please delete your old files as soon as you can, preferably before the Christmas break if you're still around.

You can list and delete your files using uberftp:

uberftp storage01.lcg.cscs.ch 'ls /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm'

uberftp -debug 2 storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'

uberftp storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'


There is no easy-to-read overview of your disk usage online right now. If you are very interested, you may parse the output from the link here:
(Warning: very big file - better to download and view in a text editor, or grep for your username)

Best regards,
Thomas Klijnsma



Logging in on cscs t2 and looking at created files

Wednesday 12 April 2017 

------------------------------------------------------------
CSCS login

ssh tklijnsm@ela.cscs.ch

With regular password
-> gelinked aan cern zo te zien, want het laatste cern ww werkt

then,

ssh tklijnsm@login.lcg.cscs.ch

then,

ssh wn135


------------------------------------------------------------
Looking at the massive file generation


$ head -n100000 /scratch/lcg/tmp/201704120930.list.list | egrep -o "dir\_.*" | cut -f2,3 -d'/' | sort | uniq -c | sort -n | tail -n10
     83 src/RecoEgamma
     88 src/CMSAnalyses
    129 src/EgammaAnalysis
    169 CMSSW_8_0_26_patch1/src
    176 src/L1Trigger
    379 CMSSW_9_0_0/config
    866 CMSSW_8_0_25/src
    975 CMSSW_9_0_0/.SCRAM
   3258 lheevent/mgbasedir
  86661 lheevent/process


-bash-4.1$ head -n6 /scratch/lcg/tmp/201704120930.list.list
5001:000fffffffffffff:0000000000000fc8:57570de7:0:99f731:10001:0:188!scratch/phoenix4/ARC_sessiondir/joxNDm8hoHqnt3tIep4RIIJmABFKDmABFKDmGXiKDmABFKDmSQOS3m/glide_VdP5JO/execute/dir_10301/lheevent/process/SubProcesses/P5_uxcx_emvexcxsx/GF1/leshouche_info.dat:4!data;1!
5001:000fffffffffffff:0000000000000fcb:468fb955:0:99f731:10001:0:177!scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_79601/lheevent/process/SubProcesses/P5_uxcx_emvexcxdx/GF2/madinM1:4!data;1!
5001:000fffffffffffff:0000000000000fcc:435604ce:80:99f731:10001:0:181!scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_79277/lheevent/process/SubProcesses/P5_uxb_emvexbbx/GF1/log_MINT0.txt:4!data;1!
5001:000fffffffffffff:0000000000000fce:39f60fd4:80:99f731:10001:0:184!scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_77619/lheevent/process/SubProcesses/P5_uxcx_emvexuxsx/GF2/mint_grids_NLO:4!data;1!
5001:000fffffffffffff:0000000000000fcf:313f84f5:80:99f731:10001:0:182!scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_79277/lheevent/process/SubProcesses/P5_uxb_emvexbbx/GF1/mint_grids_NLO:4!data;1!
5001:000fffffffffffff:0000000000000fd1:31f5d161:0:99f731:10001:0:185!scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_77619/lheevent/process/SubProcesses/P5_uxux_emvexuxsx/GF3/grid.MC_integer:4!data;1!


For at least some job, the responsible user seems to be vieri

-bash-4.1$ head -n200 /scratch/lcg/scratch/phoenix4/ARC_sessiondir/joxNDm8hoHqnt3tIep4RIIJmABFKDmABFKDmGXiKDmABFKDmSQOS3m/glide_VdP5JO/execute/*/lheevent/gridpack_generation.log | egrep "/afs/cern.ch/user/.*"
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO

Trying to run this on a total wildcard fails:
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/*/execute/*/lheevent/gridpack_generation.log | egrep "/afs/cern.ch/user/.*"

This would probably be a very heavy operation. Need something smarter.


This yields the dir_**** paths:

-bash-4.1$ head -n3 /scratch/lcg/tmp/201704120930.list.list | egrep -o "\!scratch.*dir_[0-9]+" | sed -e "s#\!#/scratch/lcg/#g"
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/joxNDm8hoHqnt3tIep4RIIJmABFKDmABFKDmGXiKDmABFKDmSQOS3m/glide_VdP5JO/execute/dir_10301
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_79601
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/4YdNDmCueHqndvtIepYPV1GmABFKDmABFKDmTEcKDmABFKDmu7sGhn/glide_7gtkQA/execute/dir_79277


Awfully long command:

head -n100 /scratch/lcg/tmp/201704120930.list.list | egrep -o "\!scratch.*dir_[0-9]+" | sed -e "s#\!#/scratch/lcg/#g" | sed -e "s/$/\/lheevent\/gridpack_generation.log/" | xargs cat 2>/dev/null | egrep "/afs/cern.ch/user/.*"


Look at random selections of the long file:

head -n200000 /scratch/lcg/tmp/201704120930.list.list | tail -n100 | egrep -o "\!scratch.*dir_[0-9]+" | sed -e "s#\!#/scratch/lcg/#g" | sed -e "s/$/\/lheevent\/gridpack_generation.log/" | xargs cat 2>/dev/null | egrep "/afs/cern.ch/user/.*"

Its this everytime:

/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO
/afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO/wellnu012j_5f_NLO_FXFX/wellnu012j_5f_NLO_FXFX_gridpack/work




New transfer monitor link on grafana

Old F2F slides

T2 roles for Vinzenz



Hi all,

We should also get Vinzenz registered properly for all the T2 responsibilities... which means going through the pain that is GOCDB and SiteDB.

At the moment I am registered in GOCDB as "Site Administrator" and "Site Operations Manager" for CSCS-LCG2. I think the url for my account is https://goc.egi.eu/portal/index.php?Page_Type=User&id=6014

In SiteDB, things have to be arranged for all four sites separately: T2_CH_CSCS, T2_CH_CSCS_HPC, T0_CH_CSCS_HPC and T3_CH_PSI. The roles on these sites are:

Data Manager
    Thomas Klijnsma
PhEDEx Contact
    Thomas Klijnsma
Site Admin
    Dino Conciatore
    Miguel Gila
    Thomas Klijnsma

Data Manager
    Derek Feichtinger
    Thomas Klijnsma
    Clemens Lange
    Nina Loktionova
PhEDEx Contact
    Thomas Klijnsma
Site Admin
    Dino Conciatore
    Derek Feichtinger
    Miguel Gila
    Thomas Klijnsma
Site Executive
    Dino Conciatore
    Derek Feichtinger
    Nina Loktionova

Data Manager
    Thomas Klijnsma
PhEDEx Contact
    Thomas Klijnsma
Site Admin
    Dino Conciatore
    Miguel Gila
    Thomas Klijnsma

Admin
    Derek Feichtinger
Data Manager
    Derek Feichtinger
    Thomas Klijnsma
PhEDEx Contact
    Thomas Klijnsma
Site Admin
    Derek Feichtinger
    Thomas Klijnsma
    Nina Loktionova
Site Executive
    Derek Feichtinger

I suppose to first approximation, it is probably best to simply add Vinzenz to every role that has my name on it. I don't think I have the rights to perform these changes... and to be honest I don't really know who does either. Derek, I suppose you have the authority for T3_CH_PSI? For the T2 sites we may have to ask the CSCS guys. Unfortunately I don't really recall how we did all this back when I started.

Cheers,
Thomas

Email 48h wall clock CMS

Tuesday, 11 December 2018 

Dear Stephan Lammel and Antonio Perez-Calero Yzquierdo,

As far as I understood you are both part of the CMS FactoryOps team - if I should contact somebody else instead, please let me know, or feel free to forward this email directly.

Among contactpersons from CMS, ATLAS, and LHCb, we have discussed that for a period of time we would like to change the job length at the site T2_CH_CSCS and T2_CH_CSCS_HPC from 72 hours to 48 hours. We would like to do this to reduce the variability in job waiting time at CSCS (this solution has apparently been tried by other sites as well). We are aware of a slight potential decrease in efficiency due to relatively more draining time, so we would like to test the system under these settings for a few weeks.

Would you be able to change the configuration on the CMS side for the sites T2_CH_CSCS and T2_CH_CSCS_HPC in order to correspond to a max wall clock of 48 hours? Ideally we would start under the new configuration already this Friday, 14 Dec, so that we can monitor the system over the holidays.

Let me know if any further information is required from our side.

Best,

Thomas Klijnsma
(CMS VO-rep for CSCS)

T2 Shares issue





ATLAS actually does get 40%, the only problem is bumpiness in performance?

Average frequency seems to be about a few days




CMS from May to now:

CPU consumption

WC consumption





Seeing queue settings

Tuesday, 2 October 2018 

qconf -sq all.q


qconf in het algemeen

Logging in on a worker node (WN) as root

15 Feb 2017 - Logging in on a worker node (WN) as root



First log in on klijsma_t@balrogNOSPAMPLEASE.psi.ch, switch to other terminal



======= adding to ssh-agent , should be done only once

~$ eval `ssh-agent`
Agent pid 70676
~$ 
~$ ssh-add -l
The agent has no identities.

Is a problem. A key needs to be added.


Set permissions (see also http://superuser.com/questions/215504/permissions-on-private-key-in-ssh-folder#215506 )

~$ chmod 700 .ssh/psi/
~$ chmod 644 .ssh/psi/id_rsa.pub 
~$ chmod 600 .ssh/psi/id_rsa

--> Hopefully this doesn't break the ssh reverse tunnelling


Then (note that it's the private key)

~$ ssh-add .ssh/psi/id_rsa
Identity added: .ssh/psi/id_rsa (.ssh/psi/id_rsa)
~$ 
~$ ssh-add -l
2048 SHA256:EYxdQhc6LMtlHlFDgiLXYe8vSXHOqFf2o4n/B+ynkeg .ssh/psi/id_rsa (RSA)

Seems to work.

=========



Then log on to: wmgt >> admin >> worker node:

~$ ssh -A -p 234 -l klijnsma_t -XY wmgt01.psi.ch
Password: 
Last login: Wed Feb 15 15:33:46 2017 from pb-d-128-141-188-193.cern.ch
##########################################################################
##                                                                      ##
##                    KEY FINGERPRINTS FOR WMGT01/02                    ##
##                                                                      ##
##########################################################################

wmgt01.psi.ch:
    2048 e2:b4:0f:3b:11:63:68:a0:0d:73:b3:13:fd:ac:aa:05   (RSA)
    256 0e:51:92:cb:fe:84:93:0f:85:fd:2d:10:16:80:ef:5c   (ECDSA)

wmgt02.psi.ch:
    2048 7e:0c:53:3c:b8:fb:f6:58:0d:b5:53:01:6d:8d:e5:8f   (RSA)
    256 ca:1a:11:9e:53:a9:de:75:b9:a1:19:61:00:a8:0c:ae   (ECDSA) 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

ALL YOUR ACTIVITY ON THIS SYSTEM WILL BE LOGGED!

  Please report problems to achim.gsell@psiNOSPAMPLEASE.ch.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

wmgt01:~$ 
wmgt01:~$ 
wmgt01:~$ 
wmgt01:~$ ssh -A root@t3admin01
Last login: Wed Feb 15 15:44:59 2017 from wmgt01
IS5500, E5460, E2760 GUI : /opt/SMgr/client/SMclient
Supermicro IPMI View     : cd /opt/SUPERMICRO/IPMIView && ./IPMIView20.sh
ILOM Java GUIs           : /opt/firefox-3.5/firefox
[root@t3admin01 ~]# 
[root@t3admin01 ~]# 
[root@t3admin01 ~]# ssh -A root@t3wn31t3wn
Last login: Wed Feb 15 15:32:46 2017 from t3admin01.psi.ch
Welcome to PSI T3 worker

[root@t3wn31 ~]# 
[root@t3wn31 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:            47         45          1          0          0         40
-/+ buffers/cache:          4         42
Swap:           48          0         48
[root@t3wn31 ~]# 
[root@t3wn31 ~]# 

(top and uptime are also useful)

To get name corresponding with user IDs (UID):
getent passwd

[root@t3wn31 ~]# getent passwd | grep 521
ursl:x:521:534:Urs Langenegger (PSI) [ursl] GroupResp:/mnt/t3nfs01/data01/shome/ursl:/bin/tcsh







Easily debugging a single WN

Tuesday 12 June 2018 


test job worker node



[tklijnsm@t3ui02 ~]$ qlogin -q debug.q@t3wn55.psi.ch
local configuration t3ui02.psi.ch not defined - using global configuration
Your job 2161347 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 2161347 has been successfully scheduled.
Establishing builtin session to host t3wn55.psi.ch ...

[tklijnsm@t3wn55 ~]$ . /mnt/t3nfs01/data01/shome/tklijnsm/Setup_env.sh
/swshare/anaconda/bin/python
/swshare/anaconda/bin/python

In case you run a remote ssh session, restart your ssh session with:
=========>  ssh -Y
[tklijnsm@t3wn55 ~]$

--> Need to do a trick first:

[tklijnsm@t3wn55 ~]$ export DISPLAY=localhost:0.0

--> Then it works:

[tklijnsm@t3wn55 ~]$ root -l
root [0]
root [0] TFile *_file0 = TFile::Open("root://t3dcachedb.psi.ch//pnfs/psi.ch/cms/trivcat/store/user/jpata/tth/May14_v1/TTToSemiLeptonic_TuneCP5_PSweights_13TeV-powheg-pythia8/May14_v1/180514_170730/0002/tree_2654.root")
(TFile *) 0x2ae9510
root [1]
root [1]


Nina ROOT instruction

Monday, 10 September 2018 


Hi Nina,

Sorry for the late reply, I've been on holiday and I am now catching up.

Nowadays sourcing root from cvmfs is much preferred over afs or local installations. Typically I setup my environment as follows:

export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch/
source /cvmfs/cms.cern.ch/cmsset_default.sh
source /swshare/psit3/etc/profile.d/cms_ui_env.sh
source /cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/setup.sh

(or some other version, depending on your needs, but 8.4 is pretty widely used). At this point, you should be able open the root command line tool and check the version:

[tklijnsm@t3ui02 ~]$ root -l
root [0] gROOT->GetVersion()
(const char ) "6.06/02"

(the "-l" disables the graphics that root tries to display at startup). One can then open root files using all the root tools. For example, using a root file containing a bunch of simulated photons:

root /mnt/t3nfs01/data01/shome/tklijnsm/debug/for_nina/Ntup_10Nov_photon_EB_testing.root

Here is a copy-paste of some basic root commands. For the full instructions I would recommend some sort of tutorial, e.g. https://root.cern.ch/getting-started .

[tklijnsm@t3ui02 ~]$ root /mnt/t3nfs01/data01/shome/tklijnsm/debug/for_nina/Ntup_10Nov_photon_EB_testing.root
root [0]
Attaching file /mnt/t3nfs01/data01/shome/tklijnsm/debug/for_nina/Ntup_10Nov_photon_EB_testing.root as _file0...
(TFile *) 0x318c7e0
root [1]
root [1] .ls
TFile**        /mnt/t3nfs01/data01/shome/tklijnsm/debug/for_nina/Ntup_10Nov_photon_EB_testing.root    
TFile*        /mnt/t3nfs01/data01/shome/tklijnsm/debug/for_nina/Ntup_10Nov_photon_EB_testing.root    
  KEY: TDirectoryFile    een_analyzer;1    een_analyzer
root [2]
root [2] een_analyzer->cd()
(Bool_t) true
root [3]
root [3] .ls
TDirectoryFile*        een_analyzer    een_analyzer
KEY: TTree    PhotonTree;1    Photon data
root [4]
root [4] PhotonTree->Print()
******************************************************************************
*Tree    :PhotonTree: Photon data                                            *
*Entries :  1701509 : Total =      1125072627 bytes  File  Size =  492028615 *
*        :          : Tree compression factor =   2.29                       *
******************************************************************************
*Br    0 :NtupID    : NtupID/I                                               *
*Entries :  1701509 : Total  Size=    6808192 bytes  File Size  =    2742193 *
*Baskets :       18 : Basket Size=     658944 bytes  Compression=   2.48     *
............................................................................*
*Br    1 :eventNumber : eventNumber/I                                        *
*Entries :  1701509 : Total  Size=    6808302 bytes  File Size  =    3093062
*Baskets :       18 : Basket Size=     658944 bytes  Compression=   2.20     *
............................................................................*
*Br    2 :luminosityBlock : luminosityBlock/I                                *
*Entries :  1701509 : Total  Size=    6808390 bytes  File Size  =     773262
*Baskets :       18 : Basket Size=     658944 bytes  Compression=   8.80     *
............................................................................*
...


Opening root files on the storage element isn't very different. Be sure to have an active grid proxy (use voms-proxy-init -voms cms -valid 192:00) and use root as follows:

root [0]
(TFile *) 0x3857530

The root tools are of course the same.

Hopefully this helps! Let me know if you have any further questions.

Cheers,
Thomas





python3 environment on T3

Must include the usual keras/tf packages


Voorbeeld van een python3 versie:

/cvmfs/sft.cern.ch/lcg/releases/Python/3.5.2-a1f64/x86_64-slc6-gcc7-opt/bin/python3

/cvmfs/sft.cern.ch/lcg/releases/Python/3.5.2-a1f64/x86_64-slc6-gcc7-opt/bin/python3 -c "import tensorflow"



source /cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/setup.sh

Iets nieuwer:
source /cvmfs/sft.cern.ch/lcg/views/LCG_91python3/x86_64-slc6-gcc7-opt/setup.sh




Hi Christoph,

Sorry for the slightly late reply, I just got back from my holiday. I'm not sure to what extent you solved the environment issue; usually I would recommend using our own anaconda installation, but I don't think we have python3 installed there. A decent alternative is then to use the pre-installed python from cvmfs, like Nina suggested. One can set up the environment conveniently using the supplied setup script:

source /cvmfs/sft.cern.ch/lcg/views/LCG_91python3/x86_64-slc6-gcc7-opt/setup.sh

Check:

[tklijnsm@t3ui02 ~]$ which python3
/cvmfs/sft.cern.ch/lcg/views/LCG_91python3/x86_64-slc6-gcc7-opt/bin/python3

Keras and TensorFlow are installed, but the versions seem to be quite old:

[tklijnsm@t3ui02 ~]$ python3.5
Python 3.5.2 (default, Sep 27 2017, 15:19:50)
[GCC 7.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'1.0.0'
>>> import keras
Using TensorFlow backend.
>>> keras.__version__
'1.1.0'

You can 'pip install --user' newer versions if needed, and make sure to update the PYTHONPATH variable to find user installed packages.


Regarding the second part of the question, running some scripts on the worker nodes, the TWiki has a reasonably elaborate instruction on how to set up scripts to interact with the batch here: https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToSubmitJobs. For your use case (only small dataframe files as output), I think it's acceptable to simply write directly to your shome. If you start to create bigger files it's indeed better to write to /scratch on the worker node, and copy the output afterwards, but for small files I don't see any objection to staging out directly to shome (do you agree Nina?).

I don't know exactly what your code looks like, but I assume you simply want to run some python script which generates some output. A simple jobscript.sh may look like the following:

<contents of jobscript.sh>
# First two lines specify where the stdout and stderr are dumped to:
#$ -o /mnt/t3nfs01/data01/shome/grab/jobstdout 
#$ -e /mnt/t3nfs01/data01/shome/grab/jobstdout
# Setup the environment:
export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch/
source /cvmfs/cms.cern.ch/cmsset_default.sh
source /swshare/psit3/etc/profile.d/cms_ui_env.sh
source /cvmfs/sft.cern.ch/lcg/views/LCG_91python3/x86_64-slc6-gcc7-opt/setup.sh
# Go to your python script and run it:
cd /mnt/t3nfs01/data01/shome/grab/path/to/pythonscript
python3.5 your_script.py

The paths are made up and the variations are endless, but hopefully this describes the idea to some extent. Then submit the job to the short queue using:

qsub -q short.q jobscript.sh

Typically if I need to submit ~50 jobs, I would create 50 jobscripts, e.g. jobscript_0.sh, ..., jobscript_49.sh.

Hope this helps! Let me know if you have any further questions.

Cheers,

Thomas






Explanation about waiting room, drain, etc.

Friday 13 July 2018 


Life Status

Life Status regulates if a site is removed from the CMS computing grid. There are three states (in addition to undetermined):

* Ok            (O) the site is in service
* Waiting-Room  (WR) the site is temporarily out of service
* Morgue        (M) the site is out of service for an extended period of time

Life Status is currently evaluated early in the morning, after Site Readiness, for the day. Based on the Site Readiness during the previous two weeks the status is set for each Tier-0, 1, and 2 sites.

If a site receives the fifth Site Readiness Error state within two weeks and its previous day Life Status is Ok (or unknown), it is temporarily removed from service by changing its Life Status to Waiting-Room. If a site's previous day Life Status is Waiting-Room (or unknown) and it receives the third Ok state in a row, it is put pack into service, i.e. Life Status is set to Ok. For Tier-2 and 3 sites, Error states on weekends are not counted but Ok states on weekends are counted. (Tier-2 and 3 sites are operated 8x5. Errors/failures on weekends are thus not expected to be corrected until Monday. However, we want to quickly put a working site back into service, thus counting weekend Ok states to change Life Status from Morgue or Waiting-Room to Ok.) If a site has a Life Status of Waiting-Room for 30 days, its Life Status changes to Morgue. If a site with Life Status of Morgue receives the fith Site Readiness Ok state in a row, it is scheduled for recommissioning, i.e. its Life Status changed to Waiting-Room. Recommissioning will take about a week and the site is held with a manual Life Status override in the Waiting-Room state during that time. After recommissioning the override is cleared and its state can change to Ok based on the above third-Ok-state-in-a-row rule.

Prod Status

Prod Status regulates if a site is enabled for production activities. There are four states (in addition to undetermined):

* enabled    site is fully enabled for production
* drain      new production workflows should exclude the site
* disabled   site disabled for production jobs
* test       site being tested for production

Prod Status is currently evaluated in the early morning, after Life Status, for the day. Based on Site Readiness during the previous ten days the status is set for each Tier-0, 1, and 2 site.

Prod Status can be manually overridden for a site by members of the production team, the site administrators, or the site support team.
In case of no manual override, the Prod Status of sites with a Life Status of Waiting-Room are set to drain and with a Life Status of Morgue to disabled.
In case of scheduled or unscheduled full site downtime(s) of more than 24 hours within the next 48 hours, Prod Status of the site is set to drain.
Otherwise, if a site receives the second Site Readiness Error state within three days and its Prod Status is enabled (or unknown), it is temporarily disabled for new production workflows by setting Prod Status to drain. If a site's previous day Prod Status is drain, disabled, or unknown and it receives the second Ok state in a row, it is reenabled for production activities, i.e. Prod Status is set to enabled. For Tier-2 and 3 sites, Error states on weekends are not counted but Ok states on weekends are counted. (Tier-2 and 3 sites are operated 8x5. Errors/failures on weekends are thus not expected to be corrected until Monday. However, we want to quickly reenable a working site for production, thus counting weekend Ok states to change Prod Status from disabled or drain to enabled.)



T2 VO boxes request





Dear Thomas,

CMS operations (Christoph Wissing) has asked us to add two additional
VO-Box Squid servers for us to be able to do the Tier-0 scale-up tests
we mentioned in the last F2F meeting.

Since the CMS Vo-Box is managed by you, I wanted to ask you if you would
be available and willing to do this. We (CSCS) would provide the VMs to
you the same way we do with the existing one. I don't think we need
Phedex (nor grid certificates) on these two extra nodes, so that would
make things a bit easier for everyone.

We can discuss details and time lines after, since we are also
discussing our side internally, I just wanted to know if you could get
involved in this in the next few days.

Thanks a lot!
BR/Pablo



Hi all,

Adding also Derek.

The management of the CMS VO-box has been mostly done by Derek so far, although indeed I should have taken this over from him a while ago. Nevertheless I think I would learn a lot from trying to set this up, and since Phedex is not needed I think should be able to handle this.

The timescale I am unfortunately not so sure about; I have a conference talk coming up and a paper to finish, so I will for sure not be able to handle more work before July 26. I am then back to work for a week, and then I participate in a summer school for a week in August. The summer is always a somewhat busy time in particle physics.

Would you be able to give a rough estimate of how much time you think we need to spend on installing these two squid servers, and when you would ultimately need them? 

Cheers,

Thomas

T2 Clean up Tuesday 26 June 2018

--------------------------------------
Tuesday 3 July 2018 

Following up on last week:

347.5 TB  : user
  49.6 TB   : user/cgalloni
  40.1 TB   : user/oiorio
  39.2 TB   : user/ytakahas
  22.4 TB   : user/decosa
  16.4 TB   : user/jpata
  11.8 TB   : user/dpinna
  11.5 TB   : user/sdonato
  10.5 TB   : user/grauco
  10.4 TB   : user/cgiuglia
  10.2 TB   : user/mschoene


Dear CSCS storage user,

I am following up on my request from last week to clean up on the T2 SE. With an increased sense of urgency I implore you all to look critically at what you stored on the T2, and whether you really need to keep it around, especially if you are using over 20 TB.

Current state:
  49.6 TB   : user/cgalloni
  40.1 TB   : user/oiorio
  39.2 TB   : user/ytakahas
  22.4 TB   : user/decosa
  16.4 TB   : user/jpata
  11.8 TB   : user/dpinna
  11.5 TB   : user/sdonato
  10.5 TB   : user/grauco
  10.4 TB   : user/cgiuglia

Please clean up no later than 10 July. For the instructions, see my email from one week ago.

Best regards,

Thomas Klijnsma


--------------------------------------
Tuesday 26 June 2018 

Follow up on:
Tuesday 3 July 2018


  49.6 TB   : user/cgalloni
  39.2 TB   : user/ytakahas
  39.2 TB   : user/oiorio
  22.4 TB   : user/decosa
  16.4 TB   : user/jpata
  11.8 TB   : user/dpinna
  11.5 TB   : user/sdonato
  10.5 TB   : user/grauco
  10.4 TB   : user/cgiuglia



User 'cgalloni': camilla.galloni@cern.ch
User 'ytakahas': Yuta.Takahashi@cern.ch
User 'oiorio': alberto.orso.maria.iorio@cern.ch
User 'jpata': joosep.pata@cern.ch
User 'dpinna': dpinna@physik.uzh.ch
User 'sdonato': silvio.donato@cern.ch
User 'grauco': giorgia.rauco@cern.ch
User 'cgiuglia': unmatched




Dear CSCS storage user,

As the storage on T2_CH_CSCS is beginning to reach its quotum, it is time to clean up old files. You are currently using over 10TB in storage. Please have a look at your files and delete what you do not need anymore. Once the disk usage reaches critical levels important services may crash, so please delete your old files as soon as you can, preferably within a week from now (Tuesday July 3rd 2018).

You can list and delete your files using uberftp:

uberftp storage01.lcg.cscs.ch 'ls /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm'

uberftp -debug 2 storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'

uberftp storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'


There is no easy-to-read overview of your disk usage online right now. If you are very interested, you may parse the output from the link here:
http://ganglia.lcg.cscs.ch/ganglia/files_cms.html
(Warning: very big file - better to download and view in a text editor, or grep for your username)

Best regards,
Thomas Klijnsma





Phedex cleaning May 2018

Thursday 17 May 2018 

To delete:


done:
<submitted>




2018 - keep this around to be sure
https://cmsweb.cern.ch/phedex/prod/Request::View?request=852161  kx
https://cmsweb.cern.ch/phedex/prod/Request::View?request=1140967  k (not 100%)



-----------------------------------

Only for transfer to PSI; deleting from CSCS.


VHbb analysis, might be slower than originally thought.





Link to del requests:



Cleaning up PhEDEx

Tuesday 18 July 2017






All datasets:

D: delete, K: keep

<< all relatively recent >>


Threw away the following:


245406 349428 384339 464390 570779 584620 589972 591984 600665 601841 601875 608035 613468 775016 788773 809466


Deletion request here:
and another small one here:
https://cmsweb.cern.ch/phedex/prod/Request::View?request=1045317


-------------------------------------------------------------
Old note on PhEDEx clean up:




Data -> Subscriptions

[10:51]  
Select Data

[10:51]  
group: local
--> group local IS IMPORTANT!

you should make sure that you're only looking at datasets in the "local" group though
the official ones coming via Dynamo we can't touch
they are handled by a script



[10:51]  
select CSCS in the list

under replica/move choose replica



Example of process:

Order by size, look at something old
Use physics knowledge --> reco can be removed if rereco is done for example

Look at the retention date or contact the requestor

Make a query in the form of /A/B/C to find all corresponding datasets, and hit delete

Approve the deletion request



F2F


- Be on new ticket mailing list
- Be on SAM-warning mailing list

- Is there really lack of CMS job pressure?
    Who within CMS do I ask about this?

- Who is responsible for the CMS monitoring?
    Who can I contact for questions reg. the 
    dashboard?

- Write monpack that simply gets the live SAM status, and writes it to a .txt file:

    <timestamp>;<SAM>;<other_relevant_var>

    Use other software to trigger the emails.
    That way also the SAM history is nicely kept
    (Although also easily available in the db ofc)

    Do this also for scratch/shome/T3SE/T2SE
    overviews

- Next F2F: January

- EGI membership:
    Apparently costs 62k per year
    for basically a ticketing system
    Not sure how to run without gocdb;
    lose the direct comm. with CMS/ATLAS,
    you lose voting rights, and (important!) 
    the security scans, which is the only thing
    worth something.

- Phoenix to be shut down some time next year
    Take over completely with Piz Daint



-----------------------------------------
CSCS
35
92.24%

HPC
46
78.18%



cpu efficiency:

NOW
0.661850981401,T2_CH_CSCS
0.616431656383,T2_CH_CSCS_HPC

Last F2F meeting:
0.673308534196,T2_CH_CSCS
0.676412051542,T2_CH_CSCS_HPC

cvmfs mount dead on wn33

Monday 11 June 2018 

[tklijnsm@t3ui02 differentialCombination2017]$ cat out/Scan_Yukawa_Jun11_hzz_NONscalingbbH_couplingdependentBRs/job__SCAN_hzz_Yukawa_reweighted_couplingdependentBRs_0_0.sh.e2104485
/gridware/sge/default/spool/t3wn33/job_scripts/2104485: line 4: /cvmfs/cms.cern.ch/cmsset_default.sh: No such file or directory
/gridware/sge/default/spool/t3wn33/job_scripts/2104485: line 10: scramv1: command not found
/gridware/sge/default/spool/t3wn33/job_scripts/2104485: line 12: combine: command not found


Renewing grid certificate

Friday 8 June 2018 

Weer verlopen dus weer hernieuwd. Stappen werken nog steeds prima.

1 ding bij opletten:

openssl pkcs12 -in mycert.p12 -nocerts -out userkey.pem

vraagt weer om een wachtwoord, nu hetzelfde gedaan als de login (W)


---------------------------------

Thursday 4 May 2017 


naar https://ca.cern.ch/ca/ , New grid certificate
Took same pw as login, without pw gave some error messages when trying further commands

Save old certificates in a directory to be sure

On UI (both psi and lxplus), copy to ~/.globus/mycert.p12 

Then follow steps outlined here:

openssl pkcs12 -in mycert.p12 -clcerts -nokeys -out usercert.pem
openssl pkcs12 -in mycert.p12 -nocerts -out userkey.pem
chmod 400 userkey.pem
chmod 400 usercert.pem


----------------------------------------
From old note:

Op dit moment zou dan debug command al moeten werken:

[tklijnsm@lxplus009 .globus]$ grid-proxy-init -debug -verify

User Cert File: /afs/cern.ch/user/t/tklijnsm/.globus/usercert.pem
User Key File: /afs/cern.ch/user/t/tklijnsm/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /tmp/x509up_u77787
Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
Enter GRID pass phrase for this identity:
Creating proxy .............................++++++
........++++++
Done
Proxy Verify OK
Your proxy is valid until: Fri Apr 29 04:48:03 2016


4 Ingeschreven bij VO CMS: https://voms2.cern.ch
Dit hoeft wrs nu nooit meer

5
[tklijnsm@lxplus009 .globus]$ voms-proxy-init -voms cms
Enter GRID pass phrase for this identity:
Contacting lcg-voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u77787.

Your proxy is valid until Fri Apr 29 04:54:21 CEST 2016




Command to list running/queued per user

Wednesday 30 May 2018 

qstat -u \* | tail -n +3 | awk '{if($5=="r"){r[$4]++} j[$4]++} END { for(n in j) printf "%7s / %-5s - %s\n",r[n],j[n],n }'


New T3 login procedure

Thursday 3 May 2018 

(In 'Method 2', the hop step essentially becomes absorbed in the subsequent ssh command. Probably not worth the effort.)


#1 ---------------------------------------------------
make psiagent

~$ psiagent
Agent pid 38383
Identity added: /Users/thomas/.ssh/psi/id_rsa (/Users/thomas/.ssh/psi/id_rsa)
Prepared ssh-agent:
2048 SHA256:ieyRXTOj48KeQigTbb2x5hXD09HSk83fQ3Vud2HoRs8 /Users/thomas/.ssh/id_rsa (RSA)
2048 SHA256:EYxdQhc6LMtlHlFDgiLXYe8vSXHOqFf2o4n/B+ynkeg /Users/thomas/.ssh/psi/id_rsa (RSA)

(this runs the following under the hood:
eval `ssh-agent`
ssh-add ~/.ssh/psi/id_rsa
echo "Prepared ssh-agent:"
ssh-add -l
)

Without the agent, you can get up to wmgt01, but access to the T3 will be blocked.
balrog seems to be fully deprecated.


#2 ---------------------------------------------------
to hop.psi.ch

ssh -A klijnsma_t@hop.psi.ch -i .ssh/psi/id_rsa -vvv

--> Let op, ww is niet heel standaard! Zie pwnotes

Creating home directory for klijnsma_t.
Last failed login: Thu May  3 11:38:31 CEST 2018 from pb-d-128-141-188-193.cern.ch on ssh:notty
There were 2 failed login attempts since the last successful login.
**************************************************************************
*                                                                        *
*                SSH Gateway, Paul Scherrer Institute (PSI)              *
*                                                                        *
**************************************************************************
*                                                                        *
*                            NOTICE TO USERS                             *
*                            ---------------                             *
*                                                                        *
*  This system is the property of PSI.  It is for authorized use only.   *
*  Users (authorized or unauthorized) have no explicit or implicit       *
*  expectation of privacy.                                               *
*                                                                        *
*  Unauthorized or improper use of this system may result in             *
*  administrative disciplinary action and civil and criminal penalties.  *
*  By continuing to use this system you indicate your awareness of and   *
*  consent to these terms and conditions of use.  LOG OFF IMMEDIATELY    *
*  if you do not agree to the conditions stated in this message.         *
*                                                                        *
**************************************************************************

[klijnsma_t@hop]$

#3 ---------------------------------------------------
to wmgt01.psi.ch


--> Dit is ww met de pin+rsa code; Eigen pin komt eerst, dan de rsa

klijnsma_t@wmgt01.psi.ch's password: debug3: Received SSH2_MSG_IGNORE
debug3: Received SSH2_MSG_IGNORE
...
debug3: Received SSH2_MSG_IGNORE

Last login: Mon Dec 18 16:34:02 2017 from pb-d-128-141-188-193.cern.ch
##########################################################################
##                                                                      ##
##                    KEY FINGERPRINTS FOR WMGT01/02                    ##
##                                                                      ##
##########################################################################

wmgt01.psi.ch:
    2048 e2:b4:0f:3b:11:63:68:a0:0d:73:b3:13:fd:ac:aa:05   (RSA)
    256 0e:51:92:cb:fe:84:93:0f:85:fd:2d:10:16:80:ef:5c   (ECDSA)

wmgt02.psi.ch:
    2048 7e:0c:53:3c:b8:fb:f6:58:0d:b5:53:01:6d:8d:e5:8f   (RSA)
    256 ca:1a:11:9e:53:a9:de:75:b9:a1:19:61:00:a8:0c:ae   (ECDSA)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    ALL YOUR ACTIVITY ON THIS SYSTEM WILL BE LOGGED!

    Please report problems to achim.gsell@psi.ch.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

wmgt01:~$


#4 ---------------------------------------------------
to root@t3admin01

wmgt01:~$ ssh -A root@t3admin01
Last login: Wed May  2 16:05:36 2018 from wmgt01
IS5500, E5460, E2760 GUI : /opt/SMgr/client/SMclient
Supermicro IPMI View     : cd /opt/SUPERMICRO/IPMIView && ./IPMIView20.sh
ILOM Java GUIs           : /opt/firefox-3.5/firefox
[root@t3admin01 ~]#
[root@t3admin01 ~]#

Should not ask for a password; if it does, the agent was not passed on correctly.
Make sure all the subsequent ssh calls to get here were called with the -A flag.


Check output of 'ssh-add -l'; should be something like this (this has verbose output):

[root@t3admin01 ~]# ssh-add -l
debug1: client_input_channel_open: ctype auth-agent@openssh.com rchan 2 win 65536 max 16384
debug1: channel 1: new [authentication agent connection]
debug1: confirm auth-agent@openssh.com
debug1: channel 1: FORCE input drain
2048 bc:ae:26:f5:a6:20:9c:94:b7:6d:76:09:f9:0d:ae:12 /Users/thomas/.ssh/id_rsa (RSA)
2048 bd:6d:18:0d:04:fd:78:1e:5c:e1:11:d3:d4:f1:b8:ac /Users/thomas/.ssh/psi/id_rsa (RSA)
[root@t3admin01 ~]# debug1: channel 1: free: authentication agent connection, nchannels 2




Logging on admin

Tuesday 14 March 2017 

Make sure to add the specific PSI key to the agent:
~$
~$ eval `ssh-agent`
Agent pid 73273
~$ ssh-add -l
2048 SHA256:ieyRXTOj48KeQigTbb2x5hXD09HSk83fQ3Vud2HoRs8 /Users/thomas/.ssh/id_rsa (RSA)
~$
~$ ssh-add .ssh/psi/id_rsa
Identity added: .ssh/psi/id_rsa (.ssh/psi/id_rsa)
~$

Then log in:
~$ ssh -A -p 234 -l klijnsma_t -XY wmgt01.psi.ch -v -i .ssh/psi/id_rsa
OpenSSH_6.9p1, LibreSSL 2.1.8
debug1: Reading configuration data /Users/thomas/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 21: Applying options for
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug1: Connecting to wmgt01.psi.ch [129.129.190.25] port 234.
debug1: Connection established.
debug1: identity file .ssh/psi/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file .ssh/psi/id_rsa-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.9
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1
debug1: match: OpenSSH_6.6.1 pat OpenSSH_6.6.1 compat 0x04000000
debug1: Authenticating to wmgt01.psi.ch:234 as 'klijnsma_t'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client chacha20-poly1305@opensshNOSPAMPLEASE.com <implicit> none
debug1: kex: client->server chacha20-poly1305@opensshNOSPAMPLEASE.com <implicit> none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:RF9FhMVoFEIRAtA6/bYotUgCKnBJjufnCtVxb9H8234
debug1: Host '[wmgt01.psi.ch]:234' is known and matches the ECDSA host key.
debug1: Found key in /Users/thomas/.ssh/known_hosts:299
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: password,keyboard-interactive
debug1: Next authentication method: keyboard-interactive
Password:
debug1: Authentication succeeded (keyboard-interactive).
Authenticated to wmgt01.psi.ch ([129.129.190.25]:234).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: Requesting X11 forwarding with authentication spoofing.
debug1: Requesting authentication agent forwarding.
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.UTF-8
debug1: Sending env LC_CTYPE = en_US.UTF-8
Last login: Tue Mar 14 10:36:50 2017 from pb-d-128-141-188-193.cern.ch
##########################################################################
##                                                                      ##
##                    KEY FINGERPRINTS FOR WMGT01/02                    ##
##                                                                      ##
##########################################################################

wmgt01.psi.ch:
    2048 e2:b4:0f:3b:11:63:68:a0:0d:73:b3:13:fd:ac:aa:05   (RSA)
    256 0e:51:92:cb:fe:84:93:0f:85:fd:2d:10:16:80:ef:5c   (ECDSA)

wmgt02.psi.ch:
    2048 7e:0c:53:3c:b8:fb:f6:58:0d:b5:53:01:6d:8d:e5:8f   (RSA)
    256 ca:1a:11:9e:53:a9:de:75:b9:a1:19:61:00:a8:0c:ae   (ECDSA)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

     ALL YOUR ACTIVITY ON THIS SYSTEM WILL BE LOGGED!

     Please report problems to achim.gsell@psiNOSPAMPLEASE.ch.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

wmgt01:~$
wmgt01:~$


This should be passwordless:

wmgt01:~$ ssh -A root@t3admin01
debug1: client_input_channel_open: ctype auth-agent@opensshNOSPAMPLEASE.com rchan 4 win 65536 max 16384
debug1: channel 1: new [authentication agent connection]
debug1: confirm auth-agent@openssh.com
debug1: channel 1: FORCE input drain
debug1: channel 1: free: authentication agent connection, nchannels 2
Last login: Tue Mar 14 10:02:32 2017 from wmgt01
IS5500, E5460, E2760 GUI : /opt/SMgr/client/SMclient
Supermicro IPMI View     : cd /opt/SUPERMICRO/IPMIView && ./IPMIView20.sh
ILOM Java GUIs           : /opt/firefox-3.5/firefox
debug1: client_input_channel_open: ctype auth-agent@opensshNOSPAMPLEASE.com rchan 4 win 65536 max 16384
debug1: channel 1: new [authentication agent connection]
debug1: confirm auth-agent@openssh.com
debug1: channel 1: FORCE input drain
debug1: channel 1: free: authentication agent connection, nchannels 2
[root@t3admin01 ~]#
[root@t3admin01 ~]#





[root@t3admin01 ~]#
[root@t3admin01 ~]# ssh -A root@t3nfs01
debug1: client_input_channel_open: ctype auth-agent@opensshNOSPAMPLEASE.com rchan 4 win 65536 max 16384
debug1: channel 1: new [authentication agent connection]
debug1: confirm auth-agent@openssh.com
debug1: channel 1: FORCE input drain
debug1: channel 1: free: authentication agent connection, nchannels 2
Last login: Tue Mar 14 10:53:39 2017 from t3admin01.psi.ch
debug1: client_input_channel_open: ctype auth-agent@opensshNOSPAMPLEASE.com rchan 4 win 65536 max 16384
debug1: channel 1: new [authentication agent connection]
debug1: confirm auth-agent@openssh.com
debug1: channel 1: FORCE input drain
[root@t3nfs01 ~]# debug1: channel 1: free: authentication agent connection, nchannels 2

[root@t3nfs01 ~]#
[root@t3nfs01 ~]# whoami
root
[root@t3nfs01 ~]#


Manually start a SAM test

MC test failing for T2

Monday 23 April 2018 



The failing SAM test is MC, which also tests the stage-out. Judging from the logs, this is were the problem is:


Relevant part:

diskCacheV111.services.space.SpaceAuthorizationException: Unable to reserve space: user not authorized to reserve space in any linkgroup.

Full error message:

stderr:
gfal-copy error: 70 (Communication error on send) - globus_ftp_client: the server responded with an error 451 diskCacheV111.services.space.SpaceAuthorizationException: Unable to reserve space: user not authorized to reserve space in any linkgroup.   
)
        ErrorCode : 60311
        ModuleName : StageOutError
        MethodName : __init__
        ErrorType : GeneralStageOutFailure
        ClassInstance : None
        FileName : /scratch/lcg/scratch/phoenix4/ARC_sessiondir/OUyNDmnkkUsnlztIepovt9HmABFKDmABFKDmGMTKDmABFKDmygTj6m/nagios/probes/org.cms/testjob/cms-MC-test/lib/python2.6/site-packages/WMCore/Storage/StageOutError.py
        ClassName : None
        Command : #!/bin/bash
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh 2> /dev/null; date; gfal-copy -t 2400 -T 2400 -p -K adler32  file:///scratch/lcg/scratch/phoenix4/ARC_sessiondir/OUyNDmnkkUsnlztIepovt9HmABFKDmABFKDmGMTKDmABFKDmygTj6m/nagios/probes/org.cms/testjob/cms-MC-test/TEST-FILE srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/SAM/StageOutTest-132133-Mon-Apr-23-09_10_49-2018'
            EXIT_STATUS=$?
            echo "gfal-copy exit status: $EXIT_STATUS"
            if $EXIT_STATUS = 0; then
               echo "Non-zero gfal-copy Exit status!!!"
               echo "Cleaning up failed file:"
                env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh 2> /dev/null; date; gfal-rm -t 600 srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/SAM/StageOutTest-132133-Mon-Apr-23-09_10_49-2018 '
               exit 60311
            fi
            exit 0
            
        LineNumber : 32
        ErrorNr : 0
        ExitCode : 151











Disk pledge

Monday 16 April 2018 



Dear all,

Today the attached email was sent to the T2 CMS hypernews, which indicates that our disk pledge is not yet updated for 2018. The email specifies the following values for CSCS pledges for CMS:

2017 DDM Physics partition: 980 TB
2018 DDM Physics partition to be set now: 980 + 3% (~1010) TB

I'm a bit confused as to where the number 980 is coming from, as the current CMS disk usage at CSCS is much higher than that. From the last F2F meeting the numbers were 40%*4000 = 1600 TB, and currently up to 1687 TB. I suppose the DDM quotum is some fixed fraction of the total disk pledge?



In intend to send the following reply:

"""
Dear Thomas Kress,

According to the PDF document you sent around there is a current pledge of CSCS for 980 TB, is that correct? This number seems to be much lower than 

"""



------------------------------------

Dear CMS Tier-2 site support & responsibles,


thanks to the T2_*_* (and T1s) which replied to my query about the 2018 DDM disk pledges.


Attached is a table with the information I received so far.
Sites/countries which did not reply are marked in red. For those sites we will increase (not yet updated in the table) the DDM disk pledge by 3% (which is the total CMS T2 disk pledge increase 2017->2018) relative to the 2017 DDM threshold given in the green 2017 table column if not other feedback is received until Tuesday.

   Greetings, Thomas.

------------------------------------

T2_CH_CSCS no reply 980 980

2017 DDM Physics partition 2018 DDM Physics partition to be set now

------------------------------------

Dear Derek,

I'm sorry for the delay in the answer, we've had some internal
misunderstandings on who should answer. I believe the figures you're
looking for are the same reported within the last F2F meeting
presentations (and within the LHConCRAY project), which are:

- 78000 HS06 for compute (pledged, 87k installed) and
- 4000 TB for storage. The Storage figures may have been pledged smaller
(400 TB were added AFTER the pledges were defined) so there might be a
discrepancy (for the better).

I believe the pledges are then defined as percentages of these figures,
but I am not the one setting them (Christoph can maybe answer that).

Now, regarding the actual assignments (system configuration) from the
last F2F figures:

- For storage, and because the granularity of the pools, CMS has 1687 TB
out of 4166 TB, that is 40%
- For compute, we're still using the 40:40:20 shares in the Slurm config
(but the actual usages differ substantially). I am not sure when we need
to change that fair share target (does this 38% starts from next April
or should have already started?).

Hope that helps!
BR/Pablo

Singularity failing

Thursday 22 March 2018 

Link to CRAB3 output:
Hard to find the actual jobs though.

-> Miguel says that actually the 'kernel panic' happens only at the end of a job


---------------------------
miguelgila [13:02]
Thomas @tklijnsm I’m seeing something really odd: since CMS has shifted production to singularity-only nodes keep crashing with kernel panics (edited)
in the past we’ve seen this happening when jobs use bind mounts and overlayfs (this is part of the system). So we’re wondering if CMS production jobs using singularity are causing this…
would it be possible to get an example job, or an example call to singularity so we can evaluate if this is what’s causing the problem? Our initial tests and the CMS monitoring never triggered a kernel panic, though (edited)

Thomas,
As mentioned via chat, I see that since CMS moved to full singularity production jobs, nodes crash constantly. This is something that we think might be related to singularity plus a mix of overlayfs and bind mounts, but we never encountered this on any of our tests, nor with the automated CMS checks that have run so far.
Do you think it could be possible for us to get a few job examples with the precise singularity calls done by production jobs, so we could try reproducing this and, if necessary, bring it to Cray for evaluation/fix?
This is very important: when nodes crash hard, they get evicted from Lustre and everything hangs for ~5min across the full machine. In this situation, we’ve been forced to disable new job starts on the LHConCRAY partition.
Thanks,
Miguel


--------------------------
Ticket:


Singularity crashes nodes on T2_CH_CSCS_HPC


Since the move to full singularity workflows, nodes on T2_CH_CSCS_HPC have been crashing constantly. We are unable to reproduce this crashing behavior of production jobs with our tests. A possible cause could be related to singularity plus a mix of overlayfs and bind mounts, but no tests have so far indicated any problems.

Debugging would be strongly facilitated if we can reproduce the singularity calls made by the payloads. Could we be given a pointer to the code that makes calls to singularity, especially at the end of a job?

Note that for now, CMS jobs have been disabled from T2_CH_CSCS_HPC to keep the overall system stable - hence the increased urgency of this ticket. In case this ticket is not submitted to the optimal support unit or with the optimal "Type of issue", feel free to change these.

Best,
Thomas Klijnsma




Uitleg over pilot jobs

Factory entries for CEs at T2_CH_CSCS and T2_CH_CSCS_HPC


TO: "CMS Support Unit"
"CMS Glidein Factory"

Based on an e-mail from Stephan Lammel (16-03), the CMS configuration of our sites T2_CH_CSCS and T2_CH_CSCS_HPC indicates that we are not fully enabled for singularity workflows [1]. From the site engineers point of view, the site should be able to run all singularity workflows, and thus the CMS configuration may be updated. Could you change the required settings from the CMS side? If there are any further modifications required at the site-level, please let us know.

Best regards,
Thomas Klijnsma

Relevant lines:
T2_CH_CSCS: arc01.lcg.cscs.ch factory=CERN glexec=glite OS=rhel6
T2_CH_CSCS: arc01.lcg.cscs.ch factory=OSG glexec=glite OS=rhel6
T2_CH_CSCS: arc01.lcg.cscs.ch factory=UCSD glexec=glite OS=rhel6
T2_CH_CSCS_HPC: arc04.lcg.cscs.ch factory=CERN glexec=glite OS=rhel6
T2_CH_CSCS_HPC: arc04.lcg.cscs.ch factory=OSG glexec=glite OS=rhel6
T2_CH_CSCS_HPC: arc04.lcg.cscs.ch factory=UCSD glexec=glite OS=rhel6

Note that two CEs were not listed: arc02.lcg.cscs.ch and arc03.lcg.cscs.ch.

Jupyter hub

Monday 19 March 2018 


#1 ----------------
Add the following lines to .ssh/config :

Host psi*
    PreferredAuthentications=publickey,password

Host psi2
    User tklijnsm
    LocalForward 8915 localhost:8915
    HostName t3ui02.psi.ch

(The port number is somewhat random... it should simply not be in use on the laptop at the moment of running).


#2 ----------------
ssh into PSI using:
ssh psi2


#3 ----------------
Set the following environment:

source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/etc/profile.d/init.sh
source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/bootstrap-bundle/1.0/etc/profile.d/init.sh
source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/lcg/root/6.06.04/bin/thisroot.sh
export PATH=/swshare/anaconda/bin:$PATH


#4 ----------------
Start up a jupyter server with the following command:

jupyter notebook --port 8915 --no-browser


#5 ----------------
Open http://localhost:8915/ in a browser



Commonly people run this in a screen session so it doesn't die when logging out.

Everybody can essentially log in to someone else's jupyter hub, so it's not very safe.



Email about Jupyter Hub

Monday 19 February 2018 


Hi Nina and Derek,

I think I have brought up the point before, but in case I didn't let me make a small case for it: Using ipython-notebooks and jupyter hubs are becoming widespread practice in particle physics. I personally don't use them a lot, but many members of our group are running their own jupyter hubs on the T3 and do all their main analysis work in it.

Currently everybody sets up his own jupyter hub. This has two main disadvantages:
(1) It's a slightly painful process to set this up, and thus makes many of our students that are short-term (semester-arbeit students, master theses) spend a significant portion of their time on not-physics. It would of course be much nicer for them to log in to a pre-configured hub and get to work immediately.
(2) The current way these hubs are set up, is that anybody on the T3 can log into anybody else's server, and also run commands on the UI as the owner of the hub. While of course we have great trust in each other on the T3, this is a bit of a safety problem.

A nice solution would be some sort of central jupyter hub. From what I understand, there are two main issues: (1) We are a little bit constrained in the amount of manpower that we can assign to this problem. To this end I of course volunteer to help out as much as I can. (2) A central Jupyter hub requires opening of more ports on the network, which poses more security risk. I am not an expert on infosec, but from what I understand this problem could maybe be solved by only allowing access to the central hub via an ssh tunnel. I also recall that the ETH IT group has set up a central jupyter hub in a secure way, so if this point becomes a struggle I can of course contact them.

Considering the amount of requests I received, I think this is one of the highest items on the user's wish list (the second item being a grid queue for high-memory/high-runtime jobs that has >20 slots). Let me know if you think there are other show-stoppers that I missed. If there are no hard-pressing technical constraints on implementing a central jupyter hub, I think it would be very nice to create one.

Cheers,
Thomas




Jupyter hub
    it s a jupyter server that you log in via webbrowser
    starts a jupyter workspace
    worried about safety, but should ask ISG
        it opens another access door
        they have something like this secure
    beats everyone having to make their own server

T3 disk space cleaning

Thursday 15 February 2018 







Thursday 15 February 2018 

Dear all,

As our shome is nearly at full capacity, we are asking the users with the highest usage to clean their files. Note that the recommended usage is about 120G. Please try to limit your usage to this number.


NAME                        AVAIL  USED  USEDDS  USEDSNAP
data01/shome/vtavolar       5.38G  395G    390G     4.49G
data01/shome/creissel       46.6G  353G    348G     5.34G
data01/shome/koschwei       52.9G  347G    347G     85.6M
data01/shome/mschoene       72.7G  327G    327G      699M
data01/shome/musella        86.5G  313G    313G     8.81M
data01/shome/grauco         86.4G  314G    313G      169M
data01/shome/pbaertsc        198G  202G    202G     16.0K
data01/shome/dbrzhech       41.9G  358G    199G      159G
data01/shome/dsalerno        207G  193G    193G     2.40M
data01/shome/wiederkehr_s    213G  187G    186G      429M
data01/shome/gaperrin        214G  186G    186G      389M
data01/shome/casal           219G  176G    176G      320K
data01/shome/sdonato         219G  167G    160G     7.30G

See also the instructions on how to use the storage element here, in case you would like to store big files for a longer period of time: https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe

Best regards,

Thomas Klijnsma


The e-mail address you entered couldn't be found. Please check the recipient's e-mail address and try to resend the message. If the problem continues, please contact your helpdesk.

New PSI login instructions

Friday 9 February 2018 







ssh -A -i ~/.ssh/psi/id_rsa -l klijnsma_t -XY hop.psi.ch 

Killing a stuck process on a user interface

----------------------------------------------------
Monday 29 January 2018 


[tklijnsm@t3ui02 ~]$ ps aux | grep tklijnsm

tklijnsm  30269 99.8  0.4 850360 626556 ?       R    Jan25 5050:07 combine /mnt/t3nfs01/da .......

This did the job:
kill -9 30269


----------------------------------------------------
Wednesday 17 May 2017 

Log in as root on a UI:


psiagent
swmgt




top:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                              
180882 grauco    20   0  167m  692  680 R 100.0  0.0  50056:04 emacs  


try: kill 180882

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                              
180882 grauco    20   0  167m  692  680 R 99.9  0.0  50056:17 emacs

No output was given, but process was not terminated


[root@t3ui02 ~]# kill -2 180882


And it's gone!





For kill strength of signal, see:


Most Linux or UNIX users know that there is a kill(1) command to stop processes, but what are the options, what do they mean?

These options are called signals, which can be expressed in either numbers or words. Some known once are "-1" or "-HUP". Also well known is "-9" (aka "-KILL")

-1 or -HUP - This argument makes kill send the "Hang Up" signal to processes. This probably originates from the modem/dial-in era. Processes have to be programmed to actually listen to this process and do something with it. Most daemons are programmed to re-read their configuration when they receive such a signal. Anyway; this is very likely the safest kill signal there is, it should not obstruct anything.
-2 or -SIGINT - This is the same as starting some program and pressing CTRL+C during execution. Most programs will stop, you could lose data.
-9 or -KILL - The kernel will let go of the process without informing the process of it. An unclean kill like this could result in data loss. This is the "hardest", "roughest" and most unsafe kill signal available, and should only be used to stop something that seems unstoppable.
-15 or -TERM - Tell the process to stop whatever it's doing, and end itself. When you don't specify any signal, this signal is used. It should be fairly safe to perform, but better start with a "-1" or "-HUP".

Fall 2017 Sample - Pileup problem?

Friday 26 January 2018 

Compare with queue length at CSCS

Follow-up list for CSCS after F2F 01-2018

Thursday 25 January 2018 

If at all possible, figure out the HEPSPEC06 discrepancy between CSCS and CMS

CSCS wants one-on-one meetings with VO reps
    Try to find out what they want to know
    Internal meeting for monitoring
        ATLAS only? Derek 

Probably need a projection of needed CPU and storage requirements for T2
    Not sure where to get these numbers


About next F2F:
    Probably ~June, 5 months from now
    11-15 and 18-22 probably, Doodle will follow
    Will be in Zurich


Link to dashboard

Wednesday 24 January 2018 

For F2F 




CPU Efficiency ranking between 01-10-2017 - 24-01-2018




Thursday 25 January 2018 

Link from Derek, pending jobs:

Link to running jobs in CMS DB:


Debugging copy issue for Christina

Thursday 18 January 2018 


voms-proxy-init -voms cms -valid 192:00

[tklijnsm@t3ui02 T3_CopyFileDebugChristina]$ renewproxy
Enter GRID pass phrase for this identity:
Contacting lcg-voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624.

Your proxy is valid until Fri Jan 26 15:28:01 CET 2018
[tklijnsm@t3ui02 T3_CopyFileDebugChristina]$
[tklijnsm@t3ui02 T3_CopyFileDebugChristina]$
[tklijnsm@t3ui02 T3_CopyFileDebugChristina]$ voms-proxy-info
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma/CN=814093632
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
type      : RFC3820 compliant impersonation proxy
strength  : 1024
path      : /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624
timeleft  : 191:59:54
key usage : Digital Signature, Key Encipherment




The copy problem is replicated if one forgets the "-voms cms" flag!


Tests using the storage element

kw: xrootd dcap



Friday 7 April 2017 


----------------------------------------------
Basic xrootd and dcap requests:

root dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/oldRegressionNtuples/DoubleElectron_FlatPt-1To300/crab_TKNtup_21Jul_Electron_lowpt/160721_211105/0000/output_4.root

root root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/oldRegressionNtuples/DoubleElectron_FlatPt-1To300/crab_TKNtup_21Jul_Electron_lowpt/160721_211105/0000/output_4.root

(These files should exist - I'm keeping them around for these kind of tests)

----------------------------------------------
Very easy control using UberFTP:

uberftp t3se01.psi.ch

cd /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
pwd
ls


[tklijnsm@t3ui02 usingTheSE]$ uberftp t3se01.psi.ch
220 GSI FTP door ready
200 User :globus-mapping: logged in
UberFTP (2.8)>
UberFTP (2.8)> ls /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-1To300
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-5To300
drwx------  1 tklijnsm   tklijnsm            512 Sep 26  2016 HPT
drwx------  1 tklijnsm   tklijnsm            512 Sep 27  2016 test_kg
drwx------  1 tklijnsm   tklijnsm            512 Sep 28  2016 test_kt
UberFTP (2.8)>
UberFTP (2.8)>         
UberFTP (2.8)>
UberFTP (2.8)> cd /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
UberFTP (2.8)> pwd
/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
UberFTP (2.8)>
UberFTP (2.8)> ls
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-1To300
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-5To300
drwx------  1 tklijnsm   tklijnsm            512 Sep 26  2016 HPT
drwx------  1 tklijnsm   tklijnsm            512 Sep 27  2016 test_kg
drwx------  1 tklijnsm   tklijnsm            512 Sep 28  2016 test_kt
UberFTP (2.8)>
UberFTP (2.8)> rm -r test_kg
UberFTP (2.8)> ls
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-1To300
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoubleElectron_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-300To6500
drwx------  1 tklijnsm   tklijnsm            512 Jul 21  2016 DoublePhoton_FlatPt-5To300
drwx------  1 tklijnsm   tklijnsm            512 Sep 26  2016 HPT
drwx------  1 tklijnsm   tklijnsm            512 Sep 28  2016 test_kt
UberFTP (2.8)>


Things that are easy:
- making directories
- renaming/deleting/copying 1 file
- wildcard operations for deleting

Things that are hard:
- wildcard operations for moving
- moving multiple files in general; possible with rename, but rename wildcards may be tricky


Best bet: rename what you want to keep, rm all others, rename back
   

Moving wildcard matching files to some directory is simply not possible without some external support.
    Should be easy on T3 since a read-only mount is available;
    call a smart ls, run a script that performs an ftp rename command
    on all.
    No quick solution otherwise


Copying files and directories to T3 storage element

Thursday 14 December 2017 

Keywords: T3 storage element se xrootd xrdcp gfal



------------------------------------------

Notes from ETH talk this morning (T2/T3)

Friday 10 November 2017 


Memory should scale with # of CPU's (so requesting more cores gets more mem)

Able to mount SE fs read-only on WNs (not just listing, real reading)

Jupyter hub
    it s a jupyter server that you log in via webbrowser
    starts a jupyter workspace
    worried about safety, but should ask ISG
        it opens another access door
        they have something like this secure
    beats everyone having to make their own server


Notes for ETH meeting talk

Tuesday 7 November 2017 



LHConCRAY talk

-----------------------------------
Tuesday 7 November 2017 

Link to similar accounting using local measurement at CSCS


Christoph proposes a short document with per-VO contribution
    Probably for the CRAY conclusions from measurement period
    CSCS writes a skeleton, VO's make a contribution
    Probably summarise measurement period
    May contain links to additional information on TWiki's
        (Maybe even DB links - readers are members of CHIPP)

> Expect skeleton from Pablo at some point




-----------------------------------
Tuesday 7 November 2017 


Only CRAY:

Only phoenix:

Both (cpu consumption page):


-----------------------------------
Tuesday 3 October 2017 

Data taking for 01-01-2017 - 30-09-2017

Both

T2_CH_CSCS only


-----------------------------------
Monday 2 October 2017 

Data taking links:

Both:

Only CRAY:

Only Phoenix:




[tklijnsm@t3ui02 Talk_1003_LHConCRAY]$ python main.py

----------------------------------------------------------------------
Test for observable 'CPU consumption Good Jobs (fraction)'
{u'reprocessing': 462.8983333333333, u'unknown': 0.0, u'hcxrootd': 728.9625, u'analysis': 143147.3463888889, u'production': 0.07916666666666666, u'hctest': 5617.035, u'analysistest': 57.11388888888889}

----------------------------------------------------------------------
Test for observable 'CPU consumption All Jobs (fraction)'
{u'reprocessing': 462.8983333333333, u'unknown': 0.0, u'hcxrootd': 731.7344444444444, u'analysis': 148428.48305555555, u'production': 0.07916666666666666, u'hctest': 5655.1258333333335, u'analysistest': 57.922222222222224}

----------------------------------------------------------------------
Test for observable 'Wall Clock consumption Good Jobs (fraction)'
{u'reprocessing': 796.3680555555555, u'unknown': 0.0, u'hcxrootd': 1144.0936111111112, u'analysis': 242445.08138888888, u'production': 0.12416666666666666, u'hctest': 6274.671111111111, u'analysistest': 68.65555555555555}

----------------------------------------------------------------------
Test for observable 'Wall Clock consumption All Jobs (fraction)'
{u'reprocessing': 799.9469444444444, u'unknown': 0.0, u'hcxrootd': 1234.6538888888888, u'analysis': 401363.9505555556, u'production': 17.790277777777778, u'hctest': 6877.499444444445, u'analysistest': 75.88305555555556}

Notes SwissGridOps meeting

Thursday 2 November 2017 


Reservation for scale test (see F2F notes) on Monday

System will drain on Sunday then (24h before)

Test should really succeed, otherwise lots of resources are wasted during a pointless drain


Need a downtime?
    Prob not, only need to disable arc04 queue for Piz Daint

Around 1500 jobs on Monday -> 4 jobs per node, 1 gig per core

23000 activations


Less CMS jobs will be ran, but probably no tickets from CMS will be openen
    Will notice nothing


Report Derek:

About T3 request to install new monitoring software:
    Derek did not look at it in detail yet

Credential issue last week: 1 long time credential ran out
    Private alarm set for Derek now
    --> Really ask for instructions

Monitoring meeting in November?
    Probably not for me
    GPFS specialists will go to PSI

January: IBM course on GPFS debugging
    Probably not for me, mostly for the real engineers


Next week (probably Wednesday): Small update, probably ~10 min, that needs dCache downtime
    For now, not scheduling downtime (trusting the resubmit system for jobs)
    Scheduled downtime makes the system drain, which is probably more
    harmful than just letting some jobs fail.

About wallclock, CPU, stime, utime

Thursday 2 November 2017 



See this link:



For SGE, stime and utime are computed by the kernel, and are often wrong if someone uses qdel
    --> Don't use stime and utime


Wallclock and CPU are measured by SGE, so are more reliable


Wallclock measures literally between start_time and end_time, nothing special.

CPU can be more than WC, if the job used multiple cores. It looks like this needs to be requested manually, because most jobs have CPU < WC, but not all.


F2F Bern Notes

==============================================
Wednesday 4 October 2017 

Some gathered action points

Send data in some format to Pablo (WC, etc.) for LHC on Cray
    Derek wants to combine with Gianfranco's data

Big scale test in November
    Need info on bad times to do this

Inode issue:
    Implement trigger for warning
    Try to get a job output on request when it
    happens

CSCS shares Kibana links with VO reps

Enable ticket notifications to VO reps

Next F2F date

2nd half of January
Probably CSCS, maybe ETHZ
If @ CSCS, I should get a flight


----------------------------------------

TWiki page about important particle physics conferences and probable heavy times for reconstruction - Fill in by me and Gianfranco

Mailing list (using an alias?) to get notified of the GGUS tickets - Will be done by T2 guys


Some ideas about sharing the CSCS Kibana set-up with us


Should find out why CMS production is so low


----------------------------------------

For ATLAS, a common ranking method is based on looking solely at site availability metrics:


This is not so good for CSCS!

May need to work on SAM availability

==============================================
Tuesday 3 October 2017 

----------------------------------------
CMS may commit to something called "Singular", while CSCS is running an own implementation called "Shifter".
Not sure what this part of software really does. Shifter is 'docker-based', and Singular probably 'node-based'. It may be important for CMS to be aware of the existence of private implementations.


----------------------------------------
Notes during Pablo's talk

GF: Follow-through on CMS errors and make changes centrally
    D: First check for idle time, jobs going through perform well

For tomorrow:
Check for oscillations in number of jobs (preferably queued jobs) in other comparable site


----------------------------------------

New node test of the overall system, test whole chain
    Point is to test scalability & datawarp (?)
    has to be approved high up: Michele, Thomas
    See "Questions" at end of finance talk

mount scratch
firewall
cvmfs on demand
create partition
create ARC config

23000 cores
60G ram
3 GB cvmfs
    360 nodes

Miguel needs a week for this test, including system drain

Recommendation should be out end of november

The argument "shared system" (=Phoenix+CRAY) means +30% performance (see numbers) is quite strong


Plans for upcoming run
    Continue at least what was done already
    Need to solve discrepancy between CSCS and CMS
        May be idle time (would be no real worry)
    Subtract down times? May need quite some work
    Need especially CMS data


----------------------------------------

Kosten:

Lunch: 13.90

Email for discussing presentation

Wednesday 1 November 2017 



Hi Derek and Nina,

Friday next week (10 Nov) I will give a talk in our weekly CMS physics ETH meeting regarding my work on T2 and T3. I will mostly be explaining a few basics about the two clusters, showing some basic statistics like consumed CPU and walltime, good vs. bad jobs, etc. I already got some pointers on how to collect this data from Joosep.

There are a few things I would like to discuss before I present them though. For example I have basic knowledge of what hardware is included, but I don't know in detail what the current status is, also regarding budget. I would for example like to explain what happened when the T3 was down for a week because of the broken controller earlier this year, but I would also like to make sure I get all the diagrams right.

And I would also like to cover the issue of shome user space, which is a rather sensitive topic with the users (they really love their 400 gb). It's probably best I prepare rather well for this type of discussion.

Would you be able to have a short meeting (~30-60min) via skype/slack/scopia early next week about all of this? I can already produce a draft of my slides before then.

Cheers,

Thomas

Getting the job accounting files SGE

Monday 30 October 2017 


from wmgt01:

ssh -A root@t3ce02


[root@t3ce02 sge]# rsync /gridware/sge/default/common/accounting tklijnsm@t3ui02:~/
[root@t3ce02 sge]# rsync /gridware/sge/default/common/accounting.*.gz tklijnsm@t3ui02:~/


Found these commands by inspecting `history | grep "accounting"`


Relevant output:

13960  stat /gridware/sge/default/common/accounting
13961  ls -l /gridware/sge/default/common/accounting.not.deleted.by.logrotate
13962  ls -l /mnt/sdc/sge/accounting.not.deleted.by.logrotate
13963  ls -l /mnt/sdc/sge/accounting.not.deleted.by.logrotate
14032  ll /gridware/sge_ce/accounting.not.deleted.by.logrotate
14039  ll accounting.not.deleted.by.logrotate
14040  ll accounting.not.deleted.by.logrotate
14041  ll -h /mnt/sdc/sge/accounting.not.deleted.by.logrotate
14045  rm -f accounting.not.deleted.by.logrotate && mv /mnt/sdc/sge/accounting.not.deleted.by.logrotate .
14048  nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/accounting >> /gridware/sge/default/common/accounting.not.deleted.by.logrotate &
14213  ll /gridware/sge/default/common/accounting
14215  ll /gridware/sge/default/common/accounting.not.deleted.by.logrotate
14625  ll /mnt/t3nfs01/data01/shome/monuser/accounting-new.txt
14626  ll /home/monuser/accounting.sh
15326  du -csh /gridware/sge/default/common/accounting.complete
15327  du -csh /gridware/sge/default/common/accounting
15328  du -csh /gridware/sge/default/common/accounting.*.gz
15330  ls /gridware/sge/default/common/accounting.full.sge.history.since.2011
15331  du -csh /gridware/sge/default/common/accounting.full.sge.history.since.2011
15332  ls -al /gridware/sge/default/common/accounting.full.sge.history.since.2011
15337  scp /gridware/sge/default/common/accounting.*.gz root@t3ui17:/shome/jpata/
15338  scp /gridware/sge/default/common/accounting.*.gz t3ui17:/shome/jpata/
15535  ls default/common/accounting
15536  du -csh default/common/accounting
15537  du -csh $SGE_ROOT/default/common/accounting
15538  du -csh $SGE_ROOT/default/common/accounting
15539  tail -n100 /gridware/sge/default/common/accounting
15541  ls -al /gridware/sge/default/common/ | grep accounting
15543  rsync /gridware/sge/default/common/accounting jpata@t3ui02:~/
15545  rsync /gridware/sge/default/common/accounting jpata@t3ui02:~/
15546  ls -al /gridware/sge/default/common/ | grep accounting
15547  rsync /gridware/sge/default/common/accounting.*.gz jpata@t3ui02:~/
15671  rm -f accounting.*gz
16020  rm -f /mnt/sdc/sge/accounting.*
16101  mv accounting.not.deleted.by.logrotate /mnt/sdc/sge/ && ln -s /mnt/sdc/sge/accounting.not.deleted.by.logrotate
16106  cd /mnt/sdc/sge && sed -i '/^tail/d' /gridware/sge/default/common/accounting.not.deleted.by.logrotate && cd -
16108  nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/accounting >> /gridware/sge/default/common/accounting.not.deleted.by.logrotate &
16142  rm -f accounting.not.deleted.by.logrotate
16143  mv /mnt/sdc/sge/accounting.not.deleted.by.logrotate .

Copying from EOS to T3 SE

Thursday 26 October 2017 


Not a trivial issue... tried the following:


[tklijnsm@t3ui02 1026_copyingFromEosToSE]$ echo $HOSTNAME
t3ui02
[tklijnsm@t3ui02 1026_copyingFromEosToSE]$ xrdcp -v root://eosuser.cern.ch//eos/user/t/tklijnsm/testDirToCopy/test1.txt .
[0B/0B][100%][==================================================][0B/s]  
Run: [ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied


So issuing the commands from T3 doesn't work. Trying from lxplus:


[tklijnsm@lxplus082 tklijnsm]$ xrdcp -vvv -r testDirToCopy root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/
xrdcp: Indexing files in testDirToCopy
xrdcp: Copying 28 bytes from 2 files.


But the commands get stuck there.

export EOS_MGM_URL=root://eosuser.cern.ch
export EOS_FUSE_MGM_ALIAS=eosuser.cern.ch


This works:
scp -v tklijnsm@lxplus.cern.ch:/eos/user/t/tklijnsm/testDirToCopy/test1.txt .

But that's only to shome, not the SE



One possibility may involve mounting eos... but that will be aweful



Other trick, following instruction from here:


[tklijnsm@t3ui02 1026_copyingFromEosToSE]$ fts-transfer-submit -s https://fts3-pilot.cern.ch:8446 -f testfiles.txt
9a4c2b1c-ba2e-11e7-b7d7-02163e00a17a


So far failed but method seems hopeful

Install python package locally

Wednesday 17 May 2017 



pip install --install-option="--prefix=/mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages" git+https://github.com/scipy/scipy.git


easy_install --install-dir /mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages git+https://github.com/scipy/scipy.git





This worked for pexpect:

easy_install --install-dir /mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages/ pexpect

First source the right python version to be sure.

Then make sure to do

PYTHONPATH="$PYTHONPATH:/mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages/"

before using


--------------------------------------
Thursday 13 July 2017 

Installed scikit-learn:

See:


--------------------------------------
Thursday 1 June 2017 


Trying to get pandas, but need numpy>=1.7.0


This is okay:
source /swshare/ROOT/root_v5.34.18_slc6_amd64/bin/thisroot.sh
export PATH=/afs/psi.ch/sys/psi.x86_64_slp6/Programming/psi-python27/2.4.1/bin:$PATH
which python /afs/psi.ch/sys/psi.x86_64_slp6/Programming/psi-python27/2.4.1/bin/python

(In ~/Setup_env.sh )


[tklijnsm@t3ui02 pythonPackages]$ PYTHONPATH="$PYTHONPATH:/mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages/"
[tklijnsm@t3ui02 pythonPackages]$
[tklijnsm@t3ui02 pythonPackages]$ easy_install --install-dir /mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages/ pandas
Creating /mnt/t3nfs01/data01/shome/tklijnsm/pythonPackages/site.py
Searching for pandas
Best match: pandas 0.17.1
Adding pandas 0.17.1 to easy-install.pth file

Using /afs/psi.ch/sys/psi.x86_64_slp6/Programming/psi-python27/2.4.1/lib/python2.7/site-packages
Processing dependencies for pandas
Finished processing dependencies for pandas
[tklijnsm@t3ui02 pythonPackages]$

T2 Ticket about eos space


Requesting EOS space on 



Probably someone mistook T2_CH_CERN with T2_CH_CSCS


Idea for T3 groups on SE

Thursday 5 October 2017 

Hi Derek and Nina,

With the deletion of Marco Peruzzi's account, I got the question whether there could be some group space reserved for the physics group that is using Marco's data. In this case it concerns the "Higgs to diphoton" group, which is currently about 6 people strong. Overall Marco's data concerns about 11 TB. 

I received a similar question a few months back, in that case it concerned a group doing studies on diamond detectors. I think from the users' side there is some demand for storage space at the physics group level. Did we already have some sort of strategy for grouped storage? Currently I see only few instances of groups using the SE:

55.4 TB   : store/t3groups/uniz-higgs
12.2 TB   : store/user/bstomumu
552.6 GB  : store/user/b-physics

Would it make sense to start adding more directories in store/t3groups/ , besides the rather general 'ethz-higgs' and 'uniz-higgs' groups? Maybe we could create these directories upon request, and also assign a 'responsible person' for it in case it needs to be cleaned up. Or is there maybe some better, more widely accepted way of dealing with groups of users using an SE?

Let me know what you think.

Cheers,
Thomas

LHConCRAY WG meeting

Monday 2 October 2017 


mismatch between CMS and CSCS accounting
    simple check is to check number of jobs
        o pilot job is 1 job, inside a pilot there are more jobs
          may complicate accounting, are accounted separately in CMS
        o Number of pilots could be compared,
          but this is not extractable from CMS DB
        o Otherwise, compare number of local jobs from CSCS with CMS,
          but counting local jobs at CSCS requires counting via logs
        o May compare CPU efficiency numbers but this is anyway high
        


T2 Errors before a scheduled downtime

Wednesday 27 September 2017 


Jobs can already fail before a scheduled downtime, if the the expected runtime is longer than the time left until downtime. The error message looks as follows:

HoldReason = "The system macro SYSTEM_PERIODIC_HOLD expression '( JobStatus ! 5 && JobStatus ! 3 && ( CurrentTime - GridResourceUnavailableTime ) > 3600 )' evaluated to TRUE"

Tickets to T2: Myriams socket issue

Thursday 14 September 2017 


Dear colleagues,

The issue with broken pools from some months ago seems to be present again. In the following path, 33% of the files are not accessible:

/pnfs/lcg.cscs.ch/cms/trivcat/store/user/mschoene/crab/8_0_26/gg_data2016_Sep04/

A file that is consistently inaccessible to me for example, is:

/pnfs/lcg.cscs.ch/cms/trivcat/store/user/mschoene/crab/8_0_26/gg_data2016_Sep04/DoubleEG_Run2016G_03Feb2017/170904_153936/0000/mt2_753.root

The common error message (after waiting long in a stuck terminal) is:

Error in <TNetXNGFile::Open>: [ERROR] Socket error

Let me know if I can supply any further information.

Cheers,
Thomas

Deleting old files from Phedex LoadTest

Wednesday 13 September 2017 


See code in /mnt/t3nfs01/data01/shome/tklijnsm/Scripts/T2tools/RemoveLoadTests


First log in on cms02:

ssh cms02

Run the following command:

find /pnfs/lcg.cscs.ch/cms/trivcat/store/PhEDEx_LoadTest07/ -type f -mtime +7

Finds files older than 7 days. Copy filelist to a .txt file manually.

Logging in on CSCS machines

Wednesday 13 September 2017 

Keywords: login cscs worker nodes wns wn cms02 lcg
-------------------------------------------------------------

First on the main portal:


With regular password (stIT+E.)

then to the login service (or something):


then on whatever machine:

ssh wn135 (access to /scratch/)
ssh cms02 ( /pnfs/ mounted )


New Inode Issue

Tuesday 12 September 2017 


And the winner is…
```[root@wn75 execute]# for proxy in $(find ./ -maxdepth 2  -type f -name x509up_*); do voms-proxy-info --file=$proxy |grep identity >> /tmp/identity; done
[root@wn75 execute]# sort -u /tmp/identity
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yiiyama/CN=697199/CN=Yutaro Iiyama
```
(edited)


[17:27]
And this:
```[root@wn75 execute]# for proxy in $(find ./ -maxdepth 2  -type f -name x509up_*); do voms-proxy-info --file=$proxy |grep identity >> /tmp/identity2; done
[root@wn75 execute]# sort /tmp/identity2 |uniq -c
      6 identity  : /C=DE/O=GermanGrid/OU=uni-hamburg/CN=Viktor Gerhard Kutzner
      5 identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yiiyama/CN=697199/CN=Yutaro Iiyama
[root@wn75 execute]# find ./ |wc -l
447620
```


[17:29]
And this:
```[root@wn75 execute]# for proxy in $(find ./ -maxdepth 2  -type f -name x509up_*); do voms-proxy-info --file=$proxy |grep identity >> /tmp/identity3; done
[root@wn75 execute]# sort /tmp/identity3 |uniq -c
      1 identity  : /C=DE/O=GermanGrid/OU=uni-hamburg/CN=Viktor Gerhard Kutzner
     10 identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yiiyama/CN=697199/CN=Yutaro Iiyama
[root@wn75 execute]# find ./ |wc -l
463517
```


[17:29]
I’d say `Yutaro Iiyama` is the culprit



Morgen emailen!

Cleaning up Phedex

Monday 14 August 2017 


To ask:

---------------------------------
605307
Camilla Galloni

Dear Camilla Galloni,

While cleaning up old Phedex datasets at CSCS, I came across the following request without a specified retention date:


Please let me know if I can delete this dataset.

Best regards,
Thomas Klijnsma
(CMS VO contact for CSCS)

--> CAN BE DELETED


---------------------------------
852161
No date specified

--> DO NOT DELETE
inofficial retention until 01.01.2018

Dear Alexis Pompili and Raul Iraq Rabadan Trejo:

While cleaning up old Phedex datasets at CSCS, I came across the following request without a specified retention date:


You were indicated as the person(s) responsible for this dataset. Please let me know if I can delete this dataset.

Best regards,
Thomas Klijnsma
(CMS VO contact for CSCS)



---------------------------------
"To be deleted at CSCS", Thea Aarenstadt

--> CAN BE DELETED

692455
702229
725656
742719
760712
765741
786446

thea.aarrestad@cern.ch  

Dear Thea Aarrestad,

While cleaning up old Phedex datasets at CSCS, I came across the following requests without a specified retention date:


The datasets indicate "To be deleted at CSCS", but I want to be sure that you really don't need to have this data anymore.

Let me know if I can go ahead and delete all these datasets.

Best regards,
Thomas Klijnsma
(CMS VO contact for CSCS)


---------------------------------
Urs:

Dear Urs,

While cleaning up old Phedex datasets at CSCS, I came across the following requests past retention date:

Retention: 2017-05-31
https://cmsweb.cern.ch/phedex/prod/Request::View?request=915927

Retention: 2017-06-30
https://cmsweb.cern.ch/phedex/prod/Request::View?request=798615
https://cmsweb.cern.ch/phedex/prod/Request::View?request=960825

Retention: 2017-07-31
https://cmsweb.cern.ch/phedex/prod/Request::View?request=831931

The retention date is not very far in the past, so I quickly wanted to check whether it is already okay to go ahead and delete these datasets.

Then there are two other requests that are almost at their retention date:

Retention: 2017-08-31
https://cmsweb.cern.ch/phedex/prod/Request::View?request=821525

Retention: 2017-08-31
https://cmsweb.cern.ch/phedex/prod/Request::View?request=944633

Would it be okay to delete these datasets too, in case you are already done with it? Let me know what can be deleted and what has to remain.

Cheers,
Thomas Klijnsma
(CMS VO contact for CSCS)



---------------------------------
Joosep

--> DO NOT DELETE

until July 2017
926340

until 01.09.2017
945719
965346


Hi Joosep,

While cleaning up old Phedex datasets at CSCS, I came across the following requests:

Retention: July 2017
https://cmsweb.cern.ch/phedex/prod/Request::View?request=926340

Retention: 01.09.2017
https://cmsweb.cern.ch/phedex/prod/Request::View?request=945719
https://cmsweb.cern.ch/phedex/prod/Request::View?request=965346
--> EXTEND RETENTION UNTIL 01.03.2018


The second two are not past retention, so I just wanted to ask whether they are still actively being used. Cleaning them up is not urgent, but I figured if I'm sending around e-mails anyway, I can ask.

The first one is past retention date, but not by a lot. Is it okay if I delete it?

I won't delete stuff unless I have your specific approval.

Cheers,
Thomas




---------------------------------
Clemens Lange
No date specified

--> CAN BE DELETED

Dear Clemens Lange,

While cleaning up old Phedex datasets at CSCS, I came across the following request without a specified retention date:


You were indicated as the person responsible for this dataset. Please let me know if I can go ahead and delete this data from CSCS.

Best regards,
Thomas Klijnsma
(CMS VO contact for CSCS)




---------------------------------
Simply past retention date:

125605
434200
635720
607219
896068


---------------------------------
To delete (batch 1)

First batch: Camilla, Thea, Clemens, and simply past:

125605
434200
635720
607219
896068
605307
692455
702229
725656
742719
760712
765741
786446
930631




T2 ticket update

Wednesday 9 August 2017 


Hi Amjad Kotobi and Antonio,

Many thanks for the explanations. We have been monitoring the number of CMS jobs submitted to T2_CH_CSCS_HPC over the last days, but we are not seeing any increase.

HammerCloud tests seem to be running fine for T2_CH_CSCS_HPC:

Also my own test jobs are running fine on the system. Is there some other switch or flag that is preventing CMS from submitting jobs to T2_CH_CSCS_HPC? Did T2_CH_CSCS_HPC perhaps not pass the tests from the Integrated Test Bed?

Cheers,
Thomas

Trying to print some events from a complicated root file with sub objects

Wednesday 2 August 2017 



Make sure xQuartz is started, and connect with -X (did not test with -Y)



gSystem->Load("libDataFormatsFWLite.so")
AutoLibraryLoader::enable()
TFile *_file0 = TFile::Open("expected_output.root")


Not done, still can't print some events



HammerCloud tests working on T2_CH_CSCS_HPC

Opening remote ROOT files

Monday 31 July 2017 

-----------------------------------
T3

<todo>


-----------------------------------
T2

root root://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis/170504_143235/0000/output_2.root

T2 another set of test crab jobs

T3 PhEDEx dataset from Stephan not transferring

Friday 28 July 2017 


Concerns this dataset:



Which was this transfer request:



Dataset is available here (100%, not on tape):
T0_CH_CERN_Disk
T1_FR_CCIN2P3_Disk


T3_CH_PSI



------------------------------
Checked certificate:

[phedex@t3cmsvobox01 ~]$ openssl x509 -in gridcert/proxy.cert -dates
notBefore=Jul 28 08:00:04 2017 GMT
notAfter=Jul 28 19:59:04 2017 GMT
-----BEGIN CERTIFICATE-----
MIIPfzCCDmegAwIBAgIKLkREBAAA<...>
<...>
<...>jOHI=
-----END CERTIFICATE-----

The fact that today is July 28 is very suspicious

crab Lightweight job submission with crab

Thursday 27 July 2017 

Steps from:


------------------------------------------------------
1. Set up CMSSW (in tutorial: CMSSW_7_3_5_patch2)

cmsrel CMSSW_7_3_5_patch2
cd CMSSW_7_3_5_patch2/src
cmsenv


------------------------------------------------------
2. Source crab script so the crab executable is in the environment:

source /cvmfs/cms.cern.ch/crab3/crab.sh
which crab

(Example output: [tklijnsm@t3ui02 src]$ which crab
/cvmfs/cms.cern.ch/crab3/slc6_amd64_gcc493/cms/crabclient/3.3.1707.patch1/bin/crab )


------------------------------------------------------
3. Get a valid proxy:

voms-proxy-init --voms cms --valid 168:00


------------------------------------------------------
4. Set up configuration files:
    one CMSSW configuration file (could also be run locally)
    one CRAB configuration file

example "pset_tutorial_analysis.py":

###############
import ParameterSet.Config as cms

process = cms.Process('NoSplit')

process.source = cms.Source(
    "PoolSource",
    fileNames = cms.untracked.vstring(
        'root://cms-xrd-global.cern.ch///store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/00CE4E7C-DAAD-E111-BA36-0025B32034EA.root'
        )
    )

process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(10))
process.options = cms.untracked.PSet(wantSummary = cms.untracked.bool(True))
process.output = cms.OutputModule("PoolOutputModule",
    outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
    fileName = cms.untracked.string('output.root'),
)
process.out = cms.EndPath(process.output)
###############


example "crabConfig_tutorial_analysis_data.py"

###############
from UserUtilities import config, getUsernameFromSiteDB
config = config()

config.General.requestName = 'tutorial_May2015_MC_analysis'
config.General.workArea = 'crab_projects'
config.General.transferOutputs = True
config.General.transferLogs = True

config.JobType.pluginName = 'Analysis'
config.JobType.psetName = 'pset_tutorial_analysis.py'

config.Data.inputDataset = '/GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO'
config.Data.inputDBS = 'global'
config.Data.splitting = 'FileBased'
config.Data.unitsPerJob = 10
config.Data.outLFNDirBase = '/store/user/%s/' % (getUsernameFromSiteDB())
config.Data.publication = True
config.Data.outputDatasetTag = 'CRAB3_tutorial_May2015_MC_analysis'

config.Site.storageSite = 'T2_CH_CSCS'
###############


------------------------------------------------------
5. Submit the crab job

crab submit -c crabConfig_tutorial_MC_analysis.py


Monitor with:

crab status -d crab_projects/crab_tutorial_May2015_MC_analysis --long



Hadd issue on t3 storage element

----------------------------------------------
Thursday 27 July 2017

Sourcing in this environment:

cmsrel CMSSW_8_0_22
cd CMSSW_8_0_22/src
cmsenv

The test script breaks.

Properly working environment:
```[tklijnsm@t3ui02 ~]$ echo $LD_LIBRARY_PATH
/cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/lib64:/cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/lib:/cvmfs/sft.cern.ch/lcg/contrib/gcc/4.9/x86_64-slc6/lib64```

[11:38]
Fault environment:
```[tklijnsm@t3ui02 src]$ echo $LD_LIBRARY_PATH
/mnt/t3nfs01/data01/shome/tklijnsm/Scripts/T3tools/haddTests/StephansEnv/CMSSW_8_0_22/biglib/slc6_amd64_gcc530:/mnt/t3nfs01/data01/shome/tklijnsm/Scripts/T3tools/haddTests/StephansEnv/CMSSW_8_0_22/lib/slc6_amd64_gcc530:/mnt/t3nfs01/data01/shome/tklijnsm/Scripts/T3tools/haddTests/StephansEnv/CMSSW_8_0_22/external/slc6_amd64_gcc530/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_0_22/biglib/slc6_amd64_gcc530:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_0_22/lib/slc6_amd64_gcc530:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_0_22/external/slc6_amd64_gcc530/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/llvm/3.8.0-giojec2/lib64:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/gcc/5.3.0/lib64:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/gcc/5.3.0/lib```

[11:39]
noteworthy difference is the gcc version 5.3 in the faulty environment

[11:40]
I don't think it would be far-fetched to assume that that's the problem


----------------------------------------------

Hi Giorgia and Stephan,

I have been trying some test hadd'ing and everything is working flawlessly for me, for multiple protocols. Here is some output of my tests:


Counted 30910 events in gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_421.root
Counted 21661 events in gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_422.root


EXECUTING: uberftp t3se01.psi.ch 'ls /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27'

220 GSI FTP door ready
200 User :globus-mapping: logged in
-r--------  1 tklijnsm   tklijnsm       11117447 Jul 27 10:36 output_421.root
-r--------  1 tklijnsm   tklijnsm        7801170 Jul 27 10:36 output_422.root


hadd -f gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_hadded_Jul27.root gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_421.root gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_422.root

hadd Target file: gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_hadded_Jul27.root
hadd compression setting for all ouput: 1
hadd Source file 1: gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_421.root
hadd Source file 2: gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_422.root
hadd Target path: gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_hadded_Jul27.root:/
hadd Target path: gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_hadded_Jul27.root:/een_analyzer

Counted 52571 events in gsidcap://t3se01.psi.ch:22128/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/haddTests_Jul27/output_hadded_Jul27.root


I have been using ROOT 6.06/02 for these tests. Could you maybe send a "printenv" of the shell you're using?

Cheers,
Thomas

T2 Minimum free space required

Friday 21 July 2017 

From the recent transfer failures due to lack of storage space, it seems that a minimum of 45 TB is needed to keep transfers running.





T2 Monitoring Urs' dataset

-------------------------------------------
Tuesday 18 July 2017  17:22

Hi Urs,

Transfers from and to CSCS already started picking up earlier this afternoon, but I monitored your dataset and noticed the percentage completed was not changing. After some digging Derek and I found out there was an additional deeper issue with the transfer from CSCS to PSI relating to the PhEDEx certificate of PSI. Derek has now solved this and I am monitoring the progress of your transfer. I am not sure how much time this will take - if the transfer picks up its pace this should not last all that long. The transfer rate with CSCS is already slowly increasing since Derek's fix.

Best,
Thomas

-------------------------------------------
Tuesday 18 July 2017  15:59

Turns out the certificate is no longer valid:

wmgt01:~$ ssh -A root@t3cmsvobox01
Last login: Tue Jul 18 15:39:28 2017 from t3admin01.psi.ch
Welcome to Tier-3

[root@t3cmsvobox01 ~]#
[root@t3cmsvobox01 ~]# su - phedex
[phedex@t3cmsvobox01 ~]$ openssl x509 -in gridcert/proxy.cert -dates
notBefore=Jul 12 23:00:05 2017 GMT
notAfter=Jul 12 23:25:05 2017 GMT
-----BEGIN CERTIFICATE-----

Derek takes over here


-------------------------------------------
Tuesday 18 July 2017  14:13

Site: T3_CH_PSI
Block completion: 16.67%, Block presence: 100%, Dataset presence: 57.13%, File-replica presence: 57.13%, Site type: Disk, StorageElement: t3se01.psi.ch


Checking the rate from sites:

T2_CH_CSCS      T1_UK_RAL_Disk     16  58.4 GB 16.2 MB/s   –   –   13.1 MB/s   2h20
T2_CH_CSCS      T1_IT_CNAF_Disk    8   42.7 GB 11.8 MB/s   –   –   5.2 MB/s    2d1h08
T2_CH_CSCS      T1_ES_PIC_Buffer   12  42.6 GB 11.8 MB/s   –   –   14.4 MB/s   4h18
T2_CH_CSCS      T2_US_Purdue       6   36.0 GB 10.0 MB/s   –   –   2.8 MB/s    1d10h07
T2_CH_CSCS      T2_DE_RWTH         4   35.4 GB 9.8 MB/s    –   –   7.9 MB/s    13h50
T2_CH_CSCS      T1_US_FNAL_Buffer  6   29.7 GB 8.3 MB/s    –   –   10.5 MB/s   23h22
T2_CH_CSCS      T1_ES_PIC_Disk     8   28.0 GB 7.8 MB/s    –   –   5.5 MB/s    14h06
T2_CH_CSCS      T1_RU_JINR_Buffer  8   25.1 GB 7.0 MB/s    –   3   5.8 MB/s    3d6h48
T2_CH_CSCS      T2_US_Caltech      4   22.4 GB 6.2 MB/s    –   –   2.6 MB/s    1d15h25
T2_CH_CSCS      T1_US_FNAL_Disk    8   18.3 GB 5.1 MB/s    –   7   6.9 MB/s    4d10h15
T2_CH_CSCS      T2_IT_Bari         4   15.8 GB 4.4 MB/s    –   –   2.2 MB/s    7h29
T2_CH_CSCS      T1_RU_JINR_Disk    4   15.0 GB 4.2 MB/s    –   –   7.5 MB/s    8h41
T2_CH_CSCS      T2_FR_IPHC         4   14.8 GB 4.1 MB/s    –   –   2.0 MB/s    2h14
T2_CH_CSCS      T2_IT_Legnaro      4   14.7 GB 4.1 MB/s    –   –   2.5 MB/s    23h21
T2_CH_CSCS      T2_US_Nebraska     4   14.2 GB 4.0 MB/s    –   –   2.3 MB/s    2d0h15
T2_CH_CSCS      T2_BE_UCL          3   8.0 GB  2.2 MB/s    –   –   411.4 kB/s  2d9h05
Total   103 420.9 GB    116.9 MB/s  –   10  –/s 0h00


Sites that have the dataset:

Site: T0_CH_CERN_Disk
Site: T1_FR_CCIN2P3_Buffer
Site: T1_FR_CCIN2P3_Disk
Site: T1_FR_CCIN2P3_MSS
Site: T2_CH_CERN
Site: T2_UK_London_Brunel
Site: T2_US_Nebraska
Site: T3_CH_PSI


Link to the transfer request:


Search query in PhEDEx:





Additional note about Deletion Request 1045325

Tuesday 18 July 2017 


TO
christoph.wissing@desy.de
eliza.melo.da.costa@cern.ch
siddharth.m.narayanan@cern.ch
clemens.lange@cern.ch
Yutaro.Iiyama@cern.ch
david.alejandro.urbina.gomez@cern.ch
jean-roch.vlimant@cern.ch
sebastian.pulido@cern.ch
dmytro.kovalskyi@cern.ch
dmason@fnal.gov
Fabio.Martinelli@cern.ch
matteo.cremonesi@cern.ch
alan.malta@cern.ch
thomas.klijnsma@cern.ch
Oliver.Gutsche@cern.ch
ajit.kumar.mohapatra@cern.ch

CC
miguel.angel.gila@cern.ch
dario.petrusic@cern.ch
dino.conciatore@cscs.ch



Dear all,

A side note about PhEDEx Deletion Request 1045325: Some of the datasets listed here have been around on the CSCS storage since 2011(!). In this single request I am attempting to clean up a lot of storage at CSCS in one go. I only included datasets that had a clearly noted "retention date" as a comment that has long passed (Comments without an explicit retention date are thus not included in this request).

If there are any datasets in this request that you wish to keep around, please send me an e-mail by Thursday July 20 at 17:00. If I have not received any reason to keep the files around before then, I will approve the deletion request.

The datasets are all from the group "local" and of type "replica". The following query in PhEDEx yields a list of the included datasets:



Best regards,
Thomas Klijnsma

T2 July 2017 Clean-up round for T2

-----------------------------------------------------
Tuesday 18 July 2017 


        22.4 TB   : /store/user/decosa
        21.2 TB   : /store/user/ytakahas
        19.2 TB   : /store/user/dsalerno
        14.6 TB   : /store/user/jpata
        14.4 TB   : /store/user/oiorio
        12.1 TB   : /store/user/sdonato
        11.4 TB   : /store/user/mschoene


annapaola.de.cosa@cern.ch
Yuta.Takahashi@cern.ch
daniel.salerno@cern.ch
joosep.pata@cern.ch
alberto.orso.maria.iorio@cern.ch
silvio.donato@cern.ch
myriam.schoenenberger@cern.ch


Dear all,

I would first like to thank those for their previous efforts to clean up their old data on CSCS to free up some storage. I'm sorry for calling on you again, but transfers to CSCS have again halted due to full pools, which is hurting important transfers for central productions.

I kindly ask you to have a look at your data once again, and please only keep what is really needed for your work - preferably today rather than tomorrow. My thanks in advance!

Cheers,
Thomas



-----------------------------------------------------
Tuesday 11 July 2017 

        28.9 TB   : /store/user/jpata
        22.4 TB   : /store/user/decosa
        21.2 TB   : /store/user/ytakahas
        19.2 TB   : /store/user/dsalerno
        16.5 TB   : /store/user/sdonato
        14.4 TB   : /store/user/oiorio

-----------------------------------------------------
Monday 10 July 2017 

    276.5 TB  : /store/user
        28.9 TB   : /store/user/jpata
        22.4 TB   : /store/user/decosa
        21.2 TB   : /store/user/ytakahas
        20.8 TB   : /store/user/zucchett
        19.2 TB   : /store/user/dsalerno
        16.5 TB   : /store/user/sdonato
        14.4 TB   : /store/user/oiorio
        11.4 TB   : /store/user/mschoene
        7.6 TB    : /store/user/bianchi
        7.2 TB    : /store/user/dpinna
        6.0 TB    : /store/user/grauco
        5.7 TB    : /store/user/paktinat
        5.6 TB    : /store/user/gregor
        5.6 TB    : /store/user/cgalloni
        5.5 TB    : /store/user/mwang

Unchanged w.r.t. July 3:

        28.9 TB   : /store/user/jpata
        22.4 TB   : /store/user/decosa
        20.8 TB   : /store/user/zucchett
        16.5 TB   : /store/user/sdonato
        14.4 TB   : /store/user/oiorio
        > small enough? 11.4 TB   : /store/user/mschoene

joosep.pata@cern.ch
annapaola.de.cosa@cern.ch  
zucchett@physik.uzh.ch
silvio.donato@cern.ch
alberto.orso.maria.iorio@cern.ch


Dear CSCS storage user,

As communicated last week there is an urgent lack of disk space on the pools at T2. Full pools means read/write-failures for official CMS productions, which will show up in the performance for CSCS as a computing site.

For this reason, please clean up your unused files on the T2 as soon as possible, very preferably today.

Best regards,
Thomas Klijnsma


-----------------------------------------------------
Monday 3 July 2017 

        38.8 TB   : /store/user/ytakahas
        29.3 TB   : /store/user/dsalerno
        28.9 TB   : /store/user/jpata
        22.9 TB   : /store/user/decosa
        20.8 TB   : /store/user/zucchett
        16.5 TB   : /store/user/sdonato
        14.4 TB   : /store/user/oiorio
        11.4 TB   : /store/user/mschoene
        10.9 TB   : /store/user/grauco





-----------------------------------------------------
Tuesday 4 July 2017 


Dear all,

My apologies for the increased pressure, but the T2 site at CSCS is currently in error mode because of full pools. It would help a lot if you could free up at least a part of the storage you are using still today, and maybe some more later in the week. The immediate problems should be resolved once a few tens of TBs are freed up. Thanks a lot for your help.

Best regards,
Thomas Klijnsma

Mails for CMS jobs on T2

Friday 14 July 2017 


lgerhardt@lbl.gov

Send this Monday at 9:00?

Dear Lisa Gerhardt,

I am the CMS contactperson for the T2_CH_CSCS site located in Lugano. We are facing an issue where CMS jobs are not running on a relatively new HPC resource, Piz Daint (a hybrid Cray XC40/XC50). There is some more detailed information in the previous email, but essentially we are looking for someone that has experience with both CMS computing operations and local T2 site configuration. We were referred to you by Jakob Blomer.
 
At NERSC there are similar machines to the ones we have in Lugano, already successfully running CMS jobs. Would you be able to point us to one of the experts that could potentially help us identify problems with our setup? We can supply detailed information.

Best regards,
Thomas Klijnsma
(CMS VO contact for CSCS)


-----------------------------------------------

Hi Thomas,

I'm glad that you'd like to come to the February pre-GDB!

As for HPC contacts within the LHC experiments, I am just about to learn
myself who is caring about supercomputers.  I'd know to whom to forward
you in ATLAS, for CMS I'm still a little bit more in the dark.

At CERN, the following CMS contacts probably know more:

    Luca Malgeri (Luca.Malgeri@cernNOSPAMPLEASE.ch): group leader of EP-CMG-PH
    Andreas Pfeiffer (Andreas.Pfeiffer@cernNOSPAMPLEASE.ch): section leader of
EP-CMG-CO

I'm also aware of successful CMS@HPC efforts by UC San Diego (Frank
Würthwein) and at NERSC on the Cori and Edison Crays.  At NERSC, Lisa
Gerhardt (lgerhardt@lblNOSPAMPLEASE.gov) can surely forward you to their experts.

Please feel free to keep me in the cc.

Hope this helps,
Jakob

-----------------------------------------------

Derek Feichtinger <derek.feichtinger@psi.ch>,
Miguel Gila <miguel.gila@cscs.ch>,
Christophorus Grab <grab@phys.ethz.ch>,
Guenther Dissertori <Guenther.Dissertori@cern.ch>

Jakob Blomer (jblomer@cernNOSPAMPLEASE.ch) and Gerardo Ganis (ganis@cernNOSPAMPLEASE.ch)



Dear Jakob Blomer and Gerardo Ganis,

I am the CMS Swiss contactperson for the T2_CH_CSCS site located in Lugano. Together with Derek Feichtinger (PSI) and Miguel Gila (CSCS) we have recently been increasing our efforts to get CMS Glidein jobs to run on the relatively new site T2_CH_CSCS_HPC. This site is a hybrid Cray XC40/XC50 super computer and currently the No. 3 of the Top500 list. We have an ongoing collaboration with the Swiss National Supercomputing Center (CSCS) to enable LHC workflows on this machine. The long-term goal is to shift the majority of our resources from a traditional cluster to this system. Unfortunately despite our increased efforts we have so far been unable to get CMS HammerCloud jobs or analysis/production jobs to run on T2_CH_CSCS_HPC (although SAM tests work). Several CMS Support e-groups have also looked at the problem, but so far jobs are still not running.

During the monthly GDB meeting I heard of your efforts to set up a forum for people working with HPC systems on the WLCG. While we will certainly participate in the pre-GDB on HPC resources next year, we intend to have CMS jobs running much sooner. Our hope is to find a contact with extensive experience of both CMS computing operations and local T2 site configuration. Would you be able to point us to the right person to aid in our debugging of T2_CH_CSCS_HPC?

Kind regards,

Thomas Klijnsma
(CMS-VO contact for CSCS)








-- Progress so far --

Derek Feichtinger has prepared a report on our progress so far here:
Most information related to debugging start from section 2.6 onward.

The latest HammerCloud tests for T2_CH_CSCS_HPC can be found here:
So far, HC jobs consistently get stuck as "submitted".

The logs of the most recent Glidein job that tested the CSCS site configuration can be found here (from June 26):



There do not seem to be any errors that could indicate the lack of CMS jobs.

PhEDEx links

T2 Instructions for dump for storage consistency check

Tuesday 11 July 2017 



update#2
Sebastian Pulido     2017-07-04     11:02     
Public Diary:
Hi Dino,
Transfer Team can help you by running a storage consistency check, for that, can you send us the full list of LFNs "OLDER THAN ONE MONTH" in the following directories:
/store/mc
/store/data
/store/generator
/store/results
/store/hidata
/store/himc
/store/lumi
/store/relval

More info about SCC in [1] and tools for creating dump in [2]

Also, files older than 2 weeks under /store/temp/ can be deleted from SE. [3]

Best,
Sebastian.

update#3
Christoph Wissing     2017-07-04     12:13     
Public Diary:
Hi,

in addition to what Sebastian wrote, you might want to clean in the temporary areas, where we know that some files are not cleaned efficiently.

LFN area /store/unmerged (please check your TFC [storage.xml] what PFN this resolves to:
Files older than 2 week can be removed, BUT RESPECT the exception list. Have a look at[1]. If you do have tools in place to respect the exception list, removing files older than 8 week is rather safe.

LFN area /store/temp/user
Everything older than 1 week (or say 2 weeks to be more safe) can be removed.

Cheers, Christoph

T2 Transfer links seem to work again

Thursday 6 July 2017 


Started working again on 04-07-2017 ~12:00, and seem to be green ever since.

The dashboard is however still showing "n/a". Will have to keep an eye on it

Table of usernames on T3 and T2

Monday 3 July 2017 

Table of user names


   | login        | ok? | followup by | Full Name                        | Institution               | cmsname      |
   |--------------+-----+-------------+----------------------------------+---------------------------+--------------|
   | aspiezia     |     | uzh         | Aniello Spiezia                  | Guest - INFN Perugia      | aspiezia     |
   | berger_p2    | ok  |             | Pirmin Berger                    | ETHZ                      | berger_p2    |
   | bianchi      | ok  |             | Lorenzo Bianchini                | ETHZ                      | bianchi      |
   | casal        | ok  |             | Bruno Casal                      | ETHZ                      | casal        |
   | cgalloni     | ok  |             | Camilla Galloni                  | UniZ                      | cgalloni     |
   | cheidegg     | ok  |             | Constantin Heidegger             | ETHZ                      | cheidegg     |
   | clange       |     | uzh         | Clemens Lange                    | UniZ                      | clange       |
   | clseitz      | ok  |             | Claudia Seitz                    | UniZ                      | clseitz      |
   | cmssgm       | ok  | admin       |                                  |                           |              |
   | creissel     |     | ethz        | Christina Reissel                | ETHZ                      | creissel     |
   | decosa       | ok  |             | Annapaola de Cosa                | UniZ                      | decosa       |
   | dmeister     |     | admin/ethz  | Daniel Meister                   | ETHZ                      | dmeister     |
   | dpinna       | ok  |             | Deborah Pinna                    | UniZ                      | dpinna       |
   | dsalerno     | ok  |             | Daniel Salerno                   | UniZ                      | dsalerno     |
   | dschafer     |     | uzh         | Daniela Schaefer                 | Guest - KIT               | dschafer     |
   | erdmann      | ok  |             | Wolfram Erdmann                  | PSI                       | werdmann     |
   | Federica24   | ok  | ethz        | Federica Tarsitano               | ETHZ                      | Federica24   |
   | feichtinger  | ok  |             | Derek Feichtinger                | PSI                       | dfeichti     |
   | gaperrin     | ok  |             | Gael Perrin                      | ETHZ                      | gaperrin     |
   | ggiannin     | ok  |             | Giulia Giannini                  | UniZ                      | ggiannin     |
   | giulioisac   | ok  |             | Giulio Isacchini                 | ETHZ                      | giulioisac   |
   | grauco       | ok  |             | Giorgia Rauco                    | UniZ                      | grauco       |
   | gregor       | ok  |             | Gregor Kasieczka                 | ETHZ                      | gregor       |
   | hinzmann     |     | uzh         | Andreas Hinzmann                 | UniZ                      | hinzmann     |
   | hsetiawa     | ok  |             | Hananiel Setiawan                | UniZ                      | hsetiawa     |
   | ineuteli     | ok  |             | Izaak Neutelings                 | UniZ                      | ineuteli     |
   | jandrejk     | ok  |             | Janik Walter Andrejkovic         | ETHZ                      | jandrejk     |
   | jhoss        |     | ethz        | Jan Hoss                         | ETHZ                      | jhoss        |
   | jngadiub     |     | uzh         | Jennifer Ngadiuba                | UniZ                      | jngadiub     |
   | jpata        | ok  |             | Joosep Pata                      | ETHZ                      | jpata        |
   | kaestli      | ok  |             | Hans-Christi Kaestli             | PSI                       | kaestli      |
   | koschwei     | ok  |             | Korbinian Schweiger              | UniZ                      | koschwei     |
   | kotlinski    | ok  |             | Bohdan Kotlinski                 | PSI                       | dkotlins     |
   | lbaeni       | ok  |             | Lukas Baeni                      | ETHZ                      | lbaeni       |
   | leac         | ok  |             | Lea Caminada                     | UniZ                      | leac         |
   | loktionova_n | ok  |             | Nina Loktionova                  | PSI                       | loktionova_n |
   | lshchuts     | ok  |             | Lesya Shchutska                  | ETHZ                      | lshchuts     |
   | mameinha     | ok  |             | Maren Tabea Meinhard             | ETHZ                      | mameinha     |
   | mangano      |     | ethz        | Boris Mangano                    | ETHZ                      | mangano      |
   | martinelli_f | ok  |             | Fabio Martinelli                 | PSI                       | none         |
   | mdefranc     |     | uzh         | Matteo Defranchis                | Guest - INFN - UniZ guest | mdefranc     |
   | mdonega      | ok  |             | Mauro Donega                     | ETHZ                      | mdonega      |
   | micheli      | ok  |             | Francesco Micheli                | ETHZ                      | micheli      |
   | mmarionn     | ok  |             | Matthieu Marionneau              | ETHZ                      | mmarionn     |
   | mmasciov     |     | ethz        | Mario Masciovecchio              | ETHZ                      | mmasciov     |
   | mquittna     |     | ethz        | Milena Quittnat                  | ETHZ                      | mquittna     |
   | mschoene     | ok  |             | Myriam Schonenberger             | ETHZ                      | mschoene     |
   | musella      | ok  |             | Pasquale Musella                 | ETHZ                      | musella      |
   | mvesterb     | ok  |             | Minna Leonora Vesterbacka Olsson | ETHZ                      | mvesterb     |
   | nchernya     | ok  |             | Nadezda Chernyavskaya            | ETHZ                      | nchernya     |
   | nding        |     | uzh         | Naili Ding                       | UniZ                      | nding        |
   | pablom       |     | ethz        | Pablo Martinez                   | ETHZ                      | pablom       |
   | pandolf      | ok  |             | Francesco Pandolfi               | ETHZ                      | pandolf      |
   | pbaertsc     | ok  |             | Pascal Baertschi                 | UniZ                      | pbaertsc     |
   | perrozzi     | ok  |             | Luca Perrozzi                    | ETHZ                      | perrozzi     |
   | peruzzi      |     | ethz        | Marco Peruzzi                    | Guest - ETHZ              | peruzzi      |
   | phwindis     |     | ethz        | Philipp Windischhofer            | ETHZ                      | phwindis     |
   | ptiwari      |     |             | Praveen Chandra Tiwari           | Guest - IISc Bangalore    | ptiwari      |
   | rbuergle     | ok  |             | Rico Buergler                    | UniZ                      | rbuergle     |
   | root         | ok  |             |                                  |                           |              |
   | sdonato      | ok  |             | Silvio Donato                    | UniZ                      | sdonato      |
   | spmondal     |     |             | Spandan Mondal                   | Guest - IISc Bangalore    | spmondal     |
   | starodumov   | ok  |             | Andrei Starodumov                | ETHZ                      | starodum     |
   | tbluntsc     |     |             | Tizian Bluntschli                | ETHZ                      | tbluntsc     |
   | thaarres     | ok  |             | Thea Klaeboe Aarrestad           | UZH                       | thaarres     |
   | thea         |     | ethz        | Alessandro Thea                  | ETHZ                      | thea         |
   | tklijnsm     | ok  |             | Thomas Klijnsma                  | ETHZ                      | tklijnsm     |
   | uchiyama     |     | meg         | Yusuke Uchiyama                  | PSI                       | uchiyama     |
   | ursl         | ok  |             | Urs Langenegger                  | PSI                       | ursl         |
   | vscheure     |     | uzh, admin  | Valerie Scheurer                 | Guest - KIT               | vscheure     |
   | vtavolar     | ok  |             | Vittorio Raoul Tavolaro          | ETHZ                      | vtavolar     |
   | wiederkehr_s | ok  |             | Stephan Wiederkehr               | PSI                       | wiederkehr_s |
   | ytakahas     | ok  |             | Yuta Takahashi                   | UniZ                      | ytakahas     |
   | zucchett     | ok  |             | Alberto Zucchetta                | UniZ                      | zucchett     |

Fun note about the PhEDEx transfer rate to T2_CH_CSCS

Monday 3 July 2017 

Saved a plot on desktop

Pre-approvals were from 26 to 30 of June, clear spike before that:


Submitting jobs to CRAY via arcsub



===============================================================================
Monday 3 July 2017 

See this email from Miguel:

Here are some commands that you can use from any UI (or LXPLUS) to submit the job and operate with ARC itself:

$ arc proxy --voms cms # generate RFC proxy

$ arcinfo -c arc04.lcg.cscs.ch # get status of ARC frontend

$ arcsub -c arc04.lcg.cscs.ch proxyinfo.xrls # submit the job

$ arcstat -a # get status of all jobs
$ arcstat gsiftp://arc04.lcg.cscs.ch:2811/jobs/PpcLDmvjBJqn3l2IepOMSphoABFKDmABFKDmt9GKDmABFKDm6LssAo -l #get status of a single job

$ arcget -d # download all jobs that have finished and are available on any ARC frontend
$ arcget gsiftp://arc04.lcg.cscs.ch:2811/jobs/3diNDmo3cKqn3l2IepOMSphoABFKDmABFKDmFTJKDmABFKDmOg0yrn # download a single job


===============================================================================
Thursday 29 June 2017

Job from yesterday is finished:

Job: gsiftp://arc04.lcg.cscs.ch:2811/jobs/t27MDmUZTkqn3l2IepOMSphoABFKDmABFKDmIgJKDmABFKDmcGT3zm
Name: benchmarktest
State: Finished
Exit Code: 0


Output file is created:

[tklijnsm@t3ui02 benchmarkOnT2]$ uberftpT2
220 GSI FTP door ready
200 User :globus-mapping: logged in
UberFTP (2.8)>
UberFTP (2.8)> cd /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm
UberFTP (2.8)> ls
-r--------  1 cms001     cms001               28 Jun 28 17:48 outputtest.txt
drwx------  1 cms001     cms001              512 May  4 16:45 GenericTTbar
UberFTP (2.8)>
UberFTP (2.8)>
UberFTP (2.8)> cat outputtest.txt
This is an output text file

UberFTP (2.8)>
UberFTP (2.8)> rm outputtest.txt


So that's good news. Now running again, but make sure to copy the log files instead of a test txt file

New test:
Job: gsiftp://arc04.lcg.cscs.ch:2811/jobs/OeHLDm12mkqn3l2IepOMSphoABFKDmABFKDm9LHKDmABFKDmKofOBo
Name: benchmarktest
State: Queuing


===============================================================================
Wednesday 28 June 2017 


First get an RFC proxy:

voms-proxy-init -voms cms --rfc

Then using arcsub (see different instructions)

Monitor with "arcstat -a"





Test using the benchmark script:

Job: gsiftp://arc04.lcg.cscs.ch:2811/jobs/Yu4KDm3gSkqn3l2IepOMSphoABFKDmABFKDm3xLKDmABFKDmLZWFPo
Name: benchmarktest
State: Finished
Exit Code: 0

Status of 4 jobs was queried, 1 jobs returned information


Now how to get the output for this...



All the useful commands are here:

All the stuff that can be specified in the xrsl file is here:

arccat (job): prints out the relevant info for a job, like whether it's finished
arcget (job): copies over a directory with output files




Added the following lines to the xlsr file:

(outputFiles=
    ('outputtest.txt' 'gsiftp://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/outputtest.txt')
)

Now trying again of output files can be passed.


At 16:55: Still no output:

[tklijnsm@t3ui02 benchmarkOnT2]$ arcstat -a
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/1DdMDmNlJfqn3l2IepOMSphoABFKDmABFKDmabNKDmABFKDmIElbHm
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/DzALDmhj8gqn3l2IepOMSphoABFKDmABFKDmXXFKDmABFKDmQWcTAn
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/xkyNDmgPJfqn3l2IepOMSphoABFKDmABFKDmlZIKDmABFKDmSd2Pnm
Status of 3 jobs was queried, 0 jobs returned information

First two jobs are pretty old, not sure why I can't kill them with arckill.

Try again later to see if the last job will show something.



Submitted a copy again to be sure:

[tklijnsm@t3ui02 benchmarkOnT2]$ . submitCmd.sh
Job submitted with jobid: gsiftp://arc04.lcg.cscs.ch:2811/jobs/t27MDmUZTkqn3l2IepOMSphoABFKDmABFKDmIgJKDmABFKDmcGT3zm


At 16:59:
[tklijnsm@t3ui02 benchmarkOnT2]$ arcstat -a
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/1DdMDmNlJfqn3l2IepOMSphoABFKDmABFKDmabNKDmABFKDmIElbHm
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/DzALDmhj8gqn3l2IepOMSphoABFKDmABFKDmXXFKDmABFKDmQWcTAn
WARNING: Job information not found in the information system: gsiftp://arc04.lcg.cscs.ch:2811/jobs/xkyNDmgPJfqn3l2IepOMSphoABFKDmABFKDmlZIKDmABFKDmSd2Pnm
Job: gsiftp://arc04.lcg.cscs.ch:2811/jobs/t27MDmUZTkqn3l2IepOMSphoABFKDmABFKDmIgJKDmABFKDmcGT3zm
Name: benchmarktest
State: Queuing

Status of 4 jobs was queried, 1 jobs returned information




T1 to T2 transfer link failures

Monday 3 July 2017 


-------------------------------------------
Initial report by Daniel Salerno (Date: Jul 01, 20:04 )

Hi experts,

I am having problems staging out successfully completed jobs. All my transfers seem to be failing, with the following error on the CRAMmonitor:

"tm_transfer_failure_reason     Job could not be submitted to FTS: users proxy expired"

I have several such tasks, here are 2 examples with stageout to different sites:



I requested my proxy just before submission and requested it for 1 week, so it should definitely be valid.

I've also noticed there is an empty folder created on both storage sites.

Can you see something I'm missing?

Thanks,
Daniel


-------------------------------------------
Reply by CMS CRAB Support (Date: Jul 01, 22:35 )

ASO server was having memory management problems, which made voms-proxy-*
command fail. Your proxy is actually fine.
I rebooted the machine where ASO runs, things appear good so far, let's see
if transfers now go. I am sure you were not the only one affected.
Stefano


-------------------------------------------
Reply by Daniel (Date: Jul 01, 22:55 )

Hi Stefano,

Thanks for fixing this. The transfers do not show up with crab status yet, but I see the files on the storage element, so it seems to have worked.

Cheers,
Daniel


-------------------------------------------
Reply by CMS CRAB Support (Date: Jul 01, 23:03 )

That's good to hear Daniel, thanks !
I see that transfers are being submitted again, while there was no activity
since end of June 30th... something possibly happened when datw switched to
July 1st, but of course past months ends have been fully transparent. Let's
see how it goes.
Stefano

GlideinWMS e group

Notes CRAY meeting

Monday 26 June 2017 


o Numbers on Piz Daint are mostly slightly worse than Phoenix
  (for ATLAS (Gianfrate) and LHCb (Roland) ).

- Check why CMS share is suddenly lower
    o Only 22% CMS jobs, rest is ATLAS and LHCb on Phoenix
    o Someone mentions CMS has not had much to send the last few weeks,
      find out why

o New configuration will be installed
    o New measurement period will start after installation
    o Change expected around 10th of July

o Next meeting:
    o First week of August, about 1 month after new configuration for data taking
    o Doodle will follow


o Private chat with Christoph after:
    o We should contact the David Collin group as quickly as possible
        o Better to not wait until we have all the information from Derek
    o When establishing contact, make sure to mention we were referenced by
      Tomasso Bocali (check spelling)
        o Tomasso Bocali explicitly recommended trying stuff on new architecture,
          (and to contact for help?)
          See his "ecomm" paper (check spelling)
    o Ask Günther what he knows about David Collin's group
        o Christoph insisted but not sure how Günther can help here

Note to #general about glidein factory settings

Monday 19 June 2017 


Hi Miguel and all,

After last weeks investigations into CMS jobs not running on CRAY, I think the following conclusions could be drawn:

CMS HammerCloud jobs are not running because of deeper issue with CMS GlideIn Factory settings. The current factory settings for T2_CH_CSCS_HPC can be found here:
I've been comparing it with the same page for T2_CH_CSCS, and I see no tell-tale indications that anything is configured incorrectly.

All the way at the bottom of the following file is probably the most telling error: http://submit-3.t2.ucsd.edu/CSstoragePath/FactoryLogsGlobalPoolITBDEV/goc-itb/entry_CMSHTPC_T2_CH_CSCS_arc04/job.430352.0.out

```     Validation failed in main/glexec_setup.sh.

    glexec test failed, nonzero value 203
    result:

    stderr:
    [gLExec]:  User cmspl3 (uid=23830) is not whitelisted, i.e. may not invoke gLExec.
               If this is unexpected, ask the sysadmin to check the syslog and the
               permissions of the config file.```

The latest log is unfortunately from June 15, so it's not the most up-to-date information, but there don't seem to be any newer logs (which is the same for T2_CH_CSC, so that's not our fault).

Still not sure in which direction to continue.

Cheers,
Thomas

Notes about CMS computing for visit

Wednesday 14 June 2017 


About Tiers:


Tier-0

This is the CERN Data Centre, which is located in Geneva, Switzerland and also at the Wigner Research Centre for Physics in Budapest, Hungary over 1200km away. The two sites are connected by two dedicated 100 Gbit/s data links. All data from the LHC passes through the central CERN hub, but CERN provides less than 20% of the total compute capacity.

Tier 0 is responsible for the safe-keeping of the raw data (first copy), first pass reconstruction, distribution of raw data and reconstruction output to the Tier 1s, and reprocessing of data during LHC down-times.

Tier 1

These are thirteen large computer centres with sufficient storage capacity and with round-the-clock support for the Grid. They are responsible for the safe-keeping of a proportional share of raw and reconstructed data, large-scale reprocessing and safe-keeping of corresponding output, distribution of data to Tier 2s and safe-keeping of a share of simulated data produced at these Tier 2s.

Tier 2

The Tier 2s are typically universities and other scientific institutes, which can store sufficient data and provide adequate computing power for specific analysis tasks. They handle analysis requirements and proportional share of simulated event production and reconstruction.

There are currently around 160 Tier 2 sites covering most of the globe.

Tier 3

Individual scientists will access these facilities through local (also sometimes referred to as Tier 3) computing resources, which can consist of local clusters in a University Department or even just an individual PC. There is no formal engagement between WLCG and Tier 3 resources.



Links to wms glidein factory settings at CMS

Sourcing custom ROOT and gcc versions on T3

Wednesday 14 June 2017 


Looking for alternatives for:

source /afs/cern.ch/sw/lcg/external/gcc/4.9/x86_64-slc6-gcc49-opt/setup.sh
source /afs/cern.ch/sw/lcg/app/releases/ROOT/6.06.08/x86_64-slc6-gcc49-opt/root/bin/thisroot.sh


export PATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin:$PATH
which gcc /cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin/gcc
source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/lcg/root/6.06.04/bin/thisroot.sh


Does not quite work:

[tklijnsm@t3ui02 ~]$ root
root: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by root)

Probably directing to /usr/lib64/ , which should be some cvmfs directory



Trying with adding this to the path as well:
/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/llvm/3.7.1/bin


export PATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/llvm/3.7.1/bin:$PATH
which gcc /cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin/gcc
source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/lcg/root/6.06.04/bin/thisroot.sh

no luck, still /usr/lib64/libstdc++.so.6 not found



Hardcore option:

export PATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_4/bin/slc6_amd64_gcc493:/cvmfs/cms.cern.ch/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_4/external/slc6_amd64_gcc493/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/llvm/3.7.1/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin:/cvmfs/cms.cern.ch/common:/cvmfs/cms.cern.ch/bin:$PATH
which gcc /cvmfs/cms.cern.ch/slc6_amd64_gcc493/external/gcc/4.9.3/bin/gcc
source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/lcg/root/6.06.04/bin/thisroot.sh

[tklijnsm@t3ui02 ~]$ root
root: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by root)

Still... Not sure what cmsenv does differently.





Another attempt:

source /cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/setup.sh

source /cvmfs/cms.cern.ch/slc6_amd64_gcc493/lcg/root/6.06.04/bin/thisroot.sh

export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc530/lcg/root/6.06.00-ikhhed4/lib:$CMSSW_BASE/python




Using the lcg views thing works according to Mauro. Do this:

source /cvmfs/sft.cern.ch/lcg/views/LCG_84/x86_64-slc6-gcc49-opt/setup.sh
export PATH=/swshare/anaconda/bin:$PATH













Ticket reply regarding lack of CMS jobs on CRAY

Wednesday 14 June 2017 



Can it be specified what is meant by "the CE is not reachable"? There are numerous SAM test jobs running fine on arc04.lcg.cscs.sh:


(Or here a more complete overview:
).

For your debugging purposes, one can directly submit jobs to arc04.lcg.cscs.ch using arcsub:

arcsub -c arc04.lcg.cscs.ch proxyinfo.xrls

where examples for proxyinfo.xrls and run.sh are added as attachments to this post. It should be checked if CMS glidein jobs are actually submitting like this, underneath all the layers of abstraction.


Looking at the CMS Glidein Factory Status here: http://vocms0305.cern.ch/factory/monitor/factoryEntryStatusNow.html , I see no entries for T2_CH_CSCS_HPC. There is one old entry for T2_CH_CSCS_arcbrisi, which is configured as follows:

GLIDEIN_CMSSite     T2_CH_CSCS_HPC
Gatekeeper     arcbrisi.cscs.ch

I am not sure if this entry is used for job submission, but if it is, it's clearly faulty. Also it is worrying there is no entry with arc04.lcg.cscs.sh at all.

Best regards,
Thomas Klijnsma

Checking the PhEDEx certificate

Friday 9 June 2017 


wmgt01:~$ ssh -A root@t3cmsvobox01

[root@t3cmsvobox01 ~]# su - phedex

[phedex@t3cmsvobox01 ~]$ openssl x509 -in gridcert/proxy.cert -dates
notBefore=Jun  9 14:00:04 2017 GMT
notAfter=Jun 10 01:59:04 2017 GMT
-----BEGIN CERTIFICATE-----
MIIPQDCCDiigAwIBAgIKQfEGWwAA[...]



cntrl + d logs out of phedex user



T3 emergency update

Friday 9 June 2017 



Dear users,

As was already communicated last night, the hardware component was installed successfully. At the moment the site is still down because of firmware and security updates, which are ongoing.

The site is expected to come back online this afternoon.

Also, you may receive automated e-mails by Nagios with the following format:


Service: check_data01_shome_<<username>>
Host: t3nfs01
Address: 192.33.123.71
State: CRITICAL

Date/Time: 09-06-2017 XX:XX:XX

Additional Info:

Connection refused by host


You can safely ignore these messages.

Cheers,
Thomas

HammerCloud jobs on T2_CH_CSCS_HPC

Wednesday 7 June 2017 


Example jobs:


Input type: CMSSW_7_0_4
Input DS Patterns: /GenericTTbar/HC-CMSSW_7_0_4_START70_V7-v1/GEN-SIM-RECO
Ganga Job Template: crab3apitest.tpl   [View template file in GitLab]
User code: pf2pat_v7_cfg.py
Option file: empty
Template: functional T2 DE

Check Metrics tab -> HC jobs seem to be submitted HPC.

Opening ticket was not needed!




Thursday 8 June 2017 

Issue not resolved: HC jobs are submitted, but the T2 dashboard shows no incoming HC jobs. The job script for test 47924 to HPC is here:


Job (
  application = CRABApp (),
  backend = CRABBackend (
    CRABConfig = CRABConfig(
    General = General(
      requestName = 'HC-98-T2_CH_CSCS_HPC-47924-20170607090501',
      transferOutputs = False,
      transferLogs = False),
    JobType = JobType(
      psetName = '/data/hc/apps/cms/inputfiles/usercode/pf2pat_v7_cfg.py',
      pluginName = 'Analysis',
      priority = 600000),
    Data    = Data(
      inputDataset = '/GenericTTbar/HC-CMSSW_7_0_4_START70_V7-v1/GEN-SIM-RECO',
      splitting = 'LumiBased',
      unitsPerJob = 100,
      totalUnits = 3000,
      ignoreLocality = True,
      publication = False,
      outputDatasetTag = '0ea12bcd230936c2556840cb8452714d'),
    User    = User(
      voRole='production'),
    Debug   = Debug(
      extraJDL = ['+CRAB_NoWNStageout=1', '+CRAB_HC=True']),
    Site    = Site(
      blacklist = ['T3*'],
      storageSite = 'T2_CH_CERN',
      whitelist = ['T2_CH_CSCS_HPC'])

    )
  )
)



Looking at certificates of valid HC jobs on phoenix:

Lines that contain "cert":
== ENV: X509_CERT_DIR=/gpfs2/scratch/phoenix4/ARC_sessiondir/EsZKDmoVocqnlztIepovt9HmABFKDmABFKDmPDUKDmABFKDmfuO4am/arc/certificates
== ENV: X509_USER_CERT=/gpfs2/scratch/phoenix4/ARC_sessiondir/EsZKDmoVocqnlztIepovt9HmABFKDmABFKDmPDUKDmABFKDmfuO4am/glide_P4F3lz/hostcert.pem
== ENV: GLEXEC_CLIENT_CERT=/gpfs2/scratch/phoenix4/ARC_sessiondir/EsZKDmoVocqnlztIepovt9HmABFKDmABFKDmPDUKDmABFKDmfuO4am/glide_P4F3lz/execute/dir_83171.condor/db85de4154ab6954b8b313bee9b26408357e57f4


Lines that contain "arc0":
== ENV: SLURM_SUBMIT_HOST=arc01.lcg.cscs.ch
== ENV: NORDUGRID_ARC_QUEUE=arc01
Dashboard early startup params: {'MonitorID': '170607_121219:sciaba_crab_HC-98-T2_CH_CSCS-47924-20170607090501', 'MonitorJobID': '1_https://glidein.cern.ch/1/170607:121219:sciaba:crab:HC-98-T2:CH:CSCS-47924-20170607090501_0', 'SyncCE': 'arc01.lcg.cscs.ch', 'OverflowFlag': 0, 'SyncSite': 'T2_CH_CSCS', 'SyncGridJobId': 'https://glidein.cern.ch/1/170607:121219:sciaba:crab:HC-98-T2:CH:CSCS-47924-20170607090501', 'WNHostName': 'wn25.lcg.cscs.ch'}

== JOB AD: Used_Gatekeeper = "arc01.lcg.cscs.ch"




List of HC jobs submitted to _HPC:


The site is currenly 'unknown', so probably something is wrong on the CMS side.



Certificate in browser expired

Thursday 8 June 2017 

Problem
I already requested and started using new grid certificates because the old ones expired, so voms-proxy-init all worked. However, the new grid certificate also needed to be imported in the browser - since June 02 I am denied access from web sites that require a valid grid certificate

Solution
Easiest: Look in the .globus directory where the valid grid certificate is (on lxplus, psi, or either one). Look for the .p12 file, copy it to the mac.
Then open Keychain Access. Goto File --> Import, and import the .p12 file.
Restarting Chrome should be enough, for Firefox the certificate needs to be manually imported in Firefox as well.

If there is no .p12 file, it can be created using:

openssl pkcs12 -export -out mygrid.p12 -in usercert.pem -inkey userkey.pem -name "Grid Certificate 08062017"


Controller failure for nfs01 (T3)

Wednesday 7 June 2017 


Background info

nfs01 is the storage system hosting all home directories. Normally it is backed up by a separate (ZFS) machine, but that one is currently offline. 

There is also nfs02, which contains an independent copy (?) of the same files. This is probably not a real copy but somehow uses the same files. Not sure if the actual "hard drives" containing the files are part of the nfs0X storage systems. nfs02 has the double responsibility of also being the NetApp dcache pool.


The problem
Between the UI and the shome file hosts there is a controller:

UI  <-->  controller_0X  <-->  nfs0X  <-->  files on hard drive

The controller is a separate part of the nfs0X that can be replaced. For most purposes it can be considered an integrated part of the nfs, except today, since it is this part that broke down.

Fabio proposed a number of solutions:

- nfs02 should contain a separate "copy" of all user files, and nfs02:data01 would normally be immediately mountable. It turns out nfs02:data01 is now destroyed, and will need to be restored in the future for this fix to work. The risk of this operation is that UI operations may then disturb dcache pool operations, since these are also ran on nfs02.

- Take the controller from nfs02 and put it in nfs01. I'm not sure to what extent this disturbs dcache operations.

- The 3rd safest option: Wait for a new controller.

- An intermediate workable option using the scratch: Copy all user files to t3ui0X:/scratch/, and allow logins in degraded mode. Users will have to copy back their important files once nfs01 is back up.


For now, we're going with waiting for the new controller.

Metrics for F2F 30 May

Wednesday 31 May 2017 


All metrics for T2 HPC:



Example of new test job output:





Tuesday 30 May 2017 





SAM tests and jobs submitted to T2_CH_CSCS_HPC are currently routed to arcbrisi.cscs.ch, which is no longer in use. Because of this T2_CH_CSCS_HPC is currently placed in the waiting room, and 100% of SAM tests are failing (see an example of a failed job in [1]). SAM tests need to be urgently rerouted to the new host, arc04.lcg.cscs.ch. 

Could someone from the CMS side perform this change?

Best regards,
Thomas Klijnsma
(CMS-VO contact for T2_CH_CSCS and T2_CH_CSCS_HPC)

[1]:









Site availability:


63.56%




Agreed metrics (per VO) for both Phoenix and CRAY:

1 Produced walltime (good & bad) per core, per type of job
2 Walltime of good vs failed jobs, per type of job
3 CPU/Wallclock efficiency for successful jobs, per type of job
4 Site Availability
5 Alternative site HepSpec value, if wanted
6 Any other metric you think important, for discussion


Cluster
Job type
Produced walltime core-hours
Good vs Bad walltime %
CPU efficiency good jobs %




Link from Derek to some metrics, CSCS vs CSCS_HPC





No statistics gathered for HPC; SAM tests are sent to arcbrisi.cscs.ch, which is no longer used. The new arc should be arc04.cscs.ch

Example of failed SAM test:




T2 Down for maintenance

Monday 29 May 2017 


Dear T2 storage users,

This Wednesday (May 31) there will be a site wide maintenance at CSCS, and as such also the T2 storage element will be experiencing some downtime. See the attached email below.

Cheers,
Thomas


Status update T2 storage

Tuesday 16 May 2017 


Dear users,

During the scheduled downtime on Friday last week (12.05.2017), dCache was updated from version 2.13.50 to 2.13.58. The change log for this update was minor and no major issues were expected, but in the meantime working with the T2 storage element has become so slow that it is essentially unusable.

The origin is now traced back to a communication problem with the chimera database,  which is responsible for keeping things in sync in the storage element. With the dCache update this communication process became very slow for no apparent reason. The storage element essentially works fine, but simple operations break down and time out because of extremely slow communication with the chimera database.

The T2 engineer is in contact with dCache support, but also this communication is not the speediest. More updates will follow.

Cheers,

Thomas





Dear users,

After extensive efforts of the T2 engineers to make the dCache update work, last week's updates have been reversed this morning, and the dCache version was rolled back to 2.13.50 as it was before.

The T2 storage element should now be functional again. If you run into any issues, please let me know as soon as possible.

Cheers,

Thomas




Some HammerCloud jobs failing

Responsibilities regarding T2 during outage

Friday 19 May 2017 


See chat below with Joosep.

In summary:

During a time when T2 can not function:
- Down time should be registered in GGUS. This was already done for me.
- GGUS may submit tickets against the T2, and there is no notification email.
    Need to manually check here:
    In theory the T2 guys can respond to these as well and they can read
    also my responses, but it's better if I do this (also because the T2
    guys don't fully know physics operations)
- Keep users up to date
- No need to contact CMS if down time is expected to be resolved soon
    Contacting CMS can be done via GGUS tickets against CMS, but these are
    apparently tricky to submit
- Try to keep the T2 out of the morgue
    This happens after 1-2 weeks in the waiting room



Chat with Joosep:


Dario etc are responsible for providing a working dCache. CMS then submits all kinds of tests (SAM, HammerCloud) which have to succeed

[10:11]
if they succeed, then jobs flow into the site and must succeed. if they don't (rare, but sometimes), then you have to follow up with T2 guys and open a pre-emptive ggus ticket against CMS site support. If you don't react, CMS will usually notice within a few days and open a ticket against the site, which after a few days will result in the job flow being switched off until the site is fixed

[10:11]
if there was just downtime, first thing is to check that all kinds of tests succeed

tklijnsm [10:12 AM]
So I can submit a ticket via ggus, with the request to start SAM and HammerCloud jobs?

joosep [10:12 AM]
currently on SSB (flash app), our site is in the waiting room

[10:12]
with 33% SAM (bad) and 98% HC (good)

[10:12]
and there are 3 open ggus tickets

[10:13]
have you seen them?

[10:13]

tklijnsm [10:13 AM]
No

joosep [10:13 AM]
I haven't followed ggus

[10:13]
you have to be proactivel there

[10:13]
click on the ggus

tklijnsm [10:14 AM]
Yes I see the tickets

joosep [10:14 AM]
normally the updates from T2->CMS go via you, or you can outsource it to T2 guys

[10:14]
but they have sometimes serious issues with communicating clearly, so better you write :slightly_smiling_face:

tklijnsm [10:14 AM]
I see

[10:14]
And anything->CMS goes via ggus then?

joosep [10:14 AM]
mostly, yes

[10:15]
it's better via ggus than private email to random person

[10:15]
but the latter also happens sometimes

[10:15]
you have to follow up why CMS SAM tests are failing (click on the red 33%)

<< copy paste some error codes from SAM jobs >>

[10:17]
bad signature, seems like some auth error

joosep [10:17 AM]
this info you just have to give to T2 guys, like: "CMS SAM tests are failing, here is the link, please take a look ASAP"

[10:17]
and they have to understand why it's failing, you can't do much there

[10:17]
sometimes you have to kind of convince them that it's actually a site problem rather than CMS problem

tklijnsm [10:18 AM]
I see

joosep [10:18 AM]
but SAM tests are not changing

[10:18]
they have to succeed 100%

[10:18]
so it's 99.999% site problem

tklijnsm [10:18 AM]
Okay, I'll put this in a ticket

[10:18]
(after I check a few other jobs)

joosep [10:18 AM]
yes, this is a webRT ticket to T2 and also let them know via chat

tklijnsm [10:19 AM]
a webRT ticket?

joosep [10:19 AM]
well, the way you make tickets to T2 guys

[10:19]
they have their own system

[10:19]
not ggus

tklijnsm [10:19 AM]
aah by grid-rt @ email == webRT, I see

joosep [10:19 AM]
you can also let them know on skype as gianfranco is doing

[10:19]
yes

tklijnsm [10:19 AM]
yes okay, I got that

joosep [10:19 AM]
it's the same, webRT is the web interface

tklijnsm [10:20 AM]
So the outage already started last week, with Dario continuously promising the resolve the issue "tomorrow"

joosep [10:20 AM]
yes

[10:20]
usual situation

[10:21]
I mean, also dCache support is basically useless

tklijnsm [10:21 AM]
Should I have contacted CMS at some point? And if so, how?

joosep [10:21 AM]
hmm

[10:21]
well, site was declared in downtime

[10:21]
so that was all correct

[10:21]
this is the way to let them know

[10:21]
possibly the GGUS tickets could have been updated with the same info

tklijnsm [10:21 AM]
hmm, so why is it in the waiting room now? It should have remained down until further notice

joosep [10:21 AM]
it looks like nobody updated those from T2 side

[10:21]
good question

tklijnsm [10:22 AM]
I see, so GGUS I will have to monitor more actively

joosep [10:22 AM]
I think site downtime -> moved to waiting room

[10:22]
and if in waiting room more than 1-2w, move to morgue

[10:22]
yes, ggus unfortunately doesn't send emails

[10:22]
for new tickets

[10:22]
and you have to check every once in a while

[10:22]
easy to see on SSB

tklijnsm [10:22 AM]
yes there is a pretty clear number

joosep [10:23 AM]
e.g. spacemon ticket is alrady several months old

[10:23]
and still they are not able to do it

tklijnsm [10:23 AM]
Okay, I'll first submit a ticket to T2 regarding the SAM tests, then I deal with the GGUS ticket

joosep [10:23 AM]
yes

[10:23]
on ggus basically just let CMS know what is the situation

[10:23]
that there are problems with dcache after the last upgrade and it's being worked on

tklijnsm [10:24 AM]
And then for future reference, when I think CMS is doing something wrong, I can also submit a GGUS ticket against CMS?

joosep [10:24 AM]
yes, but ggus is really confusing

[10:24]
to make a ticket

[10:24]
maybe Derek or someone can show you

[10:24]
also I always get lost there

new messages
[10:24]
I only did it a few times

Response to GGUS tickets

Friday 19 May 2017 

SE at T2_CH_CSCS (and T2_CH_CSCS_HPC) has been down since a failed dCache update. The engineers have tried everything to make the update work, and have today decided to roll back to a previous version. SAM tests should be succeeding again starting this morning.

Deleting files even if quota is reached

Monday 15 May 2017 


In case user quota is full, this trick will let him delete files even without snapshots deleted:

From (email from Derek):


Workaround:

The trick is to copy /dev/null to the file you want to delete:

bfguser@bwui:~> ls -lah testfile1
-rw-r--r-- 1 bfguser bfggroup 16M 2009-03-23 10:44 testfile1

bfguser@bwui:~> cp /dev/null testfile1
bfguser@bwui:~> ls -lah testfile1
-rw-r--r-- 1 bfguser bfggroup 0 2009-03-23 11:41 testfile1

bfguser@bwui:~> rm testfile1
bfguser@gwui:~> ls -lah testfile1
/bin/ls: testfile1: No such file or directory

Explanation:

ZFS is a copy-on-write filesystem, so a file deletion transiently takes slightly more space on disk before a file is actually deleted. It has to write the metadata involved with the file deletion before it removes the allocation for the file being deleted. This is how ZFS is able to always be consistent on disk, even in the event of a crash.

GOCDB: Outages and status

Monday 15 May 2017 


T2 storage is down: should I perform an action so that it does not end up in the morgue?




Made an account here:

Form contained only first + last name, email and phone
( used Thomas.Klijnsma@cern.ch for email)


From this site:

Recent Downtimes Affecting CSCS-LCG2's SEs (View all Downtimes)
Description     From     To
dCache outage     2017-05-15 16:00     2017-05-16 16:00
dCache outage     2017-05-14 00:40     2017-05-15 18:00

So downtime is properly registered already (although not by me).


Probably only a dCache problem is not enough to trigger a whole "Site Down" state (SD).


T3 Steering Board Meeting

Friday 12 May 2017 


-------------------------------------------------------
Start with slides from Derek

(last T3SBM: winter 2015)




Urs PhEDEx deletion request:


About user home reduction:
Proposal is 100GB, statistics will be gathered before decision
Worries about:
 - New users - need much better documentation about using the storage element
 - Many small files - can log-files be auto-compressed? Stuff that's not root-files?
 - Is this what users want? Carefully look at what the system can handle


End of June: write TWiki page about job submission


Later this year:
Have second meeting with more statistics so decisions about investments can be made
See last slides regarding finance

Ask users whether they use grid-control, this may help with issues related to the t3 not being grid-attached.




-------------------------------------------------------

Users job won't run

Wednesday 10 May 2017 

Gael mentions 1 job of his is refusing to run, remaining queued.


Solution:


[tklijnsm@t3ui02 ~]$ qstat -u '*' | grep perrin
2696104 0.55001 Zee_CRttba gaperrin     dr    05/04/2017 03:53:10 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2696613 0.55001 Zee_CRZb_i gaperrin     dr    05/04/2017 04:52:19 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2696825 0.55001 Zee_CRttba gaperrin     dr    05/04/2017 05:18:13 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2697230 0.55001 Zuu_CRZlig gaperrin     dr    05/04/2017 06:07:37 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2697236 0.55001 Zuu_CRZlig gaperrin     dr    05/04/2017 06:08:01 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2697237 0.55001 Zuu_CRZlig gaperrin     dr    05/04/2017 06:08:01 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2698346 0.55001 Zuu_CRZb_i gaperrin     dr    05/04/2017 09:08:43 all.q@t3wn48NOSPAMPLEASE.psi.ch                1        
2845072 0.55000 ZuuBDT_low gaperrin     qw    05/10/2017 11:56:43                                    1        
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$ qstat -j 2845072
==============================================================
job_number:                 2845072
exec_file:                  job_scripts/2845072
submission_time:            Wed May 10 11:56:43 2017
owner:                      gaperrin
uid:                        618
group:                      ethz-susy
gid:                        533
sge_o_home:                 /mnt/t3nfs01/data01/shome/gaperrin
sge_o_log_name:             gaperrin
sge_o_path:                 /cvmfs/cms.cern.ch/share/overrides/bin:/mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3/bin/slc6_amd64_gcc491:/mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3/external/slc6_amd64_gcc491/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw/CMSSW_7_4_3/bin/slc6_amd64_gcc491:/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw/CMSSW_7_4_3/external/slc6_amd64_gcc491/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc491/external/llvm/3.6/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc491/external/gcc/4.9.1-cms/bin:/bin:/cvmfs/cms.cern.ch/common:/cvmfs/cms.cern.ch/bin:/gridware/sge/bin/lx24-amd64:/cvmfs/cms.cern.ch/common:/cvmfs/cms.cern.ch/bin:/gridware/sge/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/mnt/t3nfs01/data01/swshare/psit3/bin:/mnt/t3nfs01/data01/shome/gaperrin/bin:/mnt/t3nfs01/data01/swshare/psit3/bin:/mnt/t3nfs01/data01/shome/gaperrin/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3/src/Xbb/python
sge_o_host:                 t3ui02
account:                    sge
cwd:                        /mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3/src/Xbb/python
merge:                      y
hard resource_list:         os=sl6,h_vmem=6G
mail_list:                  gaperrin@t3ui02.psi.ch
notify:                     FALSE
job_name:                   ZuuBDT_lowpt_GaelZllHbb13TeV2016mergesyscachingdc
stdout_path_list:           NONE:NONE:/mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3/src/Xbb/python/logs_v25//MERGESYSCACHING_v8_Zll_7/Logs//mergesyscachingdc_2017_05_10-11_56_42_ZuuBDT_lowpt_GaelZllHbb13TeV2016_.out
jobshare:                   0
hard_queue_list:            all.q
env_list:                   SRT_G4LEVELGAMMADATA_SCRAMRTDEL=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/external/geant4-G4PhotonEvaporation/3.0/data/Photon <<<VERY MUCH SHORTENED>>> git/1.8.3.1-odfocd/libexec/git-core,SRT_VALGRIND_LIB_SCRAMRTDEL=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/external/valgrind/3.10.1/lib/valgrind,CMSSW_BASE=/mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3,LOCALRT=/mnt/t3nfs01/data01/shome/gaperrin/VHbb/CMSSW_7_4_3,BASH_FUNC_module()=() {  eval `/usr/bin/modulecmd bash $*`
job_args:                   ZuuBDT_lowpt,GaelZllHbb13TeV2016,mergesyscachingdc,1,noid
script_file:                runAll.sh
parallel environment:  smp range: 1
scheduling info:            queue instance "all.q.admin@t3wn21.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q.admin@t3wn31.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q.admin@t3wn48.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q@t3wn46.psi.ch" dropped because it is disabled
                            queue instance "all.q@t3wn51.psi.ch" dropped because it is disabled
                            queue instance "all.q@t3wn58.psi.ch" dropped because it is disabled
                            queue instance "all.q@t3wn43.psi.ch" dropped because it is full
                            queue instance "all.q@t3wn37.psi.ch" dropped because it is full
                            queue instance "all.q@t3wn23.psi.ch" dropped because it is full
                            queue instance "all.q@t3wn36.psi.ch" dropped because it is full
                            queue instance "all.q@t3wn19.psi.ch" dropped because it is full
                            <<<VERY MUCH SHORTENED>>>
                            cannot run in queue "debug.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "short.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "meg.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "bigmem.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "long.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "all.q.admin" because it is not contained in its hard queue list (-q)
                            cannot run in queue "sherpa.gen.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "sherpa.int.long.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "sherpa.int.vlong.q" because it is not contained in its hard queue list (-q)
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn17.psi.ch" because it offers only hc:h_vmem=1.000G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn15.psi.ch" because it offers only hc:h_vmem=1.553G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn54.psi.ch" because it offers only hc:h_vmem=2.699G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn26.psi.ch" because it offers only hc:h_vmem=1.281G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn57.psi.ch" because it offers only hc:h_vmem=264.000M
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn10.psi.ch" because it offers only hc:h_vmem=1.818G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn52.psi.ch" because it offers only hc:h_vmem=264.000M
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn53.psi.ch" because it offers only hc:h_vmem=2.281G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn56.psi.ch" because it offers only hc:h_vmem=1.723G
                            cannot run because it exceeds limit "////t3wn50/" in rule "max_jobs_per_supermicro_host/1"
                            cannot run because it exceeds limit "////t3wn49/" in rule "max_jobs_per_supermicro_host/1"
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn24.psi.ch" because it offers only hc:h_vmem=1.412G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn19.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn14.psi.ch" because it offers only hc:h_vmem=596.000M
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn13.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn29.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn20.psi.ch" because it offers only hc:h_vmem=1.000G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn27.psi.ch" because it offers only hc:h_vmem=1.406G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn22.psi.ch" because it offers only hc:h_vmem=1017.954M
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn12.psi.ch" because it offers only hc:h_vmem=1017.954M
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn25.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn16.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn23.psi.ch" because it offers only hc:h_vmem=1.824G
                            (-l h_vmem=6G,os=sl6) cannot run in queue "t3wn18.psi.ch" because it offers only hc:h_vmem=1.412G
                            cannot run in PE "smp" because it only offers 0 slots
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$

This hints at too much memory requested, and this turned out to be the issue

Fixed contents of CloseCoutSentry.h

Wednesday 10 May 2017 


This version seems to work fine everywhere. The fix was to make trueStdOut open a duplicate of fdOut_, instead of fdOut_ directly.



#include "HiggsAnalysis/CombinedLimit/interface/CloseCoutSentry.h"

#include <cstdio>
#include <cassert>
#include <unistd.h>

#include <stdexcept>
#include <fcntl.h>

#include <iostream>


bool CloseCoutSentry::open_ = true;
int  CloseCoutSentry::fdOut_ = 0;
int  CloseCoutSentry::fdErr_ = 0;
FILE * CloseCoutSentry::trueStdOut_ = 0;
CloseCoutSentry *CloseCoutSentry::owner_ = 0;


CloseCoutSentry::CloseCoutSentry(bool silent) :
    silent_(silent), stdOutIsMine_(false)
{
    if (silent_) {
        if (open_) {

            std::cout << " [TK] In CloseCoutSentry.cc: Initializing CloseCoutSentry" << std::endl;
            std::cout << " [TK] In CloseCoutSentry.cc: fileno(stdout) = " << fileno(stdout) << std::endl;

            open_ = false;
            if (fdOut_ = 0 && fdErr_ = 0) {
                std::cout << " [TK] In CloseCoutSentry.cc: Duplicating fds" << std::endl;
                fdOut_ = dup(fileno(stdout));
                fdErr_ = dup(fileno(stderr));
            }

            int fdTmp_ = open( "/dev/null", O_RDWR );
            std::cout << " [TK] In CloseCoutSentry.cc: Redirecting fds" << std::endl;
            dup2(fdTmp_, fileno(stdout) );
            dup2(fdTmp_, fileno(stderr) );
            std::cout << " [TK] In CloseCoutSentry.cc: Fds redirected" << std::endl;
            assert(owner_ = 0);
            owner_ = this;
        } else {
            silent_ = false;
        }
    }
}

CloseCoutSentry::~CloseCoutSentry()
{
    clear();
}

void CloseCoutSentry::clear()
{

    fprintf(
        trueStdOutGlobal(),
        " [TK] In CloseCoutSentry.cc: Calling clear() \n"
        );

    if (stdOutIsMine_) {
        assert(this = owner_);
        fclose(trueStdOut_);
        trueStdOut_ = 0;
        stdOutIsMine_ = false;
    }
    if (silent_) {
        reallyClear();
        silent_ = false;
    }
    std::cout << " [TK] In CloseCoutSentry.cc: After clear(); This should print" << std::endl;
}

void CloseCoutSentry::reallyClear()
{
    if (fdOut_ = fdErr_) {
        dup2( fdOut_, fileno(stdout) );
        dup2( fdErr_, fileno(stderr) );
        open_   = true;
        owner_ = 0;
    }
}

void CloseCoutSentry::breakFree()
{
    reallyClear();
}

FILE *CloseCoutSentry::trueStdOutGlobal()
{
    if (owner_) return stdout;
    return owner_->trueStdOut();
}

FILE *CloseCoutSentry::trueStdOut()
{
    if (open_) return stdout;
    if (trueStdOut_) return trueStdOut_;
    if (owner_ = this && owner_ = 0) return owner_->trueStdOut();
    assert(owner_ == this);
    stdOutIsMine_ = true;
    int fdOutDup_ = dup( fdOut_ );
    trueStdOut_ = fdopen( fdOutDup_, "w" );
    return trueStdOut_;
}



Comments on pull request for combine worker node sentry issue

Tuesday 9 May 2017 



adavidzh commented 17 minutes ago

@tklijnsma, can you please comment on:

the kind of crashes induced, and
how it is that the crash was found to be specific to SGE worker nodes (esp. why only on "some" nodes).
I have seen combine running out of file descriptors, so I wonder if this is related, and if dup2 mitigates it.



Hi @adavidzh,

For the longest time combine jobs would fail on our local computer cluster at PSI, and I only recently traced back the origin to CloseCoutSentry. The job would crash at the initialization of a CloseCoutSentry class, without any error messages whatsoever (those were probably successfully redirected).

The origin is (probably) because on our local cluster, stdout/err are redirected to files from the beginning. The following code (taking all the relevant excerpts from CloseCoutSentry) reproduces the problem:


#include <iostream>
#include <cstdlib>
#include <string>
#include <stdexcept>

#include <cstdio>
#include <cassert>
#include <unistd.h>
#include <fcntl.h>

using namespace std;

int main(int argc, char **argv) {

    cout << "Test print 0 at beginning of program" << endl;

    int  fdOut = 0;
    int  fdErr = 0;

    if (fdOut = 0 && fdErr = 0) {
        fdOut = dup(1);
        fdErr = dup(2);
        }
    freopen("/dev/null", "w", stdout);
    freopen("/dev/null", "w", stderr);

    cout << "Test print 1 after redirecting" << endl;

    if (fdOut = fdErr) {
        char buf[50];
        sprintf( buf, "/dev/fd/%d", fdOut ); freopen(buf, "w", stdout);
        sprintf( buf, "/dev/fd/%d", fdErr ); freopen(buf, "w", stderr);
        }

    cout << "Test print 2 after undoing the redirecting" << endl;

    }


On a UI / lxplus worker node:

Test print 0 at beginning of program
Test print 2 after undoing the redirecting


And on a worker node of our cluster:

Test print 0 at beginning of program
# ================================================================
# JOB Live Resources USAGE for job 2763256: (...)


I didn't mean to imply the crash is specific to SGE, I only meant to say that I reliably managed to crash combine on at least one type of SGE worker nodes. I'm not sure whether dup2 mitigates combine running out of file descriptors.



Contents of CloseCoutSentry.cc with fix


Tuesday 9 May 2017 



#include "HiggsAnalysis/CombinedLimit/interface/CloseCoutSentry.h"

#include <cstdio>
#include <cassert>
#include <unistd.h>

#include <stdexcept>
#include <fcntl.h>

bool CloseCoutSentry::open_ = true;
int  CloseCoutSentry::fdOut_ = 0;
int  CloseCoutSentry::fdErr_ = 0;
FILE * CloseCoutSentry::trueStdOut_ = 0;
CloseCoutSentry *CloseCoutSentry::owner_ = 0;


CloseCoutSentry::CloseCoutSentry(bool silent) :
    silent_(silent), stdOutIsMine_(false)
{
    if (silent_) {
        if (open_) {
            open_ = false;
            if (fdOut_ = 0 && fdErr_ = 0) {
                fdOut_ = dup(1);
                fdErr_ = dup(2);
            }

            // freopen("/dev/null", "w", stdout);
            // freopen("/dev/null", "w", stderr);

            int fdTmp_ = open( "/dev/null", O_RDWR );
            dup2(fdTmp_, 1);
            dup2(fdTmp_, 2);

            assert(owner_ = 0);
            owner_ = this;
        } else {
            silent_ = false;
        }
    }
}

CloseCoutSentry::~CloseCoutSentry()
{
    clear();
}

void CloseCoutSentry::clear()
{
    if (stdOutIsMine_) {
        assert(this = owner_);
        fclose(trueStdOut_); trueStdOut_ = 0; stdOutIsMine_ = false;
    }
    if (silent_) {
        reallyClear();
        silent_ = false;
    }
}

void CloseCoutSentry::reallyClear()
{
    if (fdOut_ = fdErr_) {
        // char buf[50];
        // sprintf(buf, "/dev/fd/%d", fdOut_); freopen(buf, "w", stdout);
        // sprintf(buf, "/dev/fd/%d", fdErr_); freopen(buf, "w", stderr);
        dup2( fdOut_, 1 );
        dup2( fdErr_, 2 );
        open_   = true;
        owner_ = 0;
    }
}

void CloseCoutSentry::breakFree()
{
    reallyClear();
}

FILE *CloseCoutSentry::trueStdOutGlobal()
{
    if (owner_) return stdout;
    return owner_->trueStdOut();
}

FILE *CloseCoutSentry::trueStdOut()
{
    if (open_) return stdout;
    if (trueStdOut_) return trueStdOut_;
    if (owner_ = this && owner_ = 0) return owner_->trueStdOut();
    assert(owner_ == this);
    stdOutIsMine_ = true;

    // char buf[50];
    // sprintf(buf, "/dev/fd/%d", fdOut_); trueStdOut_ = fopen(buf, "w");
    trueStdOut_ = fdopen( fdOut_, "w" );

    return trueStdOut_;
}



Combine issue on worker nodes (wns) on T3

Friday 5 May 2017 

This line breaks:

CloseCoutSentry sentry(verbose < 3);

Going to try and reproduce this.





Monday 8 May 2017 

Tried to remake CloseCoutSentry.{cc,h}, but can't get it to work. Something must still be broken in the class.

It's weird though: Built a jobscript that reproduces the error consistently, fixed that problem in CloseCoutSentry.{cc,h}, but combine still breaks on the same line. So something else must be broken as well, or the fix was not good enough (but it seems to be so).

All changes left uncommitted for now.

Mail about benchmark job

Friday 5 May 2017 


Derek Feichtinger <derek.feichtinger@psi.ch>,
Christoph Grab <grab@phys.ethz.ch>,
Nina Loktionova <nina.loktionova@psi.ch>,
Gorini Stefano Claudio <gorini@cscs.ch>,
Joosep Pata <joosep.pata@cern.ch>,
Thomas Klijnsma <thomas.klijnsma@cern.ch>,
luis.march.ruiz@cern.ch,
Miguel Gila <miguel.gila@cscs.ch>

derek.feichtinger@psi.ch
grab@phys.ethz.ch
nina.loktionova@psi.ch
gorini@cscs.ch
joosep.pata@cern.ch
thomas.klijnsma@cern.ch
luis.march.ruiz@cern.ch
miguel.gila@cscs.ch


Dear all,

During the last LHConCRAY meeting we discussed several performance benchmarks for Phoenix and CRAY. Among these would be a stable easily-interpretable process, preferably some sort of MC event generation.

At the T1 at FNAL there already exists a stable TTBAR MC generation job, that is pretty trivial to run (it only needs access to /cvmfs). I received a script from Josep Flix that runs in a few hours for 500 events, and measures the event throughput. See the attachment and the instructions below [*]. (Two paths in the script need to be edited before one can use it).

Using 4 threads on my local T3_CH_PSI for example, I got the following benchmark:

TimeReport> Time report complete in 10336.7 seconds
Time Summary:
- Min event:   7.34417
- Max event:   49.9315
- Avg event:   20.568
- Total loop:  10322.4
- Total job:   10336.7
Event Throughput: 0.0484382 ev/s <-- The most informative number
CPU Summary:
- Total loop:  10297.3
- Total job:   10311.4

I think this might be a nice solution for comparing the Phoenix with the CRAY. Let me know your thoughts.

Cheers,
Thomas


[*] Instructions by Josep Flix:

Dear Thomas,

sure, please find the benchmark attached. As said, you only need to run the benchmark-2017-gensim.sh script I attach here.
You need CVMFS in your system for CMS. That's all.

Be sure you adapt the variables:

[root@td713 cms]# cat benchmark-2017-gensim.sh | grep db12
export RES_DIR=/home/db12/cms/results/$RUNID
export CUR_DIR=/home/db12/cms

This is the current directory, and where the results go (./results).

You launch the script in this way:

./benchmark-2017-gensim.sh $NJOBS $THREADS $NEVTS

For example, I recommend you to run 1 job with 4 threads and 500 events, or filling all of the physical cores or logical cores of the machine, by changing the THREADS.  

THREADS = 1 means 4 threads, sorry for not being clear before.

If you want to fill a machine of 24 cores, you should set this to 6.

To interpret the results, for example, if you set 3 in THREADS you will see a similar directory like this in the results directory:

[root@td550 10908_3_4_500]# ls
test_1_stress  test_2_stress  test_3_stress

which contains the results for each job sent in parallel. Then, you can check the Summary of the test as:

[root@td550 10908_3_4_500]# cat test_1_stress/CMSSW_8_4_0_patch2/src/test_1_stress.log | grep -A 10 -B 10 Avg
[5] run: 1 lumi: 1 event: 4  vsize = 3665.02 deltaVsize = 108 rss = 2391.51 delta = 192.766
[21] run: 1 lumi: 1 event: 21  vsize = 3749.27 deltaVsize = 52.25 rss = 2461.91 delta = 70.4023
[39] run: 1 lumi: 1 event: 38  vsize = 3857.28 deltaVsize = 34 rss = 2538.01 delta = 76.1016
[483] run: 1 lumi: 1 event: 484  vsize = 4170.29 deltaVsize = 0 rss = 2691.33 delta = 17.0039
[482] run: 1 lumi: 1 event: 482  vsize = 4170.29 deltaVsize = 0 rss = 2653.8 delta = -20.5273
[481] run: 1 lumi: 1 event: 481  vsize = 4170.29 deltaVsize = 10 rss = 2674.32 delta = -15.6523
TimeReport> Time report complete in 4689.71 seconds
Time Summary:
- Min event:   8.95481
- Max event:   91.5796
- Avg event:   36.7333
- Total loop:  4673.36
- Total job:   4689.71
Event Throughput: 0.106989 ev/s
CPU Summary:
- Total loop:  18342.8
- Total job:   18358.8

The important thing is the Event Thoughput. This is your benchmark.

Have fun!
Pepe.

Network issue with T2

Friday 5 May 2017 

When accessing root files, occasionally receive the following error:

>  ( ERROR )  TNetXNGFile::Open  : [ERROR] Server responded with an error: [3012] Internal timeout

This is an old issue:



Joosep: its with xrootd, but today also gsiftp



Dario did fix a typo in some config files:

"""
dario [1:46 PM]
@joosep @gianf I found a typo error in 4 pools

[1:46]
I mean in the config file
dario [1:46 PM]
this is probably the reason
[1:47]
of the problems
[1:47]
I'm really sorry
[1:47]
I modified 146 pools
[1:47]
and did a typo in the last 4

ool.mover.xrootd.port.min=33511
pool.mover.xrootd.port.max=33511

"""

But issues are back, so this may not have changed anything.


grid-rt ticket:


Dear T2 admins,

Users have reported occasionally and randomly retrieving error messages when opening root files on the T2 storage. The error message is the following:

>  ( ERROR )  TNetXNGFile::Open  : [ERROR] Server responded with an error: [3012] Internal timeout

The issue seems to be similar (or the same) as the xrootd issue from a few weeks ago.

Examples of affected files include some root files in the following directories:

root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/

root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/dsalerno/tth/V25_nBCSVM_v1/ttHTobb_M125_TuneCUETP8M2_ttHtranche3_13TeV-powheg-pythia8/V25_nBCSVM_v1/170430_142310/0000/


The error can be reproduced by for example the following python script:

import os
import ROOT

from time import sleep
from time import strftime
now = lambda: strftime( '%H:%M:%S %b%d' )

########################################
# Main
########################################

def main():

    for iAttempt in xrange(100):
        print '\n[ {0} ] Trying to access'.format( now() )

        try:
            fp = ROOT.TFile.Open( 'root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/tree_14.root' )
            fp.ls()
            sleep(1)
            fp.Close()
        except:
            print 'HERE IS FAILURE!'

        sleep(5)


########################################
# End of Main
########################################
if name == "__main__":
    main()

On my last try I got 2 error messages out of 100 attempts.

Cheers,
Thomas (CMS VO-contact at ETH Zurich)






Some output of the script:

[ 12:04:28 May05 ] Trying to access
TNetXNGFile**       root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/tree_14.root
TNetXNGFile*       root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/tree_14.root
  KEY: TDirectoryFile   vhbb;1  vhbb
  KEY: TH1F CounterAnalyzer_count_blr;1 count
  KEY: TH1F CounterAnalyzer_count_final;1   count
  KEY: TH1F CounterAnalyzer_count_lep;1 count
  KEY: TH1F CounterAnalyzer_count_jet;1 count
  KEY: TTree    tree;1  PhysicsTools.Heppy.analyzers.core.AutoFillTreeProducer.AutoFillTreeProducer_2
  KEY: TH1F CounterAnalyzer_count;1 count
  KEY: TH1F CounterAnalyzer_count_trg;1 count

[ 12:04:34 May05 ] Trying to access
TNetXNGFile**       root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/tree_14.root
TNetXNGFile*       root://storage01.lcg.cscs.ch/pnfs/lcg.cscs.ch/cms/trivcat/store/user/jpata/tth/Apr20_cmssw_v2/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/Apr20_cmssw_v2/170420_155829/0000/tree_14.root
  KEY: TDirectoryFile   vhbb;1  vhbb
  KEY: TH1F CounterAnalyzer_count_blr;1 count
  KEY: TH1F CounterAnalyzer_count_final;1   count
  KEY: TH1F CounterAnalyzer_count_lep;1 count
  KEY: TH1F CounterAnalyzer_count_jet;1 count
  KEY: TTree    tree;1  PhysicsTools.Heppy.analyzers.core.AutoFillTreeProducer.AutoFillTreeProducer_2
  KEY: TH1F CounterAnalyzer_count;1 count
  KEY: TH1F CounterAnalyzer_count_trg;1 count

[ 12:04:40 May05 ] Trying to access
Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3012] Internal timeout

HERE IS FAILURE!

[ 12:05:31 May05 ] Trying to access
Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3012] Internal timeout

HERE IS FAILURE!

mail Derek for combine issue

Friday 5 May 2017 


Hi Derek and Nina,

This week I finally figured out why some of my analysis was never able to run on the T3 worker nodes. It concerns the program "combine", which is used extensively by a lot of people in our group.

Summarized, combine contains code that looks like this (all spread across functions in the actual implementation, this is heavily simplified):

int main(int argc, char **argv) {

    cout << "Test print 0 at beginning of program" << endl;

    int  fdOut = 0;
    int  fdErr = 0;

    if (fdOut = 0 && fdErr = 0) {
        fdOut = dup(1);
        fdErr = dup(2);
        }
    freopen("/dev/null", "w", stdout);
    freopen("/dev/null", "w", stderr);

    cout << "Test print 1 after redirecting" << endl;

    if (fdOut = fdErr) {
        char buf[50];
        sprintf( buf, "/dev/fd/%d", fdOut ); freopen(buf, "w", stdout);
        sprintf( buf, "/dev/fd/%d", fdErr ); freopen(buf, "w", stderr);
        }

    cout << "Test print 2 after undoing the redirecting" << endl;

    }


If I execute this on a user interface, I get:

[tklijnsm@t3ui02 StreamTest]$ ./sentryTest.exe
Test print 0 at beginning of program
Test print 2 after undoing the redirecting

So as expected, it skips test print 1. However, if I try to execute this on a worker node, I get:

[tklijnsm@t3ui02 StreamTest]$ cat joboutput/jobscript.sh.o2763256
Test print 0 at beginning of program
# ================================================================
# JOB Live Resources USAGE for job 2763256: ( don't consider mem values, they are wrong )
# usage    1:                 cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
#
# JOB Historical Resources USAGE for job 2763256: you have to manually run
# qacct -j 2763256 2&> /dev/null || qacct -f /gridware/sge/default/common/accounting.complete -j 2763256
#
#
# JOBs executed on t3wn[30-40] should run ~1.13 faster than t3wn[10-29]
#
# removing TMPDIR: /scratch/tmpdir-2763256.1.short.q

As you can see, test print 2 is never printed. While this seems minor, the implication has been that certain analyses have been run exclusively on lxplus for quite some time now.

I setup a minimal git repository, so you can easily reproduce the results:

make
./sentryTest.exe

Should print:
Test print 0 at beginning of program
Test print 2 after undoing the redirecting

Then submit the job by:
qsub jobscript.sh
(warning: The paths in this script need to be edited into something sensible)


Would there be any way produce the intended behaviour on the worker nodes? And what could cause an issue like this in the first place?

Cheers,
Thomas

Storage directory on T2


Thursday 4 May 2017 


Accessing T2 storage element

(backslash to bypass alias)
\uberftp storage01.lcg.cscs.ch


root://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm




/pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm does not yet exist - probably have to create it via crab --> Trying the crab tutorial



After crab tutorial:

[tklijnsm@t3ui02 src]$ \uberftp storage01.lcg.cscs.ch 'ls /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/GenericTTbar'
220 GSI FTP door ready
200 User :globus-mapping: logged in
drwx------  1 cms001     cms001              512 May  4 16:45 CRAB3_tutorial_May2015_MC_analysis

woohoo!

db link:




Now set up some transfer scripts.



root root://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis/170504_143235/0000/output_2.root

Creating parent directories in one go on SE

Thursday 4 May 2017 

[[Cms-tier3] ] Making parent directories on SE

Izaak Neutelings via cern.onmicrosoft.com
09:58

aan t3-eth-admins, Yuta
Hi admins!

Is there a one-liner to create all parent directories during or (preferably) before copying files to the /pnfs storage element, just like `mkdir -p`? In the end I would like to use this in a batch script.

I tried:

(...)

See:






Eventually figured it out via this:



Hi Izaak,

After trying out various tricks like you did, I came up with the following solution:

gfal-mkdir -p gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/test9/test10/test11/

This seems to create the nested directories correctly. I haven't tested it in a CMSSW environment though, this might matter.

Note that if you're not creating a huge amount of directories, this is also the kind of job that could be done by mounting pnfs through gfalFS, and performing the regular unix mkdir on it.

Cheers,
Thomas



Login denied when trying uberftp

Thursday 4 May 2017 


[tklijnsm@t3ui02 emailReport]$ uberftp -debug 1 t3se01.psi.ch
Debug level set at 1 (OFF)
220 GSI FTP door ready
530 Login denied

(more debug does not yield more interesting information)



[tklijnsm@t3ui02 emailReport]$ grid-proxy-init -debug -verify

User Cert File: /mnt/t3nfs01/data01/shome/tklijnsm/.globus/usercert.pem
User Key File: /mnt/t3nfs01/data01/shome/tklijnsm/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624
Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
Enter GRID pass phrase for this identity:
Creating proxy .++++++
.................................................++++++
Done
Proxy Verify OK
Your proxy is valid until: Fri May  5 02:27:46 2017


[tklijnsm@t3ui02 emailReport]$ voms-proxy-init -debug -verify
Looking for user credentials in [/mnt/t3nfs01/data01/shome/tklijnsm/.globus/userkey.pem, /mnt/t3nfs01/data01/shome/tklijnsm/.globus/usercert.pem]...
Enter GRID pass phrase for this identity:
Credentials loaded successfully [/mnt/t3nfs01/data01/shome/tklijnsm/.globus/userkey.pem, /mnt/t3nfs01/data01/shome/tklijnsm/.globus/usercert.pem]
Loading CA Certificate /etc/grid-security/certificates/4339b4bc.0.
Loading CA Certificate /etc/grid-security/certificates/b4278411.0.
Loading CRL /etc/grid-security/certificates/4339b4bc.r0.
Loading CRL /etc/grid-security/certificates/4339b4bc.r0.
Loading EACL namespace (signing_policy) /etc/grid-security/certificates/4339b4bc.signing_policy.

Created proxy in /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624.

Your proxy is valid until Fri May 05 02:28:45 CEST 2017


ls on mount and opening files works fine:
[tklijnsm@t3ui02 ~]$ ls /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/oldRegressionNtuples/
DoubleElectron_FlatPt-1To300     DoublePhoton_FlatPt-300To6500
DoubleElectron_FlatPt-300To6500  DoublePhoton_FlatPt-5To300


xrootd does not work:

[tklijnsm@t3ui02 CMSSWs]$ root root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/oldRegressionNtuples/DoubleElectron_FlatPt-1To300/crab_TKNtup_21Jul_Electron_lowpt/160721_211105/0000/output_4.root
root [0]
Attaching file root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/oldRegressionNtuples/DoubleElectron_FlatPt-1To300/crab_TKNtup_21Jul_Electron_lowpt/160721_211105/0000/output_4.root as _file0...
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
(TFile *) nullptr



--> Renewed grid certificate, and everything works again. Strange since certificate was not officially expired and doing proxy-init commands returned no problems

In general when dcap works, and xrootd/uberftp don't, it's a grid certificate problem



CRAB en certificates

CRAB en certificates



Certificates opzetten

Instructies van hier:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideLcgAccess

de CERN certificates gedownload van hier:
https://www.tacar.org/cert/list


Nou moet er een soort van key in lxplus ~/.globus komen… nog niet helemaal duidelijk hoe

https://twiki.cern.ch/twiki/bin/view/CMSPublic/PersonalCertificate
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideLcgAccess

Op het moment is deze boel nog niet geinstalleerd, maar de CA ligt eruit :(



Daarna deze tutorial volgen:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookCRAB3Tutorial

( Deze heeft ook info, maar wrs outdated:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookRunningGrid )



28 april:

Even met Mauro overlegd:


1 Nieuw certificate aangevraagd (ging instantaan) en gedownload (stond een grote klop “download certificate now” na de creatie); deze heette gelijk al myCertificate.p12

2 myCertificate.p12 gekopieerd naar lxplus ~/.globus , renamed naar mycert.p12

3 Stappen gevolgd van https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookStartingGrid#ObtainingCert


openssl pkcs12 -in mycert.p12 -clcerts -nokeys -out usercert.pem

openssl pkcs12 -in mycert.p12 -nocerts -out userkey.pem

chmod 400 userkey.pem

chmod 400 usercert.pem


Op dit moment zou dan debug command al moeten werken:

[tklijnsm@lxplus009 .globus]$ grid-proxy-init -debug -verify 

User Cert File: /afs/cern.ch/user/t/tklijnsm/.globus/usercert.pem
User Key File: /afs/cern.ch/user/t/tklijnsm/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /tmp/x509up_u77787
Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
Enter GRID pass phrase for this identity:
Creating proxy .............................++++++
........++++++
 Done
Proxy Verify OK
Your proxy is valid until: Fri Apr 29 04:48:03 2016


4 Ingeschreven bij VO CMS: https://voms2.cern.ch
Dit hoeft wrs nu nooit meer

[tklijnsm@lxplus009 .globus]$ voms-proxy-init -voms cms
Enter GRID pass phrase for this identity:
Contacting lcg-voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u77787.

Your proxy is valid until Fri Apr 29 04:54:21 CEST 2016


heeeuuy





Running combine on t3 memory bug

Jobs still produce no output

[tklijnsm@t3ui02 differentialCombination2017]$ qacct -j 2670509                                                    
==============================================================
qname        short.q             
hostname     t3wn57.psi.ch       
group        ethz-higgs          
owner        tklijnsm            
project      NONE                
department   defaultdepartment   
jobname      job_SCAN_May03_Datacard_13TeV_differential_pT_moriond17_reminiaod_extrabin_corrections_newsysts_v5_renamed_0.sh
jobnumber    2670509             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed May  3 12:02:45 2017
start_time   Wed May  3 12:02:48 2017
end_time     Wed May  3 12:11:15 2017
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  139                 <--- Probably not right
ru_wallclock 507          
ru_utime     358.570      
ru_stime     145.356      
ru_maxrss    211344              
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    166049              
ru_majflt    1998                
ru_nswap     0                   
ru_inblock   383848              
ru_oublock   64                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     34149               
ru_nivcsw    52594               
cpu          503.926      
mem          143.989           
io           3.004             
iow          0.000             
maxvmem      654.543M
arid         undefined



From:

"Unix systems return errono 128+signal when a signal received. 128 + 11 = 139 . SIgnal 11 is SIGSEV (i.e. segmentation violation). = There is a memory access bug in your C++ code."

--> Segmentation violation



Running with valgrind. First compile with debug information (so line numbers are displayed in the valgrind output, otherwise even more difficult to decode):

from:

scram b clean; scram b USER_CXXFLAGS="-g"




Simply doing a std::cout before every line, reveals a problem with this:

    // std::cout << " [TK] Doing \" CloseCoutSentry sentry(verbose < 3);     \" " << std::endl;
    // CloseCoutSentry sentry(verbose < 3);    

Commented out all printouts via the sentry, and combine seems to run again.



Latest job output: status is:
SIGBUS     10     Core     Bus Error

[tklijnsm@t3ui02 test001_May03]$ qacct -j 2679117              
==============================================================
qname        short.q             
hostname     t3wn54.psi.ch       
group        ethz-higgs          
owner        tklijnsm            
project      NONE                
department   defaultdepartment   
jobname      job_SCAN_May03_Datacard_13TeV_differential_pT_moriond17_reminiaod_extrabin_corrections_newsysts_v5_renamed_0.sh
jobnumber    2679117             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed May  3 16:51:05 2017
start_time   Wed May  3 16:51:06 2017
end_time     Wed May  3 18:21:07 2017
granted_pe   NONE                
slots        1                   
failed       100 : assumedly after job
exit_status  138                 
ru_wallclock 5401         
ru_utime     0.146        
ru_stime     0.074        
ru_maxrss    11260               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    17771               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   688                 
ru_oublock   8                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     615                 
ru_nivcsw    110                 
cpu          5391.160     
mem          5714.923          
io           0.568             
iow          0.000             
maxvmem      1.170G
arid         undefined

SGE exit code 100 --> This is probably the CPU time limit or the memory limit, but googling yields no clear answer

--> It's cpu, see output:

# ================================================================
# JOB Live Resources USAGE for job 2679117: ( don't consider mem values, they are wrong )
# usage    1:                 cpu=01:29:44, mem=16152.00000 GBs, io=0.56841, vmem=3.000G, maxvmem=3.000G
#
# JOB Historical Resources USAGE for job 2679117: you have to manually run
# qacct -j 2679117 2&> /dev/null || qacct -f /gridware/sge/default/common/accounting.complete -j 2679117
#
#
# JOBs executed on t3wn[30-40] should run ~1.13 faster than t3wn[10-29]
#
# removing TMPDIR: /scratch/tmpdir-2679117.1.short.q

short.q has 90 min, so this makes sense



Now recompiling without debugging information and seeing if all runs fine





gfalFS commands and issues

Thursday 27 April 2017 

Summary:

The CMSSW environment can screw up the gfalFS commands! Always execute this stuff from a clean shell.

Basic commands:

[tklijnsm@t3ui02 tklijnsm]$ cd /scratch/tklijnsm/; ls
t3
[tklijnsm@t3ui02 tklijnsm]$ gfalFS t3 gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
[tklijnsm@t3ui02 tklijnsm]$ ls t3
oldRegressionNtuples
[tklijnsm@t3ui02 tklijnsm]$ gfalFS_umount -z t3
[tklijnsm@t3ui02 tklijnsm]$ ls t3


See the following e-mail thread:

===============================================================
Hi Izaak,

Thanks for following up - I did not know that the CMSSW environment could screw up the gfalFS commands. This is good to know. I'm glad you solved the problem!

Cheers,
Thomas

2017-04-27 13:05 GMT+02:00 Izaak Neutelings <izaak.neutelings@uzh.ch>:
Hi Thomas,

I tried again without and with running my setup script*. The problem seems to come about when I do
eval `scram runtime -sh`
so the problem should lie somewhere there...

It's mounted now, thanks!

Cheers,
Izaak

* /shome/ineuteli/setup_scripts/setup_SFrame.sh

Op 27 apr. 2017, om 11:52 heeft Thomas Klijnsma <thomas.klijnsma@cern.ch> het volgende geschreven:

Hi Izaak,

I just tried the same commands, and everything worked fine:

[tklijnsm@t3ui02 tklijnsm]$ cd /scratch/tklijnsm/; ls
t3
[tklijnsm@t3ui02 tklijnsm]$ gfalFS t3 gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/tklijnsm
[tklijnsm@t3ui02 tklijnsm]$ ls t3
oldRegressionNtuples
[tklijnsm@t3ui02 tklijnsm]$ gfalFS_umount -z t3
[tklijnsm@t3ui02 tklijnsm]$ ls t3

I found that gfalFS may sometimes fail for not very clear reasons. Could you try the same commands again?

Cheers,
Thomas

2017-04-27 11:25 GMT+02:00 Izaak Neutelings <izaak.neutelings@uzh.ch>:
Hi T3 admins,

I cannot mount the storage element on the scratch anymore... This problem I had before, but I have forgotten how we solved it.

[ineuteli@t3ui03 ineuteli]$ cd /scratch/ineuteli
[ineuteli@t3ui03 ineuteli]$ mkdir pnfs
[ineuteli@t3ui03 ineuteli]$ gfalFS pnfs gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/ineuteli
[ineuteli@t3ui03 ineuteli]$ ls
ls: cannot access pnfs: Invalid argument
SFrameAnalysis  pnfs  samples

Can you help me, please?

Cheers,
Izaak

---------------------------------------------------------------------------------------------------

===============================================================

Clean up T3 storage


Tuesday 25 April 2017 


10 TB+ users:
        33.2 TB   : store/user/clange
        32.8 TB   : store/user/ursl
        22.9 TB   : store/user/mmasciov
        22.9 TB   : store/user/cgalloni
        18.9 TB   : store/user/gaperrin
        17.0 TB   : store/user/hinzmann
        16.6 TB   : store/user/grauco
        14.5 TB   : store/user/perrozzi
        12.2 TB   : store/user/bstomumu
        11.1 TB   : store/user/peruzzi
        11.0 TB   : store/user/swiederk
        10.9 TB   : store/user/ytakahas
        10.5 TB   : store/user/mmarionn

Found email adresses:
clemens.lange@cern.ch
urs.langenegger@psi.ch
mario.masciovecchio@cern.ch
camilla.galloni@cern.ch
gael.ludovic.perrin@cern.ch
hinzmann@cern.ch
giorgia.rauco@cern.ch
perrozzi@cern.ch
wistepha@phys.ethz.ch
Yuta.Takahashi@cern.ch
mmarionn@cern.ch

Unfound usernames:
bstomumu

not doing:
Marco.Peruzzi@cern.ch





Dear T3 storage user,

As the storage on our T3 is beginning to reach its quotum, it is time to clean up old files. You are currently using over 10TB in storage. Please have a look at your files and delete what you do not need anymore. Once the disk usage reaches critical levels important services may cease to function, so please delete your unneeded files fast.

You can list and delete your files using uberftp:

uberftp t3se01.psi.ch 'ls /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm'

uberftp t3se01.psi.ch 'rm -r /pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/path/to/files'


Your current disk usage is listed here:
(Warning: very big file - better to download and view in a text editor, or grep for your username)

Best regards,
Thomas Klijnsma

Mail CMS computing benchmark job

Tuesday 25 April 2017 


Dear Computing Operations team,

We are in the process of collecting performance data for our T2 sites T2_CH_CSCS and T2_CH_CSCS_HPC. Beyond the benchmarks from HEP-SPEC, we are interested in a benchmark job that produces reliable and stable results, for instance the generation of some Monte-Carlo events from a well-known process.

Is there a recommended type of job that we can use for this purpose? Surely other sites will have tried something similar in the past.

Best regards,
Thomas Klijnsma



Notes of Swiss Grid Ops

Thursday 20 April 2017 

Nothing interesting really, gave short report of inode issue. Mediocre attendance.

Notes are here:



Maintenance:
Next week or end/tenth of May
Only thing that changes is the nucleus part of ATLAS
Date will be tuned to needs of Dario, no preference from me or ATLAS

10th of May is selected

Does this interfere with our data collection? Hopefully not. Downtown is properly announced and services will be shut down in an orderly manner.

Summary of PSI meeting

Wednesday 19 April 2017 


-- T2 --

Over the next 2 months, some metrics should be defined by which to measure HPC performance. Metrics should come also from a CMS point of view. Joosep's slides show a couple of examples, but some more thought has to be given to a proper selection of variables.

Regarding tasks: Nina will create an account and maybe try to run a CRAB example (supply one for her?). She will also subscribe to some of the HyperNews threads (we should probably give some pointers as to which ones).

Because T2 activities need lots of CMS input, I will probably remain responsible for communications.


-- T3 --

Principle for division of labor is technical stuff for Nina, user related stuff for Thomas. I will also start with some updates for the TWiki (still need an account for this).






Thursday 20 April 2017 

Summary of LHConCray meeting 10:00

Walltime good vs. bad
CPU efficiency
Failure rate (from Joosep)
Uptime (reliability and availability of both sites)
Throughput (Delivered shares / nominal shares, and more taken into account)


Performance benchmarks
- General consensus on the "control job" - add it as a metric
    Go to CMS computing groups and ask what kind of job
    would be nice; it has to be agnostic of filesystem,
    something very stable.
- HEP-SPEC
This is outside the measurement period, we do this after


Assume first 4 weeks measurement period
--> Run until end of May for first set of measurements
--> Opportunity later this year to run again
--> Interrupt run if big issues pop up

Then realize some more problems and missing observables, and run again

Christoph will contact LHCb people (Roland?) - they have different memory requirements that need to be taken into account.

Already schedule a following meeting - do by e-mail. Around end of May to look at measurements. Joosep will be gone.

See also Derek's notes:


Notes of LHConCray meeting 10:00


Multicore: some part runs on single core, so always some inefficiency (but sw is updated, much progress made in the last few months). This behaviour is unpredictable for analysis jobs

Regarding the three types:
- Single core analysis
- ...
- ...

Amount of wall time hard to predict.


Choose a period (1/2 months): Let's say we have 90% single core analysis jobs, what goes to phoenix and what goes to cray


CG: Go back to variables: Does Gianfranco's proposal make sense? (I need this proposal, missed because late).


Walltime good vs. bad
CPU efficiency
Failure rate (from Joosep)
Uptime (reliability and availability of both sites)
Throughput (probably not comparable)


Benchmarking: monitor a 24h process on both sites and compare to HEP-SPEC rating. HEP-SPEC rating is already calculated (figures available).
Gianfranco asks repeatedly for custom benchmark tests, does not trust HEP-SPEC values.
MC test should complement HEP-SPEC, discussion about whether it should be run under normal conditions or in a controlled environment.
General consensus on the "control job" - add it as a metric

Throughput still under discussion, do something with walltime

Uptime: Having 1 core active == Up, which may be unfair. CRAY may be underestimated because there is only 1 computing element, may skew availability. Pablo suggests numbers won't be comparable.

Delivered shares / nominal shares (weighted by assigned share to ATLAS/CMS/LHCb). Closely related to throughput.



Other issue: length of measurement period

- Long enough to average out fluctuations (at least ~weeks).
- Take conference schedule into account
- At least two weeks according to Joosep
- Rather 4 weeks according to Gianfranco
- Whatever is acceptable between 2-6 weeks


Short note about memory measurement: Notoriously difficult quantity, but according to Gianfranco some obscure mysql command on the SLURM database should get the numbers.
    But then again a 2nd stage memory management goes on in the
    pilot jobs for CMS. Should look at filling efficiency
    inside the pilot jobs.


Skipped metric so far: CPU performance (this is the benchmarks).



About HEP-SPEC

HEP-SPEC06 is the HEP-wide benchmark for measuring CPU performance. It has been developed by the HEPiX Benchmarking Working Group in order to replace the outdated “kSI2k” metric.

The goal is to provide a consistent and reproducible CPU benchmark to describe experiment requirements, lab commitments, existing compute resources, as well as procurements of new hardware.

HEP-SPEC06 is based on the all_cpp benchmark subset (bset) of the widely used, industry standard SPEC® CPU2006 benchmark suite. This bset matches the percentage of floating point operations which we have observed in batch jobs (~10%), and it scales perfectly with the experiment codes.

HEP-SPEC06 is the official CPU performance metric to be used by WLCG sites since 1 April 2009.

Although the HEP-SPEC06 benchmark was initially designed to meet the requirements of High Energy Physics (HEP) labs, it is by now widely used also by other communities.










Final report on inode issue

Thursday 13 April 2017 



Regarding the excessive file creation on 11-12 April 2017


Hereby a short report on the aftermath of the excessive file creation on 11-12 April 2017. On 11 and 12 april the T2_CH_CSCS was flooded with the creation of many millions of files from MC jobs from CMS. Further inspection and a follow-up with experts on MC generation yielded the following conclusions:

- Some jobs contained ~100k files per job. While this is a rather large amount of files, it is apparently not out of the ordinary for MC jobs submitted by individual users.

- Central production jobs are ran differently and should not exhibit this behaviour.

- Other sites apparently handle this volume without too much difficulty. 

- This is not the first time this issue comes up, and it's unlikely to be the last time.

This time the issue was dealt with by increasing the inode quota and manually deleting millions of files. While the issue is not frequent, it would be better to prevent crisis management in the future. This can be done by either optimising MC jobs to create less files, and/or by making the site more tolerant of huge file creation.

Best regards,
Thomas Klijnsma
( CMS VO-contact for T2_CH_CSCS )

Find all log files and copy them to psi t3

/mnt/t3nfs01/data01/shome/tklijnsm/Test/gridpackoutput/

files=$("find . -maxdepth 4 -name "*.log"")

scp $files tklijnsm@t3ui02NOSPAMPLEASE.psi.ch:/mnt/t3nfs01/data01/shome/tklijnsm/Test/gridpackoutput/


Email to vieri regarding millions of files issue

-bash-4.1$ pwd
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/joxNDm8hoHqnt3tIep4RIIJmABFKDmABFKDmGXiKDmABFKDmSQOS3m/glide_VdP5JO/execute/dir_10301
-bash-4.1$ find . -type f | wc -l
find: `./lheevent/process/tmpkVc2ER': Permission denied
97928
 --> Nearly 100.000 files...

-bash-4.1$ pwd
/scratch/lcg/scratch/phoenix4/ARC_sessiondir/joxNDm8hoHqnt3tIep4RIIJmABFKDmABFKDmGXiKDmABFKDmSQOS3m/glide_VdP5JO/execute/dir_10301/lheevent/process/SubProcesses
-bash-4.1$ find . -type f | wc -l
74590
 --> Most of them in lheevent/process/SubProcesses

Wednesday 12 April 2017 


Hi Luca,

Today and yesterday some MadGraph jobs were submitted on T2_CH_CSCS, and for some reason the jobs created a rather large amount of files, nearly 100000 files per job. This is quickly exhausting the quotas on the T2, so I am trying to find out of this kind of thing can be prevented. 

The jobs create directory structure as follows:

lheevent
    gridpack_generation.log
    mgbasedir
    process
        additional_command
        amcatnlo.tar.gz
        bin
        Cards
        check_poles_input.txt
        collect_events.log
        Events
        FixedOrderAnalysis
        HTML
        lib
        madspingrid
        makefile
        makegrid.dat
        MCatNLO
        MGMEVersion.txt
        nsqso_born.inc
        py.py
        README
        Source
        SubProcesses
            ajob_template
            analyse_opts
            appl_common.inc
            Boosts.h
            cluster.inc
            collect_events
            collect_events.o
            combine_plots_FO.sh
            combine_results_FO.sh
            combine_results.sh
            combine_root.C
            combine_root.sh
            coupl.inc
            cts_mpc.h
            cts_mprec.h
            dirs.txt
            done
            fill_MC_mshell.o
            fjcore.hh
            FKS_params.dat
            FKSParams.inc
            fks_powers.inc
            genps.inc
            handling_lhe_events.o
            initial_states_map.dat
            madfks_mcatnlo.inc
            madinMMC_F.2
            MadLoopParams.inc
            makefile
            makefile_fks_dir
            makefile_loop
            maxconfigs.inc
            maxparticles.inc
            max_split.inc
            MCmasses_HERWIG6.inc
            MCmasses_HERWIGPP.inc
            MCmasses_PYTHIA6PT.inc
            MCmasses_PYTHIA6Q.inc
            MCmasses_PYTHIA8.inc
            message.inc
            MGVersion.txt
            mint.inc
            mp_coupl.inc
            mp_coupl_same_name.inc
            nevents_unweighted
            nevents_unweighted1
            nevents_unweighted2
            nevents_unweighted_splitted
            P0_bxc_epve
            P0_bxc_tapvt
            .
            .
            . < About 700 directories >
            .
            .
            P5_uxux_tamvtxuxdx
            P5_uxux_tamvtxuxsx
            proc_characteristics
            procdef_mg5.dat
            q_es.inc
            randinit
            read40.for
            res_0.txt
            res_1.txt
            reweight0.inc
            reweight1.inc
            reweight_all.inc
            reweight_appl.inc
            reweight.inc
            reweightNLO.inc
            reweight_xsec_events.cmd
            reweight_xsec_events.local
            run.inc
            subproc.mg
            sudakov.inc
            sumres.py
            timing_variables.inc
            trapfpe.c
            trapfpe_secure.c
            unlops.inc
            vegas2.for
        TemplateVersion.txt
        test_MC_input.txt
        test_ME_input.txt
        tmpkVc2ER
        Utilities
    runcmsgrid.sh

(Hopefully this is still readable). The problem directories seem to be in lheevent/process/SubProcesses. They all have a name like "P<number>_<some_letters>_<some_more_letters>", and they account for about 70k files (about 700 directories like this, with about 100 files each). Do you happen to know what these directories are supposed to contain? Would there be anyway to run the gridpack without creating so many files?

If I can look at some other auto-generated logs or if I can give you more information, let me know!

Cheers,
Thomas






Wednesday 12 April 2017 

Dear Vieri Candelise,

Over the last 24 hours some jobs on the T2 site "T2_CH_CSCS" have created billions of files. While inspecting the logs, it seems that a majority of files were created by MadGraph jobs that use code from /afs/cern.ch/user/v/vieri/work/genproductions/bin/MadGraph5_aMCatNLO . While these jobs are probably running as they are supposed to, the overflow of new files will very likely lead to downtime for our T2 site. To prevent this type of issue in the future, we are interested in what kind of jobs were submitted exactly. Primarily, we are trying to find out whether the file-creation is inherent to the way CMSSW runs the gridpack.

Could you confirm that you submitted these MadGraph jobs? And if so, could you provide us with the specifics of the job? (E.g. your crab config file, some details on the gridpack, etc.).

Best regards,

Thomas Klijnsma
( CMS VO-contact at ETH Zurich )



Hi Thomas,

actually I did not sent any crab job since months now. And for sure not madgraph jobs… in the last days I submitted LSF Powheg jobs for a tarball creation, but It’s unlikely that this could be related.

Can you tell me more about what’s going on? Is there something I can do?

Thanks!!




Hi Vieri,

Thanks for your e-mail. Upon closer inspection, the problematic jobs only used a gridpack that was at some point created by you, but the person who has submitted the jobs is someone else. I attached the gridpack_generation.log - if you happen to know who could still be doing analysis using this gridpack, that would help us a lot. My apologies for the oversight!

Best regards,
Thomas





Hi Thomas,

no problem!

Actually I have no idea about who might have sent the jobs, but I suggest you to ask the SMP GEN contact (Paolo Gunnellini and Kenneth Long).

Cheers,
Vieri

User job creates millions of files on T2

Tuesday 11 April 2017 



e-mail from Joosep:

Dear colleagues,

The admins at T2_CH_CSCS have noticed that some user grid jobs are creating millions of inodes, exhausting the file system capabilities.
By examining in Argus the mapping of the DN to the affected pool account (cms06), it is likely that it is Seungkyu Ha [1]. I have looked at the crab jobs but I didn't see anything strange from them from the log files, they look like usual CMSSW_8_0_26_patch1 jobs using LHAPDF.
Therefore, we'd like to urgently understand what could be the problem, perhaps Seungkyu or the CMS computing experts can help us understand the issue and whether any other site is also seeing this.

[1]
> <190>0 2017-04-11T17:17:15.865208+02:00 argus03 argus-pepd-process  - - 2017-04-11 15:17:14.045Z - INFO [DFPMObligationHandler] - ACCOUNTMAPPER_OH: DN: CN=Seungkyu Ha,CN=728175,CN=sha,OU=Users,OU=Organic Units,DC=cern,DC=ch pFQAN: /cms FQANs: [/cms] mapped to POSIX account: PosixAccount{user=cms06 group=cms}

Best,
Joosep


Slack conversation starts here:

Dino managed to track down a possibly responsible user, but it's of course tricky because the files are created by a cms glidein job.
Individual users should actually not be able to do this.


Solving process: Once noticed, find culprit using the T2 guys. Then asap  involve the CMS team via hn-cms-computing-tools@cern.ch





Email to T2 users to clean up

Monday 3 April 2017 

--------------------------------------------------

Dear CSCS storage user,

As the storage on T2_CH_CSCS is beginning to reach its quotum, it is time to clean up old files. You are currently using over 10TB in storage. Please have a look at your files and delete what you do not need anymore. Once the disk usage reaches critical levels important services may crash, so please delete your old files before Tuesday 11th of April.

You can list and delete your files using uberftp:

uberftp storage01.lcg.cscs.ch 'ls /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm'

uberftp -debug 2 storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'

uberftp storage01.lcg.cscs.ch 'rm -r /pnfs/lcg.cscs.ch/cms/trivcat/store/user/tklijnsm/path/to/file'


Your current disk usage is listed here:
(Warning: very big file - better to download and view in a text editor, or grep for your username)

Best regards,
Thomas Klijnsma


--------------------------------------------------

annapaola.de.cosa@cern.ch
silvio.donato@cern.ch
alberto.orso.maria.iorio@cern.ch
giorgia.rauco@cern.ch
daniel.salerno@cern.ch
zucchett@physik.uzh.ch
myriam.schoenenberger@cern.ch
gregor.kasieczka@cern.ch
Yuta.Takahashi@cern.ch
joosep.pata@cern.ch 


Monday 3 April 2017 

34.9 TB   : /store/user/ytakahas   Yuta.Takahashi@cern.ch
32.2 TB   : /store/user/jpata      joosep.pata@cern.ch
20.7 TB   : /store/user/zucchett   zucchett@physik.uzh.ch
20.0 TB   : /store/user/decosa     annapaola.de.cosa@cern.ch
18.9 TB   : /store/user/mschoene   myriam.schoenenberger@cern.ch
18.2 TB   : /store/user/gregor     gregor.kasieczka@cern.ch
17.1 TB   : /store/user/sdonato    silvio.donato@cern.ch
13.2 TB   : /store/user/oiorio     alberto.orso.maria.iorio@cern.ch
12.9 TB   : /store/user/grauco     giorgia.rauco@cern.ch
12.5 TB   : /store/user/dsalerno   daniel.salerno@cern.ch


Tuesday 4 April 2017 

        44.9 TB   : /store/user/ytakahas
        24.6 TB   : /store/user/jpata
        20.7 TB   : /store/user/zucchett
        20.0 TB   : /store/user/decosa
        18.9 TB   : /store/user/mschoene
        18.2 TB   : /store/user/gregor
        17.1 TB   : /store/user/sdonato
        13.2 TB   : /store/user/oiorio
        12.9 TB   : /store/user/grauco
        12.6 TB   : /store/user/dsalerno


Tuesday 11 April 2017 

        20.7 TB   : /store/user/zucchett
        20.0 TB   : /store/user/decosa
        18.0 TB   : /store/user/gregor
        17.5 TB   : /store/user/ytakahas
        17.1 TB   : /store/user/sdonato
        13.2 TB   : /store/user/oiorio
        12.6 TB   : /store/user/dsalerno
        12.1 TB   : /store/user/jpata
        10.8 TB   : /store/user/mschoene
        7.6 TB    : /store/user/bianchi
        7.2 TB    : /store/user/dpinna
        6.1 TB    : /store/user/grauco
        5.7 TB    : /store/user/paktinat
        5.6 TB    : /store/user/cgalloni
        5.5 TB    : /store/user/mwang


Users that did not clean up:

        20.7 TB   : /store/user/zucchett
        20.0 TB   : /store/user/decosa
        18.0 TB   : /store/user/gregor
        17.1 TB   : /store/user/sdonato
        13.2 TB   : /store/user/oiorio
        12.6 TB   : /store/user/dsalerno

However current free space is ~180 TB, so we're in the safe zone again.





Second e-mail:

joosep.pata@cern.ch
zucchett@physik.uzh.ch
annapaola.de.cosa@cern.ch
gregor.kasieczka@cern.ch
silvio.donato@cern.ch
alberto.orso.maria.iorio@cern.ch


Dear CSCS storage user,

Hereby a gentle reminder to remove your old files the storage on T2_CH_CSCS. Your help is greatly appreciated!

Best regards,

Thomas Klijnsma

A poem by Joosep

Thursday 6 April 2017

 

that is 100% the usual case

an old script used to work

but now with 10x more data it doesn't work

and the deadline is tomorrow :)

"please make it work"





also:
You have to remember and stress: dCache is not a normal unix filesystem
it is a server where you can put files and get files

Monitor active movers (xrootd, dcap)

Wednesday 5 April 2017 

C claims slow network

Tracking the active movers:

watch --interval=1 --differences 'lynx -dump -width=800 http://t3dcachedb.psi.ch:2288/queueInfo  | grep -v ______ | grep -v ops '

Coordinated with C and confirmed: number of active movers spiked (taking up upto 40% of the max movers). Slowness before was probably due to unlucky with the network.


Asked Joosep how we can see who is pushing for active movers, but there is no simple way. Easiest thing to check is to see who is running many jobs:

qstat -u '*'

And infer from that. Otherwise, Derek knows how to get the exact information


Using xrootd and dcap to open a root file on the T3 SE

Wednesday 5 April 2017 


-------------------------------------------------------
Summary:

first:

voms-proxy-init -voms cms -valid 192:00

then:

root dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root

or 

root root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root


-------------------------------------------------------
Problem solving:

dcap worked:

[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$ root dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root
root [0]
Attaching file dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root as _file0...
(TFile ) 0x1d24e00
root [1]
root [1] .ls
TDCacheFile**          dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root     
TDCacheFile*          dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root     
  KEY: TDirectoryFile     een_analyzer;1     een_analyzer
root [2]

But xrootd gave an error:

[tklijnsm@t3ui02 ~]$ root root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root
XrdSec: No authentication protocols are available.
Error in <TXNetSystem::Connect>: some severe error occurred while opening the connection at root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root - exit
   'login failed: unable to get protocol object.'
Error in <TXNetSystem::TXNetSystem>: fatal error: connection creation failed.
root [0]
Attaching file root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root as _file0...
170405 16:37:32 16243 Xrd: CheckErrorStatus: Server [t3dcachedb.psi.ch] declared: login failed(error code: 3010)
170405 16:37:32 16243 Xrd: DoAuthentication: login failed
XrdSec: No authentication protocols are available.
170405 16:37:32 16243 Xrd: Open: Authentication failure: login failed: unable to get protocol object
Error in <TXNetFile::CreateXClient>: open attempt failed on root://t3dcachedb.psi.ch//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root
root [1]
root [1] 170405 16:37:33 16251 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
170405 16:37:33 16252 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).

root [1]
root [1]



SOLUTION: redo the init command for the VO proxy:

voms-proxy-init -voms cms -valid 192:00


~$
~$ st3
Last login: Wed Apr  5 16:43:10 2017 from pb-d-128-141-188-11.cern.ch
**************************************************************************
*                                                                        *
*                            NOTICE TO USERS                             *
*                            ---------------                             *
*                                                                        *
This is the CMS PSI Tier-3 cluster. Using this infrastructure implies  *
* that you agree to the "usage and monitoring rules for IT Resources at  *
* PSI" which can be found at :                                           *
*       cd /mnt/t3nfs01/data01/swshare                                   *
*       ./psit3/doc/AW-95-06-01-EN.pdf AW-95-06-01-EN.pdf                *
*       ./psit3/doc/AW-95-06-01-EN.txt (converted to text)               *
*                                                                        *
**************************************************************************

[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$ voms-proxy-info
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma/CN=920634880
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
Error checking proxy policy:ProxyCertInfoExtension parser error, sequence contains too many items
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$ voms-proxy-init -voms cms -valid 192:00
Enter GRID pass phrase for this identity:
Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624.

Your proxy is valid until Thu Apr 13 16:45:32 CEST 2017
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$ voms-proxy-info
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma/CN=proxy
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tklijnsm/CN=764859/CN=Thomas Klijnsma
type      : full legacy globus proxy
strength  : 1024
path      : /mnt/t3nfs01/data01/shome/tklijnsm/.x509up_u624
timeleft  : 191:59:56
key usage : Digital Signature, Key Encipherment
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$
[tklijnsm@t3ui02 ~]$



After that, opening works:

[tklijnsm@t3ui02 ~]$ root root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root
root [0]
Attaching file root://t3dcachedb.psi.ch:1094//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root as _file0...
(TFile *) 0x431c8a0
root [1]
root [1] .ls
TNetXNGFile**          root://t3dcachedb.psi.ch//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root     
TNetXNGFile*          root://t3dcachedb.psi.ch//pnfs/psi.ch/cms/trivcat/store/user/tklijnsm/DoublePhoton_FlatPt-5To300/crab_TKNtup_0106_Photon/160601_125033/0000/output_1.root     
  KEY: TDirectoryFile     een_analyzer;1     een_analyzer
root [2]


Finding an email address for a T3 username

Monday 3 April 2017 

From admin, do

ssh -A root@t3nagios

Then:
cat /opt/nagios-4.2.4/etc/objects/contacts.cfg

May break in the future for other versions and updates; if so, do:

history

Which may reveal what previous admins tried to get the list




--------------------------------------------------------

List on Monday 3 April 2017 :


[root@t3nagios ~]# cat /opt/nagios-4.2.4/etc/objects/contacts.cfg
#
# 'cms_t3_alerts'  
define contact{
        contact_name                    cms_t3_alerts
        alias                           cms_t3_alerts
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    c
        host_notification_options       d,u
        service_notification_commands   notify-by-email
        #service_notification_commands   service-notify-html-email
        host_notification_commands      host-notify-by-email
        email                           cms-tier3-alerts@lists.psi.ch
        #email                           fabio.martinelli@psi.ch
        }


define contact{
        contact_name                    nagiosadmin     ; Short name of user
    use             generic-contact     ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin        ; Full name of user

        #email                           cms-tier3-alerts@lists.psi.ch  ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS *****
        email                           fabio.martinelli@psi.ch
        host_notification_commands      host-notify-by-email
        service_notification_commands   service-notify-html-email
        }


define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }


define contactgroup{
         contactgroup_name       t3-admins
         alias                   Tier3 Grid Administrators
         members                 cms_t3_alerts
         }

#
#
#
#
# # 24 July 2013 - F.Martinelli
# define contact{
#         contact_name                    ethz_bphys
#         alias                           ethz_bphys
#         service_notification_period     24x7
#         host_notification_period        24x7
#         service_notification_options    c,w
#         host_notification_options       d,u,r
#         #service_notification_commands   notify-by-email
#         service_notification_commands   service-notify-html-email
#         host_notification_commands      host-notify-by-email
#         #email                           cms-tier3-new-alerts@lists.psi.ch
#         email                           fabio.martinelli@psi.ch
#         }
#
# define contact{
#         contact_name                    ethz_ecal
#         alias                           ethz_ecal
#         service_notification_period     24x7
#         host_notification_period        24x7
#         service_notification_options    c,w
#         host_notification_options       d,u,r
#         #service_notification_commands   notify-by-email
#         service_notification_commands   service-notify-html-email
#         host_notification_commands      host-notify-by-email
#         #email                           cms-tier3-new-alerts@lists.psi.ch
#         email                           fabio.martinelli@psi.ch
#         }
#
#
#
#

define contact {
    contact_name cgalloni
    email camilla.galloni@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}


# F.Martinelli - Nov 2016 - to send /shome overquota warnings to the T3 users
# ldapsearch  -H ldap://t3ldap01.psi.ch -x -b dc=cmst3,dc=psi,dc=ch  mail=*@ uid mail | egrep 'uid|mail' | egrep -v 'dn:|requesting|filter'  | paste  - - | awk '{printf "\ndefine contact {\n    contact_name "$2"\n    email "$4"\n    host_notifications_enabled 0\n    service_notifications_enabled 1\n    service_notification_options w,c\n    service_notification_commands notify-by-email\n    host_notification_commands      host-notify-by-email\n    service_notification_period     24x7\n    host_notification_period        24x7\n}\n"}'


define contact {
    contact_name feichtinger
    email derek.feichtinger@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
define contact {
    contact_name giulioisac
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
define contact {
    contact_name jandrejk   
    email janika@studentNOSPAMPLEASE.ethz.ch   
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name marchica
    email Carmelo.Marchica@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name erdmann
    email wolfram.erdmann@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name kaestli
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name grab
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name koenig
    email stefan.koenig@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sibille
    email Jennifer.Sibille@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name starodumov
    email andrey.starodumov@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name fronga
    email Frederic.Ronga@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name arizzi
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name webermat
    email matthias.artur.weber@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name wehrlilu
    email lukas.wehrli@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name punz
    email Thomas.Punz@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name weng
    email Joanna.Weng@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name fabstoec
    email Fabian.Stoeckli@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name jueugste
    email juerg.eugster@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sancheza
    email annkarin.sanchez@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name stiegerb
    email benjamin.stieger@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name alschmid
    email alexander.schmidt@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name thea
    email Alessandro.Thea@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pnef
    email pascal.nef@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name ursl
    email urs.langenegger@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name snoek
    email hella.snoek@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name eaguiloc
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name theofil
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name predragm
    email predrag.milenovic@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name leo
    email Leonardo.Sala@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name andis
    email Andreas.Schatti@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name kotlinski
    email danek.kotlinski@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name rossini
    email marco.rossini@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lazo-flores
    email Jose.Lazo-Flores@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mirena
    email mirena.ivova.rikova@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sdevissc
    email simon.de.visscher@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bora
    email bora.akgun@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name papel
    email luc.pape@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name fmoortga
    email filip.moortgat@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name naegelic
    email christoph.naegeli@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name radicci
    email valeria.radicci@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name meridian
    email Paolo.Meridiani@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bmillanm
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lbaeni
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bravo_c
    email cameron.bily.bravo@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lamm_h
    email hlamm@physNOSPAMPLEASE.ksu.edu
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bean
    email abean@ku.edu
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name benjtann
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bortigno
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name opolina
    email polina.otiougova@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name haweber
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name buchmann
    email marco.andrea.buchmann@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name casal
    email bruno.casal.larana@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pablom
    email Pablo.Martinez@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name tinti_g
    email gemma.tinti@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name caminada
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name rohe
    email tilman.rohe@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lmartini
    email Luca.Martini@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name martinelli_f
    email fabio.martinelli@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name favaro
    email carlotta.favaro@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name jordi_p
    email pascal.jordi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mdjordje
    email milos.djordjevic@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mverzetti
    email Mauro.Verzetti@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name fangmeier_c
    email cfangmeier74@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mtakahashi
    email maiko.takahashi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name nessif
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name peruzzi
    email Marco.Peruzzi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name nmohr
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name chanon
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name peller
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name amarini
    email andrea.carlo.marini@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mdunser
    email marc.dunser@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name paktinat
    email Saeid.Paktinat@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name stupputi
    email Salvatore.Tupputi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pandolf
    email francesco.pandolfi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name dpinna
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name maxmonch
    email max@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name dmeister
    email daniel.meister@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name aalkindi
    email ahmed.alkindi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name folguera
    email Santiago.Folgueras@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name giulini
    email Maddalena.Giulini@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name deisher
    email amanda.deisher@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name burg_w
    email gwburg@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name turner_p
    email paul.jonathan.turner@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name cpfannen
    email calvertpf@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name grutar
    email giada.rutar@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bianchi
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mmasciov
    email mario.masciovecchio@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name taroni
    email silvia.taroni@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mangano
    email Boris.Mangano@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sregnard
    email simon.regnard@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name yangyong
    email Yong.YANG@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name jngadiub
    email jennifer.ngadiuba@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name decosa
    email annapaola.de.cosa@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name clange
    email clemens.lange@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name hits
    email Dmitry.Hits@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mquittna
    email milena.quittnat@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pgras
    email Philippe.Gras@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bbilin
    email Bugra.Bilin@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name shoeche
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name vciulli
    email Vitaliano.Ciulli@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name stosi
    email Silvano.Tosi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name syu
    email Shin-Shan.Yu@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name thaarres
    email thea.aarrestad@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name liis
    email Liis.Rebane@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name gregor
    email gregor.kasieczka@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name strebeli
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name hinzmann
    email hinzmann@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name aspiezia
    email aniello.spiezia@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name caber
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name dsalerno
    email daniel.salerno@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mwang
    email mengmeng.wang@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name tstreble
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name cheidegg
    email constantin.heidegger@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pniraula
    email prajwalniraula@gmail.com
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name leac
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mdonega
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name ardelaru
    email adelarue@mit.edu
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name musella
    email pasquale.musella@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name perrozzi
    email perrozzi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name vlambert
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mameinha
    email maren.tabea.meinhard@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name dalfonso
    email dalfonso@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name gaperrin
    email gael.ludovic.perrin@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mmarionn
    email mmarionn@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name jpata
    email joosep.pata@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mschoene
    email mschoene@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name jhoss
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name bjk
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name tklijnsm
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name micheli
    email francesco.micheli@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name wiederkehr_s
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name grauco
    email giorgia.rauco@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name vtavolar
    email vittorio.tavolaro@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name ineuteli
    email izaak.neutelings@uzh.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mvesterb
    email leonora.vesterbacka@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name nchernya
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lxiao
    email liting.xiao@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mdefranc
    email matteo.defranchis@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name tbluntsc
    email tizian.bluntschli@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sbyland
    email samuel.martin.byland@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name ggiannin
    email giulia.giannini@uzh.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name sdonato
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name zucchett
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name mandrae
    email marie.andrae@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name ytakahas
    email Yuta.Takahashi@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name nding
    email Nelson.Ding@ColoradoCollege.edu
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name uchiyama
    email yusuke.uchiyama@psi.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name clseitz
    email seitz.claudia@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name phwindis
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name dschafer
    email dschafer@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
define contact {
    contact_name berger_p2
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name vscheure
    email valerie.scheurer@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name lshchuts
    email lesya.shchutska@cern.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name creissel
    email reielc@ethz.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name pbaertsc
    email pascal.baertschi@uzh.ch
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}

define contact {
    contact_name Federica24
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
define contact {
    contact_name koschwei  
    email korbinian.schweiger@cern.ch     
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
define contact {
    contact_name hsetiawa  
    email hananiel.setiawan@uzh.ch        
    host_notifications_enabled 0
    service_notifications_enabled 1
    service_notification_options w,c
    service_notification_commands notify-by-email
    host_notification_commands      host-notify-by-email
    service_notification_period     24x7
    host_notification_period        24x7
}
[root@t3nagios ~]#



Removing datasets on T2 using PhEDEx UI

Monday 3 April 2017 




Data -> Subscriptions

[10:51]  
Select Data

[10:51]  
group: local
--> group local IS IMPORTANT!

you should make sure that you're only looking at datasets in the "local" group though
the official ones coming via Dynamo we can't touch
they are handled by a script



[10:51]  
select CSCS in the list

under replica/move choose replica



Example of process:

Order by size, look at something old
Use physics knowledge --> reco can be removed if rereco is done for example

Look at the retention date or contact the requestor

Make a query in the form of /A/B/C to find all corresponding datasets, and hit delete

Approve the deletion request





0 pending jobs to CSCS from CMS

Monday 3 April 2017 


Site readiness seems in order
  (check both the auto generated table, but also the
   flash dashboard db )

Here:


SLURM top: SLTOP / SLTOP Phoenix4 / SLTOP arcbrisi arcbrisi validation is ongoing
SLURM errors: slurmerrors / slurmerrors Phoenix4


This is showing errors today:



SLURM ERRORS

Version: 0.5
Generated on Mon Apr  3 11:42:01 CEST 2017
Running on phoenix41.lcg.cscs.ch
Showing only jobs finished with these states: F,NF,TO

***************************************************************************************************************************************
* last 12h ago (sorted by start_time)
***************************************************************************************************************************************

Period between 2017-04-02T23:42 and 2017-04-03T11:42

       JobID  Partition      User          JobName  AllocCPUS      State ExitCode     ReqMem        NodeList               Start                 End    Elapsed  Timelimit
------------ ---------- --------- ---------------- ---------- ---------- -------- ---------- --------------- ------------------- ------------------- ---------- ----------
3391944           arc01   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn52 2017-04-03T04:00:06 2017-04-03T04:13:12   00:13:06 5-00:00:00
3390959           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn81 2017-04-03T03:14:28 2017-04-03T11:29:50   08:15:22 5-00:00:00
3390986           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc           wn115 2017-04-03T03:14:28 2017-04-03T11:19:13   08:04:45 5-00:00:00
3391028           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc           wn116 2017-04-03T03:14:28 2017-04-03T03:35:06   00:20:38 5-00:00:00
3391037           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc           wn126 2017-04-03T03:14:28 2017-04-03T03:31:53   00:17:25 5-00:00:00
3390921           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc           wn120 2017-04-03T03:14:28 2017-04-03T03:31:40   00:17:12 5-00:00:00
3391003           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn82 2017-04-03T03:14:28 2017-04-03T03:30:57   00:16:29 5-00:00:00
3391046           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn82 2017-04-03T03:14:28 2017-04-03T03:29:13   00:14:45 5-00:00:00
3390999           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc           wn127 2017-04-03T03:14:28 2017-04-03T03:29:07   00:14:39 5-00:00:00
3390933           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn83 2017-04-03T03:14:28 2017-04-03T03:28:35   00:14:07 5-00:00:00
3391066           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn95 2017-04-03T03:14:28 2017-04-03T03:28:11   00:13:43 5-00:00:00
3391053           arc02   lhcbplt          gridjob          1     FAILED      1:0     2000Mc            wn86 2017-04-03T03:14:28 2017-04-03T03:28:07   00:13:39 5-00:00:00



Solution from our side: Inform CSCS people, but site readiness on our side is fine so nothing to do.

Deleting snapshots - recipe from Derek

Monday 27 March 2017 

Dear colleagues

Until we solve the snapshot situation fully with docu for users for self help and better automatization, I put here into this mail what commands you can use to resolve the situation manually.

All commands need to be executed locally on t3nfs01.

###############

List detailed space usage for user grauco

[root@t3nfs01 ~]# /usr/sbin/zfs list -r -o name,quota,available,reservation,used,usedbydataset,usedbysnapshots,snapshot_count,compressratio,creation data01 -S usedbydataset -t filesystem | egrep 'NAME|grauco'
NAME                       QUOTA  AVAIL  RESERV   USED  USEDDS USEDSNAP  SSCOUNT  RATIO  CREATION
data01/shome/grauco         400G   106G     10G   294G    251G 43.9G


List the snapshots of grauco

[root@t3nfs01 ~]# zfs list -H -p -d 1 -t snap data01/shome/grauco -o creation,name
1490229044    data01/shome/grauco@zfssnap-day-20170323-013044
1490314703    data01/shome/grauco@zfssnap-day-20170324-011823
1490403185    data01/shome/grauco@zfssnap-day-20170325-015304
1490489000    data01/shome/grauco@zfssnap-day-20170326-014319
1490571962    data01/shome/grauco@zfssnap-day-20170327-014602


Destroy a particular snapshot

[root@t3nfs01 ~]# zfs destroy -r data01/shome/grauco@zfssnap-day-20170323-013044

##################


Before using these commands, it makes sense to read up on the basic commands on the man pages (man zfs).


N.B.: I already have some better scripting to reduce and delete snapshots. But this needs to be implemented in the near future, as well as the testing of the user self help measures and documenting them.


Best regards,
Derek

Deleting snapshots

Tuesday 14 March 2017 


[root@t3nfs01 ~]# zfs list data01/shome -d1
NAME                                        USED  AVAIL  REFER  MOUNTPOINT
data01/shome                               5.93T  1.94T   304K  /zfs/data01/shome
data01/shome@zfssnap-hour-20161222-080102   240K      -   304K  -
data01/shome@zfssnap-day-20170310-012348    144K      -   304K  -
data01/shome@zfssnap-day-20170311-011059    144K      -   304K  -
data01/shome@zfssnap-day-20170312-010713    144K      -   304K  -
data01/shome@zfssnap-day-20170313-011945    144K      -   304K  -
data01/shome@zfssnap-day-20170314-010840    144K      -   304K  -
data01/shome/Federica24                     516M   399G   514M  /zfs/data01/shome/Federica24
data01/shome/aspiezia                      60.0G   340G  56.6G  /zfs/data01/shome/aspiezia
data01/shome/berger_p2                      101G   299G   101G  /zfs/data01/shome/berger_p2
data01/shome/bianchi                       69.9G   330G  69.9G  /zfs/data01/shome/bianchi
data01/shome/caber                          488K   400G   488K  /zfs/data01/shome/caber
data01/shome/casal                          176G   224G   176G  /zfs/data01/shome/casal
data01/shome/cgalloni                       238G   162G   238G  /zfs/data01/shome/cgalloni
data01/shome/cheidegg                       131G   269G  57.9G  /zfs/data01/shome/cheidegg
data01/shome/clange                         129G   271G   129G  /zfs/data01/shome/clange
data01/shome/clseitz                       3.29G   397G  3.29G  /zfs/data01/shome/clseitz
data01/shome/cmssgm                         530M   399G   530M  /zfs/data01/shome/cmssgm
data01/shome/creissel                      5.22G   395G  5.21G  /zfs/data01/shome/creissel
data01/shome/decosa                        71.6G   328G  71.6G  /zfs/data01/shome/decosa
.................



Example: 5 snapshots from Joosep:
[root@t3nfs01 ~]# zfs list data01/shome/jpata -d1
NAME                                             USED  AVAIL  REFER  MOUNTPOINT
data01/shome/jpata                               235G   165G   192G  /zfs/data01/shome/jpata
data01/shome/jpata@zfssnap-day-20170310-012348  50.6M      -   212G  -
data01/shome/jpata@zfssnap-day-20170311-011059      0      -   221G  -
data01/shome/jpata@zfssnap-day-20170312-010713      0      -   221G  -
data01/shome/jpata@zfssnap-day-20170313-011945      0      -   221G  -
data01/shome/jpata@zfssnap-day-20170314-010840   121M      -   192G  -
data01/shome/jpata/docker                       3.58G   165G  3.58G  /zfs/data01/shome/jpata/docker
[root@t3nfs01 ~]#

running `zfs destroy data01/shome/jpata@zfssnap-hour-20170208-160101` will destroy that particular snap

be careful, if you run `zfs destroy` carelessly, you can nuke user data



Approving phedex requests


Overview page:

Check for pending requests to T3_CH_PSI and T2_CH_CSCS

Click on pending request

Check if group is "local" <-- Important

For files <1 TB, approve without comment 



Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2019-04-10 - ThomasKlijnsma
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback