Infrastructure Status

  • Monitoring has a hole in it: test proxy expired locally due to a site DNS change. Shouldn't reoccur.
  • UCSD has transitioned from ITB to Production once they got the version which fixed the file descriptor leak.

Action Item Status

  1. JobRobot progress.
    • BB: Andrea sent jobs with a minimal CMSSW version for xrootd usage; working with him on the best way to "fake" a dataset for the JR.
  2. Progress with local physicists.
    • fkw: talked with local physicists to explore ideas of how xrootd might benefit them. Some of these are very specific to how the UCSD T2 operates. Others are quite generic. I'm writing what I'm told. I.e. none of this reflects my opinion.
      • We produce ntuples of at least one PD that is not going to be physically located at UCSD. The students said they'd much rather run the jobs within the local environment and do remote file access, that port their scripts to fnal, or use crab. The reasons for staying within one environment are twofold. First, things are arranged slightly differently at fnal. Different stuff is sourced. Things change. And so having only one place to login interactively is easier. Second, with cmssw_4_2_preX they had problems that the same code does not compile in both places, fnal and UCSD. (it seems to me that this is unacceptable, and needs to be addressed, but what do I know.) Students were particularly interested in understanding how to know what data at fnal can be accessed. E.g. RECO is sometimes needed for something, and won't be placed anywhere other than the T1.
      • Same as the previous but for ntupling MC. Up to now, we use crab for this. The students would rather use the scripts they wrote, and run exclusively on UCSD CPU. This would be possible with the remote access.
      • Access to disks we have that are outside the storage system:
        • nfs-3,4,5 the nfs mounted spaces. Access from the desktop, laptop, CERN to these disks.
        • /data/temp on uaf-3,4,5,6 are not nfs exported. They are local temp areas on the login nodes. Exporting them via xrootd would make them available to all the login nodes equally. Basically, xrootd as an nfs replacement.
      • we use pick event functionality reasonably often, and have written scripts for that. Accessing the data directly via fireworks may be more convenient.
      • We have a post-processing step as part of our ntuple production. This does very little. It adds 3 branches, drops provenance, and merges files. At present, this is done by copying files out of hdfs onto local disk, then post-processing, then deleting the stuff. The adopted this way of operating because direct access to dcache with dcap was slow. Nobody could tell me if this copy out is still needed now. However, they did tell me that the fuse mounts don't perform well for the copy out. So they actually use the hdfs native client. In fact, I recall Terrence explaining this to me one day. fuse somehow hangs itself up after some amount of transfer out. It would be worthwile doing a few tests to at least document this behaviour, and see if xrootd offers anything here. After all, xrootd uses the native hdfs clients, doesn't it? Maybe we don't need to have all our ntuples on these nfs disks any more?
    • AD: At UNL I use Xrootd to access PATTuples for our top cross section measurement using soft-muon tagging. These PATTuples are produced by various members of the analysis group and staged out via CRAB to our HDFS volume. Then using condor directly (in my case) I use the appropriate Xrootd urls to access the PATTuples from within CMSSW jobs which are doing the soft-muon analysis. As far as I know, members of our analysis group at Cornell also access our PATTuples in our HDFS store via Xrootd in a similar fashion, but with jobs running at their own local tier-3 cluster. This works very well. This is all being done with CMSSW_3_8_7. The only awkward part at this point is constructing the correct Xrootd urls to be used within CMSSW. At the moment, this is done via some homemade scripts or emacs, but is rather straightforward. In the future, when we do our next analysis, we will be using later versions of CMSSW that already have the "fallback" method of data discovery that first uses HDFS/NFS/dCache/whatever, and, if it gets a "file not found", then it tries Xrootd. This means we don't have to construct the url manually.
  3. Setup a local redirector at UCSD.
    • MT: Up and running at
      • uaf-[3-6] nodes report to it (and to
        • Added export of /store/user to both redirectors.
        • Apparently uaf nodes have local disk that could also be exported via xrootd.
          • Problem is, as these machines already report to UNL, that this would both expose this data world-wide and pollute the namespace.
          • Can run another xrootd instance there. Will ask Andy if there is a way to restrict export of given storage-path to a sub-set of masters.
      • nfs-3 reports only to it -- this is the disk space with TAS physics data.
  4. Test out the Xrootd 3.0.3 throttling.
    • BB: Code being tested at UNL.
  5. Get dCache sites upgraded to new version of libdcap.
    • BB: Large stress test on the xrootd/dcap integration run by Pisa. They brought good feedback; a few bugs were discovered. Overall, xrootd/dcap no longer permanently locks up, but definitely isn't "scale ready". At about 50 concurrent clients, you might as well give up.
      • This is another reason we need the JR tests: I would be able to launch my own stress tests aimed at sites. It's very difficult to debug a heavily multithreaded program at someone else's site.
  6. Improve the ML webpages.
    • MT: Studied a bit monitoring in xrootd and what is done with it in existing ML server. Learned how to operate ML repository and write custom chart pages.
      • Fixed the map to show USA by default and UCSD location to pre-earthquake position smile
      • Two charts are shown now (for uaf nodes and one machine from UNL that Brian configured to send information to UCSD.
      • So far, we read practically no data via xrootd ... so the charts are not very meaningful.
      • Now, ML only reads detailed monitoring data from xrootd, and does relatively little with it. I'm working on a pre-processor that will collect and digest the summary information and register it into ML. This is entirely trivial as some things are reported cumulatively and the other are reported as rates, so one has to keep history for each host so as to be able to calculate meaningful variables for ML. Costin might be interested to convert this into java and run in ML-service directly.
      • Once this is done, I need to seriously think how this data will be stored in ML's Farm/Cluster/Node structure. If this is done right, one can do a lot of cool stuff at the web interface with little code. The thing here is that we have to support two modes of operation:
        • Sites just sending xrootd monitoring streams to UCSD. This can practically be advertised already -- I just need to decide on the port numbers I will use for this.
        • Sites installing their own ML services. This is secondary for now. Monas said they will provide rpms for that ... and offered to test / installed it at Caltech.
  7. Make public per-site Nagios pages.
    • BB: Not done yet.


  • Code:
    • BB: Fixed leaky file-descriptor in GUMS-integration code (someone else's library). Fixed memory leaks in HDFS integration.
    • BB: Reviewed CERN IT's code for true asynchronous prefetching. This is far from ready, and may require cleanup of ROOT code before such a patch is possible. Will arrive no sooner than early 2012 IMHO. Working to get it to compile; promising idea.
  • Monitoring:
    • BB: Discussed on PhEDEx list the issue of how to get a random filename at a site. Answer: no elegant approach. Will have to hack something together.
    • BB: Monitoring has been triggering often lately. May want to increase timeouts -- or at least investigate to see if failures are "real". A priori, it's not acceptable for a file to take >30s to open.

Items for next week

  • "FNAL accessibility issues": what files are accessible from FNAL; how do we explain this to the users?
    • We'll control it by developing a simple policy for what files are available in Xrootd. FKW will contact Oli to do this.
    • EXAMPLE (not final): Anything at the T2 is available, plus any 2011 AOD file.
    • Brian will create a webpage explaining a policy.
  • "FNAL validation issue": We decided FNAL is validated if:
    • Nagios heartbeat
    • Nagios random file
    • Zabbix "each host test".
  • MonALISA progress:
    • Add more graphs.
    • Attempt to offload some of the data processing to the ML team: i.e., maybe have them tackle the issue of .
  • Carryover: make progress on JobRobot / HC.
  • Carryover: Xrootd 3.0.3 testing.
  • Carryover: public Nagios tests.
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2011-03-04 - BrianBockelman
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback