Before You do Anything on This Page

Something is wrong with the CRT, so the CrtExperts need to know about it. Make sure to mention the problem either in its own eLog entry or in the end-of-run comments window. I've tried to organize steps by symptoms on this page and link to new topics for procedures to fix the CRT. If you can't solve the problem with what's on this page, contact the CrtExperts to get help.

Table of Contents


Quickstart

Abbreviated CRT debugging instructions that seems to fix a lot of problems without many decisions. If you don't need to know why the CRT isn't working, try these steps first.

  1. Make sure no one is running the CRT through run control right now
  2. LogInToCRTServer to fulfill the first requirement to all other steps
  3. KillTheCRTBackend
  4. Check the CRTTimingEndpoint
  5. Try PowerCyclingCRT
  6. If you didn't do anything in the last two steps, then the CrtExperts are going to have to read the CRTLogFiles. If you did something in steps 4 or 5, then you can check whether the CRT is ready to run to first order with:

    startallboards.pl gainupdate_v9 #Runs the CRT without run control, so the CTB will get triggers between this step and the next one.
    #If this step hangs, interrupt the command and get help from the CrtExperts.
    stopallboards.pl gainupdate_v9 #Puts the CRT back into a "waiting" state. Signals to the CTB will stop when this command completes.

    Shortly after startallboards.pl started, you should have gotten printout to the terminal like:

    Disk usage: 68% (925GB remaining in /nfs/sw)

    Found previous instances of readout. Killing...
    Started stop.pl. Looking for ./readout processes to kill...
    Starting
    DataPath=CRTDAQ/DATA/,Disk=1,DataFolder=/data1/CRTDAQ/DATA/
    pmtini = 1, pmtfin = 32
    Baseline data taking .................................................................
    Waiting for late packets ********** : 4 sec
    Initializing .................................................................
    PMT 0: baseline hits: 500
    PMT 1: baseline hits: 500
    PMT 2: baseline hits: 500
    PMT 3: baseline hits: 500
    PMT 4: baseline hits: 500
    PMT 5: baseline hits: 500
    PMT 6: baseline hits: 494
    PMT 7: baseline hits: 407
    PMT 8: baseline hits: 486
    PMT 9: baseline hits: 500
    PMT 10: baseline hits: 500
    PMT 11: baseline hits: 500
    PMT 12: baseline hits: 500
    PMT 13: baseline hits: 500
    PMT 14: baseline hits: 500
    PMT 15: baseline hits: 477
    PMT 16: baseline hits: 500
    PMT 17: baseline hits: 500
    PMT 18: baseline hits: 500
    PMT 19: baseline hits: 401
    PMT 20: baseline hits: 500
    PMT 21: baseline hits: 500
    PMT 22: baseline hits: 500
    PMT 23: baseline hits: 500
    PMT 24: baseline hits: 500
    PMT 25: baseline hits: 500
    PMT 26: baseline hits: 500
    PMT 27: baseline hits: 527
    PMT 28: baseline hits: 500
    PMT 29: baseline hits: 500
    PMT 30: baseline hits: 500
    PMT 31: baseline hits: 547
    USB 13 - File open for writing
    USB 14 - File open for writing
    USB 3 - File open for writing
    USB 22 - File open for writing
    .....Taking data .....

    If those numbers are all > 400, try running the CRT through RC again. Otherwise, go back to step 5 one more time. If you get here a second time, get help from the CrtExperts. This class of problems has to do with hardware and is not related to artdaq. If the CrtExperts aren't around for some reason and you really need the CRT now, try making monitoring plots according to the instructions in CRTAdvancedOp.

The CRT Won't Boot

  1. LogInToCRTServer
  2. You should still be logged into np04-crt-001.cern.ch from step 1, so you can KillTheCRTBackend.

The CRT Won't Configure

  1. Is it just crt0? crt0 is usually responsible for starting the backend process that takes baselines and sends messages to our hardware during the configure transition. It usually takes about 1 minute to configure while all of the other CRT board readers are already configured, so wait 2 minutes to be sure the baselines aren't slow. The CRT board reader is supposed to configure more quickly than the Wibs, so if it doesn't, something might be wrong.
  2. Have we been doing work on the CRT timing endpoint recently? It seems like the CRT can get into a "bad state" whenever its timing endpoint stops sending clock pulses for any period of time, and this seems to happen when its firmware is reloaded. Low current values from the middle channels on each LV supply in pollAllCRTLV.sh seems to be a common symptom of problems like this. Try PowerCyclingCRT before going on.
  3. crt0 could be stuck reading baselines, thus getting all of the other board readers stuck looking for baselines. This probably means major problems with the CRTTimingEndpoint, so contact the CrtExperts before attempting a fix. If you're taking beam data, contact the CrtExperts, then take the CRT out of the run. I haven't seen this happen yet.

The CRT Won't Start

  1. Check CRTLogFiles for the board reader(s) that won't start. Try searching for "exception". You might see something like:

    cet::exception object caught:---- CRT BEGIN
    CRT timing board in bad state, 0xe, can't read run start time
    ---- CRT END

    If so, you need to soft-reset the CRTTimingEndpoint.
  2. I'm not sure how this could happen, so contact the CrtExperts. There might be hints in the CRTLogFiles.

One or More CRT Board Readers has an Average_Fragment_size_double of 48 or Lower

Average_Fragment_size_double is a metric that is related to the rate at which data is sent to offline files. For the CRT, it is related to the rate at which data from single CRT modules is accepted into readout windows. An Average_Fragment_size_double of 48 seems to be an empty data container for most board readers, so this means that CRT data isn't making it into offline files.

  1. If all 4 board readers are exhibiting this behavior, then you should get the CrtExperts ' help to investigate BringingCRTUpFromRackOff.
  2. We want to figure out whether the problem is in the CRT board reader or the hardware. Open the Selection Tool and select a CRT board reader with a low Average_Fragment_size_double (read CRTBasicOp if you don't know how to do this). There is a pull-down menu on the right side of the Selection Tool that has the name Average_Fragment_size_double by default. Select Fragments Made int from this menu, then click the Add button just above the Selection Tool's graph. If you aren't already plotting Average_Fragment_size_double for the board reader you are investigating, please do so now as described in CRTBasicOp. When you're done, the Selection Tool's graph might look like this:
    TODO: Selection Tool with Fragments Made
    1. If Fragments Made doesn't drop to 0, let the run keep going for about 5 minutes if you can.
      1. If Fragments Made goes to 0 or starts to fluctuate wildly, then this might be a problem related to the global timing system:
        1. Check the most recent CrtLogFiles for a line that says:

          CRT nfo Set run start time to 77383514915854948, or 1547670298 seconds at UNIX timestamp of 1547670301 seconds. Looks like we skipped 0 32-bit rollovers, so set uppertime to 0.

          If you find it, find the difference between the "run start time in seconds" and the "UNIX timestamp". If it's more than 1 or 2, ask a timing expert to make sure that the timing system is synchronized with UNIX time. Otherwise, there might be a message like:

          CRT wrn CRT board reader failed to get reasonable run start time from timing endpoint. Difference from UNIX time of 3 > tolerance of 2

          Where the "Difference from UNIX time" might be larger than 3. If you find either line, contact a timing expert and ask them to synchronize the timing system with the current UNIX time.
        2. If you couldn't find either of those messages, look for a message like:

          CRT wrn Got a large time difference of -301 between CRT timestamp of 77390260158757998( 1547805203 seconds) and current system time of 1547805504. lowertime = 107654059, uppertime = 0, and runstarttime = 77390260051103939. Throwing out this Fragment to prevent a single bad board from ruining all of our data.

          1. If you find it, check the CRTTimingEndpoint status and look at the entry called pulse.ctrl.cmd.
          2. If it isn't 0x4, then the CRT timing endpoint is probably getting sync pulses at weird times. From your terminal controlling the timing endpoint, issue the command:

            pdtbutler crt CRT_EPT configure 0 RunStart

            You should see something like:

            Created crt device
            +------------------+------------+
            | csr | 0x0 |
            | csr.ctrl | 0x0 |
            | csr.ctrl.tgrp | 0x0 |
            | csr.stat | 0x81 |
            | csr.stat.ep_rdy | 0x1 |
            | csr.stat.ep_stat | 0x8 |
            | pulse | 0x41 |
            | pulse.cnt | 0x2c5 |
            | pulse.ctrl | 0x41 |
            | pulse.ctrl.cmd | 0x4 |
            | pulse.ctrl.en | 0x1 |
            | pulse.ctrl.force | 0x0 |
            | pulse.ts_h | 0x112eb19 |
            | pulse.ts_l | 0x578dedf6 |
            +------------------+------------+

            Where we need to make sure that csr.stat.ep_stat is 0x8 (the timing endpoint is ready) and pulse.ctrl.cmd is 0x4 (seems to be RunStart sync pulse mode). If those two parameters are different, please contact the CrtExperts and a timing system expert for help.
          3. If it is 0x4, then the global timing system might have drifted away from the current UNIX time on the CRT server. In the same log file as before, look for a line like:

            CRT nfo Set run start time to 77390260051103939, or 1547805201 seconds at UNIX timestamp of 1547805201 seconds. Looks like we skipped 0 32-bit rollovers, so set uppertime to 0.
        3. If you couldn't find any of those log file messages, something weird is happening in the board reader. Contact the CrtExperts.
      2. If Fragments Made doesn't go to 0, then this is a board reader problem. Contact the CrtExperts.
    2. If Fragments Made goes to 0 at about the same time that Average_Fragment_size_double drops, then this is probably a hardware problem. You should stop the run and try PowerCyclingCRT. If a power cycle doesn't help, then you might have encountered an elusive board reader problem. Contact the CrtExperts.

The CRT Won't Shutdown

  1. Is it just crt0? If so, click on the Process Manager in the list-tree on the left side of the Run Control window and click Force SHUTDOWN. This will probably take a few seconds. If nothing has changed after about 30 seconds, go to step 2.
  2. The Run Control's Finite State Machine might be confused about the state of one or more CRT board readers. Shut down every component you can, then KillTheCRTBackend.
  3. Contact a Run Control expert as well as the CrtExperts.
-- Main.anolivie - 2018-11-02
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2019-03-15 - AndrewPaulOlivier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CENF All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback