PHOBOS ANALYSIS AND DATA MINING TIMING TESTS FOR MDC2
D. McLeod
March 7, 1999
INTRODUCTION
       The volume of  "data" reconstructed in Mock Data Challenge 2 and the characteristics of the "events" approach the properties in early PHOBOS running closely enough that information on various types of processing in MDC2 is useful for estimating the real workload and the limitations on getting quick answers.  During MDC2 but after the PHOBOS "on" time we made some tests of plotting event information and splitting off mini-DSTs with a fraction of the events and/or a fraction of the data per event.  In these tests,  rmds03, a  SUN with /disk20000 directly mounted, and rcas0134  (a quad PPro 200 MHz) with /disk20000 NFS mounted were used.

THE EVENTS USED
We analyzed  the MDC2 reconstructions of 52476 events which simulated raw data using our GEANT3 based apparatus and secondary interaction event production, driven mostly by FRITIOF but with some tests of other generators.  About 18.5K events were central collision events, the remainder (34K) inclusive.  About half the events had collisions spread along the particle direction in a simulation of an intersection diamond while the rest were fixed at the nominal interaction point. The "reconstructed" events included vertex reconstruction as we think it will be done with real data, an advanced version of the full azimuth, +- 4.5 rapidity units multiplicity reconstruction (R.Verdier) and our latest version of the spectrometer track finding software which succeeded in finding several tracks per event in the limited phase space acceptance which was set up for these early tests. Output to the mini-DSTs included the vertex object, some plots including some eta vs. phi multiplicity distributions, some of the multiplicity data objects (in some runs)  and various "keyword" leaves such as the total multiplicity, vertex location, tracks found etc.  The details of track finding were not kept in the reconstructions.  The reconstructed event sizes were variable, about 3.1 MB for the central events and 0.41 MB average for the inclusive events, for a total of 72 GB input distributed in files of 200 events for the central events and 500 events peripheral (129 files total).  In most cases the output mini-DST had about 11.3 KB (central) or 5.4 KB (inclusive) per event, because the large multiplicity analysis objects were omitted from the output. The mini-DST output was compressed a factor 1.4 -4, average 2.7  (depending on the branch) by ROOT.  Many of the runs were done on the total available data, with overall average 1.37 MB/event in and 8.0 KB/event mini-DSTout for a mixture of central and inclusive events.  The reconstructed input events were in general not compressed (for RCF testing purposes).

THE ENVIRONMENT
       The tests all used /disk20000 which is mounted to rmds03 for input data, and "user" space i.e. /phobos/u/mcleod  NFS mounted for the much smaller output.  Some tests were done with rmds03, a SUN with the input disk directly connected, as a simulation of rsun00 & /diskA since our "on" time when these were available was already over.  This is a SUN E450 (with 4 - 250 MHz cpus), it might be possible to estimate a scaling to the speed of rsun00, a SUN E4000 with 6 cpus.  Load factors were not well known but appeared to  be "light" during these tests.  More than half the tests were done on rcas0134 (4 - 200 MHz PPro), with /disk20000 NFS mounted, and no other users.

SOFTWARE
       Except for one test with interpreted (CINT) code, all the tests were done with a simple ROOT based C++ program which was compiled with the PHAT  (PHobos Analysis Tools) libraries linked in. This was run noninteractively and most of the parameters varied during the tests were set via environment parameters, the rest required a very brief recompile and link.  This program is Phat/macros/mdc2/mdcdst.cxx in the repository, you will find there also the corresponding Mdcdst.C macro and the SUN and LINUX makefiles.  It consists of a setup of a TChain of TTree analysisout.root files which can be selected as a variety of subsets of our "analyzed data".  Options include some simple TChain->Draw() plots of one or two "leaves" and/or plots and selection of events in an explicit event loop. The latter generally included output of the mini-DSTs described above with event selection criteria and with omission of some branches of the input data.

RAW RESULTS
Conditions and comments  # Machine total time, 
seconds
cpu time events 
in
events 
out
MByte 
out
Central events,  vary cuts,  small mDST  1 rmds03 44 12.4 5000 82 0.94
(to separate output & input i/o time ) 2   " 34 13.8 5000 243 2.75
Small branches, half the events output 3   " 83 61.3 5000 2589 29.2
Same, with the smaller inclusive events 4   " 117 41.6 20000 291 1.56
    " 5   " 98 43.2 20000 607 3.27
    " 6   " 223 140 20000 7385 40.2
Central, all branches output to mDST 7   " 126 104 50 33 38.6
Do similar to the above on rcas0134 8 rcas0134 101 20.2 5000 82 0.94
   " 9   " 17.8 17.4 5000 243 2.75
Suspect... seg viol after output write. 10   " 71.6 71.0 5000 2589 29.2
   "    Inclusive events now 11   " 328 74 20000 291 1.56
   "  12   " 311 79 20000 607 3.27
Central events, all branches out  13   " 192 126 50 33 38.6
All 52K events, central & inclusive. 14 rcas0134 1748 997 52476 43371 342.5
Some rcas  <-> rmds03 comparisons 15 rmds03 1317 843 52476 43371 342.5
  " 16 rcas0134 1342 609 52476 22223 177.0
  " 17 rmds03 917 492 52476 22223 177.0
Run same job all 4 cpus.  Each job: 18 rcas0134 1743 639 52476 22223 177.0
Same job on one cpu, no other load. 19 rcas0134 1230 607 52476 22223 177.0
Run same job on 4 of the cpus 20 rmds03 848 431 52476 22223 177.0
Same job, one cpu, unknown other loads 21 rmds03 793 454 52476 22223 177.0
No output DST;  event loop, some plots 22 rcas0134 681 190 52476 ---- ----
Same on rmds03 23 rmds03 353 98 52476 ---- ----
Two DSTs written from one input  24 rcas0134 1711 1131 52476 22, 30K 176,207
CINT macro, not compiled 25 rcas0134 1189 611 52476 22223 177.0
All buffers out again, at 30 events in  26 rcas0134 117 81 30 ? ?
TChain->Draw 3 histograms & one  27 rlnx03 25.8 23.5 52476 ---- ----
 scatterplot of "leaves" =  z pos'n,  28 rcas0134 48.2 42.6 52476 ---- ----
multiplicity, tracks found, mtru:mrecon 29 rsun00 26.0 21.1 52476 ---- ----
  No event loop, try different machines 30 rmds03 70.7 26.7 52476 ---- ----
 
COMMENTS AND INTERPRETATION
     Tests 1-13 were run on smaller samples of either central or inclusive events while most of the rest were done with  the full data which are a mixture of event types.  Variable numbers of selected events were output to mDSTs  to see i/o time effects, and in most cases only small branches were put out (0.36% of central reconstructed event size,  1.32 % of inclusive size) on the assumption this will be the most usual filter.   We see the CPU time associated with the input depends more on the number of events than the total event size (11.3 vs. 5.2 KB) in comparing central to inclusive while the total time is fluctuating too much (overall load?) to offer many conclusions.  This might be expected on rmds03 with other users, and on rcas it may reflect varying NFS loads.  The results are more consistent on the longer runs 14-25.  With all branches (run 13) output at about 1.2 MB/event, we would extrapolate to  19K cpu seconds to write out 5000 events, or several hours... this result is suspect, see discussion of run 26 below.  As expected rmds03 appears faster than rcas by a factor 1.3 - 1.8 in CPU time, and the total time, partially reflecting i/o time, is qualitatively shorter on rmds03 (but with a lot of variation).

         The longer runs give a direct comparison to the scale we expect, and may average some of the load variation which appears to plague tests 1-13.  Runs 14-17 show the overall conclusion of 15-30 minutes to process 52K events in 72 GB to small (0.6% of data per event average) output DSTs, varying with the machine used and the fraction of events output.  Looking at runs 18-21, there appears to be a real advantage to running in parallel on multiple CPUs on the same machine despite the shared i/o resources;  the overall increase in time for four jobs is especially small on rmds03 (848 vs. 793 seconds).  The processes are mutually ignorant so the results should be the same with differing input streams.  It's puzzling that in this case the cpu time is longer for a single process!  It is hard to draw conclusions about i/o time because it's hard to separate, especially with fluctuations.  Runs 22 and 23 are with input time only, and show qualitatively that it's  2 or 3 times shorter than the output time even though the output data is small.  This is somewhat puzzling, but remember that the input data is uncompressed but the output is compressed, which takes longer.

        In test 24, writing two output DSTs is very easy in ROOT and appears to affect mostly CPU time.  Multiple output streams might be an interesting test.  Test 25 was puzzling in that it showed very little loss of time using CINT (compare to test 16 or 19) whereas in MDC1 we saw a factor of 2 gain by using compilation in a somewhat similar job.  This point should be revisited.  Test 26 was another attempt to extrapolate all branches output time in the face of a memory leak problem.  At 117 seconds total for 30 events in, we again find about 19,000 seconds for 5000 events.  Finally, in contrast to this last dismaying result, tests 27-30 are using no event loop, instead doing four TChain->Draw() operations on the entire data.  Four different machines were tried.  The time of ~1 minute reflects the reading of only the very small structure branches used here and the advantages of the ROOT Tree structure which puts this data all together in a small well defined location in each of the 129 Tree files in the Chain; nevertheless it's gratifying that it locates the data in this many files spanning 72 GB so quickly.

GENERALIZATIONS
     PHOBOS is not likely to do classic "data mining" in the sense of looking for rather small data sets from an enormous volume of partially processed data as in typical HEP collider experiments; it appears much of our analysis past the RCS-farm reconstruction will involve splitting off a sometimes small but seldom "tiny" fraction of the events, and removing unneeded branches for easier manipulation of the resulting mini-DSTs.   A dominating factor is the 1 1/2 hours taken to read a 50 GB tape on one of the few Redwood tape drives we will have available.  (We quote uncompressed capacity because ROOT pre-compresses by default.)  Use of a data driven rather than user driven tape staging scheme such as PHENIX's "Carousel" (D. Morrison, J. Lauret) or Fermilab's "freight train" may be well suited to our needs.  To this author the most significant point of these tests is that the times taken are a  moderate fraction (typically around 20 minutes) of the tape reading time; note the full data set size used is somewhat larger than one tape contents.   One should keep in mind though that this is while writing out and/or plotting less than 1% of the data on a large fraction of the data, or a lot of the data on just a few events.  Another point of comparison is network speeds to remote sites.  If we take 5 MB/sec as an ultimate goal, it would take 10K seconds for a 50 GB tape allowing 2 or 3 tapes "overnight".  3 1/2 MB/sec was recently obtained MIT <-> BNL (ABACUS <-> RCF) but sustained operation wasn't attempted.  At the current 1/3 to 1/2 MB/sec BNL -> UIC  before network improvements it would take 2-3 days to transfer!   In all cases, the time taken in these tests (even if the per-event processing is much expanded) is much less than the transfer time.   If a user becomes impatient with 20 minute delays, further microDSTs allowing use of TChain->Draw() etc. can be generated.   If PROOF is updated, released and documented it may offer interesting possibilities using parallel processing to speed up "mining" but the parallel processors will generally be available only at BNL.

WHAT NEXT?
        As our software develops it should become clearer what the scale and data flow through typical Physics analyses should be.  Some recent estimates in Steve Manly's writeup "Comments on Phobos Simulation and Computing Needs for Year One" are useful for discussion of DSTs, such as his projection of 200,000 triggered events per day late in year 1 running.  If reconstruction can "keep up" as is assumed there (of course it depends on successful  and rapid track reconstruction, yet to be proven) then we can conclude that the reduction of this data to a 1% of data DST stream can easily be done with one computer.  However, the time appears to rapidly grow with the size of the output events, to the extent that the 75 Si95 RCF "data mining" facilities promised us may indeed be fully occupied.  What is needed is more studies with larger event output sizes such as might be needed for multiplicity analysis or detailed track fitting.  The 4 seconds/event found here for transcribing events of the order of 1 MB seems large and should be checked again when the software problem found above is fixed.  Studies of "mining" should be made with intermediate DST event sizes, 100K bytes for example.  It would be useful to have even rough models of the event size and fraction of events needed for DSTs at both the initial extraction from data reduction output and close to the Physics analysis stage, for a variety of Physics topics.