THE
EVENTS USED
We analyzed
the MDC2 reconstructions
of 52476 events which simulated raw data using our GEANT3 based apparatus
and secondary interaction event production, driven mostly by FRITIOF but
with some tests of other generators. About 18.5K events were central
collision events, the remainder (34K) inclusive. About half the events
had collisions spread along the particle direction in a simulation of an
intersection diamond while the rest were fixed at the nominal interaction
point. The "reconstructed" events included vertex reconstruction as we
think it will be done with real data, an advanced version of the full azimuth,
+- 4.5 rapidity units multiplicity
reconstruction (R.Verdier) and our latest version of the spectrometer
track finding software which succeeded in finding several tracks per event
in the limited phase space acceptance which was set up for these early
tests. Output to the mini-DSTs included the vertex object, some plots including
some eta vs. phi multiplicity distributions, some of the multiplicity data
objects (in some runs) and various "keyword" leaves such as the total
multiplicity, vertex location, tracks found etc. The details of track
finding were not kept in the reconstructions. The reconstructed event
sizes were variable, about 3.1 MB for the central events and 0.41 MB average
for the inclusive events, for a total of 72 GB input distributed in files
of 200 events for the central events and 500 events peripheral (129 files
total). In most cases the output mini-DST had about 11.3 KB (central)
or 5.4 KB (inclusive) per event, because the large multiplicity analysis
objects were omitted from the output. The
mini-DST output was compressed a factor 1.4 -4, average 2.7 (depending
on the branch) by ROOT. Many of the runs were done on the total available
data, with overall average 1.37 MB/event in and 8.0 KB/event mini-DSTout
for a mixture of central and
inclusive events. The reconstructed input events were in general
not compressed (for RCF testing purposes).
THE
ENVIRONMENT
The tests all used /disk20000 which is mounted to rmds03 for input data,
and "user" space i.e. /phobos/u/mcleod NFS
mounted for the much smaller output. Some tests were done with rmds03,
a SUN with the input disk directly connected, as a simulation of rsun00
& /diskA since our "on" time when these were available was already
over. This is a SUN E450 (with 4 - 250 MHz cpus), it might be possible
to estimate a scaling to the speed of rsun00, a SUN E4000 with 6 cpus.
Load factors were not well known but appeared to be "light" during
these tests. More than half the tests were done on rcas0134 (4 -
200 MHz PPro), with /disk20000 NFS mounted, and no other users.
SOFTWARE
Except for one test with interpreted (CINT) code, all the tests were done
with a simple ROOT based C++ program which was compiled with the PHAT
(PHobos Analysis Tools) libraries linked in. This was run noninteractively
and most of the parameters varied during the tests were set via environment
parameters, the rest required a very brief recompile
and link. This program is Phat/macros/mdc2/mdcdst.cxx in the repository,
you will find there also the corresponding Mdcdst.C macro and the SUN and
LINUX makefiles. It consists of a setup of a TChain of TTree
analysisout.root
files which can be selected as a variety of subsets of our "analyzed data".
Options include some simple
TChain->Draw() plots of one or two "leaves" and/or plots and selection
of events in an explicit event loop. The latter generally included output
of the mini-DSTs described above with event selection criteria and with
omission of some branches of the input data.
RAW RESULTS
Conditions and comments | # | Machine | total time,
seconds |
cpu time | events
in |
events
out |
MByte
out |
Central events, vary cuts, small mDST | 1 | rmds03 | 44 | 12.4 | 5000 | 82 | 0.94 |
(to separate output & input i/o time ) | 2 | " | 34 | 13.8 | 5000 | 243 | 2.75 |
Small branches, half the events output | 3 | " | 83 | 61.3 | 5000 | 2589 | 29.2 |
Same, with the smaller inclusive events | 4 | " | 117 | 41.6 | 20000 | 291 | 1.56 |
" | 5 | " | 98 | 43.2 | 20000 | 607 | 3.27 |
" | 6 | " | 223 | 140 | 20000 | 7385 | 40.2 |
Central, all branches output to mDST | 7 | " | 126 | 104 | 50 | 33 | 38.6 |
Do similar to the above on rcas0134 | 8 | rcas0134 | 101 | 20.2 | 5000 | 82 | 0.94 |
" | 9 | " | 17.8 | 17.4 | 5000 | 243 | 2.75 |
Suspect... seg viol after output write. | 10 | " | 71.6 | 71.0 | 5000 | 2589 | 29.2 |
" Inclusive events now | 11 | " | 328 | 74 | 20000 | 291 | 1.56 |
" | 12 | " | 311 | 79 | 20000 | 607 | 3.27 |
Central events, all branches out | 13 | " | 192 | 126 | 50 | 33 | 38.6 |
All 52K events, central & inclusive. | 14 | rcas0134 | 1748 | 997 | 52476 | 43371 | 342.5 |
Some rcas <-> rmds03 comparisons | 15 | rmds03 | 1317 | 843 | 52476 | 43371 | 342.5 |
" | 16 | rcas0134 | 1342 | 609 | 52476 | 22223 | 177.0 |
" | 17 | rmds03 | 917 | 492 | 52476 | 22223 | 177.0 |
Run same job all 4 cpus. Each job: | 18 | rcas0134 | 1743 | 639 | 52476 | 22223 | 177.0 |
Same job on one cpu, no other load. | 19 | rcas0134 | 1230 | 607 | 52476 | 22223 | 177.0 |
Run same job on 4 of the cpus | 20 | rmds03 | 848 | 431 | 52476 | 22223 | 177.0 |
Same job, one cpu, unknown other loads | 21 | rmds03 | 793 | 454 | 52476 | 22223 | 177.0 |
No output DST; event loop, some plots | 22 | rcas0134 | 681 | 190 | 52476 | ---- | ---- |
Same on rmds03 | 23 | rmds03 | 353 | 98 | 52476 | ---- | ---- |
Two DSTs written from one input | 24 | rcas0134 | 1711 | 1131 | 52476 | 22, 30K | 176,207 |
CINT macro, not compiled | 25 | rcas0134 | 1189 | 611 | 52476 | 22223 | 177.0 |
All buffers out again, at 30 events in | 26 | rcas0134 | 117 | 81 | 30 | ? | ? |
TChain->Draw 3 histograms & one | 27 | rlnx03 | 25.8 | 23.5 | 52476 | ---- | ---- |
scatterplot of "leaves" = z pos'n, | 28 | rcas0134 | 48.2 | 42.6 | 52476 | ---- | ---- |
multiplicity, tracks found, mtru:mrecon | 29 | rsun00 | 26.0 | 21.1 | 52476 | ---- | ---- |
No event loop, try different machines | 30 | rmds03 | 70.7 | 26.7 | 52476 | ---- | ---- |
The longer runs give a direct comparison to the scale we expect, and may average some of the load variation which appears to plague tests 1-13. Runs 14-17 show the overall conclusion of 15-30 minutes to process 52K events in 72 GB to small (0.6% of data per event average) output DSTs, varying with the machine used and the fraction of events output. Looking at runs 18-21, there appears to be a real advantage to running in parallel on multiple CPUs on the same machine despite the shared i/o resources; the overall increase in time for four jobs is especially small on rmds03 (848 vs. 793 seconds). The processes are mutually ignorant so the results should be the same with differing input streams. It's puzzling that in this case the cpu time is longer for a single process! It is hard to draw conclusions about i/o time because it's hard to separate, especially with fluctuations. Runs 22 and 23 are with input time only, and show qualitatively that it's 2 or 3 times shorter than the output time even though the output data is small. This is somewhat puzzling, but remember that the input data is uncompressed but the output is compressed, which takes longer.
In test 24, writing two output DSTs is very easy in ROOT and appears to affect mostly CPU time. Multiple output streams might be an interesting test. Test 25 was puzzling in that it showed very little loss of time using CINT (compare to test 16 or 19) whereas in MDC1 we saw a factor of 2 gain by using compilation in a somewhat similar job. This point should be revisited. Test 26 was another attempt to extrapolate all branches output time in the face of a memory leak problem. At 117 seconds total for 30 events in, we again find about 19,000 seconds for 5000 events. Finally, in contrast to this last dismaying result, tests 27-30 are using no event loop, instead doing four TChain->Draw() operations on the entire data. Four different machines were tried. The time of ~1 minute reflects the reading of only the very small structure branches used here and the advantages of the ROOT Tree structure which puts this data all together in a small well defined location in each of the 129 Tree files in the Chain; nevertheless it's gratifying that it locates the data in this many files spanning 72 GB so quickly.
GENERALIZATIONS
PHOBOS
is not likely to do classic "data mining" in the sense of looking for rather
small data sets from an enormous volume of partially processed data as
in typical HEP collider experiments; it appears much of our analysis past
the RCS-farm reconstruction will involve splitting off a sometimes small
but seldom "tiny" fraction of the events, and removing unneeded branches
for easier manipulation of the resulting mini-DSTs. A dominating
factor is the 1 1/2 hours taken to read a 50 GB tape on one of the few
Redwood tape drives we will have available. (We quote uncompressed
capacity because ROOT pre-compresses by default.) Use of a data driven
rather than user driven tape staging scheme such as PHENIX's
"Carousel" (D. Morrison, J. Lauret) or Fermilab's "freight
train" may be well suited to our needs. To this author the most
significant point of these tests is that the times taken are a moderate
fraction (typically around 20 minutes) of the tape reading time; note the
full data set size used is somewhat larger than one tape contents.
One should keep in mind though that this is while writing out and/or plotting
less than 1% of the data on a large fraction of the data, or a lot of the
data on just a few events. Another point of comparison is network
speeds to remote sites. If we take 5 MB/sec as an ultimate goal,
it would take 10K seconds for a 50 GB tape allowing 2 or 3 tapes "overnight".
3 1/2 MB/sec was recently obtained MIT <-> BNL (ABACUS <-> RCF) but
sustained operation wasn't attempted. At the current 1/3 to 1/2 MB/sec
BNL -> UIC before network improvements it would take 2-3 days to
transfer! In all cases, the time taken in these tests (even
if the per-event processing is much expanded) is much less than the transfer
time. If a user becomes impatient with 20 minute delays, further
microDSTs allowing use of TChain->Draw() etc. can be generated.
If PROOF is updated, released and documented it may offer interesting possibilities
using parallel processing to speed up "mining" but the parallel processors
will generally be available only at BNL.
WHAT
NEXT?
As our software develops it should become clearer what the scale and data
flow through typical Physics analyses should be. Some recent estimates
in Steve Manly's writeup "Comments
on Phobos Simulation and Computing Needs for Year One" are useful for
discussion of DSTs, such as his projection of 200,000 triggered events
per day late in year 1 running. If reconstruction can "keep up" as
is assumed there (of course it depends on successful and rapid track
reconstruction, yet to be proven) then we can conclude that the reduction
of this data to a 1% of data DST stream can easily be done with one computer.
However, the time appears to rapidly grow with the size of the output events,
to the extent that the 75 Si95 RCF "data mining" facilities promised us
may indeed be fully occupied. What is needed is more studies with
larger event output sizes such as might be needed for multiplicity analysis
or detailed track fitting. The 4 seconds/event found here for transcribing
events of the order of 1 MB seems large and should be checked again when
the software problem found above is fixed. Studies of "mining" should
be made with intermediate DST event sizes, 100K bytes for example.
It would be useful to have even rough models of the event size and fraction
of events needed for DSTs at both the initial extraction from data reduction
output and close to the Physics analysis stage, for a variety of Physics
topics.