last modified:
Tuesday, December 10, 2002 16:06
DØ SAM Goals - Status Board  
 

changes at a glance: Added Completed Projects Page; cleared out complete projects from testing;

 
guide comments:  
COMPLETED PROJECTS    

Priority Projects Projects in Testing Bugs  .
status
project
request
date
SAM
team
priority
SAM
team
ETA
project
prior status
date
name
date
 . 
1
dCache integration
10.04.02
 
 
NFS shared disk
 
 . 
  .
2

CAB

10.04.02
 
.
 
   
 . 
3
pickevents
10.04.02
 
~12.15.02 
 .
 .
  .
  .
 . 
 . 
4
rewrite import_classes...to make it easier for users to enter data
09.24.02
 .
 .
 
 .
 .
  .
 
 . 
5
batch review and redesign
 09.24.02
 .
 .
 
 .
 .
  .
  .
 . 
 .  
 .
 .
 .
 
 .
 .
 .
 .
 .
 .  
 .
 .
 .
 
 . 

 .

 

 .
 .
 .
unranked:
 
 . 
 .
 .
 
 . 
 
 .
  .
 .
 .  documentation standown
 . 
 .
 .
 
  .
 . 
 .
 . 
 .
  vsn, config, release management                  
  monitoring framework                  
  HPSS and other MSS adaptors - generalization for off-site MSS setups                  
  new test cluster for distributed cluster testing                  
  rapid response triage                  
  clear stuck jobs                  
                     
                     
work todo  .
review metadata and consider new items-longer term
  .
 .
 .
 . 
 .
 . 
  .
  .
  .
                     
                     
                     
                     
top
 
  .
 .
 .
  .
 .
 . 
  .
  .
 . 

 

Comments and/or additions: Chip Brock


 

 

 

 

 

 

 

 

project criteria for completion tester responsible requester date status

pick_events

top

09.24.02 The first problem is the fraction of pick_events that are successfully returned by the utility. We submit pick_events jobs that attempt to access 99 or 100 events at a time. The first time we submit a list the event return rate varies from 2% to 100% with a peak at about 85%. Subsequent submissions of the list usually return an increasing fraction of the events. For instance, in one attempt to access 99 events, I go 43 the first time, 91 the second attempt, and 99 on the third attempt. Those results aren't untypical. when the raw-data file was already on SAM cache. I am concerned that my measurement for the success rate of pick_events is typical of other kinds of SAM access.

non-raw data, from any tier.

full design of pickevents involving pooling of common requests. caching all for some period of time - recently used files.
10.29.02 There are three issues related to pick events 1) users have reported unreliable results which needs to be traced and fixed, 2) picking all tiers of data in addition to raw needs to be be enabled, and 3) the general pick events design and implementation to include request pooling and event caching and archiving needs to be completed. Guidance from d0 on the urgency for 3) is needed.

  Kalk & Diehl    
nfs 10.29.02 Work is underway to test this version and it is part of an upcoming release. Andrew has added documentation to the station_configuration document available at:
http://d0db-dev.fnal.gov/sam/doc/install/stationConfig.shtml
We continue to test this installation at Karlsruhe.
       

batchdesign

top

10.29.02 Sinisa, Lauri, and Stefan have meet to discuss the general details of this project and Sinisa has spent 3 or 4 days working on the design and writing some of the classes. A prototype should be available in about a week for use with LSF. Work will be needed to add information to the IDLs , and to create adapters for PBS, and FBS. This work supports all four schemes for job submission for project and consumers to interactive or batch, and will enable the site administrators to easily configure the system to meet their needs. To complete this work will require an additional month of work, including contributions from Sinisa and Lauri. Additional work is needed to further abstract the configuration so it is menu driven like the current sam_config. There are additional issues that have been raised recently wrt project master timeouts, running parallel consumers, and etc. which should be included in the design if they are not already addressed.        

minusbug

top

A bug involving SAM Queries that use a MINUS I used two differently-constructed SAM queries that logically should have produced the same set of data files. However, I got two different results. This indicated a logic flaw in the SAM query processing. I sent you email with details on Oct 11 "a strange result from a SAM definition". Use of MINUS will be a standard for constructing group data samples. The typical usage is "data-of-a-kind MINUS files-analyzed" to produce a list of new or unanalyzed files.   diehl    

import classes

top

The WZ group wants to form several datasets containing different kinds of events. For instance, events with a W to muon, Z to muons, W to electron, Z to electrons, and another, say, overlap. We want to collect the events, perhaps thumbnails or DSTs and return them to SAM for group storage and access by type.

I think the metadata may contain the provision for this but that, perhaps, SAM can't handle the information yet. Here is an example metadata I used to return some picked-events, which is probably not the best example possible, but is the one I have in hand. The file is called mrg_pick_dimuon_081302_001.raw.meta.py

from import_classes import *
TheFile = ProcessedFile( name = 'mrg_pick_dimuon_081302_001.raw',
sizeK = 226622,
events = Events(13961065, 15926, 2311),
stream = 'pick-event-p10dimcand',
tier = 'raw-bygroup',
start_time = '08/13/2002 13:03:05',
end_time = '08/13/2002 13:39:55',
pid = 831675,
parents = ['pick_d51.dat', ... ,'pick_d38.dat'])

We see the datatier "raw-bygroup" and stream
"pick-event-p10dimcand". I believe this field stream could be used to identify the type of event, e.g., W to muons, etc ..., so that one could access events only of that type.
10.29.02 2. import classes: General evaluation of the needs of the experiment have been underway. This includes use cases that have been recently presented and also using the framework of what is now mcrunjob for doing workflow control and metadata generation for applications beyond MC. This includes meeting with d0 users and developers outside of sam before enough is understood to proceed. Estimate at lease 2 weeks to have requirements written down and design started. Probably a month of coding and testing after that to complete the project.

  diehl    

batch

top

These are all issues of useability that are related to batch, even though they don't correspond to the nominal idea of the batch flexibility.

1) Is there any time-out for an open project? For example if I start a project when I submit my jobs I don't want it to time out after a few hours so there is no project there when the jobs actually start to execute!

2) Do I need to worry about the consumer_id? For the initial tests I didn't set it and everything seemed to work how I wanted it to. Will this change when I run things in parallel? Do I need to set it and if so do I have to set them all to the same value or different values?

3) Does the '-num_files' command line argument work with SAM jobs? In other words can I ensure that each job processes N files from the project using the '-num_files N' command line option?

  Moore    
           

ClueDO

top

 

Dugan, Lukas, Tom?
Oneil, Moore
05.02
Chris has finished, left good documentation. Ready for testing.

file status

top

 

NA
Schellman
07.26.02

CAB

top

SAM working on CAB. Heidi has supplied some specific tasks: Evaluate current sam station with Sinisa Veseli minor fixes for local environment Optimize mutual use of d0mino/d0cs cache.
08/13/02. Moved to testing.

09.15.02 Need D0 interfaces for D0tools. heidi is working on this.
10.01.02 Heidi continues to test. Sam debugging on clud0 are useful and should make cab work.

Heidi, Mike D
Schellman

"Friday meltdown"

top

Fix the connection between the batch and project queues (aka Friday meltdown). There may be other D0mino preformance issues buried in "station (and other) bug fixes and minor features" which we would like to evaluate.

10.01.02 Solved with a simplification in the way projects are started
on d0mino. In this mode, when a user sartes a project the project is
started and the user's exe are run in the sam batch job. We are confident thisw will fix the problem and make d0mino operations much better. Work is underway to further redesign the submission scripts to provide more options and remove the batch adapters from the sam station code, and put it into its own python package that is easier to maintain and adapt to new environments.
10.01.02 Moved to testing with the caveat that D0mino operations should not be affected and with the understanding tha the ultimate solution will receive high priority.

NA
Lueking

monitoring

top

waiting criteria

10.01.02
1. reimplemented the encp daily stats
http://d0db.fnal.gov/sam_local/PlotsAndStats/EncpStats/Daily/2002/09_26_02/all_stations__09_26_02__prd.txt
2.Table summary of errors
http://d0db.fnal.gov/sam_local/TransferErrors/transfer_errors.html
3.Data transfer stats
http://d0db.fnal.gov/sam_local/monitorStats/index.html
Need to complete daily archive of log files.

 
Lueking

FNORB

top

10.01.02 Have managed to enclose this in exception handling and this problem is contained, not really solved. Diana compiles weekly status summary table at http://d0db.fnal.gov/d0dbsrv/d0dbstatus/d0_db_weekly_Sep2002.html This indicates that server crashes are decreasing.
 

dcache

top

10.04.02 This is necessary in order to rate-adapt the 9940b tape drives to disk and also important for issues involving remote sites. Done when in production.
10.29.02 This project has two parts, 1) testing the server that has been provided for d0 and 2) turning dcache on for general d0 use on central analysis and other stations. We are still waiting for the ISD group to update the existing server to the latest software and turn it over to us. We estimate one week to setup the test on d0test and see how the system works under load. Assuming everything looks good, we will discuss the next steps, and begin to use this server for Central-analysis and/or farm. We will then probably generally phase it in for remote station access to enstore data.

  Lee    
           
           
           
           
           
top