5.
Managing workflow
"Workflow" describes the sequence of operations needed to accomplish an objective, together with the resources and information needed at each step of such a sequence. For years organizations concerned with logistics (such as businesses) have used sophisticated software tools to manage workflow, but in science (and astrophysics in particular) scientific workflow management has been very ad-hoc. Surprisingly, this has included computational science, which despite its heavy investment in computers as calculating and data processing tools has placed too little emphasis on making scientific work easier and less error-prone.
This situation is changing with the increased emphasis on cyberinfrastructure
development by NSF and other funding
agencies. Software products and standards
ranging from general middleware such as
the Globus Toolkit to
domain-specific application portals such as LEAD have begun
to appear. Teuthis represents an effort to leverage some of the
developing
technologies in this area to provide workflow management specialized to
the needs
of computational astrophysicists and others who work with
noninteractive, massively parallel applications.
Teuthis organizes scientific work using entities
called projects, experiments,
runs, and jobs. In this
section we
describe each of these and show how to use the properties dialogs
associated with them. The basic sequence of tasks is shown below.
Projects are collections of related scientific questions, such as "Galaxy cluster scaling relations" or "Cosmic shear systematics due to projection effects." They represent the highest level of the Teuthis workflow paradigm. A new project is created either by selecting "New project..." from the File menu in the main window or by cloning an existing project using the "Clone" entry of its popup menu (right click in the main window) or the "Clone project" button in its properties dialog. (Note that cloning a project clones all of its child entities as well. Cloned projects will have "Copy" appended to their names.) When a new project is created, the dialog shown in Figure 1 appears.
Figure
1: Project properties dialog.
The project name should be, but does not have to
be, unique. It is used to identify the project in the main
window's project view. The description field appears in the
"Description" column of the main window's project view when this column
is enabled.
You can keep notes on the project in one of two
ways: by editing a text file on your local machine or by
launching a web browser and entering notes into a wiki. Enter the
pathname for the text file or the URL of the wiki in the field labeled
"Notes location," then select the appropriate radio button. If
you have selected "Edit text file," when you click the "Take notes"
button the application registered on your system to handle MIME type text/plain
will be launched. For this choice it may make sense to keep your
project notes in the workspace directory specified in the user
preferences dialog (although this is not required). If you have
selected "Wiki URL," clicking on the "Take notes" button will launch a
web browser and load the specified URL. The web browser is
launched using the Python webbrowser
module and will be
system-dependent.
To create a new experiment, click the "New
experiment..." button, and an experiment dialog will appear. Note
that even if you cancel these dialogs, the project and its new
experiment will still exist within the project view in the main window;
to remove them you must manually delete the project using the "Delete"
option in its popup menu. This behavior is designed to avoid
accidentally deleting the project if you have gone deeply into creating
experiments, runs, and jobs within a new project and then click
"Cancel" when you come back to the project dialog.
Experiments vary one or more parameters using a
single application code in order to answer a specific scientific
question. An example of such a question might be, "How does the
level of galaxy feedback affect the scatter in the cluster
mass-temperature relation?" Experiments are created by clicking
on the "New experiment..." button of the experiment's project dialog or
by selecting the corresponding entry from a project's popup menu in the
project view of the main window (right click on the experiment's entry
in the project view to bring up its popup menu). Experiments may
also be cloned using the "Clone" item in their popup menus or by
clicking the "Clone experiment" button in their properties
dialogs. (Cloning an experiment also clones its child
entities. Cloned experiments will have "Copy" appended to their
names.) When you create a new
experiment, you will see the dialog shown in Figure 2.
Figure
2: Experiment properties dialog.
Like projects, experiments have names and
description fields. The description field appears in the
"Description" column of the main window's project view when this column
is enabled.
Once you have provided a name and description for
your experiment, choose an application from among those you have
previously configured by clicking on the "Application" list box.
Also choose an execution machine for this experiment by selecting one
of the configured machines from the "Exec machine" list box. When
you make these selections, several fields will be filled in for
you. The remote build directory field will be initialized with a
path constructed from the build root directory for the execution
machine and the munged name of the application. The "Config
command" field will contain the application's configuration
command. The "Build command" field will contain the build command
for the application, and if you have checked "Move to remote exec dir"
in the application's profile, you will also see a command to move the
executable to the executable directory for the machine. The
absolute path name of the executable will be entered into the
"Executable to use" field; Teuthis will expect to find this in the
build directory if you have not checked "Move to remote exec
dir." Each of these entries can be modified as necessary,
including the "Exec arguments" field, which specifies additional
arguments to be passed to the application.
If you do not wish to build the remote
executable, simply check whether the "Executable to use" field is
correct, then move on to choosing
execution parameters below.
If you wish to build the application on the
remote machine from local source code, click on the "Upload source"
button. You will be asked to authenticate yourself to the
execution machine. What happens next (assuming your passphrase
was accepted!) depends on whether or not you have chosen to use the tendril.py
helper application in the user
preferences dialog. Tendril handles certain remote operations
that would be difficult to code in a cross-platform fashion without the
use of Python. When you ask to use Tendril, Teuthis first copies tendril.py
from its location on your machine to the remote machine's Teuthis
directory. (A file permissions check is performed first; if
anybody except the system administrator and the file's owner can write
to tendril.py
, an error message is displayed.) Two
of the things Tendril can do are to obtain MD5 checksums for files in a
directory tree and to create a set of directories specified in a
file. These capabilities permit Teuthis to perform a one-way
remote synchronization of your local source tree with the remote
machine. Any files found on your machine but not on the remote
machine, or which differ between the two machines, will be copied from
your machine to the remote machine. (CVS directories are
excluded.) Hence you can treat your local source as
"authoritative" and only transfer necessary files when you click
"Upload source." Remote directory creation is also much faster
when you use Tendril.
If you have chosen not to use Tendril (for
example, if the remote machine does not have Python installed), the
entire source code tree will be uploaded. In this case it may be
more efficient to handle the transfer yourself using one scp
command.
After you have uploaded or synchronized the
source code, you can browse for a local configuration file and then
upload it by clicking the "Upload file" button. (Some
applications may not require this step.) On the remote machine
the file will be given the generic name you specified in the
application profile. For example, you might have configuration
files named Modules.galaxy
and Modules.sedov
on your local machine containing configuration information for two
different problem setups. When you choose one and click "Upload
file," it will be copied to the generic file Modules
in
the remote build directory.
Note that some applications may not require a
configuration step but will instead have you edit a makefile and then
build. For these types of applications you should leave the
application's configuration command blank, specify "Makefile" as the
configuration file, and upload your edited makefile in place of a
configuration file. You can then skip forward to building the
code.
After you have uploaded the configuration file
(if any), you can remotely execute the application's configuration
command by clicking the "Do it" button next to the "Config command"
field. A progress bar dialog will be shown while the command
executes. When the command finishes, the dialog will go away, and
the "Output" button next to the "Do it" button will become
active. This button allows you to bring up a window showing the
standard output of the most recent configuration command run. If
the exit code was 0 (success), the "Output" button will have a green
border; otherwise it will have a red border. You can run the
configuration command as many times as you wish. An unsuccessful
configuration indicated by a red border will not prevent you from going
on to build the code, but it will indicate to you that you should check
the configuration output for possible problems.
The procedure for building the code is the same
as for configuring it: click the "Do it" button next to the
"Build command" field to execute the specified command on the remote
machine. While the build is in process, Teuthis will block with a
progress dialog. When it is done, the output will be available by
clicking the "Output" button, and the border color of this button will
indicate the success or failure of the build procedure.
The configuration and building of some
applications can be quite lengthy procedures, and it is anticipated
that the next major release of Teuthis will support the backgrounding
of these two processes.
When you choose an execution machine from the
"Exec machine" list box, the queue, account, number of CPUs, CPU
tiling, and memory per node fields are filled using values taken from
the execution machine's profile. These can be adjusted as
needed. If you need to use a queue or account that is not listed,
close the experiment dialog, return to the machine dialog, and add the
appropriate queue or account to the list of those available for this
machine.
Note that the "# of CPUs" field allows you to
enter a range of values. Range specifications of the form "a-b",
"a,b,c", or combinations thereof are permitted. (The form "a-b"
means "step from a to b in increments of 1.") This feature makes
it straightforward to perform parallel scaling studies.
If your experiment will require the staging of
input files, choose a data source machine in the "Src machine" field
and enter the full paths (one per line) of the files to transfer from
this machine in the "Src files" field. (Wildcards are not
permitted in this field.) When jobs are submitted as part of this
experiment, the indicated files will first be transferred from the data
source machine to the run directory on the execution machine.
Note that if the access methods of both machines are not the same or
not a type that permits third-party transfers (like gsissh+uberftp),
you will only be able to stage files from your local machine to a
remote machine, from a remote machine to your local machine, or from a
machine to itself.
If you wish to archive all of the files in the
run directory after the job is complete, choose a data destination
machine in the "Dest machine" field and enter the absolute path of the
destination directory on the data destination machine. If the
path does not exist at the time the archiving takes place, it will be
created for you. The same remarks regarding third-party transfers
made for staging job data before the job executes apply here as well.
Note that Teuthis will block with a file transfer
progress dialog while staging and archiving take place. If you
cancel the dialog while staging, job submission will also be
aborted. If you cancel while archiving, you can always come back
and initiate archiving again later. You can also change the data
destination machine and path and archive to another location.
Lengthy file transfers will be somewhat inconvenient in this version of
Teuthis since they occur in the foreground; as with compiling and
building applications, the backgrounding of this process is priority
for the next major release of Teuthis.
Parameter survey generation is one of the most
powerful features offered through the experiment dialog. You can
browse for and edit a parameter file template directly from the dialog,
then specify a set of parameters to vary and, with a single click,
generate a set of runs that will cover the entire range of parameters
you have requested.
Applications each have their own format for
parameter files read at runtime. All Teuthis assumes about these
files is that they are text-based and thus amenable to pattern
substitution. When you create a parameter file template, set all
of the parameters that are required and will not vary for this
experiment. In the places where you would normally put the value
of a parameter that will
vary, instead include the sequence %name%
, where name
is the name you would like Teuthis to know the parameter by. You
may then list the parameter in one of the varied parameter fields of
the experiment dialog, together with the range of values it is to take
on.
Range specifications are of the form "a,b,c,..."
for non-integer parameters and "a,b,c,..." or "a-b" (or combinations
thereof) for integer parameters. The "a,b,c,..." form means that
the indicated parameter is to take values a, b, c, ...; the "a-b" form
means that the indicated parameter is to take values from a to b in
increments of 1. Other than this distinction, no type checking is
performed, so strings (including embedded spaces) and real numbers may
also be specified. It is up to you to make sure that the listed
value types are compatible with the application's interpretation of
each parameter.
By checking the lock buttons for two or more
parameters, you can force them to vary together. All of the
parameters with the lock setting must have the same number of values
specified for them, or else an error message will appear when you click
on the "Generate runs" button. For example, if you request
variable A with range "1-3" and variable B with range "run1,run2,run3"
and leave the lock buttons unchecked, you will end up with nine runs,
whereas if you check the lock buttons, you will end up with three
runs. Also checking the lock button for a variable C with range
"9,11,13,15" will result in an error.
To generate the requested runs, click on the
"Generate runs" button.
It is possible to generate runs without a
parameter file template. In this case, varied parameter settings
will be ignored, and you will be prompted to continue before a single
run is generated.
A run represents the execution of code for a
given set of parameter values, carried to completion. Examples
might include "Run with 10x fiducial energy input" or "Run with 10^8
solar mass black hole." Runs are created by clicking the
"Generate runs" button of an experiment dialog. Doing this adds
the runs in the project view of the main window but does not bring up a
run dialog. To examine a run's properties, return to the main
window and double-click a run or select "Properties..." from its popup
menu (accessed with the right mouse button). An example run
dialog is shown in Figure 3.
Figure
3. Run properties dialog.
The run name is automatically generated when a
set of runs is created. Run names run from "A" to "Z", then "AA"
to "AZ", "BA" to "BZ", etc. If you generate more than one set of
runs for a given experiment, the names of the new runs will pick up
where the old ones left off. You can edit individual run names if
you wish.
The "Comments" field appears in the "Description"
column of the main window's project view when this column is
enabled. The "Disposition" field is displayed in the project view
when you request that the status column be included. The list box
contains several possible disposition remarks; you may choose one of
these or enter one of your own.
Application, execution, and
data transfer fields
These sections of the run dialog have the same
interpretations as the corresponding sections of the experiment dialog. You can add to or
change the executable path and arguments, the queue, account, CPU, and
memory settings, or the data transfer settings. However, you
cannot change the application or execution machine without returning to
the experiment dialog and generating a new set of runs.
The contents of the parameter file generated for
this run are stored within Teuthis and can be viewed by clicking on the
"View complete parameter file" button. The values of just those
parameters that were varied for this run are shown in the "Varied
parameters" field.
You can clone a run by clicking on the "Clone
run" button. Cloning a run also clones its jobs, although it does
not reset the cloned jobs' status fields. The cloned run will
have "Copy" appended to its name.
To create a job for this run, click on the
"Create job" button. A new job with a name consisting of the
run's name followed by a string of digits (e.g., "A0001") and a status
of "Unsubmitted" will be created, and its properties dialog will appear.
Jobs are the lowest level of the hierarchy of
workflow entities manipulated by Teuthis. A job represents an
individual attempt to complete or extend a run, for example "Job 123456
on 128 processors for 18 hours." Jobs are associated with
individual submissions to a remote queue and may be either "Original"
(begun from scratch) or "Restart" (restarted automatically from a
checkpoint file left by a previous job). Jobs are created via the
"Create job" popup menu item or dialog button associated with a run, by
cloning or continuing another job, or by cloning a higher-level
workflow entity. (In the latter case the cloned job's status
field is not reset.) An example of the properties dialog
associated with a job is shown in Figure 4.
Figure
4. Job properties dialog.
Application, execution,
and data transfer fields
These sections of the job dialog have the same
interpretations as the corresponding sections of the run
dialog.
You can add to or change the executable path and arguments, the queue,
account, CPU, and memory settings, or the data transfer settings.
However, you cannot change the application or execution machine without
returning to the experiment dialog and generating a new set of runs.
Note that here you will set the wall clock time
limit to be requested when submitting the job. Teuthis does not
enforce the wall clock, processor, or memory limits of remote queues,
so it is up to you to make sure that your job will be accepted in the
queue to which you are submitting. If job submission fails
because you have exceeded queue limits, you can change the resource
requests in this dialog and resubmit the job.
By using the action buttons in a job's properties
dialog or by right-clicking on the job's entry in the project view of
the main window, you can perform several important job-related
functions. At any time you may view the runtime parameter set
associated with the job (and stored within the Teuthis project file) by
clicking on "View parameters." For original jobs and restart jobs
with applications that do not restart by copying a special file, these
parameters will be the same as those of the parent run. For
restart jobs with applications that create a special restart parameter
file after writing a checkpoint, the stored parameters will be the
downloaded contents of this special parameter file.
You may submit a job using the "Submit job"
button or menu item. This will generate a job script using the
application and machine settings in force at the time the button is
clicked. The job script will be displayed in a window for you to
review and (perhaps) edit before final submission. If a data
source machine and source files have been specified, the data files
will be staged to the execution machine before the job is submitted to
the execution machine queue. (As indicated above, staging and archiving of
job data are currently foreground processes in Teuthis.) If you
accept the job script, the remote run directory will be created, the
runtime parameter file will be created in it, and the job script will
be created in the remote Teuthis directory and submitted from
there. When a job is successfully submitted, its status will
change to "Pending" and it will be assigned a remote job ID by the
remote queuing system. (If the job claims to be submitted but the
remote job ID still appears as 000000, something has gone wrong.)
At any time you may click on "View log file" or
"View output" to view the log file or standard output plus standard
error associated with the job. Teuthis looks for a log file with
the name given in the job's application profile, located within the run
directory. Standard output and error files have names that depend
on the remote queuing system but will generally be searched for in the
Teuthis directory you have configured on the execution machine.
If a job's status is unsubmitted or pending, you will not be able to
view these files (because they won't exist!). While a job is
running, you should be able to download and view the log file; you may
be able to obtain the standard output and error, depending on the
queuing system. After a job is found to have completed, the log
and standard output/error files will be downloaded and stored within
the Teuthis project file.
By clicking on a job's "Status" button or "Update
status" menu entry, or by using the "View - Update all jobs" menu item
in the main window, or by using automatic updates, you can check on a
job's run status. When a job's run status is found to have
changed from "Running" to "Complete," Teuthis will attempt to download
the job's log file and standard output/error, and if you have specified
a data destination machine and path, you will be prompted to archive
the job data files. Further status updates will do nothing and
will not produce attempts to download or archive files. If the
downloads or archiving were unsuccessful, you can try them again by
clicking on one of the "View" buttons or the "Archive data"
button. Once Teuthis has successfully downloaded the log and
standard output/error files, it will store them and not try to download
them again when you click on a "View" button. However, you can
archive the job data to another machine or path by changing the data
destination machine or path fields (assuming the files are still
present in the job's run directory when you click "Archive data.")
To restart a job, click the "Continue job" button
or menu entry. This will create a new job with the same name,
resetting any internally stored log file or output file. The
internally stored runtime parameters will be the same as the original
job's if the application is configured to do nothing, expect a special
file, or accept a special command-line argument to restart. (In
the latter case the appropriate argument will be added to the "Exec
arguments" field of the new job.) If the application is
configured to write a special restart parameter file whenever it writes
a checkpoint file, Teuthis will attempt to download this file and use
its contents for the runtime parameters of the restart job. The
restart job will be initialized with a comment field of "Restart of
######," where ###### is the remote job ID of the original job.
The restart job will run in the same run directory as the original job,
expecting to find the necessary checkpoint file(s) there. If a
remote disk purge has eliminated these files, you will need to re-stage
them for the restart job.