5. Managing workflow

"Workflow" describes the sequence of operations needed to accomplish an objective, together with the resources and information needed at each step of such a sequence. For years organizations concerned with logistics (such as businesses) have used sophisticated software tools to manage workflow, but in science (and astrophysics in particular) scientific workflow management has been very ad-hoc. Surprisingly, this has included computational science, which despite its heavy investment in computers as calculating and data processing tools has placed too little emphasis on making scientific work easier and less error-prone.

This situation is changing with the increased emphasis on cyberinfrastructure development by NSF and other funding agencies. Software products and standards ranging from general middleware such as the Globus Toolkit to domain-specific application portals such as LEAD have begun to appear. Teuthis represents an effort to leverage some of the developing technologies in this area to provide workflow management specialized to the needs of computational astrophysicists and others who work with noninteractive, massively parallel applications.

Teuthis organizes scientific work using entities called projects, experiments, runs, and jobs.  In this section we describe each of these and show how to use the properties dialogs associated with them.  The basic sequence of tasks is shown below.

  1. Configure applications and machines that will be used.
  2. Create a new project.
  3. Under the project, create a new experiment.
  4. Choose an application and execution machine for the experiment.  Configure and build the application.
  5. Select input data files and indicate the source and destination machines for data files used in the experiment.
  6. Create a parameter file template for your experiment and generate runs.
  7. For each run, create a job, then stage the input data and submit that job.
  8. Teuthis periodically checks to see if the job is done.  When it is, it obtains the log file and standard output, then archives the data produced by the job.
  9. Classify the disposition of the job and restart it if necessary.
  10. Retrieve the job data from the data destination machine and analyze it.
Presently Teuthis provides no special way to link a simulation job with related analysis jobs, but it is possible to configure analysis tools as applications and then run them with simulation data as input from Teuthis.

Projects

Projects are collections of related scientific questions, such as "Galaxy cluster scaling relations" or "Cosmic shear systematics due to projection effects."  They represent the highest level of the Teuthis workflow paradigm.  A new project is created either by selecting "New project..." from the File menu in the main window or by cloning an existing project using the "Clone" entry of its popup menu (right click in the main window) or the "Clone project" button in its properties dialog.  (Note that cloning a project clones all of its child entities as well.  Cloned projects will have "Copy" appended to their names.)  When a new project is created, the dialog shown in Figure 1 appears.

Project dialog
Figure 1: Project properties dialog.






Descriptive fields

The project name should be, but does not have to be, unique.  It is used to identify the project in the main window's project view.  The description field appears in the "Description" column of the main window's project view when this column is enabled.

Taking notes

You can keep notes on the project in one of two ways:  by editing a text file on your local machine or by launching a web browser and entering notes into a wiki.  Enter the pathname for the text file or the URL of the wiki in the field labeled "Notes location," then select the appropriate radio button.  If you have selected "Edit text file," when you click the "Take notes" button the application registered on your system to handle MIME type text/plain will be launched.  For this choice it may make sense to keep your project notes in the workspace directory specified in the user preferences dialog (although this is not required).  If you have selected "Wiki URL," clicking on the "Take notes" button will launch a web browser and load the specified URL.  The web browser is launched using the Python webbrowser module and will be system-dependent.

Creating new experiments

To create a new experiment, click the "New experiment..." button, and an experiment dialog will appear.  Note that even if you cancel these dialogs, the project and its new experiment will still exist within the project view in the main window; to remove them you must manually delete the project using the "Delete" option in its popup menu.  This behavior is designed to avoid accidentally deleting the project if you have gone deeply into creating experiments, runs, and jobs within a new project and then click "Cancel" when you come back to the project dialog.

Experiments

Experiments vary one or more parameters using a single application code in order to answer a specific scientific question.  An example of such a question might be, "How does the level of galaxy feedback affect the scatter in the cluster mass-temperature relation?"  Experiments are created by clicking on the "New experiment..." button of the experiment's project dialog or by selecting the corresponding entry from a project's popup menu in the project view of the main window (right click on the experiment's entry in the project view to bring up its popup menu).  Experiments may also be cloned using the "Clone" item in their popup menus or by clicking the "Clone experiment" button in their properties dialogs.  (Cloning an experiment also clones its child entities.  Cloned experiments will have "Copy" appended to their names.)  When you create a new experiment, you will see the dialog shown in Figure 2.

Experiment dialog
Figure 2: Experiment properties dialog.














Descriptive fields

Like projects, experiments have names and description fields.  The description field appears in the "Description" column of the main window's project view when this column is enabled.

Building an application

Once you have provided a name and description for your experiment, choose an application from among those you have previously configured by clicking on the "Application" list box.  Also choose an execution machine for this experiment by selecting one of the configured machines from the "Exec machine" list box.  When you make these selections, several fields will be filled in for you.  The remote build directory field will be initialized with a path constructed from the build root directory for the execution machine and the munged name of the application.  The "Config command" field will contain the application's configuration command.  The "Build command" field will contain the build command for the application, and if you have checked "Move to remote exec dir" in the application's profile, you will also see a command to move the executable to the executable directory for the machine.  The absolute path name of the executable will be entered into the "Executable to use" field; Teuthis will expect to find this in the build directory if you have not checked "Move to remote exec dir."  Each of these entries can be modified as necessary, including the "Exec arguments" field, which specifies additional arguments to be passed to the application.

If you do not wish to build the remote executable, simply check whether the "Executable to use" field is correct, then move on to choosing execution parameters below.

If you wish to build the application on the remote machine from local source code, click on the "Upload source" button.  You will be asked to authenticate yourself to the execution machine.  What happens next (assuming your passphrase was accepted!) depends on whether or not you have chosen to use the tendril.py helper application in the user preferences dialog.  Tendril handles certain remote operations that would be difficult to code in a cross-platform fashion without the use of Python.  When you ask to use Tendril, Teuthis first copies tendril.py from its location on your machine to the remote machine's Teuthis directory.  (A file permissions check is performed first; if anybody except the system administrator and the file's owner can write to tendril.py, an error message is displayed.)  Two of the things Tendril can do are to obtain MD5 checksums for files in a directory tree and to create a set of directories specified in a file.  These capabilities permit Teuthis to perform a one-way remote synchronization of your local source tree with the remote machine.  Any files found on your machine but not on the remote machine, or which differ between the two machines, will be copied from your machine to the remote machine.  (CVS directories are excluded.)  Hence you can treat your local source as "authoritative" and only transfer necessary files when you click "Upload source."  Remote directory creation is also much faster when you use Tendril.

If you have chosen not to use Tendril (for example, if the remote machine does not have Python installed), the entire source code tree will be uploaded.  In this case it may be more efficient to handle the transfer yourself using one scp command.

After you have uploaded or synchronized the source code, you can browse for a local configuration file and then upload it by clicking the "Upload file" button.  (Some applications may not require this step.)  On the remote machine the file will be given the generic name you specified in the application profile.  For example, you might have configuration files named Modules.galaxy and Modules.sedov on your local machine containing configuration information for two different problem setups.  When you choose one and click "Upload file," it will be copied to the generic file Modules in the remote build directory.

Note that some applications may not require a configuration step but will instead have you edit a makefile and then build.  For these types of applications you should leave the application's configuration command blank, specify "Makefile" as the configuration file, and upload your edited makefile in place of a configuration file.  You can then skip forward to building the code.

After you have uploaded the configuration file (if any), you can remotely execute the application's configuration command by clicking the "Do it" button next to the "Config command" field.  A progress bar dialog will be shown while the command executes.  When the command finishes, the dialog will go away, and the "Output" button next to the "Do it" button will become active.  This button allows you to bring up a window showing the standard output of the most recent configuration command run.  If the exit code was 0 (success), the "Output" button will have a green border; otherwise it will have a red border.  You can run the configuration command as many times as you wish.  An unsuccessful configuration indicated by a red border will not prevent you from going on to build the code, but it will indicate to you that you should check the configuration output for possible problems.

The procedure for building the code is the same as for configuring it:  click the "Do it" button next to the "Build command" field to execute the specified command on the remote machine.  While the build is in process, Teuthis will block with a progress dialog.  When it is done, the output will be available by clicking the "Output" button, and the border color of this button will indicate the success or failure of the build procedure.

The configuration and building of some applications can be quite lengthy procedures, and it is anticipated that the next major release of Teuthis will support the backgrounding of these two processes.

Choosing execution parameters

When you choose an execution machine from the "Exec machine" list box, the queue, account, number of CPUs, CPU tiling, and memory per node fields are filled using values taken from the execution machine's profile.  These can be adjusted as needed.  If you need to use a queue or account that is not listed, close the experiment dialog, return to the machine dialog, and add the appropriate queue or account to the list of those available for this machine.

Note that the "# of CPUs" field allows you to enter a range of values.  Range specifications of the form "a-b", "a,b,c", or combinations thereof are permitted.  (The form "a-b" means "step from a to b in increments of 1.")  This feature makes it straightforward to perform parallel scaling studies.

Setting up data transfers

If your experiment will require the staging of input files, choose a data source machine in the "Src machine" field and enter the full paths (one per line) of the files to transfer from this machine in the "Src files" field.  (Wildcards are not permitted in this field.)  When jobs are submitted as part of this experiment, the indicated files will first be transferred from the data source machine to the run directory on the execution machine.  Note that if the access methods of both machines are not the same or not a type that permits third-party transfers (like gsissh+uberftp), you will only be able to stage files from your local machine to a remote machine, from a remote machine to your local machine, or from a machine to itself.

If you wish to archive all of the files in the run directory after the job is complete, choose a data destination machine in the "Dest machine" field and enter the absolute path of the destination directory on the data destination machine.  If the path does not exist at the time the archiving takes place, it will be created for you.  The same remarks regarding third-party transfers made for staging job data before the job executes apply here as well.

Note that Teuthis will block with a file transfer progress dialog while staging and archiving take place.  If you cancel the dialog while staging, job submission will also be aborted.  If you cancel while archiving, you can always come back and initiate archiving again later.  You can also change the data destination machine and path and archive to another location.  Lengthy file transfers will be somewhat inconvenient in this version of Teuthis since they occur in the foreground; as with compiling and building applications, the backgrounding of this process is priority for the next major release of Teuthis.

Generating runs

Parameter survey generation is one of the most powerful features offered through the experiment dialog.  You can browse for and edit a parameter file template directly from the dialog, then specify a set of parameters to vary and, with a single click, generate a set of runs that will cover the entire range of parameters you have requested.

Applications each have their own format for parameter files read at runtime.  All Teuthis assumes about these files is that they are text-based and thus amenable to pattern substitution.  When you create a parameter file template, set all of the parameters that are required and will not vary for this experiment.  In the places where you would normally put the value of a parameter that will vary, instead include the sequence %name%, where name is the name you would like Teuthis to know the parameter by.  You may then list the parameter in one of the varied parameter fields of the experiment dialog, together with the range of values it is to take on.

Range specifications are of the form "a,b,c,..." for non-integer parameters and "a,b,c,..." or "a-b" (or combinations thereof) for integer parameters.  The "a,b,c,..." form means that the indicated parameter is to take values a, b, c, ...; the "a-b" form means that the indicated parameter is to take values from a to b in increments of 1.  Other than this distinction, no type checking is performed, so strings (including embedded spaces) and real numbers may also be specified.  It is up to you to make sure that the listed value types are compatible with the application's interpretation of each parameter.

By checking the lock buttons for two or more parameters, you can force them to vary together.  All of the parameters with the lock setting must have the same number of values specified for them, or else an error message will appear when you click on the "Generate runs" button.  For example, if you request variable A with range "1-3" and variable B with range "run1,run2,run3" and leave the lock buttons unchecked, you will end up with nine runs, whereas if you check the lock buttons, you will end up with three runs.  Also checking the lock button for a variable C with range "9,11,13,15" will result in an error.

To generate the requested runs, click on the "Generate runs" button.

It is possible to generate runs without a parameter file template.  In this case, varied parameter settings will be ignored, and you will be prompted to continue before a single run is generated.

Runs

A run represents the execution of code for a given set of parameter values, carried to completion.  Examples might include "Run with 10x fiducial energy input" or "Run with 10^8 solar mass black hole."  Runs are created by clicking the "Generate runs" button of an experiment dialog.  Doing this adds the runs in the project view of the main window but does not bring up a run dialog.  To examine a run's properties, return to the main window and double-click a run or select "Properties..." from its popup menu (accessed with the right mouse button).  An example run dialog is shown in Figure 3.

Run dialog
Figure 3.  Run properties dialog.

Descriptive fields

The run name is automatically generated when a set of runs is created.  Run names run from "A" to "Z", then "AA" to "AZ", "BA" to "BZ", etc.  If you generate more than one set of runs for a given experiment, the names of the new runs will pick up where the old ones left off.  You can edit individual run names if you wish.

The "Comments" field appears in the "Description" column of the main window's project view when this column is enabled.  The "Disposition" field is displayed in the project view when you request that the status column be included.  The list box contains several possible disposition remarks; you may choose one of these or enter one of your own.

Application, execution, and data transfer fields

These sections of the run dialog have the same interpretations as the corresponding sections of the experiment dialog.  You can add to or change the executable path and arguments, the queue, account, CPU, and memory settings, or the data transfer settings.  However, you cannot change the application or execution machine without returning to the experiment dialog and generating a new set of runs.

Runtime parameters

The contents of the parameter file generated for this run are stored within Teuthis and can be viewed by clicking on the "View complete parameter file" button.  The values of just those parameters that were varied for this run are shown in the "Varied parameters" field.

Creating new runs and jobs

You can clone a run by clicking on the "Clone run" button.  Cloning a run also clones its jobs, although it does not reset the cloned jobs' status fields.  The cloned run will have "Copy" appended to its name.

To create a job for this run, click on the "Create job" button.  A new job with a name consisting of the run's name followed by a string of digits (e.g., "A0001") and a status of "Unsubmitted" will be created, and its properties dialog will appear.

Jobs

Jobs are the lowest level of the hierarchy of workflow entities manipulated by Teuthis.  A job represents an individual attempt to complete or extend a run, for example "Job 123456 on 128 processors for 18 hours."  Jobs are associated with individual submissions to a remote queue and may be either "Original" (begun from scratch) or "Restart" (restarted automatically from a checkpoint file left by a previous job).  Jobs are created via the "Create job" popup menu item or dialog button associated with a run, by cloning or continuing another job, or by cloning a higher-level workflow entity.  (In the latter case the cloned job's status field is not reset.)  An example of the properties dialog associated with a job is shown in Figure 4.

Job dialog

Figure 4.  Job properties dialog.

Descriptive fields

The local job ID is automatically generated when a job is created.  Job IDs take the form <run name>####, where #### is a zero-padded sequence number.  If you restart a job, the restart job will have the same local job ID as the original (and will share the same run directory on the execution machine).  If you clone a job, the cloned job will have "Copy" appended to its name.  If you click the "New job" button for the job's parent run again, a new job with the next sequence number will be created.

The "Comments" field appears in the "Description" column of the main window's project view when this column is enabled.  It is initialized with the text "Original" or "Restart of ######" when you create the job.  The "Disposition" field is displayed in the project view of the main window when you request that the status column be included.  The list box contains several possible disposition remarks; you may choose one of these or enter one of your own.

Application, execution, and data transfer fields

These sections of the job dialog have the same interpretations as the corresponding sections of the run dialog.  You can add to or change the executable path and arguments, the queue, account, CPU, and memory settings, or the data transfer settings.  However, you cannot change the application or execution machine without returning to the experiment dialog and generating a new set of runs.

Note that here you will set the wall clock time limit to be requested when submitting the job.  Teuthis does not enforce the wall clock, processor, or memory limits of remote queues, so it is up to you to make sure that your job will be accepted in the queue to which you are submitting.  If job submission fails because you have exceeded queue limits, you can change the resource requests in this dialog and resubmit the job.

Submitting jobs

By using the action buttons in a job's properties dialog or by right-clicking on the job's entry in the project view of the main window, you can perform several important job-related functions.  At any time you may view the runtime parameter set associated with the job (and stored within the Teuthis project file) by clicking on "View parameters."  For original jobs and restart jobs with applications that do not restart by copying a special file, these parameters will be the same as those of the parent run.  For restart jobs with applications that create a special restart parameter file after writing a checkpoint, the stored parameters will be the downloaded contents of this special parameter file.

You may submit a job using the "Submit job" button or menu item.  This will generate a job script using the application and machine settings in force at the time the button is clicked.  The job script will be displayed in a window for you to review and (perhaps) edit before final submission.  If a data source machine and source files have been specified, the data files will be staged to the execution machine before the job is submitted to the execution machine queue.  (As indicated above, staging and archiving of job data are currently foreground processes in Teuthis.)  If you accept the job script, the remote run directory will be created, the runtime parameter file will be created in it, and the job script will be created in the remote Teuthis directory and submitted from there.  When a job is successfully submitted, its status will change to "Pending" and it will be assigned a remote job ID by the remote queuing system.  (If the job claims to be submitted but the remote job ID still appears as 000000, something has gone wrong.)

Viewing job data

At any time you may click on "View log file" or "View output" to view the log file or standard output plus standard error associated with the job.  Teuthis looks for a log file with the name given in the job's application profile, located within the run directory.  Standard output and error files have names that depend on the remote queuing system but will generally be searched for in the Teuthis directory you have configured on the execution machine.  If a job's status is unsubmitted or pending, you will not be able to view these files (because they won't exist!).  While a job is running, you should be able to download and view the log file; you may be able to obtain the standard output and error, depending on the queuing system.  After a job is found to have completed, the log and standard output/error files will be downloaded and stored within the Teuthis project file.

Updating a job's status

By clicking on a job's "Status" button or "Update status" menu entry, or by using the "View - Update all jobs" menu item in the main window, or by using automatic updates, you can check on a job's run status.  When a job's run status is found to have changed from "Running" to "Complete," Teuthis will attempt to download the job's log file and standard output/error, and if you have specified a data destination machine and path, you will be prompted to archive the job data files.  Further status updates will do nothing and will not produce attempts to download or archive files.  If the downloads or archiving were unsuccessful, you can try them again by clicking on one of the "View" buttons or the "Archive data" button.  Once Teuthis has successfully downloaded the log and standard output/error files, it will store them and not try to download them again when you click on a "View" button.  However, you can archive the job data to another machine or path by changing the data destination machine or path fields (assuming the files are still present in the job's run directory when you click "Archive data.")

Restarting a job

To restart a job, click the "Continue job" button or menu entry.  This will create a new job with the same name, resetting any internally stored log file or output file.  The internally stored runtime parameters will be the same as the original job's if the application is configured to do nothing, expect a special file, or accept a special command-line argument to restart.  (In the latter case the appropriate argument will be added to the "Exec arguments" field of the new job.)  If the application is configured to write a special restart parameter file whenever it writes a checkpoint file, Teuthis will attempt to download this file and use its contents for the runtime parameters of the restart job.  The restart job will be initialized with a comment field of "Restart of ######," where ###### is the remote job ID of the original job.  The restart job will run in the same run directory as the original job, expecting to find the necessary checkpoint file(s) there.  If a remote disk purge has eliminated these files, you will need to re-stage them for the restart job.

Previous section

Table of contents

Next section