Useful combinations of pipetask command-line options in BPS

BPS runs pipetask in a few different contexts internally, with the options passed to pipetask carefully tuned to be consistent in order achieve one of a few high-level behaviors. If we can enumerate those behaviors and the options that produce them, we can make them direct BPS options and save users from having to write out their own pipetask command-lines in their BPS configs. This page is an attempt to do that enumeration, including projections for BPS with execution butler.

This page assumes that a "--skip-existing-in <collection>" option will be added on DM-27492 - Getting issue details... STATUS . In the "--skip-existing[-in]" columns in the tables below:

means "--skip-existing", i.e. skip quanta whose metadata datasets are already present in the current output RUN collection
A string means "--skip-existing-in <string>", i.e. skip quanta whose metadata datasets are already present in that given collection.
means pass neither of these options.

This page also assumes DM-30845 - Getting issue details... STATUS , which moves the check for whether --register-dataset-types will be needed to the beginning of QuantumGraph generation instead of the init job (for current BPS) or the transfer-back job (for execution-butler BPS). Failing this check late (or having typo-datasets written to the share data repository) would be very frustrating for users, so it'll be especially bad with execution butler if we don't get that ticket implemented.

Recommended configurations

New processing (default)

Input collections may or may not include other processing outputs, but they are being used purely as inputs; the user does not want to skip any quanta based in what is in the input collections.

The output CHAINED and RUN collections either do not exist yet, or only the CHAINED collection exists and user wants to push a new RUN to it, again without trying to skip any quanta based on what is already there.

	--output	--output-run	--skip-existing[-in]	--clobber-outputs	--register-dataset-types	--init-only	--skip-init-writes	--allow-new-dataset-types
standalone `pipetask`	set by user	or set by user			²			N/A
BPS `qgraph`	`{output}`	`{outCollection}`			N/A	N/A	N/A	²
BPS `init`	`{output}`	`{outCollection}`			²			N/A
BPS `run`	`{output}`	`{outCollection}`	¹	¹				N/A
BPS EB `qgraph`	`{output}`	`{outCollection}`			N/A	N/A	N/A	²
BPS EB `init`		`{outCollection}`						N/A
BPS EB `run`		`{outCollection}`	¹	¹				N/A
BPS EB `transfer-back`	`butler transfer-datasets ...` `butler collection-chain <shared-repo> --flatten --mode prepend {output} {outCollection} # if CHAINED collection exists` `butler collection-chain <shared-repo> --flatten {output} {outCollection} {inCollection} # if CHAINED collection is new`

Only has an effect if automatic retries may occur, but is safe otherwise and may reduce datastore existence-test traffic if set. Passing these here also removes unnecessary differences between scenarios.
User should register dataset types and allow new dataset types in QuantumGraph generation only if they have reason to believe those dataset types do not already exist. With execution butler, dataset type registration in the shared repo is something transfer-back should always do, but we really want --allow-new-dataset-types in QuantumGraph generation to be able to check whether that is a no-op early. All dataset types are always registered in the execution butler when it is created, because it is always a new repo.

Continued processing with a new QuantumGraph and new RUN

The output CHAINED collection already exists. The user wants to make a new QuantumGraph that skips quanta that were already run successfully, and may also include quanta that were not run previously (or even included in the old graph).

This approach pushes a new RUN collection and depends on -skip-existing-in , which is the only way to satisfy this use case with execution butler.

	--output	--output-run	--skip-existing[-in]	--clobber-outputs	--register-dataset-types	--init-only	--skip-init-writes	--allow-new-dataset-types
standalone `pipetask`	set by user	or set by user			²			N/A
BPS `qgraph`	`{output}`	`{outCollection}`	`{output}`		N/A	N/A	N/A	²
BPS `init`	`{output}`	`{outCollection}`			²			N/A
BPS `run`	`{output}`	`{outCollection}`	¹	¹				N/A
BPS EB `qgraph`	`{output}`	`{outCollection}`	`{output}`		N/A	N/A	N/A	²
BPS EB `init`		`{outCollection}`						N/A
BPS EB `run`		`{outCollection}`	¹	¹				N/A
BPS EB `transfer-back`	`butler transfer-datasets ...` `butler collection-chain <shared-repo> --flatten --prepend {output} {outCollection}`

Cells that have changed from the default are highlighted in yellow, and footnotes are the same.

Other noteworthy configurations

--replace-run

I think this is a useful option for standalone pipetask, as a way to repeatedly try to run something until it works at all, but it's probably unwise to try to use it with BPS. It'd be much better to teach users to run some combination

butler collection-chain {collection} --pop

(if they don't want --prune-replaced) or

butler prune-collection {outCollection} --purge --unstore --unlink {collection}

(if they do want --prune-replaced), to make the deletions explicit.

It would be nice to make it a bit easier to fully delete the top of a CHAINED collection, though (just as --pop makes it easy to remove it from the CHAINED collection, without knowing its name).

Continued processing with a new QuantumGraph and existing RUN

Given that we don't have --skip-existing-in yet, this is currently the only way to skip existing quanta. Since that's an important part of managing big processing runs, switching between this mode and the default "new processing" mode is the main reason everybody currently hand-edits their BPS configs. But I think once we have --skip-existing-in , we should just retire this mode, i.e. strongly discourage users from trying it. That's because:

there's no good way to implement this behavior with execution butler;
people almost always want to edit configs and/or software versions when they do this (note that this extends to even requestMemory changes, as that's a config option), but that really shouldn't be allowed by --extend-run. The fact that it is allowed at present is a bug, documented on DM-27492 - Getting issue details... STATUS , but until that same ticket provides --skip-existing-in, that loophole also serves as an important workaround for the fact that it's the only way to pick up where you left off;
the fewer BPS+pipetask configuration combinations we have to support, the better.

Dropping this approach from our list of what's supported may disappoint users who prefer tidier CHAINED collections with a small number of RUNs. But I think that's the egg we need to break to make our omelet maintainable.

I'm not going to write out the full table for this mode because I don't think we'll ever want to implement it as a high-level BPS option.

Continued processing with an existing QuantumGraph and existing RUN

The output CHAINED collection and the output RUN collection already exist, and the user already has a QuantumGraph that they want to continue running, while skipping quanta that were already run successfully. If using execution-butler mode, they must also have an existing execution butler repository they can use; a new one will not be created.

This mode may not be used if the user is changing configuration or software versions, so I actually have some trouble thinking of contexts in which it'd be useful right now (maybe picking up after hardware failures or downtime?), and that's a big part of why it's not in the recommended category. I've written out the full table in large part because I think people will be be curious about it, and want to use it when they shouldn't (they should continue with a new QuantumGraph and new RUN instead).

The other reason I think we can't recommend this mode is that it skips quanta only at runtime, which means that it'll yield a lot of dataset-existence traffic in a brief time window, during those super-fast do-nothing jobs - that's definitely problematic for shared-registry execution, and probably undesirable even for execution butler (where we'll still be slamming the shared datastore and local registry files). That is something we could address in the future via an algorithm that prunes an existing QuantumGraph based on metadata dataset existence, but we would need to be careful to vectorize those checks or we'll just generate the same dataset-existence traffic even faster, at QG-pruning time.

If we change the granularity of config and software-version datasets to be per-quantum rather than per-RUN, as Nate Lust brought up recently, and we implement efficient QuantumGraph pruning, this gets a lot more useful.

	--output	--output-run	--register-dataset-types	--allow-new-dataset-types
standalone `pipetask`	set by user		²	N/A
BPS `qgraph`	`N/A`
BPS `init`	`{output}`	`{outCollection}`	²	N/A
BPS `run`	`{output}`	`{outCollection}`		N/A
BPS EB `qgraph`	`N/A`
BPS EB `init`		`{outCollection}`		N/A
BPS EB `run`		`{outCollection}`		N/A
BPS EB `transfer-back`	`butler transfer-datasets ...` `butler collection-chain <shared-repo> --flatten --prepend {output} {outCollection} # only if not run the first time this QG was executed`

Space shortcuts

Page tree