Platform

File structure

The platform automatically generates a file structure for both input and output upon initialisation.

Instance directory

The instance directory has the following structure:

Instances/
  Example_Instance_Set/
    instance_a.cnf
    instance_b.cnf
    ...        ...
    instance_z.cnf

Each directory under the Instances directory represents an Instance Set and each file is considered an instance. Note that if your dataset is a single file, it will be considered a single instance in the set.

For instances consisting of multiple files one additional file called instances.csv should be included in the Example_Instance_Set directory, describing which files together form an instance. The format is a single instance per line with each file separated by a space, as shown below.

instance_name_a instance_a_part_one.abc ... instance_a_part_n.xyz
instance_name_b instance_b_part_one.abc ... instance_b_part_n.xyz
...                     ...
instance_name_z instance_z_part_one.abc ... instance_z_part_n.xyz

Solver Directory

The solver directory has the following structure:

Solver/
  Example_Solver/
    sparkle_solver_wrapper.py
    parameters.pcs
    ...

The sparkle_solver_wrapper.py is a wrapper that Sparkle should call to run the solver with specific settings, and then returns a result for the configurator. In parameters.pcs the configurable parameters are described in the PCS format. Finally, when importing your Solver into Sparkle, a binary executable of the runsolver tool runsolver is added. This allows Sparkle to make fair time and computational cost measurements for all configuration experiments.

This same structure holds up for all other executables we refer to as SparkleCallable in the Sparkle package, such as Feature Extractors, which are placed in the Extractor directory.

The output directory

The output directory is located at the root of the Sparkle directory. Its structure is as follows:

Output/
  Logs/
    commandname_timestamp/
        log files
  Configuration/
    configurator/
        Raw_Data/
            configuration_scenario/
                related files
    Analysis/
  Parallel_Portfolio/
    Raw_Data/
        related files
    Analysis/
  Selection/
    selector/
        solver_scenario/
            related files
    Analysis/

The Logs directory should contain the history of commands and their output such that one can easily know what has been done in which order and find enough pointers to debug unwanted behaviour.

Other directories are cut into two subdirectories: Raw_Data contains the data produced by the main command, often time consuming to generate, handle with care; Analysis contains information extracted from the raw data, easy to generate, plots and reports.

For each type of task run by Sparkle, the related files differ. The aim is always to have all required files for reproducibility. A copy of the sparkle configuration file at the time of the run and of all files relevant to the run, a copy of any log or error file that could help with debugging or a link to it, and the output of the executed task.

For configuration the configuration trajectory if available, the training and testing sets, the default configuration and the final found configuration. The performance of those will be in the Analysis folder.

For parallel portfolio the resulting portfolio and its components. The performance of the portfolio will be in the Analysis folder.

For selection the algorithms and their performance on the training set, the model(s) generated if available and the resulting selector. The performance evaluation of the selector will be in the Analysis folder.

For analysis a link to the folder on which the analysis was performed (configuration, portfolio or selection), the performance evaluation from it and the report if it was generated.

Other directories

There are a few other special directories automatically generated by Sparkle.

Reference_Lists: Here Sparkle keeps track of user-defined aliases
Snapshots: Here Sparkle places your saved snapshots
Tmp: Here temporary files are placed that are generated during commands, but should also be removed during the command
Output/Feature_Data: Here Sparkle unifies all known/added Feature Extractors, the Instances and their features if calculated. When an extractor or instance is removed, they are also removed here.
Output/Performance_Data: Here Sparkle unifies all known/added Solvers, the Instances and their recorded objectives if known. When a solver or instance is removed, they are also removed here.

Platform Settings

Most settings can be controlled through the Settings directory, specifically the Settings/sparkle_settings.ini file. Possible settings are summarised per category in Options and possible values. For any settings that are not provided the defaults will be used. Meaning, in the extreme case, that if the settings file is empty (and nothing is set through the command line) everything will run with default values.

For convenience after every command Settings/latest.ini is written with the used settings. Here any overrides by commandline arguments are reflected. This can, for instance, provide the same settings to the next command in a chain. e.g. for generate_report after configure_solver. The used settings are also recorded in the relevant Output/ subdirectory. Note that when writing settings Sparkle always uses the name, and not an alias.

Note

When overriding settings in sparkle_settings.ini with the commandline arguments, this is considered as ‘temporary’ and only denoted in the latest_settings, but does not actually affect the values in sparkle_settings.ini

Example `sparkle_settings.ini`

This is a short example to show the format.

[general]
objective = PAR10
solver_cutoff_time = 60

[configuration]
number_of_runs = 25

[slurm]
number_of_runs_in_parallel = 25

When initialising a new platform, the user is provided with a default settings file, which can be viewed here.

Sparkle Objectives

To define an objective for your algorithms, you can define them in the general section of your Settings.ini like the following:

[general]
objective = PAR10,loss,accuracy:max,train_loss:metric

In the above example we have defined three objectives: Penalised Average Runtime, the loss function value of our algorithm on the task, and the accuracy of our algorithm on the task. Note that objectives are by default assumed to be minimised and we must therefore specify accuracy:max to clarifiy this. Furthermore, you may have certain objectives that you wish to record, but not actually have configurators and algorithms use as an objective. For this we can specificy train_loss:metric, letting the platform now this value will be present but must not be passed as an optimisable objective. The platform predefines for the user three objectives: cpu time, wallclock time and memory. These objectives will always recorded next to whatever the user may choose.

Note

Although the Platform supports multiple objectives to be registered for any Solver, not all used components, such as SMAC and Ablation Analysis, support Multi-Objective optimisation. In any such case, the first defined objective is considered the most important and used in these situations

Moreover, when aggregating an objective over various dimensions, Sparkle assumes the following:

When aggregating multiple Solvers (Algorithms), we aggregate by taking the minimum/maximum value.
When aggregating multiple runs on the same instances, we aggregate by taking the mean.
When aggregating multiple instances, we aggregate by taking the mean.

It is possible to redefine these attributes for your specific objective. The platform looks for a file called objective.py in your Settings directory of the platform, and reads your own object definitions. These definitions can either add new objectives to the platform, but also can overwrite existing definitions in the library. E.g. when creating an objective definition with the same name of one that already exists in the library, the user definiton simply overrules the library definition. Note that there are a few constraints and details:

The objective must inherit from the SparkleObjective class
The classnames are constrained to the format of alphabetical letters followed by numericals
The objective can be parametrised by an integer, such as PAR followed by 10 is interpreted as instantiating the PAR class with argument 10
If your objective is defined over time, you can indicate this using the UseTime enum, see the types module

Slurm

Slurm settings can be specified in the Settings/sparkle_settings.ini file. Any setting in the Slurm section not internally recognised by Sparkle will be added to the sbatch or srun calls. It is advised to overwrite the default settings specific to your cluster, such as the option “–partition” with a valid value on your cluster. Also, you might have to adapt the default “–mem-per-cpu” value to your system. For example, your Slurm section in the sparkle_settings.ini could look like:

[slurm]
partition = CPU
mem-per-cpu = 6000
...
time = 25:00

Discouraged options Currently these settings are inserted as is in any Slurm calls done by Sparkle. This means that any options exclusive to one or the other currently should not be used. The options below are exclusive to sbatch and are thus discouraged:

-–array
-–clusters
-–wrap

The options below are exclusive to srun and are thus discouraged:

-–label

Prepending to Slurm Jobs

In case you have specific scripts that need to be executed before running your job, such as activation of environments, you can specify this in the slurm section like:

[slurm]
...
job_prepend = echo $JOB_ID

In case that you have a multi line script, write it down as a file in the Settings directory, for example “slurm_prepend.sh” and reference it like:

[slurm]
...
job_prepend = Settings/slurm_prepend.sh

Options and possible values

[general]

objective

aliases: objective

values: str, comma seperated for multiple

description: The type of objectives Sparkle considers, see Sparkle Objective section for more.

configurator

aliases: configurator

values: SMAC2

description: The name of the Configurator class implementation to use. Currently only supports SMAC2.

selector_class

aliases: selector_class

values: Class.

Description: The ASF Algorithm selector class to use.

selector_model

aliases: selector_model

values: Model.

Description: The sklearn model to use for algorithm selection.

solution_verifier

aliases: N/A

values: {NONE, SAT}

note: Only available for SAT solving.

solver_cutoff_time

aliases: cutoff_time_each_solver_call

values: integer

description: The time a solver is allowed to run before it is terminated.

extractor_cutoff_time

aliases: cutoff_time_each_feature_computation

values: integer

description: The time a feature extractor is allowed to run before it is terminated. In case of multiple feature extractors this budget is divided equally.

run_on

aliases: run_on

values: LOCAL, SLURM

description: On which compute to run the jobs on.

verbosity

aliases: verbosity

values: QUIET, STANDARD

description: The verbosity level of Sparkle when running CLI.

check_interval

aliases: check_interval

values: int

description: Specifically for the Wait command. The amount of seconds to wait in between refreshing the wait information.

[configuration]

wallclock_time

aliases: wallclock_time

values: integer

description: The wallclock time one configuration run is allowed to use for finding configurations.

cpu_time

aliases: cpu_time

values: integer

description: The cpu time one configuration run is allowed to use for finding configurations.

solver_calls

aliases: solver_calls

values: integer

description: The number of solver calls one configuration run is allowed to use for finding configurations.

number_of_runs

aliases: number_of_runs

values: integer

description: The number of separate configurations runs.

target_cutoff_length

aliases: smac_each_run_cutoff_length

values: {max} (other values: whatever is allowed by SMAC)

[slurm]

number_of_jobs_in_parallel

aliases: num_job_in_parallel

values: integer

description: The number of jobs runs that can run in parallel.

max_parallel_runs_per_node

aliases: clis_per_node

values: integer

description: The number of parallel processes that can be run on one compute node. In case a node has 32 cores and each solver uses 2 cores, the max_parallel_runs_per_node is at most 16.

[ablation]

racing

aliases: ablation_racing

values: boolean

description: Use racing when performing the ablation analysis between the default and configured parameters

[parallel_portfolio]

check_interval

aliases: check_interval

values: int

description: How many seconds the parallel portfolio waits to check whether jobs have completed. Decreasing the amount increases the accuracy of the report but also significantly increases computational load.

num_seeds_per_solver

aliases: num_seeds_per_solver

values: int

description: Only relevant for undeterministic solvers. The amount of solvers that will be started with a random seed.

Priorities

Sparkle has a large flexibility with passing along settings. Settings provided through different channels have different priorities as follows:

Default - Default values will be overwritten if a value is given through any other mechanism;
File – Settings form the Settings/sparkle_settings.ini overwrite default values, but are overwritten by settings given through the command line;
Command line Settings file – Settings files provided through the command line, overwrite default values and other settings files.
Command line - Settings given through the command line overwrite all other settings, including settings files provided through the command line.
Configurators - Each configurator has its own option section and these values will take precedence of any value set in the general configurator section.

Reporting packages

The platform depends on the following user supplied packages to generate its reports:

pdflatex
latex
bibtex