ccp.docs/source/usermanual/04_methods.md

15 KiB

Methods

Logically, Methods are computational functions or procedures. They can be implementations of algorithms or numerical recipes. data gatherings or transformations, AI modules, generation of visuals and charts. Whatever can be executed and produces a valuable scientific result that needs to be reproduceable and repeatable can be written as a Method.

CCP tries to be as lax as possible with respect to the technical constraints for Methods. It aims at supporting every language and every reasonable combination of operating systems and dependencies by providing stacks of Runtimes that provide many ready solutions but are simultaneously open to customisations.

Anatomy of a Method

At its heart a Method is a JSON structure that aggregates a section of metadata, the definition of input parameters, the description of expected outputs, instructions for customising the deploy and execute steps of its lifecycle and link to a compatible Infrastructure.

The syntax of the JSON data structure is constrained by the grammar proposed by the OGC Processes API specification (https://ogcapi.ogc.org/processes/).

The following code snippet illustrates an example.

    
    {
        "id":"408d9dc5-ee37-4123-9f07-4294f13bce19",
        "title":"JDK-8 Example maven",
        "description":"Test for executing a jdk8 sample app from GitHub repository built with maven",
        "version":"0.0.1",
        "jobControlOptions":[
          "async-execute"
        ],
        "keywords":[
          "jdk", "java", "jdk8", "java8", "maven"
        ],
        "metadata":[
          {
            "title":"Marco Lettere",
            "role":"author",
            "href":"https://accounts.d4science.org/auth/admin/realms/d4science/users/09138708-9a19-4724-93d1-8c721d591da2"
          },
          {
            "role":"category",
            "title":"Test"
          },
          {
            "title":"%2Fgcube%2Fdevsec%2FCCP",
            "role":"context",
            "href":"https://accounts.dev.d4science.org/auth/admin/realms/d4science/clients/%2Fgcube%2Fdevsec%2FCCP"
          }
        ],
        "outputTransmission":[
          "value"
        ],
        "inputs":{
          "ccpimage":{
            "id":"ccpimage",
            "title":"Runtime",
            "description":"The image of the runtime to use for method execution. This depends on the infrastructure specific protocol for interacting with registries.",
            "minOccurs":1,
            "maxOccurs":1,
            "schema":{
              "type":"string",
              "format":"url",
              "contentMediaType":"text/plain",
              "default":"nubisware/ccp-jdk8-jammy:latest",
              "readonly":"true"
            }
          },
          "repository":{
            "id":"repository",
            "title":"Repository URL",
            "description":"Git url to repository",
            "minOccurs":1,
            "maxOccurs":1,
            "schema":{
              "type":"string",
              "format":"url",
              "default":"https://github.com/dcore94/jdk-maven-example"
            }
          },
          "mainclass":{
            "id":"mainclass",
            "title":"Main Class",
            "description":"The main class to run",
            "minOccurs":1,
            "maxOccurs":1,
            "schema":{
                "type":"string",
                "default":"example.HelloWorld"
            }
          }
        },
        "outputs":{
          "filetext":{
            "id":"filetext",
            "title":"Text output",
            "description":"Some output is written in txt format to file.txt",
            "minOccurs":1,
            "maxOccurs":1,
            "metadata":[
              {
                "title":"file.txt",
                "role":"file",
                "href":"/output/file.txt"
              }
            ],
            "schema":{
              "type":"string",
              "contentMediaType":"text/plain"
            }
          },
          "filexml":{
            "id":"filexml",
            "title":"XML output",
            "description":"Some output is written in XML format to file.xml",
            "minOccurs":1,
            "maxOccurs":1,
            "metadata":[
              {
                "title":"file.xml",
                "role":"file",
                "href":"/ccp_data/output/file.xml"
              }
            ],
            "schema":{
              "type":"string",
              "contentMediaType":"application/xml"
            }
          },
          "filejson":{
            "id":"filejson",
            "title":"JSON output",
            "description":"Some output is written in JSON format to file.json",
            "minOccurs":1,
            "maxOccurs":1,
            "metadata":[
              {
                "title":"file.json",
                "role":"file",
                "href":"/ccp_data/output/file.json"
              }
            ],
            "schema":{
              "type":"string",
              "contentMediaType":"application/json"
            }
            },
            "filecsv":{
              "id":"filecsv",
              "title":"CSV output",
              "description":"Some output is written in CSV format to file.csv",
              "minOccurs":1,
              "maxOccurs":1,
              "metadata":[
                {
                  "title":"file.csv",
                  "role":"file",
                  "href":"/output/file.csv"
                }
              ],
              "schema":{
                "type":"string",
                "contentMediaType":"text/csv"
              }
            }
        },
        "additionalParameters":{
            "parameters":[
              {
                "name":"execute-script",
                "value":[
                  "cd execution",
                  "mkdir -p /ccp_data/output",
                  "java -cp target/jdk-maven-example-0.0.1-SNAPSHOT.jar {{ mainclass }} 1>> /ccp_data/stdout.txt 2>> /ccp_data/stderr.txt",
                  "cp /tmp/file.* /ccp_data/output/"
                ]
              },
              {
                "name":"deploy-script",
                "value":[
                  "git clone {{ repository }} execution 1>> /ccp_data/stdout.txt 2>> /ccp_data/stderr.txt",
                  "cd execution",
                  "mvn clean package 1> /ccp_data/stdout.txt 2>> /ccp_data/stderr.txt",
                  "cd -"
                ]
              },
              {
                "name":"undeploy-script",
                "value":[]
              },
              {
                "name":"cancel-script",
                "value":[]
              }
           ]
        },
        "links":[
            {
            "href":"infrastructures/nubisware-docker-swarm-nfs",
            "rel":"compatibleWith",
            "title":"Docker swarm with NFS on Nubis cluster"
            }
        ]
    }

This is an example of a Method that executes Java 8 code rooted at a main class example.HelloWorld and cloned from a public GitHub reposiotry. The code is built with Maven.

The keywords section contains keywords that help in searching for the Method. The metadata fields author and context show what user has created the descriptor for the Method and in which context. Methods can also contain several category metadata items that help in classifying the Method.

jobControlOptions is hardocded to "async-execute" because CCP always executes Methods in an asynchronous way.

In the example above the Method has three inputs.

ccpimage is required to appear exactly once. This input is always required for every Method that will be executed on CCP. It refers to the Runtime required for the Method execution. The input is a plain text string representing the reference to a container image matching the requirements of the Infrastructure. Since the example is compatible with a Docker based Infrastructure the reference is a name in Docker form repository/image:versiontag. This input is readonly because the default value provided at Method definition time is constrained and not editable. repository is the URL to the Git repository to be cloned. It defaults to an example project but can be edited. mainclass is the main class of the Java application.

The Method declares four example output files encoded as XML, JSON, CSV or plain text. As will be shown later, a Method is not required to return only wht it declares as outputs. The output declaration is used mainly for semantically enrich an output.

The additionalParameters section encodes the three scripts governing the Method's lifecycle. The lifecycle of a Method will be described in the following section. In this example, the deploy scripts takes care of cloning a Git repository passed as input parameter "reposiotry" into a target folder and build the code using Maven. The execute script builds a folder named "output", the launches the main class of the Java application and finally copies the output files (which are created in the /tmp directory by the example Java code) to the output folder. The undeploy and cancel scripts are actually no-operations because they rely on the fact that in an environment based on containers, the clean-up operations are intrinsic.

It is important to note that all inputs declared for the Method can be used as variables in the scripts by putting their id in double curly brackets. There are other variables that can be useed in addition to outputs and they will be discussed in section "Execution context of a Method".

The links section encodes the link to the Infrastructure that is declared to be compatible with the Method.

Lifecycle of a Method execution

The following Figure depicts what happens when the execution of a Method is requested by a user either by interfacing with a GUI widget of the CCP or by invoking the REST API.

Lifecycle of Method execution

The message carrying the execution request is sent to CCP and the execution starts. The first task puts the execution in Launch state. During this phase the Runtime for the execution is prepared. On a container based Infrastructure this usually resolves to using the ccpimage input parameter in order to fetch the container image from a reposiory and instantiatiate the container.

After the transition to the the Launch state, like for every other transition, the outcome of the operation is evaluated and in case of errors the process terminates by transitioning directly to the Destroy state thus ensuring that the infrastructure is cleaned up.

After a successful Launch, the Method execution moves into Deploy state. As shown by the script task decorator, this task is scripted meaning that by default it's a no-operation and the commands to be performed are supplied by the creator of the Method at definition time through the deploy-script attribute. Example operations that could occur during this phase in a deploy script are: fetching of code on Git repositories, installation of fine grained dependencies (for example pip install -r requirements.txt), building of code, downloading of resource files.

After the Deploy phase, a Method execution enters the Execute phase. Like for the Deploy phase what exactly happens during this phase is determined by the execute-script provided by the Method creator at Method definition time. Instructions in the execute-script usually contain invocation of main code components.

The time spent in the Execute phase is limited by the Infrastructure. It is up the the Infrastructure manager to define what is the maximum amount of time allowed for Method execution. If the method allows it, the execution time can be futher limited by the user requesting the execution of a Method, by setting the ccpmaxtime input parameter.

The Fetch following a successful Execute phase is a non scriptable transition in charge of uploading the outputs of a Method execution to the Execution storage.

.. The following Undeploy phase can be used by Method developers to perform operations after the Method execution has terminated. This phase is not thought to be a cleanup task because on containerised Infrastructures the system takes autonoumously care of destroying resources at the end of a Method execution. Instead it could be used to perform extra work like notifying external systems or sharing outputs.

It is currently not possible to script the "Undeploy" phase because, on container based Infrastructures, the system takes autonoumously care of destroying resources at the end of a Method execution.

Finally, the Destroy phase is the time where the Infrastructure controller literally destroys the Runtime of the execution and all resources created during the previous phases.

Execution context of a Method

During the execution of deploy-script, execute-script and undeploy-script as well as during method execution it is possible to access information that is contextual to CCP, Method or Execution request.

Some information are accessible as template variables that can be used in the scripts with the {{ var }} syntax. Other useful information is passed directly to the Method execution as environmental varibales that can be accessed with the proper APIS that every programming language supports.

All input parameters are passed in the form of template variables to the scripts. This allows the script to adapt the input values to the requirements of the Method (using them directly passing as commandline arguments, setting as environmental variables, or writing to files).

There are few input parameters that are used to govern the Method execution itself rather than providing input to the Method. In particular:

  • ccpimage as already told is required and it is used automatically during Launch phase in order to instantiate a container.
  • ccpnote is a special input field that when present is used to tag the executions in order to provide better visual feedback.
  • ccpmaxtime can be used to limit the maximum execution time of a Method. The value is expressed in seconds and it is capped by the maximum time configured for the Infrastructure.
  • ccpreplicas currently supported on Docker swarm based Infrastructures allows for creating multiple instances of a Method execution in order to obtain a coarse grained degree of parallelism.

A set of environmental variables is passed to the Runtime inside of which the Method is executed in order to provide additional context.

The following two environmental valuables provide context for the execution.

  • ccptaskname is the id of the execution.
  • ccptaskid is the index of the replica (1-based) when multiple replicas are requested with the input parameter ccpreplicas. This can be used to customise the behavior of a replica like accessing a slice of a dataset or writing output to different files.

The following variables are related to the authentication and authorization context of the Method execution. They can be used to access D4Science services in a secure and convenient way also for very long lasting executions.