Creating a Dataset#

This section covers how to create a dataset in QCArchive and add entries to it. With minor changes this is also how you add entries to an existing dataset, so both are covered here.

The steps are straightforward
  1. Create the dataset in QCArchive

  2. Get one or more molecules into SEAMM

  3. Upload them to the dataset

This flowchart is a very simple example of doing this.

Creating a QCArchive dataset

Flowchart to create a QCArchive dataset#

The steps are
  1. Set up a parameter to get the SMILES string

  2. Create the dataset in QCArchive if it doesn’t exist

  3. Create the structure from SMILES

  4. Quickly optimize the structure with a forcefield to make it reasonable

  5. Add the structure to QCArchive as an entry in the dataset

  6. List the entries in the dataset to check that it worked

Obviously this a toy example. In reality you would want to impoprt a number of structures, probably from files, using a loop. But this simple flowchart captures the essence of the problem.

The first QCArchive step is set up like this

Options for creating a QCArchive dataset

Options for creating a QCArchive dataset#

Hopfully this is straightforward and easy to understand. The interesting option is the last, saying that it is OK if the dataset exists. The first time we run this it will create the singlepoint dataset paul_test1 in QCArchive. Subsequently, it won’t do anything becasue the dataset exists, so we can run this flowchart a number of times to add several molecules.

The next couple of steps in the flowchart creates the structure from the SMILES string that the initial Parameter step captured, and then optimizes it with a forcefield in order to clean the structure.

The second QCArchive set adds the structure to the dataset

QCArchive step edit options dialog

Add the structure to the QCArchive dataset#

If all goes according to plane, the last QCArchive step will list the entries in the dataset so that we can check that the new structure has been added.

Options dialog for listing entries

Options for listing the entries in the dataset#

The output from running the flowchart is:

Running the flowchart
---------------------
Step 0: Start  2023.4.8

Step 1: Parameters  2023.1.23
    The following variables have been set from command-line arguments,
    environment variables, a configuration file, (.ini), or a default value, in
    that order.

    +------------+----------+-------------+-----------------------------+
    | Variable   | Value    | Set From    | Description                 |
    +============+==========+=============+=============================+
    | SMILES     | c1ccccc1 | commandline | The SMILES for the molecule |
    +------------+----------+-------------+-----------------------------+

Step 2: QCArchive  2023.3.28
    Will create a new singlepoint  dataset paul_test1 in the QCArchive
    Validation Project Server.

    Created singlepoint dataset paul_test1.

Step 3: from SMILES  2021.10.13
    Create the structure from the SMILES 'c1ccccc1', overwriting the current
    configuration. The name of the system will be the canonical SMILES of the
    structure. The name of the configuration will be initial.

    Created a molecular structure with 12 atoms.
           System name = c1ccccc1
    Configuration name = initial

Step 4: QuickMin  2023.1.14
    Minimizing the structure with the best available forcefield, with a maximum
    of 1000 steps. The optimized structure will be put in a new configuration
    with 'optimized with ' as its name.

    The minimization converged in 74 steps to 17.872 kJ/mol. The final structure was
    saved in the new configuration named 'optimized with GAFF'.

Step 5: QCArchive  2023.3.28
    Will add the current configuration to the singlepoint dataset paul_test1 in
    the QCArchive Validation Project Server.

    Added c1ccccc1/optimized with GAFF to the dataset.

Step 6: QCArchive  2023.3.28
    Will list the entries in the singlepoint dataset paul_test1 in the QCArchive
    Validation Project Server.

    There are 1 entries in the singlepoint dataset paul_test1:
        c1ccccc1/optimized with GAFF

That looks good! The last couple lines let us know that we added an entry for benzene to the dataset. If we run again with toluene as the molecule the last part looks like this:

There are 2 entries in the singlepoint dataset paul_test1:
    Oc1ccccc1/optimized with GAFF
    c1ccccc1/optimized with GAFF

which is correct: the dataset now has two entries, benzene and toluene.