.. _creating_a_dataset: ****************** Creating a Dataset ****************** This section covers how to create a dataset in QCArchive and add entries to it. With minor changes this is also how you add entries to an existing dataset, so both are covered here. The steps are straightforward #. Create the dataset in QCArchive #. Get one or more molecules into SEAMM #. Upload them to the dataset This :download:`flowchart <./create_dataset.flow>` is a very simple example of doing this. .. figure:: ./create_dataset.png :width: 300px :align: center :alt: Creating a QCArchive dataset Flowchart to create a QCArchive dataset The steps are #. Set up a parameter to get the SMILES string #. Create the dataset in QCArchive if it doesn't exist #. Create the structure from SMILES #. Quickly optimize the structure with a forcefield to make it reasonable #. Add the structure to QCArchive as an entry in the dataset #. List the entries in the dataset to check that it worked Obviously this a toy example. In reality you would want to impoprt a number of structures, probably from files, using a loop. But this simple flowchart captures the essence of the problem. The first QCArchive step is set up like this .. figure:: ./create_options.png :width: 300px :align: center :alt: Options for creating a QCArchive dataset Options for creating a QCArchive dataset Hopfully this is straightforward and easy to understand. The interesting option is the last, saying that it is OK if the dataset exists. The first time we run this it will create the singlepoint dataset *paul_test1* in QCArchive. Subsequently, it won't do anything becasue the dataset exists, so we can run this flowchart a number of times to add several molecules. The next couple of steps in the flowchart creates the structure from the SMILES string that the initial Parameter step captured, and then optimizes it with a forcefield in order to clean the structure. The second QCArchive set adds the structure to the dataset .. figure:: ./add_to_dataset.png :width: 300px :align: center :alt: QCArchive step edit options dialog Add the structure to the QCArchive dataset If all goes according to plane, the last QCArchive step will list the entries in the dataset so that we can check that the new structure has been added. .. figure:: ./list_entries.png :width: 300px :align: center :alt: Options dialog for listing entries Options for listing the entries in the dataset The output from running the flowchart is:: Running the flowchart --------------------- Step 0: Start 2023.4.8 Step 1: Parameters 2023.1.23 The following variables have been set from command-line arguments, environment variables, a configuration file, (.ini), or a default value, in that order. +------------+----------+-------------+-----------------------------+ | Variable | Value | Set From | Description | +============+==========+=============+=============================+ | SMILES | c1ccccc1 | commandline | The SMILES for the molecule | +------------+----------+-------------+-----------------------------+ Step 2: QCArchive 2023.3.28 Will create a new singlepoint dataset paul_test1 in the QCArchive Validation Project Server. Created singlepoint dataset paul_test1. Step 3: from SMILES 2021.10.13 Create the structure from the SMILES 'c1ccccc1', overwriting the current configuration. The name of the system will be the canonical SMILES of the structure. The name of the configuration will be initial. Created a molecular structure with 12 atoms. System name = c1ccccc1 Configuration name = initial Step 4: QuickMin 2023.1.14 Minimizing the structure with the best available forcefield, with a maximum of 1000 steps. The optimized structure will be put in a new configuration with 'optimized with ' as its name. The minimization converged in 74 steps to 17.872 kJ/mol. The final structure was saved in the new configuration named 'optimized with GAFF'. Step 5: QCArchive 2023.3.28 Will add the current configuration to the singlepoint dataset paul_test1 in the QCArchive Validation Project Server. Added c1ccccc1/optimized with GAFF to the dataset. Step 6: QCArchive 2023.3.28 Will list the entries in the singlepoint dataset paul_test1 in the QCArchive Validation Project Server. There are 1 entries in the singlepoint dataset paul_test1: c1ccccc1/optimized with GAFF That looks good! The last couple lines let us know that we added an entry for benzene to the dataset. If we run again with toluene as the molecule the last part looks like this:: There are 2 entries in the singlepoint dataset paul_test1: Oc1ccccc1/optimized with GAFF c1ccccc1/optimized with GAFF which is correct: the dataset now has two entries, benzene and toluene.