Troubleshooting data package archival in the vault

After a data manager approves a data package for archiving in the vault, the copy-one-coll-to-vault.r script asynchronously archives the data package. This involves copying its data from the research collection to the vault collection, among other things. When the script fails to archive a data package part way, the copy to vault cronjob status is set to RETRY. At a later time the retry-copy-to-vault.r cronjob is run to try to finish archiving all data packages that are in status PENDING or RETRY. By default, this whole process happens automatically. Please consult the vault process design documentation for more details.

Note that in Yoda versions 1.9 and older the process is slightly different. The script copy-accepted-folders-to-vault.r asynchronously archives the data package, and if the archival fails in certain situations the status is set to RETRY. The retry-copy-to-vault.r cronjob runs periodically to try to archive any data packages with such a RETRY status.

This page contains an explanation of how to troubleshoot the process if something goes wrong.

Detecting failed archiving jobs

Archival jobs that have failed can be detected using the data package status report tool, which is part of the Yoda client tools.

You can also run this tool in a cronjob to send a report of data packages that are in the process of being archived or published for a long time, which suggests that something might have gone wrong.

Example command for compiling a list of data packages that have been in the process of being archived or published for more than approximately four hours, and sending the list to an administrator if there are any:

yreport_datapackagestatus --pending --stale --email a.admin@uu.nl

This will also report data packages that are waiting for approval to be archived or published. In such cases, no technical troubleshooting is needed.

Finding the cause

If the data package has been approved for archiving in the vault (status ACCEPTED), first see if the cause of the problem can be found in the rodsLog files. Find the copy_to_vault or folder_secure message for the data package in the rodsLog. In Yoda versions 1.9 or lower look for the message iiCopyFolderToVault. Then grep for other messages by the same pid on the same day, and look for error messages.

Possible causes include:

  • An issue with one of the source data objects in the research collection results in a failure when copying it. For example: a data object that is in an intermediate state cannot be copied.
  • A restart of the iRODS service while the copy-to-vault process was running.
  • A storage issue, such as a storage resource without free space available.

If the root cause is not transient, it needs to be resolved first. Otherwise restarting the process would just result in the same problem occurring again.

Restart options

There are two ways to restart the transfer:

Trigger a complete restart of the copy-to-vault process for the data package

Signal the copy-to-vault job to retry the archiving operation by setting the org_cronjob_copy_to_vault AVU to CRONJOB_RETRY. The job will then copy the data packages to a new vault folder.

Example command:

imeta set -C /tempZone/home/vault-collection/data-package[1234567890] org_cronjob_copy_to_vault CRONJOB_RETRY

Afterwards, you will need to remove the vault collection that was created on the first try manually.

Finish the archiving process manually

If the error occurred during copying the contents of the data packages, it is also possible to finish the copy job manually. This can be useful if the data package is large and retrying the complete transfer would take a lot of time.

First, complete the synchronization process using the irsync command in a tmux session. For example:

irsync -r -V -s "i:/zoneName/home/research-groupname/data-package" "i:/zoneName/home/vault-groupname/data-package[1234567890]/original"

After irsync has finished, complete the copy-to-vault process manually using the secure-in-vault rule. Example command:

irule -r irods_rule_engine_plugin-irods_rule_language-instance -F /etc/irods/yoda-ruleset/tools/secure-in-vault.r '*researchCollection="'""/zoneName/home/research-groupname/data-package'"' '*vaultCollection="'"/zoneName/home/vault-groupname/data-package[1234567890]"'"'

Finally, check in the portal that the status of the data package in the research collection is Secured, the publication status is Unpublished, and the metadata of the vault package can be viewed. Also check the rodsLog for errors.