User Guide

Introduction

About NLNZ Tools Scripts Ingestion

NLNZ Tools Scripts Ingestion is a set of scripts related to the processing of SIPs for ingestion into the Rosetta archiving system. The aim is to useful tools to help in that processing.

Contents of this document

Following this introduction, this User Guide includes the following sections:

  • Fairfax ingestion related scripts - Covers Fairfax ingestion related scripts.
  • Reports scripts - Covers reports-related scripts.
  • Utilities scripts - Covers useful utility scripts.
  • Running requirements - Covers running requirements.

Reports scripts

reports/daily-file-usage-report.py

Provides a daily usage report of a set of subfolders of a given root folder.

Arguments

-h, --help            show this help message and exit
--source_folder SOURCE_FOLDER
                    The root source-folder for the report.
--reports_folder REPORTS_FOLDER
                    The folder where reports exist and get written.
--number_previous_days NUMBER_PREVIOUS_DAYS
                    The number of previous days to include in the report.
                    The default is 0.
--create_reports_folder
                    Indicates that the reports folder will get created. Otherwise it must already exist.
--include_file_details_in_console_output
                    Indicates that individual file details will output to the console as well as the reports file.
--calculate_md5_hash  Calculate and report the md5 hash of individual files (this is a very intensive I/O operation).
--include_dot_directories
                    Include first-level root subdirectories that start with a '.'
--ignore_unchanged_directories
                    Do not report changes for directories that haven't changed.
--verbose             Indicates that operations will be done in a verbose manner.
                    NOTE: This means that no csv report file will be generated.
--debug               Indicates that operations will include debug output.
--test                Indicates that only tests will be run.

Usage

daily-file-usage-report.py [-h] --source_folder SOURCE_FOLDER
                                  --reports_folder REPORTS_FOLDER
                                  [--number_previous_days NUMBER_PREVIOUS_DAYS]
                                  [--create_reports_folder]
                                  [--include_file_details_in_console_output]
                                  [--calculate_md5_hash]
                                  [--include_dot_folders] [--verbose]
                                  [--debug] [--test]

Example usage

scriptsFolder="/go/repos-nlnzdigitalpreservation/nlnz-tools-scripts-ingestion/reports"
sourceFolder="/media/legaldep-ftp"
reportsFolder="/media/sf_a-laptop-shared-work/ftp-daily-usage-reports"

${scriptsFolder}/daily-file-usage-report.py \
    --source_folder "${sourceFolder}" \
    --reports_folder "${reportsFolder}" \
    --ignore_unchanged_directories \
    --number_previous_days 21

Report output

The console output to the report can be used in a csv file. There is also a csv file generated in the reports_folder that contains a detailed listing .csv of the source folders. This report csv file is then used as input for the next report, as long as it was generated within the number_previous_days.

Utilities scripts

utilities/bulk-file-rename.py

Simple utility for renaming files in bulk.

Arguments

-h, --help            show this help message and exit
--source_folder SOURCE_FOLDER
                    The root source-folder for the report.
--file_name_portion_to_replace FILE_NAME_PORTION_TO_REPLACE
                    The portion of the filename that will be replacement.
--file_name_portion_replacement FILE_NAME_PORTION_REPLACEMENT
                    The replacement portion of the filename. If not specified, then an empty string is used.
--verbose             Indicates that operations will be done in a verbose
                    manner. NOTE: This means that no csv report file will
                    be generated.
--debug               Indicates that operations will include debug output.
--test                Indicates that only tests will be run.

Usage

usage: bulk-file-rename.py [-h] --source_folder SOURCE_FOLDER \
                           --file_name_portion_to_replace FILE_NAME_PORTION_TO_REPLACE \
                           --file_name_portion_replacement FILE_NAME_PORTION_REPLACEMENT \
                           [--verbose] [--debug] \
                           [--test]

bulk-file-replace.groovy

Replaces a set of files that match a given regex with a replacement file. Use of this script may require editing of the groovy file. Currently the script was used to bulk replace test PDF files with the same hash, but different names.

Arguments

targetFolder the target folder containing the files that will be matched.
             Note that all the files in the target folder will be checked
             (i.e. subdirectories will be searched as well).
replacementFile the file to replace the matched file with. The replacement file will be copied
                over the matching file.

Edited values

These are values that require editing in the groovy script itself.

regexPattern - the pattern used to match the target file

expectedMd5Hash - the MD5 hash of the target file.

Usage example

utilities/bulk-file-replace.groovy /path/to/target/folder utilities/resources/minimal-jhove-acceptable.pdf

Running requirements

Python-based scripts

Those scripts with a .py extension are Python-based scripts. Currently these scripts run with Python 2.7. None of the scripts have been upgraded to Python3.

Groovy-based scripts

Those scripts with a .groovy extension are Groovy-based. Currently these scripts run with Groovy 2.5.4 or later and Java OpenJDK 11.

Operating system

These scripts have only been tested and run on Ubuntu Linux 18.