implementation of controls

In this phase of the (re)processing, the primary objective was to implement control mechanisms for data cohesion. Intern students were provided with the integrated data. Assigned workflow included identifying irregularities, common grammatical elements, and standardizing various aspects such as topics, subjects, offices, names, and other relevant data elements.

cleaning and standardizing data

The pandemic posed significant challenges in developing and implementing efficient workflows. Processing and cleansing data became more complex due to limitations like students accessing files from international locations, limited technology availability, and restricted access to students for training or skill development. To address confusion and accessibility issues, a workflow was designed to streamline applications, condense Python scripts, and incorporate other automation methods. The objective was to improve overall efficiency and effectiveness despite the pandemic-induced constraints.

The workflow was organized based on academic semesters (Fall, Spring, Summer) and allocated to internship students. These students were divided into three groups: Group A, B, and C. Each student was given specific ‘chunks’ of data, which were obtained by dividing the master worksheet. They also received a metadata master sheet containing their assigned lines, a folder with original scans from the aggregated information, and another folder with available videos.

Automated delivery points were set up for each folder, triggering updates and reports. Once the work was completed, it was deposited into a Quality Assurance (QA) folder, which automatically assigned it to the groups responsible for quality checking. After undergoing two QA checks, the work was automatically sent to the collection archivist for finalization and preparation for the second phase¹.

Sample Student Access ‘Groups A-B’ Processes:

Sort sheet by LAST NAME – FIRST NAME (workflow specific)
Implement controlled standardization script to standardize irregularities (must be 87% confidence to edit)
- Names and nicknames vary for the candidate
- ICPSR numbers may vary for the candidate
- States may vary for the candidate
- Parties may vary for the candidate
Review terms or strings corrected and add extracted text to ‘control sheet-text’.
Implement controlled abbreviations script to standardize all acronyms, short-hand, jargon. Add extracted/corrected text to ‘control sheet-tab abbr.)
Implement error-A_B-1 script – this will export a report with highlighted ‘suggestions, alerts’ for human review.
Clear duplicates from master lists and amend (add parties, states, ICPSR numbers, etc…)
Backup files

Sample Student Access ‘Groups C’ Processes:

Sort sheet by LAST NAME – FIRST NAME (workflow specific)
Run fuzzy-comp-1 script to verify NAMES and ABBREVIATIONS and make sure they are in the CONTROL-SHEET-TAB-TEXT and CONTROL-SHEET-TAB-ABBRV.
Run error-C-1 script for report of ‘suggested’ alerts, spell checker
Implement controlled abbreviations script to standardize all acronyms, short-hand, jargon. Add extracted/corrected text to ‘control sheet-tab abbr.)
Sort sheet LAST NAME – FIRST NAME/OFFICE/ELECTION YEAR
Backup files

Sample Student Access ‘Groups QA-Rover-1′ Processes:

Files will be automatically added to the assigned QA/R folder
Perform data validation and standardization based on the controlled vocabulary with QA-1 script. This script will identify and handle any errors or inconsistencies in the data. This will generate a report highlighting the issues found and suggest possible corrections.
Review report and address issues (save report and notes/changes/alterations made).
Backup files

Sample Student Access ‘Groups QA-Rover-2′ Processes:

Open new metadata worksheet and sort by ‘Component ID’
Open ‘KanterVideos’ folder, CTRL+A, right click SHIFT, copy as path
Paste in new tab in metadata worksheet ‘files’ (CTRL+A, CTRL+H, replace the path leaving the P-#-#.
In metadata sheet copy the Component ID column into the ‘files’ tab next to the just pasted video files.
Select both columns and in excel in the ‘Home’  ‘Conditional Formatting’  ‘Highlight Cell Rules’  ‘Duplicate Values’
Review the files that are in both videos and excel by sorting the columns by cell color, clear the cells leaving only the non-duplicates

PRACTICE EXERCISES

Footnotes

Pryse, JA. Archival Education and Research Institute AERI (June 19-23, 2023) – Julian P. Kanter Collection: Chaos and Order. Louisiana State University. ↩