initial codes and processes a folder of tools

Phase 1 – Review and Structure Scanned Documents

1. Create a Structured Word Document – pattern-structure2i.py

It checks whether a given string can be interpreted as a date using the ‘is_date'function.
It reads a Word document specified by the provided file path.
It creates a new Word document.
It iterates through the paragraphs in the original document.
For each paragraph, it splits the text at occurrences of the word “received” (case-insensitive).
It processes each part of the split paragraph, formatting it in the following way:
Words are added one by one to the new paragraph.
Words following “received” are displayed in red and bold.
Words following a “received” section that appears to be a date are displayed in blue and bold.
It saves the newly formatted document with a filename that includes “formatted_and_date_highlighted_” followed by the original document’s filename

2. Create Structured Excel – word-excel-1.py

- The resulting Excel file will divide the pdf at every recognition of the term “received” easier to merge with transcriptions in the next step.
  - Reads a Word document specified by the provided file path.
  - Iterates through the document’s paragraphs and runs.
  - Identifies text segments that are marked as “received” (in red and bold).
  - Groups these “received” segments and the text that follows them into rows.
  - Creates a Pandas DataFrame from these rows.
  - Saves the DataFrame as an Excel file named ‘output.xlsx’ without an index.

3. Analyze Excel and Clear Issues Affecting Data

- Issue 1: Inches Representation
  - The inches are represented inconsistently as “ll” and “11”, |1, || throughout the file when referring to audio formats. We know audio reels come in various sizes, such as 3″, 5″, 7″, and 10″.
  - Resolution: Replace the inconsistent characters “ll” and “11” with the appropriate format sizes (3″, 5″, 7″, 10″).
  - Add the term “format:” to the text to clarify the audio format.
- Issue 2: Time Representation
  - Time is represented inconsistently throughout the original documents, using seconds (e.g., :10, :30, :60) and, in some cases, “minutes.”
  - The Python script has identified 3″ followed by :30 as “JII:JO,” which needs to be replaced with the correct terms.
- Issue 3: Other Identifiable Terminology
  - Continue to identify additional issues that affect data management.
  - Record identifiable conditions and elements to build an effective script (states, titles, special characters)
  - Run the replace-add-format script to assist in managing the data.

4. Transcribe Audio Files

- Run the audio-recog.py script that performs speech recognition on audio files and extracts the speech content from them. It utilizes the ‘SpeechRecognition‘ library (‘speech_recognition’).
  - Imports necessary libraries (‘os', ‘speech_recognition', ‘pandas').
  - Specifies a folder path containing audio files (‘folder_path').
  - Initializes a speech recognizer ('recognizer') from the ‘SpeechRecognition’ library.
  - Iterates over the files in the specified folder, checking if they are audio files (‘.wav', ‘.mp3', or ‘.flac').
  - For each audio file, it loads the file, recognizes the speech using Google’s speech recognition (‘recognize_google'), and appends the extracted text along with the file name to a list (‘data').
  - Handles errors that may occur during the process.
  - Creates a Pandas DataFrame (‘df') from the data.
  - Specifies an output file path for an Excel file (‘output_path').
  - Writes the DataFrame to the Excel file.
  - Prints a completion message with the path to the saved Excel file.
    - results:

5. Extract Named Entities

- (Option 1) Run the names-extract.py script to recognize and extract proper nouns (names of people, places, organizations, etc.).
  - Imports the necessary libraries (‘pandas and ‘spacy').
  - Loads the English language model for spaCy (‘en_core_web_sm').
  - Loads data from an Excel file into a Pandas DataFrame (‘df').
  - Checks if a ‘Text’ column exists in the DataFrame.
  - Defines a function (‘extract_proper_nouns') to extract proper nouns (e.g., names of people, organizations, locations) from text using spaCy.
  - Applies the ‘extract_proper_nouns' function to the ‘Text’ column and creates a new column ‘Text_proper_nouns’ in the DataFrame.
  - Writes the updated DataFrame back to an Excel file named ‘data_with_proper_nouns.xlsx’.
  - If the ‘Text’ column doesn’t exist, it prints a message indicating so.
    - results:

- (Option 2) Run the advanced names extraction.py script for training and using a NER named entity recognition model to extract named entities from excel, pdf, or other file type. An alternate to this script is available for political science data that we have used for this project start-nltk.py which was created from an annotated control list and ner structured list from annotated.
  - Imports necessary libraries, including ‘os', ‘pandas', ‘spacy', ‘pdfplumber', ‘train_test_split' from ‘sklearn.model_selection', and ‘Example' from ‘spacy.training.example'.
  - Defines a function train_model to train a Named Entity Recognition (NER) model using spaCy.
  - Loads annotated data from a CSV file into a Pandas DataFrame.
  - Splits the data into training and evaluation sets using ‘train_test_split'.
  - Initializes a blank spaCy model and adds a custom NER component with specific labels.
  - Converts labels to the BILOU format (Begin, Inside, Last, Outside, Unit) as required for NER.
  - Converts training data to spaCy format and prepares training examples.
  - Initializes the model with the ‘tok2vec' component.
  - Defines an optimizer and loss function.
  - Trains the NER model using the training examples.
  - Saves the trained model to a specific location.
  - Loads the trained model.
  - Defines functions to extract text and named entities from PDF files and save them to a text file.
  - Executes the script when run as the main program, extracting named entities from a PDF file and saving them to a text file.
- (Option 3) Run the PDF-metadata-extract.py script is designed to extract metadata and text content from PDF and TXT files located in a specified directory. The extracted information is then organized into an Excel spreadsheet for further analysis
  - Imports necessary libraries, including ‘os', ‘pandas', ‘spacy', time', ‘re', ‘fitz' (from ‘PyMuPDF’), Path (from pathlib'), and ‘SpellChecker' (from ‘spellchecker').
  - Loads the English language model for spaCy and initializes a spell checker.
  - Asks the user for input and output directories.
  - Defines functions to extract dates, locations, entities, persons, political parties, subjects, superpacs, and keywords from text.
  - Loads predefined labels from a CSV file.
  - Initializes a spaCy Matcher and adds patterns based on the loaded labels.
  - Defines a custom stoplist and a function to extract keywords and two-word phrases.
  - Defines a function to correct spelling errors in text.
  - Processes PDF and TXT files in the specified input directory, extracting metadata and performing spelling correction.
  - Creates a DataFrame containing extracted metadata.
  - Saves the DataFrame to an Excel file in the specified output directory.
  - Measures the total processing time and prints it.

6. Data Cleanup

- Run standard-fuzzy-clean-1.py script to perform fuzzy string matching to standardize values (from controlled vocabulary or other sources) in specific columns of an Excel file based on a reference column containing the standardized values.
  - Imports necessary libraries, including ‘pandas' for data manipulation and ‘fuzzywuzzy' for fuzzy string matching.
  - Reads an input Excel file specified by the ‘input_file' variable into a pandas DataFrame (‘df').
  - Specifies the columns to be compared and standardized using the ‘column_to_standardize' and ‘reference_column' variables.
  - Defines a function find_best_match to find the best fuzzy match for a value in the column_to_standardize from the values in the reference_column.
  - Obtains unique values from the ‘reference_column' and stores them in the ‘reference_values' variable.
  - Iterates through the values in the ‘column_to_standardize', finding the best match for each value among the ‘reference_values' and appending the best match to the ‘standardized_names' list.
  - Adds a new column called ‘Standardized Names’ to the DataFrame (‘df') containing the standardized values.
  - Saves the updated DataFrame to a new Excel file specified by the ‘output_file' variable.
  - Prints a message indicating that the standardized names have been saved to the output file.
- Run clean-1.py to clean text data by checking cell contents, verifying text, removing leading or trailing spaces, spelling, entity identification, and standardized data from previous steps. Analyze the data and resolve issues like inconsistent formatting or terminologies.
  - Imports the ‘openpyxl' library, which is used for working with Excel files.
  - Defines a function called ‘clean_text_in_excel' that takes two parameters: ‘file_path' (the path to the Excel file) and sheet_name (the name of the sheet to process).
  - Inside the function:
    - Loads the Excel workbook specified by ‘file_path' into a workbook object.
    - Selects the sheet with the name provided in ‘sheet_name' and assigns it to the sheet variable.
    - Iterates over the rows in the selected sheet.
    - Within each row, it iterates over the cells.
    - Checks if the cell contains text (as indicated by ‘cell.data_type == 's'').
    - If the cell contains text, it cleans the text by removing leading and trailing spaces and converting it to lowercase.
    - Updates the cell’s value with the cleaned text.
  - After processing all the cells in the sheet, it saves the updated workbook using ‘workbook.save(file_path)'.
  - An example usage of the function is provided, demonstrating how to clean text in a specific sheet of an Excel file (‘file_path' and ‘sheet_name' are specified).

7. Compare and Link Data

- Consider combining both datasets/into one workbook with tabs, although two separate worksheets will work (edit the script to define). There will be one worksheet with filenames of audio files, transcriptions, and columns of named entities and dates. The second worksheet will be the extracted data from the scanned pages that have been sorted and structured for management.
- Run record-link.py script to identify and link records in two different datasets that likely refer to the same entity, despite variations in data formatting or errors. The report will be output to a new excel worksheet. This output will identify enough linkage to assign file named files with the extracted information that describes the audio files. Note: The threshold used in the classification step (len(df1.columns) times 0.75) can be adjusted according to the specific needs and characteristics of the datasets being compared. Lowering the threshold may increase the likelihood of finding matches but also increase the chances of false positives. Conversely, raising the threshold may reduce the chances of false positives but may also miss potential matches. The threshold should be fine-tuned based on the data and use case.
  - Importing Libraries:
    - 'import pandas as pd:’ Imports the pandas library for data manipulation.
    - 'import recordlinkage': Imports the ‘recordlinkage’ library, which is used for record linkage and deduplication tasks.
  - Loading Data:
    - Two Excel files (‘df1' and ‘df2') are loaded into pandas DataFrames. These DataFrames represent the datasets that you want to link or compare.
  - Indexation Step:
    - An Index' object from ‘recordlinkage' is created using ‘indexer.full()'. This step defines the indexing method, which is “full” in this case, meaning that it will create all possible pairs of records between the two datasets.
  - Comparison Step:
    - A ‘Compare ‘object is created to specify how columns from the two datasets should be compared.
    - The code iterates through columns in ‘df1', checks their data type (text, numeric, or boolean), and sets up comparison methods accordingly.
      - For text columns, it uses the ‘Jaro-Winkler’ similarity method with a threshold of 0.85.
      - For non-text columns, it uses an exact match.
    - The ‘compare.compute()' function computes the comparison features for all pairs of records, resulting in a DataFrame of comparison scores.
  - Classification Step:
    - The code uses a threshold to classify pairs of records as matches.
    - In this example, pairs with a sum of comparison scores greater than ‘len(df1.columns)' * 0.75 are considered matches. You can adjust this threshold to meet your specific requirements.
  - Printing Matches:
    - The code prints the pairs of matching records based on the classification step’s threshold.

8. Create a Working Metadata Worksheet

- Combine and organize the data into a new metadata worksheet.
- The new worksheet should include identification, transcription data, and legacy data.

9. Streamline the Workflow

- Consider combining Steps 1-6 into one or two Python scripts to improve workflow efficiency.
- The streamlined workflow will serve as a baseline collection management platform for newly acquired, large-scale collections.

10. Export Worksheet XML to Audio Files

- Export the metadata worksheet to the audio files.
- Apply any necessary renaming, convert preservation copies to preservation access copies, and create access copies as per the repository policy.
- Organize the exported files into folders for migration to the server.