GoTranscript
>
All Services
>

En/blog/safe De Identification Before Ai Processing Checklist Redaction Rules

Blog chevron right How-to Guides

Safe De-Identification Before AI Processing (Checklist + Redaction Rules)

Andrew Russo
Andrew Russo
Posted in Zoom Apr 17 · 18 Apr, 2026
Safe De-Identification Before AI Processing (Checklist + Redaction Rules)

To use AI tools safely, remove personally identifiable information (PII) and sensitive identifiers before you upload files. The simplest approach is a repeatable checklist, clear redaction rules, and a two-version workflow so you keep an original copy while sharing only a de-identified version. This guide gives you standard redaction markers, practical steps, and “never upload” guardrails.

Primary keyword: safe de-identification before AI processing.

Key takeaways

  • Use a two-version workflow: keep an original “Source” file and create a separate “AI-Safe” de-identified file.
  • Redact consistently with standard markers like [NAME], [EMAIL], and [ACCOUNT_ID].
  • De-identification is not just names; it includes IDs, metadata, small clues, and combinations that can re-identify someone.
  • Decide up front what the AI task needs, then redact everything else.
  • Some data should never be uploaded to general-purpose AI tools, even if you redact.

What “de-identification” means (and what it does not)

De-identification means you remove or replace information that can identify a person, directly or indirectly. This includes obvious items (like names) and less obvious items (like unique job titles, rare locations, or a full face in a photo).

De-identification does not automatically make data “risk-free.” You can still re-identify people when multiple small details combine, or when you leave unique identifiers in filenames, headers, or embedded metadata.

PII vs. sensitive data: treat both as “redact by default”

PII identifies a person (like a full name or email). Sensitive data can harm someone if exposed (like medical details, finances, or a private complaint), even if it does not include a name.

If you are unsure which category something fits, treat it as sensitive and redact it unless the AI task truly needs it.

When you must follow formal rules

Some data types come with legal or contractual limits (for example, health, education, and financial records). If your organization is covered by specific regulations, follow those requirements and your internal policy before you use any AI tool.

For healthcare data in the U.S., review the HIPAA guidance and the concept of de-identification at HHS: HIPAA de-identification.

What should never be uploaded to AI tools (even “redacted”)

Some information is so sensitive, high-risk, or hard to sanitize that it should not go into general-purpose AI tools at all. If you cannot confidently control where the content goes, who can access it, and how it is retained, keep it out.

  • Passwords, passcodes, PINs, and security answers (including temporary codes in emails or SMS screenshots).
  • Private keys, API keys, access tokens, session cookies, and authentication QR codes.
  • Full payment card data (especially full PAN + expiry + CVV), bank login details, or full account access instructions.
  • Unmasked government ID images (passports, driver’s licenses, national IDs) and full ID numbers.
  • Full medical records with direct identifiers, unless you have an approved, compliant workflow.
  • Minor’s sensitive records (health, school discipline, custody, abuse reports) unless you have formal authorization and controls.
  • Highly sensitive legal or HR materials with names and unique case details (whistleblower reports, terminations, investigations) unless approved.
  • Raw database exports containing unique identifiers (customer IDs tied to other systems), even if they “look anonymous.”

If you need AI help with these, create a synthetic example (made-up data) that preserves the structure but contains no real identifiers.

De-identification checklist (before you paste, upload, or connect a file)

Use this checklist every time you prepare content for AI processing. It works for text, audio transcripts, screenshots, spreadsheets, and PDFs.

Step 1: Define the minimum the AI needs

  • Write down the task in one sentence (example: “Summarize customer pain points from support tickets”).
  • List the fields the AI must see (example: issue category, product, error message).
  • List what it does not need (example: name, email, address, order number).

Step 2: Create a two-version workflow (Source vs. AI-Safe)

  • Source version (restricted): original file with full details, stored in your controlled environment.
  • AI-Safe version (shareable): a copy where you redact identifiers and sensitive content.
  • Never edit the Source file “in place.” Always make a copy and redact the copy.
  • Name files clearly, like ProjectX_Interview01_SOURCE and ProjectX_Interview01_AI-SAFE.

Step 3: Remove direct identifiers (high priority)

  • Full names, nicknames when unique, initials if tied to role.
  • Email addresses, usernames, social media handles.
  • Phone numbers, fax numbers.
  • Full postal addresses and precise locations (apartment, street).
  • Government IDs (SSN, national ID, passport, driver’s license).
  • Account numbers, policy numbers, patient IDs, student IDs, employee IDs.
  • Biometric identifiers (face images, fingerprints), where applicable.

Step 4: Remove indirect identifiers (easy to miss)

  • Exact dates tied to a person (birth date, admission date, termination date).
  • Rare job titles, unique roles (“the only pediatric surgeon in town”), or small teams.
  • Exact employer + location combinations.
  • Case numbers, ticket IDs, invoice numbers that connect to other systems.
  • Unique device identifiers (IMEI, MAC address), serial numbers.
  • IP addresses and precise GPS coordinates.

Step 5: Redact metadata and file “leaks”

  • Remove names in filenames (e.g., rename “Jane_Doe_Performance_Review.pdf”).
  • Check document properties (author name, tracked changes, comments).
  • Remove hidden spreadsheet tabs, filters, and notes.
  • For images/screenshots, crop out browser tabs, notifications, and contact photos.

Step 6: Sanity-check for re-identification risk

  • Ask: “Could someone guess who this is from the remaining details?”
  • Search within the file for “@”, “+1”, “DOB”, “SSN”, “Account”, “MRN”, “Invoice”, and similar strings.
  • Spot-check the first page, last page, headers/footers, and any tables.

Step 7: Document what you changed

  • Keep a short redaction log (date, editor, rules used, version name).
  • If needed internally, keep a separate mapping file that links placeholders to real identities.

Standardized redaction markers (rules you can reuse)

Standard markers make your data easier to read and safer to handle. They also prevent “near misses,” like leaving a first name in one paragraph and redacting it elsewhere.

Core marker set (simple and consistent)

  • [NAME] (or role-based: [CUSTOMER_NAME], [AGENT_NAME])
  • [EMAIL]
  • [PHONE]
  • [ADDRESS] (or [CITY], [STATE] when partial location is needed)
  • [DOB]
  • [GOV_ID] (SSN, passport, driver’s license)
  • [ACCOUNT_ID] (customer ID, policy number, MRN, student ID)
  • [PAYMENT] (card/bank details)
  • [COMPANY] (when the company name is sensitive)
  • [LOCATION] (when any place reference could identify)

Numbering rules (keep relationships without exposing identity)

If the same person appears multiple times, keep them consistent with a stable label. Use numbering so the AI can follow the story without knowing the person.

  • Use [PERSON_1], [PERSON_2] for general narratives.
  • Use [PATIENT_1], [EMPLOYEE_1], [WITNESS_1] when roles matter.
  • Use [ORG_1], [CLINIC_1], [VENDOR_1] for organizations.

Partial redaction rules (only when needed)

Sometimes you need the shape of data for troubleshooting or analytics. If you keep partial values, do it in a controlled and consistent way.

  • Email: [EMAIL: redacted] rather than leaving the domain.
  • Phone: [PHONE: last4=1234] only if the last 4 digits are required for matching.
  • Dates: replace with month/year (example: [DATE: 2026-04]) or relative time (example: [DATE: 3 months ago]).
  • Locations: keep region level (example: [STATE] instead of street).

Do not use black boxes that hide meaning

A marker like [REDACTED] everywhere makes analysis harder and invites mistakes. Prefer specific tags so you can audit what you removed.

Practical de-identification steps by content type

Different file types hide identifiers in different places. Use the most relevant steps below before you upload anything.

Text documents (notes, emails, PDFs)

  • Find-and-replace names, emails, and phone numbers with your standard markers.
  • Check headers, signatures, quoted email threads, and attachments listed inline.
  • Remove footers that include addresses, ticket IDs, or confidentiality lines with names.

Spreadsheets (CSVs, Excel exports)

  • Delete columns the AI does not need (best option) instead of redacting cell-by-cell.
  • Replace unique IDs with new random IDs, and store the mapping separately.
  • Scan for hidden tabs and pivot tables that still contain raw data.

Audio and transcripts (calls, interviews, meetings)

  • Redact spoken PII in the transcript using markers like [PHONE] and [ADDRESS].
  • If you share audio, remember the voice itself can identify a person in some contexts.
  • Consider sharing transcript-only for analysis tasks when audio is not required.

If you need a transcript quickly for review, you can start with automated transcription and then de-identify the text before any broader AI use.

Screenshots and images

  • Crop to the minimum needed area, then blur or cover identifiers with solid blocks.
  • Remove profile photos, names in nav bars, email subjects, and notification banners.
  • Watch for small text like URLs that contain usernames or account IDs.

Chat logs and support tickets

  • Redact signatures and auto-filled fields (name, email, address).
  • Replace ticket IDs with [TICKET_ID] if they link back to a CRM.
  • Keep problem text, steps taken, error codes, and outcomes when useful.

Pitfalls that cause “accidental re-identification”

Most de-identification failures happen because of process gaps, not because people do not know what PII is. These are the issues to watch for.

Inconsistent redaction across the file

  • Someone redacts the first name once but leaves it in another paragraph.
  • A table is redacted, but the same data appears in an appendix or footnote.

Leaving “linkable” IDs in place

  • Internal IDs can look harmless, but they often connect to a person in another system.
  • Order numbers, claim numbers, and case IDs can also reveal identity when combined with other details.

Over-sharing context

  • “Our only CFO in Boise” can identify someone without a name.
  • Exact dates plus a rare event can narrow it down to one person.

Forgetting non-text data

  • Tracked changes, comments, and document properties can contain names.
  • Images can include faces, badges, and screens with open inboxes.

Assuming a tool will “handle it”

Do not rely on an AI tool to remove sensitive data after you upload it. Redact before you paste, upload, or connect any integration.

Decision criteria: is your file “AI-safe” yet?

Use these quick checks to decide whether your AI-Safe version is ready to share.

  • Purpose check: Every field remaining supports the AI task.
  • Identifier check: No direct identifiers remain in text, tables, headers, or filenames.
  • Linkability check: No internal IDs remain that could be matched elsewhere.
  • Uniqueness check: The remaining details do not point to a single person.
  • Metadata check: Comments, tracked changes, and properties are cleared.
  • Retention check: You have approval to share this class of data with the chosen tool.

If you cannot confidently pass these checks, reduce the data further or switch to synthetic examples.

Common questions

1) Is removing names enough to de-identify data for AI?

No. Names are only one identifier, and indirect clues can still point to a person. Remove or generalize IDs, precise dates, exact locations, and unique role details.

2) Should I replace identifiers with blanks or with tags like [NAME]?

Use tags whenever possible. Tags keep the text readable and preserve relationships (like “the same person” across paragraphs) without exposing identity.

3) Can I keep last four digits of a phone number or account number?

Only if the AI task truly needs it and you have a clear policy. If you keep partial values, label them explicitly (example: [PHONE: last4=1234]) and do it consistently.

4) What about voice recordings—are transcripts safer than audio?

Often, yes, because audio can include a recognizable voice and accidental spoken identifiers. If the AI task does not require audio features, share the redacted transcript instead of the recording.

5) How do I handle lists of participants in meeting notes?

Remove the participant list and use role-based placeholders inside the notes (example: [MANAGER_1], [ENGINEER_2]). Keep a private mapping file only if you need it.

6) How can I reduce risk when I must share real examples?

Share the smallest excerpt that still shows the problem, remove linkable IDs, and generalize rare details. If possible, create a synthetic version that matches the pattern without using real data.

7) Do I need to keep a “source” version if I redact?

Yes. A two-version workflow prevents accidental loss of the original and lets you re-check decisions later. Store the Source version securely and limit access.

If you need help turning audio or video into text you can review and redact, GoTranscript offers professional transcription services that fit into a careful, two-version workflow. You can also add an extra quality step with transcription proofreading services before you share an AI-Safe version more widely.