A law firm preparing discovery production reviewed 50,000 documents manually for personal data. They redacted names, addresses, and obvious identifiers. After production, they believed they had satisfied their GDPR obligations. Six weeks later, opposing counsel notified them that 127 documents contained unredacted national insurance numbers in embedded metadata. The subsequent ICO investigation resulted in a formal reprimand, required remediation, and reputational damage that far exceeded what proper redaction would have cost.
GDPR-compliant redaction isn't just about black boxes over visible text. It requires systematic identification of personal data wherever it exists—in text, in metadata, in embedded objects, in image layers—and applying appropriate techniques that truly remove the data, not just obscure it visually.
This comprehensive guide provides the practical knowledge legal teams need to implement proper GDPR-compliant redaction.
Understanding What Must Be Redacted
Personal Data Under GDPR
Article 4(1) of GDPR defines personal data as "any information relating to an identified or identifiable natural person." This deliberately broad definition encompasses far more than obvious identifiers.
Direct Identifiers - Data that identifies individuals on its own:
- Full names, first names, surnames
- National identification numbers (NI numbers, passport numbers, driving licence numbers)
- Postal and email addresses
- Phone numbers (mobile, landline, fax)
- Financial account numbers (bank accounts, credit cards)
- Government-issued identifiers (NHS numbers, tax references)
Indirect Identifiers - Data that identifies when combined with other information:
- Job titles and positions (especially in small organisations)
- Employee and membership numbers
- IP addresses and device identifiers
- Location data and travel patterns
- Unique characteristics or descriptions
- Vehicle registration numbers
Special Category Data - Requiring heightened protection under Article 9:
- Racial or ethnic origin
- Political opinions
- Religious or philosophical beliefs
- Trade union membership
- Genetic and biometric data
- Health data
- Sex life and sexual orientation
The Redaction Decision Framework
Not all personal data requires redaction in every context. The analysis should consider:
Is processing lawful? If a legitimate legal basis exists for including the personal data in the document disclosure, redaction may not be required. Article 6(1)(f) legitimate interests or Article 6(1)(c) legal obligation may apply to litigation disclosures.
Is the data necessary? Data minimisation requires processing only what's necessary for the purpose. Even if lawful to include, unnecessary personal data should be redacted.
Who will receive the document? Internal circulation may justify different treatment than disclosure to adverse parties or public production. Context matters.
What agreements exist? Data processing agreements, protective orders, or confidentiality undertakings may permit or require certain approaches to personal data in documents.
Redaction Techniques by Data Type
Text Redaction
Black Box Redaction
The traditional approach—covering text with opaque boxes—seems straightforward but contains a critical trap:
The Critical Requirement: The underlying text must be removed, not just covered. Many PDF tools create visual overlays that can be removed, revealing the "redacted" text beneath.
Verification Protocol:
- After applying redaction, attempt to copy text from the redacted area
- Use PDF inspection tools to examine the document structure
- Search the document for terms you believe were redacted
- If any text is accessible, redaction is incomplete
Character Replacement
Replace personal data with descriptive placeholders:
Original: "John Smith, NI number AB123456C, contacted the department on 15 March 2024."
Redacted: "[NAME REDACTED], NI number [REDACTED], contacted the department on 15 March 2024."
Advantages:
- Preserves document readability and context
- Makes clear what type of data was redacted
- Enables verification of redaction completeness
Best Practice: Use consistent placeholders across all documents (e.g., always "[NAME REDACTED]", not sometimes "NAME WITHHELD" or "[REDACTED NAME]").
Pseudonymisation
Replace identifiers with consistent pseudonyms throughout documents:
"John Smith" becomes "Individual A" in all documents where he appears.
Advantages:
- Preserves relationships and narrative flow
- Enables analysis of patterns and communications
- Documents remain coherent and usable
Critical Requirement: Maintain the mapping table separately and securely. If the mapping is compromised, pseudonymisation provides no protection.
Metadata Redaction
Documents contain extensive metadata that frequently includes personal data invisible to casual review:
| Metadata Category | Personal Data Risk | Remediation Approach |
|---|---|---|
| Author fields | Creator names, login IDs | Clear or replace with generic values |
| Last modified by | Editor names, user accounts | Clear or replace |
| File paths | May contain usernames in paths | Remove or sanitise |
| Revision history | Names of all editors, deleted content | Remove revision history entirely |
| Comments/annotations | Commenter names, potentially sensitive content | Remove all comments |
| Email headers | Sender/recipient addresses, routing information | Targeted redaction of personal addresses |
Redaction Workflow:
- Use metadata removal tools before visual redaction
- Flatten documents to remove layers (tracked changes, comments)
- Consider conversion to clean formats (e.g., printed to PDF/A)
- Verify with metadata inspection tools after processing
Image Redaction
Images within documents—photographs, scanned pages, embedded graphics—require special handling:
Text in Images: OCR-derived text can be redacted from searchable layers, but text visible in the image itself requires image editing to obscure.
Facial Images: May require:
- Complete obscuring (black box over face)
- Blurring or pixelation
- Consideration of whether context makes individuals identifiable even with face obscured
EXIF Data: Photograph metadata often contains:
- GPS coordinates of where photo was taken
- Camera serial numbers (potentially identifying)
- Timestamps and date information
- Device identifiers
Bulk Redaction: Scaling the Process
Pattern-Based Redaction
For large document sets, pattern matching enables systematic redaction of structured personal data:
Regular Expression Patterns:
| Data Type | UK Pattern | Notes |
|---|---|---|
| Email addresses | [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} | Catches most formats |
| UK phone numbers | (\+44|0)\s?[1-9]\d{2,4}\s?\d{3,4}\s?\d{3,4} | Various spacing patterns |
| NI numbers | [A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D] | Standard format |
| UK postcodes | [A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2} | All valid formats |
| Credit cards | \d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4} | With optional separators |
Limitations of Pattern Matching:
- Patterns miss variations and errors in source data
- Over-inclusive patterns generate false positives requiring manual review
- Cannot identify personal data that doesn't match patterns (names, descriptions)
- International formats may differ significantly
Named Entity Recognition (NER)
AI-based identification addresses pattern matching limitations:
Capabilities:
- Name identification regardless of format or cultural origin
- Address recognition beyond postcode patterns
- Contextual identification (job titles that effectively identify individuals)
- Relationship inference (identifying data through context)
Advantages Over Patterns:
- Catches variations that patterns miss
- Identifies personal data based on meaning, not just format
- Learns from corrections, improving over time
- Handles multiple languages in multilingual document sets
The Hybrid Approach
Effective bulk redaction combines techniques:
- Pattern Matching: Automatically redact known identifier formats (NI numbers, account numbers, postcodes)
- Named Entity Recognition: Identify names, addresses, organisations, and other contextual personal data
- Dictionary Matching: Organisation-specific terms (project code names, internal identifiers)
- Human Review: Verify AI suggestions, address edge cases, make judgment calls
- Quality Control: Statistical sampling of processed documents to verify accuracy
Verification and Quality Control
Redaction Verification Protocol
Text Layer Verification:
- Select all text and copy—does any supposedly redacted text appear?
- Search for patterns that should have been redacted (phone formats, email formats)
- Search for specific terms from the redaction log
Metadata Verification:
- Run metadata extraction tools on processed documents
- Check document properties panels
- Verify no hidden content, revision history, or comments remain
Visual Verification:
- Review documents for visible personal data
- Check images and embedded objects
- Verify consistent redaction appearance across documents
Statistical Quality Control
For large document sets, sample-based QC provides efficient verification:
Sample Selection:
- Random selection across document types
- Stratification by custodian, date range, document type
- Statistical sample sizes for desired confidence levels
Review Protocol:
- Full manual review of every sample document
- Check for missed personal data and redaction failures
- Document error types and rates
Error Response:
- If sample reveals errors, investigate root cause
- Determine if errors are systematic or isolated
- Re-process affected document populations
- Expand sample or implement full re-review if error rates unacceptable
Common Redaction Failures
Failure 1: Visual-Only Redaction
The Problem: Black boxes that can be removed or text that remains searchable beneath visual obscuring.
The Consequence: Personal data remains fully accessible to anyone who knows to look, defeating the entire purpose of redaction.
The Prevention: Use true redaction tools that remove underlying text; verify by attempting extraction after processing.
Failure 2: Metadata Neglect
The Problem: Visible text properly redacted but metadata containing personal data untouched.
The Consequence: The 127 documents with NI numbers in metadata—precisely what happened to the firm in our opening example.
The Prevention: Include metadata processing in standard workflow; verify with inspection tools.
Failure 3: Context Revelation
The Problem: Redacting direct identifiers while leaving context that identifies individuals.
Example: "The CEO of [COMPANY], [REDACTED], stated in his interview..." when only one person holds that position.
The Prevention: Consider whether surrounding context permits identification; redact contextual information if necessary.
Failure 4: Inconsistent Application
The Problem: Same name redacted in some documents but not others in the same production.
The Consequence: The unredacted instances defeat all the redacted ones, and the inconsistency may raise additional questions.
The Prevention: Maintain comprehensive redaction lists applied across entire document sets; implement cross-document verification.
RUNO's Review & Redaction Tools
RUNO's Review & Redaction module addresses the full spectrum of GDPR-compliant redaction requirements:
AI-Powered Identification: The platform combines pattern matching with named entity recognition to identify personal data automatically. UK-specific patterns for NI numbers, NHS numbers, postcodes, and other identifiers ensure comprehensive coverage for UK documents.
True Redaction Processing: Redaction tools remove underlying text—not just overlay it—with automated verification that removed content is no longer accessible through any extraction method.
Metadata Processing: Integrated metadata handling strips or sanitises document metadata as part of the standard workflow, ensuring hidden personal data doesn't survive the redaction process.
Batch Processing: Redaction rules can be applied consistently across entire document populations, ensuring the inconsistent application failures that plague manual processes are eliminated.
Audit Trail: Complete records document what was redacted, when, and by whom—supporting GDPR accountability requirements and providing defensibility documentation for any regulatory inquiry.
Conclusion: The Cost of Getting It Wrong
GDPR-compliant redaction requires systematic processes that address text, metadata, images, and context. The 127 documents with unredacted NI numbers weren't a failure of intent—the firm wanted to comply. They were a failure of process—relying on visual review that couldn't catch what eyes couldn't see.
The investment in proper redaction processes and tools is modest compared to the consequences of failure: regulatory enforcement, reputational damage, professional liability exposure, and the fundamental breach of trust that occurs when personal data is disclosed inappropriately.
When personal data must be shared, proper redaction protects both data subjects and the organisations handling their information. The tools and techniques exist. The question is whether your organisation will implement them before or after an incident forces the issue.
Explore RUNO's Review & Redaction Suite or request a demonstration to see AI-powered redaction in action.