Documentation

Improving Personal Data Identification and Analysis with AI

kurt

November 22, 2024

Improving Personal Data Identification and Analysis with AI

Introduction

In today’s landscape, companies face the critical challenge of effectively managing vast amounts of data while safeguarding personal information. As the volume and complexity of data grow, accurately identifying and managing personal data becomes increasingly difficult. To address this, organizations need solutions equipped with rapid and precise data analysis capabilities, enabling them to enhance their data protection standards significantly.

Challenges

Accurately classifying personal data in large-scale databases is a significant challenge faced by many organizations. Traditional methods based on regular expressions and fixed pattern recognition are losing effectiveness due to the following limitations:

Diversity of Data Patterns

Personal information such as addresses, names, and medical records often lacks standardized formats, making classification difficult.
Examples:
Addresses can appear as “110-2430” or “bldg 110 rm. 2430,” presenting varying structures.
Medical data can include abbreviations or technical jargon, increasing complexity.

Complexity of Regulatory Compliance
Global regulations such as GDPR, CCPA, HIPAA, and ISO/IEC 27701 mandate precise identification and protection of personal data. Failing to comply can result in legal issues, fines, and a decline in customer trust.

GDPR (General Data Protection Regulation): Enforces transparency in data handling and ensures the rights of EU data subjects.
CCPA (California Consumer Privacy Act): Grants California consumers the right to request data deletion and opt-out of data sales.
HIPAA (Health Insurance Portability and Accountability Act): Protects sensitive health information in the US, ensuring confidentiality and security.
ISO/IEC 27701: Provides an international framework for managing personal information and supporting regulatory compliance.

Each regulation presents unique requirements, and non-compliance exposes organizations to severe legal, financial, and reputational risks.

Inefficiency of Traditional Solutions

Regular expression-based systems only recognize predefined patterns and require frequent updates for new data formats.
This decreases operational efficiency and leads to increased costs for organizations.

These challenges weaken data protection, increase operational costs, and leave organizations vulnerable to compliance and security risks.

Objectives

The objective of the AI classifier is to enable customers to gain tangible benefits in data protection and management. By addressing the complexities of data management, improving the level of personal data protection, and achieving regulatory compliance effectively, this solution aims to deliver the following core goals:

1. Enhancing the Accuracy of Personal Data Identification

Context-Based Automated Classification: Rather than relying on fixed patterns, the AI classifier comprehends data contextually to accurately identify various types of personal data, such as addresses, names, and medical information.
Adaptation to New Data Patterns: The AI model continuously learns, overcoming the limitations of traditional solutions and adapting flexibly to new data patterns.

This significantly improves the accuracy of personal data identification, minimizing errors and uncertainties in data management for customers.

2. Boosting Operational Efficiency and Reducing Costs

Resource Savings: High-performance classification reduces the workload on IT, security, and data management teams, even in large-scale data environments.
Time Optimization: Processes diverse data quickly, reducing the time spent on repetitive tasks.
Operational Stability: The AI classifier provides high reliability and consistency during data processing, preventing system interruptions or errors and maintaining a stable operational environment.

By leveraging the AI classifier, organizations can enhance the efficiency of personal data management and allocate more resources to their core business activities.

3. Supporting Regulatory Compliance

Automated Regulatory Response: Aligns with regulations such as GDPR, CCPA, HIPAA, and ISMS-p by automating classification tailored to legal requirements.
Real-Time Monitoring and Reporting: Provides transparent data management and reporting capabilities to demonstrate compliance.
Mitigation of Legal Risks and Fines: Prevents penalties and reputational damage caused by regulatory violations, enhancing corporate trustworthiness.

With these, organizations can ensure compliance, minimize legal risks, and strengthen customer trust effectively.

Solution Overview

QueryPie’s AI classifier is a solution that combines contextual analysis and pattern recognition to classify personal data with precision and efficiency. This helps customers simplify complex data management processes and enhance personal data protection. The AI classifier offers the following key features:

1. Advanced Text Comprehension

Leverages bidirectional contextual understanding to analyze and classify data containing personal information accurately.
Handles various types of personal data, including names, addresses, and medical information, ensuring high accuracy for both structured and unstructured data.
Adapts flexibly to data by understanding its context, avoiding reliance on fixed patterns.

2. Reliable Data Collection and Refinement

Collects necessary data for personal information classification from credible sources such as national databases and public data portals.
The collected data undergoes refinement processes, including deduplication, error correction, and standardization, to ensure high-quality training data.
Refined data is a critical factor in improving classification accuracy, providing results tailored to the customer’s environment.

3. Customized Classification Models

Offers models optimized for different types of personal information.
For instance, separate AI models are designed for names, addresses, and medical information to maintain high accuracy.
Adjusts models to meet customer-specific requirements across various industries and data environments.
Continuously learns and updates to adapt flexibly to new data patterns.

4. Efficient Resource Utilization

Features advanced pre-filtering capabilities to filter out irrelevant text, maximizing processing efficiency.
Minimizes unnecessary model calls to optimize system resource usage, reducing operational costs effectively.

Technical Description

Background for Model Selection

To optimize performance for personal data classification tasks, various AI language models were analyzed and compared. The BERT-based model was selected due to its specific advantages for personal data classification over recently developed large language models like GPT or Claude:

Efficient Processing Speed

BERT provides an excellent balance between processing speed and accuracy for real-time classification tasks.
It operates reliably in large-scale data environments, minimizing latency issues during processing.

Context Understanding and Feature Extraction

BERT excels in analyzing text context both forwards and backwards, making it highly effective at accurately classifying personal data.
Whether dealing with names, addresses, or medical records, BERT consistently delivers high precision in handling various types of personal information.

Model Combination and Optimization

Different models were selected and optimized based on the specific types of personal data being classified:
- KoElectra: An open-source model optimized for Korean language datasets. It performs exceptionally well in tasks such as medical record classification and address identification. (For Korean market)
- Custom BERT-Based Model: A BERT model trained on tailored datasets offers robust performance, especially in handling Out-of-Vocabulary (OOV) issues caused by short text or abbreviations. It surpasses open-source models in stability and accuracy for such challenges.
By combining these models, the strengths of each are maximized to address various types of personal data effectively.

High Accuracy and Flexibility

The system leverages the strengths of multiple models to achieve high accuracy in personal data classification tasks.
With its robust learning and updating framework, the system can flexibly adapt to new data patterns and changes in environmental conditions.

Solution Components Overview

The personal data classification process of the AI classifier is designed in sequential stages to maximize accuracy and efficiency. Below is a detailed explanation of each component:

1. Pre-Filtering

Role: Analyzes input sentences to remove irrelevant text unrelated to personal data and filters out unnecessary content, allowing the model to focus on meaningful data.
Effect: Reduces the amount of data the model needs to process, optimizing resource usage, and improves processing speed by filtering out irrelevant data early.
Examples:
- Text Consisting Solely of Special Characters or Numbers:
  Strings like "123456" or "!@#$%^&*" are unlikely to be relevant to personal information such as addresses or medical data. Therefore, they are excluded from the analysis stage.
- Text Misaligned with Personal Data Types:
  For example, Korean text such as "홍길동" (a name written in Hangul) is excluded from the Romanized name classifier. Conversely, Romanized names like "Gil-Dong Hong" are passed to the Romanized name classifier for further processing.

2. Context Analysis Model

Role: Utilizes BERT-based language models like Ko-Electra to perform deep contextual analysis of input text.
Effect: Goes beyond simple keyword matching by understanding the meaning within context to accurately determine whether the text contains personal information.
Features:
- Handles complex data types such as addresses, names, and medical information.
- Adapts flexibly to new and evolving data patterns.

3. Classification Layer

Role: Uses the feature vectors extracted by the context analysis model to make a final determination on whether the text contains personal information.
Effect: Accurately identifies personal information and structures the results in a format suitable for the client environment.
Example Output:
- If the input text contains address information, the output might be formatted as: "is_address: true"
- This clearly communicates the presence of personal information and simplifies the data structure for seamless integration into subsequent processes.

Data Collection and Refinement

1. Data Collection

We gather data essential for personal information classification from trustworthy public sources and verified platforms. In this section, I will explain based on examples from Korean data sources.

Reliable Data Sources:
Data is obtained from credible sources such as (in Korean case):

Diverse Data Types:
- Address Data: We utilize Korean address data from the Address-based Industry Support Services. This includes city, county, and district information, which is combined to generate realistic or similar address data for training.
- Medical Information: Medical terms and abbreviations are extracted from statistics such as frequent disease conditions and disease frequency data provided by the Healthcare Big Data Open System.
- Occupations and Certifications: Occupational and certification-related data are sourced from the Korean Occupational Dictionary and PQI (Private Qualification Information Service).

Ensuring Accuracy: We ensure the credibility of data sources and strictly manage quality during the collection phase.

2. Data Refinement

The collected data undergoes a refinement process to ensure consistency and quality before being utilized.

Duplicate Removal: Identical data entries are removed to prevent redundant learning in the model.
Error Correction: Mistakes and omissions are reviewed and corrected. For example, typos or incorrect syntax in address data are fixed.
Standardization: Data is made uniform through removing special characters, eliminating unnecessary spaces, and building an abbreviation dictionary.
Quality Assurance: Refined data is sampled and reviewed to ensure accuracy and relevance to the model.

Shell Native Command Control through SSH Proxy Architecture