Anonymization
Pangeanic Anonymization Solution (PAS) is based in the use of Neural Networks (NN). Those Networks ingest text and output annotated text identifying the multiple entities found, for instance person names, addresses...
But Neural Networks only work as they have been taught during the training, and that means that will not be able to identify new entities or to adapt to specific usage cases.
To solve the problem PAS uses two techniques that will refine the identification performance:
- Dictionaries
- Rules
Anonymization Dictionaries
The simplest way to help the NN to identify an entity is to have that entity name declared in a Dictionary. We call these Dictionaries Anon Dictionaries as they list texts that can be anonymized.
Imagine you work for a hospital and you want to be sure the NN will detect and anonymize any Doctor name appearing in the documents you want to anonymize, you may create a Anon Dictionary called the DoctorsList that simply contains the names in different lines of a plan text file and use it when you are anonymizing.
An Anon Dictionary is linked to an entity type, for instance our Doctors List can be linked to the type PER (Person Name) or we can create a new Entity Type called DOCTOR and assign those names to DOCTOR type.
Clear Dictionaries
There are a second type of dictionaries the Clear Dictionaries that can be defined to force the system to avoid identification and anonymization.
Imagine you work for a hospital and you want to be sure the NN will 'NOT detect and anonymize any Doctor name appearing in the documents you want to anonymize. You are interested in this case to anonymize only patients' names. You may create a Clear Dictionary called the DoctorsList that simply contains the names in different lines of a plan text file and use it when you are anonymizing.
Rules
Rules are Regular Expressions that can be defined in order to detect patterns. The syntax for Rules is Regex.
Like Anon Dictionaries rules are associated to an existing or new Entity Type.
Rules are mostly used to detect sequences of numbers and characters that identify persons or objects such as:
- Driving license
- The number of employee in an organization
- A expedient identifying code
- A Bank account number
Anonymization Profiles
When a user wants to anonymize a document he has to input some parameters in ECO, at least the language of the document (English, Spanish, Japanese...), the list of entities to anonymize and the anonymization mode (redaction, pseudo...).
It is not practical to ask the user to remember and choose which dicts to use or which rules, and that's the reason admins can create anonymization profiles. An Anonymization Profile puts together some dicts and rules with a name. The name should be something easy to remember by the user, probably linked to the specific anonymization tasks he performs every now and then. Users will then choose language, entities list, anon mode and optionally one of the profiles the admin has created for him encapsulating both dicts and rules.