This content originally appeared on HackerNoon and was authored by EScholar: Electronic Academic Papers for Scholars
:::info Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author (anthip@ifi.uio.no);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
:::
Table of Links
2 Background
2.3 Privacy-Preserving Data Publishing
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
4 Privacy-oriented Entity Recognizer
4.2 Silver Corpus and Model Fine-tuning
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
D. LLM probabilities: base models
E. Training size and performance
2.3 Privacy-Preserving Data Publishing
PPDP approaches to text sanitization rely on a privacy model specifying formal conditions that must be fulfilled to ensure the data can be shared without harm to the privacy of the registered individuals. The most prominent privacy model is k-anonymity (Samarati and Sweeney, 1998), which requires that an individual/entity be indistinguishable from k -1 other individuals/entities. This model was subsequently adapted to text data by approaches such as k- safety (Chakaravarthy et al., 2008) and k-confusability (Cumby and Ghani, 2011).
\ t-plausibility (Anandan et al., 2012) follows a similar approach, using already detected personal information and ensuring that those are sufficiently generalized to ensure that at least t documents can be mapped to the edited text. Sanchez and Batet (2016) present C-sanitized, which relies on an information-theoretic approach that computes the point-wise mutual information (using co-occurrence counts from web data) between the person or entity to protect and the terms of the document. Terms whose mutual information ends up above a given threshold are then masked.
\ k-anonymity was also employed in Papadopoulou et al. (2022) in combination with NLP-based approaches, where based on an assumption of an attacker’s knowledge, the optimal set of masking decisions was found to ensure k-anonymity.
\ Finally, Manzanares-Salor et al. (2022) provided an approach to the evaluation of disclosure risks that relies on training a text classifier to assess the difficulty of inferring the identity of the individual in question based on the sanitized text.
2.4 Differential Privacy
Differential privacy (DP) is a framework for ensuring the privacy of individuals in datasets (Dwork et al., 2006). It essentially operates by producing randomized responses to queries. The level of artificial noise introduced in each response is optimized such as to provide a guarantee that the amount of information that can be learned about any individual remains under a given threshold.
\ Fernandes et al. (2019) applied DP to text data, in combination with ML techniques by adding noise to the word embeddings of the model. Their work focused on removing stylistic cues from the text as a way to ensure that the author of the text could not be identified by it. Feyisetan et al. (2019) also apply noise to word embeddings in a setting where the geolocation data of an individual is to be protected.
\ More recently, Sasada et al. (2021) tried to address the issue of the noise needed for DP causing utility loss in the resulting text by creating duplicates first, and then adding noise, thus reducing the amount of noise needed. Krishna et al. (2021) sought to address the same issue using an algorithm based on auto-encoders to transform text without losing data utility. Finally, Igamberdiev and Habernal (2023) introduced DPBART, a DP rewriting system based on pre-trained BART model, and which seeks to reduce the amount of artificial noise needed to reach a given privacy guarantee.
\ DP-oriented approaches generally lead to complete transformations of the text, at least for reasonable values of the privacy threshold. Those approaches are therefore well suited to the generation of synthetic texts, in particular to collect training data for machine learning models. However, they are difficult to apply to text sanitization, as most text sanitization problems are expected to retain the core content of the text and only edit out the personal identifiers. This is particularly the case for court judgments and medical records, as the sanitized documents should not alter the wording and semantic content conveyed in the text.
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
This content originally appeared on HackerNoon and was authored by EScholar: Electronic Academic Papers for Scholars