Over the past years, the CNIL has been committed to clarifying its position in favour of the promotion of innovation and to reassuring professionals and consumers that the development of artificial intelligence (AI) systems can be reconciled with the imperatives of protecting data subjects' privacy, even when such technology involves the massive consumption of personal data.
Indeed, concerns are often raised as to the feared negative impact AI could have on individuals’ rights, especially since the emergence of generative AI systems.
General Data Protection Regulation (“GDPR”) requirements, in particular those relating to the determination of specific purposes and limited retention periods, have also been pointed out as hindering or even preventing certain AI applications.
Today, the CNIL wishes to overcome these two assumptions.
As part of its “action plan for the deployment of AI systems that respect individual privacy”, the CNIL has set up several programs to support and assist companies of the AI field in their compliance process.
Now, to pursue such effort to help AI system designers to comply with personal data protection requirements and restore users' trust in these systems, the CNIL has published its first set of practical guidelines on the creation of learning databases for AI systems, which are submitted to public consultation until 16 November 2023.
These recommendations include an introductory sheet outlining their scope, a standard documentation template and 7 practical sheets aimed at identifying concrete solutions to the legal and technical issues that may typically be encountered:
Introductory sheet: scope of the recommendations
At this stage, the CNIL has deliberately focused on the development phase of AI tools (AI design and training), to the exclusion of the implementation phase (calibration, use, maintenance), and the creation of databases dedicated to the training of these tools (in the development or implementation phase), when personal data is or could be processed.
Furthermore, the practical information sheets do not cover the operations carried out during the AI system shutdown phase or the deletion of personal data contained therein, which must comply with the principle of limited data retention periods.
The scope of the practical sheets is further limited to the following AI:
- machine learning systems;
- deterministic systems, which do not use statistical learning, but are based on logic and knowledge. For example: inference and deduction engines and expert systems.
Finally, the recommendations under consultation are applicable only to AI systems covered by the RGPD exclusively. This is generally the case for AI systems developed as part of scientific research, research and development, customization of a commercial product, improvement of the public service.
Data processing carried out during the development phase of an AI system subject to Directive 2016/680 of 27 April 2016, the so-called Law Enforcement Directive (“LED”), as well as those concerning state security and national defence, are therefore excluded from the scope of these sheets.
Sheet 1: Applicable legal regime
This sheet provides for clarification as to the determination of the applicable legal regime in relation to AI development phase (GDPR, LED Directive or state security or national defence).
Sheet 2: Purpose of the processing
The CNIL provides practical guidelines to help define the processing purpose(s), considering the specificities of AI system development.
This sheet addresses the difficulties that developers of AI solutions may encounter when the future operational use of the tool is not yet clearly defined at the development stage.
In this case, it is still possible to comply with the obligation to determine a precise purpose for each processing of personal data, as long as this purpose refers cumulatively to the "type" of system being developed (image or sound generative AI, "computer vision " system, etc.) and the features and technical capacities that can be anticipated.
Such principle may be departed from in the case of scientific research projects, under certain conditions.
Sheet 3: Qualification of AI systems providers under the GDPR
It is important for any provider of AI systems involving the processing of personal data to identify whether it is acting, within the meaning of the GDPR, as a controller, joint controller or processor. This qualification will determine their obligations.
The CNIL gives practical illustrations of relationships and the qualification that should be attributed to each party involved.
For example:
- The supplier of a conversational agent who trains its language model ("Large Language Model" or LLM) from data publicly available on the Internet is a data controller, since it decides on both the objective (training an AI system) and the essential means of processing (selecting the data it will reuse).
In this case, the public or private entity that makes the data reused by the supplier available online also acts as a data controller, distinct from the supplier.
- University hospitals developing an AI system to analyse medical imaging data, who choose to use the same federated learning protocol, act as joint controllers. Together, they determine the objective (training a medical imaging AI system) and the means of this processing (through the choice of protocol and the determination of the data they exploit).
- An AI system supplier who develops such a system on behalf of one of his clients, as part of a service, acts as a data processor.
Sheet 4: Lawfulness of the processing
Any processing of personal data is only lawful in the presence of one of the 6 "legal bases" provided for by the GDPR.
According to the CNIL, with regard to the creation of a database of personal data intended for the training of an AI-based algorithm, the following bases may be contemplated: data subjects’ consent, legitimate interest, mission of public interest and performance of a contract.
The CNIL, although not explicitly ruling out the observance of a legal obligation or the safeguarding of vital interests as possible legal grounds for processing, does not provide any specific guidance in this respect.
The sheet details the conditions under which each of these legal bases can be considered adequate and clarifies the additional checks to be carried out in the event of data reuse.
In particular, the CNIL specifies that the reuse of data, especially when publicly accessible on the Internet, can be justified under the rules laid down by the GDPR concerning research and innovation. Companies wishing to reuse this data to feed their learning algorithms must, however, ensure that it has not been collected in a patently unlawful manner.
Sheet 5: Data Protection Impact Assessment (DPIA)
A DPIA must be carried out prior to the implementation of any processing likely to generate a high risk for the rights and freedoms of data subjects.
The CNIL specifies in which cases AI stakeholders are likely to be covered by this obligation, taking into account the risks specific to the development of AI tools.
To determine whether a DPIA is required for AI-based data processing, it is necessary to:
(i) Check the list drawn up by the CNIL of processing operations for which a DPIA is always required. Some of these operations may be based on AI systems, such as those involving profiling or automated decision-making: in this case, a DPIA is always required.
(ii) If the contemplated processing does not appear on the said list, proceed to the 9-criteria test established by the European Data Protection Board (EDPB). Under this test, any processing operation meeting at least two of the criteria will be deemed to be subject to the obligation to carry out a DPIA.
According to CNIL, the following criteria are typically relevant to the development phase of an AI system:
- collection of sensitive data or data of a highly personal nature (location data or financial data, for example);
- large-scale collection of personal data;
- collection of data from vulnerable individuals, such as minors;
- cross-referencing or combining data sets;
- innovative use or application of new technological or organizational solutions.
Sheets 6 and 7: System design, data collection and management
These sheets detail the recommended steps to be taken by any developer of an AI algorithm involving the processing of personal data, both upstream of the system's design and throughout the development phase.
Such recommendations are of a very practical nature (e.g., carry out a technical analysis of the state of the art, carry out pilot studies, consult an ethics committee, use of training data ablation techniques, etc.) and are issued with regard to the various major data protection principles set out in the GDPR (data minimization, retention periods, security, etc.).
The CNIL clarifies, in particular, that data retention rules, which must be limited in time, may in fact justify a long-term retention of data if it is necessary for the purposes of the AI processing.
A large number of contributions from AI stakeholders can be expected throughout the public consultation launched by the CNIL on its new recommendations, and further clarifications on the application of the GDPR to AI systems should be forthcoming.