On this page: social media, web scraping, data scraping, text and data mining, data donation, social networking
Date of last review: 2025-05-02
A growing number of studies use data from the web to analyse human behaviour, for example from social media.
It has never been easier to get access to people’s opinions or views about any product, topic or event.
Contrary to what you may expect however, the fact that social media data may be public, does not mean that you can automatically make use of them without license and/or permission: irrespective of whether information is public or private, the GDPR still applies when you use personal data.
Typical issues in social media research
Below we discuss the following typical issues in social media research that can play a role in designing your social media project:
Social media research
On this page: social media, web scraping, data scraping, text and data mining, data donation, social networking
Date of last review: 2025-05-02
A growing number of studies use data from the web to analyse human behaviour, for example from social media. It has never been easier to get access to people’s opinions or views about any product, topic or event. Contrary to what you may expect however, the fact that social media data may be public, does not mean that you can automatically make use of them without license and/or permission: irrespective of whether information is public or private, the GDPR still applies when you use personal data.
Typical issues in social media research
To get started, you can use the Self-reflection guide for human-related big data research which was developed by the Ethics Committee of the Faculty of Humanities and provides step-by-step guidance on doing big data research, which can include social media research.
Below we discuss the following typical issues in social media research that can play a role in designing your social media project:
Ethical considerations
It is important to consider the risks involved for individuals who post or who are otherwise involved in social media content. Three ethical considerations:
Recruitment
Data integrity
Private vs. public social media
Some social media are not public to all, but require registration or membership (such as Facebook groups, private Telegram channels, etc.). Content in such spaces is not manifestly made public, and therefore requires a different approach than when using data from public spaces. For example, you could contact the site or group administrator, and/or ask consent from users in these private spaces.
Even when a platform is public, it is possible that individuals do not perceive the content they share as public, for example because they believe the platform to be more private than it is. Not all social media users may be aware of the risks of posting personal information online, or what the contents are of the platform’s terms of service and privacy policy. And even when they do, they may still not necessarily expect their data to be used for academic purposes and could feel violated if their data were for example published or quoted without their knowing or consent. For example, Telegram is often perceived among users as a private platform, though many channels are actually public. When using data from such platforms, it is therefore important to consider not only how public a social media platform de facto is, but also which online spaces people perceive as ‘private’ or ‘public’. To check this, you can run a small pilot and ask data subjects (or representatives) about their expectations of privacy.
Intellectual property and terms of service
Both the social media user as well as the platform could potentially claim intellectual property on social media content. What each platform claims and allows is usually described in their Terms of service. They can for example forbid usage of the platform’s content by third parties or limit the amount of data that can be scraped. As far as copyright goes, the EU Copyright Directive (art. 3 and 4) and the Dutch Copyright Act (art. 15n and 15o) state that usage for text and data mining research is not seen as a copyright infringement. It is allowed to copy, use, and archive the data for such research, provided that the data was accessed lawfully (e.g., because it was publicly available) and is protected with sufficient security measures (read more about copyright). So Terms of Service statements that prohibit scraping for copyright reasons do not apply to EU researchers. It is however strongly recommended to stick to other terms set by the platform, as illustrated by a recent example on Reddit where the prohibition to use AI was ignored by researchers.
Informing social media users
The GDPR requires that you do your best to inform data subjects before your research project about your project, incl. which data you are collecting, and why and how you are using the data. This is not necessary if:
Read more in the section about informing data subjects.
Legal basis: consent or public interest
Obtaining consent from social media users can become virtually impossible when data from hundreds of thousands of social media users are used. In some cases, revealing your identity as a researcher or seeking consent can interfere with research and be counterproductive - for example in the case of ‘covert’ or ‘non-intrusive’ research where researchers are primarily interested in observing interactions between users. Consent is therefore usually not a suitable legal basis for this type of research, even when collecting special categories of personal data. You can use the legal basis flowchart to find a suitable legal basis (usually: public interest), or contact your privacy officer for help.
How to collect social media data
There are several ways to collect social media data (see also the RDM guide on collecting data from the web), for example:
Amount and sensitivity of personal data
Social media data can consist of many different elements in many different formats, such as user information (full name, username, number of followers, etc.), post content, timestamps, geotags, videos, and images. All these elements are often interrelated and can be very unstructured (or the structure may change with time), which can make the data difficult to minimise and deidentify. Because most social media data are publicly available, a simple search may enable others to find the original social media post, and so simply removing a username is usually not sufficient. Moreover, because social media are social in nature, it often happens that there is a lot of ‘bycatch’: you could collect more data, more sensitive data, or data from vulnerable groups unintendedly. For example, content could concern other people besides the person posting the content as well, such as comments and likes. Or if you collect data from multiple platforms, it may be possible to construct more detailed user profiles. Thinking beforehand about which data you need and from which and how many individuals is crucial to apply data minimisation.
Example: a researcher has collected Telegram conversations about escape routes, seeking humanitarian help, and descriptions of war experiences from Ukrainian refugees. Despite removing direct identifiers such as usernames and contact information, this information could still be sensitive as 1) it could be used to crack down on escape routes and 2) the Telegram channel could easily be located if the dataset became publicly available. In this case, the dataset should either be fully anonymised, summarised, or only be made available under restricted access.
Data subjects’ rights and social media research
Data subjects have several rights under the GDPR that they should be able to exercise, such as the right to be forgotten. In social media research, several issues could arise:
In case of scientific research, it is possible to limit or exclude the rights of access (art. 15), rectification (art. 16), and restriction of processing (art. 18) if granting those rights would endanger the research project. However, an even better solution is to deidentify the data to such an extent that it is not possible anymore to find the data subject in the dataset. In that case, you should inform the data subject that they cannot exercise their rights anymore (art. 11(2)).
Making social media data FAIR
If the social media data you use cannot be anonymised or may not be re-published, you could still make the data Findable, Accessible, Interoperable, and Reusable (FAIR). For example, by publishing your methodology (including your query to extract the data, if relevant) and metadata and providing access to the data under restrictions. Remember to always include the date of accessing/scraping the data, and which parameters you used to select the data, as the data on social media can change from day to day! You can find more information on this topic in this chapter and in the guidebook “Making Qualitative Data Reusable” (Verburg, Braukmann, & Mahabier, 2023).
Further reading