Metadata and Your Privacy
The importance of metadata to user privacy is simply under emphasized.
Metadata can tell the whole story without ever reading the message contents; with files, Metadata can reveal additional and potentially sensitive information in addition to whatever is contained inside a file.
People break up over metadata, are arrested over metadata, and killed over metadata. The leaking/capture of metadata is just as privacy invasive as directly reading message contents in many cases, despite the downplaying by the entities who rely on data collection via metadata.
What is metadata?
In its most basic form, metadata is the data associated with a message but not included in its contents. Common metadata frequently includes the time of a message sent and who it was sent to.
These may seem like two insignificant data points at first glance, but a lot of the “story” can be told just using these two data points - in many cases, we don’t need to read the explicit contents of the message itself to get the “story.” Sometimes, these two data points can answer important-enough questions that make the content of the message irrelevant, such as:
- Who are they messaging?
- What time was the message sent? Received?
- How often are two people messaging
For rather obvious reasons, who a message was sent to itself can be significant enough on its own, especially if the users have been identified with rather unique identifiers, such as a phone number attached to a SIM card. Knowing when a message was sent to a user can establish a pattern; especially if the central servers relaying messages are logging and storing these particular data points – as many do.
Over time, just with tracking these two metadata data points alone, we can start establishing clear patterns - for example, User X may message User Y every Sunday at 5:00pm for approximately an hour.
Even on encrypted platforms, sometimes you’ll find that while the message itself is encrypted, the metadata is made available to the servers handling the message. In this case, at minimum, metadata is transmitted to the servers of whatever communication service you’re using - be it a messenger or an email service provider.
If the provider respects user privacy, then likely this data wouldn’t be logged (for an excessive amount of time), analyzed, aggregated, or shared.
However, some messaging platforms use metadata for more than just message routing. They may use user metadata for targeted advertising, user tracking across multiple platforms, data to feed to machine-learning algorithms like automated spam detection, or as a “commodity” for selling (data brokers, etc).
Many free and popular, yet privacy-unfriendly email providers also use metadata - email headers. This metadata is often scanned and used to train the machine-learning algorithms behind spam, phishing, and malicious message filterinIf the cloud storage provider does not run their servers in-house, then third-parties such as the server infrastructure provider (hosting) may have access tog.
If the email provider also offers other products - perhaps a calendar - the data captured from the platform’s email scanning may be used to create calendar events. For example, if you’ve received a flight itinerary to your inbox, the platform may infer from the message metadata you are planning a trip and place it on the calendar.
Metadata and communications
Metadata as attached to communication can be especially broad, since technically anything outside the message contents can be labeled as metadata. Therefor, it would be exceedingly difficult to give a definitive and all inclusive list, but common metadata does frequently include:
- Who sent/received a message (including unique identifiers)
- Time message was sent
- Server handling the message (specifically in email)
- Location of device when message was sent
- Client used to send a message (specifically in email)
As mentioned earlier, metadata is often associated with files - particularly photos. File metadata can include:
- Location details of a photo
- Timestamp/date details of a photo
- Timestamp of author who last edited a file
- File type and size
Metadata and files
Most of this post focuses on metadata associated with messages. However, photos and videos also have associated metadata.
For example, by default on iOS, pictures taken with the device’s camera automatically have metadata (EXIF) associated with it; the most notable of this photo metadata is GPS location.
Exchangeable image file format (EXIF) is the specific metadata standard for digital images, and can also include metadata such as:
- Device ID numbers
- Camera settings
- Image metrics (pixel dimension, resolution, file size)
In most cases, devices - such as smartphones - include exact timestamps and GPS coordinates in images taken with the device’s camera by default. It’s often difficult to remove the majority of this EXIF metadata without specialized software.
On iOS, users can disable the geolocation tagging using GPS via the Photos app by tapping on an image, tapping on the information tab, tapping “adjust” for the location, and then selecting “No Location.”
Most users are unaware their images contain this potentially sensitive data; as such, they frequently share photos to the internet containing this data, which can be harvested by anyone - including the platform on which it was shared - with the know-how.
An obvious answer some may have to this problem is to tell users to “only share photos with trusted contacts.” While this is generally good advice (especially for more personal photos or videos), it is but one solution to a multi-faceted problem because photos - and their associated metadata - may be unintentionally shared to other parties…
- Do you have iCloud enabled? Photos from your iDevice may be automatically synced with Apple’s iCloud servers and while stored encrypted, Apple has the keys for decryption. (This can be mitigated by enabling Apple’s Advanced Data Protection).
- Syncing/storing photos on a non-privacy friendly or unencrypted storage provider like Google Photos? You’re sharing your metadata, like location data and timestamp of photos, with Google.
- Uploaded a photo to Instagram? Instagram may ingest your photo EXIF metadata.
- Used Imgur to share an interesting bird from your hike on Reddit? In addition to sharing metadata with Imgur or Reddit, anyone who views your photo may be able to download it and extract its EXIF data.
Files (that are not photos) also have associated metadata. Common file types that frequently retain and attach metadata include Word files (.docx) and PDF files (.pdf).
These files can contain a lot of metadata, such as:
- Author names
- Which “author” last saved the document
- Comments (which can contain additional information)
If the cloud storage provider does not run their servers in-house, then third-parties such as the server infrastructure provider (hosting) may have access toe PII from file properties and metadata. Even if the cloud storage provider runs their own infrastructure, if their servers know the decryption keys for files stored, then they could effectively access this data for themselves or share access with other third-parties.
Fortunately, it is possible to remove file metadata, just like it is for EXIF data in photos prior to uploading to a cloud service. For example, in Word, users can use an in-built function to erase document properties and PII.
Many users make use of cloud storage services - like iCloud, Google Photos, Dropbox, and others - to store files and photos and sync/access them across different devices. These services generally have access to your file and photo metadata.
Additionally, since their servers know the decryption keys, these services may also have direct access to your files; this access can be given to third parties, such as government entities or server infrastructure providers (if applicable).
How is metadata used?
Metadata has valid uses, even if there are ways for communication transmission without metadata or minimized exposure of said metadata. Usually, metadata usage (and collection) is dependent upon the protocol and platform itself; some protocols use metadata and by extension some platforms generate/use/collect metadata more than others.
There are a wide variety of uses - some rather invasive, some legitimate - for metadata. Of course, what is ultimately acceptable to a user is dependent on that same user’s threat model.
1. Message routing
Very generally speaking, metadata is used by clients and servers to route (and in the process of routing, validating) communications to the appropriate parties. In some applications, this type of metadata is common as information such as the recipient and time stamp are needed to successfully route messages.
However, there are protocols and messaging implementations that 1) severely limit the amount of metadata exposed to servers or 2) eliminate the “need” for the server to have access to even limited amounts of metadata, even for the purposes of message routing.
Metadata is often used in investigations, ranging from law enforcement investigations to open source intelligence (OSINT) investigations.
Most commonly, for law enforcement, we see a high interest in location data; the biggest example as of writing is the revelation of Fog Data Science, where local police could purchase a relatively low-cost subscription tool that provided location histories of millions of devices.
Interestingly, Fog Data Science didn’t explicitly collect the data themselves - the organization did not create something like a downloadable app for end users and then accessed and collected data that way.
Rather, they aggregated data from hundreds of third-party apps that do collect location data. These weren’t apps “traditionally” associated with location data usage (think GPS apps or ride-sharing apps); any app that collects location data and shared/sold it is essentially fair game. Much of this data was collected in the background.
So, that could include messaging apps that collect/approximate location data - such as WhatsApp - and email clients of privacy-unfriendly email providers… and this is just for location data.
Naturally, investigations may have interest in other communication metadata such as:
- Who sent a message
- When a message was sent
- On which device a message was sent/received
- Information associated with a handle/account
Even file - particularly, photo - metadata could prove valuable to investigations, as photos containing coordinates EXIF data may be inadvertently shared/published publicly.
For example, John McAfee’s location in Guatemala was exposed after Vice Magazine posted a photo taken with an iPhone; the iPhone’s camera is a “GPS-enabled” camera and the photo was not scrubbed of location EXIF data prior to posting.
It’s important to remember: people are accused of many things based on metadata. Divorces happen over metadata; metadata can compromise operations security (OPSEC), metadata can “make” an open source investigation and take an investigation further. Many things can be relatively correctly inferred from capturing and examining metadata.
People are even arrested and convicted based on metadata. According to General Michael Hayden (former director of the CIA and NSA), “We kill people based on metadata.”
Metadata is often used in various marketing/advertising campaigns.
Data brokers may use metadata to compile central repositories on millions of people and then sell it to whoever has the desire and money to purchase it. AdTech may use metadata either directly collected from their own platforms or received from another party to then display targeted or re-targeted advertisements.
It seems with the targeted and re-targeted marketing landscape (a hefty part of “surveillance capitalism”), any byte of data is sought after; this includes the metadata associated with communications.
If the 2023 landscape of WhatsApp has taught any of us anything: AdTech is likely highly interested in communication metadata across many different communication platforms; phone calls, contact books/addresses, emails, messaging, and social media direct messaging. Specific examples can include:
- Email subject, to, and CC lines
- Location an email or message was opened
- “High” engagement with specific social media “influencers”
- Who and how often you communicate with other individuals on a given platform
- Whether a link was clicked in an email or message
These data points (and many more) can be used to “hyper personalize” a marketing approach, which can include (but not limited to) displaying targeted ads, sending “lead” emails to your inbox, displaying similar ads to your frequent contacts, or recommending purchases based on email links you’ve followed.
As a note, like law enforcement, marketing and AdTech have a high interest in location data. When tied with communications, location data can provide a wealth of information, which can be used for further profiling, marketing, and sharing/selling.
For example - just collecting location data over time is significant; someone can easily learn where you live, work, go to school, workout, grocery shop, and more just from gathering location data. Combined with communication metadata, complete outsiders can gain real insight into who you communicate with, when communication happens, and from where communication occurs.
They can also, with reasonable infer or model, your future actions - will you message Bob from around the location of The Only Burger Joint in Town today at 1200 like you have every Monday and Wednesday for the previous three months?
4. Machine learning
Depending on the platform, metadata can be used to train machine learning algorithms; this is seen frequently with “free” and privacy-unfriendly services that take user data as “payment” for their services:
- WhatsApp uses metadata to train its AI, which is partially responsible for content moderation and spam detection on the platform.
- Google likely uses email message (header) metadata to train its AI-powered spam filter for its Gmail service (though it claims scanned email data is not used for advertising purposes). It also provides an application programmable interface (API) for developers to tap into and use Gmail metadata.
- Microsoft uses metadata to train its AI-assistant, Cortana; Cortana shares data with Microsoft and can’t be easily disabled on Windows 10/11
- Facebook Messenger uses metadata to feed and train its various algorithms across the Facebook platform. Until Facebook implements end-to-end encryption for its Facebook Messenger, it also has direct access to metadata and message contents.
Naturally, using real user metadata can have privacy implications.
For example, many of these “free” services collect metadata that allows accurate location pinpointing even if permission to GPS is denied. Perhaps your important email to your professor is never delivered because the machine learning algorithm driving the spam filter falsely flags your message as spam and prevents it from ever being delivered.
Perhaps an email provider inadvertently automatically adds an "event" to your calendar because the machine-learning algorithm detected a receipt plus a location, but the calendar is shared with others. Perhaps you are swept up in an law enforcement investigation because an algorithm used metadata from your communications with your brother - who is the suspect of a drug investigation.
The issue compounds because most privacy policies won’t specify their direct collection or use of metadata.
How to protect metadata?
Protecting metadata has the possibility to be fairly complicated. As mentioned, sometimes metadata has valid uses; it’s possible to use metadata in a way that shows respect to user privacy.
Typically, the “best way” to protect metadata is to limit the amount of metadata produced; the easiest way to accomplish this is to not share metadata in the first place.
This may mean using a communication platform (email or messenger) that doesn’t require access to extensive metadata in order to render services in the first place. Or using a cloud storage provider that also end-to-end encrypts metadata on the client (device) side prior to upload. Or using an email provider that minimizes information transmitted via email headers.
In most cases, if metadata must be generated and/or used, it should be either 1) minimal or 2) encrypted so that it’s unreadable by the server handling the request(s).
Use a secure messenger
With messengers and messaging platforms, metadata can include information associated with an “account” and metadata attached to messages.
Some message platforms - even those advertised as end-to-end encrypted messengers - collect and store metadata, such as to whom and when a message was sent/received/opened. They may also require direct access to contacts, aggregate data attained from third parties, and share data (like location data) with third parties.
Account creation with some messengers may require a valid email address or a sim-connected phone number for use.
It’s difficult to directly address these issues from the user side. Mitigations, such as denying GPS location and refraining from transmitting sensitive data over a messaging platform, can be taken; however, the most effective way to minimize metadata generation, collection, and transmittance on a messenger to switch to a secure messaging alternative.
While not all messaging alternatives alternatives eliminate transmitting metadata altogether, secure messaging alternatives do respect the privacy of the user by not using this metadata outside of message routing. Most secure messengers do not require excessive personal information and do not engage in data collection or sharing of their own.
For messaging platform suggestions, users are highly encouraged to review the options presented in avoidthehack’s recommendations for secure messenger alternatives to messaging platforms like WhatsApp and Facebook Messenger.
Use an encrypted email provider
Email, as a protocol in general, creates and transmits a wealth of metadata; this is an issue with the protocol itself and is tough for even secure email providers to wholly address. Like messaging platforms, email metadata frequently includes information associated with an email account and metadata (email headers) attached to messages.
Email headers, the metadata associated with send/receiving email messages, can expose a lot of information. Some of this includes the sender's IP address, server handling names, and the email client used to draft/send an email. Most encrypted email providers attempt to minimize metadata transmitted and/or even potentially available to sending/receiving servers.
With the popular email services providers, account creation frequently requires PII - which is associated with the account, which is associated with messages sent/received - such as a valid phone number and first/last name. Many popular email providers collect much metadata over time for their own use; in many cases, the email provider even has access to users’ inboxes, even if they are not reading each and every message sent or received.
Encrypted and privacy-respecting email providers do not collect metadata or require PII to establish/use an account, even if metadata is transmitted over/to the provider's servers. Encrypted email providers use zero-knowledge implementations, ensuring not even the service provider's servers have access to user's inboxes.
For email providers, users are encouraged to use an encrypted email provider - specifically one that implements zero-knowledge encryption on its servers, which helps prevent email scanning and "snooping" by the provider. Encrypted email providers may not totally encrypt transmitted metadata, but do strive to minimize metadata; metadata generally isn't actively collected/consumed by the provider's servers for purposes other than message routing.
Users are highly encouraged to review the options presented in avoidthehack’s recommendations for secure and encrypted email providers; at minimum, most encrypted email providers minimize metadata transmitted when sending an email message.
Use a secure-by-design cloud storage provider
As mentioned, files and photos have associated metadata. This metadata may be unencrypted, or encrypted but directly accessible to the cloud storage provider. This access may extend to other third parties like government entities or even the server infrastructure provider (hosting).
For secure (and privacy-respecting) cloud storage providers, users are highly encouraged to review options presented in avoidthehack’s recommendations for secure cloud storage providers.
Secure cloud storage providers implement zero-knowledge encryption for files uploaded to user's accounts; they go beyond just encrypting and securely storing the files because they also encrypt file metadata that would otherwise be made available to the provider's servers.
Many of the popular cloud storage providers do not engage in client-side encryption or encrypting file metadata of files prior to upload to their servers. These providers also know and hold the decryption key for the files stored on their servers, so they could decrypt and access the files on a whim -- for example, to investigate a suspected terms of service violation.
If using any other cloud storage service, users can encrypt their files prior to uploading to the cloud. This is good practice, even for recommended secure-by-design cloud storage providers. After all, the cloud is just someone else's computer!
Minimize metadata sharing across online activities
Metadata is generated with nearly every activity done online. Some of this metadata generated may be unavoidable, or exceedingly hard to minimize. However, this is not the case for all metadata!
In addition to using more privacy-friendly messengers, email providers, and cloud storage providers, users can take basic steps to minimize unintended (meta)data sharing for common online actions, such as but not limited to:
- Scrubbing photos of GPS location prior to sharing, publishing, or posting with anyone on any platform.
- Users may also want to consider scrubbing the exact timestamp on the photo
- Encrypt files prior to upload on any cloud storage service
- Scrub files like .docx and PDFs of unnecessary metadata prior to sharing or publishing online.
Metadata is just as valuable as direct message contents.
Metadata alone can tell most of the “story” without ever revealing the explicit content of a communication. File metadata, specifically photo EXIF, can add sensitive information to “regular” or seemingly innocent photos or shared files, which risks user privacy.
The best way for users to protect metadata is to either refrain from sharing it or using strong encryption. Users should rely on strong encryption to keep their data secure and private.
If the cloud storage provider does not run their servers in-house, then third-parties such as the server infrastructure provider (hosting) may have access toed secure messengers and encrypted email providers to reduce the likelihood their metadata is used for other purposes and to avoid unintended data sharing.
Users are also advised to use secure cloud storage providers where possible - ideally, regardless of the cloud storage provider, users would encrypt their files prior to upload to any file storage/sharing service.
Users should take care to at least remove location coordinates from photos prior to sharing them, whether between “trusted contacts” or on the internet.
With that said, stay safe out there!