Metadata is data about data, not that such a definition really helps unless you think about it. Metadata adds context to other data. When I think of metadata, one of the things I first think of is photography. You obviously have a photograph that you can look at, and see the image, the colors, the shadows, and the composition. But the photograph includes context that you cannot necessarily see, like how large the digital picture is, when it was created, what the image resolution is, including data that is not even photographic in nature, like the author, the data and time, or geo-location information. Sometimes, the photograph meaning changes depending on that additional information.
Consider the following photograph of part of a white-furred animal. Now name that animal. The list of potential candidates is very long and you probably cannot identify the animal with any level of confidence. But even a little data about the photograph can vastly improve the chances of your success. If I tell you the photograph was taken in Colorado, you can probably limit your options to “mountain goat” or “wolf”. If you know the photograph was taken near Hudson Bay, you might identify the animal as a polar bear. But if I don’t tell you the photograph was taken in Peru would you ever guess “Alpaca?”
Consider, if you will, the next photograph. I expect very few people will immediately recognize it. Typical answers are probably a black and white photo of a beach, satellite topography, piece of shale, and many other answers to try to identify what look to be striations or layers. I could give you f-stop and focal length, but it would likely not help unless you have more context. If I tell you there is no trick here, but only macro photography with a single reflected white halogen light source, and that I took that photograph in my basement office, does that help? If you have never seen my office, probably not much, because you still have a relative lack of data about the picture; a lack of metadata.
So we can see value in metadata, but is there enough value to have a long-term identification, gathering, dissemination and analysis of metadata?
Assume the following map shows where Joe Green travels during his typical days. The red marks are some of the cell towers to which Joe connects, along with the timeframe of some of those connections. The location information relating to Joe’s calls are clearly metadata. No one is listening to his calls, just observing where Joe travels during a day. This is not even all the metadata available for Joe – we’re not looking at browsing habits, searches he runs from his phone, analyzing what apps and accounts he has – this is JUST geolocation data. What can we tell from this metadata?
First of all, we know about Joe lives just off Sleepy Hollow Rd. in Annandale, VA. That time stamp from midnight to 7:30 am most likely shows Joe at home, in bed, asleep, and then getting ready for work. I might guess that stop at 7:50 near Pentagon City is a coffee shop (verified by checking Starbucks locations). At 8:15 Joe arrives at work, and we can have confidence in that because his location is more or less steady from 8:15 to about 5:17. But, Joe did not go right home. He made another stop at 5:43 near Bailey’s Crossroads. Where? Given that the duration of that stop was about an hour, it could have been a quick supper, or groceries, or a drink, or a gym; any number of activities. So, just by looking at the metadata about where he travelled, we know a lot about Joe; where he lives, where he gets coffee, where he works, that he takes Columbia Pike to the office, despite all those stupid traffic lights and the chaos that is defined by Bailey’s Crossroads. We can predict about how much gas he uses in a normal week (18 miles round trip to work), and more. Every Saturday at 8:00 am he shows up at the Army Navy Country Club (ANCC) for a round of golf. Since he plays at the ANCC, we may wonder if Joe is a current or former officer in the military or a senior government employee in the national security community. And, since he seems to keep a regular 8:00am tee-time, we can probably assume he has some pull at ANCC. He takes just over three hours to play his round then leave the club, so he is probably not playing a full 18 holes. He appears to drive to Lubber Run Park for about five minutes. Maybe he only needs five minutes to walk his dog, Giuseppe Verdi (what else would his dog be named?), but so be it. As metadata, other than the fact he plays at ANCC none of that is especially interesting.
Now let’s consider Joe’s phone calls – another easily identified piece of metadata. (some meaningless numbers deleted). He calls 555-555-5555 every day at about 8:30 am after he arrives at work and every day at the end of the day, which appears to be while he is driving. First bets are parents, spouse, girlfriend/boyfriend, or child. 111-111-1111 is likely a conference call number, since they are long calls during the work day. Not sure who could get away with a 54-minute call with 222-222-2222 which started at 9:17 in the evening besides a close family member, so brother, sister, or parents. My bet is that 444-444-4444 belongs to the friend with whom he plays golf Saturday morning. So, while we can start penciling in relationships, all in all, this is not very interesting information.
And metadata about his text messages is so boring I am not even including it. He rarely sends text messages, and they mostly go to the same “suspected family” numbers as above. Except, that every Saturday about the time he leaves Lubber Run Park, he sends a single text to 666-666-6666. This is slightly more interesting since he never has any other call or text communication with that number other than about 11:48 every Saturday.
If we analyze Joe’s social media presence we can probably identify what his job is, his age, his political party, major interests, and more, but some of that begins to delve beyond metadata. If we simplify the equation to just his geolocation, phone and text details, we can see part of Joe’s life in the nutshell of metadata. It is just some random information that may or may not show anything of interest. And, in Joe’s case, when viewed in isolation, it is rather unremarkable. But the power of metadata does not come in that data itself but in the ability of that data to be processed and correlated in an automated fashion.
The best intelligence about what is heard on a wiretap comes from a law enforcement official actually listening to the tape and identifying interesting things. The audio has to be processed. But metadata is raw data which can be processed by computer instead of by a person. Computers can take Joe’s raw metadata, and compare that data with everyone else’s data; with my data, with your data, with my daughter’s data, with the data from my mailman, and everyone else. Including, with the data of people who the NSA or FBI (or whomever is watching) may find of particular interest.
So, yes, they can include Natasha’s data, because they know Natasha is up to no good. Natasha’s metadata shows the same type of information about Natasha that it shows about Joe. Natasha lives in a condo near Georgetown. Someone can identify the route she takes to work as a “cultural attaché”. Her geolocation data suggests where she goes to lunch every day. The data indicates what numbers she dials and whom she texts. It may make it obvious she works out at a gym in Rosslyn. Someone reviewing her geolocation data might see that she walks her Bichon Frisé, Boris (really, what else would she call him?), in Lubber Run Park every Saturday afternoon. Where she shops for groceries may be visible, and even obvious. If Natasha is a “person of interest” counter intelligence groups might even know what kind of toothpaste she uses. And, all the information available about Natasha is input into computer systems which are processing this metadata from everyone else, always looking for correlation. Looking for patterns and similarities…
Wait a minute. She walks Boris in Lubber Run Park Saturday after lunch. Do we know anyone else who is regularly in Lubber Run Park?
So, because some computer system full of metadata identifies a correlation, Joe Green suddenly becomes a person of interest instead of just a pile of metadata. Law enforcement and intelligence agencies assess the potential threat; who is Joe Green and to what does he have access? If Joe works in the Crystal City Comic Book shop, and plays golf at the ANCC with his USAF Ret. Col. brother, then this might quickly end up as temporary noise. If Joe holds a clearance and is currently employed in classified projects, law enforcement might pursue action – consider a subpoena of additional phone records and information to try to build more context; is this a coincidence or is this true correlation? At some point, if they felt they had enough information that they thought actual email text, text message text, or phone call audio would be valuable, they would establish probable cause and pursue a warrant.
So, the value in the metadata isn’t about the data itself. The value in the metadata is in the ability of computer systems to process large amounts of metadata, looking for correlations which may not otherwise be found. But to do that, you really need a lot of metadata, because you want the ability to add as much context as you can as rapidly as you can.
As far as the second picture above, it is about an inch and a half section of the hamon from my Wakizashi, starting at the shinogi out to the sharp edge. The part of the blade from the shinogi to the mune is hidden off the top of the picture. The lines on the genuine hamon come from the folded steel which results in 1024 layers. That detail is all “data”, and not metadata. You don’t get that level of information just by looking at a picture; you get it from adding enough context you know exactly what you are looking at. And, you probably won’t get that by just browsing metadata; you need more detail. In a case like Joe Green, you probably need to be listening to those phone calls.