Bayesian Flooding and Facebook Manipulation
May 23rd, 2012 | Back to Blog Listing

For the past few months I've been conducting an open online experiment. It's really an extension of a seven-year project I've been working on, but this newer portion is exclusively focused on Facebook. The purpose of the experiment is to explore how much control a typical Facebook user has over his or her personal information online, not what is externally visible, but what is internally being analyzed. In other words, is it possible to manipulate the flow of information being collected for advertising and marketing purposes?

The consideration behind this idea is due to three basic realities:

  1. The advancement of online technology and communication is not going to slow down, much less stop. Social networking plays a vital role in this.
  2. In order to participate in society, one cannot simply hide from technology.
  3. There is a lucrative market for acquiring and selling private, personal information (demographics, lifestyle choices, spending habits, interests, etc.).

Over the past few years I've had dozens of people ask me if I thought they could remove themselves from the watchful eyes of Facebook. My answer is always no. The problem is that once the information has been collected, it will always be stored and associated with you. I have therefore devised a slightly different method for dealing with this problem. Rather than trying to hide information from Facebook, it may be possible simply to overwhelm it with too much information. While this may sound counter-intuitive, there is a well-known mathematical theorem that may in fact validate the idea.

Imagine if you were asked to look inside of a friend's pantry or cupboard for 30 seconds, and then to make a guess about their general diet. For most people this would be a pretty simple exercise and the results would probably be very accurate. But now imagine that when you peered inside, their pantry was magically the size of an entire grocery store and contained just as many products. Other than the fact that they weren't starving, what could you really report about this friend's diet? The amount of variation available would make the analysis very difficult. This is the basic idea.

Target and Teenage Pregnancy

Back in February of this year, Forbes released an article detailing how Target successfully determined that a teenage girl was pregnant before her father was able to. The girl had received coupons in the mail for a host of maternity items. This prompted her father to call Target's customer service department and complain how it was inappropriate to send a teenage girl such material. Target was quick to apologize, but a few days later the father actually wound up apologizing to Target noting that his daughter was indeed pregnant. The question is, how could Target possibly have known this?

Like most big-box stores, Target tracks everything that their customers purchase. According to the article, customers are assigned a "guest ID number, tied to their credit card, name, or email address that becomes a bucket that stores a history of everything they've bought and any demographic information Target has collected from them or bought from other sources". So not only does Target keep a history of what customers buy, but they purchase demographic information about these customers from other companies as well. Perhaps companies like Facebook.

They are then able to have programmers and statisticians analyze giant chunks of data and assemble patterns of consumer spending habits. In the case of Target, the statisticians noticed that women tend to stock up on health supplements in their first trimester, and unscented lotions in their second trimester. As it turns out, they claim to have about 25 products that they use to indicate a pregnancy score - the likeliness that a customer is pregnant. The article explains that when just four of those products are purchased by the same customer, the likeliness of them being pregnant is a whopping 87%.

It might sound a bit like witchcraft, but it's the type powerful of statistical analysis that can be done with such large data pools. Imagine the selling power a company like Facebook has when it comes to providing supplemental demographic information to a company like Target. Whether or not they do sell such information is a matter of their internal business practices, but the data they are regularly collecting is priceless. This provides them with a continued incentive to collect and store personal data about individuals.

Email Extrapolation

In 2004 I was working for an email marketing company. Naturally we were eager to analyze potential customers any way that we could, but often had little more to work with than their email address (incidentally these were typically acquired via some pseudo-scandalous online promotion). I began working independently on an interpretive email address analyzer and after leaving the company, had mixed success in marketing the product to other businesses. The relevance to this experience is how much can be ascertained just from a single email address, never mind the hoards of voluntary information associated with a Facebook profile.

The basic process would run an email address through a series of filters I had devised. Each filter would output how the email address scored against the particular filter. For example, say we wanted to analyze the gender of a user. If the email address reads susan@example.com, it's pretty likely it belongs to a female for the obvious reason that "Susan" is a female's name. The program would cross-reference a database of names and assign a probable score of the user being female. But consider a less obvious address like steelerj67@example.com. A different gender filter would extrapolate the word "steeler", cross reference a list of sports teams, and associate it with the Pittsburgh Steelers. It would then score the user as a male, albeit with much less certainty than the previous example. The logic being that females tend not to associate their email addresses with sports teams. There could be dozens of filters applied just to determine the likely gender of the person. And of course other filters would perform completely different tasks. In this case one would also extrapolate the "67" and interpret it as a birth year.

Without performing this kind of analysis we just have a customer email address. But with this analysis we potentially have a 45 year old male living in Pennsylvania who likes football. That's a lot of data to extract from a single address, even if it is speculative.

This type of analysis is certainly not always accurate, but it is statistically very relevant, and can greatly affect the usefulness of large datasets. More importantly, when we apply the analysis of these filters to other collected demographic information, the results tend to improve very significantly. When these methodologies are applied to tens of millions of customer records, the aggregate change in revenue can increase tremendously. Advertising to a 13-year-old female versus a 62-year-old man requires a pretty substantial change in advertising content. Simply knowing this is invaluable to a company.

If I was able to successfully extrapolate so much information just from an email address, think of how accurate Facebook can be with all of the information users provide them.

Bayesian Flooding


A look at Facebook's Advertising system.
Click for Full Size
Facebook no doubt uses methods similar to the two described above, but they also enjoy the luxury of collecting large amounts of personal data directly from users - this is exclusively what their service does. In turn, they're able to use the information collected to advertise to their users. If you've ever used their advertising system, you probably know just how powerful it is. The level of granularity and depth one is able to target for advertising purposes is far beyond the scope of Google. And yet this powerful advertising model may still suffer from a rather obvious Achilles' heel. For the time being at least, it appears to depend very heavily upon the honesty of its users.

This is where the experiment I have been working on comes directly into play.

Over the past several months I have entered a myriad of life-events to my Facebook profile using their new Timeline feature. Some of those life-events are true, and some of them are not. In my fictitious life I've explored a dozen different religions, had countless injuries and broken bones, suffered twice through cancer, been married, divorced, fathered children all around the world, and have even fought for numerous foreign militaries.

This is what I refer to as Bayesian Flooding, and to be perfectly honest, it's turned out to be a great deal of fun. My intent was to coin the term within the same sphere as Bayesian Filtering, a common method of filtering junk email by word analysis. Of course both terms pay homage to Thomas Bayes, a mathematician best known for Bayes' Theorem.


The basic formula describing Bayes' Theorem. It depicts the conditional probability of event A given event B has occurred.
Bayes' Theorem is a commonly applied mathematical formula used for calculating the conditional probability of some event given that some additional event has occurred, or that some additional knowledge has been gained. For example, if someone told you they had a nice conversation on a train, the probability it was a woman they spoke with is 50%. If they told you the person they spoke to was going to attend a quilt exhibition, it is far more likely than 50% it is a woman[1].

The probability of correctly assuming the gender increases because you have gained more information about the original problem. Mathematically speaking, you are now considering the probability that the person is a woman given that you know the person was attending a quilt exhibition; knitting, crocheting, and quilting are more typically associated as being female hobbies. An interesting quirk of Bayes' Theorem is that it heavily relies upon sexism, racism, ageism, and every other type of generalization imaginable in order to draw assumptions. This is not because it is somehow prejudiced, but rather because such categorizations can be shown to be statistically accurate. The key is to have accurate statistics about the topic being generalized. To borrow from the esteemed Sherlock Holmes, "You can never foretell what any one man will do, but you can say with precision what an average number will be up to." This is essentially the backbone of Bayes' Theorem.

When Facebook analyzes my profile and notices that I have participated in a dozen different religions over the past 30 years, their engine should make the assumption that I am interested in theology and various disciplines of spirituality. As a result, they're more likely to serve me ads and recommendations within this realm, perhaps for spiritual books, personal retreats, or the like. For a sizable majority of people, these assumptions and recommendations will be accurate and should result in a better click-through rate and ultimately more revenue. But the truth is that I'm an agnostic atheist and certainly couldn't care less about religious topics.

It might seem like childish anarchy, but there is a legitimate rationale behind wanting to fool the engine. As data analysis becomes more and more detailed (namely due to our world being digitally cataloged), companies are inventing coercive psychological tricks that manipulate consumers into spending more, plain and simple. The products aren't necessary getting better, rather the science of selling the products is. Advertisers argue there are benefits to more efficiently targeting customers, but I believe these benefits fail to acknowledge the downside consumers face. It's simply a matter of knowing far too much about a person while having the singular goal of acquiring their money. If consumer manipulation is harmless, I would have to strongly question why we condemn psychics for applying similar tricks, while at the same time congratulating the business world.

Beyond psychological manipulation, there are also legitimate privacy concerns that need to be taken into consideration, much like the case of the pregnant teen. If advertisements became completely personalized, it would be possible to learn virtually anything about someone just by observing what they were suggested to buy, never mind what they actually bought. Whether or not a teenager should be made to disclose her pregnancy to her father is a matter for a different debate. But I believe it's a pretty unanimous position that Target should not be involved at any level of the discussion.

There are dozens of very large players in this game at the moment, but Facebook and Google are most likely the two best-known. Even people who may not grasp the complexity of personal data collection probably still suspect they're being cataloged - and they're right. But an interesting difference with Facebook versus say, Google, is that I believe their data pool can be distorted without inhibiting one from using their site. If Facebook became the go-to source for private, personal information, and that information was flawed, it would potentially affect all other analysis of the individual as well.

The theoretical advantage a company like Google has is that it would extremely difficult to apply an idea like Bayesian Flooding to their model with any level of practicality. Google's paid advertising is primarily based upon the user's active search query. If you enter a false query, you'll get answers to questions unimportant to you; it would be a futile exercise. Conversely, and when you actually need to search for something, the ad engine would still be just as effective since it runs in real-time. This is especially true of services like Gmail. In order to flood a Gmail account, not only would you have to send non-sensible emails to contacts with some regularity (often referred to in this sense as Bayesian Poisoning), but recipients would have to reply in a similar manner. Of course the only reason one sends email in the first place is to exchange communication and thus the purpose of the tool would again be lost, at least in practical terms.

By contrast, applying Bayesian Flooding to a Facebook profile is quite trivial and in no way inhibits one from still enjoying the many facets of their service. The method only disturbs the advertising and recommendation model, not the actual tool. It is still possible to share photos, exchange stories and ideas, and comment on posts regardless of any superfluous details that happen to be associated with one's profile. What's the real harm if someone on Facebook thinks I spent two years in the Pakistani National Army so long as I can still share photos with them from my recent trip to Canada? With the release of their Timeline system, anybody is free to add such details, regardless of how accurate they are.

Now that Facebook has decided to become a publicly traded company, it seems to me this is a pretty significant detail shareholders are likely to begin questioning. It may even be one of the reasons as to why they have recently become so anxious to get people using their email services; such services are much more complicated to fool (as described via Google above). The more people that begin to use Facebook for day-to-day emailing and chatting, the more accurate and valuable each individual dossier becomes.

Results of the Experiment


Some of my current recommendations from the Facebook robot.
Thus far, my experiment seems to be producing exactly the results I had hypothesized it would. Whatever algorithm(s) Facebook uses to recommend pages is evidently picking up on my colorful assortment of life-changing events. This is a promising start for those interested in reducing what they're worth to Facebook as a human commodity.

Companies are willing to pay for advertisements because ads produce quantitative, measurable results. If one receives an advertisement that is irrelevant to them, the cost of that ad space has been wasted. If the cost of advertising outweighs the return on investment, companies tend to stop advertising. This is an oversimplification of the whole cycle, but illustrates the basic premise.

Of course there are numerous methods Facebook uses to provide recommendations and advertisements to people. Some are based on the 'likes' of Facebook friends, some are based on COOKIES from other sites, and I would imagine that some are even based on internal browsing history. So while it may not be possible to manipulate all of the personal information being collected about you, the Timeline feature can at least be used to manipulate it to some degree - at least per my own experimentation.

Incidentally I've also discovered numerous bugs in their system, most of them related to dates on the Timeline. Although Facebook does not allow you add life-events prior to your birth, they do permit you to include other people in your life-events that occurred before they were born. I'm sure Facebook will eventually fix this, but for the time being I suspect it makes their Bayesian analysis that much more inaccurate.

Algorithmic Corrections

It might be fair to ask if a company like Facebook would be able to adapt to something like Bayesian Flooding. The short answer is that yes, they most definitely would be able to. Data-mining companies can analyze all sorts of patterns illustrating how "normal people" tend to enter personal data. Once normal behavior patterns are established, it is not particularly difficult to flag outliers. Of course what Facebook might do with such outliers requires a bit of speculation.

At worst they could be banned for intentionally misusing the system. However, it seems this would be a little short-sighted as the user still carries a marketable value, perhaps just less of one. Instead I suspect they would be flagged to not receive certain kinds of advertisements thus not wasting advertisers money. After all, the company hosting the ad wants the product to sell as much as the company selling the product does. Why else would the advertising dollars continue to pour in?

If Facebook chose to implement filters to detect this type of Bayesian Flooding, people like myself would simply concoct new ways to further interfere with those filters, perhaps by adding events at a slower rate, or adding events more central to my actual life. Facebook would then try to correct those methods, and so on. This is how technological cat and mouse games get started, similar to fighting SPAM or attempting to prevent piracy. As a technological rule of thumb, every measure invented to curb a certain practice eventually has a counter-measure to circumvent it.

But having written that, there is an upside to those who take to Bayesian Flooding. I very sincerely doubt enough people will be interested in the idea to create a blip on Facebook's radar, much less change anything. That probably means those wishing to partake would find success in the idea.

Conclusion

While I'll definitely keep experimenting with their system, it does seem for the time being that people can directly affect what advertisements they receive simply by flooding their profile. The new Facebook Timeline feature makes this simple and even fun. And though the idea may appear petty to some people, those of us wishing to protect our privacy and avoid being cataloged by corporate America may find it beneficial to our cause.

If you have any comments or thoughts on this process, please feel free to contact me as always.