By Troy Chevalier
A friend and fellow food enthusiast has a food blog on her cooking website (www.cooked.it). She recently applied to the Google AdSense program, and was rejected with the reason that here site was "Adult Content. This was puzzling, because it is a very G-rated website entirely devoted to cooking. Thinking that perhaps it was a mistake, she re-applied, and was quickly denied again. As you might expect, Google can't human review every site, so they rely on algorithmic techniques, hence the swift—but incorrect—response.
Butterflied roast chicken. Looking at the blog (cooked.it/blog), what caused it to be rejected? One thought was that the section on butterflied roast chicken shows step-by-step how to butterfly a "naked" chicken. Perhaps the chicken was incorrectly matched as skin? Another thought was the text references to the word "breast" (breast bone, breast meat).
Detecting adult content. Luminate also invests considerable effort to automatically detect adult content, using a variety of machine learning techniques:
- Classification of text associated with images. Luminate uses a common approach called logistic regression.
- Analysis of images. Skin-based detectors are a common approach, where a combination of color and texture information is used to determine whether pixels are skin or not. The size and number of skin regions, plus other interesting features are then used to classify the image type.
What caused Google to reject the food blog? There are many images that include closeups of hands. And there are several images that include wood cutting boards; materials like wood, and sand can sometimes be misclassified as skin. It is likely a combination of things, including the total number of perceived objectionable images.
Technology and human review. Relying solely on algorithmic solutions makes this a very challenging task, so Luminate incorporates a unique combination of technology and human review. Incorporating human review into the process provides additional benefits, such as a feedback loop where information from the human reviewers is used to improve our technology in what is often referred to as semi-supervised learning.