Data Analytics Over Hidden Databases

Title: Data Analytics Over Hidden Databases
Author: Dasgupta, Arjun
Abstract: Web based access to databases have become a popular method of datadelivery . A multitude of websites provides access to their proprietary datathrough web forms . In order to view this data , customers use the web forminterface and pose queries on the underlying database . These queries areexecuted and a resulting set of tuples (usually the top -k ones ) is served to thecustomer . Top -k along with strict limits on querying are constraints used by thedatabase providers to conserve the power of the underlying data distribution .Delivering limited access only to tuples that satisfy a query enables providers toexpose only a small snippet of the entire inventory at a time . This method of datadelivery prevents analysts from deriving information on the holistic nature of data .Analytical queries on the data statistics are hence blocked through these accessrestrictions . The objective of this work is to provide detailed approaches that obtain resultstowards inferring statistical information on such hidden databases , using theirpublicly available front -end forms . To this end , we first explore the problem ofrandom sampling of tuples from hidden databases . Samples representing theunderlying data open up a proprietary database to a plethora of opportunities bygiving external parties a glimpse into the holistic aspects of the data . Analystscan use samples to pose aggregate queries and gain information on the natureand quality of data . In addition to sampling , we also present efficient techniquesthat directly produce unbiased estimate of various interesting aggregates . Thesetechniques can be also applied to address the more general problem of sizeestimation of such databases . In light of techniques towards inferring aggregates , we introduce and motivatethe problem of privacy preservation in hidden databases from the data provider'sperspective , where the objective is to preserve the underlying aggregates while serving legitimate customers with answers to their form -based queries .
URI: http : / /hdl .handle .net /10106 /5170
Date: 2010-11-01


