Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Survey on Data Pricing: from Economics to Data Science

A Survey on Data Pricing: from Economics to Data Science Data are invaluable. How can we assess the value of data objec- tively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electronic commerce, data management, data mining and machine learning. In this arti- cle, we present a uni ed, interdisciplinary and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing and review the development and evolution of pricing models according to a series of fundamental principles. We discuss both digital products and data products. We also consider a series of challenges and directions for future work. 1 Introduction In this digital economics era, data are well recognized as an essential re- source for work and life. Many products and services are delivered purely in digital forms. Many big data applications are built on the second use or reuse of data [196], that is, the same data are customized and reused by many applications for di erent purposes. The extensive sharing and reusing data has profound implications to economy. For example, digital maps are often produced for trac and directions as the immediate usage. However, Nagaraj [153] nds that mining activities were strongly bene ted by open maps or maps sponsored by governments, particularly for smaller rms with arXiv:2009.04462v2 [econ.TH] 27 Nov 2020 less resources. Universal availability of data often helps minority parties and emerging initiatives. In business and economic activities where data are shared, exchanged and reused, it is essential to measure the value of data properly. While there exist many possible ways to appreciate and represent the value of data, a general approach that can be scalable for massive applications and acceptable to many parties is to set a price at which data can be sold or purchased, that is, data pricing. The importance of pricing in business is well recognized in nancial modeling [120], as price being one of the four Ps of the marketing mix . Pricing data is far from trivial. Data have many di erent aspects. Con- sequently, the term \price of data" may carry di erent meanings and refer to di erent properties of data. To illustrate the complexity, let us quickly consider the following three scenarios involving price information related to data. • Data transmission. Imagine the scenario where a mobile service provider o ers a smart phone user the price of its data package. Here, the price is quoted for the data transmission service and is decided by several factors, such as the amount of data the user wants to trans- mit in a month time, the location (roaming or not, for example), and the transmission speed. The price does not include and is indepen- dent from the content, that is, what the data are about, such as data quality, and how the data are collected, stored or processed. • Digital products. Imagine that a person wants to watch a movie at home. This is a purchase of data, since the movie is sent to the cus- tomer's home as a stream of bits. The price here typically is related to the content, but is independent from the data transmission service, that is, how the data are transmitted to the user's home. • Data products. Many logistics companies want to pay for weather in- formation to support their business operations. While historical data are relevant, more often than not those companies want to subscribe to weather forecasting information instead. Some companies may want The four Ps are product, price, place and promotion [120]. 2 weather predictions at a higher granularity while some may want de- tailed predictions at speci c locations. Moreover, some may want long term predictions while some others may want short term projections. Here, prediction services are sold as data products. The above three cases just elaborate some representative scenarios where data prices are used, and are by no means exhaustive. To appreciate data pricing, including ideas, principles and methods, we have to take an inter- disciplinary approach from multiple elds, economics and data science being the two most prominent. Indeed, the studies and practice of data pricing started as early as the dawn of digital economics, and are highly diversi ed and rich in innovative thinking. In this article, we try to present a comprehensive survey on data pricing, an emerging research and practice area that plays a more and more impor- tant role in the current big data and AI economics era. Our survey is highly related to the current strong rising of data science. To a large extent, data pricing is an overdue pillar in data science research and practice. Data and information as goods discussed in this article are those that are distributed purely in digital form. We focus on two categories of the most interest: pricing digital products and pricing data products, demonstrated by the last two aforementioned scenarios, respectively. In this article, dig- ital products refer to those intangible goods but can be consumed through electronics, such as e-books, downloadable musics, online ads, and internet coupons. Many digital products have physical correspondences in one way or another, though not absolutely necessary. Data products refer to data sets as products and information services derived from data sets. We build the linkage between these two categories by pointing out many ideas and meth- ods on pricing digital products can be generalized and applied to pricing data products. In some scenarios, the boundary between digital products and data products is also blurry. Hereafter, we use the term information goods to refer to both digital products and data products. 1.1 Related Surveys The research into data pricing happens simultaneously in multiple domains, including but not limited to economics, marketing, e-commerce, databases and data management, operational research, management science, machine 3 learning and AI. However, to the best of our knowledge, there exists very limited e ort to provide an interdisciplinary survey of the related work. This article presents our endeavor to produce a comprehensive picture. There are some previous surveys related to data pricing. For example, Liang et al. [136] survey the life cycle of big data, and reviews 11 data pricing models. They also discuss data trading and protection. Fricker and Maksi- mov [75] report a literature survey over 18 research articles regarding several research questions, including maturity of the pricing models. Very recently, Zhang and Beltr an [220] review the state-of-the-art data pricing methods. They categorize data pricing methods according to two important data prop- erties, granularity and privacy, This article covers a substantially broader scope than those [75, 136, 220]. We connect economics, digital product pric- ing and data product pricing. We also discuss a series of desirable properties in data pricing, including arbitrage-freeness, revenue maximization, fairness, truthfulness, and privacy preservation, and review the techniques achieving those properties. Data pricing is related to cloud pricing, since a lot of data for pricing and trading are hosted on cloud. Wu et al. [208] present a comprehensive survey on cloud pricing models. They systematically categorize three fundamen- tal pricing strategies, namely value-based pricing, cost-based pricing and market-based pricing. Then, they further categorize nine pricing tactical objects. Speci cally, value-based pricing is demand driven and consists of customer value-based pricing, experience-based pricing, and service-based pricing. Cost-based pricing is supply driven and consists of expenditure- based pricing, resource-based pricing and utility-based pricing. Market- based pricing is an equilibrium of supply and demand and consists of free and pay later pricing, retail-based pricing and auction and online pricing. They cover in total 60 pricing models. While data and cloud are highly related, data pricing and cloud pricing are fundamentally di erent. Data pricing is selling data, while cloud pricing is selling cloud resources (e.g., storage and computation), including physical resources, virtual resources and stateless resources. In addition, Sen et al. [181] survey the major broad-band pricing pro- posals, including the realizations in various consumer data plans around the world. Murthy et al. [150] list di erent pricing models and pricing schemes used by some popular IaaS (infrastructure-as-a-service) providers. 4 Wu et al. [211] propose pricing as a service, which is essentially a personal- ized pricing service for IaaS. Aazam and Huh [1] propose broker as a service, which matches cloud services among cloud service providers and users. The key idea is to predict resource demands and thus derive prices. As data are often hosted online, one interesting question is the fair shar- ing of the cost among data owners, data users and brokers. This is related to data pricing, because the costs of data hosting and processing have to be recovered from data pricing. Kantere et al. [116] study the fair allocation of costs in query services. They develop a stochastic model, which predicts the extent of cost amortization in time and number of services based on query trac statistics. The model can be implemented on top of a cloud DBMS. Al-Kiswany et al. [12] provide a cost assessment tool to evaluate the cost of a desired data sharing. One useful feature of the tool is that a user can explore the cost space of alternative con gurations using various factors, such as quality, staleness, and accuracy. The technique is based on what-if analysis. 1.2 Structure of This Survey We take a multi-disciplinary approach in this survey. The rest of the article is organized as follows. In Section 2, we start from economics and focus on two aspects. First, we discuss cost reduction in information goods that contributes to their prices and has impact on economics. Then, we discuss the di erences between digital products and data products. In Section 3, we discuss the fundamental principles of data pricing. We rst present versioning as a general framework for pricing information goods. Then, we identify several desirable properties in data pricing, includ- ing truthfulness, fairness, revenue-maximization, arbitrage-freeness, privacy preservation and computational eciency. In Section 4, we discuss pricing digital products. We rst review the three major streams of revenues for digital products. Then, we revisit the bundling and subscription planning pricing models. Last, we consider auctions, which are widely used in pricing digital products. In Section 5, we discuss pricing data products. We rst overview the structures, players, and ways to produce data products in data marketplaces. 5 Then, we examine several important areas in pricing data products, includ- ing arbitrage-free pricing, revenue maximization pricing, fair and truthful pricing, privacy preservation in pricing. We also discuss dynamic data pric- ing, online pricing, and pricing in federated and collaborative learning. Last, in Section 6, we discuss challenges and future directions. 2 Economics of Data Pricing In general, pricing is the practice that a business sets a price at which a product or a service can be sold. Pricing is often part of the marketing plan of a business. To set prices, a business often considers a series of objectives, such as pro tability, tness in marketplace, market positioning, price consistency across categories and products, and meeting or preventing competition. Some major pricing strategies in literature [38, 58, 108, 155, 159] include operation-oriented pricing, revenue-oriented pricing, customer- oriented pricing, value-oriented pricing, and relationship-oriented pricing. There is a rich body of studies in economics and marketing research on pricing tactics, which are far beyond the scope and capacity of this survey. In this section, to understand the economic factors speci c to data pric- ing, we examine the cost reduction in information goods. Then, we inspect the di erences between digital products and data as products. 2.1 Cost Reduction in Information Goods \Technology changes. Economic laws do not." [182] The production, distri- bution, and consumption of information goods, comparing to those of phys- ical products in the long history of human economies, are distinguished by signi cant cost reductions on ve aspects, namely search costs, production costs, replication costs, transportation costs, and tracking and veri cation costs. Essentially, digital and data economics investigates how standard eco- nomic models adjust when those costs are reduced dramatically. Goldfarb and Tucker [93] present a thorough discussion, whose framework is largely followed here. 6 2.1.1 Search Costs \Search costs are the costs of looking for information" [182], which are in- curred in any information collection activities. Information goods allow more e ective and ecient online search. The consequent low search costs facilitate users' discovering digital products and data sets, as well as compar- ing prices of similar products and services. For example, Brynjolfsson and Smith [40] show that online prices of books and CDs are clearly lower than oine, though the price dispersion, however, does not shrink accordingly. Low search costs facilitate the sales of rare and long tail products [15, 214]. Thus, more variety is often observed in information goods and services. The degree of variety may be heavily impacted by recommender systems. Speci c to consumption of media, one of the major categories of digital products, Gentzkow and Shapiro [82] show that online media consumption is more diverse than oine. At the same time, customers may tend to consume more that aligns more or less with their viewpoints, which is called the \echo chamber" e ect [188]. Low search costs give strong rise to the prevalent platform businesses, which provide extensive matching services to customers and improve trade eciency [115]. Interoperability, compatibility and standards are strategic tools for both building platforms and running platform businesses [99]. 2.1.2 Production Costs Producing digital products, such as online courses, eBooks, software, graph- ics and digital arts, and photography, is very di erent from manufacturing physical products, like bread, shoes, and jackets. Moreover, collecting and processing massive data so that parts of data can be sold and can meet customers' needs is also di erent from traditional production. A wide spec- trum of production costs in traditional products are substantially reduced in information goods. First, some essential major costs in traditional production, such as ma- terials, semi- nished products and their transportation, are dramatically reduced in producing information goods. In many cases, the costs of obtain- ing, producing and transporting raw materials and physical semi- nished products can be reduced to very low or can even approach zero in mak- ing information goods. Second, a substantial cost of a traditional physical 7 product often belongs to the product itself and cannot be further reduced through sharing. The unit costs of information goods can approach zero through sharing as long as there are sucient reuses and sales volume. Last, smart manufacturing and customer-to-manufacturing can reduce the supply chain costs in traditional physical production [88, 187]. Information goods often can reduce the costs of customization to extreme. The substantial reduction in production cost in materials, semi- nished products, customization and sharing gives rise to a series of innovative busi- ness models, such as economics of sharing, pay-as-you-go and query-based data consumption. This also encourages innovation and long tail products that address diverse and smaller groups of potential customers. 2.1.3 Replication Costs One distinct feature of information goods versus traditional products is that information goods are non-rival. That is, one customer consuming an infor- mation good does not reduce the amount or quality of the product available to other customers. The zero marginal costs and the non-rival property of information goods empower innovative opportunities and bring in new challenges. In order to structure pricing of a large variety of non-rival information goods with zero marginal costs, bundling is often used [182], that is, mul- tiple products are sold together at a single price. Since a large number of information goods can be bundled together without a substantial increase in cost, economically it may be optimal to bundle thousands of digital products together to meet diverse and independent customer preferences [10, 25, 26]. Due to the zero marginal costs and the non-rivalrous property, many information goods are made publicly available, such as Wikipedia and open source software [131]. People contribute to open source or publicly available digital products and data to demonstrate their professional skills to potential employers. Companies support those products to complement their sales on other products. The zero marginal costs and non-rivalrous property post challenges to copyright policies and enforcement. Waldfogel [203] shows that low repli- cation costs, though may reduce revenue, help supplies and demands, and https://www.wikipedia.org 8 thus boost quality. Williams [207] shows that the protection of intellec- tual properties indeed has negative impact on follow-on innovation in gene sequencing. At the same time, there are evidences showing that governments man- date \open data" may lead to data leakages and privacy breaches that a ect citizens' oine welfare [5]. On the negative side, the zero marginal costs or non-rivalrous nature also ease the way for spamming [174] and online crime [149]. 2.1.4 Transportation Costs Thanks to the Internet, the costs of transporting information goods approach zero. This may imply, in many scenarios, that local communities may not a ect adoptions and consumptions of information goods, often known as the e ect of at world [76]. Interestingly, this is not true all the time, as some studies demonstrate that tastes may still be local in music [73] and content consumption [81]. While the physical transportation may approach zero, regulation may put sophisticated constraints on locations. For example, when Wekipedia was blocked in China in October 2005, more contributors from outside China were motivated to contribute [221]. Copyright policies may also a ect the availability and consumption of information goods in di erent regions, such as news media [46], and thus may be re ected by price. 2.1.5 Tracking and Veri cation Costs The capability of tracking users with relatively low costs is an important feature of information goods [182]. The low tracking costs give the rise to extensive personalized markets and possible price discrimination [77, 165]. Behavioral price discrimination is an immediate type, which sets prices ac- cording to customers' previous behavior. Correspondingly, if customers are well aware of the bene ts of tracking information to a monopoly, they may likely choose to be privacy sensitive and hold the information [193]. Another type of price discrimination is versioning [183], which sells information at di erent prices to di erent customers using di erent versions. Versioning is discussed in detail in Section 3.1. 9 The advantage of low tracking costs also leads to the blooming busi- nesses of personalized advertising [69]. A challenge for a company, however, is how to set prices for many advertisements that may be shown to massive customers? The same advertisement may have di erent prices for di er- ent customers. Auctions are often used to address the challenge [19], and can even be used to discover prices for information goods [164]. At the same time, auctions may be less useful when online marketplaces become mature [66]. The low tracking costs and the consequences, such as price discrimina- tion, lead to serious concerns on privacy [4]. As to be discussed later in this article, whether privacy should be treated as goods and how privacy is priced are investigated [74, 163]. Moreover, privacy regulation and the im- pact on welfare are important topics, though they are far beyond the scope of this survey. As a byproduct of low tracking costs, the costs of verifying identity and reputation of producers and users of information goods are dramatically lower than those in traditional scenarios. The low veri cation costs facilitate online transactions extensively and lower the costs of trust dramatically. 2.2 Di erences between Digital Products and Data Products This survey focuses on pricing two categories of information goods, digital products and data products. While digital products and data products share a series of common ideas and methods in pricing, they are also essentially di erent from each other on at least four aspects. First, the units of digital products are often well de ned and xed. For example, individual movies and musics are often priced and sold in whole. The consumption of a digital product is often independent from each other. For example, it would be rare that two digital books have to be read at the same time. In contrast, although the basic unit in a data set can be at a very small granularity, such as a record in a relational table, the units for pricing and consumption often vary from one customer to another. For example, a customer may be interested in the sales data of female customers in a province, while another customer may be interested in the sales data on electronics during the Christmas season. Correspondingly, one individual unit of data at the lowest granularity may not be valuable as a data product. 10 For example, one customer purchase record, after proper anonymization, may not be useful for a retailer. Instead, more often than not, many basic units of data are combined, aggregated and consumed together. Second, di erent from digital products, data sets as data products have very strong and exible aggregateability. Customers often aggregate data using various dimensions. The aggregateability, on the one hand, enables many opportunities for innovations in data business, and, on the other hand, posts many technical and business challenges, such as ensuring arbitrage- freeness as to be discussed later in this article. In many business scenarios, digital products like movies and musics are bundled. However, bundles are not aggregates. Customers still get digital products and consume them individually. Bundling is to take the advantage of low replication costs of digital products to boost sales and meet customers' diverse demands [10, 25, 26]. Third, the means of consuming digital products and data products are also very di erent. Typically digital products are consumed directly by peo- ple, such as movies watched by people and musics enjoyed by fans. Data sets are more often than not consumed by computers. They are, for example, analyzed, summarized or used to train machine learning models. The out- puts of models are used to automate operations or support human decision making. Last, digital products and data products are dramatically di erent in ways to be reused and resold. Digital products are easy to be consumed by others, that is, to be reused, or even to be resold to others in whole. Data sets, to the contrary, can be reused by others in di erent ways, such as aggregation in di erent dimensions and analysis for di erent purposes. Moreover, data can be easily processed and transformed so that they can be resold in a hard-to-detect manner. The above di erences between digital products and data products lead to di erent considerations in pricing principles and methods, which are dis- cussed later. Before we leave this topic, we want to point out that it is possible that the same information can be regarded as digital products in some situations and as data products in some other situations. For exam- ple, social media like tweets and customer reviews can be regarded as digital products when a customer reads them online. At the same time, they can be collected and processed in batch by analytic tools to detect events, dis- 11 cover customer pro les and feed recommender systems. In this situation, a systematic collection of social media can be priced and sold as a data product. 2.3 Summary In summary, information goods, including digital products and data prod- ucts, distinguish themselves from the traditional physical products in sig- ni cant cost reductions, particularly in search costs, production costs, repli- cation costs, transportation costs, and tracking and veri cation costs. The signi cant reduction of costs has profound impact on pricing information goods, which is discussed in the later sections of this article. There are sev- eral major di erences between digital products and data products, including consumption units, aggregatebility, means of consumption, and reusing and reselling. 3 Fundamental Principles of Data Pricing In this section, we rst review the idea of versioning [182, 183], which is a fundamental framework of designing information goods and pricing them. Then, we review several important properties in cost models of digital and data products. 3.1 Versioning As the replication costs of information goods are very low, even approaching zero in many cases, the price of an information good tends to be very low in marketplaces, too. The potential of very low prices of information goods, on the one hand, makes information goods economically appealing, and, on the other hand, may also make information goods economically dangerous, as the competitors may easily enter the market [182, 183]. This dilemma keeps many traditional pricing strategies far away from being e ective for information goods. To tackle the dilemma, the core idea is \linking price to value", that is, setting the price re ecting the value that a customer places on the informa- tion. Speci cally, the versioning strategy [183] makes di erent versions to appeal to di erent types of customers. For example, for a piece of software, 12 di erent versions have di erent subsets of features. Di erent versions of a movie may provide di erent image resolutions and sound e ects. Essen- tially, versioning divides customers into subgroups so that each subgroup may regard some features highly valuable and some other features of little value. A version corresponding to the demands can be provided. There are many di erent ways to produce di erent versions of informa- tion goods. For example, as information is often time sensitive, delay is often a good basis. In stock market information services, an expensive version may deliver real time quotes while a basic version delivers the same information 20 minutes later. In addition, versions may be de ned by convenience (e.g., data can be accessed only by PDF le or by downloadable spreadsheet), com- prehensiveness (e.g., the length of historical data available), manipulation (e.g., whether users can store, duplicate, print the information), community (e.g., availability of posting and reading discussion boards), annoyance (e.g., the option of no advertisements), the means of customer support (e.g., by website only or by talking to experts), and many other factors. Most ver- sions of information goods are created by subtracting value from the most technologically advanced and complete version. In many situations where customers may not realize the value of an in- formation good unless they try it, even the free versions may be provided. The rationale is that the free versions can provide opportunities to poten- tial customers to test out. The objectives of o ering free versions include building awareness, gaining follow-on sales, creating a customer network, attracting attentions, and gaining competitive advantages. The number of versions of an information good may be decided by two major considerations. First, the characteristics of the information to be sold is important. An information good that can be used in many di erent ways opens the door to many di erent versions. The second important factor is the value that di erent customers may place on it. The larger the variance, the more versions may be needed. The versioning strategy has been investigated in pricing data products, for example, relational data sets and query results [27, 28]. Relational views provide a natural and exible technical mean to produce versions of an information source. A series of technical challenges are identi ed, such as arbitrage in pricing, ne-grained data pricing, pricing updates, integrated data and competing data sources, which are reviewed further in this article. 13 3.2 Important Desiderata in Data Pricing There are many di erent ways to design and implement pricing models for information goods. There are a small number of desiderata pursued by most models. How to implement those desiderata in pricing models is discussed in the later sections. 3.2.1 Truthfulness To make a market ecient, the market is preferred to be truthful. A market is truthful if every buyer is sel sh and only o ers the price that maximizes the buyer's true utility value. In other words, in a truthful market, no buyer pays more than sucient to purchase a product. Here, di erent buyers may have di erent utility values on the same product. Truthfulness can facilitate a wide spectrum of pricing mechanisms, such as many kinds of auctions [7]. Auctions of digital products are discussed in Section 4.3. 3.2.2 Revenue Maximization Pricing models can optimize di erent objectives, such as lowest cost, highest pro t, and largest sales. The objective of maximizing revenue is often of special interest in designing pricing strategies. The rationale is that, for a business to be successful long term, a more immediate and important requirement is to win over as many customers as possible. For traditional physical products, it is often assumed that the marginal cost goes up after a certain number of units are manufactured, and thus the pro t can be maximized if the output level is set so that the marginal revenue is equal to the marginal cost, and the revenue can be maximized if the marginal revenue becomes zero. However, given that the replication costs of information goods are very low, revenue maximization and pro t maximization for information products become quite di erent from those for physical products [7, 42]. 3.2.3 Fairness Essentially, a market is fair if each seller gets the fair share of the revenue in coalition. In his seminal article [184], Shapley lays out the fundamental requirements of fairness in markets. Suppose there are k sellers cooperatively 14 participate in a transaction that leads to a payment v. There are four basic requirements for being fair. • Balance : the sum of the payment to each seller should be equal to v. That is, the payment is fully distributed to all sellers. • Symmetry : for a set of sellers S and two additional sellers s and s 0 0 who are not in S, that is, s; s 62 S, if S[fsg and S[fs g produce the same payment, then s and s should receive the same payment. That is, the same contribution to utility should be paid the same. • Zero element : for a set of sellers S and an additional seller s 62 S, if S [fsg and S produce the same payment, then s should receive a payment of 0. That is, no contribution, no payment. • Additivity : If the goods can be used for two tasks T and T with 1 2 payment v and v , respectively, then the payment to complete both 1 2 tasks T + T is v + v . 1 2 1 2 In the above well celebrated Shapley fairness, the Shapley value is the unique allocation of payment that satis es all the requirements. 1 U (S [ (s))U (S) (s) =  (1) n1 jSj SDnfsg where U () is the utility function, D is the complete set of sellers, S  D is a set of sellers, and s is a seller. Equivalently, Equation 1 can also be written as (s) = (U (P [fsg)U (P )) (2) s s N ! 2(D) where  2 (D) is a permutation of all sellers, and P is the set of sellers preceding s in . Agarwal et al. [7] observe that, as the replication costs of information goods are very low, the marginal costs of production are close to zero, a seller can produce more units of the same information good to obtain a larger Shapley value and thus a larger portion of the payment unjusti ed in business. This is a challenge in designing fair marketplace for information goods. 15 3.2.4 Arbitrage-free Pricing Arbitrage is the activities that take advantage of price di erences between two or more markets or channels. For example, consider a scenario where a user wants to purchase the access to an article, whose listed price is $35. Suppose that the journal publishing the article has a monthly subscription rate of $25. Then, the user can conduct arbitrage to subscribe to the journal for only one month and obtain the article at a price cheaper than the listed price. Arbitrage is often undesirable in pricing models. At least it should be able to check whether a pricing model is arbitrage-free. However, arbitrage can sneak in pricing models that are not thoroughly designed. For example, suppose a data service provider sells query results with prices based on variance [133], a variance of 10 for $5 each query result and a variance of 1 for $100 each query result. Each answer is perturbed independently. A customer who wants to obtain an answer of variance of 1 can purchase the query 10 times and compute their average. Due to the independent noise in perturbation, the aggregated average has variance 1, and thus the customer saves $50 by arbitrage. 3.2.5 Privacy-preservation Privacy is becoming a more and more serious concern about information goods. In general, privacy is the ability of an individual or a group to keep themselves or the information about themselves hidden from being identi ed or approached by other people. Privacy is highly related to information and information exchange, which are what information goods about. As explained in Section 2.1.5, due to the low tracking costs of information goods, it is easier to collect data about user privacy [4]. Whether privacy should be treated as goods and how privacy is priced are investigated [74, 163]. It is highly desirable to preserve privacy in marketplaces of information goods. In general, transactions in a marketplace may disclose privacy of various parties in many di erent ways. First, privacy of buyers is highly vulnerable. Their identities, the loca- tion and time of purchases, speci c products purchased, the purchase prices and total amount may re ect their privacy. It has been reported from time 16 to time that e-commerce providers leak customer information by mistakes, such as an accident reported recently . Second, privacy of information good providers may also be disclosed. For example, medical treatment information in hospitals is highly valuable for many business companies, such as pharmacy and medical equipment com- panies. Imagine that hospitals can collect and anonymize medical treatment data properly and provide the corresponding data products in marketplaces so that individual patients cannot be re-identi ed. Buyers, however, may be able to infer from the data the successful rates of a speci c treatment in a hospital, which may be regarded as the privacy of the hospital. Last, transactions in marketplaces may also disclose privacy of a third party involved. For example, an AI technology company may provide ma- chine learning model building services to data product buyers. However, machine learning models may be stolen [194], which are regarded privacy of the AI technology company. To protect privacy in marketplaces of information goods, various di- rections are being explored, such as hiding the information about what, when and how much a buyer purchases [11], building decentralized and trustworthy privacy preservation data marketplace [50, 107], investigating the tradeo between payments and accuracy when privacy presents [160], and aggregating non-veri able information from a privacy-sensitive pop- ulation [86]. There are many studies on preserving privacy in informa- tion goods. We refer interested readers to consult the rich body of sur- veys [8, 35, 61, 72, 78, 114, 212, 224] and others. We do not discuss further details about general privacy preservation techniques in this article, since privacy preservation techniques are far beyond the scope and capacity of this survey. 3.2.6 Computational Eciency As many information goods may be sold to a huge number of potential buyers, a pricing model has to match goods/sellers and buyers with an ap- propriate price. Computing prices eciently with respect to a large number of goods and a large number of buyers presents technical challenges [28]. https://www.telegraph.co.uk/technology/2020/03/10/leak-millions-amazon-e bay-transactions-exposes-customer-addresses/ 17 For example, one reasonable expectation is that a marketplace is polyno- mial, that is, the complexity of computing prices has to be polynomial with respect to the number of sellers, and cannot grow with respect to the num- ber of goods/buyers when prices are updated [7]. When auctions are used in determining prices, auction eciency [92] is required to be fast, which is the time needed to process bids. 3.3 Summary Versioning is a common mechanism in designing and pricing information goods, so that prices of di erent versions can be linked to values placed by various customer groups. There are a series of important requirements on pricing information goods, including truthfulness, revenue maximization, fairness, arbitrage-free pricing, privacy preservation, and computational ef- ciency. Those requirements post technical challenges to pricing models. 4 Pricing Digital Products Although the focus of this article is about pricing data products, we provide a brief review on pricing digital products here, since some general ideas in pricing digital products can be borrowed and extended to data products. In some cases, the boundary between digital products and data products is even blurry. We rst discuss the three major streams of revenues for digital products. Then, we look at two major types of pricing models. The rst is bundling and subscription, and the second is auctions. These pricing models are popularly adopted by digital product marketplaces. 4.1 Streams of Revenues As discussed in Section 3.2.2, revenue maximization often serves as the basic objective in pricing mechanisms, including pricing digital products. There- fore, the understanding of pricing digital products can naturally start with an analysis of possible ways where revenues of digital products may come from. Lambrecht et al. [127] summarize that there are three streams of revenues for digital products that are delivered online. 18 • Money. A provider can sell to customers content or, more broadly, services, such as movies and e-books. • Information/privacy. Instead of charging customers directly, a provider can collect customer information by tracking (e.g., using cook- ies) and sell the information about customers to generate revenues. • Time/attention. A provider can sell space in their digital products to advertisers to produce revenue. Often, a rm has to design a revenue model for its digital products that combine more than one revenue stream. The three streams are not indepen- dent. Instead, they compete with each other, and thus a good tradeo has to be settled [79]. On the one hand, in some situations, revenues from money stream may be increased at the cost of those from time/attention stream. For example, customers may pay for the content and avoid ads [168, 171], or convert from free versions to premium versions with tting functions [202]. On the other hand, customers may be highly price sensitive in some digi- tal products, and thus growth in time/attention stream may be easier. For example, an online news site experiences a dramatic loss of customer vis- its after introducing a paywall [45]. Free samples may stimulate long-term sales [37]. A possible tradeo between money and time/attention has to be carefully designed. Typical approaches in revenue models of content and services [173] in- clude rigid pricing (e.g., each movie is priced at a xed price), designing pricing tiers (e.g., basic versus premium versions), setting up duration of subscription plans (e.g., 6 months of promotion period with very low sub- scription price) and designing freemium models. One important and unique feature in digital product consumption is micropayments, which means a customer can pay a very small amount that is typically impractical in tradi- tional transactions using standard credit cards due to network service fees. Micropayments and subscriptions have di erent e ects on consumer behav- ior [20]. As a concrete example of revenue models, consider pricing software prod- ucts [130]. The major parameters of pricing models include formation of price, structure of payment ow, assessment base, price discrimination, price building and dynamic strategies. The formation of price considers price de- termination, that is, cost-based, value-based or competition-oriented, as well 19 as degree of interaction, unilateral versus interactive. In terms of payment ow, it may be by single payment, recurring payments or combination. The assessment base of pricing may be usage-dependent (e.g., by transaction or time) or usage-independent (e.g., server types and GPU). As the tracking costs of digital products are low, a rm can collect customer personal data and sell such data for revenue, that is, generat- ing revenues from information/privacy stream. Typically, personal data may include customers' identities, behavior patterns, preferences and needs. There are various ways to sell customer data, which are also discussed in Section 5 when data products and their marketplaces are discussed. For example [32, 36], a website can provide direct marketing companies user activity information. Moreover, websites can also collaborate with data management platforms (DMP, for advertising) [67] and produce revenues by facilitating businesses to identify audience segments. For example, the information about how customers are connected in social networks can be used to design customized discounts in marketing campaigns [215]. Berge- mann and Bonatti [33] develop a model of pricing customer-level information such that the data about each customer are sold individually and individual queries to the database are priced linearly. As new technologies of customer tracking become available, more pricing models may emerge. We want to point out that selling customer data, though serves the purpose of selling digital products, crosses the boundary between selling digital products and data products. We review some studies on setting prices for customer data and privacy information in the next section. To produce revenues from time/attention stream, many digital product producers and service providers embed advertisements in their products in one way or the other, and obtain remarkable or even dominant advertising income. However, as John Wanamaker (1838-1922) wisely said, \Half the money I spend on advertising is wasted; the trouble is I don't know which half." It is well recognized that it is hard to accurately measure advertising e ects [95, 132]. Advertisers customize ads for online display [190, 216]. One feasible way to improve advertising e ectiveness is to combine user information and advertising opportunities. Retargeted advertising [128] is such an approach, which combines customer online and oine behavior data and makes rms focus on customers showing prior interest in the related products. For example, Athey et al. [21] consider customers with multiple 20 homes and investigate the advertising strategies and e ectiveness. In summary, digital product and service suppliers produce rev- enues through three major streams, money, information/privacy and time/attention. Orthogonally, a rm can bundle its digital products and also design subscription plans that provide products and services in a spe- ci c period for a price, which is discussed next. 4.2 Bundling and Subscription Planning Product bundling organizes products or services into bundles, such that a bundle of products or services are for sale as one combined product or service package. Product bundling is a common marketing practice, particularly in the traditional industry like telecommunication services, nancial services, healthcare, and consumer electronics. As discussed in Section 2.1.3, the low replication costs of information goods allow prevalent adoption of bundling in pricing digital products [182]. Designing product bundles essentially is a combinatorial optimization prob- lem. The basic and static setting is that a customer wants to buy either one or multiple products at a time, which is investigated well before digital products are available [6]. A series of studies [18, 148, 169] develop pric- ing strategies with two products under di erent types of bundling. They share the basic assumption that demand for a bundle is elastic comparing to demand for individual products. For example, Armstrong [18] studies the scenarios where products may be substituted or provided by separate sellers. Bundling multiple products is analyzed, often under the independent value distribution framework [152]. Consider the situation where there are n heterogeneous products for one buyer, and the objective is to maximize expected revenue. Assume that the value distributions on products are independent. That is, for each product x , the price that a buyer would like to pay for is an arbitrary distribution D in range [a ; b ], where 0 i i i a  b < 1, and those distributions D ; : : : ; D are independent from each i i 1 i other. Further assume that the buyer is additive, that is, the buyer's value for a set of products is the sum of the buyer's values of those individual products in the set. Babaio et al. [51] show that either selling each item separately or selling all items together as a grand bundle produces at least 21 a constant fraction of the optimal revenue. This interesting and important result allows a simple yet e ective bundling strategy: either pricing each product individually or pricing the grand bundle in the expected price. In practice, many platforms, such as Hulu and Amazon Prime Video, o er grand bundle subscription for their products. More recently, Haghpanah and Hartline [97, 98] show that grand bun- dle is optimal if more price-sensitive buyers consider the products more complementary. When multiple buyers are considered, whose preferences are unknown, Balcan et al. [30] give a simple pricing model that achieves a surprisingly strong guarantee: in the case of unlimited supplies, a ran- dom single price achieves expected revenue within a logarithmic factor for customers with general valuation functions. This result allows great con- venience in practice, that is, setting a uniform price for all products. It is easier to price a bundle of a larger number of products, since the law of large numbers allows to predict customers' valuations more accurately for a larger bundle of products [2]. Orthogonal to bundling, subscription is to price the interactions between customers and a platform over a period of time. Subscribing customers are in general heterogeneous in both usage rate and value of products. On the one hand, customers with higher usage rates may prefer subscribing to larger subscription sets. On the other hand, in order to maximize revenue, the platform wants customers with lower usage rates to subscribe, and customers with higher usage rates to rent. Moreover, di erent users may have di erent values for a product. Many platforms o er subscription and renting at the same time. For a platform, the subscription model is to select a subscription fee and the period for each set of products and also set the rental price for each product [13]. Alaei et al. [13] follow the model of grand bundle and consider grand subscription, a single rental price for the set that includes all products. They establish the sucient and necessary condition for the optimality of grand subscription. They also show that subscription fees can be set proportional to the cardinality of a set of products and can achieve of the 4 log 2m+log n optimal revenue for n types of customers and m types of products. This approximation is tight in the sense that it cannot be improved more than ( ) in polynomial time. log n After all, modeling bundling and subscriptions is computationally chal- 22 lenging due to the combinatorial nature. Dynamic pricing bundles and sub- scriptions, such as promotions and coupons, have rarely been touched yet. 4.3 Auctions Auctions have a long history back to the Babylonian and Roman em- pires [185]. There are many excellent surveys on auctions (e.g., [24, 68, 118, 145]). A comprehensive review on auctions is far beyond the scope and ca- pacity of this article. In this article, we instead only focus on the important role of auctions as a pricing mechanism for digital products. 4.3.1 Basics about Auctions There are four basic types of auctions widely used. • In the ascending-bid auction (also known as English auction), the price is raised successively until only one bidder remains, who wins the ob- ject at the nal price. • The descending auction (also known as the Dutch auction) works the other way by starting at a very high price and lowering the price continuously, until the rst bidder calls out and accepts the current price. • In the rst-price sealed-bid auction, every bidder submits a bid without knowing the others' bids. The one making the highest bid wins and pays at the named price. • The second-price sealed-bid auction (also known as the Vickrey auc- tion [198]) works in the same way as the rst-price sealed-bid auction does, except that the winner pays only the second highest bid. There are two basic models of the value information in auctions. The private-value model assumes that every bidder has an independent value on the object for sale. The value is also private to the bidder only. The pure common-value model assumes that the actual value of the object for sale is the same for all bidders, but bidders have di erent private information about that actual value. Every bidder adjusts her/his estimate of the actual 23 value by learning other bidders' signals. There are also models considering both values private to individual bidders and common to all bidders. One fundamental principle in auction theory is the revenue equivalence theorem [152, 177, 198, 199], which essentially states that, for a set of risk- neutral bidders with independent private valuation of an object drawn from a common cumulative distribution that is strictly increasing and atomless on [v ; v ], any auction mechanism yields the same expected revenue min max and thus any bidder with valuation v makes the same expected payment if (1) the object is allocated to the bidder with the highest valuation; and (2) any bidder with valuation v has an expected utility of 0. Based on the min revenue equivalence theorem, the four basic types of auctions lead to the same payment by the winner and the same revenue. While most studies in auction theory make some simple assumptions about independence of customer valuations, empirical studies [106] demon- strate that, in practice, the wrong assumption of valuation independence causes inecient auctions in e-commerce. 4.3.2 Sponsored Search Auctions Online ad and sponsored search auctions [126, 172, 197] are one important application of auctions in pricing digital products. Sponsored search [110] is the business model where content providers pay search engines for trac to their websites. In sponsored search, advertisers and, more generally, content providers bid for keywords in search engines, and search engines decide which ad to display in which position to answer a query from a user. GoTo.com created the rst sponsored search auction [110]. Di erent pricing models can be used in sponsored search auctions, such as pay-per mille /pay-per impression (PPM), pay-per-click (PPC), and pay- per-action (PPA). In the early days of sponsored search, a generalized rst price auction is used. Each advertiser bids on multiple keywords, and can set a bidding price for each keyword. When a user query is answered, which is a keyword, the top k bids on the keyword in price are displayed. If an ad is clicked by the user, the corresponding advertiser pays the bidding price. The rst price auction mechanism is unstable, costs advertisers time and reduces search engine pro ts [64]. Later, Google generalizes the second That is, the cost of 1,000 advertisement impressions. 24 price auction mechanism [65], and enhances the ranking of bids by additional information, such as the ad's click-through-rate (CTR), keyword relevance, and ad's landing-page/site quality. There are many in depth analyses about sponsored search auction mech- anisms (e.g., [172]). For example, some studies analyze auction mechanisms based on assumptions about rationality, budget constraints and CTR dis- tributions. Some other studies look at practical sponsored search systems and discuss auction mechanisms when the standard assumptions do not hold. Another group of studies, such as [22,43,53,80], conduct empirical studies to understand bidding behavior and statics. Last and latest, deep learning ap- proaches are used to develop auction strategies in sponsored search [175,222]. 4.3.3 Auctions on Digital Products with Unlimited Supplies One unique feature of digital products is that the replication costs are very low and thus may have almost unlimited supply. Products of unlimited supplies lead to new challenges and opportunities to auction mechanism design. For example, the second price auction can be straightforwardly generalized for k identical products { the top k highest bidders win and each pays the (k + 1)-th bidding price. However, when there are unlimited identical products, the (k + 1)-th bidding price approaches 0. The lack of competition due to obsessive supplies prevents bidders from o ering any high prices. In other words, the challenge is how to ensure the bids are truthful, that is, re ecting the bidders' true valuation of the digital products. Denote by B the set of bidders, and by b ; b ; : : : the bidding prices 1 2 in descending order, that is, b  b  0 for any i > 0. Suppose the i i+1 generalized second price auction mechanism is used. That is, if k bids are taken, those winning bidders each pays the cost b . The auction objective k+1 is to maximize k b . An auction is competitive if it yields revenue within k+1 a constant factor of the optimal xed pricing. It is tricky that, when there is unlimited supply, the Vickrey auction is not competitive if the seller chooses the number of products to sell before knowing the bids, and is not truthful if the seller chooses after knowing the bids [92]. Goldberg et al. [92] propose the rst competitive auction for digital goods with unlimited supplies. The major idea is the smart framework of random sampling auction. An auction is bid-independent if bidder i's bid value should only determine whether the bidder wins the auction, but not the 25 0 price. We select a sample B of B at random, independent from the bid values. We use the bids in B to compute the optimal bid threshold f 0 0 0 that maximizes the revenue in B , and every bidder in B B whose bid value is over f 0 wins. Symmetrically, we use the bids in BB to compute the optimal bid threshold f that maximizes the revenue in BB , and BB every bidder in B whose bid value is higher than f wins. In general, BB 0 0 f = f does not necessarily hold. Random sampling auctions are B BB competitive, no matter the single-price version or the multi-price version. Indeed, random sampling auctions are 15-competitive in the worst case [70] and 4-competitive for a large class of instances where there are at least 6 bids that are as good as the optimal sale price [14]. There are a series of improvements on random sampling auctions. For example, Hartline and McGrew [102] further improve the competitiveness. Goldberg and Hartline [89] extend the scope from single digital product with unlimited supply to multiple products with unlimited supplies. Given a set of bids, they show that the bidder-optimal product assignment given the bids and the optimal sale prices can be determined by solving the integer programming problem as follows. P P max x r ij j j i subject to r = 0 x  1 1  i  n (3) ij x  0 1  i  n; 1  j  m ij p + r  a 1  i  n; 1  j  m i j ij P P P p = x (a r ) i ij ij j i j i where x is the assignment of product j to bidder i, r is the optimal price ij j for product j, p is the pro t of bidder i, and a is bid from bidder i on i ij product j. Then, we can solve the optimal pricing problem in the following random sampling auction. Let B be the set of bidders. First, we obtain a sample 0 0 B of bidders. Second, we compute the optimal sale prices for B . Last, we run the xed-price auction on B B using the sale prices computed in Equation 3. All bidders in B lose the auction. The random sampling auction is shown truthful and competitive [89]. Most of the proposed auctions for digital goods with unlimited supply are randomized auctions. Goldberg et al. [92] show that no deterministic 26 auction can be competitive. Aggarwal et al. [9] later point out that the result does not hold for asymmetric auctions [144]. In a symmetric ex ante auc- tion, buyers' preference parameters are drawn from a symmetric probability distribution, and thus there exists a symmetric equilibrium if an equilibrium exists at all. In an asymmetric auction, each buyer has the same information about the product but a di erent opportunity cost of obtaining the product, that is, bidders' valuations are drawn from di erent distributions. Aggar- wal et al. [9] give an asymmetric deterministic auction that can approximate the revenue of any optimal single-price sale in the worst case. Indeed, they develop a general derandomization technique to transform any randomized auction into an asymmetric deterministic auction with approximately the same revenue. The general idea follows the deterministic maximum ow solution to the well-known hat problem [63]. 4.3.4 Envy-free Auctions One drawback in random sampling auctions is that some bidders may lose even they make bids higher than some winning bidders do, since the bidders 0 0 in B and B B use di erent thresholds (i.e., f 0 and f 0 , respectively) BB B in the one product version and all bidders in B lose in the multiple product version. Goldberg and Hartline [91] establish a fundamental result: an auction cannot be truthful, competitive and envy-free at the same time. They also explore possible tradeo s between truthfulness and envy-freeness based on the consensus revenue estimate (CORE) technique [90]. Speci cally, using a similar idea in combinatorial auctions with single parameter agents [16], we can relax the truthfulness requirement by requiring being truthful with prob- ability (1), and always guarantee envy-free. The auction is highly truthful when  approaches 0 and the number of winners in the auction approaches in nity. The other type of auctions relaxes the envy-free requirement to being envy-free with probability (1 ), and guarantees truthfulness. Both auctions are competitive and the probability is over random coin tosses made by the randomized auction mechanism and not the input. 27 4.3.5 Online Auctions In addition to potentially unlimited supply, another important feature of digital goods is that a digital good may be sold repetitively, such as a movie and a song. Therefore, auctions on digital goods may run continuously instead of only one round. Moreover, customers may want to have prompt answers to their bids. Online auctions [129] are designed to address the setting where di er- ent customers bid at di erent times. The auction mechanism has to make decision about each bid as it arrives. An (online) auction is incentive com- patible if the bidders are rationally motivated to reveal their true valuations of the object. Lavi and Nisan [129] show that an online auction is incentive compatible if and only if it is based on supply curves under the assumption of limited supply, that is, before it receives the i-th bid b (q), it xes the supply curve p (q) based on the previous bids, and (1) the quantity q sold i i to customer i is the quantity q that maximizes the sum (b (j)p (j)); i i j=1 and (2) the price paid by i is p (j). j=1 To tackle the challenges when there is unlimited supply, Bar-Yossef et al. [31] point out that supply curves are not available anymore. Instead, they propose an extremely simple incentive-compatible randomized online auction. Each bidder i picks a random number t 2 f0; : : : ;blog hcg and sets the price threshold to s = 2 , where h is the ratio of the highest valuation against the lowest valuation among all bidders. This auction is O(log h)- competitive. The auction mechanism can be further improved to achieve even bet- ter incentive-compatibility. Speci cally, we can divide a sequence of bids b ; b ; : : : into l = (blog hc + 1) buckets, such that bucket B contains the 1 2 j j j+1 bids with indexes in range [2 ; 2 ). The weight of bucket B is the sum of bids within B , that is, w = i. A new bidder can choose one of the j j i2B buckets at random with the probability proportional to the bucket weight, and pays the price of the lowest bid of the bucket. The price s that bidder j j i pays follows the probability distribution Pr[s = 2 ] = P , where i l1 r=0 d+1 d is a parameter. The auction is shown O(3 (log h) )-competitive. By p p setting d = log log h, the auction is O(exp( log log h))-competitive. 28 4.4 Summary As revenue maximization plays a fundamental role in pricing digital prod- ucts, we review the three major streams of revenues for digital products, namely money, information/privacy, and time/attention. Then, we revisit bundling and subscription planning for digital products, which echoes the opportunities and challenges due to low replication costs of information goods. Auctions are widely used in pricing digital products. We review some basic types of auctions and their applications in digital products, including sponsored search auctions, auctions with unlimited supplies, envy-free auc- tions and online auctions. Some ideas employed by pricing digital products are also used in pricing data products, as to be discussed in the next section. 5 Pricing Data Products In this section, we discuss pricing in marketplaces of data. We rst obtain an overall understanding about data markets and the major players in such markets. Then, we look into several most studied technical problems in data product pricing, including arbitrage-free pricing, revenue maximization pric- ing, fair and truthful pricing and privacy preservation in data marketplaces. Last, we discuss pricing in novel application scenarios, including dynamic data pricing, online pricing and federated learning pricing. 5.1 Data Markets and Pricing, What Are They? Marketplaces for data have been actively developed for over a decade. An early survey [179] identi es di erent categories and dimensions of data mar- ketplaces and data vendors in 2012. There are many studies on various issues about data markets and pricing strategies. Before we discuss any speci cs in detail, it is important to obtain an overall understanding about data markets, such as what are sold and for what purposes, who are the sellers, who are the buyers, and what are the basic pricing models. Pantelis and Aija [167] present a brief economic analysis of data taxon- omy as a market mechanism. Data and databases are legally protected by either copyright or database right. Copyright protects expression and signif- icant creative e ort that creates and organizes data. Database right protects 29 a whole database. One challenge is that both copyright and database right are hard to enforce due to the non-rivalrous nature of data. In general, data may be owned by governments, private parties or in- dividuals. Consequently, data can be categorized into three types: open, public, and private data [167]. Open data are common pool resources [166], such as the data made available by the open data initiatives. Public data, such as the data collected by the government in the United States, are valu- able resources subject to the \tragedy of the commons" [101]. Public data are often produced by individuals or organizations for research and used by governments and local authorities, but may also be employed by commercial parties to enhance their proprietary resources or services. Private data are generated by private applications or services. To understand what are sold in data markets and for what purposes, Muschalle et al. [151] consider the common queries and demands on data markets, as well as the pricing strategies. They observe two major types of queries. The rst type of queries is to estimate the value of a \thing" or compare the values of \things", where examples of the \things" are like webpages for advertisements, starlets, politicians and products. The second type is to show all about a \thing". Those queries are raised by seven cate- gories of bene ciaries, namely analysts, application vendors, data processing algorithm developers, data providers, consultants, licensing and certi cation entities, and data market owners. Muschalle et al. [151] also identify three types of market structures. First, in a monopoly, a supplier is powerful enough to set prices to maximize pro ts. Second, an oligopoly is domi- nated by a small number of strong competitors. Last, in strong competition markets, prices may align with marginal costs. A series of pricing strategies and models may be considered in data markets [151]. First, free data may be obtained from public authorities, may help to attract customers and suppliers of commercial data, and may be integrated into private and not-free data products. Second, prices can be based on usages, such as charging customers per hour of data usage. Third, package pricing allows a customer to obtain a certain amount of data or API calls for a xed fee. A few studies [116, 210] try to optimize package pricing models. Fourth, in the at fee tari model, a data product or service is o ered at a at rate, regardless of usage. It is simple, easy to use. The drawback is the lack of exibility, particularly for buyers. Fifth, 30 combining package pricing and at fee tari results in two-part tari , that is, a xed basic fee plus additional fee per unit consumed. This model is popular in data services. Speci cally, Wu and Banker [209] show that, under zero marginal costs and monitoring costs, at fee and two-part tari pricing are on par, and two-part tari is the most pro table strategy. Last, in the freemium model, users can use basic products or services for free and pay for premium functions or services. Recently, machine learning, particularly deep learning [94], becomes dis- ruptive in many applications, such as computer vision [139, 201] and natural language processing [218]. In most situations, powerful deep models heav- ily rely on large amounts of training data [156]. Monetization of data and machine learning models built on data through markets gains stronger and stronger interests from industry. Speci c to data as an economic good and data pricing as a monetization mechanism in this context, a series of studies focus on data utility for model building and the associated pricing, particu- larly considering privacy. Some data owners may have detailed knowledge of speci c machine learn- ing tasks and thus dedicate corresponding e ort to collect high quality data for building better models. Babaio et al. [23] study the design of optimal mechanisms for a monopoly data provider to sell her/his data. Speci cally, they show that it is feasible to achieve optimal revenue by a simple one-round protocol, that is, a protocol where a buyer and a seller each sends a single message, and there is a single money transfer. The optimal mechanism can be computed in polynomial time. For a buyer who may abort the interaction with a seller prematurely, multiple rounds of partial information disclosure interleaved by payments may be needed to ensure optimal revenue. Cum- mings et al. [49] study the optimal design for data buyers to purchase data estimators with di erent variances and combine the estimators to meet a required quality guarantee on variance with the lowest total cost. The role of privacy in data collection and machine learning model build- ing is investigated. For example, Ghosh and Roth [87] develop auctions that are truthful and approximately optimal for data buyers to obtain accurate estimates on data from owners who are compensated for privacy loss. They show that the classic Vickrey auction [198] can minimize the buyer's total payment and meet the accuracy requirement. They also develop a mecha- nism that can maximize the accuracy given a budget. 31 In general, modeling data owners' costs of privacy loss is very dicult, since the costs may be correlated with private data arbitrarily. It is impos- sible to design a direct revelation mechanism that can provide a non-trivial guarantee on accuracy and, at the same time, is rational for individual data owners. To tackle the issue, Ligett and Roth [137] design a take-it-or-leave- it mechanism, which randomly approaches individuals from a population and makes o ers. This mechanism can be used for some data collection scenarios, such as surveys. Versioning is an important strategy in data pricing. A data seller can customize data into di erent versions according to buyers' needs. Berge- mann et al. [34] develop the optimal menu of information products that a monopoly data supplier can o er to a data buyer, so that one product can t the buyer's willingness to buy the information at the o ered price, and the revenue is maximized. One important nding is that information products indeed allow larger scopes of price discrimination. There are at least two dimensions that sellers can explore to derive various subsets of a data set, namely data quality and data position. When data are used to build machine learning models, it is important to assess the value of each data record within a data set. There exist various methods for assessment, such as leave-one-out [47], leverage or in uence score [48]. Ghorbani and Zou [85] propose to apply the Shapley fairness on the data used to train a machine learning model, and thus de ne data Shapley for a record i in a training data set D as U (S [fig)U (S) = C n1 jSj SDfig where C is an arbitrary (positive) constant, and U (S) is the performance score of the model trained on data S  D. One challenge is that computing the exact data Shapley values on large data sets for sophisticated models, such as deep neural networks, is computational prohibitive. Ghorbani and Zou [85] also develop Monte Carlo and gradient-based methods for estima- tion. If a data point p appears in two samples D and D from the same data 1 2 distribution, intuitively the Shapley value of p in D and D should be simi- 1 2 lar. Mathematically, the intrinsic Shapley value of p in a distribution should 32 be the expectation of the Shapley value of p in the distribution. Based on this intuition, Ghorbani et al. [84] propose the notion of distributional Shap- ley. Let Z be a universe in question. For example, in classi cation problems, conventionally Z = XY , where X is the feature space and Y is the output. Let D be a data distribution in Z . Assuming a potential function or a per- formance metric U : Z ! [0; 1] and a sample size m > 0, the distributional Shapley value of a point z 2 Z is the expected Shapley value over data sets of size m containing z, that is, (z; U;D; m) = E m1 [ (z; U; S [ fzg], SD m1 where S  D is a set of m points sampled i.i.d. from D. They show that distribution Shapley values are stable. Kwon et al. [125] further derive the computationally tractable expressions for distributional Shapley for a series of models, including linear regression, binary classi cation and non- parametric density estimation. Alternative to Shapley values, there are some other data valuation meth- ods. For example, in machine learning, in uence functions [119,206] approx- imate leave-one-out to assess the value of a data item. Cai et al. [41] propose strategy-proof mechanisms for data elicitation and trade o between model accuracy and reward. Richardson et al. [176] focus on the case of linear regression. Recently, Yoon et al. [217] propose data valuation using rein- forcement learning. They use a data value estimator to learn how much a data item as an element in the training data contributes to improving model performance. One distinct advantage is that the model being trained and the data value estimator can improve each other's performance. Data quality is an important issue [170]. There are many studies on assessment of data quality [103, 170, 204]. Some studies speci cally focus on pricing based on data quality and the impact on data markets. Heck- man et al. [103] propose a simple linear model, Value of data = xed cost + w  factor ; where the factors include but are not limited to age of data, periodicity of data, volume of data, and accuracy of data, and w is the associated weight. One practical diculty in using the model is that the parameters in the model are hard to estimate. Another diculty is that many data sets do not have public prices associated. Yu and Zhang [219] consider pricing multiple versions formed by multiple factors of data quality and build a two-level model. The rst level is the data platform where a single owner 33 is assumed, who designs the number of versions. The second level is the customers who want to maximize the data utility. Each level is modeled as a maximization problem and thus the whole model is a bi-level programming problem, which is NP-hard. Another way to form multiple versions of data products is to charge by queries [121{124]. Intuitively, a data seller may treat a view of a data set as a version. Setting the price for every possible view is not only tedious but also tricky. If prices on views are not set properly, arbitrages or less than highest prices may happen. Koutris et al. [121, 124] propose a framework of query and view based data pricing. The major idea is that a seller only needs to specify the prices on a few views, and then the prices of other views can be decided algorithmically. Their advocate two desiderata, arbitrage- freeness and discount-freeness. Theoretically, they show the existence and uniqueness of pricing functions satisfying the requirements. They also show the complexity of computing the pricing functions. Unfortunately, only selection views and conjunctive queries without self-joins are tractable. They present polynomial time algorithms for chain queries and cyclic queries. Technically, the core idea in the view and query based pricing framework is query determinacy [157, 158, 180]. A query Q is said to be determined by a set of views V if the answer to Q can be completely derived from the views. Query determinacy enables the feasibility of arbitrage detection. If V determines Q, then arbitrage happens if and only if the price of V is cheaper than that of Q. Koutris et al. [123] further explore the technical challenges in practical implementation of view and query based data pricing. Speci cally, they develop an integer linear programming formulation for the pricing problem with a large number of queries. Considering the scenario where a user may purchase multiple queries over time or the database is updated, such that information in multiple queries and updates may have overlaps, they also leverage query history to avoid double charging. To handle the situation where there are multiple sellers, they de ne the share of a seller as the max- imum revenue that the seller can get among all minimum-cost solutions, and accordingly de ne a fair revenue distribution policy. A prototype demon- stration system is reported in [122]. Tang et al. [192] follow the view and query based pricing framework and consider the minimum granularity of data, that is, each tuple is a view. 34 Their model assigns to each tuple a price and prices queries based on minimal provenances. Tang et al. [191] extend view and query based pricing to XML documents and consider the situation where a customer may just want to purchase a sample instead of the complete query result. 5.2 Arbitrage-free Pricing Arbitrage is probably the most intensively studied issue in pricing data prod- ucts. As introduced in Section 3.2.4, in general, arbitrage is the activities that take advantage of price di erences between two or more markets or channels. Arbitrage is undesirable in many pricing models. Unfortunately, arbitrage may sneak in pricing models without rigorous design. For exam- ple, Balazinska et al. [28] analyze that subscription based pricing possibly with a query limit allows arbitrage. Muschalle et al. [151] point out that a pricing model charging users a certain amount of API calls for a xed rate may potentially allow arbitrage, depending on the package size. Arbitrage-freeness is one of the fundamental properties of pricing mod- els in query and view based pricing [121{124]. Li and Miklau [134] and Li et al. [133] develop frameworks of pricing linear aggregate queries. Specif- ically, Li et al. [133] consider linear queries. Given a data set of n tuples x ; : : : ; x , a linear query q = (q ; : : : ; q ) is a real-valued vector, and the 1 n 1 N answer q(x) = q x . For a multiset of queries S = fQ ; : : : ; Q g and i i 1 k i=1 query Q, if the answer to Q can be linearly derived from the answers to the queries in S, then Q is said to be determined by S, denoted by S ! Q. A pricing function (Q) is arbitrage-free if for any multiset S and query Q such that S ! Q, (Q)  (Q ). i=1 Under the general intuition of arbitrage-freeness, Li et al. [133] consider a speci c form of queries, linear queries with variance (q; v), that is, the estimation of the answer to query q should have a variance no larger than v. Using di erent values of v, di erent versions are formed. A pricing model not carefully designed may allow arbitrage. Li et al. [133] rst establish the observation that (q; v) = ( ). Then, f (q) they synthesize pricing function (q; v) = , which is arbitrage-free if f is positive and semi-norm . For any arbitrage-free pricing functions  ; : : : ;  , 1 k 5 n n A function f : R ! R is semi-norm if for any c 2 R and any query Q 2 R , f(cq) = jcjf(q); and for any q ; q 2 R , f(q + q )  f(q ) + f(q ). 1 2 1 2 1 2 35 6 f ( (q); : : : ;  (q)) is also arbitrage-free if f is a subadditive and nonde- 1 k creasing function. As Roth [178] summarizes, the framework by Li et al. [133] still faces three important challenges. First, arbitrage is still possible to derive answers to a bundle of queries from another bundle of queries and their answers. Second, arbitrage is still possible on biased estimators for statistical queries. Last, it is unclear whether we can obtain arbitrage-free pricing maximizing pro t given the distribution of buyer demands. Later, Deep and Koutris [54] provide some interesting insights to arbitrage-free pricing for bundles. Lin and Kifer [138] investigate arbitrage-free pricing for general data queries. They consider three types of pricing models for query bundles, where a query bundle is a set of queries posted simultaneously as a batch. First, an instance-independent pricing function depends on the query bun- dle but not the database instance. Second, an up-front dependent pricing function depends on both the query bundle and the database instance. A customer knows an un-front dependent pricing function, and decides whether to purchase or not the query answers. Last, a delayed pricing function de- pends on both the query bundle and the answers computed by the query bundle on the current database instance. The customer knows the pricing function, but do not know the exact price. Once agreeing, the customer is charged when the answers are computed. Lin and Kifer [138] also summarize ve di erent types of arbitrage situ- ations. First, if prices are quoted by queries, in order to avoid price-based arbitrage, answers to queries should not be deduced from prices along. Sec- ond, a buyer may use multiple accounts to derive answers to a query bun- dle. To avoid separate account arbitrage, the price of a query bundle [q ; q ] 1 2 should be at most the sum of the prices of q and q . Third, if the answers 1 2 to a query bundle q can always be deduced from answers to another query bundle q, to prevent post-processing arbitrage from happening, the price of q should be no cheaper than that of q . Fourth, although the answers to a query bundle q may not be always derivable from the answers to an- other query bundle q on all database instances, still for a speci c database instance I , the answers to q may be derived from the answers to q . If so, a serendipitous arbitrage happens. Last, if two queries behave almost identical but their prices are dramatically di erent, almost-certain arbitrage P P k k A function f is subadditive if for any x ; : : : ; x , f( x )  f(x ). 1 i i i=1 i=1 36 happens. Based on the above categorization, they discuss conditions that can prevent various types of arbitrage situations from happening. Pricing many queries in real time with formal guarantees on arbitrage freeness is challenging. Many theoretical methods are not scalable in prac- tice. For example, it takes QueryMarket [123] about one minute to compute the price of a join query over a relation of about 1000 tuples. Qirana [55,56] is a system for query-based pricing. The system allows data sellers to choose from a set of pricing functions that are information arbitrage-free, which covers both post-processing arbitrage-freeness and serendipitous arbitrage- freeness in Lin and Kifer's taxonomy [138]. Qirana also supports history- aware pricing. Qirana has been shown highly ecient and scalable on TPC- 7 8 H and SSB benchmark datasets as demonstration. The key idea in Qirana is that it regards a query as an uncertainty reduction mechanism. Initially, a buyer faces a set of possible databases I de ned by a database schema, primary keys and prede ned constraints. Once a buyer obtains the answer E to a query Q, all possible databases D such that E 6= Q(D) are eliminated. The price assigned to Q should be a function of how much the set of possible databases shrinks. Let S be the set of possible databases before the query Q is answered. S is called the support set. Then, a weighted coverage function assigns a weight w to every wc D 2 S , and computes the price to a query by p (Q; D) = w . i i Q(D )6=Q(D) Alternatively, consider the equivalence relation in S : D  D if and only i j if Q(D ) = Q(D ). Assign to each possible database D 2 S a weight w i j i i such that w = 1. Let P be the set of equivalence classes. For i Q D 2S each class B 2 P , denote by w = w . The Shannon entropy Q B i D 2B function is used to compute the price of query Q as the entropy of the query output P (Q; D) = w log w . The q-entropy function B B B2P (also known as Tsallis entropy) for q = 2 is used to assign to Q the price P (Q; D) = w (1 w ). Deep and Koutris [54] show that the B B B2P weighted coverage function, the Shannon entropy function and the 2-entropy function are all arbitrage-free. Using the complete set of possible databases as the support set leads to a #P -hard problem. To make the price calculation computationally feasible, Qirana uses uniform random sample and random neighboors as the support http://www.tpc.org/tpch. http://www.cs.umb.edu/?poneil/StarSchemaB.PDF 37 sets. In targeted advertising markets, user data, such as opt-in email ad- dresses, and user impressions are sold as data products. How to price users properly to avoid arbitrage is important. Xia and Muthukrishnan [213] consider the following problem. Denote by q a selection query over user attributes, by U the set of all users satisfying q , and by p the price of i i i each user in U . If a buyer purchases n users (1  n  jU j) in U , she/he i i i has to pay n  p . If prices of di erent queries are not well coordinated, version-arbitrage may arise. If two queries q and q return similar user i j sets but q is dramatically more expensive than q , then a user who wants i j q may purchase q instead. Xia and Muthukrishnan [213] point out that i j uniform pricing, that is, every query has the same price, is arbitrage-free, but is a logarithmic approximation to the maximum revenue arbitrage-free pricing solution. Then, they present a greedy non-uniform pricing design. The design starts with the optimal uniform pricing that is arbitrage-free, and then iteratively updates the pricing function. If the price of a query can be updated to increase the revenue, it is increased so that the arbitrage-free property is retained. This greedy algorithm is still a logarithmic approxi- mation to the maximum revenue arbitrage-free pricing solution. Chen et al. [44] develop an arbitrage-free pricing design for multiple versions of a machine learning model. They assume that a broker trains the optimal model on the complete raw data. Then, random Gaussian noises are added to the optimal model to produce di erent versions for di erent buyers. The assumption is that the error of a machine learning model instance is monotonic with respect to the variance of the noise injected into the model. In this setting, a pricing function is arbitrage-free if and only if the price of a randomized model instance is monotonically increasing and subadditive with respect to the inverse of the variance. 5.3 Revenue Maximization Pricing As explained in Section 3.2.2, the objective of revenue maximization is often of special interest in designing pricing strategies, since for a business to be successful long term, a more immediate and important requirement is to win Here, \buying a user" is short for purchasing the impression of a user in online adver- tising and a user email in targeted email advertising, for example. 38 over as many customers as possible. Revenue maximization pricing for data products is a relatively less ex- plored area. A possible reason is that, comparing with pricing digital prod- ucts, some other factors in pricing data products need more urgent accom- modation, such as arbitrage. As mentioned in Section 5.2, Xia and Muthukrishnan [213] develop loga- rithmic approximation pricing algorithms for revenue maximization in user- based markets. They also consider the situations where both the maximum number (i.e., maximum demand) and the minimum number (i.e., minimum demand) of users that a buyer purchases are speci ed, and provide an O(D) approximation algorithm to maximize revenue, where D is the largest min- imal demand among all buyers. Chawla et al. [42] consider query and view based pricing for arbitrage- free revenue maximization under the assumption that all buyers are single- minded and the supply is unlimited. A buyer is single-minded if the buyer wants to purchase the answer to a single set of queries. They consider three types of pricing functions. Uniform bundle pricing sets the price of every bundle identical. Additive or item pricing prices each item and charges a bundle the sum of prices for the items in the bundle. Fractionally subadditive 1 k pricing or XOS sets k weights w ; : : : ; w for each item j, and for a bundle j j k i e, the price is set to max w . Building on the extensive studies on i=1 j j2e revenue maximization with single-minded buyers and unlimited supply [29, 39, 96], they develop new heuristics. It is well known that there exists uniform bundle pricing that is O(log m) approximation of revenue maximization, where m is the number of bundles. Swamy and Cheung [189] show that item pricing can achieve an O(log B) approximation of maximum revenue, where B is the maximum number of bundles an item can involve. Chawla et al. [42] show some new lower bounds, that is, uniform bundle pricing, item pricing and XOS pricing combining a constant number of item pricing functions are still (log m) away from maximum revenue. They also present approximation algorithms. To maximize revenue in machine learning models, Chen et al. [44] show that the optimization problem is coNP-hard. Thus, they relax the subaddi- q(x) q(y) tive constraint p(x + y)  p(x) + p(y) by  for every 0 < x  y, x y q(x) and turn to nding a pricing function q() such that is decreasing with respect to x. They show that, for every well standing pricing function p(), 39 there exists a pricing function q() with the relaxed subadditive constraint p(x) such that  q(x)  p(x), and q(x) can be computed using dynamic programming in O(n ) time, where n is the number of interpolated price points. 5.4 Fair and Truthful Pricing Fairness and truthfulness are important for data product markets. Recall that fairness refers to that the revenue generated by a sale transaction in the data market is distributed among sellers in an unprejudiced manner so that they are paid for their marginal contributions. Truthfulness means a market where buyers are well motivated to report their internal valuations of data products unwarily. Agarwal et al. [7] propose a mathematical model of data marketplaces that are fair, truthful, revenue maximizing, and scalable. They assume each seller j supplies a data stream X and each buyer n conducts a prediction task Y , where X ; Y 2 R . For example, X may be a stream of customers' n j n j interest on di erent products, and Y is a task predicting a new customer's interest. Taking a prediction task Y and an estimate Y , a prediction gain n n 2T function G : R ! [0; 1] measures the quality of the prediction. The value ^ ^ that buyer n gets from estimate Y is  G(Y ; Y ), where  is the price rate n n n n n that the buyer is willing to pay for a unit increase in G. A machine learning MT T model M : R ! R uses data from M sellers to produce an estimate Y for buyer n's prediction task Y . Let p and b be the price and the bid, n n n n respectively. Then, allocation function AF : (p ; b ; X ) ! X measures n n M M the quality at which buyer n obtains that is allocated to the sellers on sale X , where X 2 R . Revenue function RF : (p ; b ; Y ;M;G; X ) ! r M M n n n M n calculates how much revenue r 2 R to extract from the buyer. The utility that buyer n receives by bidding n for Y is n n U (b ; Y ) =  G(Y ; Y )RF (p ; b ; Y ); n n n n n n n n ^ e e where Y = M(Y ; X ) and X = AF (p ; b ; X ). A market is truthful n n M M n n M if for all prediction tasks Y ,  = arg max + U (z; Y ). They adopt the n n z2R n notion of fairness following the famous Shapley fairness [184]. One main result [7] is that, the data market de ned as such is truthful if and only if function AF is monotonic, that is, an increase in the di erence 40 between price rate p and bid b leads to a decrease in predication gain G. n n They also give randomized -approximation algorithms for fair data market, that is, jj jj <  with probability 1 , where is n;Shapley n 1 n;Shapley the Shapley-fair payment division among sellers, is the output of the approximation algorithm, and ;  > 0. Their algorithms are polynomial. Shapley fairness [184] is popularly adopted as the foundation of fairness in data markets. However, computing Shapley value is exponential [57]. Maleki et al. [141] present a permutation sampling method that approxi- mates Shapley values for any bounded utility functions. The basic idea is to use Equation 2 and tackle (s) = E[U (P [fsg)U (P )] by sample mean. s i Following Hoe ding's inequality [104], to achieve an (; )-approximation, 2r N 2N that is, P (js ^sj  )  1, where s ^ is the estimate, we need log p 2 samples and evaluate the utility function O(N log N ) times, where r is the range of the utility function U . Jia et al. [112] present approximation algorithms for Shapley value that can substantially reduce the number of times that the utility function is evaluated. First, they apply the idea of feature selection using group test- ing [60, 225]. For user s, let be the random variable that s appears in a random sample of sellers. Then, for sellers s and s , the di erence in i j Shapley values between s and s is i j U (S[fs g)U (S[fs g) 1 i j (s ) (s ) = i j N2 N1 S2Dnfs ;s g i j ( ) jSj = E[( )U ( ; : : : ; )] s s s s i j 1 j where U ( ; : : : ; ) is the utility computed using the sellers appear- s s 1 j ing in the random sample. They can use group testing to rst esti- mate the Shapley di erences and then derive the Shapley value from the di erences by solving a feasibility problem. They show that this algo- rithm is an (; )-approximation that evaluates the utility function at most O( N (log N ) ) times. They further observe that most of the Shapley val- ues are around the mean. Exploiting this approximate sparsity, they give an (; )-approximation algorithm that evaluates the utility function only O(N (log N ) log(log N ) times. Ghorbani and Zou [85] propose a principled framework of fair data eval- uation in supervised learning, and Monte-Carlo and gradient-based approx- imation methods. Their Monte-Carlo method follows a general idea similar to that in Jia et al. [112]. They generate Monte-Carlo estimates until the 41 average empirically converges. They also argue that, in practice, it is su- cient to estimate Shapley values up to the intrinsic noise in the predictive performance on the test data set. Adding one tuple as a training data point does not signi cantly a ect the performance of a model trained using a large training data set. Therefore, truncation can be used in practice based on the bootstrap variation on the test set. In their gradient Shapley method, they train a model using one \epoch" of the training data, and then update the model by gradient descent on one data point at a time, where the marginal contribution is the change in the performance of the model. In general, computing Shapley values requires an exponential number of model evaluations. However, for some speci c model, the computation may be reduced dramatically. For example, Jia et al. [111] show that for unweighted kNN classi ers, the exact computation needs only O(N log N ) h(;k) time and an (; )-approximation can be achieved in O(N log N ) time when  is not too small and k is not too large. They also propose a Monte- N (log N ) Carlo approximation of O( ) for weighted kNN classi ers. A key (log k) enabler of the progress is the speci c utility function of a kNN classi er minfk;jSjg U (S) = 1[y = y ] kNN test (S) i=1 where (S) is the index of the training feature that is the k-th closest to x i test among the training examples in S. Moreover, the sublinear approximation for unweighted kNN classi ers is facilitated by locality sensitive hashing [52]. Recently, Jia et al. [113] leverage the ecient computation of Shapley values in kNN [111] to tackle general classi cation problems. They propose to rst train a target model, such as a deep neural network, and identify the features. Then, they conduct a model distillation to kNN by training a kNN classi er using the features to mimic the performance of the original model and tune parameter k, the number of nearest neighbors considered. Last, they apply the Shapley value estimation method in kNN [111] to approach the Shapley values in the target model. Many classic rewarding methods, such as Shapley values, may be vulner- able to data-replication attacks. One data provider may replicate its data and act as an additional provider to obtain extra unconscionable rewards. To prevent data-replication attacks from happening, replication-robust pay- o mechanisms are proposed. Han et al. [100] propose a x to Shapley value 42 based payo mechanisms. The idea is to down-weigh the Shapley value { a data provider gets a less reward if there are multiple copies of its data in the coalitions. Related to fairness and truthfulness in a market, cooperation among dif- ferent agents in a market may happen. Building trust in a sub-community within a data marketplace becomes an interesting subject. Armstrong and Durfee [17] analyze factors that may in uence the eciency of building trust and conducting cooperation in a data market. For each agent in a market, the other agents can be divided into two categories, namely those remem- bered agents and those strange or forgotten agents. They have a few inter- esting ndings. Cooperations arising from iterated interactions is inversely proportional to the rate of system mixing, the number of initially misbe- having agents, and the rate at which agents explore alternative strategies. Cooperation is also initially inversely proportional to population size. At the same time, cooperation is proportional to average member size and better estimation of the likelihood of strange agents to misbehave. 5.5 Privacy Preserving Marketplaces of Data Privacy is a serious concern and also a critical tipping point in designing marketplaces of data. When a user shares her/his data with some others, the user may disclose her/his privacy to some extent. Therefore, it is important to explore how to protect or minimize the privacy leakage. At the same time, it is also important to understand how a seller's privacy disclosure may be properly compensated through data pricing. Ghosh and Roth [87] design truthful marketplaces where data buyers want to purchase data to estimate statistics and sellers want compensation for their privacy loss. In the design, there is only one query and the individ- ual evaluations of their data are private. Data owners are asked to report the costs for the use of their data. Under the assumption of di erential privacy [61, 62], they transform the problem into variants of multi-unit pro- curement auction. They show that, when a buyer holds an accuracy goal, the classic Vickrey auction can minimize the buyer's total cost and guaran- tee the accuracy. When the buyer has a budget, they give an approximation algorithm to maximize the accuracy under the budget constraint. The method by Ghosh and Roth [87] may not work well when the costs 43 and the data are correlated. For example, a store with more customer trac may request a higher cost in using the data. Correspondingly, reporting the cost may reveal the privacy of the store. Fleischer and Lyu [74] tackle the scenario where costs are correlated with data and propose a posted- price-like mechanism. Given a set of data sellers categorized into di erent types and the associated distributions of costs, the mechanism o ers each user a contract with the expected payment corresponding to the type. If a seller takes the o er, the payment is determined by the seller's veri able type and the associated payment in the contract. All sellers have the same probability to take or reject their contracts independently. The sellers are truthful, that is, a user takes the o er if the payment is larger than or equal to the privacy loss. This posted-price-like mechanism is Bayesian incentive compatible (i.e., every seller's strategy is Bayesian-Nash equilibrium), ex- interim individually rational (i.e., the expected utility is non-negative for every seller when the seller decides truthfully), O( )-accurate, perfectly data private (i.e., whenever the mechanism's posterior belief about a seller's data di ers from its prior belief, the mechanism pays the seller) and - di erentially private. Li et al. [133] tackle the same problem as Ghosh and Roth [87] do, but assume that individual valuations are public and focus on returning unbiased estimations and pricing multiple queries consistently. To address the concerns on privacy loss, they develop a theoretical framework to divide the price among data owners who contribute to the aggregate computation and thus have loss of privacy. Their framework extends several principles from both di erential privacy and query pricing in data markets. The fairness mechanism considered by Li et al. [133] only compensates a seller whose data are used. Niu et al. [163] further consider the scenario where multiple sellers' data are correlated and extend to dependent fair- ness. In dependent fairness, a seller s is still compensated if the data of another seller s are used that are correlated with the data of s. They pro- pose two approaches to privacy compensation. In the bottom-up approach, the broker rst satis es each individual seller's privacy compensation and then decides the price for the statistic selling to a buyer. In the top-down design, the broker decides the total price of a data aggregate product sold to a buyer, and then spares a fraction of the total price for privacy compensa- tion. The privacy compensation is divided and assigned to individual data 44 sellers by solving a budget allocation problem. Each seller receives a com- pensation roughly proportional to the privacy loss due to the data sharing. Niu et al. [161] further extend to time series data that may have temporal correlations. They adopt Pu er sh privacy [117] to measure privacy losses under temporal correlations. While various e orts have been made to address the challenges of privacy loss compensation when user data are correlated in one way or another, as Ghosh and Roth [87] point out, in general, it is impossible for any mechanism to compensate individuals for privacy loss properly if correlations between their private data and their cost functions are unknown beforehand. In the classical setting of physical goods [143], using contract theory [142] with hidden information, that is, unobservable types of buyers, a seller can design a set of contracts with di erent consumption levels to maximize rev- enue from buyers. Naghizadeh and Sinha [154] extend the contract design model to price a bundle of queries at di erent privacy levels to maximize revenue. They also consider adversarial users. Their work also adopts dif- ferential privacy [61, 62]. For a query bundle fQ ; : : : ; Q g, a contract is a 1 k tuple (p; ; s), where p > 0 is the price paid by a buyer,  is the privacy budget, such that a buyer can get an answer to query Q (1  i  k) with -di erential privacy guarantee, and    , and p is the post-hoc ne i i i=1 to be paid if the buyer is found misusing the query answers. It is assumed that an adversarial buyer derives a bene t C (), which is monotonically increasing and convex, C (0) = 0. One interesting nding is that, in the tra- ditional contract theory, if there are n types of honest buyers and one type of adversarial buyers, the seller should design up to n + 1 contracts. In the data marketplace situation, they show that up to n contracts are sucient. In other words, a data seller should not design a contract for the adver- sary. Instead, the seller should adjust the contracts' pricing to account for the risks from adversarial users. They also design post-hoc nes in pricing query bundles that can help to reduce loss due to privacy leakage by adver- sarial buyers. They provide a fast approximation algorithm to compute the contracts. A data owner has to decide a tradeo between privacy and data utility. Li and Raghunathan [135] design an economics-based incentive-compatible mechanism for a data owner to price and disseminate private data. Speci - cally, let two-part tari pricing function R(s; x) = + x be the price for s s 45 x amount of data at sensitivity level s, where and are the xed and s s variable price factors, respectively. Assuming two types of data users, one type for aggregate information and patterns in data and the other type for individual identity and personal information, the proposed mechanism works in four stages. First, the data owner selects a variety of sensitivity types to o er. Second, the data owner o ers di erent prices for data with di erent sensitivity types. Third, a data user selects a certain sensitivity type with corresponding price, and thus reveals the user type. Last, the data user se- lects the optimal amount of data with the chosen sensitivity type. The core idea is that the data owner can identify the sensitive attributes in the data, such as the identifying attributes, which are not useful for aggregate analysis but necessary at individual communication. A data owner can o er a lower price for data without sensitive attributes, and charge for a higher price for data with sensitive attributes. This approach provides an orthogonal idea to the popular ways of tuning the parameter in di erential privacy. Due to the privacy concerns, when a company may have opportunities to collect data about its customers, should it do it (i.e., collecting and re- vealing the data) or not (i.e., a blanket policy of never collecting)? Jaisingh et al. [109] nd that the company should not collect customer data if the total gains from trading the data cannot cover the privacy loss. In practice, there is an increasing tendency for consumers to overestimate their loss of privacy, particularly when the use of the private data is uncertain. In other cases, the company should o er two contracts on their services and prod- ucts. One contract collects the customer data at a certain price, and the other contract does not collect any customer data at a di erent price. While most of the studies on privacy preserving data marketplaces fo- cus on the privacy of data owners, transactions may also disclose privacy of data buyers, such as what, when and how much they buy. For example, a retail company purchasing query results may consider what queries (e.g., the products or customer groups involved in the queries), when (e.g., the periods where the queries are concerned), and how much data it purchases as privacy, and may want to keep the information con dential from any others, including the data sellers and the broker. Aiello et al. [11] design a mechanism such that after making an initial deposit and maintaining a su- cient balance, a buyer can engage in an unlimited number of price-oblivious transfer protocols where the sellers and the broker cannot know anything 46 other than the amount of interaction and the initial deposit amount. The broker even cannot know the buyer's current balance and when the buyer's balance runs out. This is achieved by adapting conditional disclosure [83] to the two-party setting. Distribution and use of private data are another important step where privacy may leak. Hynes et al. [107] demonstrate Sterling, a decentralized marketplace for private data, which supports privacy-preserving distribution and use of data. The central technical idea comes from privacy-preserving smart contracts on a permissionless blockchain. To provide strong security and privacy guarantees, they combine blockchain smart contracts, trusted execution environments and di erential privacy. Particularly, smart con- tracts allow enforcement of constraints on data usage and enables payments and rewards. 5.6 Data Pricing in Novel Applications: Dynamic Data Pric- ing, Online Pricing and Federated Learning Pricing The demand of data pricing arises in many novel application scenarios. In this subsection, we particularly discuss three emerging situations: dynamic data pricing, online pricing and pricing in federated learning. Many applications are built on dynamic and online data. How to price temporal views on data streams properly is an important issue for practical data markets. One central task is to estimate and optimize the operational costs, which are the costs to evaluate queries of di erent users on the y. The pricing decisions involve not only data sellers but also data buyers. For example, suppose two data buyers b and b purchase two queries q and 1 2 1 q , such that q can be written as a further selection on top of q (e.g., q 2 2 1 1 is about all customers in North America, while q keeps all the same as q 2 1 but focuses on only customers in Canada). The optimal pricing of q and q should take the advantage of the overlap between the two queries so that the sharing can save the operational costs, and, at the same time, be fair to b and b . 1 2 Al-Kiswany et al. [12] propose a greedy method that enumerates all possible sharing plans and selects the one with the minimum additional cost. It does not come with any quality guarantee. Liu and Hacigum  u  s [140] propose an improved method that takes some risk in sharing plan. If the 47 costs of the previous sharings are already cumulated to a high level, and the additional cost of a new sharing (i.e., the risk) is moderate and can be amortized well by the previous sharings, then the new sharing may be taken. They also give ve rules to ensure fair pricing. Let AC (S) be the cost attributed to a sharing S. First, for two identical sharings S = S , 1 2 AC (S ) = AC (S ) should hold. Second, for any sharing S, AC (S) should 1 2 be no higher than the lowest cost of S if no other sharing exists. Third, for two sharings S and S , if the query of S is contained by the query of S , 1 2 1 2 that is, the result of S is a subset of the result of S , and the lowest cost 1 2 of S is smaller than the lowest cost of S if no other sharing exists, then 1 2 AC (S )  AC (S ). Fourth, a sharing plan with common subexpressions 1 2 with other sharings should be compensated. Last, the cost of the global plan should be equal to the sum of costs attributed to all sharings. In order to purchase dynamic data, a buyer may have to call a seller's API repeatedly. A buyer may have to pay for the same data multiple times. Upadhyaya et al. [195] explore how to modify APIs to achieve op- timal history-aware pricing, that is, buyers are charged only once for data purchased and not updated. The central idea is the introduction of the no- tion of refund { a user can ask for refunds of data that she/he has bought before. For each query, the seller issues a coupon in addition to the query result, where the coupon records the identity information of the data in the query result. Speci cally, a coupon c = ((id; uid; v); ;H(id )), where id is a tuple identi er, uid is a user-id, v is a version-id that is monotonically increasing,  is a query identi er that is also monotonically increasing, H is a cryptographic hash function [59], such as SHA-1, SHA-256 and SHA-3, and  is a secret key only known to the seller. If a buyer gets two coupons c and c in two di erent purchases such that c [(tid; uid; v)] = c [(tid; uid; v)], 2 1 2 then the buyer can ask the seller for a refund by showing the two coupons. As pointed out by Deep and Koutris [55], the refund mechanism does not provide any arbitrage-free guarantee. Qirana [55,56] can support history-aware pricing. To incorporate a query history, suppose a buyer already purchases queries Q = Q ; : : : ; Q and pays 1 k for a total of p(Q; D) so far. When a new query Q comes, let the support k+1 set S = fD 2 S j Q(D ) = Q(D); Q (D ) 6= Q (D)g. Then, the k+1 i i k+1 i k+1 new total price p((Q ; : : : ; Q ; Q ); D) = p(Q; D) + w . This 1 k k+1 i D 2S k+1 history-aware pricing function is shown arbitrage-free. 48 Zheng et al. [223] consider online pricing for mobile crowd-sensing data markets. Di erent from most of the work on data markets, they assume that data providers are distributed in space and there are three types of spatial queries from buyers, namely single-data query (e.g., inquiring the value at a speci c location), multi-data query (e.g., inquiring the mean in a region) and range query (e.g., inquiring the probability that the data at a region falls in a given range). The vendor uses raw data from data providers and produces a statistical model through Gaussian process to answer queries. To form di erent versions of data products, the vendor generates di erent conditional Gaussian distribution with respect to locations and uses the conditional entropy to quantify the quality of the versions. They propose a randomized online pricing strategy so that the price can be adaptive from the historical queries. They show that the pricing mechanism is arbitrage- free and is a constant factor approximation of revenue maximization. Niu et al. [162] consider online data market where a query may be sold to di erent buyers at di erent time and the broker can adjust prices over time. The objective is to maximize the broker's cumulative revenue by posting reasonable prices for sequential queries. They design a contextual dynamic pricing mechanism with the reserve price constraint. The central idea is to use the properties of ellipsoid for ecient online optimization. Their method can support both linear and non-linear market value models with uncertainty. Federated learning [146,147] trains a machine learning model across mul- tiple decentralized parties, where each party holds local data without any peer-wise data exchanging. The parties and their data sets are often or- dered in a federated learning process. To accommodate the participation order and value data in federated learning, Wang et al. [205] develop fed- erated Shapley value. Let I be the set of participants and U be the utility function, where U (A +B) is the utility of training rst on A and then on B. For participant i at round t in a federated learning process, the federated Shapley value is 1 1 (i) = [U (I + S [fig))U (I + S)] t  1:t1 1:t1 jI j1 jI j SI nfig jSj if i 2 I and (i) = 0 otherwise. The federated Shapley value of a t t party is the sum of the values of all rounds, that is, (i) = (i). t=1 49 Wang et al. [205] show that the federated Shapley values have instanta- neous group rationality, that is, (i) = U (I ) U (I ). The t 1:t 1:t1 i2I fairness is guaranteed at each round. That is, for any two parties i and j, (i) = (j) at round t if 8S  I n fi; jg, U (I + (S [ fig)) = t t t 1:t1 U (I + (S [ fjg)). Moreover, for any party i at round t, (i) = 0 if 1:t1 t 8S  I n fig, (I + (S [ fig)) = U (I + S). They also extend t 1:t1 1:t1 the previous Shapley value approximation techniques to compute federated Shapley values. Sim et al. [186] consider the more general situation of collaborative machine learning and advocate using information gain as the utility func- tion. For a model  trained on data D, the information gain I(; D) = H() H(jD), which is the reduction in uncertainty. They generalize to -Shapley fairness by assigning a reward r = k to a party i. By tuning parameter , they can trade o among Shapley fairness, individual rational- ity, stability of the grand coalition and group welfare. Hu and Gong [105] consider privacy leaking in federated learning and design an incentive mechanism to compensate the cost of privacy leakage of the users that are most likely to provide reliable data. Their problem is formulated in a two-stage Stackelberg game [200]. Richardson et al. [176] use in uence functions to reward data contributions to linear regression in the federated learning setting. 5.7 Summary In this section, we review the topic of pricing data products. We rst an- alyze the structures, players, and ways to produce data products in data marketplaces. Then, we examine several important areas in pricing data products, including arbitrage-free pricing, revenue maximization pricing, fair and truthful pricing and privacy preserving pricing. We also discuss how to price dynamic data and online pricing. When pricing data products in a data marketplace, those several considerations are typically incorporated and integrated in one way or another. 50 6 Discussion and Open Challenges Data pricing comes from practical demands and has been tackled in multiple disciplines. Although there is a rich body of literature addressing a series of issues in data pricing, there are still many questions remained unexplored. In this section, we discuss some interesting challenges for possible future work. By no means our list is exhaustive. Instead, we hope our discussion can intrigue more extensive interest and research e ort into this fast growing area. 6.1 Data Supply Chain: A Grand Challenge At the macro level, although many studies focus on di erent steps in data marketplaces, we clearly observe a lack of systematic investigation on data supply chains and development of end-to-end solutions. As data products are abundant and diversi ed, to develop ecologically sustainable market- places, supply chains of data products have to be built. Here, we introduce and advocate the notion of data supply chains, which connect all parties in- volved in data production and consumption, including data providers, data processors, data analysts, data product and services consumers and other possible roles. Each party in a data supply chain connects its upstream providers and its downstream consumers, provides its value-added contri- butions and obtains rewards. Feedback mechanisms through pricing and marketing have to be created in a data supply chain so that supply and consumption can be matched, coordinated and balanced. Most of those problems are not thoroughly thought about. Although the notion of data supply chain is not mentioned in litera- ture, some speci c trends and challenges are discussed sporadically. For example, Muschalle et al. [151] identify some trends and challenges in data consumption and marketplaces. First, they assert that many essential data processing tasks are essential for data markets, such as labeling, annotating and aggregating data. Second, data markets will be integrated with numer- ous application domains. To enable domain data markets, it is important to customize general data processing technologies for niche domains. Third, customers want to have data faster. Thus, it is important to create on- line data query services and develop corresponding pricing models. Fourth, as there are more data, more data providers and more analysts, a data 51 product may be substituted by others. To hatch a healthy ecological data marketplace, it is important to establish standard data processing mashups to facilitate data product substitution. Fifth, to maintain a fair data market overall, it is important to provide price transparency so that data product providers have to optimize their data and data processing/analysis services. Last, customer preferences and experience are critical for data markets. Recently, Acemoglu et al. [3] present an insightful study on the ecological e ect of data markets. They demonstrate that a user's sharing of data may likely reveal some other users' privacy and depress the price of other users' data. The depressed prices lead to excessive data sharing and thus further reduce welfare. Their study suggests the need of mediation in data sharing in data markets. Most recently, Fernandez et al. [71] analyze the challenges and propose a research agenda around constructing a data market platform to address the sharing, discovery and integration of data among many parties. Their big picture covers both market design and system development. The focus is to create the incentives and mechanisms to connect data supply and demand. As the middlemen, arbiters build data mashups to match data supply and demand. The market platforms advocated by the authors can be regarded as the data exchange mechanisms in data supply chain. One challenge associated with the macro view of data supply chain is the interdisciplinary nature of data pricing research. As can be observed in this article, data pricing is studied in many di erent disciplines, such as economics, marketing, electronic commerce, data management, data mining and machine learning. The communication and dialog among di erent areas have to be strengthened. 6.2 Some Technical Challenges at the Micro Level At the micro level, there are many research problems remained open. We name a few examples of fundamental problems. First, most of the studies suggest relative prices of data products. Very few studies connect theoretical models with data pricing practice and in- vestigate absolute prices of data products and their marketing e ect. As data pricing is a market mechanism and user behavior in practice is hard to modeled completely, experimental studies of data pricing models are essen- 52 tial and should be connected to theoretical investigations. Second, pricing is based on valuation and equilibrium among multiple parties. Di erent parties may have di erent valuation on data, data prod- ucts and data services. It is important to systematically establish the prin- ciples of value assessment for various parties in data marketplaces, such as data providers, data owners, data users, and data brokers. Moreover, it is important to understand what messages are passed to di erent parties in data marketplaces through data pricing actions, and how. So far, value assessment of data and negotiations among di erent parties in data market- places are largely not analyzed in detail. Third, many pricing models are proposed in literature. It is important to understand how data pricing models and their assumptions can be im- plemented and enforced in practice. Speci cally, accounting and auditing in data marketplaces are critical to achieve transparency in data pricing and eciency in data marketplaces. Accounting and auditing in data market- places, however, are interesting problems that have not been investigated in depth yet. We need principles, quality guarantees and designs of opera- tional procedures for accounting and auditing in data pricing, transactions and adversary detection. Fourth, most of the studies on data pricing develop general models. At the same time, as data science transforms many application domains, data pricing has to deal with speci c applications. Mechanisms, regulations and constraints in a speci c domain may facilitate data pricing in some aspects, and post challenges in some other aspects. For example, Jia et al. [111] show that, although fair pricing in general is exponential in computation time but can be achieved polynomially in kNN models (Section 5.4). It is interesting and highly desirable to explore fairness, truthfulness, and privacy preservation of data pricing in speci c applications. Last but not least, almost all applications are dynamic in nature. The values of data, data products and data services may also evolve over time. The changes may be caused by the updates in demands and supplies. It is important to develop mechanisms to capture and monitor changes in de- mand and supply of data, data products and data services, and explore corresponding dynamic pricing. 53 References [1] M. Aazam and E. Huh, \Broker as a service (baas) pricing and resource estimation model," in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Dec 2014, pp. 463{468. [2] T. Abdallah, \On the bene t (or cost) of large-scale bundling," Production and Operations Management, vol. 28, no. 4, pp. 955{969, 2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .1111/poms.12958 [3] D. Acemoglu, A. Makhdoumi, A. Malekian, and A. Ozdaglar, \Too Much Data: Prices and Ineciencies in Data Markets," National Bureau of Economic Research, Inc, NBER Working Papers 26296, Sep. 2019. [Online]. Available: https://ideas.repec.org/p/nbr/nberwo /26296.html [4] A. Acquisti, C. Taylor, and L. Wagman, \The economics of privacy," Journal of Economic Literature, vol. 54, no. 2, pp. 442{92, June 2016. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jel. 54.2.442 [5] A. Acquisti and C. Tucker, \Guns, privacy, and crime," Working paper, 2011. [Online]. Available: https://www.heinz.cmu.edu/ acqui sti/papers/acquisti-REV.pdf [6] W. J. Adams and J. L. Yellen, \Commodity bundling and the burden of monopoly," The Quarterly Journal of Economics, vol. 90, no. 3, pp. 475{498, 1976. [Online]. Available: http: //www.jstor.org/stable/1886045 [7] A. Agarwal, M. Dahleh, and T. Sarkar, \A marketplace for data: An algorithmic solution," in Proceedings of the 2019 ACM Conference on Economics and Computation, ser. EC'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 701{726. [Online]. Available: https://doi.org/10.1145/3328526.3329589 [8] C. C. Aggarwal and P. S. Yu, \Privacy-preserving data mining: A survey," in Handbook of Database Security: Applications 54 and Trends, M. Gertz and S. Jajodia, Eds. Boston, MA: Springer US, 2008, pp. 431{460. [Online]. Available: https: //doi.org/10.1007/978-0-387-48533-1 18 [9] G. Aggarwal, A. Fiat, A. V. Goldberg, J. D. Hartline, N. Immorlica, and M. Sudan, \Derandomization of auctions," in Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, ser. STOC'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 619{625. [Online]. Available: https://doi.org/10.1145/1060590.1060682 [10] L. Aguiar and J. Waldfogel, \As streaming reaches ood stage, does it stimulate or depress music sales?" International Journal of Industrial Organization, vol. 57, no. C, pp. 278{307, 2018. [Online]. Available: https://ideas.repec.org/a/eee/indorg/v57y2018icp278-307.html [11] W. Aiello, Y. Ishai, and O. Reingold, \Priced oblivious transfer: How to sell digital goods," in Advances in Cryptology - EUROCRYPT 2001, International Conference on the Theory and Application of Cryptographic Techniques, Innsbruck, Austria, May 6-10, 2001, Proceeding, ser. Lecture Notes in Computer Science, vol. 2045. Springer, 2001, pp. 119{135. [Online]. Available: https://iacr.org/archive/eurocrypt2001/20450118.pdf [12] S. Al-Kiswany, H. Hacigum  u  s, Z. Liu, and J. Sankaranarayanan, \Cost exploration of data sharings in the cloud," in Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT'13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 601{612. [Online]. Available: https://doi.org/10.1145/2452376.2452447 [13] S. Alaei, A. Makhdoumi, and A. Malekian, \Optimal subscription planning for digital goods," SSRN Electronic Journal, 01 2019. [14] S. Alaei, A. Malekian, and A. Srinivasan, \On random sampling auctions for digital goods," in Proceedings of the 10th ACM Conference on Electronic Commerce, ser. EC'09. New York, NY, USA: Association for Computing Machinery, 2009, pp. 187{196. [Online]. Available: https://doi.org/10.1145/1566374.1566402 55 [15] C. Anderson, The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, 2006. [16] A. Archer, C. Papadimitriou, K. Talwar, and E. Tardos, \An approx- imate truthful mechanism for combinatorial auctions with single pa- rameter agents," in Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'03. USA: Society for Industrial and Applied Mathematics, 2003, pp. 205{214. [17] A. A. Armstrong and E. H. Durfee, \Mixing and memory: Emer- gent cooperation in an information marketplace," in Proceedings of the 3rd International Conference on Multi Agent Systems, ser. ICMAS'98. USA: IEEE Computer Society, 1998, p. 34. [18] M. Armstrong, \A more general theory of commodity bundling," Journal of Economic Theory, vol. 148, no. 2, pp. 448{472, 2013. [Online]. Available: https://ideas.repec.org/a/eee/jetheo/v148y2013 i2p448-472.html [19] N. Arnosti, M. Beck, and P. Milgrom, \Adverse selection and auction design for internet display advertising," in Proceedings of the Sixteenth ACM Conference on Economics and Computation, ser. EC'15. New York, NY, USA: Association for Computing Machinery, 2015, p. 167. [Online]. Available: https://doi.org/10.1145/2764468.2764537 [20] S. Athey, E. Calvano, and J. Gans, \The impact of the internet on advertising markets for news media," National Bureau of Economic Research, Working Paper 19419, September 2013. [Online]. Available: http://www.nber.org/papers/w19419 [21] S. Athey, E. Calvano, and J. S. Gans, \The impact of consumer multi- homing on advertising markets and media competition," Management Science, vol. 64, pp. 1574{1590, 2018. [22] J. Auerbach, J. Galenson, and M. Sundararajan, \An empirical analysis of return on investment maximization in sponsored search auctions," in Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising, ser. ADKDD'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 1{9. [Online]. Available: https://doi.org/10.1145/1517472.1517473 56 [23] M. Babaio , R. Kleinberg, and R. Paes Leme, \Optimal mechanisms for selling information," in Proceedings of the 13th ACM Conference on Electronic Commerce, ser. EC'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 92{109. [Online]. Available: https://doi.org/10.1145/2229012.2229024 [24] P. Bajari and A. Hortacsu, \Economic insights from internet auctions," Journal of Economic Literature, vol. 42, no. 2, pp. 457{486, June 2004. [Online]. Available: https://www.aeaweb.org/articles?id= 10.1257/0022051041409075 [25] Y. Bakos and E. Brynjolfsson, \Bundling information goods: Pricing, pro ts, and eciency," Manage. Sci., vol. 45, no. 12, pp. 1613{1630, Dec. 1999. [Online]. Available: https://doi.org/10.1287/mnsc.45.12.1 [26] ||, \Bundling and competition on the internet," Marketing Science, vol. 19, no. 1, pp. 63{82, Feb. 2000. [27] M. Balazinska, B. Howe, P. Koutris, D. Suciu, and P. Upadhyaya, \A discussion on pricing relational data," in In Search of Elegance in the Theory and Practice of Computation: Essays Dedicated to Peter Buneman, V. Tannen, L. Wong, L. Libkin, W. Fan, W.-C. Tan, and M. Fourman, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 167{173. [Online]. Available: https://doi.org/10.1007/978-3-642-41660-6 7 [28] M. Balazinska, B. Howe, and D. Suciu, \Data markets in the cloud: An opportunity for the database community," PVLDB, vol. 4, no. 12, pp. 1482{1485, 2011. [Online]. Available: http://dblp.uni-trier.de/db /journals/pvldb/pvldb4.html#BalazinskaHS11 [29] M.-F. Balcan and A. Blum, \Approximation algorithms and online mechanisms for item pricing," in Proceedings of the 7th ACM Conference on Electronic Commerce, ser. EC'06. New York, NY, USA: Association for Computing Machinery, 2006, pp. 29{35. [Online]. Available: https://doi.org/10.1145/1134707.1134711 [30] M.-F. Balcan, A. Blum, and Y. Mansour, \Item pricing for revenue maximization," in Proceedings of the 9th ACM Conference on 57 Electronic Commerce, ser. EC'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 50{59. [Online]. Available: https://doi.org/10.1145/1386790.1386802 [31] Z. Bar-Yossef, K. Hildrum, and F. Wu, \Incentive-compatible online auctions for digital goods," in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'02. USA: Society for Industrial and Applied Mathematics, 2002, pp. 964{970. [32] J. Ben eld and W. Szlemko, \Internet-based data collection: Promises and realities," Journal of Research Practice, vol. 2, no. 2, 1 2006. [33] D. Bergemann and A. Bonatti, \Selling cookies," American Economic Journal: Microeconomics, vol. 7, no. 3, pp. 259{94, August 2015. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/mi c.20140155 [34] D. Bergemann, A. Bonatti, and A. Smolin, \The design and price of information," American Economic Review, vol. 108, no. 1, pp. 1{48, January 2018. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/aer.20161079 [35] E. Bertino, D. Lin, and W. Jiang, \A survey of quanti cation of privacy preserving data mining algorithms," in Privacy-Preserving Data Mining: Models and Algorithms, C. C. Aggarwal and P. S. Yu, Eds. Boston, MA: Springer US, 2008, pp. 183{205. [Online]. Available: https://doi.org/10.1007/978-0-387-70992-5 8 [36] S. J. Best and B. S. Krueger, Internet Data Collection, ser. Quantita- tive Applications in the Social Sciences. Thousand Oaks, CA: SAGE Publications, Inc., 2004. [37] A. Boom, \\download for free": When do providers of digital goods o er free samples?" Free University Berlin, School of Business & Eco- nomics, Discussion Papers 2004/28, 2004. [38] R. Brennan, L. Canning, and R. Mcdowell, Business-to-business mar- keting. Sage Publications, 01 2013. 58 [39] P. Briest and P. Krysta, \Single-minded unlimited supply pricing on sparse instances," in Proceedings of the Seventeenth Annual ACM- SIAM Symposium on Discrete Algorithm, ser. SODA'06. USA: Soci- ety for Industrial and Applied Mathematics, 2006, pp. 1093{1102. [40] E. Brynjolfsson and M. D. Smith, \Frictionless commerce? a comparison of internet and conventional retailers," Management Science, vol. 46, no. 4, pp. 563{585, 2000. [Online]. Available: https://doi.org/10.1287/mnsc.46.4.563.12061 [41] Y. Cai, C. Daskalakis, and C. Papadimitriou, \Optimum statistical estimation with strategic data sources," in Proceedings of The 28th Conference on Learning Theory, ser. Proceedings of Machine Learning Research, P. Grun  wald, E. Hazan, and S. Kale, Eds., vol. 40. Paris, France: PMLR, 03{06 Jul 2015, pp. 280{296. [42] S. Chawla, S. Deep, P. Koutrisw, and Y. Teng, \Revenue maximization for query pricing," Proc. VLDB Endow., vol. 13, no. 1, pp. 1{14, Sep. 2019. [Online]. Available: https://doi.org/10.14778/3357377.3357378 [43] Y.-K. Che, S. Choi, and J. Kim, \An experimental study of sponsored-search auctions," Games and Economic Behavior, vol. 102, pp. 20 { 43, 2017. [Online]. Available: http://www.sciencedirect.co m/science/article/pii/S0899825616301233 [44] L. Chen, P. Koutris, and A. Kumar, \Towards model-based pricing for machine learning in a data marketplace," in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 1535{1552. [Online]. Available: https: //doi-org.proxy.lib.sf u.ca/10.1145/3299869.3300078 [45] L. Chiou and C. Tucker, \Paywalls and the demand for news," Information Economics and Policy, vol. 25, no. 2, pp. 61{69, 2013. [Online]. Available: https://EconPapers.repec.org/RePEc:eee:iepoli: v:25:y:2013:i:2:p:61-69 [46] ||, \Content aggregation by platforms: The case of the news media," Journal of Economics & Management Strategy, 59 vol. 26, no. 4, pp. 782{805, 2017. [Online]. Available: https: //onlinelibrary.wiley.com/doi/abs/10.1111/jems.12207 [47] R. D. Cook, \Detection of in uential observation in linear regression," Technometrics, vol. 19, no. 1, pp. 15{18, Feb. 1977. [48] R. Cook and S. Weisberg, Residuals and In uence in Regression, ser. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1982. [Online]. Available: https: //books.google.ca/books?id=MVSqAAAAIAAJ [49] R. Cummings, K. Ligett, A. Roth, Z. S. Wu, and J. Ziani, \Accuracy for sale: Aggregating data with a variance constraint," in Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ser. ITCS'15. New York, NY, USA: Association for Computing Machinery, 2015, pp. 317{324. [Online]. Available: https://doi.org/10.1145/2688073.2688106 [50] D. Dao, D. Alistarh, C. Musat, and C. Zhang, \Databright: Towards a global exchange for decentralized data ownership and trusted computation," CoRR, vol. abs/1802.04780, 2018. [Online]. Available: http://arxiv.org/abs/1802.04780 [51] C. Daskalakis, A. Deckelbaum, and C. Tzamos, \Strong duality for a multiple-good monopolist," Econometrica, vol. 85, no. 3, pp. 735{767, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .3982/ECTA12618 [52] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, \Locality- sensitive hashing scheme based on p-stable distributions," in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser. SCG'04. New York, NY, USA: Association for Computing Machinery, 2004, pp. 253{262. [Online]. Available: https://doi.org/10.1145/997817.997857 [53] D. Davydov, S. Izmalkov, and A. Smirnov, \Sponsored-Search Auctions: Empirical and Experimental Works," Journal of the New Economic Association, vol. 28, no. 4, pp. 56{73, 2015. [Online]. Available: https://ideas.repec.org/a/nea/journl/y2015i28p56-73.ht ml 60 [54] S. Deep and P. Koutris, \The design of arbitrage-free data pricing schemes," CoRR, vol. abs/1606.09376, 2016. [Online]. Available: http://arxiv.org/abs/1606.09376 [55] ||, \Qirana: A framework for scalable query pricing," in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD'17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 699{713. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3035918.3064017 [56] S. Deep, P. Koutris, and Y. Bidasaria, \Qirana demonstration: Real time scalable query pricing," Proc. VLDB Endow., vol. 10, no. 12, pp. 1949{1952, Aug. 2017. [Online]. Available: https: //doi-org.proxy.lib.sf u.ca/10.14778/3137765.3137816 [57] X. Deng and C. H. Papadimitriou, \On the complexity of cooperative solution concepts," Mathematics of Operations Research, vol. 19, no. 2, pp. 257{266, 1994. [Online]. Available: https: //doi.org/10.1287/moor.19.2.257 [58] S. Dibb, L. Simkin, W. M. Pride, and O. Ferrell, Marketing: Concepts and Strategies. 5th Edition. Abingdon, UK: Houghton Miin, April 2005. [Online]. Available: http://oro.open.ac.uk/2041/ [59] W. Die and M. Hellman, \New directions in cryptography," IEEE Trans. Inf. Theor., vol. 22, no. 6, pp. 644{654, Sep. 2006. [Online]. Available: https://doi.org/10.1109/TIT.1976.1055638 [60] D.-Z. Du and F. K. Hwang, Combinatorial Group Testing and Its Applications, 2nd ed. WORLD SCIENTIFIC, 1999. [Online]. Available: https://www.worldscienti c.com/doi/abs/10.1142/4252 [61] C. Dwork, \Di erential privacy: A survey of results," in Theory and Applications of Models of Computation, M. Agrawal, D. Du, Z. Duan, and A. Li, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 1{19. [62] C. Dwork, F. McSherry, K. Nissim, and A. Smith, \Calibrating noise to sensitivity in private data analysis," in Theory of Cryptography, 61 S. Halevi and T. Rabin, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 265{284. [63] T. Ebert, \Applications of recursive operators to randomness and com- plexity," Ph.D. dissertation, University of California, Santa Barbara, [64] B. Edelman and M. Ostrovsky, \Strategic bidder behavior in sponsored search auctions," Decision Support Systems, vol. 43, no. 1, pp. 192{198, Feb. 2007. [Online]. Available: https: //doi.org/10.1016/j.dss.2006.08.008 [65] B. Edelman, M. Ostrovsky, and M. Schwarz, \Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords," American Economic Review, vol. 97, no. 1, pp. 242{259, March 2007. [Online]. Available: https: //www.aeaweb.org/articles?id=10.1257/aer.97.1.242 [66] L. Einav, C. Farronato, J. Levin, and N. Sundaresan, \Auctions versus Posted Prices in Online Markets," Journal of Political Economy, vol. 126, no. 1, pp. 178{215, 2018. [Online]. Available: https://ideas.repec.org/a/ucp/jpolec/doi10.1086-695529.html [67] H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan, and S. Chen, \Overview of turn data management platform for digital advertising," Proc. VLDB Endow., vol. 6, no. 11, pp. 1138{1149, Aug. 2013. [Online]. Available: https://doi.org/10.14778/2536222.2536238 [68] R. Engelbrecht-Wiggans, \Auctions and bidding models: A survey," Management Science, vol. 26, no. 2, pp. 119{142, 1980. [Online]. Available: http://www.jstor.org/stable/2630247 [69] D. S. Evans, \The online advertising industry: Economics, evolution, and privacy," Journal of Economic Perspectives, vol. 23, no. 3, pp. 37{60, September 2009. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/jep.23.3.37 [70] U. Feige, A. Flaxman, J. D. Hartline, and R. Kleinberg, \On the competitive ratio of the random sampling auction," 62 in Proceedings of the First International Conference on Internet and Network Economics, ser. WINE'05. Berlin, Heidelberg: Springer-Verlag, 2005, pp. 878{886. [Online]. Available: https: //doi.org/10.1007/11600930 89 [71] R. C. Fernandez, P. Subramaniam, and M. J. Franklin, \Data market platforms: Trading data assets to solve data problems," Proc. VLDB Endow., vol. 13, no. 12, pp. 1933{1947, Jul. 2020. [Online]. Available: https://doi.org/10.14778/3407790.3407800 [72] M. A. Ferrag, L. Maglaras, and A. Ahmim, \Privacy-preserving schemes for ad hoc social networks: A survey," IEEE Communica- tions Surveys Tutorials, vol. 19, no. 4, pp. 3015{3045, 2017. [73] F. Ferreira and J. Waldfogel, \Pop internationalism: Has half a cen- tury of world music trade displaced local culture?" Economic Journal, vol. 123, no. 569, pp. 634{664, Jun 2013. [74] L. K. Fleischer and Y.-H. Lyu, \Approximately optimal auctions for selling privacy when costs are correlated with data," in Proceedings of the 13th ACM Conference on Electronic Commerce, ser. EC'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 568{585. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10. 1145/2229012.2229054 [75] S. A. Fricker and Y. V. Maksimov, \Pricing of data products in data marketplaces," in Software Business, A. Ojala, H. Holmstr om Olsson, and K. Werder, Eds. Cham: Springer International Publishing, 2017, pp. 49{66. [76] T. L. Friedman, The world is at : a brief history of the twenty- rst century / Thomas L. Friedman., 1st ed. New York :: Farrar, Straus and Giroux,, 2005., includes index. [77] D. Fudenberg and J. M. Villas-Boas, \Price discrimination in the digital economy," in The Oxford Handbook of the Digital Economy, M. Peitz and J. Waldfogel, Eds. Oxford University Press, 2012. [Online]. Available: https://www.oxfordhandbooks.com/view/10.10 93/oxf ordhb/9780195397840.001.0001/oxf ordhb-9780195397840-e-10 63 [78] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, \Privacy- preserving data publishing: A survey of recent developments," ACM Comput. Surv., vol. 42, no. 4, Jun. 2010. [Online]. Available: https://doi.org/10.1145/1749603.1749605 [79] J. M. Gallaugher, P. Auger, and A. BarNir, \Revenue streams and digital content providers: an empirical investigation," Information & Management, vol. 38, no. 7, pp. 473 { 485, 2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S037872060000083 [80] K. Ganchev, A. Kulesza, J. Tan, R. Gabbard, Q. Liu, and M. Kearns, \Empirical price modeling for sponsored search," in Internet and Net- work Economics, X. Deng and F. C. Graham, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 541{548. [81] N. Gandal, \Native language and internet usage," International Journal of the Sociology of Language, vol. 2006, no. 182, pp. 25 { 40, 2006. [Online]. Available: https://www.degruyter.com/view/journal s/ijsl/2006/182/article-p25.xml [82] M. Gentzkow and J. M. Shapiro, \Ideological segregation online and oine," The Quarterly Journal of Economics, vol. 126, no. 4, pp. 1799{ 1839, 11 2011. [Online]. Available: https://doi.org/10.1093/qje/qjr044 [83] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, \Protecting data privacy in private information retrieval schemes," J. Comput. Syst. Sci., vol. 60, no. 3, pp. 592{629, Jun. 2000. [Online]. Available: https://doi.org/10.1006/jcss.1999.1689 [84] A. Ghorbani, M. P. Kim, and J. Zou, \A distributional framework for data valuation," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [85] A. Ghorbani and J. Zou, \Data shapley: Equitable valuation of data for machine learning," in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, 09{15 Jun 2019, pp. 64 2242{2251. [Online]. Available: http://proceedings.mlr.press/v97/gh orbani19c.html [86] A. Ghosh, K. Ligett, A. Roth, and G. Schoenebeck, \Buying private data without veri cation," in Proceedings of the Fifteenth ACM Conference on Economics and Computation, ser. EC'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 931{948. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2600057 [87] A. Ghosh and A. Roth, \Selling privacy at auction," in Proceedings of the 12th ACM Conference on Electronic Commerce, ser. EC'11. New York, NY, USA: Association for Computing Machinery, 2011, pp. 199{208. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10. 1145/1993574.1993605 [88] A. Gilchrist, Industry 4.0: The Industrial Internet of Things, 1st ed. USA: Apress, 2016. [89] A. V. Goldberg and J. D. Hartline, \Competitive auctions for multiple digital goods," in Proceedings of the 9th Annual European Symposium on Algorithms, ser. ESA'01. Berlin, Heidelberg: Springer-Verlag, 2001, pp. 416{427. [90] ||, \Competitiveness via consensus," in Proceedings of the Four- teenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'03. USA: Society for Industrial and Applied Mathematics, 2003, pp. 215{222. [91] ||, \Envy-free auctions for digital goods," in Proceedings of the 4th ACM Conference on Electronic Commerce, ser. EC'03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 29{35. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/779928. [92] A. V. Goldberg, J. D. Hartline, and A. Wright, \Competitive auctions and digital goods," in Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'01. USA: Society for Industrial and Applied Mathematics, 2001, pp. 735{744. 65 [93] A. Goldfarb and C. Tucker, \Digital economics," Journal of Economic Literature, vol. 57, no. 1, pp. 3{43, March 2019. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jel.20171452 [94] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [95] B. R. Gordon, F. Zettelmeyer, N. Bhargava, and D. Chapsky, \A comparison of approaches to advertising measurement: Evidence from big eld experiments at facebook," Marketing Science, vol. 38, no. 2, pp. 193{225, 2019. [Online]. Available: https: //doi.org/10.1287/mksc.2018.1135 [96] V. Guruswami, J. D. Hartline, A. R. Karlin, D. Kempe, C. Kenyon, and F. McSherry, \On pro t-maximizing envy-free pricing," in Pro- ceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'05. USA: Society for Industrial and Applied Mathematics, 2005, pp. 1164{1173. [97] N. Haghpanah and J. Hartline, \Reverse mechanism design," in Proceedings of the Sixteenth ACM Conference on Economics and Computation, ser. EC'15. New York, NY, USA: Association for Computing Machinery, 2015, pp. 757{758. [Online]. Available: https://doi.org/10.1145/2764468.2764498 [98] ||, \When is pure bundling optimal?" The Pennsylvania State University, Working Paper, April 2020. [Online]. Available: https://www.personal.psu.edu/nuh47/papers/bundling.pdf [99] H. Halaburda and Y. Yehezkel, \Platform competition under asymmetric information," American Economic Journal: Microe- conomics, vol. 5, no. 3, pp. 22{68, 2013. [Online]. Available: http://www.jstor.org/stable/43189630 [100] D. Han, S. Tople, A. Rogers, M. Wooldridge, O. Ohrimenko, and S. Tschiatschek, \Replication-robust payo -allocation with applica- tions in machine learning marketplaces," ArXiv, vol. abs/2006.14583, 66 [101] G. Hardin, \The tragedy of the commons," Science, vol. 162, no. 3859, pp. 1243{1248, 1968. [Online]. Available: https: //science.sciencemag.org/content/162/3859/1243 [102] J. D. Hartline and R. McGrew, \From optimal limited to unlimited supply auctions," in Proceedings of the 6th ACM Conference on Electronic Commerce, ser. EC'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 175{182. [Online]. Available: https://doi.org/10.1145/1064009.1064028 [103] J. Heckman, E. Peters, N. G. Kurup, E. Boehmer, and M. Davaloo, \A pricing model for data markets," in iConference 2015 Proceedings. iSchools, 2015. [104] W. Hoe ding, \Probability inequalities for sums of bounded random variables," Journal of the American Statistical Association, vol. 58, no. 301, pp. 13{30, 1963. [Online]. Available: http: //www.jstor.org/stable/2282952 [105] R. Hu and Y. Gong, \Trading data for learning: Incentive mechanism for on-device federated learning," ArXiv, vol. abs/2009.05604, 2020. [106] W. Hu and A. Bolivar, \Online auctions eciency: A survey of ebay auctions," in Proceedings of the 17th International Conference on World Wide Web, ser. WWW'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 925{934. [Online]. Available: https://doi.org/10.1145/1367497.1367621 [107] N. Hynes, D. Dao, D. Yan, R. Cheng, and D. Song, \A demonstration of sterling: A privacy-preserving data marketplace," Proc. VLDB Endow., vol. 11, no. 12, pp. 2086{2089, Aug. 2018. [Online]. Available: https://doi.org/10.14778/3229863.3236266 [108] G. Irvin, Modern Cost-Bene t Methods. London: Macmillan Pub- lishers Limited, 1978. [109] J. Jaisingh, J. Barron, S. Mehta, and A. Chaturvedi, \Privacy and pricing personal information," European Journal of Operational Research, vol. 187, no. 3, pp. 857 { 870, 2008. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0377221706007867 67 [110] J. Jansen and T. Mullen, \Sponsored search: An overview of the concept, history, and technology," International Journal of Electronic Business, vol. 6, pp. 114{131, 01 2008. [111] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li, C. Zhang, C. Spanos, and D. Song, \Ecient task-speci c data valuation for nearest neighbor algorithms," Proc. VLDB Endow., vol. 12, no. 11, pp. 1610{1623, Jul. 2019. [Online]. Available: https://doi.org/10.14778/3342263.3342637 [112] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gurel,  B. Li, C. Zhang, D. Song, and C. J. Spanos, \Towards ecient data valuation based on the shapley value," in Proceedings of Machine Learning Research, ser. Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama, Eds., vol. 89. PMLR, 16{18 Apr 2019, pp. 1167{1176. [Online]. Available: http://proceedings.mlr.press/v89/jia19a.html [113] R. Jia, X. Sun, J. Xu, C. Zhang, B. Li, and D. Song, \An empirical and comparative analysis of data valuation with scalable algorithms," CoRR, vol. abs/1911.07128, 2019. [Online]. Available: http://arxiv.org/abs/1911.07128 [114] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, \Di erential privacy and its applications in social network analysis: A survey," ArXiv, vol. abs/2010.02973, 2020. [115] B. Jullien, \Two-sided b to b platforms," in The Oxford Handbook of the Digital Economy, M. Peitz and J. Waldfogel, Eds. Oxford University Press, 2012. [Online]. Available: https: //www.oxfordhandbooks.com/view/10.1093/oxf ordhb/978019539784 0.001.0001/oxf ordhb-9780195397840-e-7 [116] V. V. Kantere, D. Dash, G. Gratsias, and A. Ailamaki, \Predicting cost amortization for query services," in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'11. New York, NY, USA: Association for Computing Machinery, 2011, pp. 325{336. [Online]. Available: https://doi.org/10.1145/1989323.1989358 68 [117] D. Kifer and A. Machanavajjhala, \A rigorous and customizable framework for privacy," in Proceedings of the 31st ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, ser. PODS'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 77{88. [Online]. Available: https://doi.org/10.1 145/2213556.2213571 [118] P. Klemperer, \Auction theory: A guide to the literature," Journal of Economic Surveys, vol. 13, no. 3, pp. 227{286, 1999. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-641 9.00083 [119] P. W. Koh and P. Liang, \Understanding black-box predictions via in uence functions," in Proceedings of the 34th International Confer- ence on Machine Learning - Volume 70, ser. ICML'17. JMLR.org, 2017, pp. 1885{1894. [120] P. Kotler, Marketing Management : the millennium edition. Boston, MA: Pearson Custom Pub., 2000. [121] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu, \Query-based data pricing," in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ser. PODS'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 167{178. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2213556.2213582 [122] ||, \Querymarket demonstration: Pricing for online data markets," Proc. VLDB Endow., vol. 5, no. 12, pp. 1962{1965, Aug. 2012. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.14778/236750 2.2367548 [123] ||, \Toward practical query pricing with querymarket," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 613{624. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2463676.2465335 [124] ||, \Query-based data pricing," J. ACM, vol. 62, no. 5, Nov. 2015. [Online]. Available: https://doi.org/10.1145/2770870 69 [125] Y. Kwon, M. A. Rivas, and J. Zou, \Ecient computation and analysis of distributional shapley values," ArXiv, vol. abs/2007.01357, 2020. [126] S. Lahaie, D. M. Pennock, A. Saberi, and R. V. Vohra, \Sponsored search auctions," in Algorithmic Game Theory, N. Nisan, T. Rough- garden, E. Tardos, and V. V. Vazirani, Eds. Cambridge University Press, 2007, pp. 699{716. [127] A. Lambrecht, A. Goldfarb, A. Bonatti, A. Ghose, D. Goldstein, R. Lewis, A. Rao, N. Sahni, and S. Yao, \How do rms make money selling digital goods online?" Marketing Letters, vol. 25, pp. 331{341, 09 2014. [128] A. Lambrecht and C. Tucker, \When does retargeting work? information speci city in online advertising," Journal of Marketing Research, vol. 50, no. 5, pp. 561{576, 2013. [Online]. Available: https://doi.org/10.1509/jmr.11.0503 [129] R. Lavi and N. Nisan, \Competitive analysis of incentive compatible on-line auctions," in Proceedings of the 2nd ACM Conference on Electronic Commerce, ser. EC'00. New York, NY, USA: Association for Computing Machinery, 2000, pp. 233{241. [Online]. Available: https://doi.org/10.1145/352871.352897 [130] S. Lehmann and P. Buxmann, \Pricing strategies of software vendors," Business & Information Systems Engineering, vol. 1, pp. 452{462, 12 [131] J. Lerner, P. A. Pathak, and J. Tirole, \The dynamics of open-source contributors," American Economic Review, vol. 96, no. 2, pp. 114{118, May 2006. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/000282806777211874 [132] R. Lewis and J. Rao, \On the near impossibility of measuring the returns to advertising," SSRN Electronic Journal, 01 2013. [133] C. Li, D. Y. Li, G. Miklau, and D. Suciu, \A theory of pricing private data," ACM Trans. Database Syst., vol. 39, no. 4, Dec. 2015. [Online]. Available: https://doi.org/10.1145/2691190.2691191 70 [134] C. Li and G. Miklau, \Pricing aggregate queries in a data marketplace," in Proceedings of the 15th International Workshop on the Web and Databases 2012, WebDB 2012, Scottsdale, AZ, USA, May 20, 2012, Z. G. Ives and Y. Velegrakis, Eds., 2012, pp. 19{24. [Online]. Available: http://db.disi.unitn.eu/pages/WebDB2012/pap ers/p15.pdf [135] X.-B. Li and S. Raghunathan, \Pricing and disseminating customer data with privacy awareness," Decision Support Systems, vol. 59, pp. 63 { 73, 2014. [Online]. Available: http://www.sciencedirect.com/sc ience/article/pii/S0167923613002534 [136] F. Liang, W. Yu, D. An, Q. Yang, X. Fu, and W. Zhao, \A survey on big data market: Pricing, trading and protection," IEEE Access, vol. PP, pp. 1{1, 02 2018. [137] K. Ligett and A. Roth, \Take it or leave it: Running a survey when privacy comes at a cost," in Proceedings of the Eighth International Workshop on Internet and Network Economics (WINE'12), ser. Lec- ture Notes in Computer Science, P. W. Goldberg and M. Guo, Eds., vol. 7695. Berlin, Heidelberg: Springer, 2012, pp. 378{391. [138] B.-R. Lin and D. Kifer, \On arbitrage-free pricing for general data queries," Proc. VLDB Endow., vol. 7, no. 9, pp. 757{768, May 2014. [Online]. Available: https://doi.org/10.14778/2732939.2732948 [139] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. [van der Laak], B. [van Ginneken], and C. I. S anchez, \A survey on deep learning in medical image analysis," Medical Image Analysis, vol. 42, pp. 60 { 88, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S136184151730113 [140] Z. Liu and H. Hacigum  u  s, \Online optimization and fair costing for dynamic data sharing in a cloud data market," in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 1359{1370. [Online]. Available: https://doi.org/10.1145/2588555.2593679 71 [141] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers, \Bounding the estimation error of sampling-based shapley value approximation with/without stratifying," CoRR, vol. abs/1306.4265, 2013. [Online]. Available: http://arxiv.org/abs/1306.4265 [142] A. Mas-Colell, M. Whinston, and J. Green, Microeconomic Theory. Oxford University Press, 1995. [Online]. Available: https://EconPapers.repec.org/RePEc:oxp:obooks:9780195102680 [143] E. Maskin and J. Riley, \Monopoly with incomplete information," The RAND Journal of Economics, vol. 15, no. 2, pp. 171{196, 1984. [Online]. Available: http://www.jstor.org/stable/2555674 [144] ||, \Asymmetric Auctions," The Review of Economic Studies, vol. 67, no. 3, pp. 413{438, 07 2000. [Online]. Available: https://doi.org/10.1111/1467-937X.00137 [145] R. P. McAfee and J. McMillan, \Auctions and Bidding," Journal of Economic Literature, vol. 25, no. 2, pp. 699{738, June 1987. [Online]. Available: https://ideas.repec.org/a/aea/jeclit/v25y1987i2p699-738. html [146] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Ar- cas, \Communication-Ecient Learning of Deep Networks from De- centralized Data," in Proceedings of the 20th International Conference on Arti cial Intelligence and Statistics, 2017, pp. 1273{1282. [147] B. McMahan and D. Ramage, \Federated learning: Collaborative machine learning without centralized training data," Google AI Blog, April 2017. [Online]. Available: https://ai.googleblog.com/2017/04/ federated-learning-collaborative.html [148] D. Menicucci, S. Hurkens, and D.-S. Jeon, \On the optimality of pure bundling for a monopolist," Journal of Mathematical Economics, vol. 60, pp. 33 { 42, 2015. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S030440681500066X [149] T. Moore, R. Clayton, and R. Anderson, \The economics of online crime," Journal of Economic Perspectives, vol. 23, 72 no. 3, pp. 3{20, September 2009. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/jep.23.3.3 [150] M. K. M. Murthy, H. A. Sanjay, and J. P. Ashwini, \Pricing models and pricing schemes of iaas providers: A comparison study," in Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ser. ICACCI'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 143{ 147. [Online]. Available: https://doi.org/10.1145/2345396.2345421 [151] A. Muschalle, F. Stahl, A. L oser, and G. Vossen, \Pricing ap- proaches for data markets," in Enabling Real-Time Business Intel- ligence, M. Castellanos, U. Dayal, and E. A. Rundensteiner, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 129{144. [152] R. B. Myerson, \Optimal auction design," Math. Oper. Res., vol. 6, no. 1, pp. 58{73, Feb. 1981. [Online]. Available: https: //doi.org/10.1287/moor.6.1.58 [153] A. Nagaraj, \The private impact of public information: Landsat satellite maps and gold exploration," Unpublished, 07 2016. [Online]. Available: http://abhishekn.com/ les/nagaraj landsat2020.pdf [154] P. Naghizadeh and A. Sinha, \Adversarial contract design for private data commercialization," in Proceedings of the 2019 ACM Conference on Economics and Computation, ser. EC'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 681{699. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3328526.3329633 [155] J. Nagle, T.T. & Hogan, The Strategy and Tactics of Pricing: A Guide to Growing More Pro tably. Prentice Hall, 2010. [156] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, \Deep learning applications and challenges in big data analytics," Journal of Big Data, vol. 2, no. 1, p. 1, Feb 2015. [Online]. Available: https: //doi.org/10.1186/s40537-014-0007-7 [157] A. Nash, L. Segou n, and V. Vianu, \Determinacy and rewriting of conjunctive queries using views: A progress report," in Proceedings of 73 the 11th International Conference on Database Theory, ser. ICDT'07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 59{73. [Online]. Available: https://doi.org/10.1007/11965893 5 [158] ||, \Views and queries: Determinacy and rewriting," ACM Trans. Database Syst., vol. 35, no. 3, Jul. 2010. [Online]. Available: https://doi.org/10.1145/1806907.1806913 [159] M. Neumeier, The brand ip : why customers now run companies{and how to pro t from it. San Francisco :: New Riders,, 2015. [160] K. Nissim, S. Vadhan, and D. Xiao, \Redrawing the boundaries on purchasing data from privacy-sensitive individuals," in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ser. ITCS'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 411{422. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2554797.2554835 [161] C. Niu, Z. Zheng, S. Tang, X. Gao, and F. Wu, \Making big money from small sensors: Trading time-series data under pu er sh privacy," in IEEE INFOCOM 2019 - IEEE Conference on Computer Commu- nications, April 2019, pp. 568{576. [162] C. Niu, Z. Zheng, F. Wu, S. Tang, and G. Chen, \Online pricing with reserve price constraint for personal data markets," CoRR, vol. abs/1911.12598, 2019. [Online]. Available: http: //arxiv.org/abs/1911.12598 [163] C. Niu, Z. Zheng, F. Wu, S. Tang, X. Gao, and G. Chen, \Unlocking the value of privacy: Trading aggregate statistics over private correlated data," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD'18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 2031{2040. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3219819.3220013 [164] A. Ockenfels, D. Reiley, and A. Sadrieh, \Online auctions," National Bureau of Economic Research, Working Paper 12785, December 2006. [Online]. Available: http://www.nber.org/papers/w12785 74 [165] A. Odlyzko, \Privacy, economics, and price discrimination on the internet," in Proceedings of the 5th International Conference on Electronic Commerce, ser. ICEC'03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 355{366. [Online]. Available: https://doi.org/10.1145/948005.948051 [166] E. Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action, ser. Canto Classics. Cambridge University Press, [167] K. Pantelis and L. Aija, \Understanding the value of (big) data," in 2013 IEEE International Conference on Big Data, 2013, pp. 38{42. [168] K. Pauwels and A. Weiss, \Moving from free to fee: How online rms market to change their business model successfully," Journal of Marketing, vol. 72, no. 3, pp. 14{31, 2008. [Online]. Available: https://doi.org/10.1509/JMKG.72.3.014 [169] A. Pavan, I. Segal, and J. Toikka, \Dynamic mechanism design: A myersonian approach," Econometrica, vol. 82, no. 2, pp. 601{653, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .3982/ECTA10269 [170] L. L. Pipino, Y. W. Lee, and R. Y. Wang, \Data quality assessment," Commun. ACM, vol. 45, no. 4, pp. 211{218, Apr. 2002. [Online]. Available: https://doi.org/10.1145/505248.506010 [171] A. Prasad, V. Mahajan, and B. Bronnenberg, \Advertising versus pay- per-view in electronic media," International Journal of Research in Marketing, vol. 20, no. 1, pp. 13 { 30, 2003. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0167811602001192 [172] T. Qin, W. Chen, and T.-Y. Liu, \Sponsored search auctions: Recent advances and future directions," ACM Trans. Intell. Syst. Technol., vol. 5, no. 4, Jan. 2015. [Online]. Available: https://doi.org/10.1145/2668108 [173] A. Rao, \Online Content Pricing: Purchase and Rental Markets," Marketing Science, vol. 34, no. 3, pp. 430{451, May 2015. [Online]. 75 Available: https://ideas.repec.org/a/inm/ormksc/v34y2015i3p430-45 1.html [174] J. M. Rao and D. H. Reiley, \The economics of spam," Journal of Economic Perspectives, vol. 26, no. 3, pp. 87{110, September 2012. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jep. 26.3.87 [175] K. Ren, J. Qin, L. Zheng, Z. Yang, W. Zhang, and Y. Yu, \Deep landscape forecasting for real-time bidding advertising," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 363{372. [Online]. Available: https://doi.org/10.1145/3292500.3330870 [176] A. Richardson, A. Filos-Ratsikas, and B. Faltings, \Rewarding high- quality data via in uence functions," CoRR, vol. abs/1908.11598, 2019. [Online]. Available: http://arxiv.org/abs/1908.11598 [177] J. Riley and W. F. Samuelson, \Optimal auctions," American Economic Review, vol. 71, no. 3, pp. 381{392, 1981. [Online]. Available: https://EconPapers.repec.org/RePEc:aea:aecrev:v:71:y: 1981:i:3:p:381-92 [178] A. Roth, \Technical perspective: Pricing information (and its implications)," Commun. ACM, vol. 60, no. 12, p. 78, Nov. 2017. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3139455 [179] F. Schomm, F. Stahl, and G. Vossen, \Marketplaces for data: An initial survey," SIGMOD Rec., vol. 42, no. 1, pp. 15{26, May 2013. [Online]. Available: https://doi.org/10.1145/2481528.2481532 [180] L. Segou n and V. Vianu, \Views and queries: Determinacy and rewriting," in Proceedings of the Twenty-Fourth ACM SIGMOD- SIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 49{60. [Online]. Available: https://doi.org/10.1 145/1065167.1065174 76 [181] S. Sen, C. Joe-Wong, S. Ha, and M. Chiang, \A survey of smart data pricing: Past proposals, current plans, and future trends," ACM Computing Survey, vol. 46, no. 2, Nov. 2013. [Online]. Available: https://doi.org/10.1145/2543581.2543582 [182] C. Shapiro, S. Carl, H. Varian, and H. B. Press, Information Rules: A Strategic Guide to the Network Economy, ser. Strategy/Technology / Harvard Business School Press. Harvard Business School Press, 1998. [Online]. Available: https://books.google.ca/books?id=aE J4I v PVEC [183] C. Shapiro and H. R. Varian, \Versioning: The smart way to sell information," Harvard Business Review, pp. 106{114, November- December 1998. [Online]. Available: https://hbr.org/1998/11/versio ning-the-smart-way-to-sell-inf ormation [184] L. S. Shapley, \A Value for n-Person Games," RAND Corporation, Santa Monica, CA, Tech. Rep. P-295, 1952. [Online]. Available: https://www.rand.org/pubs/papers/P0295.html [185] M. Shubik, \Auctions, bidding, and markets: An historical sketch," in Auctions, Bidding, and Contracting, M. Shubik and J. Stark, Eds. New York University Press, 1983, pp. 33{52. [186] R. H. L. Sim, Y. Zhang, M. C. Chan, and B. K. H. Low, \Collaborative machine learning with incentive-aware model rewards," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [187] B. Squire, S. Brown, J. Readman, and J. Bessant, \The impact of mass customisation on manufacturing trade-o s," Production and Op- erations Management, vol. 15, pp. 10 { 21, 01 2009. [188] C. Sunstein, Echo Chambers: Bush V. Gore, Impeachment, and Beyond. Princeton University Press, 2001. [Online]. Available: https://books.google.ca/books?id=sEgHAAAACAAJ [189] C. Swamy and M. Cheung, \Approximation algorithms for single- minded envy-free pro t-maximization problems with limited supply," 77 in 2008 IEEE 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS). Los Alamitos, CA, USA: IEEE Computer Society, oct 2008, pp. 35{44. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/FOCS.2008.15 [190] G. Tang, Y. Yang, and J. Pei, \Price information patterns in web search advertising: An empirical case study on accommodation industry," in 2013 IEEE International Conference on Data Mining (ICDM). Los Alamitos, CA, USA: IEEE Computer Society, dec 2013, pp. 737{746. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/ICDM.2013.100 [191] R. Tang, A. Amarilli, P. Senellart, and S. Bressan, \Get a sample for a discount," in Database and Expert Systems Applications, H. Decker, L. Lhotsk a, S. Link, M. Spies, and R. R. Wagner, Eds. Cham: Springer International Publishing, 2014, pp. 20{34. [192] R. Tang, H. Wu, Z. Bao, S. Bressan, and P. Valduriez, \The price is right," in Database and Expert Systems Applications, H. Decker, L. Lhotsk a, S. Link, J. Basl, and A. M. Tjoa, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 380{394. [193] C. R. Taylor, \Consumer Privacy and the Market for Customer Information," RAND Journal of Economics, vol. 35, no. 4, pp. 631{650, Winter 2004. [Online]. Available: https://ideas.repec.org/ a/rje/randje/v35y20044p631-650.html [194] F. Tram er, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, \Stealing machine learning models via prediction apis," in Proceedings of the 25th USENIX Conference on Security Symposium, ser. SEC'16. USA: USENIX Association, 2016, pp. 601{618. [195] P. Upadhyaya, M. Balazinska, and D. Suciu, \Price-optimal querying with data apis," Proc. VLDB Endow., vol. 9, no. 14, pp. 1695{1706, Oct. 2016. [Online]. Available: https://doi.org/10.14778/3007328.300 [196] S. van de Sandt, S. Dallmeier-Tiessen, A. Lavasa, and V. Petras, \The de nition of reuse," Data Science Journal, vol. 18, no. 1, p. 22, 2019. 78 [197] H. R. Varian, \Online ad auctions," American Economic Review, vol. 99, no. 2, pp. 430{34, May 2009. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.99.2.430 [198] W. Vickrey, \Counterspeculation, auctions, and competitive sealed tenders," The Journal of Finance, vol. 16, no. 1, pp. 8{37, 1961. [Online]. Available: http://www.jstor.org/stable/2977633 [199] ||, \Auctions and bidding games," in Recent Advances in Game Theory. Princeton, New Jersey: Princeton University Conference, 1962, pp. 15{27. [200] H. von Stackelberg, Market Structure and Equilibrium. J. Springer, [201] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, \Deep learning for computer vision: A brief review," Computational Intelligence and Neuroscience, vol. 2018, p. 7068349, Feb 2018. [Online]. Available: https://doi.org/10.1155/2018/7068349 [202] T. Wagner, A. Benlian, and T. Hess, \Converting freemium customers from free to premium{the role of the perceived premium t in the case of music as a service," Electronic Markets, vol. 24, pp. 259{268, 12 [203] J. Waldfogel, \Copyright research in the digital age: Moving from piracy to the supply of new products," American Economic Review, vol. 102, no. 3, pp. 337{42, May 2012. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.102.3.337 [204] R. Y. Wang and D. M. Strong, \Beyond accuracy: What data quality means to data consumers," Journal of Management Information Systems, vol. 12, no. 4, pp. 5{33, 1996. [Online]. Available: https://doi.org/10.1080/07421222.1996.11518099 [205] T. Wang, J. Rausch, C. Zhang, R. Jia, and D. Song, \A princi- pled approach to data valuation for federated learning," ArXiv, vol. abs/2009.06192, 2020. 79 [206] Z. Wang, H. Zhu, Z. Dong, X. He, and S. Huang, \Less is better: Unweighted data subsampling via in uence function," CoRR, vol. abs/1912.01321, 2019. [Online]. Available: http: //arxiv.org/abs/1912.01321 [207] H. L. Williams, \Intellectual property rights and innovation: Evidence from the human genome," Journal of Political Economy, vol. 121, no. 1, pp. 1{27, 2013. [Online]. Available: https: //doi.org/10.1086/669706 [208] C. Wu, R. Buyya, and K. Ramamohanarao, \Cloud pricing models: Taxonomy, survey, and interdisciplinary challenges," ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3342103 [209] S. Wu and R. Banker, \Best pricing strategy for information services," Journal of the Association of Information Systems, vol. 11, no. 6, pp. 339{366, Jan. 2010. [210] S. Wu and P. Pavlou, \On the optimal xed-up-to pricing for infor- mation services," Journal of the Association of Information Systems, vol. 20, no. 10, pp. 1447{1474, Jan. 2019. [211] X. Wu, W. Zhang, and W. Dou, \Pricing as a service: Personalized pricing strategy in cloud computing," in 2012 IEEE 12th International Conference on Computer and Information Technology, Oct 2012, pp. 1119{1124. [212] X. Wu, X. Ying, K. Liu, and L. Chen, \A survey of privacy-preservation of graphs and social networks," in Managing and Mining Graph Data, C. C. Aggarwal and H. Wang, Eds. Boston, MA: Springer US, 2010, pp. 421{453. [Online]. Available: https://doi.org/10.1007/978-1-4419-6045-0 14 [213] C. Xia and S. Muthukrishnan, \Arbitrage-free pricing in user-based markets," in Proceedings of the 17th International Conference on Au- tonomous Agents and MultiAgent Systems, ser. AAMAS'18. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 327{335. 80 [214] H. Yang, \Targeted search and the long tail e ect," RAND Journal of Economics, vol. 44, no. 4, pp. 733{756, December 2013. [215] Y. Yang, X. Mao, J. Pei, and X. He, \Continuous in uence maximization: What discounts should we o er to social network users?" in Proceedings of the 2016 International Conference on Management of Data, ser. SIGMOD'16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 727{741. [Online]. Available: https://doi.org/10.1145/2882903.2882961 [216] Y. Yang, Q. S. Lu, G. Tang, and J. Pei, \The Impact of Market Competition on Search Advertising," Journal of Interactive Marketing, vol. 30, no. C, pp. 46{55, 2015. [Online]. Available: https://ideas.repec.org/a/eee/joinma/v30y2015icp46-55.html [217] J. Yoon, S. Arik, and T. P ster, \Data valuation using reinforcement learning," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [218] T. Young, D. Hazarika, S. Poria, and E. Cambria, \Recent trends in deep learning based natural language processing [review article]," IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55{75, [219] H. Yu and M. Zhang, \Data pricing strategy based on data quality," Computers & Industrial Engineering, vol. 112, pp. 1 { 10, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pi i/S0360835217303509 [220] M. Zhang and F. Beltran, \A survey of data pricing methods," SSRN, April 2020. [Online]. Available: https://ssrn.com/abstract=36 09120orhttp://dx.doi.org/10.2139/ssrn.3609120 [221] X. M. Zhang and F. Zhu, \Group size and incentives to contribute: A natural experiment at chinese wikipedia," American Economic Review, vol. 101, no. 4, pp. 1601{15, June 2011. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.101.4.1601 [222] J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He, \Deep reinforcement learning for sponsored search real-time bidding," in Proceedings of 81 the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD'18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 1021{1030. [Online]. Available: https://doi.org/10.1145/3219819.3219918 [223] Z. Zheng, Y. Peng, F. Wu, S. Tang, and G. Chen, \An online pricing mechanism for mobile crowdsensing data markets," in Proceedings of the 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing, ser. Mobihoc'17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3084041.3084044 [224] B. Zhou, J. Pei, and W. Luk, \A brief survey on anonymization techniques for privacy preserving publishing of social network data," SIGKDD Explor. Newsl., vol. 10, no. 2, pp. 12{22, Dec. 2008. [Online]. Available: https://doi.org/10.1145/1540276.1540279 [225] Y. Zhou, U. Porwal, C. Zhang, H. Ngo, X. Nguyen, C. R e, and V. Govindaraju, \Parallel feature selection inspired by group testing," in Proceedings of the 27th International Conference on Neural Infor- mation Processing Systems - Volume 2, ser. NIPS'14. Cambridge, MA, USA: MIT Press, 2014, pp. 3554{3562. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

A Survey on Data Pricing: from Economics to Data Science

Computing Research Repository , Volume 2021 (2009) – Sep 9, 2020

Loading next page...
 
/lp/arxiv-cornell-university/a-survey-on-data-pricing-from-economics-to-data-science-799dqrAbIp

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
1041-4347
eISSN
ARCH-3344
DOI
10.1109/TKDE.2020.3045927
Publisher site
See Article on Publisher Site

Abstract

Data are invaluable. How can we assess the value of data objec- tively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electronic commerce, data management, data mining and machine learning. In this arti- cle, we present a uni ed, interdisciplinary and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing and review the development and evolution of pricing models according to a series of fundamental principles. We discuss both digital products and data products. We also consider a series of challenges and directions for future work. 1 Introduction In this digital economics era, data are well recognized as an essential re- source for work and life. Many products and services are delivered purely in digital forms. Many big data applications are built on the second use or reuse of data [196], that is, the same data are customized and reused by many applications for di erent purposes. The extensive sharing and reusing data has profound implications to economy. For example, digital maps are often produced for trac and directions as the immediate usage. However, Nagaraj [153] nds that mining activities were strongly bene ted by open maps or maps sponsored by governments, particularly for smaller rms with arXiv:2009.04462v2 [econ.TH] 27 Nov 2020 less resources. Universal availability of data often helps minority parties and emerging initiatives. In business and economic activities where data are shared, exchanged and reused, it is essential to measure the value of data properly. While there exist many possible ways to appreciate and represent the value of data, a general approach that can be scalable for massive applications and acceptable to many parties is to set a price at which data can be sold or purchased, that is, data pricing. The importance of pricing in business is well recognized in nancial modeling [120], as price being one of the four Ps of the marketing mix . Pricing data is far from trivial. Data have many di erent aspects. Con- sequently, the term \price of data" may carry di erent meanings and refer to di erent properties of data. To illustrate the complexity, let us quickly consider the following three scenarios involving price information related to data. • Data transmission. Imagine the scenario where a mobile service provider o ers a smart phone user the price of its data package. Here, the price is quoted for the data transmission service and is decided by several factors, such as the amount of data the user wants to trans- mit in a month time, the location (roaming or not, for example), and the transmission speed. The price does not include and is indepen- dent from the content, that is, what the data are about, such as data quality, and how the data are collected, stored or processed. • Digital products. Imagine that a person wants to watch a movie at home. This is a purchase of data, since the movie is sent to the cus- tomer's home as a stream of bits. The price here typically is related to the content, but is independent from the data transmission service, that is, how the data are transmitted to the user's home. • Data products. Many logistics companies want to pay for weather in- formation to support their business operations. While historical data are relevant, more often than not those companies want to subscribe to weather forecasting information instead. Some companies may want The four Ps are product, price, place and promotion [120]. 2 weather predictions at a higher granularity while some may want de- tailed predictions at speci c locations. Moreover, some may want long term predictions while some others may want short term projections. Here, prediction services are sold as data products. The above three cases just elaborate some representative scenarios where data prices are used, and are by no means exhaustive. To appreciate data pricing, including ideas, principles and methods, we have to take an inter- disciplinary approach from multiple elds, economics and data science being the two most prominent. Indeed, the studies and practice of data pricing started as early as the dawn of digital economics, and are highly diversi ed and rich in innovative thinking. In this article, we try to present a comprehensive survey on data pricing, an emerging research and practice area that plays a more and more impor- tant role in the current big data and AI economics era. Our survey is highly related to the current strong rising of data science. To a large extent, data pricing is an overdue pillar in data science research and practice. Data and information as goods discussed in this article are those that are distributed purely in digital form. We focus on two categories of the most interest: pricing digital products and pricing data products, demonstrated by the last two aforementioned scenarios, respectively. In this article, dig- ital products refer to those intangible goods but can be consumed through electronics, such as e-books, downloadable musics, online ads, and internet coupons. Many digital products have physical correspondences in one way or another, though not absolutely necessary. Data products refer to data sets as products and information services derived from data sets. We build the linkage between these two categories by pointing out many ideas and meth- ods on pricing digital products can be generalized and applied to pricing data products. In some scenarios, the boundary between digital products and data products is also blurry. Hereafter, we use the term information goods to refer to both digital products and data products. 1.1 Related Surveys The research into data pricing happens simultaneously in multiple domains, including but not limited to economics, marketing, e-commerce, databases and data management, operational research, management science, machine 3 learning and AI. However, to the best of our knowledge, there exists very limited e ort to provide an interdisciplinary survey of the related work. This article presents our endeavor to produce a comprehensive picture. There are some previous surveys related to data pricing. For example, Liang et al. [136] survey the life cycle of big data, and reviews 11 data pricing models. They also discuss data trading and protection. Fricker and Maksi- mov [75] report a literature survey over 18 research articles regarding several research questions, including maturity of the pricing models. Very recently, Zhang and Beltr an [220] review the state-of-the-art data pricing methods. They categorize data pricing methods according to two important data prop- erties, granularity and privacy, This article covers a substantially broader scope than those [75, 136, 220]. We connect economics, digital product pric- ing and data product pricing. We also discuss a series of desirable properties in data pricing, including arbitrage-freeness, revenue maximization, fairness, truthfulness, and privacy preservation, and review the techniques achieving those properties. Data pricing is related to cloud pricing, since a lot of data for pricing and trading are hosted on cloud. Wu et al. [208] present a comprehensive survey on cloud pricing models. They systematically categorize three fundamen- tal pricing strategies, namely value-based pricing, cost-based pricing and market-based pricing. Then, they further categorize nine pricing tactical objects. Speci cally, value-based pricing is demand driven and consists of customer value-based pricing, experience-based pricing, and service-based pricing. Cost-based pricing is supply driven and consists of expenditure- based pricing, resource-based pricing and utility-based pricing. Market- based pricing is an equilibrium of supply and demand and consists of free and pay later pricing, retail-based pricing and auction and online pricing. They cover in total 60 pricing models. While data and cloud are highly related, data pricing and cloud pricing are fundamentally di erent. Data pricing is selling data, while cloud pricing is selling cloud resources (e.g., storage and computation), including physical resources, virtual resources and stateless resources. In addition, Sen et al. [181] survey the major broad-band pricing pro- posals, including the realizations in various consumer data plans around the world. Murthy et al. [150] list di erent pricing models and pricing schemes used by some popular IaaS (infrastructure-as-a-service) providers. 4 Wu et al. [211] propose pricing as a service, which is essentially a personal- ized pricing service for IaaS. Aazam and Huh [1] propose broker as a service, which matches cloud services among cloud service providers and users. The key idea is to predict resource demands and thus derive prices. As data are often hosted online, one interesting question is the fair shar- ing of the cost among data owners, data users and brokers. This is related to data pricing, because the costs of data hosting and processing have to be recovered from data pricing. Kantere et al. [116] study the fair allocation of costs in query services. They develop a stochastic model, which predicts the extent of cost amortization in time and number of services based on query trac statistics. The model can be implemented on top of a cloud DBMS. Al-Kiswany et al. [12] provide a cost assessment tool to evaluate the cost of a desired data sharing. One useful feature of the tool is that a user can explore the cost space of alternative con gurations using various factors, such as quality, staleness, and accuracy. The technique is based on what-if analysis. 1.2 Structure of This Survey We take a multi-disciplinary approach in this survey. The rest of the article is organized as follows. In Section 2, we start from economics and focus on two aspects. First, we discuss cost reduction in information goods that contributes to their prices and has impact on economics. Then, we discuss the di erences between digital products and data products. In Section 3, we discuss the fundamental principles of data pricing. We rst present versioning as a general framework for pricing information goods. Then, we identify several desirable properties in data pricing, includ- ing truthfulness, fairness, revenue-maximization, arbitrage-freeness, privacy preservation and computational eciency. In Section 4, we discuss pricing digital products. We rst review the three major streams of revenues for digital products. Then, we revisit the bundling and subscription planning pricing models. Last, we consider auctions, which are widely used in pricing digital products. In Section 5, we discuss pricing data products. We rst overview the structures, players, and ways to produce data products in data marketplaces. 5 Then, we examine several important areas in pricing data products, includ- ing arbitrage-free pricing, revenue maximization pricing, fair and truthful pricing, privacy preservation in pricing. We also discuss dynamic data pric- ing, online pricing, and pricing in federated and collaborative learning. Last, in Section 6, we discuss challenges and future directions. 2 Economics of Data Pricing In general, pricing is the practice that a business sets a price at which a product or a service can be sold. Pricing is often part of the marketing plan of a business. To set prices, a business often considers a series of objectives, such as pro tability, tness in marketplace, market positioning, price consistency across categories and products, and meeting or preventing competition. Some major pricing strategies in literature [38, 58, 108, 155, 159] include operation-oriented pricing, revenue-oriented pricing, customer- oriented pricing, value-oriented pricing, and relationship-oriented pricing. There is a rich body of studies in economics and marketing research on pricing tactics, which are far beyond the scope and capacity of this survey. In this section, to understand the economic factors speci c to data pric- ing, we examine the cost reduction in information goods. Then, we inspect the di erences between digital products and data as products. 2.1 Cost Reduction in Information Goods \Technology changes. Economic laws do not." [182] The production, distri- bution, and consumption of information goods, comparing to those of phys- ical products in the long history of human economies, are distinguished by signi cant cost reductions on ve aspects, namely search costs, production costs, replication costs, transportation costs, and tracking and veri cation costs. Essentially, digital and data economics investigates how standard eco- nomic models adjust when those costs are reduced dramatically. Goldfarb and Tucker [93] present a thorough discussion, whose framework is largely followed here. 6 2.1.1 Search Costs \Search costs are the costs of looking for information" [182], which are in- curred in any information collection activities. Information goods allow more e ective and ecient online search. The consequent low search costs facilitate users' discovering digital products and data sets, as well as compar- ing prices of similar products and services. For example, Brynjolfsson and Smith [40] show that online prices of books and CDs are clearly lower than oine, though the price dispersion, however, does not shrink accordingly. Low search costs facilitate the sales of rare and long tail products [15, 214]. Thus, more variety is often observed in information goods and services. The degree of variety may be heavily impacted by recommender systems. Speci c to consumption of media, one of the major categories of digital products, Gentzkow and Shapiro [82] show that online media consumption is more diverse than oine. At the same time, customers may tend to consume more that aligns more or less with their viewpoints, which is called the \echo chamber" e ect [188]. Low search costs give strong rise to the prevalent platform businesses, which provide extensive matching services to customers and improve trade eciency [115]. Interoperability, compatibility and standards are strategic tools for both building platforms and running platform businesses [99]. 2.1.2 Production Costs Producing digital products, such as online courses, eBooks, software, graph- ics and digital arts, and photography, is very di erent from manufacturing physical products, like bread, shoes, and jackets. Moreover, collecting and processing massive data so that parts of data can be sold and can meet customers' needs is also di erent from traditional production. A wide spec- trum of production costs in traditional products are substantially reduced in information goods. First, some essential major costs in traditional production, such as ma- terials, semi- nished products and their transportation, are dramatically reduced in producing information goods. In many cases, the costs of obtain- ing, producing and transporting raw materials and physical semi- nished products can be reduced to very low or can even approach zero in mak- ing information goods. Second, a substantial cost of a traditional physical 7 product often belongs to the product itself and cannot be further reduced through sharing. The unit costs of information goods can approach zero through sharing as long as there are sucient reuses and sales volume. Last, smart manufacturing and customer-to-manufacturing can reduce the supply chain costs in traditional physical production [88, 187]. Information goods often can reduce the costs of customization to extreme. The substantial reduction in production cost in materials, semi- nished products, customization and sharing gives rise to a series of innovative busi- ness models, such as economics of sharing, pay-as-you-go and query-based data consumption. This also encourages innovation and long tail products that address diverse and smaller groups of potential customers. 2.1.3 Replication Costs One distinct feature of information goods versus traditional products is that information goods are non-rival. That is, one customer consuming an infor- mation good does not reduce the amount or quality of the product available to other customers. The zero marginal costs and the non-rival property of information goods empower innovative opportunities and bring in new challenges. In order to structure pricing of a large variety of non-rival information goods with zero marginal costs, bundling is often used [182], that is, mul- tiple products are sold together at a single price. Since a large number of information goods can be bundled together without a substantial increase in cost, economically it may be optimal to bundle thousands of digital products together to meet diverse and independent customer preferences [10, 25, 26]. Due to the zero marginal costs and the non-rivalrous property, many information goods are made publicly available, such as Wikipedia and open source software [131]. People contribute to open source or publicly available digital products and data to demonstrate their professional skills to potential employers. Companies support those products to complement their sales on other products. The zero marginal costs and non-rivalrous property post challenges to copyright policies and enforcement. Waldfogel [203] shows that low repli- cation costs, though may reduce revenue, help supplies and demands, and https://www.wikipedia.org 8 thus boost quality. Williams [207] shows that the protection of intellec- tual properties indeed has negative impact on follow-on innovation in gene sequencing. At the same time, there are evidences showing that governments man- date \open data" may lead to data leakages and privacy breaches that a ect citizens' oine welfare [5]. On the negative side, the zero marginal costs or non-rivalrous nature also ease the way for spamming [174] and online crime [149]. 2.1.4 Transportation Costs Thanks to the Internet, the costs of transporting information goods approach zero. This may imply, in many scenarios, that local communities may not a ect adoptions and consumptions of information goods, often known as the e ect of at world [76]. Interestingly, this is not true all the time, as some studies demonstrate that tastes may still be local in music [73] and content consumption [81]. While the physical transportation may approach zero, regulation may put sophisticated constraints on locations. For example, when Wekipedia was blocked in China in October 2005, more contributors from outside China were motivated to contribute [221]. Copyright policies may also a ect the availability and consumption of information goods in di erent regions, such as news media [46], and thus may be re ected by price. 2.1.5 Tracking and Veri cation Costs The capability of tracking users with relatively low costs is an important feature of information goods [182]. The low tracking costs give the rise to extensive personalized markets and possible price discrimination [77, 165]. Behavioral price discrimination is an immediate type, which sets prices ac- cording to customers' previous behavior. Correspondingly, if customers are well aware of the bene ts of tracking information to a monopoly, they may likely choose to be privacy sensitive and hold the information [193]. Another type of price discrimination is versioning [183], which sells information at di erent prices to di erent customers using di erent versions. Versioning is discussed in detail in Section 3.1. 9 The advantage of low tracking costs also leads to the blooming busi- nesses of personalized advertising [69]. A challenge for a company, however, is how to set prices for many advertisements that may be shown to massive customers? The same advertisement may have di erent prices for di er- ent customers. Auctions are often used to address the challenge [19], and can even be used to discover prices for information goods [164]. At the same time, auctions may be less useful when online marketplaces become mature [66]. The low tracking costs and the consequences, such as price discrimina- tion, lead to serious concerns on privacy [4]. As to be discussed later in this article, whether privacy should be treated as goods and how privacy is priced are investigated [74, 163]. Moreover, privacy regulation and the im- pact on welfare are important topics, though they are far beyond the scope of this survey. As a byproduct of low tracking costs, the costs of verifying identity and reputation of producers and users of information goods are dramatically lower than those in traditional scenarios. The low veri cation costs facilitate online transactions extensively and lower the costs of trust dramatically. 2.2 Di erences between Digital Products and Data Products This survey focuses on pricing two categories of information goods, digital products and data products. While digital products and data products share a series of common ideas and methods in pricing, they are also essentially di erent from each other on at least four aspects. First, the units of digital products are often well de ned and xed. For example, individual movies and musics are often priced and sold in whole. The consumption of a digital product is often independent from each other. For example, it would be rare that two digital books have to be read at the same time. In contrast, although the basic unit in a data set can be at a very small granularity, such as a record in a relational table, the units for pricing and consumption often vary from one customer to another. For example, a customer may be interested in the sales data of female customers in a province, while another customer may be interested in the sales data on electronics during the Christmas season. Correspondingly, one individual unit of data at the lowest granularity may not be valuable as a data product. 10 For example, one customer purchase record, after proper anonymization, may not be useful for a retailer. Instead, more often than not, many basic units of data are combined, aggregated and consumed together. Second, di erent from digital products, data sets as data products have very strong and exible aggregateability. Customers often aggregate data using various dimensions. The aggregateability, on the one hand, enables many opportunities for innovations in data business, and, on the other hand, posts many technical and business challenges, such as ensuring arbitrage- freeness as to be discussed later in this article. In many business scenarios, digital products like movies and musics are bundled. However, bundles are not aggregates. Customers still get digital products and consume them individually. Bundling is to take the advantage of low replication costs of digital products to boost sales and meet customers' diverse demands [10, 25, 26]. Third, the means of consuming digital products and data products are also very di erent. Typically digital products are consumed directly by peo- ple, such as movies watched by people and musics enjoyed by fans. Data sets are more often than not consumed by computers. They are, for example, analyzed, summarized or used to train machine learning models. The out- puts of models are used to automate operations or support human decision making. Last, digital products and data products are dramatically di erent in ways to be reused and resold. Digital products are easy to be consumed by others, that is, to be reused, or even to be resold to others in whole. Data sets, to the contrary, can be reused by others in di erent ways, such as aggregation in di erent dimensions and analysis for di erent purposes. Moreover, data can be easily processed and transformed so that they can be resold in a hard-to-detect manner. The above di erences between digital products and data products lead to di erent considerations in pricing principles and methods, which are dis- cussed later. Before we leave this topic, we want to point out that it is possible that the same information can be regarded as digital products in some situations and as data products in some other situations. For exam- ple, social media like tweets and customer reviews can be regarded as digital products when a customer reads them online. At the same time, they can be collected and processed in batch by analytic tools to detect events, dis- 11 cover customer pro les and feed recommender systems. In this situation, a systematic collection of social media can be priced and sold as a data product. 2.3 Summary In summary, information goods, including digital products and data prod- ucts, distinguish themselves from the traditional physical products in sig- ni cant cost reductions, particularly in search costs, production costs, repli- cation costs, transportation costs, and tracking and veri cation costs. The signi cant reduction of costs has profound impact on pricing information goods, which is discussed in the later sections of this article. There are sev- eral major di erences between digital products and data products, including consumption units, aggregatebility, means of consumption, and reusing and reselling. 3 Fundamental Principles of Data Pricing In this section, we rst review the idea of versioning [182, 183], which is a fundamental framework of designing information goods and pricing them. Then, we review several important properties in cost models of digital and data products. 3.1 Versioning As the replication costs of information goods are very low, even approaching zero in many cases, the price of an information good tends to be very low in marketplaces, too. The potential of very low prices of information goods, on the one hand, makes information goods economically appealing, and, on the other hand, may also make information goods economically dangerous, as the competitors may easily enter the market [182, 183]. This dilemma keeps many traditional pricing strategies far away from being e ective for information goods. To tackle the dilemma, the core idea is \linking price to value", that is, setting the price re ecting the value that a customer places on the informa- tion. Speci cally, the versioning strategy [183] makes di erent versions to appeal to di erent types of customers. For example, for a piece of software, 12 di erent versions have di erent subsets of features. Di erent versions of a movie may provide di erent image resolutions and sound e ects. Essen- tially, versioning divides customers into subgroups so that each subgroup may regard some features highly valuable and some other features of little value. A version corresponding to the demands can be provided. There are many di erent ways to produce di erent versions of informa- tion goods. For example, as information is often time sensitive, delay is often a good basis. In stock market information services, an expensive version may deliver real time quotes while a basic version delivers the same information 20 minutes later. In addition, versions may be de ned by convenience (e.g., data can be accessed only by PDF le or by downloadable spreadsheet), com- prehensiveness (e.g., the length of historical data available), manipulation (e.g., whether users can store, duplicate, print the information), community (e.g., availability of posting and reading discussion boards), annoyance (e.g., the option of no advertisements), the means of customer support (e.g., by website only or by talking to experts), and many other factors. Most ver- sions of information goods are created by subtracting value from the most technologically advanced and complete version. In many situations where customers may not realize the value of an in- formation good unless they try it, even the free versions may be provided. The rationale is that the free versions can provide opportunities to poten- tial customers to test out. The objectives of o ering free versions include building awareness, gaining follow-on sales, creating a customer network, attracting attentions, and gaining competitive advantages. The number of versions of an information good may be decided by two major considerations. First, the characteristics of the information to be sold is important. An information good that can be used in many di erent ways opens the door to many di erent versions. The second important factor is the value that di erent customers may place on it. The larger the variance, the more versions may be needed. The versioning strategy has been investigated in pricing data products, for example, relational data sets and query results [27, 28]. Relational views provide a natural and exible technical mean to produce versions of an information source. A series of technical challenges are identi ed, such as arbitrage in pricing, ne-grained data pricing, pricing updates, integrated data and competing data sources, which are reviewed further in this article. 13 3.2 Important Desiderata in Data Pricing There are many di erent ways to design and implement pricing models for information goods. There are a small number of desiderata pursued by most models. How to implement those desiderata in pricing models is discussed in the later sections. 3.2.1 Truthfulness To make a market ecient, the market is preferred to be truthful. A market is truthful if every buyer is sel sh and only o ers the price that maximizes the buyer's true utility value. In other words, in a truthful market, no buyer pays more than sucient to purchase a product. Here, di erent buyers may have di erent utility values on the same product. Truthfulness can facilitate a wide spectrum of pricing mechanisms, such as many kinds of auctions [7]. Auctions of digital products are discussed in Section 4.3. 3.2.2 Revenue Maximization Pricing models can optimize di erent objectives, such as lowest cost, highest pro t, and largest sales. The objective of maximizing revenue is often of special interest in designing pricing strategies. The rationale is that, for a business to be successful long term, a more immediate and important requirement is to win over as many customers as possible. For traditional physical products, it is often assumed that the marginal cost goes up after a certain number of units are manufactured, and thus the pro t can be maximized if the output level is set so that the marginal revenue is equal to the marginal cost, and the revenue can be maximized if the marginal revenue becomes zero. However, given that the replication costs of information goods are very low, revenue maximization and pro t maximization for information products become quite di erent from those for physical products [7, 42]. 3.2.3 Fairness Essentially, a market is fair if each seller gets the fair share of the revenue in coalition. In his seminal article [184], Shapley lays out the fundamental requirements of fairness in markets. Suppose there are k sellers cooperatively 14 participate in a transaction that leads to a payment v. There are four basic requirements for being fair. • Balance : the sum of the payment to each seller should be equal to v. That is, the payment is fully distributed to all sellers. • Symmetry : for a set of sellers S and two additional sellers s and s 0 0 who are not in S, that is, s; s 62 S, if S[fsg and S[fs g produce the same payment, then s and s should receive the same payment. That is, the same contribution to utility should be paid the same. • Zero element : for a set of sellers S and an additional seller s 62 S, if S [fsg and S produce the same payment, then s should receive a payment of 0. That is, no contribution, no payment. • Additivity : If the goods can be used for two tasks T and T with 1 2 payment v and v , respectively, then the payment to complete both 1 2 tasks T + T is v + v . 1 2 1 2 In the above well celebrated Shapley fairness, the Shapley value is the unique allocation of payment that satis es all the requirements. 1 U (S [ (s))U (S) (s) =  (1) n1 jSj SDnfsg where U () is the utility function, D is the complete set of sellers, S  D is a set of sellers, and s is a seller. Equivalently, Equation 1 can also be written as (s) = (U (P [fsg)U (P )) (2) s s N ! 2(D) where  2 (D) is a permutation of all sellers, and P is the set of sellers preceding s in . Agarwal et al. [7] observe that, as the replication costs of information goods are very low, the marginal costs of production are close to zero, a seller can produce more units of the same information good to obtain a larger Shapley value and thus a larger portion of the payment unjusti ed in business. This is a challenge in designing fair marketplace for information goods. 15 3.2.4 Arbitrage-free Pricing Arbitrage is the activities that take advantage of price di erences between two or more markets or channels. For example, consider a scenario where a user wants to purchase the access to an article, whose listed price is $35. Suppose that the journal publishing the article has a monthly subscription rate of $25. Then, the user can conduct arbitrage to subscribe to the journal for only one month and obtain the article at a price cheaper than the listed price. Arbitrage is often undesirable in pricing models. At least it should be able to check whether a pricing model is arbitrage-free. However, arbitrage can sneak in pricing models that are not thoroughly designed. For example, suppose a data service provider sells query results with prices based on variance [133], a variance of 10 for $5 each query result and a variance of 1 for $100 each query result. Each answer is perturbed independently. A customer who wants to obtain an answer of variance of 1 can purchase the query 10 times and compute their average. Due to the independent noise in perturbation, the aggregated average has variance 1, and thus the customer saves $50 by arbitrage. 3.2.5 Privacy-preservation Privacy is becoming a more and more serious concern about information goods. In general, privacy is the ability of an individual or a group to keep themselves or the information about themselves hidden from being identi ed or approached by other people. Privacy is highly related to information and information exchange, which are what information goods about. As explained in Section 2.1.5, due to the low tracking costs of information goods, it is easier to collect data about user privacy [4]. Whether privacy should be treated as goods and how privacy is priced are investigated [74, 163]. It is highly desirable to preserve privacy in marketplaces of information goods. In general, transactions in a marketplace may disclose privacy of various parties in many di erent ways. First, privacy of buyers is highly vulnerable. Their identities, the loca- tion and time of purchases, speci c products purchased, the purchase prices and total amount may re ect their privacy. It has been reported from time 16 to time that e-commerce providers leak customer information by mistakes, such as an accident reported recently . Second, privacy of information good providers may also be disclosed. For example, medical treatment information in hospitals is highly valuable for many business companies, such as pharmacy and medical equipment com- panies. Imagine that hospitals can collect and anonymize medical treatment data properly and provide the corresponding data products in marketplaces so that individual patients cannot be re-identi ed. Buyers, however, may be able to infer from the data the successful rates of a speci c treatment in a hospital, which may be regarded as the privacy of the hospital. Last, transactions in marketplaces may also disclose privacy of a third party involved. For example, an AI technology company may provide ma- chine learning model building services to data product buyers. However, machine learning models may be stolen [194], which are regarded privacy of the AI technology company. To protect privacy in marketplaces of information goods, various di- rections are being explored, such as hiding the information about what, when and how much a buyer purchases [11], building decentralized and trustworthy privacy preservation data marketplace [50, 107], investigating the tradeo between payments and accuracy when privacy presents [160], and aggregating non-veri able information from a privacy-sensitive pop- ulation [86]. There are many studies on preserving privacy in informa- tion goods. We refer interested readers to consult the rich body of sur- veys [8, 35, 61, 72, 78, 114, 212, 224] and others. We do not discuss further details about general privacy preservation techniques in this article, since privacy preservation techniques are far beyond the scope and capacity of this survey. 3.2.6 Computational Eciency As many information goods may be sold to a huge number of potential buyers, a pricing model has to match goods/sellers and buyers with an ap- propriate price. Computing prices eciently with respect to a large number of goods and a large number of buyers presents technical challenges [28]. https://www.telegraph.co.uk/technology/2020/03/10/leak-millions-amazon-e bay-transactions-exposes-customer-addresses/ 17 For example, one reasonable expectation is that a marketplace is polyno- mial, that is, the complexity of computing prices has to be polynomial with respect to the number of sellers, and cannot grow with respect to the num- ber of goods/buyers when prices are updated [7]. When auctions are used in determining prices, auction eciency [92] is required to be fast, which is the time needed to process bids. 3.3 Summary Versioning is a common mechanism in designing and pricing information goods, so that prices of di erent versions can be linked to values placed by various customer groups. There are a series of important requirements on pricing information goods, including truthfulness, revenue maximization, fairness, arbitrage-free pricing, privacy preservation, and computational ef- ciency. Those requirements post technical challenges to pricing models. 4 Pricing Digital Products Although the focus of this article is about pricing data products, we provide a brief review on pricing digital products here, since some general ideas in pricing digital products can be borrowed and extended to data products. In some cases, the boundary between digital products and data products is even blurry. We rst discuss the three major streams of revenues for digital products. Then, we look at two major types of pricing models. The rst is bundling and subscription, and the second is auctions. These pricing models are popularly adopted by digital product marketplaces. 4.1 Streams of Revenues As discussed in Section 3.2.2, revenue maximization often serves as the basic objective in pricing mechanisms, including pricing digital products. There- fore, the understanding of pricing digital products can naturally start with an analysis of possible ways where revenues of digital products may come from. Lambrecht et al. [127] summarize that there are three streams of revenues for digital products that are delivered online. 18 • Money. A provider can sell to customers content or, more broadly, services, such as movies and e-books. • Information/privacy. Instead of charging customers directly, a provider can collect customer information by tracking (e.g., using cook- ies) and sell the information about customers to generate revenues. • Time/attention. A provider can sell space in their digital products to advertisers to produce revenue. Often, a rm has to design a revenue model for its digital products that combine more than one revenue stream. The three streams are not indepen- dent. Instead, they compete with each other, and thus a good tradeo has to be settled [79]. On the one hand, in some situations, revenues from money stream may be increased at the cost of those from time/attention stream. For example, customers may pay for the content and avoid ads [168, 171], or convert from free versions to premium versions with tting functions [202]. On the other hand, customers may be highly price sensitive in some digi- tal products, and thus growth in time/attention stream may be easier. For example, an online news site experiences a dramatic loss of customer vis- its after introducing a paywall [45]. Free samples may stimulate long-term sales [37]. A possible tradeo between money and time/attention has to be carefully designed. Typical approaches in revenue models of content and services [173] in- clude rigid pricing (e.g., each movie is priced at a xed price), designing pricing tiers (e.g., basic versus premium versions), setting up duration of subscription plans (e.g., 6 months of promotion period with very low sub- scription price) and designing freemium models. One important and unique feature in digital product consumption is micropayments, which means a customer can pay a very small amount that is typically impractical in tradi- tional transactions using standard credit cards due to network service fees. Micropayments and subscriptions have di erent e ects on consumer behav- ior [20]. As a concrete example of revenue models, consider pricing software prod- ucts [130]. The major parameters of pricing models include formation of price, structure of payment ow, assessment base, price discrimination, price building and dynamic strategies. The formation of price considers price de- termination, that is, cost-based, value-based or competition-oriented, as well 19 as degree of interaction, unilateral versus interactive. In terms of payment ow, it may be by single payment, recurring payments or combination. The assessment base of pricing may be usage-dependent (e.g., by transaction or time) or usage-independent (e.g., server types and GPU). As the tracking costs of digital products are low, a rm can collect customer personal data and sell such data for revenue, that is, generat- ing revenues from information/privacy stream. Typically, personal data may include customers' identities, behavior patterns, preferences and needs. There are various ways to sell customer data, which are also discussed in Section 5 when data products and their marketplaces are discussed. For example [32, 36], a website can provide direct marketing companies user activity information. Moreover, websites can also collaborate with data management platforms (DMP, for advertising) [67] and produce revenues by facilitating businesses to identify audience segments. For example, the information about how customers are connected in social networks can be used to design customized discounts in marketing campaigns [215]. Berge- mann and Bonatti [33] develop a model of pricing customer-level information such that the data about each customer are sold individually and individual queries to the database are priced linearly. As new technologies of customer tracking become available, more pricing models may emerge. We want to point out that selling customer data, though serves the purpose of selling digital products, crosses the boundary between selling digital products and data products. We review some studies on setting prices for customer data and privacy information in the next section. To produce revenues from time/attention stream, many digital product producers and service providers embed advertisements in their products in one way or the other, and obtain remarkable or even dominant advertising income. However, as John Wanamaker (1838-1922) wisely said, \Half the money I spend on advertising is wasted; the trouble is I don't know which half." It is well recognized that it is hard to accurately measure advertising e ects [95, 132]. Advertisers customize ads for online display [190, 216]. One feasible way to improve advertising e ectiveness is to combine user information and advertising opportunities. Retargeted advertising [128] is such an approach, which combines customer online and oine behavior data and makes rms focus on customers showing prior interest in the related products. For example, Athey et al. [21] consider customers with multiple 20 homes and investigate the advertising strategies and e ectiveness. In summary, digital product and service suppliers produce rev- enues through three major streams, money, information/privacy and time/attention. Orthogonally, a rm can bundle its digital products and also design subscription plans that provide products and services in a spe- ci c period for a price, which is discussed next. 4.2 Bundling and Subscription Planning Product bundling organizes products or services into bundles, such that a bundle of products or services are for sale as one combined product or service package. Product bundling is a common marketing practice, particularly in the traditional industry like telecommunication services, nancial services, healthcare, and consumer electronics. As discussed in Section 2.1.3, the low replication costs of information goods allow prevalent adoption of bundling in pricing digital products [182]. Designing product bundles essentially is a combinatorial optimization prob- lem. The basic and static setting is that a customer wants to buy either one or multiple products at a time, which is investigated well before digital products are available [6]. A series of studies [18, 148, 169] develop pric- ing strategies with two products under di erent types of bundling. They share the basic assumption that demand for a bundle is elastic comparing to demand for individual products. For example, Armstrong [18] studies the scenarios where products may be substituted or provided by separate sellers. Bundling multiple products is analyzed, often under the independent value distribution framework [152]. Consider the situation where there are n heterogeneous products for one buyer, and the objective is to maximize expected revenue. Assume that the value distributions on products are independent. That is, for each product x , the price that a buyer would like to pay for is an arbitrary distribution D in range [a ; b ], where 0 i i i a  b < 1, and those distributions D ; : : : ; D are independent from each i i 1 i other. Further assume that the buyer is additive, that is, the buyer's value for a set of products is the sum of the buyer's values of those individual products in the set. Babaio et al. [51] show that either selling each item separately or selling all items together as a grand bundle produces at least 21 a constant fraction of the optimal revenue. This interesting and important result allows a simple yet e ective bundling strategy: either pricing each product individually or pricing the grand bundle in the expected price. In practice, many platforms, such as Hulu and Amazon Prime Video, o er grand bundle subscription for their products. More recently, Haghpanah and Hartline [97, 98] show that grand bun- dle is optimal if more price-sensitive buyers consider the products more complementary. When multiple buyers are considered, whose preferences are unknown, Balcan et al. [30] give a simple pricing model that achieves a surprisingly strong guarantee: in the case of unlimited supplies, a ran- dom single price achieves expected revenue within a logarithmic factor for customers with general valuation functions. This result allows great con- venience in practice, that is, setting a uniform price for all products. It is easier to price a bundle of a larger number of products, since the law of large numbers allows to predict customers' valuations more accurately for a larger bundle of products [2]. Orthogonal to bundling, subscription is to price the interactions between customers and a platform over a period of time. Subscribing customers are in general heterogeneous in both usage rate and value of products. On the one hand, customers with higher usage rates may prefer subscribing to larger subscription sets. On the other hand, in order to maximize revenue, the platform wants customers with lower usage rates to subscribe, and customers with higher usage rates to rent. Moreover, di erent users may have di erent values for a product. Many platforms o er subscription and renting at the same time. For a platform, the subscription model is to select a subscription fee and the period for each set of products and also set the rental price for each product [13]. Alaei et al. [13] follow the model of grand bundle and consider grand subscription, a single rental price for the set that includes all products. They establish the sucient and necessary condition for the optimality of grand subscription. They also show that subscription fees can be set proportional to the cardinality of a set of products and can achieve of the 4 log 2m+log n optimal revenue for n types of customers and m types of products. This approximation is tight in the sense that it cannot be improved more than ( ) in polynomial time. log n After all, modeling bundling and subscriptions is computationally chal- 22 lenging due to the combinatorial nature. Dynamic pricing bundles and sub- scriptions, such as promotions and coupons, have rarely been touched yet. 4.3 Auctions Auctions have a long history back to the Babylonian and Roman em- pires [185]. There are many excellent surveys on auctions (e.g., [24, 68, 118, 145]). A comprehensive review on auctions is far beyond the scope and ca- pacity of this article. In this article, we instead only focus on the important role of auctions as a pricing mechanism for digital products. 4.3.1 Basics about Auctions There are four basic types of auctions widely used. • In the ascending-bid auction (also known as English auction), the price is raised successively until only one bidder remains, who wins the ob- ject at the nal price. • The descending auction (also known as the Dutch auction) works the other way by starting at a very high price and lowering the price continuously, until the rst bidder calls out and accepts the current price. • In the rst-price sealed-bid auction, every bidder submits a bid without knowing the others' bids. The one making the highest bid wins and pays at the named price. • The second-price sealed-bid auction (also known as the Vickrey auc- tion [198]) works in the same way as the rst-price sealed-bid auction does, except that the winner pays only the second highest bid. There are two basic models of the value information in auctions. The private-value model assumes that every bidder has an independent value on the object for sale. The value is also private to the bidder only. The pure common-value model assumes that the actual value of the object for sale is the same for all bidders, but bidders have di erent private information about that actual value. Every bidder adjusts her/his estimate of the actual 23 value by learning other bidders' signals. There are also models considering both values private to individual bidders and common to all bidders. One fundamental principle in auction theory is the revenue equivalence theorem [152, 177, 198, 199], which essentially states that, for a set of risk- neutral bidders with independent private valuation of an object drawn from a common cumulative distribution that is strictly increasing and atomless on [v ; v ], any auction mechanism yields the same expected revenue min max and thus any bidder with valuation v makes the same expected payment if (1) the object is allocated to the bidder with the highest valuation; and (2) any bidder with valuation v has an expected utility of 0. Based on the min revenue equivalence theorem, the four basic types of auctions lead to the same payment by the winner and the same revenue. While most studies in auction theory make some simple assumptions about independence of customer valuations, empirical studies [106] demon- strate that, in practice, the wrong assumption of valuation independence causes inecient auctions in e-commerce. 4.3.2 Sponsored Search Auctions Online ad and sponsored search auctions [126, 172, 197] are one important application of auctions in pricing digital products. Sponsored search [110] is the business model where content providers pay search engines for trac to their websites. In sponsored search, advertisers and, more generally, content providers bid for keywords in search engines, and search engines decide which ad to display in which position to answer a query from a user. GoTo.com created the rst sponsored search auction [110]. Di erent pricing models can be used in sponsored search auctions, such as pay-per mille /pay-per impression (PPM), pay-per-click (PPC), and pay- per-action (PPA). In the early days of sponsored search, a generalized rst price auction is used. Each advertiser bids on multiple keywords, and can set a bidding price for each keyword. When a user query is answered, which is a keyword, the top k bids on the keyword in price are displayed. If an ad is clicked by the user, the corresponding advertiser pays the bidding price. The rst price auction mechanism is unstable, costs advertisers time and reduces search engine pro ts [64]. Later, Google generalizes the second That is, the cost of 1,000 advertisement impressions. 24 price auction mechanism [65], and enhances the ranking of bids by additional information, such as the ad's click-through-rate (CTR), keyword relevance, and ad's landing-page/site quality. There are many in depth analyses about sponsored search auction mech- anisms (e.g., [172]). For example, some studies analyze auction mechanisms based on assumptions about rationality, budget constraints and CTR dis- tributions. Some other studies look at practical sponsored search systems and discuss auction mechanisms when the standard assumptions do not hold. Another group of studies, such as [22,43,53,80], conduct empirical studies to understand bidding behavior and statics. Last and latest, deep learning ap- proaches are used to develop auction strategies in sponsored search [175,222]. 4.3.3 Auctions on Digital Products with Unlimited Supplies One unique feature of digital products is that the replication costs are very low and thus may have almost unlimited supply. Products of unlimited supplies lead to new challenges and opportunities to auction mechanism design. For example, the second price auction can be straightforwardly generalized for k identical products { the top k highest bidders win and each pays the (k + 1)-th bidding price. However, when there are unlimited identical products, the (k + 1)-th bidding price approaches 0. The lack of competition due to obsessive supplies prevents bidders from o ering any high prices. In other words, the challenge is how to ensure the bids are truthful, that is, re ecting the bidders' true valuation of the digital products. Denote by B the set of bidders, and by b ; b ; : : : the bidding prices 1 2 in descending order, that is, b  b  0 for any i > 0. Suppose the i i+1 generalized second price auction mechanism is used. That is, if k bids are taken, those winning bidders each pays the cost b . The auction objective k+1 is to maximize k b . An auction is competitive if it yields revenue within k+1 a constant factor of the optimal xed pricing. It is tricky that, when there is unlimited supply, the Vickrey auction is not competitive if the seller chooses the number of products to sell before knowing the bids, and is not truthful if the seller chooses after knowing the bids [92]. Goldberg et al. [92] propose the rst competitive auction for digital goods with unlimited supplies. The major idea is the smart framework of random sampling auction. An auction is bid-independent if bidder i's bid value should only determine whether the bidder wins the auction, but not the 25 0 price. We select a sample B of B at random, independent from the bid values. We use the bids in B to compute the optimal bid threshold f 0 0 0 that maximizes the revenue in B , and every bidder in B B whose bid value is over f 0 wins. Symmetrically, we use the bids in BB to compute the optimal bid threshold f that maximizes the revenue in BB , and BB every bidder in B whose bid value is higher than f wins. In general, BB 0 0 f = f does not necessarily hold. Random sampling auctions are B BB competitive, no matter the single-price version or the multi-price version. Indeed, random sampling auctions are 15-competitive in the worst case [70] and 4-competitive for a large class of instances where there are at least 6 bids that are as good as the optimal sale price [14]. There are a series of improvements on random sampling auctions. For example, Hartline and McGrew [102] further improve the competitiveness. Goldberg and Hartline [89] extend the scope from single digital product with unlimited supply to multiple products with unlimited supplies. Given a set of bids, they show that the bidder-optimal product assignment given the bids and the optimal sale prices can be determined by solving the integer programming problem as follows. P P max x r ij j j i subject to r = 0 x  1 1  i  n (3) ij x  0 1  i  n; 1  j  m ij p + r  a 1  i  n; 1  j  m i j ij P P P p = x (a r ) i ij ij j i j i where x is the assignment of product j to bidder i, r is the optimal price ij j for product j, p is the pro t of bidder i, and a is bid from bidder i on i ij product j. Then, we can solve the optimal pricing problem in the following random sampling auction. Let B be the set of bidders. First, we obtain a sample 0 0 B of bidders. Second, we compute the optimal sale prices for B . Last, we run the xed-price auction on B B using the sale prices computed in Equation 3. All bidders in B lose the auction. The random sampling auction is shown truthful and competitive [89]. Most of the proposed auctions for digital goods with unlimited supply are randomized auctions. Goldberg et al. [92] show that no deterministic 26 auction can be competitive. Aggarwal et al. [9] later point out that the result does not hold for asymmetric auctions [144]. In a symmetric ex ante auc- tion, buyers' preference parameters are drawn from a symmetric probability distribution, and thus there exists a symmetric equilibrium if an equilibrium exists at all. In an asymmetric auction, each buyer has the same information about the product but a di erent opportunity cost of obtaining the product, that is, bidders' valuations are drawn from di erent distributions. Aggar- wal et al. [9] give an asymmetric deterministic auction that can approximate the revenue of any optimal single-price sale in the worst case. Indeed, they develop a general derandomization technique to transform any randomized auction into an asymmetric deterministic auction with approximately the same revenue. The general idea follows the deterministic maximum ow solution to the well-known hat problem [63]. 4.3.4 Envy-free Auctions One drawback in random sampling auctions is that some bidders may lose even they make bids higher than some winning bidders do, since the bidders 0 0 in B and B B use di erent thresholds (i.e., f 0 and f 0 , respectively) BB B in the one product version and all bidders in B lose in the multiple product version. Goldberg and Hartline [91] establish a fundamental result: an auction cannot be truthful, competitive and envy-free at the same time. They also explore possible tradeo s between truthfulness and envy-freeness based on the consensus revenue estimate (CORE) technique [90]. Speci cally, using a similar idea in combinatorial auctions with single parameter agents [16], we can relax the truthfulness requirement by requiring being truthful with prob- ability (1), and always guarantee envy-free. The auction is highly truthful when  approaches 0 and the number of winners in the auction approaches in nity. The other type of auctions relaxes the envy-free requirement to being envy-free with probability (1 ), and guarantees truthfulness. Both auctions are competitive and the probability is over random coin tosses made by the randomized auction mechanism and not the input. 27 4.3.5 Online Auctions In addition to potentially unlimited supply, another important feature of digital goods is that a digital good may be sold repetitively, such as a movie and a song. Therefore, auctions on digital goods may run continuously instead of only one round. Moreover, customers may want to have prompt answers to their bids. Online auctions [129] are designed to address the setting where di er- ent customers bid at di erent times. The auction mechanism has to make decision about each bid as it arrives. An (online) auction is incentive com- patible if the bidders are rationally motivated to reveal their true valuations of the object. Lavi and Nisan [129] show that an online auction is incentive compatible if and only if it is based on supply curves under the assumption of limited supply, that is, before it receives the i-th bid b (q), it xes the supply curve p (q) based on the previous bids, and (1) the quantity q sold i i to customer i is the quantity q that maximizes the sum (b (j)p (j)); i i j=1 and (2) the price paid by i is p (j). j=1 To tackle the challenges when there is unlimited supply, Bar-Yossef et al. [31] point out that supply curves are not available anymore. Instead, they propose an extremely simple incentive-compatible randomized online auction. Each bidder i picks a random number t 2 f0; : : : ;blog hcg and sets the price threshold to s = 2 , where h is the ratio of the highest valuation against the lowest valuation among all bidders. This auction is O(log h)- competitive. The auction mechanism can be further improved to achieve even bet- ter incentive-compatibility. Speci cally, we can divide a sequence of bids b ; b ; : : : into l = (blog hc + 1) buckets, such that bucket B contains the 1 2 j j j+1 bids with indexes in range [2 ; 2 ). The weight of bucket B is the sum of bids within B , that is, w = i. A new bidder can choose one of the j j i2B buckets at random with the probability proportional to the bucket weight, and pays the price of the lowest bid of the bucket. The price s that bidder j j i pays follows the probability distribution Pr[s = 2 ] = P , where i l1 r=0 d+1 d is a parameter. The auction is shown O(3 (log h) )-competitive. By p p setting d = log log h, the auction is O(exp( log log h))-competitive. 28 4.4 Summary As revenue maximization plays a fundamental role in pricing digital prod- ucts, we review the three major streams of revenues for digital products, namely money, information/privacy, and time/attention. Then, we revisit bundling and subscription planning for digital products, which echoes the opportunities and challenges due to low replication costs of information goods. Auctions are widely used in pricing digital products. We review some basic types of auctions and their applications in digital products, including sponsored search auctions, auctions with unlimited supplies, envy-free auc- tions and online auctions. Some ideas employed by pricing digital products are also used in pricing data products, as to be discussed in the next section. 5 Pricing Data Products In this section, we discuss pricing in marketplaces of data. We rst obtain an overall understanding about data markets and the major players in such markets. Then, we look into several most studied technical problems in data product pricing, including arbitrage-free pricing, revenue maximization pric- ing, fair and truthful pricing and privacy preservation in data marketplaces. Last, we discuss pricing in novel application scenarios, including dynamic data pricing, online pricing and federated learning pricing. 5.1 Data Markets and Pricing, What Are They? Marketplaces for data have been actively developed for over a decade. An early survey [179] identi es di erent categories and dimensions of data mar- ketplaces and data vendors in 2012. There are many studies on various issues about data markets and pricing strategies. Before we discuss any speci cs in detail, it is important to obtain an overall understanding about data markets, such as what are sold and for what purposes, who are the sellers, who are the buyers, and what are the basic pricing models. Pantelis and Aija [167] present a brief economic analysis of data taxon- omy as a market mechanism. Data and databases are legally protected by either copyright or database right. Copyright protects expression and signif- icant creative e ort that creates and organizes data. Database right protects 29 a whole database. One challenge is that both copyright and database right are hard to enforce due to the non-rivalrous nature of data. In general, data may be owned by governments, private parties or in- dividuals. Consequently, data can be categorized into three types: open, public, and private data [167]. Open data are common pool resources [166], such as the data made available by the open data initiatives. Public data, such as the data collected by the government in the United States, are valu- able resources subject to the \tragedy of the commons" [101]. Public data are often produced by individuals or organizations for research and used by governments and local authorities, but may also be employed by commercial parties to enhance their proprietary resources or services. Private data are generated by private applications or services. To understand what are sold in data markets and for what purposes, Muschalle et al. [151] consider the common queries and demands on data markets, as well as the pricing strategies. They observe two major types of queries. The rst type of queries is to estimate the value of a \thing" or compare the values of \things", where examples of the \things" are like webpages for advertisements, starlets, politicians and products. The second type is to show all about a \thing". Those queries are raised by seven cate- gories of bene ciaries, namely analysts, application vendors, data processing algorithm developers, data providers, consultants, licensing and certi cation entities, and data market owners. Muschalle et al. [151] also identify three types of market structures. First, in a monopoly, a supplier is powerful enough to set prices to maximize pro ts. Second, an oligopoly is domi- nated by a small number of strong competitors. Last, in strong competition markets, prices may align with marginal costs. A series of pricing strategies and models may be considered in data markets [151]. First, free data may be obtained from public authorities, may help to attract customers and suppliers of commercial data, and may be integrated into private and not-free data products. Second, prices can be based on usages, such as charging customers per hour of data usage. Third, package pricing allows a customer to obtain a certain amount of data or API calls for a xed fee. A few studies [116, 210] try to optimize package pricing models. Fourth, in the at fee tari model, a data product or service is o ered at a at rate, regardless of usage. It is simple, easy to use. The drawback is the lack of exibility, particularly for buyers. Fifth, 30 combining package pricing and at fee tari results in two-part tari , that is, a xed basic fee plus additional fee per unit consumed. This model is popular in data services. Speci cally, Wu and Banker [209] show that, under zero marginal costs and monitoring costs, at fee and two-part tari pricing are on par, and two-part tari is the most pro table strategy. Last, in the freemium model, users can use basic products or services for free and pay for premium functions or services. Recently, machine learning, particularly deep learning [94], becomes dis- ruptive in many applications, such as computer vision [139, 201] and natural language processing [218]. In most situations, powerful deep models heav- ily rely on large amounts of training data [156]. Monetization of data and machine learning models built on data through markets gains stronger and stronger interests from industry. Speci c to data as an economic good and data pricing as a monetization mechanism in this context, a series of studies focus on data utility for model building and the associated pricing, particu- larly considering privacy. Some data owners may have detailed knowledge of speci c machine learn- ing tasks and thus dedicate corresponding e ort to collect high quality data for building better models. Babaio et al. [23] study the design of optimal mechanisms for a monopoly data provider to sell her/his data. Speci cally, they show that it is feasible to achieve optimal revenue by a simple one-round protocol, that is, a protocol where a buyer and a seller each sends a single message, and there is a single money transfer. The optimal mechanism can be computed in polynomial time. For a buyer who may abort the interaction with a seller prematurely, multiple rounds of partial information disclosure interleaved by payments may be needed to ensure optimal revenue. Cum- mings et al. [49] study the optimal design for data buyers to purchase data estimators with di erent variances and combine the estimators to meet a required quality guarantee on variance with the lowest total cost. The role of privacy in data collection and machine learning model build- ing is investigated. For example, Ghosh and Roth [87] develop auctions that are truthful and approximately optimal for data buyers to obtain accurate estimates on data from owners who are compensated for privacy loss. They show that the classic Vickrey auction [198] can minimize the buyer's total payment and meet the accuracy requirement. They also develop a mecha- nism that can maximize the accuracy given a budget. 31 In general, modeling data owners' costs of privacy loss is very dicult, since the costs may be correlated with private data arbitrarily. It is impos- sible to design a direct revelation mechanism that can provide a non-trivial guarantee on accuracy and, at the same time, is rational for individual data owners. To tackle the issue, Ligett and Roth [137] design a take-it-or-leave- it mechanism, which randomly approaches individuals from a population and makes o ers. This mechanism can be used for some data collection scenarios, such as surveys. Versioning is an important strategy in data pricing. A data seller can customize data into di erent versions according to buyers' needs. Berge- mann et al. [34] develop the optimal menu of information products that a monopoly data supplier can o er to a data buyer, so that one product can t the buyer's willingness to buy the information at the o ered price, and the revenue is maximized. One important nding is that information products indeed allow larger scopes of price discrimination. There are at least two dimensions that sellers can explore to derive various subsets of a data set, namely data quality and data position. When data are used to build machine learning models, it is important to assess the value of each data record within a data set. There exist various methods for assessment, such as leave-one-out [47], leverage or in uence score [48]. Ghorbani and Zou [85] propose to apply the Shapley fairness on the data used to train a machine learning model, and thus de ne data Shapley for a record i in a training data set D as U (S [fig)U (S) = C n1 jSj SDfig where C is an arbitrary (positive) constant, and U (S) is the performance score of the model trained on data S  D. One challenge is that computing the exact data Shapley values on large data sets for sophisticated models, such as deep neural networks, is computational prohibitive. Ghorbani and Zou [85] also develop Monte Carlo and gradient-based methods for estima- tion. If a data point p appears in two samples D and D from the same data 1 2 distribution, intuitively the Shapley value of p in D and D should be simi- 1 2 lar. Mathematically, the intrinsic Shapley value of p in a distribution should 32 be the expectation of the Shapley value of p in the distribution. Based on this intuition, Ghorbani et al. [84] propose the notion of distributional Shap- ley. Let Z be a universe in question. For example, in classi cation problems, conventionally Z = XY , where X is the feature space and Y is the output. Let D be a data distribution in Z . Assuming a potential function or a per- formance metric U : Z ! [0; 1] and a sample size m > 0, the distributional Shapley value of a point z 2 Z is the expected Shapley value over data sets of size m containing z, that is, (z; U;D; m) = E m1 [ (z; U; S [ fzg], SD m1 where S  D is a set of m points sampled i.i.d. from D. They show that distribution Shapley values are stable. Kwon et al. [125] further derive the computationally tractable expressions for distributional Shapley for a series of models, including linear regression, binary classi cation and non- parametric density estimation. Alternative to Shapley values, there are some other data valuation meth- ods. For example, in machine learning, in uence functions [119,206] approx- imate leave-one-out to assess the value of a data item. Cai et al. [41] propose strategy-proof mechanisms for data elicitation and trade o between model accuracy and reward. Richardson et al. [176] focus on the case of linear regression. Recently, Yoon et al. [217] propose data valuation using rein- forcement learning. They use a data value estimator to learn how much a data item as an element in the training data contributes to improving model performance. One distinct advantage is that the model being trained and the data value estimator can improve each other's performance. Data quality is an important issue [170]. There are many studies on assessment of data quality [103, 170, 204]. Some studies speci cally focus on pricing based on data quality and the impact on data markets. Heck- man et al. [103] propose a simple linear model, Value of data = xed cost + w  factor ; where the factors include but are not limited to age of data, periodicity of data, volume of data, and accuracy of data, and w is the associated weight. One practical diculty in using the model is that the parameters in the model are hard to estimate. Another diculty is that many data sets do not have public prices associated. Yu and Zhang [219] consider pricing multiple versions formed by multiple factors of data quality and build a two-level model. The rst level is the data platform where a single owner 33 is assumed, who designs the number of versions. The second level is the customers who want to maximize the data utility. Each level is modeled as a maximization problem and thus the whole model is a bi-level programming problem, which is NP-hard. Another way to form multiple versions of data products is to charge by queries [121{124]. Intuitively, a data seller may treat a view of a data set as a version. Setting the price for every possible view is not only tedious but also tricky. If prices on views are not set properly, arbitrages or less than highest prices may happen. Koutris et al. [121, 124] propose a framework of query and view based data pricing. The major idea is that a seller only needs to specify the prices on a few views, and then the prices of other views can be decided algorithmically. Their advocate two desiderata, arbitrage- freeness and discount-freeness. Theoretically, they show the existence and uniqueness of pricing functions satisfying the requirements. They also show the complexity of computing the pricing functions. Unfortunately, only selection views and conjunctive queries without self-joins are tractable. They present polynomial time algorithms for chain queries and cyclic queries. Technically, the core idea in the view and query based pricing framework is query determinacy [157, 158, 180]. A query Q is said to be determined by a set of views V if the answer to Q can be completely derived from the views. Query determinacy enables the feasibility of arbitrage detection. If V determines Q, then arbitrage happens if and only if the price of V is cheaper than that of Q. Koutris et al. [123] further explore the technical challenges in practical implementation of view and query based data pricing. Speci cally, they develop an integer linear programming formulation for the pricing problem with a large number of queries. Considering the scenario where a user may purchase multiple queries over time or the database is updated, such that information in multiple queries and updates may have overlaps, they also leverage query history to avoid double charging. To handle the situation where there are multiple sellers, they de ne the share of a seller as the max- imum revenue that the seller can get among all minimum-cost solutions, and accordingly de ne a fair revenue distribution policy. A prototype demon- stration system is reported in [122]. Tang et al. [192] follow the view and query based pricing framework and consider the minimum granularity of data, that is, each tuple is a view. 34 Their model assigns to each tuple a price and prices queries based on minimal provenances. Tang et al. [191] extend view and query based pricing to XML documents and consider the situation where a customer may just want to purchase a sample instead of the complete query result. 5.2 Arbitrage-free Pricing Arbitrage is probably the most intensively studied issue in pricing data prod- ucts. As introduced in Section 3.2.4, in general, arbitrage is the activities that take advantage of price di erences between two or more markets or channels. Arbitrage is undesirable in many pricing models. Unfortunately, arbitrage may sneak in pricing models without rigorous design. For exam- ple, Balazinska et al. [28] analyze that subscription based pricing possibly with a query limit allows arbitrage. Muschalle et al. [151] point out that a pricing model charging users a certain amount of API calls for a xed rate may potentially allow arbitrage, depending on the package size. Arbitrage-freeness is one of the fundamental properties of pricing mod- els in query and view based pricing [121{124]. Li and Miklau [134] and Li et al. [133] develop frameworks of pricing linear aggregate queries. Specif- ically, Li et al. [133] consider linear queries. Given a data set of n tuples x ; : : : ; x , a linear query q = (q ; : : : ; q ) is a real-valued vector, and the 1 n 1 N answer q(x) = q x . For a multiset of queries S = fQ ; : : : ; Q g and i i 1 k i=1 query Q, if the answer to Q can be linearly derived from the answers to the queries in S, then Q is said to be determined by S, denoted by S ! Q. A pricing function (Q) is arbitrage-free if for any multiset S and query Q such that S ! Q, (Q)  (Q ). i=1 Under the general intuition of arbitrage-freeness, Li et al. [133] consider a speci c form of queries, linear queries with variance (q; v), that is, the estimation of the answer to query q should have a variance no larger than v. Using di erent values of v, di erent versions are formed. A pricing model not carefully designed may allow arbitrage. Li et al. [133] rst establish the observation that (q; v) = ( ). Then, f (q) they synthesize pricing function (q; v) = , which is arbitrage-free if f is positive and semi-norm . For any arbitrage-free pricing functions  ; : : : ;  , 1 k 5 n n A function f : R ! R is semi-norm if for any c 2 R and any query Q 2 R , f(cq) = jcjf(q); and for any q ; q 2 R , f(q + q )  f(q ) + f(q ). 1 2 1 2 1 2 35 6 f ( (q); : : : ;  (q)) is also arbitrage-free if f is a subadditive and nonde- 1 k creasing function. As Roth [178] summarizes, the framework by Li et al. [133] still faces three important challenges. First, arbitrage is still possible to derive answers to a bundle of queries from another bundle of queries and their answers. Second, arbitrage is still possible on biased estimators for statistical queries. Last, it is unclear whether we can obtain arbitrage-free pricing maximizing pro t given the distribution of buyer demands. Later, Deep and Koutris [54] provide some interesting insights to arbitrage-free pricing for bundles. Lin and Kifer [138] investigate arbitrage-free pricing for general data queries. They consider three types of pricing models for query bundles, where a query bundle is a set of queries posted simultaneously as a batch. First, an instance-independent pricing function depends on the query bun- dle but not the database instance. Second, an up-front dependent pricing function depends on both the query bundle and the database instance. A customer knows an un-front dependent pricing function, and decides whether to purchase or not the query answers. Last, a delayed pricing function de- pends on both the query bundle and the answers computed by the query bundle on the current database instance. The customer knows the pricing function, but do not know the exact price. Once agreeing, the customer is charged when the answers are computed. Lin and Kifer [138] also summarize ve di erent types of arbitrage situ- ations. First, if prices are quoted by queries, in order to avoid price-based arbitrage, answers to queries should not be deduced from prices along. Sec- ond, a buyer may use multiple accounts to derive answers to a query bun- dle. To avoid separate account arbitrage, the price of a query bundle [q ; q ] 1 2 should be at most the sum of the prices of q and q . Third, if the answers 1 2 to a query bundle q can always be deduced from answers to another query bundle q, to prevent post-processing arbitrage from happening, the price of q should be no cheaper than that of q . Fourth, although the answers to a query bundle q may not be always derivable from the answers to an- other query bundle q on all database instances, still for a speci c database instance I , the answers to q may be derived from the answers to q . If so, a serendipitous arbitrage happens. Last, if two queries behave almost identical but their prices are dramatically di erent, almost-certain arbitrage P P k k A function f is subadditive if for any x ; : : : ; x , f( x )  f(x ). 1 i i i=1 i=1 36 happens. Based on the above categorization, they discuss conditions that can prevent various types of arbitrage situations from happening. Pricing many queries in real time with formal guarantees on arbitrage freeness is challenging. Many theoretical methods are not scalable in prac- tice. For example, it takes QueryMarket [123] about one minute to compute the price of a join query over a relation of about 1000 tuples. Qirana [55,56] is a system for query-based pricing. The system allows data sellers to choose from a set of pricing functions that are information arbitrage-free, which covers both post-processing arbitrage-freeness and serendipitous arbitrage- freeness in Lin and Kifer's taxonomy [138]. Qirana also supports history- aware pricing. Qirana has been shown highly ecient and scalable on TPC- 7 8 H and SSB benchmark datasets as demonstration. The key idea in Qirana is that it regards a query as an uncertainty reduction mechanism. Initially, a buyer faces a set of possible databases I de ned by a database schema, primary keys and prede ned constraints. Once a buyer obtains the answer E to a query Q, all possible databases D such that E 6= Q(D) are eliminated. The price assigned to Q should be a function of how much the set of possible databases shrinks. Let S be the set of possible databases before the query Q is answered. S is called the support set. Then, a weighted coverage function assigns a weight w to every wc D 2 S , and computes the price to a query by p (Q; D) = w . i i Q(D )6=Q(D) Alternatively, consider the equivalence relation in S : D  D if and only i j if Q(D ) = Q(D ). Assign to each possible database D 2 S a weight w i j i i such that w = 1. Let P be the set of equivalence classes. For i Q D 2S each class B 2 P , denote by w = w . The Shannon entropy Q B i D 2B function is used to compute the price of query Q as the entropy of the query output P (Q; D) = w log w . The q-entropy function B B B2P (also known as Tsallis entropy) for q = 2 is used to assign to Q the price P (Q; D) = w (1 w ). Deep and Koutris [54] show that the B B B2P weighted coverage function, the Shannon entropy function and the 2-entropy function are all arbitrage-free. Using the complete set of possible databases as the support set leads to a #P -hard problem. To make the price calculation computationally feasible, Qirana uses uniform random sample and random neighboors as the support http://www.tpc.org/tpch. http://www.cs.umb.edu/?poneil/StarSchemaB.PDF 37 sets. In targeted advertising markets, user data, such as opt-in email ad- dresses, and user impressions are sold as data products. How to price users properly to avoid arbitrage is important. Xia and Muthukrishnan [213] consider the following problem. Denote by q a selection query over user attributes, by U the set of all users satisfying q , and by p the price of i i i each user in U . If a buyer purchases n users (1  n  jU j) in U , she/he i i i has to pay n  p . If prices of di erent queries are not well coordinated, version-arbitrage may arise. If two queries q and q return similar user i j sets but q is dramatically more expensive than q , then a user who wants i j q may purchase q instead. Xia and Muthukrishnan [213] point out that i j uniform pricing, that is, every query has the same price, is arbitrage-free, but is a logarithmic approximation to the maximum revenue arbitrage-free pricing solution. Then, they present a greedy non-uniform pricing design. The design starts with the optimal uniform pricing that is arbitrage-free, and then iteratively updates the pricing function. If the price of a query can be updated to increase the revenue, it is increased so that the arbitrage-free property is retained. This greedy algorithm is still a logarithmic approxi- mation to the maximum revenue arbitrage-free pricing solution. Chen et al. [44] develop an arbitrage-free pricing design for multiple versions of a machine learning model. They assume that a broker trains the optimal model on the complete raw data. Then, random Gaussian noises are added to the optimal model to produce di erent versions for di erent buyers. The assumption is that the error of a machine learning model instance is monotonic with respect to the variance of the noise injected into the model. In this setting, a pricing function is arbitrage-free if and only if the price of a randomized model instance is monotonically increasing and subadditive with respect to the inverse of the variance. 5.3 Revenue Maximization Pricing As explained in Section 3.2.2, the objective of revenue maximization is often of special interest in designing pricing strategies, since for a business to be successful long term, a more immediate and important requirement is to win Here, \buying a user" is short for purchasing the impression of a user in online adver- tising and a user email in targeted email advertising, for example. 38 over as many customers as possible. Revenue maximization pricing for data products is a relatively less ex- plored area. A possible reason is that, comparing with pricing digital prod- ucts, some other factors in pricing data products need more urgent accom- modation, such as arbitrage. As mentioned in Section 5.2, Xia and Muthukrishnan [213] develop loga- rithmic approximation pricing algorithms for revenue maximization in user- based markets. They also consider the situations where both the maximum number (i.e., maximum demand) and the minimum number (i.e., minimum demand) of users that a buyer purchases are speci ed, and provide an O(D) approximation algorithm to maximize revenue, where D is the largest min- imal demand among all buyers. Chawla et al. [42] consider query and view based pricing for arbitrage- free revenue maximization under the assumption that all buyers are single- minded and the supply is unlimited. A buyer is single-minded if the buyer wants to purchase the answer to a single set of queries. They consider three types of pricing functions. Uniform bundle pricing sets the price of every bundle identical. Additive or item pricing prices each item and charges a bundle the sum of prices for the items in the bundle. Fractionally subadditive 1 k pricing or XOS sets k weights w ; : : : ; w for each item j, and for a bundle j j k i e, the price is set to max w . Building on the extensive studies on i=1 j j2e revenue maximization with single-minded buyers and unlimited supply [29, 39, 96], they develop new heuristics. It is well known that there exists uniform bundle pricing that is O(log m) approximation of revenue maximization, where m is the number of bundles. Swamy and Cheung [189] show that item pricing can achieve an O(log B) approximation of maximum revenue, where B is the maximum number of bundles an item can involve. Chawla et al. [42] show some new lower bounds, that is, uniform bundle pricing, item pricing and XOS pricing combining a constant number of item pricing functions are still (log m) away from maximum revenue. They also present approximation algorithms. To maximize revenue in machine learning models, Chen et al. [44] show that the optimization problem is coNP-hard. Thus, they relax the subaddi- q(x) q(y) tive constraint p(x + y)  p(x) + p(y) by  for every 0 < x  y, x y q(x) and turn to nding a pricing function q() such that is decreasing with respect to x. They show that, for every well standing pricing function p(), 39 there exists a pricing function q() with the relaxed subadditive constraint p(x) such that  q(x)  p(x), and q(x) can be computed using dynamic programming in O(n ) time, where n is the number of interpolated price points. 5.4 Fair and Truthful Pricing Fairness and truthfulness are important for data product markets. Recall that fairness refers to that the revenue generated by a sale transaction in the data market is distributed among sellers in an unprejudiced manner so that they are paid for their marginal contributions. Truthfulness means a market where buyers are well motivated to report their internal valuations of data products unwarily. Agarwal et al. [7] propose a mathematical model of data marketplaces that are fair, truthful, revenue maximizing, and scalable. They assume each seller j supplies a data stream X and each buyer n conducts a prediction task Y , where X ; Y 2 R . For example, X may be a stream of customers' n j n j interest on di erent products, and Y is a task predicting a new customer's interest. Taking a prediction task Y and an estimate Y , a prediction gain n n 2T function G : R ! [0; 1] measures the quality of the prediction. The value ^ ^ that buyer n gets from estimate Y is  G(Y ; Y ), where  is the price rate n n n n n that the buyer is willing to pay for a unit increase in G. A machine learning MT T model M : R ! R uses data from M sellers to produce an estimate Y for buyer n's prediction task Y . Let p and b be the price and the bid, n n n n respectively. Then, allocation function AF : (p ; b ; X ) ! X measures n n M M the quality at which buyer n obtains that is allocated to the sellers on sale X , where X 2 R . Revenue function RF : (p ; b ; Y ;M;G; X ) ! r M M n n n M n calculates how much revenue r 2 R to extract from the buyer. The utility that buyer n receives by bidding n for Y is n n U (b ; Y ) =  G(Y ; Y )RF (p ; b ; Y ); n n n n n n n n ^ e e where Y = M(Y ; X ) and X = AF (p ; b ; X ). A market is truthful n n M M n n M if for all prediction tasks Y ,  = arg max + U (z; Y ). They adopt the n n z2R n notion of fairness following the famous Shapley fairness [184]. One main result [7] is that, the data market de ned as such is truthful if and only if function AF is monotonic, that is, an increase in the di erence 40 between price rate p and bid b leads to a decrease in predication gain G. n n They also give randomized -approximation algorithms for fair data market, that is, jj jj <  with probability 1 , where is n;Shapley n 1 n;Shapley the Shapley-fair payment division among sellers, is the output of the approximation algorithm, and ;  > 0. Their algorithms are polynomial. Shapley fairness [184] is popularly adopted as the foundation of fairness in data markets. However, computing Shapley value is exponential [57]. Maleki et al. [141] present a permutation sampling method that approxi- mates Shapley values for any bounded utility functions. The basic idea is to use Equation 2 and tackle (s) = E[U (P [fsg)U (P )] by sample mean. s i Following Hoe ding's inequality [104], to achieve an (; )-approximation, 2r N 2N that is, P (js ^sj  )  1, where s ^ is the estimate, we need log p 2 samples and evaluate the utility function O(N log N ) times, where r is the range of the utility function U . Jia et al. [112] present approximation algorithms for Shapley value that can substantially reduce the number of times that the utility function is evaluated. First, they apply the idea of feature selection using group test- ing [60, 225]. For user s, let be the random variable that s appears in a random sample of sellers. Then, for sellers s and s , the di erence in i j Shapley values between s and s is i j U (S[fs g)U (S[fs g) 1 i j (s ) (s ) = i j N2 N1 S2Dnfs ;s g i j ( ) jSj = E[( )U ( ; : : : ; )] s s s s i j 1 j where U ( ; : : : ; ) is the utility computed using the sellers appear- s s 1 j ing in the random sample. They can use group testing to rst esti- mate the Shapley di erences and then derive the Shapley value from the di erences by solving a feasibility problem. They show that this algo- rithm is an (; )-approximation that evaluates the utility function at most O( N (log N ) ) times. They further observe that most of the Shapley val- ues are around the mean. Exploiting this approximate sparsity, they give an (; )-approximation algorithm that evaluates the utility function only O(N (log N ) log(log N ) times. Ghorbani and Zou [85] propose a principled framework of fair data eval- uation in supervised learning, and Monte-Carlo and gradient-based approx- imation methods. Their Monte-Carlo method follows a general idea similar to that in Jia et al. [112]. They generate Monte-Carlo estimates until the 41 average empirically converges. They also argue that, in practice, it is su- cient to estimate Shapley values up to the intrinsic noise in the predictive performance on the test data set. Adding one tuple as a training data point does not signi cantly a ect the performance of a model trained using a large training data set. Therefore, truncation can be used in practice based on the bootstrap variation on the test set. In their gradient Shapley method, they train a model using one \epoch" of the training data, and then update the model by gradient descent on one data point at a time, where the marginal contribution is the change in the performance of the model. In general, computing Shapley values requires an exponential number of model evaluations. However, for some speci c model, the computation may be reduced dramatically. For example, Jia et al. [111] show that for unweighted kNN classi ers, the exact computation needs only O(N log N ) h(;k) time and an (; )-approximation can be achieved in O(N log N ) time when  is not too small and k is not too large. They also propose a Monte- N (log N ) Carlo approximation of O( ) for weighted kNN classi ers. A key (log k) enabler of the progress is the speci c utility function of a kNN classi er minfk;jSjg U (S) = 1[y = y ] kNN test (S) i=1 where (S) is the index of the training feature that is the k-th closest to x i test among the training examples in S. Moreover, the sublinear approximation for unweighted kNN classi ers is facilitated by locality sensitive hashing [52]. Recently, Jia et al. [113] leverage the ecient computation of Shapley values in kNN [111] to tackle general classi cation problems. They propose to rst train a target model, such as a deep neural network, and identify the features. Then, they conduct a model distillation to kNN by training a kNN classi er using the features to mimic the performance of the original model and tune parameter k, the number of nearest neighbors considered. Last, they apply the Shapley value estimation method in kNN [111] to approach the Shapley values in the target model. Many classic rewarding methods, such as Shapley values, may be vulner- able to data-replication attacks. One data provider may replicate its data and act as an additional provider to obtain extra unconscionable rewards. To prevent data-replication attacks from happening, replication-robust pay- o mechanisms are proposed. Han et al. [100] propose a x to Shapley value 42 based payo mechanisms. The idea is to down-weigh the Shapley value { a data provider gets a less reward if there are multiple copies of its data in the coalitions. Related to fairness and truthfulness in a market, cooperation among dif- ferent agents in a market may happen. Building trust in a sub-community within a data marketplace becomes an interesting subject. Armstrong and Durfee [17] analyze factors that may in uence the eciency of building trust and conducting cooperation in a data market. For each agent in a market, the other agents can be divided into two categories, namely those remem- bered agents and those strange or forgotten agents. They have a few inter- esting ndings. Cooperations arising from iterated interactions is inversely proportional to the rate of system mixing, the number of initially misbe- having agents, and the rate at which agents explore alternative strategies. Cooperation is also initially inversely proportional to population size. At the same time, cooperation is proportional to average member size and better estimation of the likelihood of strange agents to misbehave. 5.5 Privacy Preserving Marketplaces of Data Privacy is a serious concern and also a critical tipping point in designing marketplaces of data. When a user shares her/his data with some others, the user may disclose her/his privacy to some extent. Therefore, it is important to explore how to protect or minimize the privacy leakage. At the same time, it is also important to understand how a seller's privacy disclosure may be properly compensated through data pricing. Ghosh and Roth [87] design truthful marketplaces where data buyers want to purchase data to estimate statistics and sellers want compensation for their privacy loss. In the design, there is only one query and the individ- ual evaluations of their data are private. Data owners are asked to report the costs for the use of their data. Under the assumption of di erential privacy [61, 62], they transform the problem into variants of multi-unit pro- curement auction. They show that, when a buyer holds an accuracy goal, the classic Vickrey auction can minimize the buyer's total cost and guaran- tee the accuracy. When the buyer has a budget, they give an approximation algorithm to maximize the accuracy under the budget constraint. The method by Ghosh and Roth [87] may not work well when the costs 43 and the data are correlated. For example, a store with more customer trac may request a higher cost in using the data. Correspondingly, reporting the cost may reveal the privacy of the store. Fleischer and Lyu [74] tackle the scenario where costs are correlated with data and propose a posted- price-like mechanism. Given a set of data sellers categorized into di erent types and the associated distributions of costs, the mechanism o ers each user a contract with the expected payment corresponding to the type. If a seller takes the o er, the payment is determined by the seller's veri able type and the associated payment in the contract. All sellers have the same probability to take or reject their contracts independently. The sellers are truthful, that is, a user takes the o er if the payment is larger than or equal to the privacy loss. This posted-price-like mechanism is Bayesian incentive compatible (i.e., every seller's strategy is Bayesian-Nash equilibrium), ex- interim individually rational (i.e., the expected utility is non-negative for every seller when the seller decides truthfully), O( )-accurate, perfectly data private (i.e., whenever the mechanism's posterior belief about a seller's data di ers from its prior belief, the mechanism pays the seller) and - di erentially private. Li et al. [133] tackle the same problem as Ghosh and Roth [87] do, but assume that individual valuations are public and focus on returning unbiased estimations and pricing multiple queries consistently. To address the concerns on privacy loss, they develop a theoretical framework to divide the price among data owners who contribute to the aggregate computation and thus have loss of privacy. Their framework extends several principles from both di erential privacy and query pricing in data markets. The fairness mechanism considered by Li et al. [133] only compensates a seller whose data are used. Niu et al. [163] further consider the scenario where multiple sellers' data are correlated and extend to dependent fair- ness. In dependent fairness, a seller s is still compensated if the data of another seller s are used that are correlated with the data of s. They pro- pose two approaches to privacy compensation. In the bottom-up approach, the broker rst satis es each individual seller's privacy compensation and then decides the price for the statistic selling to a buyer. In the top-down design, the broker decides the total price of a data aggregate product sold to a buyer, and then spares a fraction of the total price for privacy compensa- tion. The privacy compensation is divided and assigned to individual data 44 sellers by solving a budget allocation problem. Each seller receives a com- pensation roughly proportional to the privacy loss due to the data sharing. Niu et al. [161] further extend to time series data that may have temporal correlations. They adopt Pu er sh privacy [117] to measure privacy losses under temporal correlations. While various e orts have been made to address the challenges of privacy loss compensation when user data are correlated in one way or another, as Ghosh and Roth [87] point out, in general, it is impossible for any mechanism to compensate individuals for privacy loss properly if correlations between their private data and their cost functions are unknown beforehand. In the classical setting of physical goods [143], using contract theory [142] with hidden information, that is, unobservable types of buyers, a seller can design a set of contracts with di erent consumption levels to maximize rev- enue from buyers. Naghizadeh and Sinha [154] extend the contract design model to price a bundle of queries at di erent privacy levels to maximize revenue. They also consider adversarial users. Their work also adopts dif- ferential privacy [61, 62]. For a query bundle fQ ; : : : ; Q g, a contract is a 1 k tuple (p; ; s), where p > 0 is the price paid by a buyer,  is the privacy budget, such that a buyer can get an answer to query Q (1  i  k) with -di erential privacy guarantee, and    , and p is the post-hoc ne i i i=1 to be paid if the buyer is found misusing the query answers. It is assumed that an adversarial buyer derives a bene t C (), which is monotonically increasing and convex, C (0) = 0. One interesting nding is that, in the tra- ditional contract theory, if there are n types of honest buyers and one type of adversarial buyers, the seller should design up to n + 1 contracts. In the data marketplace situation, they show that up to n contracts are sucient. In other words, a data seller should not design a contract for the adver- sary. Instead, the seller should adjust the contracts' pricing to account for the risks from adversarial users. They also design post-hoc nes in pricing query bundles that can help to reduce loss due to privacy leakage by adver- sarial buyers. They provide a fast approximation algorithm to compute the contracts. A data owner has to decide a tradeo between privacy and data utility. Li and Raghunathan [135] design an economics-based incentive-compatible mechanism for a data owner to price and disseminate private data. Speci - cally, let two-part tari pricing function R(s; x) = + x be the price for s s 45 x amount of data at sensitivity level s, where and are the xed and s s variable price factors, respectively. Assuming two types of data users, one type for aggregate information and patterns in data and the other type for individual identity and personal information, the proposed mechanism works in four stages. First, the data owner selects a variety of sensitivity types to o er. Second, the data owner o ers di erent prices for data with di erent sensitivity types. Third, a data user selects a certain sensitivity type with corresponding price, and thus reveals the user type. Last, the data user se- lects the optimal amount of data with the chosen sensitivity type. The core idea is that the data owner can identify the sensitive attributes in the data, such as the identifying attributes, which are not useful for aggregate analysis but necessary at individual communication. A data owner can o er a lower price for data without sensitive attributes, and charge for a higher price for data with sensitive attributes. This approach provides an orthogonal idea to the popular ways of tuning the parameter in di erential privacy. Due to the privacy concerns, when a company may have opportunities to collect data about its customers, should it do it (i.e., collecting and re- vealing the data) or not (i.e., a blanket policy of never collecting)? Jaisingh et al. [109] nd that the company should not collect customer data if the total gains from trading the data cannot cover the privacy loss. In practice, there is an increasing tendency for consumers to overestimate their loss of privacy, particularly when the use of the private data is uncertain. In other cases, the company should o er two contracts on their services and prod- ucts. One contract collects the customer data at a certain price, and the other contract does not collect any customer data at a di erent price. While most of the studies on privacy preserving data marketplaces fo- cus on the privacy of data owners, transactions may also disclose privacy of data buyers, such as what, when and how much they buy. For example, a retail company purchasing query results may consider what queries (e.g., the products or customer groups involved in the queries), when (e.g., the periods where the queries are concerned), and how much data it purchases as privacy, and may want to keep the information con dential from any others, including the data sellers and the broker. Aiello et al. [11] design a mechanism such that after making an initial deposit and maintaining a su- cient balance, a buyer can engage in an unlimited number of price-oblivious transfer protocols where the sellers and the broker cannot know anything 46 other than the amount of interaction and the initial deposit amount. The broker even cannot know the buyer's current balance and when the buyer's balance runs out. This is achieved by adapting conditional disclosure [83] to the two-party setting. Distribution and use of private data are another important step where privacy may leak. Hynes et al. [107] demonstrate Sterling, a decentralized marketplace for private data, which supports privacy-preserving distribution and use of data. The central technical idea comes from privacy-preserving smart contracts on a permissionless blockchain. To provide strong security and privacy guarantees, they combine blockchain smart contracts, trusted execution environments and di erential privacy. Particularly, smart con- tracts allow enforcement of constraints on data usage and enables payments and rewards. 5.6 Data Pricing in Novel Applications: Dynamic Data Pric- ing, Online Pricing and Federated Learning Pricing The demand of data pricing arises in many novel application scenarios. In this subsection, we particularly discuss three emerging situations: dynamic data pricing, online pricing and pricing in federated learning. Many applications are built on dynamic and online data. How to price temporal views on data streams properly is an important issue for practical data markets. One central task is to estimate and optimize the operational costs, which are the costs to evaluate queries of di erent users on the y. The pricing decisions involve not only data sellers but also data buyers. For example, suppose two data buyers b and b purchase two queries q and 1 2 1 q , such that q can be written as a further selection on top of q (e.g., q 2 2 1 1 is about all customers in North America, while q keeps all the same as q 2 1 but focuses on only customers in Canada). The optimal pricing of q and q should take the advantage of the overlap between the two queries so that the sharing can save the operational costs, and, at the same time, be fair to b and b . 1 2 Al-Kiswany et al. [12] propose a greedy method that enumerates all possible sharing plans and selects the one with the minimum additional cost. It does not come with any quality guarantee. Liu and Hacigum  u  s [140] propose an improved method that takes some risk in sharing plan. If the 47 costs of the previous sharings are already cumulated to a high level, and the additional cost of a new sharing (i.e., the risk) is moderate and can be amortized well by the previous sharings, then the new sharing may be taken. They also give ve rules to ensure fair pricing. Let AC (S) be the cost attributed to a sharing S. First, for two identical sharings S = S , 1 2 AC (S ) = AC (S ) should hold. Second, for any sharing S, AC (S) should 1 2 be no higher than the lowest cost of S if no other sharing exists. Third, for two sharings S and S , if the query of S is contained by the query of S , 1 2 1 2 that is, the result of S is a subset of the result of S , and the lowest cost 1 2 of S is smaller than the lowest cost of S if no other sharing exists, then 1 2 AC (S )  AC (S ). Fourth, a sharing plan with common subexpressions 1 2 with other sharings should be compensated. Last, the cost of the global plan should be equal to the sum of costs attributed to all sharings. In order to purchase dynamic data, a buyer may have to call a seller's API repeatedly. A buyer may have to pay for the same data multiple times. Upadhyaya et al. [195] explore how to modify APIs to achieve op- timal history-aware pricing, that is, buyers are charged only once for data purchased and not updated. The central idea is the introduction of the no- tion of refund { a user can ask for refunds of data that she/he has bought before. For each query, the seller issues a coupon in addition to the query result, where the coupon records the identity information of the data in the query result. Speci cally, a coupon c = ((id; uid; v); ;H(id )), where id is a tuple identi er, uid is a user-id, v is a version-id that is monotonically increasing,  is a query identi er that is also monotonically increasing, H is a cryptographic hash function [59], such as SHA-1, SHA-256 and SHA-3, and  is a secret key only known to the seller. If a buyer gets two coupons c and c in two di erent purchases such that c [(tid; uid; v)] = c [(tid; uid; v)], 2 1 2 then the buyer can ask the seller for a refund by showing the two coupons. As pointed out by Deep and Koutris [55], the refund mechanism does not provide any arbitrage-free guarantee. Qirana [55,56] can support history-aware pricing. To incorporate a query history, suppose a buyer already purchases queries Q = Q ; : : : ; Q and pays 1 k for a total of p(Q; D) so far. When a new query Q comes, let the support k+1 set S = fD 2 S j Q(D ) = Q(D); Q (D ) 6= Q (D)g. Then, the k+1 i i k+1 i k+1 new total price p((Q ; : : : ; Q ; Q ); D) = p(Q; D) + w . This 1 k k+1 i D 2S k+1 history-aware pricing function is shown arbitrage-free. 48 Zheng et al. [223] consider online pricing for mobile crowd-sensing data markets. Di erent from most of the work on data markets, they assume that data providers are distributed in space and there are three types of spatial queries from buyers, namely single-data query (e.g., inquiring the value at a speci c location), multi-data query (e.g., inquiring the mean in a region) and range query (e.g., inquiring the probability that the data at a region falls in a given range). The vendor uses raw data from data providers and produces a statistical model through Gaussian process to answer queries. To form di erent versions of data products, the vendor generates di erent conditional Gaussian distribution with respect to locations and uses the conditional entropy to quantify the quality of the versions. They propose a randomized online pricing strategy so that the price can be adaptive from the historical queries. They show that the pricing mechanism is arbitrage- free and is a constant factor approximation of revenue maximization. Niu et al. [162] consider online data market where a query may be sold to di erent buyers at di erent time and the broker can adjust prices over time. The objective is to maximize the broker's cumulative revenue by posting reasonable prices for sequential queries. They design a contextual dynamic pricing mechanism with the reserve price constraint. The central idea is to use the properties of ellipsoid for ecient online optimization. Their method can support both linear and non-linear market value models with uncertainty. Federated learning [146,147] trains a machine learning model across mul- tiple decentralized parties, where each party holds local data without any peer-wise data exchanging. The parties and their data sets are often or- dered in a federated learning process. To accommodate the participation order and value data in federated learning, Wang et al. [205] develop fed- erated Shapley value. Let I be the set of participants and U be the utility function, where U (A +B) is the utility of training rst on A and then on B. For participant i at round t in a federated learning process, the federated Shapley value is 1 1 (i) = [U (I + S [fig))U (I + S)] t  1:t1 1:t1 jI j1 jI j SI nfig jSj if i 2 I and (i) = 0 otherwise. The federated Shapley value of a t t party is the sum of the values of all rounds, that is, (i) = (i). t=1 49 Wang et al. [205] show that the federated Shapley values have instanta- neous group rationality, that is, (i) = U (I ) U (I ). The t 1:t 1:t1 i2I fairness is guaranteed at each round. That is, for any two parties i and j, (i) = (j) at round t if 8S  I n fi; jg, U (I + (S [ fig)) = t t t 1:t1 U (I + (S [ fjg)). Moreover, for any party i at round t, (i) = 0 if 1:t1 t 8S  I n fig, (I + (S [ fig)) = U (I + S). They also extend t 1:t1 1:t1 the previous Shapley value approximation techniques to compute federated Shapley values. Sim et al. [186] consider the more general situation of collaborative machine learning and advocate using information gain as the utility func- tion. For a model  trained on data D, the information gain I(; D) = H() H(jD), which is the reduction in uncertainty. They generalize to -Shapley fairness by assigning a reward r = k to a party i. By tuning parameter , they can trade o among Shapley fairness, individual rational- ity, stability of the grand coalition and group welfare. Hu and Gong [105] consider privacy leaking in federated learning and design an incentive mechanism to compensate the cost of privacy leakage of the users that are most likely to provide reliable data. Their problem is formulated in a two-stage Stackelberg game [200]. Richardson et al. [176] use in uence functions to reward data contributions to linear regression in the federated learning setting. 5.7 Summary In this section, we review the topic of pricing data products. We rst an- alyze the structures, players, and ways to produce data products in data marketplaces. Then, we examine several important areas in pricing data products, including arbitrage-free pricing, revenue maximization pricing, fair and truthful pricing and privacy preserving pricing. We also discuss how to price dynamic data and online pricing. When pricing data products in a data marketplace, those several considerations are typically incorporated and integrated in one way or another. 50 6 Discussion and Open Challenges Data pricing comes from practical demands and has been tackled in multiple disciplines. Although there is a rich body of literature addressing a series of issues in data pricing, there are still many questions remained unexplored. In this section, we discuss some interesting challenges for possible future work. By no means our list is exhaustive. Instead, we hope our discussion can intrigue more extensive interest and research e ort into this fast growing area. 6.1 Data Supply Chain: A Grand Challenge At the macro level, although many studies focus on di erent steps in data marketplaces, we clearly observe a lack of systematic investigation on data supply chains and development of end-to-end solutions. As data products are abundant and diversi ed, to develop ecologically sustainable market- places, supply chains of data products have to be built. Here, we introduce and advocate the notion of data supply chains, which connect all parties in- volved in data production and consumption, including data providers, data processors, data analysts, data product and services consumers and other possible roles. Each party in a data supply chain connects its upstream providers and its downstream consumers, provides its value-added contri- butions and obtains rewards. Feedback mechanisms through pricing and marketing have to be created in a data supply chain so that supply and consumption can be matched, coordinated and balanced. Most of those problems are not thoroughly thought about. Although the notion of data supply chain is not mentioned in litera- ture, some speci c trends and challenges are discussed sporadically. For example, Muschalle et al. [151] identify some trends and challenges in data consumption and marketplaces. First, they assert that many essential data processing tasks are essential for data markets, such as labeling, annotating and aggregating data. Second, data markets will be integrated with numer- ous application domains. To enable domain data markets, it is important to customize general data processing technologies for niche domains. Third, customers want to have data faster. Thus, it is important to create on- line data query services and develop corresponding pricing models. Fourth, as there are more data, more data providers and more analysts, a data 51 product may be substituted by others. To hatch a healthy ecological data marketplace, it is important to establish standard data processing mashups to facilitate data product substitution. Fifth, to maintain a fair data market overall, it is important to provide price transparency so that data product providers have to optimize their data and data processing/analysis services. Last, customer preferences and experience are critical for data markets. Recently, Acemoglu et al. [3] present an insightful study on the ecological e ect of data markets. They demonstrate that a user's sharing of data may likely reveal some other users' privacy and depress the price of other users' data. The depressed prices lead to excessive data sharing and thus further reduce welfare. Their study suggests the need of mediation in data sharing in data markets. Most recently, Fernandez et al. [71] analyze the challenges and propose a research agenda around constructing a data market platform to address the sharing, discovery and integration of data among many parties. Their big picture covers both market design and system development. The focus is to create the incentives and mechanisms to connect data supply and demand. As the middlemen, arbiters build data mashups to match data supply and demand. The market platforms advocated by the authors can be regarded as the data exchange mechanisms in data supply chain. One challenge associated with the macro view of data supply chain is the interdisciplinary nature of data pricing research. As can be observed in this article, data pricing is studied in many di erent disciplines, such as economics, marketing, electronic commerce, data management, data mining and machine learning. The communication and dialog among di erent areas have to be strengthened. 6.2 Some Technical Challenges at the Micro Level At the micro level, there are many research problems remained open. We name a few examples of fundamental problems. First, most of the studies suggest relative prices of data products. Very few studies connect theoretical models with data pricing practice and in- vestigate absolute prices of data products and their marketing e ect. As data pricing is a market mechanism and user behavior in practice is hard to modeled completely, experimental studies of data pricing models are essen- 52 tial and should be connected to theoretical investigations. Second, pricing is based on valuation and equilibrium among multiple parties. Di erent parties may have di erent valuation on data, data prod- ucts and data services. It is important to systematically establish the prin- ciples of value assessment for various parties in data marketplaces, such as data providers, data owners, data users, and data brokers. Moreover, it is important to understand what messages are passed to di erent parties in data marketplaces through data pricing actions, and how. So far, value assessment of data and negotiations among di erent parties in data market- places are largely not analyzed in detail. Third, many pricing models are proposed in literature. It is important to understand how data pricing models and their assumptions can be im- plemented and enforced in practice. Speci cally, accounting and auditing in data marketplaces are critical to achieve transparency in data pricing and eciency in data marketplaces. Accounting and auditing in data market- places, however, are interesting problems that have not been investigated in depth yet. We need principles, quality guarantees and designs of opera- tional procedures for accounting and auditing in data pricing, transactions and adversary detection. Fourth, most of the studies on data pricing develop general models. At the same time, as data science transforms many application domains, data pricing has to deal with speci c applications. Mechanisms, regulations and constraints in a speci c domain may facilitate data pricing in some aspects, and post challenges in some other aspects. For example, Jia et al. [111] show that, although fair pricing in general is exponential in computation time but can be achieved polynomially in kNN models (Section 5.4). It is interesting and highly desirable to explore fairness, truthfulness, and privacy preservation of data pricing in speci c applications. Last but not least, almost all applications are dynamic in nature. The values of data, data products and data services may also evolve over time. The changes may be caused by the updates in demands and supplies. It is important to develop mechanisms to capture and monitor changes in de- mand and supply of data, data products and data services, and explore corresponding dynamic pricing. 53 References [1] M. Aazam and E. Huh, \Broker as a service (baas) pricing and resource estimation model," in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Dec 2014, pp. 463{468. [2] T. Abdallah, \On the bene t (or cost) of large-scale bundling," Production and Operations Management, vol. 28, no. 4, pp. 955{969, 2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .1111/poms.12958 [3] D. Acemoglu, A. Makhdoumi, A. Malekian, and A. Ozdaglar, \Too Much Data: Prices and Ineciencies in Data Markets," National Bureau of Economic Research, Inc, NBER Working Papers 26296, Sep. 2019. [Online]. Available: https://ideas.repec.org/p/nbr/nberwo /26296.html [4] A. Acquisti, C. Taylor, and L. Wagman, \The economics of privacy," Journal of Economic Literature, vol. 54, no. 2, pp. 442{92, June 2016. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jel. 54.2.442 [5] A. Acquisti and C. Tucker, \Guns, privacy, and crime," Working paper, 2011. [Online]. Available: https://www.heinz.cmu.edu/ acqui sti/papers/acquisti-REV.pdf [6] W. J. Adams and J. L. Yellen, \Commodity bundling and the burden of monopoly," The Quarterly Journal of Economics, vol. 90, no. 3, pp. 475{498, 1976. [Online]. Available: http: //www.jstor.org/stable/1886045 [7] A. Agarwal, M. Dahleh, and T. Sarkar, \A marketplace for data: An algorithmic solution," in Proceedings of the 2019 ACM Conference on Economics and Computation, ser. EC'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 701{726. [Online]. Available: https://doi.org/10.1145/3328526.3329589 [8] C. C. Aggarwal and P. S. Yu, \Privacy-preserving data mining: A survey," in Handbook of Database Security: Applications 54 and Trends, M. Gertz and S. Jajodia, Eds. Boston, MA: Springer US, 2008, pp. 431{460. [Online]. Available: https: //doi.org/10.1007/978-0-387-48533-1 18 [9] G. Aggarwal, A. Fiat, A. V. Goldberg, J. D. Hartline, N. Immorlica, and M. Sudan, \Derandomization of auctions," in Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, ser. STOC'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 619{625. [Online]. Available: https://doi.org/10.1145/1060590.1060682 [10] L. Aguiar and J. Waldfogel, \As streaming reaches ood stage, does it stimulate or depress music sales?" International Journal of Industrial Organization, vol. 57, no. C, pp. 278{307, 2018. [Online]. Available: https://ideas.repec.org/a/eee/indorg/v57y2018icp278-307.html [11] W. Aiello, Y. Ishai, and O. Reingold, \Priced oblivious transfer: How to sell digital goods," in Advances in Cryptology - EUROCRYPT 2001, International Conference on the Theory and Application of Cryptographic Techniques, Innsbruck, Austria, May 6-10, 2001, Proceeding, ser. Lecture Notes in Computer Science, vol. 2045. Springer, 2001, pp. 119{135. [Online]. Available: https://iacr.org/archive/eurocrypt2001/20450118.pdf [12] S. Al-Kiswany, H. Hacigum  u  s, Z. Liu, and J. Sankaranarayanan, \Cost exploration of data sharings in the cloud," in Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT'13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 601{612. [Online]. Available: https://doi.org/10.1145/2452376.2452447 [13] S. Alaei, A. Makhdoumi, and A. Malekian, \Optimal subscription planning for digital goods," SSRN Electronic Journal, 01 2019. [14] S. Alaei, A. Malekian, and A. Srinivasan, \On random sampling auctions for digital goods," in Proceedings of the 10th ACM Conference on Electronic Commerce, ser. EC'09. New York, NY, USA: Association for Computing Machinery, 2009, pp. 187{196. [Online]. Available: https://doi.org/10.1145/1566374.1566402 55 [15] C. Anderson, The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, 2006. [16] A. Archer, C. Papadimitriou, K. Talwar, and E. Tardos, \An approx- imate truthful mechanism for combinatorial auctions with single pa- rameter agents," in Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'03. USA: Society for Industrial and Applied Mathematics, 2003, pp. 205{214. [17] A. A. Armstrong and E. H. Durfee, \Mixing and memory: Emer- gent cooperation in an information marketplace," in Proceedings of the 3rd International Conference on Multi Agent Systems, ser. ICMAS'98. USA: IEEE Computer Society, 1998, p. 34. [18] M. Armstrong, \A more general theory of commodity bundling," Journal of Economic Theory, vol. 148, no. 2, pp. 448{472, 2013. [Online]. Available: https://ideas.repec.org/a/eee/jetheo/v148y2013 i2p448-472.html [19] N. Arnosti, M. Beck, and P. Milgrom, \Adverse selection and auction design for internet display advertising," in Proceedings of the Sixteenth ACM Conference on Economics and Computation, ser. EC'15. New York, NY, USA: Association for Computing Machinery, 2015, p. 167. [Online]. Available: https://doi.org/10.1145/2764468.2764537 [20] S. Athey, E. Calvano, and J. Gans, \The impact of the internet on advertising markets for news media," National Bureau of Economic Research, Working Paper 19419, September 2013. [Online]. Available: http://www.nber.org/papers/w19419 [21] S. Athey, E. Calvano, and J. S. Gans, \The impact of consumer multi- homing on advertising markets and media competition," Management Science, vol. 64, pp. 1574{1590, 2018. [22] J. Auerbach, J. Galenson, and M. Sundararajan, \An empirical analysis of return on investment maximization in sponsored search auctions," in Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising, ser. ADKDD'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 1{9. [Online]. Available: https://doi.org/10.1145/1517472.1517473 56 [23] M. Babaio , R. Kleinberg, and R. Paes Leme, \Optimal mechanisms for selling information," in Proceedings of the 13th ACM Conference on Electronic Commerce, ser. EC'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 92{109. [Online]. Available: https://doi.org/10.1145/2229012.2229024 [24] P. Bajari and A. Hortacsu, \Economic insights from internet auctions," Journal of Economic Literature, vol. 42, no. 2, pp. 457{486, June 2004. [Online]. Available: https://www.aeaweb.org/articles?id= 10.1257/0022051041409075 [25] Y. Bakos and E. Brynjolfsson, \Bundling information goods: Pricing, pro ts, and eciency," Manage. Sci., vol. 45, no. 12, pp. 1613{1630, Dec. 1999. [Online]. Available: https://doi.org/10.1287/mnsc.45.12.1 [26] ||, \Bundling and competition on the internet," Marketing Science, vol. 19, no. 1, pp. 63{82, Feb. 2000. [27] M. Balazinska, B. Howe, P. Koutris, D. Suciu, and P. Upadhyaya, \A discussion on pricing relational data," in In Search of Elegance in the Theory and Practice of Computation: Essays Dedicated to Peter Buneman, V. Tannen, L. Wong, L. Libkin, W. Fan, W.-C. Tan, and M. Fourman, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 167{173. [Online]. Available: https://doi.org/10.1007/978-3-642-41660-6 7 [28] M. Balazinska, B. Howe, and D. Suciu, \Data markets in the cloud: An opportunity for the database community," PVLDB, vol. 4, no. 12, pp. 1482{1485, 2011. [Online]. Available: http://dblp.uni-trier.de/db /journals/pvldb/pvldb4.html#BalazinskaHS11 [29] M.-F. Balcan and A. Blum, \Approximation algorithms and online mechanisms for item pricing," in Proceedings of the 7th ACM Conference on Electronic Commerce, ser. EC'06. New York, NY, USA: Association for Computing Machinery, 2006, pp. 29{35. [Online]. Available: https://doi.org/10.1145/1134707.1134711 [30] M.-F. Balcan, A. Blum, and Y. Mansour, \Item pricing for revenue maximization," in Proceedings of the 9th ACM Conference on 57 Electronic Commerce, ser. EC'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 50{59. [Online]. Available: https://doi.org/10.1145/1386790.1386802 [31] Z. Bar-Yossef, K. Hildrum, and F. Wu, \Incentive-compatible online auctions for digital goods," in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'02. USA: Society for Industrial and Applied Mathematics, 2002, pp. 964{970. [32] J. Ben eld and W. Szlemko, \Internet-based data collection: Promises and realities," Journal of Research Practice, vol. 2, no. 2, 1 2006. [33] D. Bergemann and A. Bonatti, \Selling cookies," American Economic Journal: Microeconomics, vol. 7, no. 3, pp. 259{94, August 2015. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/mi c.20140155 [34] D. Bergemann, A. Bonatti, and A. Smolin, \The design and price of information," American Economic Review, vol. 108, no. 1, pp. 1{48, January 2018. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/aer.20161079 [35] E. Bertino, D. Lin, and W. Jiang, \A survey of quanti cation of privacy preserving data mining algorithms," in Privacy-Preserving Data Mining: Models and Algorithms, C. C. Aggarwal and P. S. Yu, Eds. Boston, MA: Springer US, 2008, pp. 183{205. [Online]. Available: https://doi.org/10.1007/978-0-387-70992-5 8 [36] S. J. Best and B. S. Krueger, Internet Data Collection, ser. Quantita- tive Applications in the Social Sciences. Thousand Oaks, CA: SAGE Publications, Inc., 2004. [37] A. Boom, \\download for free": When do providers of digital goods o er free samples?" Free University Berlin, School of Business & Eco- nomics, Discussion Papers 2004/28, 2004. [38] R. Brennan, L. Canning, and R. Mcdowell, Business-to-business mar- keting. Sage Publications, 01 2013. 58 [39] P. Briest and P. Krysta, \Single-minded unlimited supply pricing on sparse instances," in Proceedings of the Seventeenth Annual ACM- SIAM Symposium on Discrete Algorithm, ser. SODA'06. USA: Soci- ety for Industrial and Applied Mathematics, 2006, pp. 1093{1102. [40] E. Brynjolfsson and M. D. Smith, \Frictionless commerce? a comparison of internet and conventional retailers," Management Science, vol. 46, no. 4, pp. 563{585, 2000. [Online]. Available: https://doi.org/10.1287/mnsc.46.4.563.12061 [41] Y. Cai, C. Daskalakis, and C. Papadimitriou, \Optimum statistical estimation with strategic data sources," in Proceedings of The 28th Conference on Learning Theory, ser. Proceedings of Machine Learning Research, P. Grun  wald, E. Hazan, and S. Kale, Eds., vol. 40. Paris, France: PMLR, 03{06 Jul 2015, pp. 280{296. [42] S. Chawla, S. Deep, P. Koutrisw, and Y. Teng, \Revenue maximization for query pricing," Proc. VLDB Endow., vol. 13, no. 1, pp. 1{14, Sep. 2019. [Online]. Available: https://doi.org/10.14778/3357377.3357378 [43] Y.-K. Che, S. Choi, and J. Kim, \An experimental study of sponsored-search auctions," Games and Economic Behavior, vol. 102, pp. 20 { 43, 2017. [Online]. Available: http://www.sciencedirect.co m/science/article/pii/S0899825616301233 [44] L. Chen, P. Koutris, and A. Kumar, \Towards model-based pricing for machine learning in a data marketplace," in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 1535{1552. [Online]. Available: https: //doi-org.proxy.lib.sf u.ca/10.1145/3299869.3300078 [45] L. Chiou and C. Tucker, \Paywalls and the demand for news," Information Economics and Policy, vol. 25, no. 2, pp. 61{69, 2013. [Online]. Available: https://EconPapers.repec.org/RePEc:eee:iepoli: v:25:y:2013:i:2:p:61-69 [46] ||, \Content aggregation by platforms: The case of the news media," Journal of Economics & Management Strategy, 59 vol. 26, no. 4, pp. 782{805, 2017. [Online]. Available: https: //onlinelibrary.wiley.com/doi/abs/10.1111/jems.12207 [47] R. D. Cook, \Detection of in uential observation in linear regression," Technometrics, vol. 19, no. 1, pp. 15{18, Feb. 1977. [48] R. Cook and S. Weisberg, Residuals and In uence in Regression, ser. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1982. [Online]. Available: https: //books.google.ca/books?id=MVSqAAAAIAAJ [49] R. Cummings, K. Ligett, A. Roth, Z. S. Wu, and J. Ziani, \Accuracy for sale: Aggregating data with a variance constraint," in Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ser. ITCS'15. New York, NY, USA: Association for Computing Machinery, 2015, pp. 317{324. [Online]. Available: https://doi.org/10.1145/2688073.2688106 [50] D. Dao, D. Alistarh, C. Musat, and C. Zhang, \Databright: Towards a global exchange for decentralized data ownership and trusted computation," CoRR, vol. abs/1802.04780, 2018. [Online]. Available: http://arxiv.org/abs/1802.04780 [51] C. Daskalakis, A. Deckelbaum, and C. Tzamos, \Strong duality for a multiple-good monopolist," Econometrica, vol. 85, no. 3, pp. 735{767, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .3982/ECTA12618 [52] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, \Locality- sensitive hashing scheme based on p-stable distributions," in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser. SCG'04. New York, NY, USA: Association for Computing Machinery, 2004, pp. 253{262. [Online]. Available: https://doi.org/10.1145/997817.997857 [53] D. Davydov, S. Izmalkov, and A. Smirnov, \Sponsored-Search Auctions: Empirical and Experimental Works," Journal of the New Economic Association, vol. 28, no. 4, pp. 56{73, 2015. [Online]. Available: https://ideas.repec.org/a/nea/journl/y2015i28p56-73.ht ml 60 [54] S. Deep and P. Koutris, \The design of arbitrage-free data pricing schemes," CoRR, vol. abs/1606.09376, 2016. [Online]. Available: http://arxiv.org/abs/1606.09376 [55] ||, \Qirana: A framework for scalable query pricing," in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD'17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 699{713. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3035918.3064017 [56] S. Deep, P. Koutris, and Y. Bidasaria, \Qirana demonstration: Real time scalable query pricing," Proc. VLDB Endow., vol. 10, no. 12, pp. 1949{1952, Aug. 2017. [Online]. Available: https: //doi-org.proxy.lib.sf u.ca/10.14778/3137765.3137816 [57] X. Deng and C. H. Papadimitriou, \On the complexity of cooperative solution concepts," Mathematics of Operations Research, vol. 19, no. 2, pp. 257{266, 1994. [Online]. Available: https: //doi.org/10.1287/moor.19.2.257 [58] S. Dibb, L. Simkin, W. M. Pride, and O. Ferrell, Marketing: Concepts and Strategies. 5th Edition. Abingdon, UK: Houghton Miin, April 2005. [Online]. Available: http://oro.open.ac.uk/2041/ [59] W. Die and M. Hellman, \New directions in cryptography," IEEE Trans. Inf. Theor., vol. 22, no. 6, pp. 644{654, Sep. 2006. [Online]. Available: https://doi.org/10.1109/TIT.1976.1055638 [60] D.-Z. Du and F. K. Hwang, Combinatorial Group Testing and Its Applications, 2nd ed. WORLD SCIENTIFIC, 1999. [Online]. Available: https://www.worldscienti c.com/doi/abs/10.1142/4252 [61] C. Dwork, \Di erential privacy: A survey of results," in Theory and Applications of Models of Computation, M. Agrawal, D. Du, Z. Duan, and A. Li, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 1{19. [62] C. Dwork, F. McSherry, K. Nissim, and A. Smith, \Calibrating noise to sensitivity in private data analysis," in Theory of Cryptography, 61 S. Halevi and T. Rabin, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 265{284. [63] T. Ebert, \Applications of recursive operators to randomness and com- plexity," Ph.D. dissertation, University of California, Santa Barbara, [64] B. Edelman and M. Ostrovsky, \Strategic bidder behavior in sponsored search auctions," Decision Support Systems, vol. 43, no. 1, pp. 192{198, Feb. 2007. [Online]. Available: https: //doi.org/10.1016/j.dss.2006.08.008 [65] B. Edelman, M. Ostrovsky, and M. Schwarz, \Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords," American Economic Review, vol. 97, no. 1, pp. 242{259, March 2007. [Online]. Available: https: //www.aeaweb.org/articles?id=10.1257/aer.97.1.242 [66] L. Einav, C. Farronato, J. Levin, and N. Sundaresan, \Auctions versus Posted Prices in Online Markets," Journal of Political Economy, vol. 126, no. 1, pp. 178{215, 2018. [Online]. Available: https://ideas.repec.org/a/ucp/jpolec/doi10.1086-695529.html [67] H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan, and S. Chen, \Overview of turn data management platform for digital advertising," Proc. VLDB Endow., vol. 6, no. 11, pp. 1138{1149, Aug. 2013. [Online]. Available: https://doi.org/10.14778/2536222.2536238 [68] R. Engelbrecht-Wiggans, \Auctions and bidding models: A survey," Management Science, vol. 26, no. 2, pp. 119{142, 1980. [Online]. Available: http://www.jstor.org/stable/2630247 [69] D. S. Evans, \The online advertising industry: Economics, evolution, and privacy," Journal of Economic Perspectives, vol. 23, no. 3, pp. 37{60, September 2009. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/jep.23.3.37 [70] U. Feige, A. Flaxman, J. D. Hartline, and R. Kleinberg, \On the competitive ratio of the random sampling auction," 62 in Proceedings of the First International Conference on Internet and Network Economics, ser. WINE'05. Berlin, Heidelberg: Springer-Verlag, 2005, pp. 878{886. [Online]. Available: https: //doi.org/10.1007/11600930 89 [71] R. C. Fernandez, P. Subramaniam, and M. J. Franklin, \Data market platforms: Trading data assets to solve data problems," Proc. VLDB Endow., vol. 13, no. 12, pp. 1933{1947, Jul. 2020. [Online]. Available: https://doi.org/10.14778/3407790.3407800 [72] M. A. Ferrag, L. Maglaras, and A. Ahmim, \Privacy-preserving schemes for ad hoc social networks: A survey," IEEE Communica- tions Surveys Tutorials, vol. 19, no. 4, pp. 3015{3045, 2017. [73] F. Ferreira and J. Waldfogel, \Pop internationalism: Has half a cen- tury of world music trade displaced local culture?" Economic Journal, vol. 123, no. 569, pp. 634{664, Jun 2013. [74] L. K. Fleischer and Y.-H. Lyu, \Approximately optimal auctions for selling privacy when costs are correlated with data," in Proceedings of the 13th ACM Conference on Electronic Commerce, ser. EC'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 568{585. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10. 1145/2229012.2229054 [75] S. A. Fricker and Y. V. Maksimov, \Pricing of data products in data marketplaces," in Software Business, A. Ojala, H. Holmstr om Olsson, and K. Werder, Eds. Cham: Springer International Publishing, 2017, pp. 49{66. [76] T. L. Friedman, The world is at : a brief history of the twenty- rst century / Thomas L. Friedman., 1st ed. New York :: Farrar, Straus and Giroux,, 2005., includes index. [77] D. Fudenberg and J. M. Villas-Boas, \Price discrimination in the digital economy," in The Oxford Handbook of the Digital Economy, M. Peitz and J. Waldfogel, Eds. Oxford University Press, 2012. [Online]. Available: https://www.oxfordhandbooks.com/view/10.10 93/oxf ordhb/9780195397840.001.0001/oxf ordhb-9780195397840-e-10 63 [78] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, \Privacy- preserving data publishing: A survey of recent developments," ACM Comput. Surv., vol. 42, no. 4, Jun. 2010. [Online]. Available: https://doi.org/10.1145/1749603.1749605 [79] J. M. Gallaugher, P. Auger, and A. BarNir, \Revenue streams and digital content providers: an empirical investigation," Information & Management, vol. 38, no. 7, pp. 473 { 485, 2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S037872060000083 [80] K. Ganchev, A. Kulesza, J. Tan, R. Gabbard, Q. Liu, and M. Kearns, \Empirical price modeling for sponsored search," in Internet and Net- work Economics, X. Deng and F. C. Graham, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 541{548. [81] N. Gandal, \Native language and internet usage," International Journal of the Sociology of Language, vol. 2006, no. 182, pp. 25 { 40, 2006. [Online]. Available: https://www.degruyter.com/view/journal s/ijsl/2006/182/article-p25.xml [82] M. Gentzkow and J. M. Shapiro, \Ideological segregation online and oine," The Quarterly Journal of Economics, vol. 126, no. 4, pp. 1799{ 1839, 11 2011. [Online]. Available: https://doi.org/10.1093/qje/qjr044 [83] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, \Protecting data privacy in private information retrieval schemes," J. Comput. Syst. Sci., vol. 60, no. 3, pp. 592{629, Jun. 2000. [Online]. Available: https://doi.org/10.1006/jcss.1999.1689 [84] A. Ghorbani, M. P. Kim, and J. Zou, \A distributional framework for data valuation," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [85] A. Ghorbani and J. Zou, \Data shapley: Equitable valuation of data for machine learning," in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, 09{15 Jun 2019, pp. 64 2242{2251. [Online]. Available: http://proceedings.mlr.press/v97/gh orbani19c.html [86] A. Ghosh, K. Ligett, A. Roth, and G. Schoenebeck, \Buying private data without veri cation," in Proceedings of the Fifteenth ACM Conference on Economics and Computation, ser. EC'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 931{948. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2600057 [87] A. Ghosh and A. Roth, \Selling privacy at auction," in Proceedings of the 12th ACM Conference on Electronic Commerce, ser. EC'11. New York, NY, USA: Association for Computing Machinery, 2011, pp. 199{208. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10. 1145/1993574.1993605 [88] A. Gilchrist, Industry 4.0: The Industrial Internet of Things, 1st ed. USA: Apress, 2016. [89] A. V. Goldberg and J. D. Hartline, \Competitive auctions for multiple digital goods," in Proceedings of the 9th Annual European Symposium on Algorithms, ser. ESA'01. Berlin, Heidelberg: Springer-Verlag, 2001, pp. 416{427. [90] ||, \Competitiveness via consensus," in Proceedings of the Four- teenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'03. USA: Society for Industrial and Applied Mathematics, 2003, pp. 215{222. [91] ||, \Envy-free auctions for digital goods," in Proceedings of the 4th ACM Conference on Electronic Commerce, ser. EC'03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 29{35. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/779928. [92] A. V. Goldberg, J. D. Hartline, and A. Wright, \Competitive auctions and digital goods," in Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'01. USA: Society for Industrial and Applied Mathematics, 2001, pp. 735{744. 65 [93] A. Goldfarb and C. Tucker, \Digital economics," Journal of Economic Literature, vol. 57, no. 1, pp. 3{43, March 2019. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jel.20171452 [94] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [95] B. R. Gordon, F. Zettelmeyer, N. Bhargava, and D. Chapsky, \A comparison of approaches to advertising measurement: Evidence from big eld experiments at facebook," Marketing Science, vol. 38, no. 2, pp. 193{225, 2019. [Online]. Available: https: //doi.org/10.1287/mksc.2018.1135 [96] V. Guruswami, J. D. Hartline, A. R. Karlin, D. Kempe, C. Kenyon, and F. McSherry, \On pro t-maximizing envy-free pricing," in Pro- ceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA'05. USA: Society for Industrial and Applied Mathematics, 2005, pp. 1164{1173. [97] N. Haghpanah and J. Hartline, \Reverse mechanism design," in Proceedings of the Sixteenth ACM Conference on Economics and Computation, ser. EC'15. New York, NY, USA: Association for Computing Machinery, 2015, pp. 757{758. [Online]. Available: https://doi.org/10.1145/2764468.2764498 [98] ||, \When is pure bundling optimal?" The Pennsylvania State University, Working Paper, April 2020. [Online]. Available: https://www.personal.psu.edu/nuh47/papers/bundling.pdf [99] H. Halaburda and Y. Yehezkel, \Platform competition under asymmetric information," American Economic Journal: Microe- conomics, vol. 5, no. 3, pp. 22{68, 2013. [Online]. Available: http://www.jstor.org/stable/43189630 [100] D. Han, S. Tople, A. Rogers, M. Wooldridge, O. Ohrimenko, and S. Tschiatschek, \Replication-robust payo -allocation with applica- tions in machine learning marketplaces," ArXiv, vol. abs/2006.14583, 66 [101] G. Hardin, \The tragedy of the commons," Science, vol. 162, no. 3859, pp. 1243{1248, 1968. [Online]. Available: https: //science.sciencemag.org/content/162/3859/1243 [102] J. D. Hartline and R. McGrew, \From optimal limited to unlimited supply auctions," in Proceedings of the 6th ACM Conference on Electronic Commerce, ser. EC'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 175{182. [Online]. Available: https://doi.org/10.1145/1064009.1064028 [103] J. Heckman, E. Peters, N. G. Kurup, E. Boehmer, and M. Davaloo, \A pricing model for data markets," in iConference 2015 Proceedings. iSchools, 2015. [104] W. Hoe ding, \Probability inequalities for sums of bounded random variables," Journal of the American Statistical Association, vol. 58, no. 301, pp. 13{30, 1963. [Online]. Available: http: //www.jstor.org/stable/2282952 [105] R. Hu and Y. Gong, \Trading data for learning: Incentive mechanism for on-device federated learning," ArXiv, vol. abs/2009.05604, 2020. [106] W. Hu and A. Bolivar, \Online auctions eciency: A survey of ebay auctions," in Proceedings of the 17th International Conference on World Wide Web, ser. WWW'08. New York, NY, USA: Association for Computing Machinery, 2008, pp. 925{934. [Online]. Available: https://doi.org/10.1145/1367497.1367621 [107] N. Hynes, D. Dao, D. Yan, R. Cheng, and D. Song, \A demonstration of sterling: A privacy-preserving data marketplace," Proc. VLDB Endow., vol. 11, no. 12, pp. 2086{2089, Aug. 2018. [Online]. Available: https://doi.org/10.14778/3229863.3236266 [108] G. Irvin, Modern Cost-Bene t Methods. London: Macmillan Pub- lishers Limited, 1978. [109] J. Jaisingh, J. Barron, S. Mehta, and A. Chaturvedi, \Privacy and pricing personal information," European Journal of Operational Research, vol. 187, no. 3, pp. 857 { 870, 2008. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0377221706007867 67 [110] J. Jansen and T. Mullen, \Sponsored search: An overview of the concept, history, and technology," International Journal of Electronic Business, vol. 6, pp. 114{131, 01 2008. [111] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li, C. Zhang, C. Spanos, and D. Song, \Ecient task-speci c data valuation for nearest neighbor algorithms," Proc. VLDB Endow., vol. 12, no. 11, pp. 1610{1623, Jul. 2019. [Online]. Available: https://doi.org/10.14778/3342263.3342637 [112] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gurel,  B. Li, C. Zhang, D. Song, and C. J. Spanos, \Towards ecient data valuation based on the shapley value," in Proceedings of Machine Learning Research, ser. Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama, Eds., vol. 89. PMLR, 16{18 Apr 2019, pp. 1167{1176. [Online]. Available: http://proceedings.mlr.press/v89/jia19a.html [113] R. Jia, X. Sun, J. Xu, C. Zhang, B. Li, and D. Song, \An empirical and comparative analysis of data valuation with scalable algorithms," CoRR, vol. abs/1911.07128, 2019. [Online]. Available: http://arxiv.org/abs/1911.07128 [114] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, \Di erential privacy and its applications in social network analysis: A survey," ArXiv, vol. abs/2010.02973, 2020. [115] B. Jullien, \Two-sided b to b platforms," in The Oxford Handbook of the Digital Economy, M. Peitz and J. Waldfogel, Eds. Oxford University Press, 2012. [Online]. Available: https: //www.oxfordhandbooks.com/view/10.1093/oxf ordhb/978019539784 0.001.0001/oxf ordhb-9780195397840-e-7 [116] V. V. Kantere, D. Dash, G. Gratsias, and A. Ailamaki, \Predicting cost amortization for query services," in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'11. New York, NY, USA: Association for Computing Machinery, 2011, pp. 325{336. [Online]. Available: https://doi.org/10.1145/1989323.1989358 68 [117] D. Kifer and A. Machanavajjhala, \A rigorous and customizable framework for privacy," in Proceedings of the 31st ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, ser. PODS'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 77{88. [Online]. Available: https://doi.org/10.1 145/2213556.2213571 [118] P. Klemperer, \Auction theory: A guide to the literature," Journal of Economic Surveys, vol. 13, no. 3, pp. 227{286, 1999. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-641 9.00083 [119] P. W. Koh and P. Liang, \Understanding black-box predictions via in uence functions," in Proceedings of the 34th International Confer- ence on Machine Learning - Volume 70, ser. ICML'17. JMLR.org, 2017, pp. 1885{1894. [120] P. Kotler, Marketing Management : the millennium edition. Boston, MA: Pearson Custom Pub., 2000. [121] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu, \Query-based data pricing," in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ser. PODS'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 167{178. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2213556.2213582 [122] ||, \Querymarket demonstration: Pricing for online data markets," Proc. VLDB Endow., vol. 5, no. 12, pp. 1962{1965, Aug. 2012. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.14778/236750 2.2367548 [123] ||, \Toward practical query pricing with querymarket," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 613{624. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2463676.2465335 [124] ||, \Query-based data pricing," J. ACM, vol. 62, no. 5, Nov. 2015. [Online]. Available: https://doi.org/10.1145/2770870 69 [125] Y. Kwon, M. A. Rivas, and J. Zou, \Ecient computation and analysis of distributional shapley values," ArXiv, vol. abs/2007.01357, 2020. [126] S. Lahaie, D. M. Pennock, A. Saberi, and R. V. Vohra, \Sponsored search auctions," in Algorithmic Game Theory, N. Nisan, T. Rough- garden, E. Tardos, and V. V. Vazirani, Eds. Cambridge University Press, 2007, pp. 699{716. [127] A. Lambrecht, A. Goldfarb, A. Bonatti, A. Ghose, D. Goldstein, R. Lewis, A. Rao, N. Sahni, and S. Yao, \How do rms make money selling digital goods online?" Marketing Letters, vol. 25, pp. 331{341, 09 2014. [128] A. Lambrecht and C. Tucker, \When does retargeting work? information speci city in online advertising," Journal of Marketing Research, vol. 50, no. 5, pp. 561{576, 2013. [Online]. Available: https://doi.org/10.1509/jmr.11.0503 [129] R. Lavi and N. Nisan, \Competitive analysis of incentive compatible on-line auctions," in Proceedings of the 2nd ACM Conference on Electronic Commerce, ser. EC'00. New York, NY, USA: Association for Computing Machinery, 2000, pp. 233{241. [Online]. Available: https://doi.org/10.1145/352871.352897 [130] S. Lehmann and P. Buxmann, \Pricing strategies of software vendors," Business & Information Systems Engineering, vol. 1, pp. 452{462, 12 [131] J. Lerner, P. A. Pathak, and J. Tirole, \The dynamics of open-source contributors," American Economic Review, vol. 96, no. 2, pp. 114{118, May 2006. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/000282806777211874 [132] R. Lewis and J. Rao, \On the near impossibility of measuring the returns to advertising," SSRN Electronic Journal, 01 2013. [133] C. Li, D. Y. Li, G. Miklau, and D. Suciu, \A theory of pricing private data," ACM Trans. Database Syst., vol. 39, no. 4, Dec. 2015. [Online]. Available: https://doi.org/10.1145/2691190.2691191 70 [134] C. Li and G. Miklau, \Pricing aggregate queries in a data marketplace," in Proceedings of the 15th International Workshop on the Web and Databases 2012, WebDB 2012, Scottsdale, AZ, USA, May 20, 2012, Z. G. Ives and Y. Velegrakis, Eds., 2012, pp. 19{24. [Online]. Available: http://db.disi.unitn.eu/pages/WebDB2012/pap ers/p15.pdf [135] X.-B. Li and S. Raghunathan, \Pricing and disseminating customer data with privacy awareness," Decision Support Systems, vol. 59, pp. 63 { 73, 2014. [Online]. Available: http://www.sciencedirect.com/sc ience/article/pii/S0167923613002534 [136] F. Liang, W. Yu, D. An, Q. Yang, X. Fu, and W. Zhao, \A survey on big data market: Pricing, trading and protection," IEEE Access, vol. PP, pp. 1{1, 02 2018. [137] K. Ligett and A. Roth, \Take it or leave it: Running a survey when privacy comes at a cost," in Proceedings of the Eighth International Workshop on Internet and Network Economics (WINE'12), ser. Lec- ture Notes in Computer Science, P. W. Goldberg and M. Guo, Eds., vol. 7695. Berlin, Heidelberg: Springer, 2012, pp. 378{391. [138] B.-R. Lin and D. Kifer, \On arbitrage-free pricing for general data queries," Proc. VLDB Endow., vol. 7, no. 9, pp. 757{768, May 2014. [Online]. Available: https://doi.org/10.14778/2732939.2732948 [139] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. [van der Laak], B. [van Ginneken], and C. I. S anchez, \A survey on deep learning in medical image analysis," Medical Image Analysis, vol. 42, pp. 60 { 88, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S136184151730113 [140] Z. Liu and H. Hacigum  u  s, \Online optimization and fair costing for dynamic data sharing in a cloud data market," in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 1359{1370. [Online]. Available: https://doi.org/10.1145/2588555.2593679 71 [141] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers, \Bounding the estimation error of sampling-based shapley value approximation with/without stratifying," CoRR, vol. abs/1306.4265, 2013. [Online]. Available: http://arxiv.org/abs/1306.4265 [142] A. Mas-Colell, M. Whinston, and J. Green, Microeconomic Theory. Oxford University Press, 1995. [Online]. Available: https://EconPapers.repec.org/RePEc:oxp:obooks:9780195102680 [143] E. Maskin and J. Riley, \Monopoly with incomplete information," The RAND Journal of Economics, vol. 15, no. 2, pp. 171{196, 1984. [Online]. Available: http://www.jstor.org/stable/2555674 [144] ||, \Asymmetric Auctions," The Review of Economic Studies, vol. 67, no. 3, pp. 413{438, 07 2000. [Online]. Available: https://doi.org/10.1111/1467-937X.00137 [145] R. P. McAfee and J. McMillan, \Auctions and Bidding," Journal of Economic Literature, vol. 25, no. 2, pp. 699{738, June 1987. [Online]. Available: https://ideas.repec.org/a/aea/jeclit/v25y1987i2p699-738. html [146] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Ar- cas, \Communication-Ecient Learning of Deep Networks from De- centralized Data," in Proceedings of the 20th International Conference on Arti cial Intelligence and Statistics, 2017, pp. 1273{1282. [147] B. McMahan and D. Ramage, \Federated learning: Collaborative machine learning without centralized training data," Google AI Blog, April 2017. [Online]. Available: https://ai.googleblog.com/2017/04/ federated-learning-collaborative.html [148] D. Menicucci, S. Hurkens, and D.-S. Jeon, \On the optimality of pure bundling for a monopolist," Journal of Mathematical Economics, vol. 60, pp. 33 { 42, 2015. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S030440681500066X [149] T. Moore, R. Clayton, and R. Anderson, \The economics of online crime," Journal of Economic Perspectives, vol. 23, 72 no. 3, pp. 3{20, September 2009. [Online]. Available: http: //www.aeaweb.org/articles?id=10.1257/jep.23.3.3 [150] M. K. M. Murthy, H. A. Sanjay, and J. P. Ashwini, \Pricing models and pricing schemes of iaas providers: A comparison study," in Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ser. ICACCI'12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 143{ 147. [Online]. Available: https://doi.org/10.1145/2345396.2345421 [151] A. Muschalle, F. Stahl, A. L oser, and G. Vossen, \Pricing ap- proaches for data markets," in Enabling Real-Time Business Intel- ligence, M. Castellanos, U. Dayal, and E. A. Rundensteiner, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 129{144. [152] R. B. Myerson, \Optimal auction design," Math. Oper. Res., vol. 6, no. 1, pp. 58{73, Feb. 1981. [Online]. Available: https: //doi.org/10.1287/moor.6.1.58 [153] A. Nagaraj, \The private impact of public information: Landsat satellite maps and gold exploration," Unpublished, 07 2016. [Online]. Available: http://abhishekn.com/ les/nagaraj landsat2020.pdf [154] P. Naghizadeh and A. Sinha, \Adversarial contract design for private data commercialization," in Proceedings of the 2019 ACM Conference on Economics and Computation, ser. EC'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 681{699. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3328526.3329633 [155] J. Nagle, T.T. & Hogan, The Strategy and Tactics of Pricing: A Guide to Growing More Pro tably. Prentice Hall, 2010. [156] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, \Deep learning applications and challenges in big data analytics," Journal of Big Data, vol. 2, no. 1, p. 1, Feb 2015. [Online]. Available: https: //doi.org/10.1186/s40537-014-0007-7 [157] A. Nash, L. Segou n, and V. Vianu, \Determinacy and rewriting of conjunctive queries using views: A progress report," in Proceedings of 73 the 11th International Conference on Database Theory, ser. ICDT'07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 59{73. [Online]. Available: https://doi.org/10.1007/11965893 5 [158] ||, \Views and queries: Determinacy and rewriting," ACM Trans. Database Syst., vol. 35, no. 3, Jul. 2010. [Online]. Available: https://doi.org/10.1145/1806907.1806913 [159] M. Neumeier, The brand ip : why customers now run companies{and how to pro t from it. San Francisco :: New Riders,, 2015. [160] K. Nissim, S. Vadhan, and D. Xiao, \Redrawing the boundaries on purchasing data from privacy-sensitive individuals," in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ser. ITCS'14. New York, NY, USA: Association for Computing Machinery, 2014, pp. 411{422. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/2554797.2554835 [161] C. Niu, Z. Zheng, S. Tang, X. Gao, and F. Wu, \Making big money from small sensors: Trading time-series data under pu er sh privacy," in IEEE INFOCOM 2019 - IEEE Conference on Computer Commu- nications, April 2019, pp. 568{576. [162] C. Niu, Z. Zheng, F. Wu, S. Tang, and G. Chen, \Online pricing with reserve price constraint for personal data markets," CoRR, vol. abs/1911.12598, 2019. [Online]. Available: http: //arxiv.org/abs/1911.12598 [163] C. Niu, Z. Zheng, F. Wu, S. Tang, X. Gao, and G. Chen, \Unlocking the value of privacy: Trading aggregate statistics over private correlated data," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD'18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 2031{2040. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3219819.3220013 [164] A. Ockenfels, D. Reiley, and A. Sadrieh, \Online auctions," National Bureau of Economic Research, Working Paper 12785, December 2006. [Online]. Available: http://www.nber.org/papers/w12785 74 [165] A. Odlyzko, \Privacy, economics, and price discrimination on the internet," in Proceedings of the 5th International Conference on Electronic Commerce, ser. ICEC'03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 355{366. [Online]. Available: https://doi.org/10.1145/948005.948051 [166] E. Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action, ser. Canto Classics. Cambridge University Press, [167] K. Pantelis and L. Aija, \Understanding the value of (big) data," in 2013 IEEE International Conference on Big Data, 2013, pp. 38{42. [168] K. Pauwels and A. Weiss, \Moving from free to fee: How online rms market to change their business model successfully," Journal of Marketing, vol. 72, no. 3, pp. 14{31, 2008. [Online]. Available: https://doi.org/10.1509/JMKG.72.3.014 [169] A. Pavan, I. Segal, and J. Toikka, \Dynamic mechanism design: A myersonian approach," Econometrica, vol. 82, no. 2, pp. 601{653, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10 .3982/ECTA10269 [170] L. L. Pipino, Y. W. Lee, and R. Y. Wang, \Data quality assessment," Commun. ACM, vol. 45, no. 4, pp. 211{218, Apr. 2002. [Online]. Available: https://doi.org/10.1145/505248.506010 [171] A. Prasad, V. Mahajan, and B. Bronnenberg, \Advertising versus pay- per-view in electronic media," International Journal of Research in Marketing, vol. 20, no. 1, pp. 13 { 30, 2003. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0167811602001192 [172] T. Qin, W. Chen, and T.-Y. Liu, \Sponsored search auctions: Recent advances and future directions," ACM Trans. Intell. Syst. Technol., vol. 5, no. 4, Jan. 2015. [Online]. Available: https://doi.org/10.1145/2668108 [173] A. Rao, \Online Content Pricing: Purchase and Rental Markets," Marketing Science, vol. 34, no. 3, pp. 430{451, May 2015. [Online]. 75 Available: https://ideas.repec.org/a/inm/ormksc/v34y2015i3p430-45 1.html [174] J. M. Rao and D. H. Reiley, \The economics of spam," Journal of Economic Perspectives, vol. 26, no. 3, pp. 87{110, September 2012. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/jep. 26.3.87 [175] K. Ren, J. Qin, L. Zheng, Z. Yang, W. Zhang, and Y. Yu, \Deep landscape forecasting for real-time bidding advertising," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD'19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 363{372. [Online]. Available: https://doi.org/10.1145/3292500.3330870 [176] A. Richardson, A. Filos-Ratsikas, and B. Faltings, \Rewarding high- quality data via in uence functions," CoRR, vol. abs/1908.11598, 2019. [Online]. Available: http://arxiv.org/abs/1908.11598 [177] J. Riley and W. F. Samuelson, \Optimal auctions," American Economic Review, vol. 71, no. 3, pp. 381{392, 1981. [Online]. Available: https://EconPapers.repec.org/RePEc:aea:aecrev:v:71:y: 1981:i:3:p:381-92 [178] A. Roth, \Technical perspective: Pricing information (and its implications)," Commun. ACM, vol. 60, no. 12, p. 78, Nov. 2017. [Online]. Available: https://doi-org.proxy.lib.sf u.ca/10.1145/3139455 [179] F. Schomm, F. Stahl, and G. Vossen, \Marketplaces for data: An initial survey," SIGMOD Rec., vol. 42, no. 1, pp. 15{26, May 2013. [Online]. Available: https://doi.org/10.1145/2481528.2481532 [180] L. Segou n and V. Vianu, \Views and queries: Determinacy and rewriting," in Proceedings of the Twenty-Fourth ACM SIGMOD- SIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS'05. New York, NY, USA: Association for Computing Machinery, 2005, pp. 49{60. [Online]. Available: https://doi.org/10.1 145/1065167.1065174 76 [181] S. Sen, C. Joe-Wong, S. Ha, and M. Chiang, \A survey of smart data pricing: Past proposals, current plans, and future trends," ACM Computing Survey, vol. 46, no. 2, Nov. 2013. [Online]. Available: https://doi.org/10.1145/2543581.2543582 [182] C. Shapiro, S. Carl, H. Varian, and H. B. Press, Information Rules: A Strategic Guide to the Network Economy, ser. Strategy/Technology / Harvard Business School Press. Harvard Business School Press, 1998. [Online]. Available: https://books.google.ca/books?id=aE J4I v PVEC [183] C. Shapiro and H. R. Varian, \Versioning: The smart way to sell information," Harvard Business Review, pp. 106{114, November- December 1998. [Online]. Available: https://hbr.org/1998/11/versio ning-the-smart-way-to-sell-inf ormation [184] L. S. Shapley, \A Value for n-Person Games," RAND Corporation, Santa Monica, CA, Tech. Rep. P-295, 1952. [Online]. Available: https://www.rand.org/pubs/papers/P0295.html [185] M. Shubik, \Auctions, bidding, and markets: An historical sketch," in Auctions, Bidding, and Contracting, M. Shubik and J. Stark, Eds. New York University Press, 1983, pp. 33{52. [186] R. H. L. Sim, Y. Zhang, M. C. Chan, and B. K. H. Low, \Collaborative machine learning with incentive-aware model rewards," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [187] B. Squire, S. Brown, J. Readman, and J. Bessant, \The impact of mass customisation on manufacturing trade-o s," Production and Op- erations Management, vol. 15, pp. 10 { 21, 01 2009. [188] C. Sunstein, Echo Chambers: Bush V. Gore, Impeachment, and Beyond. Princeton University Press, 2001. [Online]. Available: https://books.google.ca/books?id=sEgHAAAACAAJ [189] C. Swamy and M. Cheung, \Approximation algorithms for single- minded envy-free pro t-maximization problems with limited supply," 77 in 2008 IEEE 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS). Los Alamitos, CA, USA: IEEE Computer Society, oct 2008, pp. 35{44. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/FOCS.2008.15 [190] G. Tang, Y. Yang, and J. Pei, \Price information patterns in web search advertising: An empirical case study on accommodation industry," in 2013 IEEE International Conference on Data Mining (ICDM). Los Alamitos, CA, USA: IEEE Computer Society, dec 2013, pp. 737{746. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/ICDM.2013.100 [191] R. Tang, A. Amarilli, P. Senellart, and S. Bressan, \Get a sample for a discount," in Database and Expert Systems Applications, H. Decker, L. Lhotsk a, S. Link, M. Spies, and R. R. Wagner, Eds. Cham: Springer International Publishing, 2014, pp. 20{34. [192] R. Tang, H. Wu, Z. Bao, S. Bressan, and P. Valduriez, \The price is right," in Database and Expert Systems Applications, H. Decker, L. Lhotsk a, S. Link, J. Basl, and A. M. Tjoa, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 380{394. [193] C. R. Taylor, \Consumer Privacy and the Market for Customer Information," RAND Journal of Economics, vol. 35, no. 4, pp. 631{650, Winter 2004. [Online]. Available: https://ideas.repec.org/ a/rje/randje/v35y20044p631-650.html [194] F. Tram er, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, \Stealing machine learning models via prediction apis," in Proceedings of the 25th USENIX Conference on Security Symposium, ser. SEC'16. USA: USENIX Association, 2016, pp. 601{618. [195] P. Upadhyaya, M. Balazinska, and D. Suciu, \Price-optimal querying with data apis," Proc. VLDB Endow., vol. 9, no. 14, pp. 1695{1706, Oct. 2016. [Online]. Available: https://doi.org/10.14778/3007328.300 [196] S. van de Sandt, S. Dallmeier-Tiessen, A. Lavasa, and V. Petras, \The de nition of reuse," Data Science Journal, vol. 18, no. 1, p. 22, 2019. 78 [197] H. R. Varian, \Online ad auctions," American Economic Review, vol. 99, no. 2, pp. 430{34, May 2009. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.99.2.430 [198] W. Vickrey, \Counterspeculation, auctions, and competitive sealed tenders," The Journal of Finance, vol. 16, no. 1, pp. 8{37, 1961. [Online]. Available: http://www.jstor.org/stable/2977633 [199] ||, \Auctions and bidding games," in Recent Advances in Game Theory. Princeton, New Jersey: Princeton University Conference, 1962, pp. 15{27. [200] H. von Stackelberg, Market Structure and Equilibrium. J. Springer, [201] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, \Deep learning for computer vision: A brief review," Computational Intelligence and Neuroscience, vol. 2018, p. 7068349, Feb 2018. [Online]. Available: https://doi.org/10.1155/2018/7068349 [202] T. Wagner, A. Benlian, and T. Hess, \Converting freemium customers from free to premium{the role of the perceived premium t in the case of music as a service," Electronic Markets, vol. 24, pp. 259{268, 12 [203] J. Waldfogel, \Copyright research in the digital age: Moving from piracy to the supply of new products," American Economic Review, vol. 102, no. 3, pp. 337{42, May 2012. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.102.3.337 [204] R. Y. Wang and D. M. Strong, \Beyond accuracy: What data quality means to data consumers," Journal of Management Information Systems, vol. 12, no. 4, pp. 5{33, 1996. [Online]. Available: https://doi.org/10.1080/07421222.1996.11518099 [205] T. Wang, J. Rausch, C. Zhang, R. Jia, and D. Song, \A princi- pled approach to data valuation for federated learning," ArXiv, vol. abs/2009.06192, 2020. 79 [206] Z. Wang, H. Zhu, Z. Dong, X. He, and S. Huang, \Less is better: Unweighted data subsampling via in uence function," CoRR, vol. abs/1912.01321, 2019. [Online]. Available: http: //arxiv.org/abs/1912.01321 [207] H. L. Williams, \Intellectual property rights and innovation: Evidence from the human genome," Journal of Political Economy, vol. 121, no. 1, pp. 1{27, 2013. [Online]. Available: https: //doi.org/10.1086/669706 [208] C. Wu, R. Buyya, and K. Ramamohanarao, \Cloud pricing models: Taxonomy, survey, and interdisciplinary challenges," ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3342103 [209] S. Wu and R. Banker, \Best pricing strategy for information services," Journal of the Association of Information Systems, vol. 11, no. 6, pp. 339{366, Jan. 2010. [210] S. Wu and P. Pavlou, \On the optimal xed-up-to pricing for infor- mation services," Journal of the Association of Information Systems, vol. 20, no. 10, pp. 1447{1474, Jan. 2019. [211] X. Wu, W. Zhang, and W. Dou, \Pricing as a service: Personalized pricing strategy in cloud computing," in 2012 IEEE 12th International Conference on Computer and Information Technology, Oct 2012, pp. 1119{1124. [212] X. Wu, X. Ying, K. Liu, and L. Chen, \A survey of privacy-preservation of graphs and social networks," in Managing and Mining Graph Data, C. C. Aggarwal and H. Wang, Eds. Boston, MA: Springer US, 2010, pp. 421{453. [Online]. Available: https://doi.org/10.1007/978-1-4419-6045-0 14 [213] C. Xia and S. Muthukrishnan, \Arbitrage-free pricing in user-based markets," in Proceedings of the 17th International Conference on Au- tonomous Agents and MultiAgent Systems, ser. AAMAS'18. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 327{335. 80 [214] H. Yang, \Targeted search and the long tail e ect," RAND Journal of Economics, vol. 44, no. 4, pp. 733{756, December 2013. [215] Y. Yang, X. Mao, J. Pei, and X. He, \Continuous in uence maximization: What discounts should we o er to social network users?" in Proceedings of the 2016 International Conference on Management of Data, ser. SIGMOD'16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 727{741. [Online]. Available: https://doi.org/10.1145/2882903.2882961 [216] Y. Yang, Q. S. Lu, G. Tang, and J. Pei, \The Impact of Market Competition on Search Advertising," Journal of Interactive Marketing, vol. 30, no. C, pp. 46{55, 2015. [Online]. Available: https://ideas.repec.org/a/eee/joinma/v30y2015icp46-55.html [217] J. Yoon, S. Arik, and T. P ster, \Data valuation using reinforcement learning," in Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020), 2020. [218] T. Young, D. Hazarika, S. Poria, and E. Cambria, \Recent trends in deep learning based natural language processing [review article]," IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55{75, [219] H. Yu and M. Zhang, \Data pricing strategy based on data quality," Computers & Industrial Engineering, vol. 112, pp. 1 { 10, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pi i/S0360835217303509 [220] M. Zhang and F. Beltran, \A survey of data pricing methods," SSRN, April 2020. [Online]. Available: https://ssrn.com/abstract=36 09120orhttp://dx.doi.org/10.2139/ssrn.3609120 [221] X. M. Zhang and F. Zhu, \Group size and incentives to contribute: A natural experiment at chinese wikipedia," American Economic Review, vol. 101, no. 4, pp. 1601{15, June 2011. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.101.4.1601 [222] J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He, \Deep reinforcement learning for sponsored search real-time bidding," in Proceedings of 81 the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD'18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 1021{1030. [Online]. Available: https://doi.org/10.1145/3219819.3219918 [223] Z. Zheng, Y. Peng, F. Wu, S. Tang, and G. Chen, \An online pricing mechanism for mobile crowdsensing data markets," in Proceedings of the 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing, ser. Mobihoc'17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3084041.3084044 [224] B. Zhou, J. Pei, and W. Luk, \A brief survey on anonymization techniques for privacy preserving publishing of social network data," SIGKDD Explor. Newsl., vol. 10, no. 2, pp. 12{22, Dec. 2008. [Online]. Available: https://doi.org/10.1145/1540276.1540279 [225] Y. Zhou, U. Porwal, C. Zhang, H. Ngo, X. Nguyen, C. R e, and V. Govindaraju, \Parallel feature selection inspired by group testing," in Proceedings of the 27th International Conference on Neural Infor- mation Processing Systems - Volume 2, ser. NIPS'14. Cambridge, MA, USA: MIT Press, 2014, pp. 3554{3562.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: Sep 9, 2020

There are no references for this article.