Access the full text.
Sign up today, get DeepDyve free for 14 days.
Chunbin Lin, Jiaheng Lu, T. Ling, Bogdan Cautis (2012)
LotusX: A Position-Aware XML Graphical Search System with Auto-Completion2012 IEEE 28th International Conference on Data Engineering
G. Nemhauser, L. Wolsey, M. Fisher (1978)
An analysis of approximations for maximizing submodular set functions—IMathematical Programming, 14
Aston Zhang, Amit Goyal, Weize Kong, Hongbo Deng, Anlei Dong, Yi Chang, Carl Gunter, Jiawei Han (2015)
adaQAC: Adaptive Query Auto-Completion via Implicit Negative FeedbackProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Nandish Jayaram, Mahesh Gupta, Arijit Khan, Chengkai Li, Xifeng Yan, R. Elmasri (2014)
GQBE: Querying knowledge graphs by example entity tuples2014 IEEE 30th International Conference on Data Engineering
Chaohui Wang, Miao Xie, S. Bhowmick, Byron Choi, Xiaokui Xiao, Shuigeng Zhou (2020)
FERRARI: an efficient framework for visual exploratory subgraph search in graph databasesThe VLDB Journal
Robert Pienta, Fred Hohman, Acar Tamersoy, A. Endert, S. Navathe, Hanghang Tong, Duen Chau (2017)
Visual Graph Query Construction and RefinementProceedings of the 2017 ACM International Conference on Management of Data
Manasi Vartak, Sajjadur Rahman, S. Madden, Aditya Parameswaran, N. Polyzotis (2015)
SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual AnalyticsProceedings of the VLDB Endowment International Conference on Very Large Data Bases, 8
Y. Ioannidis, Stratis Viglas (2006)
Conversational queryingInf. Syst., 31
Arnab Nandi, H. Jagadish (2007)
Assisted querying using instant-response interfaces
G. Marchionini (2006)
Exploratory searchCommunications of the ACM, 49
S Boyd, L Vandenberghe (2004)
10.1017/CBO9780511804441Convex optimization
Kai Huang, Huey-Eng Chua, S. Bhowmick, Byron Choi, Shuigeng Zhou (2019)
CATAPULT: Data-driven Selection of Canned Patterns for Efficient Visual Graph Query FormulationProceedings of the 2019 International Conference on Management of Data
Jia Li, Yang Cao, Shuai Ma (2017)
Relaxing Graph Pattern Matching With ExplanationsProceedings of the 2017 ACM on Conference on Information and Knowledge Management
J. Leskovec, C. Faloutsos (2006)
Sampling from large graphs
(2017)
The ubiquity of large graphs and surprising challenges of graph processing
D. Mottin, Emmanuel Müller (2017)
Graph Exploration: From Users to Large GraphsProceedings of the 2017 ACM International Conference on Management of Data
Yunyao Li, Cong Yu, H. Jagadish (2008)
Enabling Schema-Free XQuery with meaningful query focusThe VLDB Journal, 17
D. Mottin, F. Bonchi, Francesco Gullo (2015)
Graph Query Reformulation with DiversityProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
L. Cordella, P. Foggia, Carlo Sansone, M. Vento (2004)
A (sub)graph isomorphism algorithm for matching large graphsIEEE Transactions on Pattern Analysis and Machine Intelligence, 26
H. Bast, Ingmar Weber (2006)
Type less, find more: fast autocompletion search with a succinct indexProceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Peipei Yi, Byron Choi, S. Bhowmick, Jianliang Xu (2016)
AutoG: a visual query autocompletion framework for graph databasesThe VLDB Journal, 26
S. Comai, E. Damiani, P. Fraternali (2001)
Computing graphical queries over XML dataACM Trans. Inf. Syst., 19
Peipei Yi, Byron Choi, Zhiwei Zhang, S. Bhowmick, Jianliang Xu (2022)
GFocus: User Focus-Based Graph Query AutocompletionIEEE Transactions on Knowledge and Data Engineering, 34
Xifeng Yan, Jiawei Han (2002)
gSpan: graph-based substructure pattern mining2002 IEEE International Conference on Data Mining, 2002. Proceedings.
Lilong Jiang, Arnab Nandi (2015)
SnapToQuery: Providing Interactive Feedback during Exploratory Query SpecificationProc. VLDB Endow., 8
S. Abiteboul, Yael Amsterdamer, T. Milo, P. Senellart (2012)
Auto-completion learning for XMLProceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Daniele Braga, A. Campi, S. Ceri (2005)
XQBE (XQuery By Example): A visual interface to the standard XML query languageACM Trans. Database Syst., 30
Arnab Nandi, H. Jagadish (2007)
Effective Phrase Prediction
Arnab Nandi, Lilong Jiang, Michael Mandel (2013)
Gestural Query SpecificationProc. VLDB Endow., 7
Jianhua Feng, Guoliang Li (2012)
Efficient Fuzzy Type-Ahead Search in XML DataIEEE Transactions on Knowledge and Data Engineering, 24
Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, Panos Kalnis (2014)
GRAMI: Frequent Subgraph and Pattern Mining in a Single Large GraphProc. VLDB Endow., 7
Yinghui Wu, Shengqi Yang, M. Srivatsa, A. Iyengar, Xifeng Yan (2013)
Summarizing Answer Graphs Induced by Keyword QueriesProc. VLDB Endow., 6
Ho Hung, S. Bhowmick, Ba Truong, Byron Choi, Shuigeng Zhou (2013)
QUBLE: blending visual subgraph query formulation with query processing on large networks
Chuan Xiao, Jianbin Qin, Wei Wang, Y. Ishikawa, K. Tsuda, K. Sadakane (2013)
Efficient Error-tolerant Query AutocompletionProc. VLDB Endow., 6
Nandish Jayaram, Sidharth Goyal, Chengkai Li (2015)
VIIQ: Auto-Suggestion Enabled Visual Interface for Interactive Graph Query FormulationProc. VLDB Endow., 8
J. McGregor (1982)
Backtrack search algorithms and the maximal common subgraph problemSoftware: Practice and Experience, 12
S. Bhowmick, Byron Choi, C. Dyreson (2016)
Data-driven Visual Graph Query Interface Construction and Maintenance: Challenges and OpportunitiesProc. VLDB Endow., 9
Nathan Ng, Peipei Yi, Zhiwei Zhang, Byron Choi, S. Bhowmick, Jianliang Xu (2019)
FGreat: Focused Graph Query Autocompletion2019 IEEE 35th International Conference on Data Engineering (ICDE)
Graph query autocompletion (GQAC) takes a user’s graph query as input and generates top-k query suggestions as output, to help alleviate the verbose and error-prone graph query formulation process in a visual interface. To compose a target query with GQAC, the user may iteratively adopt suggestions or manually add edges to augment the existing query. The current state-of-the-art of GQAC, however, focuses on a large collection of small- or medium-sized graphs only. The subgraph fea- tures exploited by existing GQAC are either too small or too scarce in large graphs. In this paper, we present Flexible graph query autocompletion for LArge Graphs, called FLAG. We are the first to propose wildcard labels in the context of GQAC , which summarizes query structures that have different labels. FLAG allows augmenting users’ queries with subgraph incre- ments with wildcard labels to form suggestions. To support wildcard-enabled suggestions, a new suggestion ranking function is proposed. We propose an efficient ranking algorithm and extend an index to further optimize the online suggestion ranking. We have conducted a user study and a set of large-scale simulations to verify both the effectiveness and efficiency of FLAG . The results show that the query suggestions saved roughly 50% of mouse clicks and FLAG returns suggestions in few seconds. Keywords Subgraph query · Query autocompletion · Large graphs · Database usability 1 Introduction the drawing of query graphs in an easy and intuitive manner. Real-world visual query interfaces (e.g., Pub Chem , Chem - 2 3 Researchers and practitioners perform different types of SPider , and SCAFFo Ld h unter ) have already been offered. queries on large graphs [30]. Formulating subgraph match- However, composing graph queries in a visual environment ing query, among others, requires significant users’ effort. A may still be cumbersome. To alleviate the burden of visual popular approach to provide query formulation aids for users graph formulation, graph query autocompletion (GQAC) is to build visual query interfaces (a.k.a Gui s) that facilitate [36, 37] has been proposed. Consider a scenario that a user formulates a target query graph q iteratively via the Gui . Given an existing partially formulated query graph q, GQAC * Peipei Yi aims to suggest a subgraph increment Δq to q to form a query pyi@lenovo.com suggestion, such that the suggestion is closer to the target query q . * Byron Choi Since users’ intention is hard to predict, GQAC typically bchoi@comp.hkbu.edu.hk returns k suggestions on a visual interface for users to choose Jianping Li from. An example of Gui , the user’s current query, and sug- csjpli@comp.hkbu.edu.hk gestions of GQAC are shown in Fig. 1. We mimicked the Sourav S. Bhowmick example figure style of a related work [37] for presentation assourav@ntu.edu.sg consistency. Jianliang Xu Existing studies only consider the GQAC problem for large xujl@comp.hkbu.edu.hk collections of small graphs, e.g., chemical databases , and Machine Intelligence Center, Lenovo, Hong Kong, Hong Kong https:// pubch em. ncbi. nlm. nih. gov/ search/. 2 4 Department of Computer Science, Hong Kong Baptist http:// autog. comp. hkbu. edu. hk: 8000/ autog/ http:// autog. comp. hkbu. University, Kowloon Tong, Hong Kongedu. hk: 8000/ gfocus/. http:// www. chems pider. com/. School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore http:// scaff oldhu nter. sourc eforge. net/. Vol.:(0123456789) 1 3 176 P. Yi et al. query suggestions e IR IR IR IR IR * Reset Rollback SubmitQuery 2 wildcard DM DM DM DM DM Query editor Labelpanel 3 query spe- IR 1 DB DB DB DB DB DB DM DB IR cialization Edge e IR useful q q q q q 1 2 3 4 suggestions DM 12 3 e3 ·· · e1 DB Node AG AI DB IR ML q ·· · Fig. 2 Example of query and the suggestions with wildcards Suggestions IR IR IR IR DB DM IR DM DM DM DM two reasons. First, FLAG can then provide suggestions that DB DB DB DB DB DB DB DB contain wildcard labels. An example is shown as q of Fig. 2. q +Δq q +Δq q +Δq q +Δq 1 2 3 4 q summarizes suggestions q +Δq , q +Δq , and q +Δq 2 3 4 of Fig. 1. It is evident that q summarizes (or generalizes) Fig. 1 A typical GUI and query suggestions of GQAC the three suggestions, each of which has only few support from the graph data, and spare some space of the top-k sug- gestions for others. Second, wildcards can be naturally used cannot be directly applied to large graphs. In particular, pre- when users are not sure about the labels of the nodes/edges vious studies construct query suggestions based on some of the query graph, but FLAG still suggests new edges. To popular substructures (a.k.a features) of the graph data. avoid having wildcards appearing in arbitrary places of This assumes users want to construct queries to retrieve query suggestions, we propose well-formed suggestions. some graphs. For instance, we may set the minimum sup- To address the second challenge, we introduce query port of frequent substructures to 10% of the dataset size for generalization and query specialization of suggestions for Pbu Cmeh and we obtained approximately a thousand features GQAC. Query specialization is an operator for augmenting for GQAC. However, in large graphs, such features are smaller an existing query to one that is closer to the target query. It in size and more scarce in quantities. For example, such fre- also quantifies how much a suggestion augments the exist- quent subgraphs are very few in Ceti Sree , reported fewer than ing query. In each specialization, the user either i) add a 10 frequent subgraphs for various support threshold values wildcard edge or ii) change a wildcard label to an exact [8]. This phenomenon leads to two main challenges of GQAC label. The introduction of wildcard label does not alter the for large graphs. First, there are a large number of distinct sub- asymptotic complexity of the query process (e.g., subgraph graphs and each of them has small supports from the graph. matching) or graph autocompletion (e.g., suggestions rank- Candidate suggestions generated from them are many but rare ing). Next, we propose query generalization which is the in the graph data. Further, the visual interface shows only k opposite of query specialization. Recall from Fig. 2, three suggestions, not to mention humans may interpret a small suggestions are generalized into one so that the support set of suggestions in practice. Such k suggestions may not be of the generalized suggestion becomes higher. q is more useful. We illustrate the second challenge with Example 1. specific than q +Δq but more generalized than q +Δq , 1 2 q +Δq and q +Δq . 3 4 Example 1 Suppose the current query is q. The first sugges- Wildcard-enabled GQAC may generate numerous can- tion in Fig. 1 ( q +Δq ) increments q by one edge, which didate suggestions and their ranking can be inefficient. We may not save effort from query formulation. Then, consider propose a novel linear submodular ranking function that the last three suggestions increment q by two edges. They involves not only query suggestion’s specialization to the have however become overly specific, each of which appears current query but also the summarization of the possible only few times in the data graph. The three suggestions candidate suggestions. Specifically, we propose speciali- occupied relatively much area of the GUI. It is desirable to zation value ( ) to quantify how much a suggestion aug- efficiently summarize the specific suggestions and rank the ments the existing query; and summarization value ( ) generalized one high and leave room for others. to quantify how many candidate suggestions a suggestion summarizes. The approximation of the ranking function is To address the aforementioned challenges, we propose differentiable. Hence, we can adopt a stochastic gradient Flexible graph query autocompletion for LArge Graphs descent algorithm to learn the parameters of the ranking (FLAG). To tackle the first challenge, we propose wildcard function. It is also not surprising that the ranking prob- label to GQAC. A wildcard label represents any label of lem is n P-h Ard . Since the ranking function is submodular, the data graph. It is suitable for GQAC for a large graph for we propose an efficient greedy algorithm for computing the top-k suggestions. To further optimize efficiency, we extend an existing index with the support of wildcards for 5 ranking. When query logs (which are also graphs) are available, GQAC may generate suggestions from them also. 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 177 Table 1 Frequently used notations 2.1 Background on Graph Query Autocompletion (GQAC) Symbol Meaning q The current query or existing query 2.1.1 Graph Data q A query suggestion or simply suggestion Δq Query increment (adding Δq to q yields q ) We consider a single large graph G =(V , E, l) , consists of Q Query suggestions a set of nodes V, a set of edges E and a labeling function l q ⊆ q q is a subgraph of q and is the embedding of q in q 𝜆 that assigns labels to nodes and edges. The size of a graph f A (proper) connected subgraph of the query q is defined by |E |. is a function that returns the degree of a vertex. For example, Fig. 1 shows the Cite Seer graph. Node labels represent the area of the publication (e.g., DB, In conclusion, this paper makes the following DM, IR) and edge labels represent the distance between the contributions. pair of publications. This dataset will be used for subsequent examples. For presentation simplicity, all examples illustrate 1. We propose wildcard labels for query graph and query undirected graphs with a single label for each node and edge. suggestions. We propose a notion of well-formed wild- card graph for GQAC. 2.1.2 Query Formalism 2. We propose specialization value ( ) and summariza- tion value ( ) to measure how much a suggestion spe- This paper adopts subgraph isomorphism, a popular and fun- cializes an existing query and summarizes other candi- damental query formalism, for the technical discussions. The date suggestions. subgraph isomorphism is recalled below. 3. We propose a ranking function based on and . 4. To optimize the efficiency of online ranking of query Definition 1 (Subgraph isomorphism) Given two graphs � � � � suggestions, we present the techniques that are needed g =(V , E, l) and g =(V , E , l ) , g is a subgraph of g , to extend an existing index for the wildcard-enabled denoted as g ⊆ g , iff there is an injective function (or GQAC. embedding ) ∶ V ↦ V such that 5. We use a stochastic gradient descent algorithm to � � learn the parameters of the ranking functions in 1. ∀u ∈ V , (u)∈ V such that (l(u) , l ((u))) = ; experiments. We investigate the usefulness and effi- and ciency of FLAG via a user study and extensive simu- 2. ∀(u, v) ∈ E , ((u),(v)) ∈ E such that (l(u, v), l ( lations. The results show that FLAG saves about 50% (u) , (v))) = , of mouse clicks in query formulations and the sug- gestions are returned in several seconds under a large where (l , l ) = iff l = l 1 2 1 2 variety of settings. Multiple subgraph isomorphic embeddings of g may exist 0 1 m in g , denoted as , , , . For succinct presenta- g,g g,g g,g The rest of the paper is organized as follows. Section 2 pro- tion, we refer to each as an embedding , when the sub- g,g vides the background of GQAC. Section 3 proposes wild- scripts and superscripts are clear from or irrelevant to the card labels for GQAC. Section 4 proposes specialization context. value ( ) and summarization value ( ) for query sugges- tions. Section 5 provides details of the efficient online sug- Definition 2 (Subgraph query) Given a single large graph gestion ranking. We present a performance study in Sect. 6. G and a query graph q, the answer (or result set) of q is We discuss the related work in Sect. 7. Section 8 concludes G ={𝜆 q ⊆ G}. q 𝜆 the paper and presents some future work. 2.1.3 Visual Graph Query Construction 2 Preliminaries Graphs and their query graphs can be intuitively displayed and drawn in a visual environment (e.g., a Gui Fig. 1). In the This section provides the preliminaries of graph query auto- process of visual query construction, the user draws the cur- completion (GQAC) and presents the problem being studied. rent query q on the query editor and has the target query q in Some frequently used notations are listed in Table 1. Note that this is different from the embedding that maps a graph into a d-dimensional space, in the context of machine learning. 1 3 178 P. Yi et al. λ = {0 → 1, 1 → 0} compose(f ,f ,f ,λ ,λ )= g 10 13 4 1 2 λ = {0 → 0, 1 → 1} 1 0 00 1 1 1 1 1 1 DB DB DB DB DB DB DB DB 2 + 2 via = 2 2 IR DB IR DB 2 2 f f f g 10 13 4 Fig. 3 FLAG: graph query autocompletion for large graphs Fig. 4 An illustration of query composition—forming a large query graph from small graphs mind; and he/she performs an action (e.g., adding an edge or subgraph to q) to make the current query closer to the target. features can be plugged into GQAC, depending on the users’ Performing this process manually can be error-prone. applications. The composition of two subgraphs (incrementing a 2.1.4 Graph Query Autocompletion (GQAC) subgraph with another) can be intuitively understood as a one-step construction of a query suggestion, which can be A visual environment often provides visual aids for query formally defined as a function. We recall some rel- construction, in addition to the basic constructs (e.g., as evant definitions below. shown node and edge labels in the label panel of Fig. 1). Recently, Yi et al. [36] propose GQAC that aims at alleviat- Definition 3 (Common subgraph ( )) Given two graphs ing users from the cumbersome actions by providing useful g and g , a common subgraph of g and g is a connected 1 2 1 2 subgraph suggestions. The process of GQAC is sketched in subgraph containing at least one edge and it is a subgraph Fig. 3. Here, we present its major steps. The details related of g and g (denoted as (g , g ) , or simply when g and 1 2 1 2 1 to FLAG are postponed to later sections. g are clear from the context), i.e., ⊆ g and ⊆ g , 2 𝜆 1 𝜆 2 1 2 for some and . 1 2 1. GQAC takes the user’s current query q and the user pref- erence of suggestions as input. Voluminous candidate Definition 4 ( for query composition) [36] query suggestions are generated and ranked. A query is a function that takes two graphs, g and g , and the corre- 1 2 suggestion is a graph that augments the current query sponding embeddings ( and ) of a common subgraph 1 2 with structure and/or labels. Note that the increments as input, and returns the graph g that is composed by g and to the query can be subgraphs. A small set of ranked g via and of , respectively, denoted by g = 2 1 2 query suggestions are efficiently generated for the user’s (g , g , , , ). 1 2 1 2 review. 2. The user may compose the query by either adopting a suggestion or manually adding other edges. Example 2 An example of query composition is shown in 3. The above steps are repeated until the target query is Fig. 4. Assume that f is the current query and f is the 10 13 constructed. graph feature used to increment f . Then, g is the query graph formed by adding f to f via the common subgraph 13 10 2.1.5 Formalizing GQAC f , i.e., g = ( f , f , f , , ). The increment is 4 10 13 4 1 2 highlighted in blue with the gray background. The embed- Recall that query suggestions are formed by incrementing dings and specify the locations of f in f and f , 1 2 4 10 13 the current query with a subgraph. The current state-of-the- respectively. art of GQAC [36] exploits the concepts of graph features (or simply features). Graph features are generally understood as Definition 5 (Useful suggestion) Given a target query q and subgraphs that carry important characteristics of graph data. existing (or current) query q, a query suggestion q is useful Features have also been considered the tokens of GQAC. For if and only if q ⊂ q and q ⊆ q , for some and . 𝜆 𝜆 t 1 2 1 2 example, an existing work of GQAC[36] decomposes the current query into a set of features and augments the cur- rent query with another feature to form a query suggestion. We remark that existing GQAC systems do not rank infrequent items of a database high, if at all. In the context of web search, some The intuition is that users may want to specify some charac- infrequent phrases are also not suggested. A possible reason is that teristics of the graph in their target queries. While existing users can simply run those queries without further constructing them work uses c-prime features as the features for GQAC, other and manually examine the few results to obtain their desired informa- tion. 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 179 Add wildcard labels Definition 6 (Wildcard graph) A graph with wildcard labels 01 01 01 01 01 ∗ 8 ∗ 1 1 1 1 1 “*”, denoted as G , is defined as a 3-ary tuple (V ,E, ) , DB DB DB DB DB DB DB DB DB DB where V and E are node and edge sets and the label function 2 * 2 * 2 2 2 2 f DB f DB f f f 13 9 8 * 6 * 4 that assigns ordinary labels or “*” to a node or edges. Fig. 5 Adding wildcard labels to features A wildcard can be introduced to query graphs manually by users or suggested by GQAC. When introducing the wild- As motivated, users’ target queries are hard to predict. cards to GQAC, the features to be added to an existing query Recently, GQAC systems have proposed various ranking must allow wildcards. However, this leads to an exponential mechanisms (according to users’ preferences and a ranking blowup in the number of features used in existing GQAC for function ) to efficiently compute a small list of sugges- constructing query suggestions. Having too many wildcards tions with the hope that they are useful. Some ranking fac- in queries or suggestions is not only computationally costly tors include the result counts of the suggested queries and to generate and rank but also confuses the users. Further- the structural diversity of the suggestions. It is not surprising more, the suggestions having wildcards can be neither too that the suggestion ranking problems are generally intrac- generic nor too specific with respect to the closest ordinary table and hence, greedy algorithms have been proposed to suggestion. To this end, we restrict the wildcards only occur efficiently rank the useful query suggestions. at the leaf nodes/edges only (see Def. 7). Hence, users may often expand their query graphs at the boundaries. 2.1.6 Problem Statement Definition 7 (Well-formed wildcard graph) A graph G is a Given a large graph G, an existing query q, a ranking func- well-formed wildcard graph if it is a wildcard graph and all tion , a user preference and a parameter k, the paper wildcard labels are on one leaf edge and the incident leaf � � � � investigates to return query suggestions Q ∶{q , q , … , q } node. k 1 2 k s.t. for i ∈[1, k] , q is composed by adding an increment to q and Q is the top-k suggestions w.r.t. the ranking function 3.2 Wildcard Features for GQAC and the user preference . To the best of our knowledge, this paper is the first work While wildcards may still significantly increase the number that computes query suggestions for querying a single large of features, and hence, query suggestions, not every wildcard graph and wildcards for GQAC have not been proposed feature is useful. Consider an extreme case, where two fre- before. quent features f and f of the same size have the same result ∗ ∗ set, i.e., f ⊆ f , f = f and D = D , it is not neces- 𝜆 f f sary to consider f in GQAC. Among wildcard features and 3 Wildcard Labels for GQAC ordinary features with the same result set, it is sufficient to increment the existing query with the ordinary feature. Thus, In this section, we propose wildcard labels to generalize such f can be omitted from GQAC. Recall that GQAC gen- similar substructures into a summary structure. We further erates query suggestions by adding a feature from a feature discuss how to introduce wildcard labels to the process of set to the existing query. We propose independent wildcard GQAC (e.g., graph features and query compositions). features such that the features retrieve different results from data. 3.1 Wildcard Labels and Graphs Definition 8 (Independent Wildcard Feature (IWF)) A wild- A wildcard label (or simply wildcard) represents any pos- card feature f is independent w.r.t a feature set F if sible labels of nodes/edges and is assigned to new unlabeled nodes and edges by default, meaning that the labels are not 1. There exists F ⊆ F and for f ∈ F such that f ⊆ f 1 1 1 1 𝜆 yet specified. Figure 5 shows an ordinary feature ( f ) and and f = f + 1; features having a wildcard label of an edge ( f and f ). The 2. There exists F ⊆ F and for f ∈ F such that f ⊆ f 8 9 2 2 2 𝜆 2 query formalism of subgraph isomorphism can be readily and f = f ; and extended with wildcards by simply replacing the matching ∗ ∗ function of Def. 1 with , wher e (l ,l 1 2 )= iff l =“*” or l =l . 1 1 2 8 ∗ We use G to denote a graph having wildcards but may omit “*” when they are not relevant to the discussion. 1 3 180 P. Yi et al. D ∗ λ = {0 → 1, 1 → 0} f f 1 3. ≥ and ≥ for ∀f ∈ F and f ∈ F , where compose(f ,f ,f ,λ ,λ )= g 1 1 2 2 10 8 4 1 2 D ∗ D f f λ = {0 → 0, 1 → 1} 2 2 1 0 00 1 1 1 1 1 1 is a constant called independent ratio. DB DB DB DB DB DB DB DB 2 + 2 via = 2 2 IR IR 2 * 2 * The independent ratio has the following properties: i) f f f g 10 8 4 ≥ 1 ; and ii) the larger value of , the more difference in the feature result sets and intuitively, more independent the Fig. 6 An illustration of wildcard composition wildcard features with respect to their closest ordinary fea- tures F and F . A feature f is introduced to GQAC if it is 1 2 dependent enough from F. features and is restricted to return a well-formed The detailed process of generating independent well- suggestion. formed wildcard features for graph query autocompletion is presented in Algo. 1. We adopt existing studies of feature Example 4 Recall the query composition in Example 2 with mining [35] to obtain a set of features F = {f , f , … , f } . 1 2 n ( f , f , f , , ). We add wildcard labels into f 10 13 4 1 2 13 Then, we add wildcard labels one by one to f to obtain wild- in Example 3 and obtain wildcard features { f , f , f }. A 9 8 6 card features F , that are both independent and well-formed. wildcard composition can be obtained by simply substitut- Applying the concepts introduced in Def. 8 and 7, we itera- ing f of the composition with any of the wildcard features. tively generate all wildcard features by substituting labels One of the wildcard compositions, i.e., ( f , f , f , 10 8 4 on one leaf edge with wildcards. Meanwhile, we eliminate , ), is illustrated in Fig. 6. 1 2 the wildcard features that are dependent to existing ordinary features. 4 Query Specialization and Query Summarization The previous sections presented the features and their com- position. In this section, we formalize query specialization for modeling the whole query suggestion construction pro- cess. We propose the specialization value ( ) to quantify how a query graph is specialized from an empty graph, and summarization value ( ) to quantify how one wildcard query suggestion summarizes other suggestions. 4.1 Query Specialization 4.1.1 Specialization Order (≺ ) Specialization order is a partial order defined between two Example 3 We illustrate the process of adding wildcard query graphs. The intuition is that a more specialized query labels to features with Fig. 5. Given an ordinary feature f , is closer to the target query. It also models one query is con- and edge DB-DB connects leaf node DB. We replace the structed from the other. We formally define the specializa - labels on edge DB-DB with wildcard labels to obtain wild- tion operators and specialization order as follows. card features f , f , and f , which can be regarded as gener- 9 8 6 The specialization operators are the following two: alizing the labels on the edge DB-DB with wildcard labels. 1. (q, e:(u, v)): add a new edge e, where (e) is a “*” 3.3 Composition of Well‑Formed Wildcard Features label, if u and v are existing nodes; and (v) is a “*” label, if v is a new node. The features discussed earlier can be the tokens for query 2. (q, e): replace a “*” label with a specific label of autocompletion. Suggestions with wildcards are constructed the edge or node of e. by adding a feature to an existing query graph. The query composition (a one-step query suggestion construction Def. 4) can be readily extended. Given a query composi- tion (g , g , , , ), g and g could be wildcard 1 2 1 2 1 2 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 181 visual graph specialization(SP) � � (q)={q q ≺ q , G ≠ �} , 1 1 1 1 DB DB DB DB DB DB DB DB +1 SP +7 SP +1 SP 2 2 2 2 2 2 2 2 IR IR DB IR DB IR DB * where G is the subgraph query results set of q . In other 13 14 21 22 1 1 2 2 DB DB DB words, q summarizes all the query graphs that specialize q. q q q q The summarization of a set of graphs Q is as follows. (Q)= (q). Fig. 7 An illustration of specialization orders and values q∈Q Example 6 Continuing with Fig. 7, given four query Definition 9 (Specialization order ( ≺ )) Given two query graphs, the specialization order of the query graphs is � � � � graphs q =(V , E, l) and q =(V , E , l ) , q specializes q, � �� ��� q ≺ q ≺ q ≺ q . Then, (q)= {q, q , q , q } , and denoted as q ≺ q iff there is an injective (or embedding) � � �� ��� (q )={q , q , q }. function ∶ V → V such that q ⊆ q . When the user formulates the query graph, both the num- 4.1.2 Specialization Value ( ) ber of query results and the possible suggestions are decreas- ing. This property (see Prop. 1) can be used to reduce the To further measure the different degree of specialization number of candidate suggestions for efficient GQAC . In par- of query graphs, we propose specialization value based on ticular, if g is an answer for query q , then g is an answer for the specialization operators, in Def. 10. In addition, given a every query q that summarizes q . On the other hand, if g is suggestion to an existing query, the difference of their spe- not an answer for q, then g is not an answer for every query cialization values captures how much does the suggestion q that specializes q. This is formally described as follows. augment the query. For simplicity, Def. 10 assumes that all operators are equal. Proposition 1 Given two query graphs q and q , where q specializes q (i.e., q ≺ q ) via a series of specialization Definition 10 (Specialization value ( )) The specialization � operators. Then, q ≺ q ⇒ ∀ g ∈ G � , ∃ g ∈ G s.t. g ⊆ g . q q 𝜆 value of a query graph q is the number of specialization operators needed to formulate q from an empty graph q , denoted as (q). 5 Autocompletion Framework for Large Graphs Example 5 We illustrate specialization order and specializa- tion value with Fig. 7. The specialization order of the query The overall query autocompletion is presented in Algo 3 graphs is q ≺ q ≺ q ≺ q . The existing query is q (left- and illustrated with Fig. 3. FLAG assumes 1 the user sub- most) with a wildcard. The specialization value of q is 13, mits a query and an intent, and 2 the query is decomposed indicated in bold at the center of q. After specializing the into a set of embeddings of wildcard features of the data wildcard of q into the label DB, the user obtains q with a graph. FLAG then supports wildcards in two main steps of specialization value increased by 1. Then, the user adopts GQAC. First, in the candidate generation step, 3 we deter- a suggestion with a wildcard to get q , with specialization mine possible candidate suggestions, i.e., the well-formed value increased by 7. At last, the user specializes the wild- wildcard features to attach to the current query to form sug- card to a specific label and obtains the target query q . gestions that may yield non-empty answers. In Sect. 5.1, we propose pruning techniques for large graphs and sampling 4.2 Query Summarization techniques. Second, in Sect. 5.2, 4 we present a new rank- ing function that combines the specialization value and sum- 4.2.1 Summarization Set ( ) marization set size. To model how likely a query q is useful to the user, we compute how many suggestions can be specialized from q. 5.1 Candidate Suggestions Generation We formally define the summarization set to denote such suggestions.5.1.1 Query decomposition Definition 11 (Summarization set ( )) The summarization During the online autocomplete, the query decomposition set of a query q, denoted as (q) , contains all query graphs procedure (Algo. 1 from Auto G[36]) is adopted. The query that specialize q. graph q is decomposed into a feature set F , along with the 1 3 182 P. Yi et al. embeddings of the features in the query. The detailed process offline. The embeddings M of the features in the sampled is presented in Algo. 2. graphs are and the embeddings M of sampled graphs in To generate well-form query suggestions, where the wild- the large graph can be computed offline. For a composition cards appear in the leaf nodes/edges, the non-leaf wildcards (f , f , , , ) , the embeddings of f and f in the 1 2 1 2 1 2 (if any) in F need to be specialized before generating can- large graph , are obtained using M and M . f ,G f ,G f g q 1 2 didate suggestions (Lines 3-9). Proposition 2 A query q is a non-empty query of the sam- pled graphs only if for each query composition ( f , f , , , ) of q satisfied that ∃ , , such that 1 2 1 2 f ,G f ,G 1 2 [ ]= [ ] f ,G 1 f ,G 2 1 2 Proposition 2 verifies whether each composition of the query can find at least one instance in the large graph from the sampled portion. There could be false negatives sim- ply because the sampled graphs may not cover all possible compositions of the large graph, even one may increase the sampling size for higher accuracy. Prop. 2 is used in both online candidate generation and indexing of query composi- tions offline. 5.2 Suggestion Ranking From our preliminary experiments, we observed that the 5.1.2 Non‑empty candidate suggestions number of candidate suggestions can be thousands. Consid- ering the users may only be able to interpret a small subset Candidate suggestions can specialize the existing query in of them, FLAG returns top-k suggestions w.r.t. a ranking multiple ways. First, suggestions can replace wildcards in function and a user preference. Suggestion ranking crite- the query with specific labels. Second, candidates can incre- ria of existing studies [36] are either infeasible to obtain ment the query with (wildcard) features. Specifically, given from large graphs for efficient online autocomplete (e.g. , a set of features, the number of possible candidates is, in the ) or indistinguishable among the candidate suggestions worst case, exponential to the query and feature sizes. How- (e.g., ) because the increment parts share no common ever, many of the composed queries may not make sense, subgraphs (i.e., ) and yield the same value. As the when the composed queries do not retrieve any results from first attempt on GQAC for large graphs, we present a rank- the underlying data graph. Such queries are known as empty ing function prefers query suggestions that i) augments the queries. Furthermore, the problem of deciding the emptiness existing query more and ii) summarizes more candidate sug- of a subgraph matching query is NP-hard. gestions. The first preference simply reflects the user’s intent Existing work [36] has proposed a necessary condition to adopt larger useful increments, whereas the second one for compositions of non-empty query candidates. It has been recognizes the importance of summarizing more suggestions reported that the condition reduced 13% and 45% of query that can be useful to the user. These two preferences can compositions for Adi S and Pbu Cmeh , which consist of a large be quantified as specialization power and summarization collection of modest-sized graphs. When directly applied, power. We then combine these two criteria to measure the [36] prunes only 0.1% of the possible compositions of the utility of a set of query suggestions. Cite Seer dataset. Therefore, in Prop 2, we propose a neces- sary condition for non-empty query compositions based on Definition 12 (Specialization power ( )) Given a set the large graph and sampling techniques. of candidate suggestions U to an existing query q, the spe- We illustrate how to ec ffi iently prune empty compositions cialization power of a suggestion q ∈ U w.r.t. q is defined as using the embedding information. The queries that are not (q )− (q) pruned are considered candidate suggestions. (q , q)= . �� �� max({(q )− (q)q ∈ U}) Consider a large graph G, a set of sampled graphs D obtained from G using existing graph sampling techniques The specialization power of a suggestion q is defined (Sect. 6), and the set of frequent features F extracted from as the increment of the specialization value if the user D using existing frequent subgraph mining techniques 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 183 normalization. Since the values of the two criteria can be IR 2 suggestions DM of very different ranges in practice, which makes sensi- 1 * DB DB tive and difficult to tune, we introduce the scaling factors. IR IR IR IR 2 2 2 2 q q +2 SP +1 SP +1 SP 2 5 The parameters , and are data-specific. In order to DM 1 DM 1 DM 1 DM 1 1 1 * 1 * 1 DB DB DB DM DB DB * tune the parameters, we adopt a machine learning method. IR IR However, this requires all the functions involved to be dif- 2 2 q q q q 12 14 3 6 DM 1 DM 1 ferentiable. However, the maximum function in Def. 12 is 1 1 DB DB DM not continuous and differentiable. We adopt a differentia- IR IR 2 2 q q 4 7 DM DM ble approximation to the maximum function [4]. Hence, in 1 1 1 * 1 1 DB IR DB IR the experiments, we can use a stochastic gradient descent algorithm to learn the parameters. Fig. 8 An example of suggestions in relation to Example 8 Continuing with Fig. 8, we illustrate the utility of query suggestions defined in Def. 14. and are set to 1. adopts the suggestion, and normalized by the maximum There are 7 candidate suggestions to the existing query q , specialization value increment of all candidate suggestions. i.e., q , q , ..., q . When GQAC only considers how much 1 2 7 the suggestions specialize the existing query (i.e., = 1 ), q Definition 13 (Summarization power ( )) Given a and q would be the top-2 suggestions. When GQAC only set of candidate suggestions U to an existing query q, the considers how much the suggestions summarize other can- summarization power of a subset of candidate suggestions didate suggestions (i.e., = 0 ), t hen q and q would be the 1 3 Q ⊆ U w.r.t. U is defined as top-2 suggestions. When is set to 0.5, then q and q would 1 5 be the top-2 suggestions. (Q ) (Q )= . The ranking task is then to find the top- k candidate sug- gestions that have the highest value. It can be noted that The summarization power of a set of suggestions is the two objectives and of can be com- defined as the number of candidate suggestions summarized peting: in practice, the summarization power of smaller by them, and normalized by the total number of candidate queries are often larger as more candidate suggestions are suggestions. summarized by smaller ones, whereas smaller queries pro- vide smaller specialization power. It is not surprising that Example 7 We illustrate specialization power ( ) and the problem of determining the query suggestions with the summarization power ( ) using Fig. 8. The user highest value is n P-hard. manually adds a wildcard edge (with a wildcard node) to query q and obtains q . There are 7 candidate sugges- 12 14 Definition 15 (Ranked Subgraph Query Suggestions for tions for the current query q , i.e., q , q , ..., q . According 1 2 7 Large Graphs (r SQL ) ) Given a query q, a set of query sug- � � to Def. 12, (q , q )= 0.5 and (q , q )= 1 . 14 14 1 5 � � � gestions Q , the ranking function , a user preference com- According to Definition 11 and 13, (q )={q , q } and 1 1 5 ponent , and a user-specified constraint k , the ranked sub- � � � � � � � (q )={q , q , q , q } . Then, ({q , q }) is . 3 3 5 6 7 1 3 graph query suggestions problem is to determine a subset Q , (Q ) is maximized, i.e., Q ⊆ Q , Q ≤ k and there Definition 14 (Utility of query suggestions) Given a set of � � � � is no other Q ⊆ Q such that (Q ) > (Q ). query suggestions Q ∶{q , q , … , q } , the specialization 1 2 k power of each suggestion with respect to the existing query Proposition 3 The Rsql problem is NP-hard. q, the summarization power of Q with respect to all candi- date suggestions, a user preference component ∈[0, 1] , (Proof sketch) The maximization of this utility function and scaling factors and , the utility of Q is defined as is n P-hard, by a reduction from the Set Cover (SC) problem. follows: Given an instance of SC problem, each subset S of ele- � � � ments { o , , o } is converted to a candidate suggestion (Q )= (q , q) +(1 − )(Q ) 1 m � � … q ∈Q q that summarizes q , , q ; a nd k remains the same. i 1 m and of r SQL is set to 0 and 1, respectively. Finding the The bi-criteria ranking function combines the speciali- query suggestion set is then to find the i query suggestions, zation power and summarization power of the query sug- where i is smaller than or equal to k, that cover the candi- gestions. is a parameter to set the preference between date suggestions the most. It can be trivially mapped to the the two criteria, and the constant denominator k is for 1 3 184 P. Yi et al. solution of SC, that covers all elements with the smallest number of subsets i. ◻ 5.3 Efficient Summarization Computation This subsection presents efficient algorithms for determin- ing , which enables efficient ranking for the online autocompletion. We remark that the computation of is straightforward, given q, and hence, is omitted. The computation of depends on (Defs. 12 and 13). To determine whether suggestions summarize the others, i.e., the specialization orders between them, we need to compute subgraph isomorphism between each pair of online suggestions. Hence, we derive a necessary condition Fig. 9 Index structure (partial) for Cite Seer for the specialization order between candidate suggestions and index them. This can be efficiently indexed for the fol- lowing two reasons. (i) Some query suggestions are similar because they are composed by adding small increments on the same existing query graph. (ii) The specialization order between the wildcard features (i.e., the increments) is avail- able offline. We formalize a necessary condition for the specialization order between candidate suggestions. We illustrate how to efficiently i) offline compute and index all possible speciali- zation orders and ii) online prune the false ones based on current query graph q. Fig. 10 Efficient suggestion summarization computation Proposition 4 Given two suggestions q and q to query q, 1 2 ideas of the extensions but skip the verbose index definition. where q is formed via (q, f , , , ) and q 12 1 11 12 1 2 An illustration of the index is shown in Fig. 9. In particular, is formed via (q, f , , , ) , q specializes q 22 2 21 22 2 1 we index the wildcard features (shown in the bottom) in a (i.e., q ≺ q ) only if 1 2 DAG, where each index node represents a feature, and an edge represents a specialization order between features. All possible subgraph isomorphism embeddings are indexed (shown in M 1. the increment of q specializes that of q , i.e., Δ � ≺ Δ � ; q 𝜆 q 2 1 1 2 of the index edge and of the indexed content). That is, all and the possible ways that two well-formed features are composed 2. there exists one embedding 𝜆 ∈{𝜆 Δ � ≺ Δ � } , s.t., the q 𝜆 q have been precomputed and indexed. This avoids computing 1 2 nodes where q increments at matches that of q via . specializations of features online. The features are further 1 2 indexed by their values. The proposition can be established by a simple proof by contradiction. The first condition of Prop. 4 can be computed Example 9 We illustrate the efficient specialization order offline and then indexed. The second condition can be used computation with Fig. 10. Given two compositions, the spe- in the online autocompletion to prune false specialization cialization relation of the increments (i.e., f ≺ f ) has been 𝛿 𝛿 1 2 orders using the current query. indexed. =(0, 1) , =(1, 0) can be simply retrieved. Dur- 0 1 ing online autocompletion, we check the second condition of Prop. 4. We find that =(0, 1) satisfies that the nodes 5.3.1 Indexing Wildcard Features where q increments at matches that of q via . Hence, the 1 2 suggestion q specializes q . We extend Feature DAG index (Fd AG) [36] with the support 2 1 of wildcards. Due to space limitations, we highlight the main 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 185 Table 2 Some characteristics of the datasets Dataset |V| |E| |l(V)| |l(E)| (V) t witter 11,316,811 85,331,845 32 1 15.1 w ord n et 73,753 234,024 28 1 6.3 Cite Seer 3,312 4,591 6 3 2.8 simulation on popular real datasets. In particular, we studied the overall performance of FLAG, the effectiveness of the optimizations, and the effects of the parameters of FLAG. 6.1 Software and Hardware We implemented the FLAG prototype on top of Aotu G[36]. The prototype was mainly implemented in C++, using VF2 [7] for subgraph query processing and the McGregor’s algo- rithm [21] (with minor adaptations) for determining com- mon subgraphs. We used GSPAn [35] for frequent subgraph mining. We conducted all the experiments on a machine with a 2.2GHz Xeon E5-2630 processor and 256GB mem- ory, running Linux. All the indexes were built offline and loaded from the hard disk and were then made fully mem- ory-resident for online query autocompletion. 6.2 Datasets 5.4 Efficient Ranking Algorithm We conducted experiments on several different workload Given that the ranking function of a set of candidate sug- settings by employing real graph datasets with various char- gestions Q can be efficiently computed, we present a greedy acteristics. Table 2 reports some dataset characteristics. ranking algorithm in Lines 16–20 of Algo 3. Greedy algo- rithms are typical approximation algorithm for r SQL because 1. t witter . This dataset models the Twitter social net- is submodular. Recall that a function is submodular if the ∼ ∼ work. It consists of 11M vertices and 85M edges. marginal gain from adding an element to a set S is at least as Each vertex represents a user and each edge represents high as the marginal gain from adding it to a superset of S. In the friendship/followership relation between two users. particular, it satisfies: f (S ∪{o}) − f (S) ≥ f (T ∪{o}) − f (T) , The original graph has no labels. We randomly added for all element o and all pair of sets S ⊆ T . We can analyze labels to the vertices. The number of distinct labels was as follows. Firstly, is linear and monotone sub- set to 32 and the randomization follows a Gaussian dis- modular since it is a sum of non-negative numbers. Sec- tribution ( =50 and =3). ondly, is monotone submodular because adding 2. w ord n et . This dataset models the lexical network of new suggestions can only summarize more candidate sug- ∼ ∼ words. It consists of 74K vertices and 234K edges. gestions. Hence, is a non-negative linear combination of Each vertex represents an English word and each edge the two scaled monotone submodular components. Thus, represents the relationships between them, such as syno- is monotone submodular. The problem of maximizing nym, antonym, and meronym. The original graph has no a monotone submodular function subject to a cardinality labels. We randomly added labels to the vertices, similar constraint admits a 1 − 1∕ approximation algorithm [27]. to the way used in t witter . 3. Cite Seer . This dataset models publications in Cit- eSeer. It consists of ∼ 3K vertices and ∼ 4K edges. Each 6 Experimental Evaluation This section presents an experimental evaluation of FLAG. http:// socia lcomp uting. asu. edu/ datas ets/ Twitt er We first investigated the suggestion quality via user study https:// netwo rkrep osito ry. com/ wordn et- words. php and then conducted an extensive performance evaluation via https:// linqs. soe. ucsc. edu/ data 1 3 186 P. Yi et al. vertex represents a publication and each edge represents Table 3 Some characteristics of the features of datasets the citation relation between two publications. Each ver- Dataset |F| (V) (E) Time(s) tex is labeled with the Computer Science area (e.g., DB, t witter 0.2% 1859 3.33 2.33 114.6 DM, IR) and each edge is labeled with the Jaccard dis- w ord n et 0.3% 1745 3.30 2.33 8.7 tance between the pair of publications. The distance is Cite Seer 0.5% 1720 5.72 4.92 1.4 computed from the word attributes of the publications and further evenly categorized into three types (small, medium, large distances). Table 4 Index construction Dataset |V| |E| time (s) # compositions Time(s) t witter 1,859 12,637 1.0 244,195 1,711,636 6.3 Query Sets w ord n et 1,745 11,862 0.9 298,369 16,333 Cite Seer 1,720 20,842 2.3 8,481,895 151,847 We generated numerous sets of query graphs of different query sizes |q| (the number of edges) and various frequencies in the large graph. Each query set contained 100 graphs. In particular, we generated queries that yield different result set are an order of magnitude smaller than those used in Aotu G. min sizes (i.e., G > G for all query graphs). These query We set smaller s because that frequent subgraphs are sets enable us to investigate the usefulness and performance relatively scarce in large graphs. The maximum feature size of FLAG with different user workloads. Query sets of query maxL was set to 10 for all datasets. Some statistics of the sizes ranged from 2 to 9. features are summarized in Table 3. 6.4 Graph Sampling6.6 Index Instead of running expensive frequent subgraph mining With frequent features mined by GSPAn , we adopted the algorithms on the single large graph, we scaled down the Auto G procedure (i.e., Algorithm 4 of [36]) to enumerate large graph using Random Walk sampling [16] before fre- the possible compositions of feature pairs. We discovered quent subgraph mining. We sampled min{V(G), 10 } that the pruning technique proposed in Auto G for com- graphs of 10 edges from the large graph. In particular, we position enumeration is ineffective for the employed large randomly selected a vertex as the starting vertex and then graphs. Their pruning technique can prune 13 and 45% of simulated a random walk on the graph. At each step, there the empty compositions for the Aid S and Pub Chem data- is a probability 0.15 (the value commonly used in literature) sets. It is not surprising to find this necessary condition only we jumped to the starting vertex and continued the random prunes 0.1% of the compositions on Ceti Sree since the char- walk. If we cannot meet the required sample graph size after acteristics of citation network are much different from those a large number of steps (e.g., 100 ∗ V(G ) or random walk of chemical and biological structures. has exhausted the neighbors of the starting vertex, we would After applying the embedding-based necessary condition select another starting vertex and restart the random walk. for non-empty query compositions (introduced in Sect. 5.1), 41% of the compositions for the Ceti Sree dataset are pruned. 6.5 Feature Mining Table 4 briefly summarizes the characteristics of construct- ing an index and enumerating compositions, respectively. We followed Auto G using GSPAn [35] to obtain a sufficient number of features (frequent subgraphs) to build the index 6.7 Quality Metrics offline. In particular, we set the default minimum support value ( ) to 0.2, 0.3, and 0.5% for t rettiw , w dro n te , We adopted several popular metrics to measure suggestion and Cite Seer , respectively. These minimum support values qualities [25, 36]. We report the number of suggestion adop- tions (i.e., #Aotu ) and the total profit metric (i.e., t Pm ). Spe- cifically, the total profit metric ( t Pm ) [25, 36] quantifies the A query was generated following the Random Walk sampling percentage of mouse clicks saved by adopting suggestions (same as graph sampling). We checked that the constructed query during the visual query formulation. q was not generated before and with a result set size G large than min G . q no. of clicks saved by suggestions × 100%. This limitation is due to the GSPAn binary executable. no. of clicks without suggestions We have investigated several existing feature mining work before opting to apply GSPAn to graph samples. 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 187 Table 5 Quality metrics and Metric Meaning their meanings #Aout The average number of suggestions accepted in the simulation The total number of specializations obtained from suggestions () The average number of specializations obtained from each accepted suggestion no. of useful suggestions #Aout The useful suggestion ratio U defined as × 100% no. of returned suggestions TPM The total profit metric (TPM) adopted from [25], which quantifies the % of specializations saved by FLAG in the visual graph query formulation: no. of specializations saved by FLAG TPM = no. of specializations without FLAG In addition to #Auto and TPM , we report the number of 0.819 and the p-value is 0.007. Thus, TPM is a good qual- specializations from adopting suggestions (denoted as ), ity indication of FLAG. The average ratings of the queries the average number of specializations from each adoption with high, medium, and low TPM values are 3.57 (between (denoted as () ) and the useful suggestion ratio U “strongly agree” and “agree”), 2.63 (“neither agree nor disa- no. of useful suggestions defined as × 100% . Each reported num- gree”) and 1.83 (between “disagree,” and “strongly disa- no. of returned suggestions gree”), respectively. ber is the average of the 100 queries in each query set. Note that even when the suggestions are correct, users still need 6.9.2 Large‑Scale Simulations at least a mouse click to adopt them to obtain the target query. The employed quality metrics are listed in Table 5. We investigated the suggestion qualities via simulations under a large variety of parameter settings. is set to 0.5 so 6.8 Learning Scaling Factors that both and contribute to ranking. For each target query, we started with a random edge with one node We used a stochastic gradient descent algorithm to learn the label (the other node and edge are with wildcard labels). default scaling factors for Definition 14. We generate 100 In each step, we called FLAG. Then, we chose the useful random simple queries from a dataset. Each initial query suggestion with the largest number of specializations. If no contains 1 edge and its target query contains 4 edges. We useful suggestions were returned, we specialized the query divide the queries into 10 groups. Each group is used to by a random specialization operator toward the target query. learn the parameters around 33 iterations. The learning rate Each target query set contains 100 queries. is set to 0.01. The learning algorithm converges at around We studied the effects of the major parameters of FLAG 300 iterations. For t witter dataset, the default and are on Cite Seer , w ord n et , and t witter . We report the rep- 3.8 and 7.6, respectively. For Cite Seer dataset, we obtained resentative simulation results in Tables 6–17. The per- the defaults similarly. Their values are 3.6 and 7.2, respec- formance characteristics presented here can be useful for tively. The learned for t witter and Cite Seer are 0.58 and users to set their default parameter values, which could be 0.45, respectively. , and values of the w ord n et dataset dataset-specific. are 1, 1, and 0.5, respectively. 6.9.3 Varying the Maximum Increment Sizes (ı ) max 6.9 Suggestion Qualities of FLAG Table 6 shows the quality metrics of Q5 (i.e., queries of 6.9.1 User Study 5 edges) with various on Cite Seer . The results show max the qualities decrease as increases. #Auto shows that max We first conducted a user test with 10 volunteers. Each user the suggestions were used in multiple iterations of the was given 3 queries with high, medium, and low TPM val- ues, respectively, from the simulation. We randomly shuffled Table 6 Quality metrics by varying (Cite Seer ) max these 9 queries. The users were asked to formulate the target queries via the visual aid shown in Fig. 1. They expressed #Aout TPM () U max their level of agreement to the statement “FLAG is useful 4 4.9 13.1 60 2.7 14 when I draw the query.” via a symmetric 5-level agree–disa- 8 5.2 12.6 54 2.6 11 gree Likert scale, where 1 means “strongly disagree” and 5 12 5.2 11.7 47 2.4 10 means “strongly agree”. 16 4.9 11.4 48 2.4 9 Consistent with [36, 37], our result showed that the 20 3.9 11.8 58 4.0 9 correlation coefficient between TPMs and users’ points is 1 3 188 P. Yi et al. Table 7 Quality metrics by varying (w ord n et ) Table 10 Quality metrics by varying |q| (w ord n et ) max #Aout TPM () U |q| #Aout TPM () U max 4 5.1 10.6 40 2.1 8 2 1.9 4.0 35 2.2 6 8 4.3 9.2 36 2.2 7 3 2.7 6.1 34 2.4 6 12 3.6 7.7 30 2.2 6 4 3.6 7.7 30 2.2 6 16 3.4 7.6 31 2.3 6 5 4.3 9.6 31 2.3 6 20 3.4 7.6 31 2.3 6 6 4.4 10.4 28 2.5 5 7 5.2 12.0 28 2.4 5 8 6.0 13.5 27 2.3 5 9 6.3 14.2 24 2.3 5 Table 8 Quality metrics by varying (t witter ) max #Aout TPM () U max 4 5.1 11.3 44 2.2 9 Table 11 Quality metrics by varying |q| (t witter ) 8 3.7 9.4 40 2.7 6 |q| #Aout TPM () U 12 3.2 8.7 39 2.9 6 16 3.2 8.7 39 2.9 6 2 1.4 3.9 41 2.9 5 20 3.2 8.7 39 2.9 6 3 2.6 7.0 44 3.1 6 4 3.2 8.7 39 2.9 6 5 4.4 11.4 39 2.7 6 6 5.1 12.8 35 2.6 5 Table 9 Quality metrics by varying |q| (Cite Seer ) 7 6.3 15.4 35 2.6 5 |q| #Aout TPM () U 8 7.2 17.8 35 2.6 5 2 2.6 5.3 44 2.1 8 9 8.2 19.9 35 2.5 5 3 3.8 8.4 46 2.3 9 4 5.2 11.7 47 2.4 10 5 5.8 14.9 52 3.0 10 Table 12 Quality metrics by varying k (Cite Seer ) 6 7.0 18.5 54 2.9 10 k #Aout TPM () U 7 8.1 21.5 54 2.9 10 8 8.8 24.3 54 3.0 10 4 4.9 10.3 40 2.3 19 9 10.8 27.7 52 2.8 10 6 5.0 11.0 44 2.4 14 8 5.2 11.5 46 2.4 12 10 5.2 11.7 47 2.4 10 query formulation. In particular, the formulation process of each query adopted around 5.7 suggestions on average. shows that the number of specialization added by FLAG Table 13 Quality metrics by varying k (w ord n et ) was around 15. TPM shows that FLAG saved roughly 53% k #Aout TPM () U manual specialization in query formulation. () shows 4 2.0 4.5 19 2.4 6 that each adoption introduced 2–3 specializations to the 6 2.3 5.3 22 2.4 5 existing query. U shows that FLAG generally produced use- 8 2.9 6.5 27 2.3 5 ful suggestions. Tables 7 and 8 show the quality metrics of 10 3.6 7.7 30 2.2 6 Q4 with various on w ord n et and t witter . The results max of w ord n et and t witter share the same trends as that of Cite Seer . The values of quality metrics of w ord n et and t rettiw were lower than Ceti Sree since the number of com- TPM and U of Cite Seer , w ord n et , and t witter generally positions of w ord n et and t witter was relatively few. retained as |q| increased. 6.9.4 Varying the Target Query Sizes (|q|) 6.9.5 Varying the User‑Specified Constant k Tables 9, 10 and 11 show the quality metrics of various |q|. Tables 12, 13 and 14 show the suggestion quality when we It is not surprising that FLAG achieved more suggestion varied k. The results show that #Aotu , , TPM , and () adoptions as |q| increased. The number of adoption (#Auot ) generally increased with k. It is not surprising because as and adopted specializations ( ) increased as |q| increased. more suggestions are returned, the higher chance some of 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 189 Table 14 Quality metrics by varying k (t witter ) Table 17 Quality metrics by varying (t witter ) k #Aout TPM () U #Aout TPM () U 4 2.3 6.1 28 2.9 8 0.00 5.6 11.8 44 2.1 13 6 2.8 7.4 33 2.8 7 0.20 5.2 11.3 44 2.2 9 8 3.1 8.2 36 2.9 6 0.40 4.4 10.5 44 2.5 7 10 3.2 8.7 39 2.9 6 0.60 3.1 8.6 39 3.0 4 0.80 2.2 6.9 34 3.4 2 1.00 0.4 2.0 11 5.0 1 Table 15 Quality metrics by varying (Cite Seer ) #Aout TPM () U Fig. 11 Atr - default 0.00 5.6 13.1 55 2.5 15 0.20 5.2 12.9 56 2.8 12 0.40 5.4 12.0 49 2.4 10 0.60 4.5 11.0 48 2.6 9 0.80 2.8 10.2 54 4.4 7 1.00 0.3 3.0 20 11.5 1 Fig. 12 Atr vs Table 16 Quality metrics by varying (w ord n et ) #Aout TPM () U 0.00 5.4 11.2 42 2.1 12 0.20 5.0 10.4 39 2.1 8 0.40 4.3 9.0 34 2.2 7 FLAG under the default setting in Fig. 11. For Cite Seer , 0.60 2.6 6.1 26 2.5 4 we obtained Atr s around 3s. For t witter , we obtained short 0.80 1.4 3.6 16 2.8 2 Atr s as the number of compositions was relatively few. Thus, 1.00 0.2 1.5 9 6.6 1 the response time of FLAG is generally very short. The rest of this section reports the average response time when we vary major parameters of FLAG, i.e., , k, and |q|. them are adopted. Importantly, the useful suggestions ratio is higher when k is smaller mainly because the useful sug-6.10.1 Varying ˛ of the Ranking Functions gestions of Ceti Sree usually rank higher than w dro n te and t witter . We ranged from 0 to 1. Figure 12 shows the effects of on Atr s. The Atr was always less than 3.5s. We also noticed 6.9.6 Varying ˛ of the Ranking Functions that the Atr decreased when approaching 1. The higher the value of , t he GQAC process prefers suggestions with Tables 15, 16 and 17 show the suggestion quality with vari- large specialization more and small summarization, which ous s. The results show that the suggestion qualities were results in shorter time for updating the summarizations of generally good when was small. The optimal for Cite Seer the candidate suggestions. was 0.2, that for w ord n et was around 0, and that for t wit - ter was 0.0-0.4. Then, the quality decreased as the value 6.10.2 Varying the User‑Specified Constant k of increased. The learned s from Sect. 6.8 produced slightly lower TPM when compared to the optimal ones. We varied k from 10 to 50 and reported the Atr s for Ceti Sree FLAG generally produced high-quality suggestions when and t witter in Fig. 13. The largest value of k tested was 50, s are smaller than 0.8 for Cite Seer , 0.2 for w ord n et , and which is large enough for common visual interfaces. The 0.8 for t witter . results show that the Atr s increased as k increased. FLAG returned suggestions within 5s when k is less than 20. The 6.10 Efficiency of FLAG We conducted a detailed evaluation of the online FLAG 15 We remark that query decomposition takes less than a few milli- processing. We report the Average Response Time (Atr ) of seconds, which are negligible, and hence, is not shown separately. 1 3 190 P. Yi et al. Fig. 13 Atr vs k 25, 34]) and structured queries (e.g., [24]). Li et al. [9] extend keyword search autocompletion to Xm L queries. [18] associated structures to query keywords. LotusX pro- vides position-aware autocompletion capability for Xm L [19]. An autocompletion learning editor for Xm L provides intelligence autocompletion [1]. [12] presents a conver- sational mechanism that accepts incomplete SQL queries, Fig. 14 Atr vs |q| which then matches and replaces a part (user focus) of the previously issued queries. There has been a stream of work on extending Query By Example to construct structural queries, e.g., [5, 6, 14]. In contrast, this paper focuses on structural queries for graphs. Hence, we only include related work on graphs. Regarding GQAC, Yi et al. [36] proposed Aout G to rank GQAC process may need 8s to provide suggestions when k subgraph suggestions for graphs of small or modest sizes. is up to 50. The recent work [28, 37] introduced user focus to GQAC. In [22], Mottin et al. proposed graph query reformulation, 6.10.3 Varying the Target Query Sizes (|q|) which determines a set of reformulated queries that maxi- mally cover the results of the current query. In Pienta et al. Figure 14 shows the Atr as the query size increased. The [29] and Li et al. [13], the authors demonstrated interac- results show that the autocomplete process of FLAG finished tive methods to produce edge or node suggestions for visual within 6s for queries with up to 8 edges. The Atr increased graph query construction. In contrast, this paper considers when the query size |q| increased. The Atr increased mainly flexible subgraph suggestions for large graphs. because large queries required more time to generate more candidate suggestions and then rank them. 8 Conclusion 7 Related Work We proposed FLAG that exploits the wildcard label notion Query formulation aids have recently gained increasing to generate top-k query suggestions to help the query formu- research attention. Firstly, recent work has proposed a lation for large graphs. Considering that the graph features variety of innovative approaches to help query formula- exploited by existing GQAC studies are either absent or rare tion. For example, Ge Sture Queyr [26] proposes to use in large graphs, we proposed to introduce wildcard labels for gestures for specifying SQL queries. SnapToQuery [15] query graph and query suggestions to allow more query sug- guides users to explore query specification via snapping gestion candidates. Candidate query suggestions are ranked user’s likely intended queries. [3] has proposed a data- by a new ranking function that considers how much the sug- driven approach for GUI construction. Exploratory search gestion augments the existing query and how many other has been demonstrated as useful for enhancing interac- suggestions it summarizes. We proposed efficient algorithm tions between users and search systems (e.g., [20, 22, 23]). for suggestion ranking. Our user study and experiments veri- Qub Le [11] allows users to explore regions of a graph that fied both the effectiveness and efficiency of FLAG. contains at least a query answer. Wang et al. [32] recently This paper leads to a variety of interesting future work. propose efficient visual exploratory search in graph data- We are extending the study of histories of users’ activities bases. Huang et al. [10] study canned subgraph patterns for [38] into the ranking. We are studying the explanations of GUI. See db [31] proposes visualization recommendations the few cases (e.g., [17]), where GQAC returned incorrect for supporting data analysis. [18] introduces Meaningful suggestions. Query Focus (m QF) of a given keyword search to generate Funding Hong Kong Research Grants Council (C6030-18GF, XQuyer . While keyword search (e.g., [33]) has been pro- 12201119, 12201518), Hong Kong Baptist University (IRCMS/19-20/ posed to query graphs, this approach does not allow users H01), and Prof. Byron Choi. to precisely specify query structures. This paper contrib- utes to query autocompletion for query formulation. Open Access This article is licensed under a Creative Commons Attri- Secondly, there is existing work on query autocomple- bution 4.0 International License, which permits use, sharing, adapta- tion, distribution and reproduction in any medium or format, as long tion on various query types. For instance, there is work as you give appropriate credit to the original author(s) and the source, on query autocompletion for keyword search (e.g., [2, 1 3 FLAG: Towards Graph Query Autocompletion for Large Graphs 191 provide a link to the Creative Commons licence, and indicate if changes 17. Li J, Cao Y, Ma S (2017) Relaxing graph pattern matching with were made. The images or other third party material in this article are explanations. In CIKM included in the article's Creative Commons licence, unless indicated 18. Li Y, Yu C, Jagadish HV (2008) Enabling schema-free xquery otherwise in a credit line to the material. If material is not included in with meaningful query focus. VLDB J., pages 355–377 the article's Creative Commons licence and your intended use is not 19. Lin C, Lu J, Ling TW, Cautis B (2012) LotusX: a position-aware permitted by statutory regulation or exceeds the permitted use, you will xml graphical search system with auto-completion. In ICDE, need to obtain permission directly from the copyright holder. To view a pages 1265–1268 copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . 20. Marchionini G (2006) Exploratory search: from finding to under - standing. Commun. ACM, pages 41–46 21. McGregor JJ (1982) Backtrack search algorithms and the maximal common subgraph problem. Softw., Pract. Exper., pages 23–34 References 22. Mottin D, Bonchi F, Gullo F (2015) Graph query reformulation with diversity. In KDD, pages 825–834 1. Abiteboul S, Amsterdamer Y, Milo T, Senellart P (2012) Auto- 23. Mottin D, Müller E (2017) Graph exploration: From users to large completion learning for xml. In SIGMOD, pages 669–672 graphs. In SIGMOD, pages 1737–1740 2. Bast H, Weber I (2006) Type less, find more: fast autocompletion 24. Nandi A, Jagadish HV (2007) Assisted querying using instant- search with a succinct index. In SIGIR, pages 364–371 response interfaces. In SIGMOD, pages 1156–1158 3. Bhowmick SS, Choi B, Dyreson CE (2016) Data-driven visual 25. Nandi A, Jagadish HV (2007) Effective phrase prediction. In graph query interface construction and maintenance: challenges VLDB, pages 219–230 and opportunities. PVLDB 9:984–992 26. Nandi A, Jiang L, Mandel M (2013) Gestural query specification. 4. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge PVLDB 7(4):289–300 University Press, Cambridge 27. Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of 5. Braga D, Campi A, Ceri S (2005) XQBE (XQuery By Example): approximations for maximizing submodular set functions - i. a visual interface to the standard xml query language. In TODS, Math. Program., pages 265–294 pages 398–443 28. Ng N, Yi P, Zhang Z, Choi B, Bhowmick SS, Xu J (2019) Fgreat: 6. Comai S, Damiani E, Fraternali P (2001) Computing graphical focused graph query autocompletion. In ICDE, pages 1956–1959 queries over xml data. TOIS, pages 371–430 29. Pienta R, Hohman F, Tamersoy A, Endert A, Navathe SB, Tong H, 7. Cordella LP, Foggia P, Sansone C, Vento M (2004) A (sub)graph Chau DH (2017) Visual graph query construction and refinement. isomorphism algorithm for matching large graphs. PAMI, pages In SIGMOD, pages 1587–1590 1367–1372 30. Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT (2017) The 8. Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) ubiquity of large graphs and surprising challenges of graph pro- Grami: frequent subgraph and pattern mining in a single large cessing. PVLDB 11:420–431 graph. PVLDB 7:517–528 31. Vartak M, Rahman S, Madden S, Parameswaran A, Polyzotis N 9. Feng J, Li G (2012) Efficient fuzzy type-ahead search in xml data. (2015) Seedb: efficient data-driven visualization recommenda- TKDE, pages 882–895 tions to support visual analytics. PVLDB 8(13):2182–2193 10. Huang K, Chua H, Bhowmick SS, Choi B, Zhou S (2019) CATA- 32. Wang C, Xie M, Bhowmick SS, Choi B, Xiao X, Zhou S (2020) PULT: data-driven selection of canned patterns for efficient visual FERRARI: an efficient framework for visual exploratory subgraph graph query formulation. In SIGMOD, pages 900–917 search in graph databases. VLDB J 29(5):973–998 11. Hung HH, Bhowmick SS, Truong BQ, Choi B, Zhou S (2013) 33. Wu Y, Yang S, Srivatsa M, Iyengar A, Yan X (2013) Summarizing QUBLE: blending visual subgraph query formulation with query answer graphs induced by keyword queries. PVLDB 6:1774–1785 processing on large networks. In SIGMOD, pages 1097–1100 34. Xiao C, Qin J, Wang W, Ishikawa Y, Tsuda K, Sadakane K (2013) 12. Ioannidis YE, Viglas S (2006) Conversational querying. Inf. Syst., Efficient error-tolerant query autocompletion. PVLDB, pages pages 33–56 373–384 13. Jayaram N, Goyal S, Li C (2015) VIIQ: Auto-suggestion enabled 35. Yan X, Han J (2002) gSpan: graph-based substructure pattern visual interface for interactive graph query formulation. PVLDB, mining. In ICDM, pages 721–724 pages 1940–1951 36. Yi P, Choi B, Bhowmick SS, Xu J (2017) Autog: a visual 14. Jayaram N, Gupta M, Khan A, Li C, Yan X, Elmasri R (2014) query autocompletion framework for graph databases. VLDB J GQBE: querying knowledge graphs by example entity tuples. In 26(3):347–372 ICDE, pages 1250–1253 37. Yi P, Li J, Choi B, Bhowmick SS, Xu J (2020) Gfocus: user focus- 15. Jiang L, Nandi A (2015) Snaptoquery: providing interac- based graph query autocompletion. TKDE tive feedback during exploratory query specification. PVLDB 38. Zhang A, Goyal A, Kong W, Deng H, Dong A, Chang Y, Gunter 8(11):1250–1261 CA, Han J (2015) adaqac: adaptive query auto-completion via 16. Leskovec J, Faloutsos C (2006) Sampling from large graphs. In implicit negative feedback. In SIGIR, pages 143–152 KDD 1 3
Data Science and Engineering – Springer Journals
Published: Jun 1, 2022
Keywords: Subgraph query; Query autocompletion; Large graphs; Database usability
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.