Symmetric Queries as a Building Block for Efficient Parallel Query Evaluation
Introduction | People | Publications | Boarder Impacts
This material is based upon work supported by the National Science Foundation under Grant No. 1606557 (2015-2020)
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."
Today's applications frequently feature massive and heterogeneous data and complicated computational requirements. There have been many efforts towards efficient parallel query processing and optimization. However, this does not mean that the parallel potentials have been realized by existing techniques and frameworks in scaling to massive datasets, especially for applications that inherently demand recursive data accesses.
The aim of the project is to offer a theoretical methodology for tackling the problem of parallel query evaluation on massive data. The PI conjectures that to maximize parallelizability of generic queries, e.g. queries that are used frequently in analytical and transactional applications, one need to examine queries that are inherently parallelizable as the basic unit of study, and identifies symmetric queries as a set of queries that are potentially highly parallelizable and use such queries as a stepping stone to study parallelizable query languages and leverage the findings to design techniques for efficient evaluation of generic queries. In particular, the project will focus on three separate yet highly related tasks: (1) design and study a set of query languages whose queries are symmetric, investigate the properties of these languages, and propose and prove theoretical bounds of the computational complexity of the languages, in terms of scaling and data skew; (2) investigate and propose data structures and algorithms for efficiently evaluating queries of these languages in a parallel manner; and (3) propose strategies including query rewrite and optimization techniques for efficient evaluation of arbitrary queries, based on the new data structures and algorithms that result from (2).
The challenges of this project is logistical rather than technical. The PI, Melanie Wu, moved from Indiana University, an R1 institution, to Pomona College, a liberal arts institution with only undergraduate students. She had to delay the start of the project for a year from 2014 to 2015 and conducted research mainly through collaborations. With the support from the NSF program directors, she was able to readjust the aim of the project to limit the scope to theoretical studies. However, the research is fruitful and she was able to involve undergraduate students, especially supporting URM students.
Research Outcomes and Publications
...Many sources of data have temporal start and end attributes or are created in a time-ordered manner. For many applications, it is only natural to join such temporal datasets based on these temporal attributes. ... To support windowed temporal joins and temporal joins of skewed datasets, we propose the skip- join algorithm, which operates on stab-forests, a novel data structure for indexing temporal data. The stab-forest is a dynamic data structure that allows ecient updates, given that events are appended in a time-based order. It eciently supports not only traditional temporal stab-queries and window-queries, but also more general multi-stab-queries and multi-window-queries. ...
...Many graph query languages rely on the composition operator to navigate graphs and select nodes of interests, even though evaluating compositions of relations can be costly. Often, this need for composition can be reduced by rewriting towards queries that use semi-joins instead. In this way, the cost of evaluating queries can be signicantly reduced. We study techniques to recognize and apply such rewritings. Concretely, we study the relationship between the expressive power of the relation algebras, which heavily rely on composition, and the semi- join algebras, which replace the composition operator in favor of the semi-join operators. As our main result, we show that each fragment of the relation algebras where intersection and/or dierence is only used on edges (and not on complex compositions) is expressively equivalent to a fragment of the semi-join algebras. ... In addition, on sibling-ordered trees, we establish new relationships among the expressive power of Regular XPath, Conditional XPath, FO-logic, and the semi- join algebra augmented with restrictedxpoint operators...
... Motivated by the continuing interest in the tree data model, we study the expressive power of downward navigational query languages on trees and chains. Basic navigational queries are built from the identity relation and edge relations using composition and union. We study the eects on relative expressiveness when we add transitive closure, projections, coprojections, intersection, and dierence; this for Boolean queries and path queries on labeled and unlabeled structures.
We are able to present the complete Hasse diagrams of relative expressiveness, visualized in the figure on the right. In particular, our results show, for each fragment of the navigational query languages that we study, whether it is closed under dierence and intersection when applied on trees. These results are proven using the concept of condition automata, which we use to represent and manipulate navigational expressions. We also use condition automata to show that, on labeled chains, projections do not provide additional expressive power for Boolean queries
... We study the problem of enumeration of all k-sized subsets of temporal events that mutually overlap at some point in a query time window. This problem arises in many application domains, e.g., in social networks, life sciences, smart cities, telecommunications, and others. We propose a start time index (STI) approach that overcomes the efficiency bottlenecks of current methods which are based on 2-way join algorithms to enumerate temporal k-cliques. Additionally, we investigate how precomputed checkpoints can be used to further improve the efficiency of STI. Our experimental results demonstrate that STI outperforms the state of the art by a wide margin and that our checkpointing strategies are effective.....
This is an extended version of the paper of the same title that was published in Foundation of INformation and Knowledge Systems in 2018. In this extended version, We formalize GCount-k, k 0, as the class of queries that can be written using so-called generalized counting terms. We show that GCount-k captures many practical queries in k-SyCALC. We also establish that satisability is NEXPTIME-complete for SimpleCALC-k and GCount-k, and is in EXPTIME for 1-SyCALC. The decidability of 1-SyCALC and SimpleCALC-k, k>=0, sets them apart from many other fragments offirst-order logic. We further established that the satisability for 2-SyCALC is shown to be undecidable. Hence, proving a strong dichotomy between 1-SyCALC and SimpleCALC-k, k>=0, on the one hand and 2-SyCALC on the other hand.
... In this paper, the authors wish to initiate research on symmetric queries. Thereto, a data model is proposed in which a binary relation between objects and set names encodes set membership. On this data model, two query languages are introduced, QuineCALC and SyCALC. They are correlated with the symmetric Boolean functions of Quine, respectively symmetric relational functions, on sequences of sets of given length. Symmetric Boolean functions involve the Boolean operations union, intersection, and complement, whereas symmetric relational functions additionally involve projection and Cartesian product. Quine's characterization of symmetric Boolean functions in terms of incidence information is generalized to QuineCALC queries. This generalization also yields an incidence-based normal form for QuineCALC queries. Inspired by these desirable incidence-related properties of QuineCALC queries, counting-only SyCALC queries are introduced as those SyCALC queries for which the result only depends on incidence information. Counting-only SyCALC queries are then characterized as quantied Boolean combinations of QuineCALC queries, and a normal form is proposed for them as well. It is shown that, while it is undecidable whether a SyCALC query is counting-only, it is decidable whether a counting-only SyCALC query is a QuineCALC query. Finally, some decidability problems are considered, such as satisability, containment, equivalence, validity, and emptiness. It is shown that all these problems are undecidable for SyCALC, but decidable for QuineCALC and counting-only SyCALC queries.
... For several practical queries on bags of sets of objects, the answer does not depend on the precise composition of these sets, but only on the number of sets to which each object belongs. This is the case k=1 for the more general situation where the query answer only depends on the number of sets to which each group of at most k objects belongs. We call such queries k-counting-only. We focus on k-SyCALC, k-counting-only queries that arerst-order denable .We introduce SimpleCALC-k, a syntactically dened (strict) fragment of k-SyCALC. It turns out that many practical queries in k-SyCALC can already be expressed in SimpleCALC-k. We prove that the k-countingonly queries form a non-collapsing hierarchy: for every k, there exist (k+1)-counting-only queries that are not k-counting-only. This result specializes to both SimpleCALC-k and k-SyCALC. Finally, we establish a strong dichotomy between 1-SyCALC and SimpleCALC-k on the one hand and 2-SyCALC on the other hand by showing that satisability, validity, query containment, and query equivalence are decidable for the former two languages, but not for the latter one...
... Fragments of Tarski's relation algebra form the basis of many versatile graph and tree query languages including the regular path queries, XPath, and SPARQL. Surprisingly, however, a systematic study of the relative expressive power of relation algebra fragments on trees has not yet been undertaken. Our approach is to start from a basic fragment which only allows composition and union. We then study how the expressive power of the query language changes if we add diversity, converse, projections, coprojections, intersections, and/or difference, both for path queries and Boolean queries. For path queries, we found that adding intersection and difference yields more expressive power for some fragments, while adding one of the other operators always yields more expressive power. For Boolean queries, we obtain a similar picture for the relative expressive power, except for a few fragments where adding converse or projection yields no more expressive power. ...
... Many graph query languages rely on the composition operator to navigate graphs and select nodes of interests, even though evaluating compositions of relations can be costly. Often, this need for composition can be reduced by rewriting towards queries that use semi-joins instead. In this way, the cost of evaluating queries can be significantly reduced. We study techniques to recognize and apply such rewritings. Concretely, we study the relationship between the expressive power of the relation algebras, that heavily rely on composition, and the semi-join algebras, that replace the composition operator in favor of the semi-join operators. As our main result, we show that each fragment of the relation algebras where intersection and/or difference is only used on edges (and not on complex compositions) is expressively equivalent to a fragment of the semi-join algebras. This expressive equivalence holds for node queries that evaluate to sets of nodes. For practical relevance, we exhibit constructive steps for rewriting relation algebra queries to semi-join algebra queries, and prove that these steps lead to only a well-bounded increase in the number of steps needed to evaluate the rewritten queries. ....
... We study the expressiveness on a given document of various fragments of XPath, the core navigational language on XML documents. Viewing these languages as fragments of Tarski�s relation algebra, we give characterizations for when a binary relation on the document�s nodes (i.e., a set of paths) is definable by an expression in these algebras. In contrast with this �global view� on language semantics, there is also a �local view� where one is interested in the nodes to which one can navigate starting from a particular node. In this view, we characterize when a set of nodes can be defined as the result of applying an expression to a given node. All of these global and local definability results are obtained using a two-step methodology, which consists of first characterizing when two nodes cannot be distinguished by an expression in the language, and then bootstrapping these characterizations to the desired results. ...
... we focus on functional constraints. Since the usefulness of functional constraints is not limited to the RDF data model, we study the functional constraints in the more general setting of relations with arbitrary arity. We further introduce constant constraints and study the functional and constant constraints combined. Our main results are sound and complete axiomatizations for the functional and constant constraints, both separately and combined. ...
... exploiting relations
and symmetries within probabilistic models
has been proven to be surprisingly eective
at solving large scale data mining problems.
One of the key operations inside these lifted
approaches is counting - be it for parame-
ter/structure learning or for efficient inference.
... we investigates whether 'Com-
pilation to Graph Databases' could be a prac-
tical technique for scaling lifted probabilistic
inference and learning methods and demonstrate that graph database queries to obtain
both exact and approximate counts can make
state-of-the-art inference and learning meth-
ods orders of magnitude faster, without sacri-
cing performance....
... we study navigational query languages for graphs (binary relations). We make more powerful languages by adding intersection; set dierence; projection; coprojection; converse; transitive closure; and the diversity relation to the union and composition operations of the base lanauge.... we compare the expressive power of all resulting languages, both for binary-relation queries as well as for boolean queries... we now complete the Hasse diagram in the presence of transitive closure, both for the case of a single edge label, as well as for the case of at least two edge labels. Our findings include: add expressive power when coprojection is present. ... transitive closure also adds expressive power as soon as converse is present... converse adds expressive power in the presence of transitive closure.
... we study the expressive power of downward fragments of navigational query languages on trees. The basic navigational query language we consider expresses queries by building binary relations from the edge relations and the identity relation, using composition and union... we focus on studying the effects on the expressive power when add transitive closure, projections, coprojections, intersection, and difference operations are added, at both the level of boolean queries and path queries, and on both labeled and unlabeled trees, and on both labeled and unlabeled chains. ... our results include the complete Hasse diagram of relative expressiveness, in which we are able to decide, for each fragment of the navigational query languages that we study, whether it is closed under difference and intersection when applied on trees....
...Our basic language has only the operators union and composition, together with the identity relation. Richer languages can be obtained by adding other features
such as intersection, difference, projection and coprojection, converse, and the
diversity relation. .... For the languages considered above, adding transitive closure augments
the expressive power not only at the level of path queries but also at the level of
Boolean queries, for the latter provided that multiple input relations are allowed.
This is no longer true in the context of unlabeled graphs (i.e., in the case where there
is only one input relation), however.
... we prove that this is indeed not
the case for the basic language to which none, one, or both of projection and the
diversity relation are added, a surprising result given the limited expressive power
of these languages. In combination with our earlier work,
this result yields a complete understanding of the impact of transitive closure on the
languages under consideration.
... we study navigational query languages for graphs (binary relations). The simplest language has only the two operators: union and composition, together with the identity relation. We make more powerful languages by adding any of the following operators: intersection; set difference; projection;
coprojection; converse; and the diversity relation. All these operators map binary relations to binary relations.
We compare the expressive power of all resulting languages,
not only for general path queries (queries where the result may be any binary relation) but also for boolean queries ...
... In this paper, we initiate research on symmetric queries. On a data model in which a binary relation between objects and set names encodes set membershipwe introduce two query languages: QuineCALC and SyCALC, inspired by the symmetric Boolean functions of Quine on sequences of sets of given length. The latter do not only involve the Boolean operations union, intersection, and complement, but also projection and Cartesian product. Quine�s characterization of symmetric Boolean functions in terms of incidence information is generalized to QuineCALC queries.... Inspired by these desirable incidence-related properties of QuineCALC queries, we introduce counting-only queries as SyCALC for which the result only depends on incidence information. We characterize counting-only queries as quantified Boolean combinations of QuineCALC queries and propose a normal form for it. ....
Boarder Impacts and Community Outreach
The fund provided training for undergraduate students to get into research. Among the students funded by this project, Eric Campbell is currently a PhD candidate at Cornell University and a 2019 NSF Graduate Research Fellow. Many of the students are URMs and 1st-gen students.
With the support from NSF, the PI has been a leading force in promoting diversity in the field of computer science.
Melanie Wu have been a member of the ACM-W Executive Council since 2012. ACM-W is a sub-organization of the ACM (Association for Computing Machinery) that focuses on promoting the visibility of female computer scientists and helping female students build successful careers in this field via scholarships and community activities.
Melanie Wu chaired the communication committee of ACM-W from 2012 to 2019. Since 2019, she has been the treasurer of ACM-W.
Melanie Wu sered on the CRA-WP board of directors. CRA-WP is a sub-organization of CRA (Computer Research Association) that aims at increasing the success and participation of underrepresented populations in computing research and education at all levels. CRA-WP provides mentoring and support for women at every level of the research pipeline: undergraduate students, graduate students, faculty, and industry and government researchers. We strive to ensure that our activities have a positive impact on all underrepresented groups in CSE (Computer Science and Engineering) and are committed to improving the working environment and increasing the success of all Computer Scientists and Engineers, without regard for gender, race, sexual orientation or socioeconomic background.
Melanie Wu co-chaired the CREU (Collaborative Research Experiences for Undergraduates) program, an undergraduate research program that provides research stipends to teams of students from underrepresented groups working on research projects under the guidance of a mentor at their home institutions. her work involves administrating application, reviewing proposals, awarding grants and monitoring the students' research progress throughout the grant period.
She co-chaired the CRA-WP Grace Hopper Celebration Research Scholars Program. The program brings undergraduates to the annual Grace Hopper Celebration and engages them in a unique, research-focused experience, providing them a mentor, networking opportunities, and advising toward graduate school and research careers in computing.
Grad Cohort aims to increase the ranks of senior women in computing-related studies and research by building and mentoring nationwide communities of women through their graduate studies. Participants will meet for two days with 20 to 25 senior computing-related researchers and professionals, who will share pertinent information on graduate school survival skills, as well as more personal information and insights about their experiences.
In recent years, over 500 female PhD and MS students attend the 2-day conference. Melanie Wu was a panelist and faculty mentor at CRA-W Grad Cohort 2016, 2017 and 2019 .
Career mentoring workshops aim to provide career mentoring to women in computing in both early and middle careers. It includes three tracks, for women in educational institites, reserach instituites and research labs. The Early Career Mentoring Workshop focuses on how to build a great research program, advise students and get funding, how to be an effective and inspiring teacher, how to set yourself up for promotion and success, and how to balance work and life, among other things. The Mid-Career Mentoring Workshop focuses on topics such as how to prepare for promotion to a more senior position, be an effective leader, build collaborations and start new research initiatives, and balance work and life.
Melanie Wu was a panelist and mentor at the early and mid career mentoring workshops in 2018 and 2019.
The Grace Hopper Celebration of Women in Computing (GHC) is the world's largest gathering of women technologists. It is produced by the Anita Borg Institute and presented in partnership with ACM.
Melanie Wu has been attending GHC with her students every year. at GHC, she hosted panel, served as judge for student research competition, hosted ACM information booth, and participated in student mentoring.
She lead 15 students to attend GHC 2017 in Orlando, FL. She was leading the effort to organize students from Pomona college to attend GHC (2018, 2019, 2020).
Point of Contact: Yuqing Melanie Wu
Last updated on July 2, 2020