Zhengyi Yang's Publications

[by year] [by type]

(* Corresponding Author; # Equal Contribution)

[29] Haoran Ning, Bocheng Han, Zhengyi Yang*, Kongzhang Hao, Miao Ma, Chunling Wang, Boge Liu, Xiaoshuang Cheng, Yu Hao, Yi Jin, Wanchun Zhang, and Chengwei Zhang. Exploring simple architecture of just-in-time compilation in databases. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, 2024. [ bib ]
[28] Yi Ding, Zhengyi Yang*, Shunyang Li, Liuyi Chen, Haoran Ning, Kongzhang Hao, and Yongfei Liu. Fgaq: Accelerating graph analytical queries using fpga. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, 2024. [ bib ]
[27] Wenke Yang, Zihan Yang, Liuyi Chen, Ruiqing Yan, Zhengyi Yang*, Linhan Zhang, and Tang Yifu. Parallel program generation for hybrid tabular-textual question answering. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, 2024. [ bib ]
[26] Diya Yan, Riza Yosia Sunindijo, Cynthia C Wang, and Zhengyi Yang. Comparing the career paths of male and female managers in the construction industry: Insights from linkedin data. In Proceedings of the 40th Annual Association of Researchers in Construction Management Conference (ARCOM), 2024. [ bib ]
[25] Kaiyu Chen, Dong Wen, Wentao Li, Zhengyi Yang, and Wenjie Zhang. On compressing historical cliques in temporal graphs. In Database Systems for Advanced Applications (DASFAA), 2024. [ bib ]
[24] Tianming Zhang, Junkai Fang, Zhengyi Yang, Bin Cao, and Jing Fan. Tatkc: A temporal graph neural network for fast approximate temporal katz centrality ranking. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 527–538, New York, NY, USA, 2024. Association for Computing Machinery. [ bib | DOI | http ]
Numerous real-world networks are represented as temporal graphs, which capture the dynamics of connections over time. Identifying important nodes on temporal graphs has a plethora of real-life applications, such as information propagation and influential user identification, etc. Temporal Katz centrality, a popular temporal metric, gauges the importance of nodes by taking into account both the number of temporal walks and the timespan between the interactions. The computation of traditional temporal Katz centrality is computationally expensive, especially when applied to massive temporal graphs. Therefore, in this paper, we design a temporal graph neural network to approximate temporal Katz centrality computation. To the best of our knowledge, we are the first to address temporal Katz centrality computation purely from a learning-based perspective. We propose a time-injected self-attention model that consists of two phases. In the first phase, we utilize a time-injected self-attention mechanism to acquire node representations that encompass both structural information and temporal relevance. The second phase is structured as a multi-layer perceptron (MLP) which uses the learned node representation to predict node rankings. Furthermore, normalization and neighbor sampling strategies are integrated into the model to enhance its overall performance. Extensive experiments on real-world networks demonstrate the efficiency and accuracy of TATKC.
[23] Tianming Zhang, Yunjun Gao, Jie Zhao, Lu Chen, Lu Jin, Zhengyi Yang, Bin Cao, and Jing Fan. Efficient exact and approximate betweenness centrality computation for temporal graphs. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 2395–2406, New York, NY, USA, 2024. Association for Computing Machinery. [ bib | DOI | http ]
Betweenness centrality of a vertex in a graph evaluates how often the vertex occurs in the shortest paths. It is a widely used metric of vertex importance in graph analytics. While betweenness centrality on static graphs has been extensively investigated, many real-world graphs are time-varying and modeled as temporal graphs. Examples include social networks and telecommunication networks, where a relationship between two vertices occurs at a specific time. Hence, in this paper, we target efficient methods for temporal betweenness centrality computation. We firstly propose an exact algorithm with the new notion of time instance graph, based on which, we derive a temporal dependency accumulation theory for iterative computation. To reduce the size of the time instance graph and improve the efficiency, we propose an additional optimization, which compresses the time instance graph with equivalent vertices and edges, and extends the dependency theory to the compressed graph. Since it is theoretically complex to compute temporal betweenness centrality, we further devise a probabilistically guaranteed approximate method to handle massive temporal graphs. Extensive experimental results on real-world temporal networks demonstrate the superior performance of the proposed methods. In particular, our exact and approximate methods outperform the state-of-the-art methods by up to two and five orders of magnitude, respectively.
[22] Lingbo Li, Zhichun Li, Fusen Guo, Haoyu Yang, Jingtian Wei, and Zhengyi Yang. Prototype comparison convolutional networks for one-shot segmentation. IEEE Access, 12:54978–54990, 2024. [ bib | DOI ]
In few-shot semantic segmentation (FSS), the key challenges are efficiently tuning the interaction between the support set and the query set and distinguishing between context, background, and interfering items. To address these challenges, we propose prototype comparison networks for one-shot segmentation (OPCN) to capture the details required for FSS. Specifically, we offer the Fusion Interaction Module (FIM) to improve the segmentation performance by capturing the correlation and semantic information between the support set and query set features. Subsequently, we propose the Feature Enhancement Module (FEM), which aims to enhance the information required in the support set and query set features while increasing the focus on critical details by reducing the weight of the background regions to provide a more targeted feature representation for subsequent query image segmentation. Then, we propose the Feature Refinement Module (FRM) to filter irrelevant background information in the features and specify the target location region. Finally, the Feature Matching Module (FM) generates the final segmentation mask for the query image. Extensive experiments on the PASCAL-5i and COCO-20i datasets show that our approach achieves excellent performance in the case of the one-shot setup.
[21] Tianming Zhang, Xinwei Cai, Lu Chen, Zhengyi Yang, Yunjun Gao, Bin Cao, and Jing Fan. Towards efficient simulation-based constrained temporal graph pattern matching. World Wide Web, 27(3):22, Apr 2024. [ bib | DOI | http ]
In the context of searching a single data graph G, graph pattern matching is to find all the occurrences of a pattern graph Q in G, specified by a matching rule. It is of paramount importance in many real applications such as social network analysis and cyber security, among others. A wide spectrum of studies target general graph pattern matching. However, to analyze time-relevant services such as studying the spread of diseases and detecting attack patterns, it is attractive to study inexact temporal graph pattern matching. Hence, in this paper, we propose a relaxed matching rule called constrained temporal dual simulation, and study simulation-based constrained temporal graph pattern matching which guarantees that the matching result (i) preserves the ancestor and descendant temporal connectivities; and (ii) implements edge-to-temporal path mapping. We devise a decomposition-based matching method, which first decomposes the data graph into Source Temporal Connected Components, and then performs matching on decomposed subgraphs. To speed up the matching, we define child/parent dependency relation tables and propose an efficient double hierarchical traverse strategy. Considering that the temporal graphs are naturally dynamic, we further propose update algorithms. An extensive empirical study over real-world and synthetic temporal graphs has demonstrated the effectiveness and efficiency of our approach.
[20] Sai Li, Peng Kou, Miao Ma, Haoyu Yang, Shuo Huang, and Zhengyi Yang. Application of semi-supervised learning in image classification: Research on fusion of labeled and unlabeled data. IEEE Access, 12:27331–27343, 2024. [ bib | DOI ]
Deep learning has attracted wide attention recently because of its excellent feature representation ability and end-to-end automatic learning method. Especially in clinical medical imaging diagnosis, the semi-supervised deep learning model is favored and widely used because it can make maximum use of a limited number of labeled data and combine it with a large number of unlabeled data to extract more information and knowledge from it. However, the scarcity of medical image data, the vast image size, and the instability of image quality directly affect the model's robustness, generalization, and image classification performance. Therefore, this paper proposes a new semi-supervised learning model, which uses quadratic neurons instead of traditional neurons, aiming to use quadratic convolution instead of the conventional convolution layer to improve the feature extraction ability of the model. In addition, we introduce two Dropout layers and two fully connected layers at the end of the model to enhance the nonlinear fitting ability of the network. Experiments on two large medical public data sets - ISIC 2019 and Retinopathy OCT - show that our method can improve the model's generalization performance and image classification accuracy.
[19] Wenqian Zhang, Zhengyi Yang*, Dong Wen, and Xiaoyang Wang. Efficient distributed core graph decomposition. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1023–1031, 2023. [ bib | DOI ]
Core decomposition is one of the most fundamental problems in graph analytics, which is associated with numerous applications, such as community detection, protein network analysis, and system structure analysis. As the sizes of graphs are becoming increasingly large, it is challenging to compute core decomposition on a single machine. In this paper, we study the problem of k-Core decomposition in the distributed environment. Specifically, we propose the distributed Filter-Array k-Core (FAkCore) algorithm, which adopts the commonly used Scatter-Gather framework. We design an auxiliary data structure of running counts for each vertex to track the statistics of its neighbors' core number. It allows us to recompute the core number of a vertex only when the value is updated. Together with an enhanced message filtering mechanism, our method significantly reduces redundant computation and communication in the existing distributed k-Core decomposition algorithm. Experiments on 10 real-world graphs show that our method outperforms the baseline algorithms by 1.4 times on average and up to 2.2 times.
[18] Liuyi Chen, Bocheng Han, Xuesong Wang, Jiazhen Zhao, Wenke Yang, and Zhengyi Yang*. Machine learning methods in weather and climate applications: A survey. Applied Sciences, 13(21), 2023. [ bib | DOI | http ]
With the rapid development of artificial intelligence, machine learning is gradually becoming popular for predictions in all walks of life. In meteorology, it is gradually competing with traditional climate predictions dominated by physical models. This survey aims to consolidate the current understanding of Machine Learning (ML) applications in weather and climate prediction—a field of growing importance across multiple sectors, including agriculture and disaster management. Building upon an exhaustive review of more than 20 methods highlighted in existing literature, this survey pinpointed eight techniques that show particular promise for improving the accuracy of both short-term weather and medium-to-long-term climate forecasts. According to the survey, while ML demonstrates significant capabilities in short-term weather prediction, its application in medium-to-long-term climate forecasting remains limited, constrained by factors such as intricate climate variables and data limitations. Current literature tends to focus narrowly on either short-term weather or medium-to-long-term climate forecasting, often neglecting the relationship between the two, as well as general neglect of modeling structure and recent advances. By providing an integrated analysis of models spanning different time scales, this survey aims to bridge these gaps, thereby serving as a meaningful guide for future interdisciplinary research in this rapidly evolving field.
[17] Miao Ma, Zhengyi Yang*, Kongzhang Hao, Liuyi Chen, Chunling Wang, and Yi Jin. An empirical analysis of just-in-time compilation in modern databases. In Australasian Database Conference (ADC), pages 227–240. Springer Nature Switzerland, 2023. [ bib | DOI | http ]
JIT (Just-in-Time) technology has garnered significant attention for improving the efficiency of database execution. It offers higher performance by eliminating interpretation overhead compared to traditional execution engines. LLVM serves as the primary JIT architecture, which was implemented in PostgreSQL since version 11. However, recent advancements in WASM-based databases, such as Mutable, present an alternative JIT approach. This approach minimizes the extensive engineering efforts associated with the execution engine and focuses on optimizing supported operators for lower latency and higher throughput. In this paper, we perform comprehensive experiments on these two representative open-source databases to gain deeper insights into the effectiveness of different JIT architectures.
[16] Nimish Ukey, Zhengyi Yang*, Wenke Yang, Binghao Li, and Runze Li. knn join for dynamic high-dimensional data: A parallel approach. In Australasian Database Conference (ADC), pages 3–16. Springer Nature Switzerland, 2023. [ bib | DOI | http ]
The k nearest neighbor (kNN) join operation is a fundamental task that combines two high-dimensional databases, enabling data points in the User dataset U to identify their k nearest neighbor points from the Item dataset I. This operation plays a crucial role in various domains, including knowledge discovery, data mining, similarity search applications, and scientific research. However, exact kNN search in high-dimensional spaces is computationally demanding, and existing sequential methods face challenges in handling large datasets. In this paper, we propose an efficient parallel solution for dynamic kNN join over high-dimensional data, leveraging the high-dimensional R tree (HDR Tree) for improved efficiency. Our solution harnesses the power of Simultaneous Multi-Threading (SMT) technologies and Single-Instruction-Multiple-Data (SIMD) instructions in modern CPUs for parallelisation. Importantly, our research is the first to introduce parallel computation for exact kNN join over high-dimensional data. Experimental results demonstrate that our proposed approach outperforms the sequential HDR Tree method by up to 1.2 times with a single thread. Moreover, our solution provides near-linear scalability as the number of threads increases.
[15] Nimish Ukey#, Guangjian Zhang#, Zhengyi Yang*, Binghao Li, Wei Li, and Wenjie Zhang. Efficient continuous knn join over dynamic high-dimensional data. World Wide Web, Sep 2023. [ bib | DOI | http ]
Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.
[14] Kongzhang Hao, Long Yuan, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. Efficient and scalable distributed graph structural clustering at billion scale. In Database Systems for Advanced Applications (DASFAA), pages 234–251, Cham, 2023. Springer Nature Switzerland. [ bib | DOI | http ]
Structural Graph Clustering (SCAN) is a fundamental problem in graph analysis and has received considerable attention recently. Existing distributed solutions either lack efficiency or suffer from high memory consumption when addressing this problem in billion-scale graphs. Motivated by these, in this paper, we aim to devise a distributed algorithm for SCAN that is both efficient and scalable. We first propose a fine-grained clustering framework tailored for SCAN. Based on the new framework, we devise a distributed SCAN algorithm, which not only keeps a low communication overhead during execution, but also effectively reduces the memory consumption at all time. We also devise an effective workload balance mechanism that is automatically triggered by the idle machines to handle skewed workloads. The experiment results demonstrate the efficiency and scalability of our proposed algorithm.
[13] Zhengyi Yang, Wenjie Zhang, Xuemin Lin*, Ying Zhang, and Shunyang Li. Hgmatch: A match-by-hyperedge approach for subgraph matching on hypergraphs. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 2063–2076, 2023. [ bib | DOI ]
Hypergraphs are generalisation of graphs in which a hyperedge can connect any number of vertices. It can describe n-ary relationships and high-order information among entities compared to conventional graphs. In this paper, we study the fundamental problem of subgraph matching on hypergraphs (i.e, subhypergraph matching). Existing methods directly extend subgraph matching algorithms to the case of hypergraphs. However, this approach delays hyperedge verification and underutilises the high-order information in hypergraphs, which leads to large search space and high enumeration cost. Furthermore, with the growing size of hypergraphs, it is becoming hard to compute subhypergraph matching sequentially. Thus, we propose an efficient and parallel subhypergraph matching system, HGMatch, to handle subhypergraph matching in massive hypergraphs. We proposes a novel match-by-hyperedge framework to utilise high-order information in hypergraphs and uses set operations for efficient candidates generation. Moreover, we develop an optimised parallel execution engine in HGMatch based on the dataflow model, which features a task-based scheduler and fine-grained dynamic work stealing to achieve bounded memory execution and better load balancing. Experimental evaluation on 10 real-world datasets shows that HGMatch outperforms the extended version of the state-of-the-art subgraph matching algorithms (CFL, DAF, CECI and RapidMatch) by orders of magnitude when using a single thread, and achieves almost linear scalability when the number of threads increases.
[12] Nimish Ukey, Zhengyi Yang*, Binghao Li, Guangjian Zhang, Yiheng Hu, and Wenjie Zhang. Survey on exact knn queries over high-dimensional data space. Sensors, 23(2), 2023. [ bib | DOI | http ]
k nearest neighbours (kNN) queries are fundamental in many applications, ranging from data mining, recommendation system and Internet of Things, to Industry 4.0 framework applications. In mining, specifically, it can be used for the classification of human activities, iterative closest point registration and pattern recognition and has also been helpful for intrusion detection systems and fault detection. Due to the importance of kNN queries, many algorithms have been proposed in the literature, for both static and dynamic data. In this paper, we focus on exact kNN queries and present a comprehensive survey of exact kNN queries. In particular, we study two fundamental types of exact kNN queries: the kNN Search queries and the kNN Join queries. Our survey focuses on exact approaches over high-dimensional data space, which covers 20 kNN Search methods and 9 kNN Join methods. To the best of our knowledge, this is the first work of a comprehensive survey of exact kNN queries over high-dimensional datasets. We specifically categorise the algorithms based on indexing strategies, data and space partitioning strategies, clustering techniques and the computing paradigm. We provide useful insights for the evolution of approaches based on the various categorisation factors, as well as the possibility of further expansion. Lastly, we discuss some open challenges and future research directions.
[11] Weixiao Xu, Lin Sun, Cheng Zhen, Bo Liu, Zhengyi Yang, and Wenke Yang. Deep learning-based image recognition of agricultural pests. Applied Sciences, 12(24), 2022. [ bib | DOI | http ]
Pests and diseases are an inevitable problem in agricultural production, causing substantial economic losses yearly. The application of convolutional neural networks to the intelligent recognition of crop pest images has become increasingly popular due to advances in deep learning methods and the rise of large-scale datasets. However, the diversity and complexity of pest samples, the size of sample images, and the number of examples all directly affect the performance of convolutional neural networks. Therefore, we designed a new target-detection framework based on Cascade RCNN (Regions with CNN features), aiming to solve the problems of large image size, many pest types, and small and unbalanced numbers of samples in pest sample datasets. Specifically, this study performed data enhancement on the original samples to solve the problem of a small and unbalanced number of examples in the dataset and developed a sliding window cropping method, which could increase the perceptual field to learn sample features more accurately and in more detail without changing the original image size. Secondly, combining the attention mechanism with the FPN (Feature Pyramid Networks) layer enabled the model to learn sample features that were more important for the current task from both channel and space aspects. Compared with the current popular target-detection frameworks, the average precision value of our model (mAP@0.5) was 84.16%, the value of (mAP@0.5:0.95) was 65.23%, the precision was 67.79%, and the F1 score was 82.34%. The experiments showed that our model solved the problem of convolutional neural networks being challenging to use because of the wide variety of pest types, the large size of sample images, and the difficulty of identifying tiny pests.
[10] Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, and Wenjie Zhang. Hop-constrained s-t simple path enumeration in large uncertain graphs. In Australasian Database Conference (ADC), pages 115–127. Springer International Publishing, 2022. [ bib | DOI | http ]
Uncertain graphs are graphs where each edge is assigned with a probability of existence. In this paper, we study the problem of hop-constrained s-t simple path enumeration in large uncertain graphs. To the best of our knowledge, we are the first to study this problem in the literature. Specifically, we propose a light-weight index to prune candidate paths by adopting the concept of probability-constrained distance. An efficient enumeration algorithm is designed based on the index structure. Experiment results on real-world datasets show that our proposed methods significantly outperform the baseline methods by up to 6 times.
[9] Nimish Ukey, Zhengyi Yang*, Guangjian Zhang, Boge Liu, Binghao Li, and Wenjie Zhang. Efficient knn join over dynamic high-dimensional data. In Australasian Database Conference (ADC), pages 63–75. Springer International Publishing, 2022. [ bib | DOI | http ]
Given a user dataset U and an object dataset I in high-dimensional space, a kNN join query retrieves each object in dataset U its k nearest neighbors from the dataset I. kNN join is a fundamental and essential operation in applications from many domains such as databases, computer vision, multi-media, machine learning, recommendation systems, and many more. The datasets in real world often update dynamically on insertion or deletion of objects. However, existing algorithms of dynamic kNN join lack support for deletion and batch update, which are important in real-life applications. In this paper, we propose a new method of kNN join over dynamic high-dimensional data. Specifically, our method features lazy updates, batch operations, and optimised deletions. Experiments on real-world datasets show that our method outperforms the existing algorithms of naive RkNN join and HDR Tree by up to 5 and 4 times, respectively.
[8] Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, Wenjie Zhang, Long Yuan, and Xuemin Lin. Hop-constrained s-t simple path enumeration in billion-scale labelled graphs. In Web Information Systems Engineering (WISE), pages 49–64. Springer International Publishing, 2022. [ bib | DOI | http ]
Hop-constrained s-t simple path (HC-s-tpath) enumeration is a fundamental problem in graph analysis. Existing solutions for this problem focus on unlabelled graphs and assume queries are issued without any label constraints. However, in many real-world applications, graphs are edge-labelled and the queries involve label constraints on the path connecting two vertices. Therefore, we study the problem of labelled hop-constrained s-t path (LHC-s-tpath) enumeration in this paper. We aim to efficiently enumerate the HC-s-tpaths using only edges with provided labels. To achieve this goal, we first demonstrate the existence of unnecessary computation specific to the label constraints in the state-of-the-art HC-s-tpath enumeration algorithm. We then devise a novel online index to identify the fruitless exploration during the enumeration. Based on the proposed index, we design an efficient LHC-s-tpath enumeration algorithm in which unnecessary computation is effectively pruned. Extensive experiments are conducted on real-world graphs with billions of edges. Experiment results show that our proposed algorithms significantly outperform the baseline methods by over one order of magnitude.
[7] Shunyang Li, Zhengyi Yang, Xianhang Zhang, Wenjie Zhang, and Xuemin Lin. Sql2cypher: Automated data and query migration from rdbms to gdbms. In Web Information Systems Engineering (WISE), pages 510–517. Springer International Publishing, 2021. [ bib | DOI | http ]
There are many real-world application domains where data can be naturally modelled as a graph, such as social networks and computer networks. Relational Database Management Systems (RDBMS) find it hard to capture the relationships and inherent graph structure of data and are inappropriate for storing highly connected data; thus, graph databases have emerged to address the challenges of high data connectivity. As the performance of querying highly connected data in relational query statements is usually worse than that in the graph database. Transforming data from a relational database to a graph database is imperative for improving the performance of graph queries. In this paper, we demonstrate SQL2Cypher, a system for migrating data from a relational database to a graph database automatically. This system also supports translating SQL queries into Cypher queries. SQL2Cypher is open-source (https://github.com/UNSW-database/SQL2Cypher) to allow researchers and programmers to migrate data efficiently. Our demonstration video can be found here: https://www.youtube.com/watch?v=eGaeBrVTJws.
[6] Zhengyi Yang, Longbin Lai, Xuemin Lin, Kongzhang Hao, and Wenjie Zhang. Huge: An efficient and scalable subgraph enumeration system. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), SIGMOD/PODS '21, pages 2049–2062, New York, NY, USA, 2021. Association for Computing Machinery. [ bib | DOI | http ]
Subgraph enumeration is a fundamental problem in graph analytics, which aims to find all instances of a given query graph on a large data graph. In this paper, we propose a system called HUGE to efficiently process subgraph enumeration at scale in the distributed context. HUGE features 1) an optimiser to compute an advanced execution plan without the constraints of existing works; 2) a hybrid communication layer that supports both pushing and pulling communication; 3) a novel two-stage execution mode with a lock-free and zero-copy cache design; 4) a BFS/DFS-adaptive scheduler to bound memory consumption; and 5) two-layer intra- and inter-machine load balancing. HUGE is generic such that all existing distributed subgraph enumeration algorithms can be plugged in to enjoy automatic speed up and bounded-memory execution.
[5] Xin Jin, Zhengyi Yang, Xuemin Lin, Shiyu Yang, Lu Qin, and You Peng. Fast: Fpga-based subgraph matching on massive graphs. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1452–1463, 2021. [ bib | DOI ]
Subgraph matching is a basic operation widely used in many applications. However, due to its NP-hardness and the explosive growth of graph data, it is challenging to compute subgraph matching, especially in large graphs. In this paper, we aim at scaling up subgraph matching on a single machine using FPGAs. Specifically, we propose a CPU-FPGA co-designed framework. On the CPU side, we first develop a novel auxiliary data structure called candidate search tree (CST) which serves as a complete search space of subgraph matching. CST can be partitioned and fully loaded into FPGAs' on-chip memory. Then, a workload estimation technique is proposed to balance the load between the CPU and FPGA. On the FPGA side, we design and implement the first FPGA-based subgraph matching algorithm, called FAST. To take full advantage of the pipeline mechanism on FPGAs, task parallelism optimization and task generator separation strategy are proposed for FAST, achieving massive parallelism. Moreover, we carefully develop a BRAM-only matching process to fully utilize FPGA's on-chip memory, which avoids the expensive intermediate data transfer between FPGA's BRAM and DRAM. Comprehensive experiments show that FAST achieves up to 462.0x and 150.0x speedup compared with the state-of-the-art algorithm DAF and CECI, respectively. In addition, FAST is the only algorithm that can handle the billion-scale graph using one machine in our experiments.
[4] Ran Wang, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. An empirical study on recent graph database systems. In Knowledge Science, Engineering and Management (KSEM), pages 328–340, Berlin, Heidelberg, 2020. Springer International Publishing. [ bib | DOI | http ]
Graphs are widely used to model the intricate relationships among objects in a wide range of applications. The advance in graph data has brought significant value to artificial intelligence technologies. Recently, a number of graph database systems have been developed. In this paper, we present a comprehensive overview and empirical investigation on existing property graph database systems such as Neo4j, AgensGraph, TigerGraph and LightGraph (LightGraph has recently renamed to TuGraph.). These systems support declarative graph query languages. Our empirical studies are conducted in a single-machine environment against on the LDBC social network benchmark, consisting of three different large-scale datasets and a set of benchmark queries. This is the first empirical study to compare these graph database systems by evaluating data bulk importing and processing simple and complex queries. Experimental results provide insightful observations of various graph data systems and indicate that AgensGraph works well on SQL based workload and simple update queries, TigerGraph is powerful on complex business intelligence queries, Neo4j is user-friendly and suitable for small queries, and LightGraph is a more balanced product achieving good performance on different queries. The related code, scripts and data of this paper are available online (https://github.com/UNSW-database/GraphDB-Benchmark).
[3] Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin, Wenjie Zhang, Ying Zhang, Zhengping Qian, and Jingren Zhou. Distributed subgraph matching on timely dataflow. In Proc. VLDB Endow. (VLDB), volume 12, pages 1099–1112. VLDB Endowment, June 2019. [ bib | DOI | http ]
Recently there emerge many distributed algorithms that aim at solving subgraph matching at scale. Existing algorithm-level comparisons failed to provide a systematic view of distributed subgraph matching mainly due to the intertwining of strategy and optimization. In this paper, we identify four strategies and three general-purpose optimizations from representative state-of-the-art algorithms. We implement the four strategies with the optimizations based on the common Timely dataflow system for systematic strategy-level comparison. Our implementation covers all representative algorithms. We conduct extensive experiments for both unlabelled matching and labelled matching to analyze the performance of distributed subgraph matching under various settings, which is finally summarized as a practical guide.
[2] Kongzhang Hao, Zhengyi Yang, Longbin Lai, Zhengmin Lai, Xin Jin, and Xuemin Lin. Patmat: A distributed pattern matching engine with cypher. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), CIKM '19, pages 2921–2924, New York, NY, USA, 2019. Association for Computing Machinery. [ bib | DOI | http ]
Graph pattern matching is one of the most fundamental problems in graph database and is associated with a wide spectrum of applications. Due to its computational intensiveness, researchers have primarily devoted their efforts to improving the performance of the algorithm while constraining the graphs to have singular labels on vertices (edges) or no label. Whereas in practice graphs are typically associated with rich properties, thus the main focus in the industry is instead on powerful query languages that can express a sufficient number of pattern matching scenarios. We demo PatMat in this work to glue together the academic efforts on performance and the industrial efforts on expressiveness. To do so, we leverage the state-of-the-art join-based algorithms in the distributed contexts and Cypher query language - the most widely-adopted declarative language for graph pattern matching. The experiments demonstrate how we are capable of turning complex Cypher semantics into a distributed solution with high performance.
[1] Zhengmin Lai, Zhengyi Yang, and Longbin Lai. Improving distributed subgraph matching algorithm on timely dataflow. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pages 266–273, 2019. [ bib | DOI ]
The subgraph matching problem is defined to find all subgraphs of a data graph that are isomorphic to a given query graph. Subgraph matching plays a vital role in the fields of e-commerce, social media and biological science. CliqueJoin is a distributed subgraph matching algorithm that is designed to be efficient and scalable. However, CliqueJoin is originally developed on MapReduce, thus the performance of the algorithm can be affected by the notorious I/O issue of MapReduce while processing multi-round join tasks. Meanwhile, CliqueJoin does not propose a cost evaluation strategy for labelled graphs, which limits its application in practice where most real-world graphs are labelled. Targeting the limitations of CliqueJoin, we propose CliqueJoin++ to improve CliqueJoin in two aspects. Firstly, we implement CliqueJoin++ on the Timely dataflow system instead of MapReduce to avoid considerable I/O cost. Secondly, we extend the cost evaluation function in CliqueJoin to compute optimal join plans for labelled graphs in the distributed context. Extensive experiments have been conducted to show that the proposed method is up to 10 times faster than the MapReduce version for unlabelled matching, and it achieves good performance and scalability for labelled matching.

This file was generated by bibtex2html 1.99.