Zhengyi Yang's Publications

(* Corresponding Author; # Equal Contribution)

[54]	Yuxing Han, Lixiang Chen, Haoyu Wang, Zhanghao Chen, Yifan Zhang, Chengcheng Yang, Kongzhang Hao, and Zhengyi Yang. Learning from the past: Adaptive parallelism tuning for stream processing systems. In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 3535–3548, Los Alamitos, CA, USA, May 2025. IEEE Computer Society. [ bib \| DOI \| http ] Distributed stream processing systems rely on the dataflow model to define and execute streaming jobs, organizing computations as Directed Acyclic Graphs (DAGs) of operators. Adjusting the parallelism of these operators is crucial to handling fluctuating workloads efficiently while balancing resource usage and processing performance. However, existing methods often fail to effectively utilize execution histories or fully exploit DAG structures, limiting their ability to identify bottlenecks and determine the optimal parallelism. In this paper, we propose StreamTune, a novel approach for adaptive parallelism tuning in stream processing systems. StreamTune incorporates a pre-training and fine-tuning framework that leverages global knowledge from historical execution data for job-specific parallelism tuning. In the pre-training phase, StreamTune clusters the historical data with Graph Edit Distance and pre-trains a Graph Neural Network-based encoder per cluster to capture the correlation between the operator parallelism, DAG structures, and the identified operator-level bottlenecks. In the online tuning phase, StreamTune iteratively refines operator parallelism recommendations using an operator-level bottleneck prediction model enforced with a monotonic constraint, which aligns with the observed system performance behavior. Evaluation results demonstrate that StreamTune reduces reconfigurations by up to 29.6% and parallelism degrees by up to 30.8% in Apache Flink under a synthetic workload. In Timely Dataflow, StreamTune achieves up to an 83.3% reduction in parallelism degrees while maintaining comparable processing performance under the Nexmark benchmark, when compared to the state-of-the-art methods.
[53]	Zebin Chen, Kaiyu Chen, Dong Wen, Zhengyi Yang, Wentao Li, and Ying Zhang. Accelerating shortest path counting on road networks. In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 3508–3521, Los Alamitos, CA, USA, May 2025. IEEE Computer Society. [ bib \| DOI \| http ] Counting the number of shortest paths between two query vertices on road networks has a wide range of applications and has recently drawn significant research attention. The state-of-the-art solution builds a tree-based index using the concept of tree decomposition. However, its performance deteriorates when the tree decomposition results in an unbalanced tree and may not perform well when the query vertices are close to each other. This paper aims to improve the efficiency of shortest path counting. We propose a novel indexing scheme that combines hub labeling with a balanced tree hierarchy. This approach significantly reduces the number of visited labels compared to the state-of-the-art solution. Furthermore, we introduce several optimizations to enhance the efficiency of index construction and minimize its size. Extensive experiments conducted on real-world road networks demonstrate that our method achieves up to 4.1 times higher query efficiency and reduces the index size by a factor of 2.35 compared to the state-of-the-art solution.
[52]	Wenqian Zhang, Zhengyi Yang, Dong Wen, Wentao Li, Wenjie Zhang, and Xuemin Lin. Accelerating core decomposition in billion-scale hypergraphs. In Proceedings of the ACM on Management of Data (SIGMOD)*, volume 3, New York, NY, USA, February 2025. Association for Computing Machinery. [ bib \| DOI \| http ] Hypergraphs provide a versatile framework for modeling complex relationships beyond pairwise interactions, finding applications in various domains. k-core decomposition is a fundamental task in hypergraph analysis that decomposes hypergraphs into cohesive substructures. Existing studies capture the cohesion in hypergraphs based on the vertex neighborhood size. However, such decomposition poses unique challenges, including the efficiency of core value updates, redundant computation, and high memory consumption. We observe that the state-of-the-art algorithms do not fully address the above challenges and are unable to scale to large hypergraphs. In this paper, we propose an efficient approach for hypergraph k-core decomposition. Novel concepts and strategies are developed to compute the core value of each vertex and reduce redundant computation of vertices. Experimental results on real-world and synthetic hypergraphs demonstrate that our approach significantly outperforms the state-of-the-art algorithm by 7 times on average while reducing the average memory usage by 36 times. Moreover, while existing algorithms fail on tens of millions hyperedges, our approach efficiently handles billion-scale hypergraphs in only a single thread.
[51]	Tianming Zhang, Renbo Zhang, Zhengyi Yang, Yunjun Gao, Bin Cao, and Jing Fan. Clgnn: A contrastive learning-based gnn model for betweenness centrality prediction on temporal graphs, 2025. [ bib \| arXiv \| http ]
[50]	Yizhe Zhang, Zhengyi Yang, Bocheng Han, Haoran Ning, Xin Cao, John Shepherd, and Guanfeng Liu. Risc-v meets rdbms: An experimental study of database performance on an open instruction set architecture. In Proceedings of Workshops at the 51st International Conference on Very Large Data Bases (VLDB)*. VLDB.org, 2025. [ bib ]
[49]	Diya Yan, Yi Ding, Riza Yosia Sunindijo, Cynthia C Wang, and Zhengyi Yang. Key factors in women's managerial advancement in the construction industry: insights from machine learning. International Journal of Construction Management, 0(0):1–15, 2025. [ bib \| DOI \| http ] Despite ongoing efforts to promote gender diversity in the Australian construction industry, women remain significantly underrepresented in managerial positions. Differing from previous studies using traditional survey or interview approaches, this study applied career capital theory and analyzed 1,595 LinkedIn profiles with 11 features, related to work experience, network size, educational background, and industry recognition. Predictive modeling was conducted using MATLAB's Classification Learner, applying multiple machine learning algorithms to assess the significance of those features in predicting managerial level. The results identified current employer size as the strongest predictor of female managerial levels. Women in small enterprises were more likely to reach top management, while those in large companies more likely remained in lower managerial levels. Experience duration also had a significant impact, but progression plateaued beyond seven years, indicating tenure alone does not drive advancement. Follower and connection count demonstrated a notable contribution, emphasizing the importance of professional visibility. Contrary to traditional assumptions, recommendation count and highest education level had lower relevance, while construction-related degrees, certifications, awards, and courses showed minimal impact. This study sheds light on the barriers and contributors of women's managerial advancement and provides practical recommendations for policymakers and industry stakeholders to foster inclusive and equitable workplaces.
[48]	Liuyi Chen#, Yi Ding#, Xushuo Tang, Fangyue Chen, Siyuan Gong, Xu Zhou, and Zhengyi Yang. Accelerating streaming subgraph matching via vector databases. Intelligent Computing*, 4:0131, 2025. [ bib \| DOI \| http ] Graphs are widely used in applications such as social network analysis, bioinformatics, and recommendation systems to represent relationships and complex dependencies. Subgraph matching, which involves finding instances of a query subgraph within a larger graph, is crucial for tasks such as fraud detection, pattern recognition, and semantic search. The streaming subgraph matching problem, an extension of this task, aims to efficiently process queries in a stream with minimal latency. This is particularly important in real-time applications such as dynamic monitoring and network anomaly detection, where quick query responses are essential. To address streaming subgraph matching, existing methods incorporate precomputed indices, such as tree structures. However, these approaches often fail to scale efficiently under high query arrival rates or for large graphs due to limitations in caching, query reuse, and indexing performance. In this paper, we adopt a framework that leverages a subgraph index based on graph embeddings, enabling effective caching and reuse of query results. Building on this foundation, we perform k-nearest neighbor search on high-dimensional vectors by using a vector database for indexing. Inverted file and product quantization techniques within the vector database were employed to accelerate the process. Experimental evaluations on 16 diverse real-world datasets show that our approach reduces processing time by an average of 87.7% compared to the state-of-the-art method, achieves cache hit rates ranging from 70% to 90%, and demonstrates robustness and consistency across varying batch sizes and datasets.
[47]	Huangleshuai He, Zhengyi Yang, Dong Wen, Wenqian Zhang, Michael Yu, Wenke Yang, and Wenjie Zhang. A survey on efficient graph reachability queries. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)*, pages 58–77, Singapore, 2025. Springer Nature Singapore. [ bib \| DOI \| http ] Graph reachability queries, which determine whether a path exists between two vertices in a graph, are a foundational problem in graph analytics. This survey provides a comprehensive review of techniques for efficient graph reachability querying, including static and dynamic approaches, indexed and online methods, and recent advancements. Detailed discussions explore various design choices for efficiently handling graph reachability queries, emphasising their applicability, limitations, and performance trade-offs. Additionally, we outline key open challenges and potential future directions to advance this field. This survey aims to guide researchers in navigating and advancing the state-of-the-art in graph reachability queries.
[46]	Longbin Lai, Changwei Luo, Yunkai Lou, Mingchen Ju, and Zhengyi Yang. Graphy'our data: Towards end-to-end modeling, exploring and generating report from raw data. In Companion of the 2025 International Conference on Management of Data (SIGMOD), SIGMOD/PODS '25, pages 147–150, New York, NY, USA, 2025. Association for Computing Machinery. [ bib \| DOI \| http ] While Large Language Models (LLMs) excel at single-document queries and conversational workflows, they struggle with progressively exploring, analyzing, and synthesizing large unstructured document sets, such as in literature surveys. We address this challenge – termed Progressive Document Investigation – by introducing Graphy, an end-to-end platform that automates data modeling, exploration and high-quality report generation in a user-friendly manner. Graphy comprises an offline Scrapper that transforms raw documents into a graph, and an online Surveyor that enables iterative exploration and LLM-driven report generation. We showcase a pre-scrapped graph of over 50,000 papers, demonstrating how Graphy facilitates the literature-survey scenario, with video available at https://youtu.be/uM4nzkAdGlM.
[45]	Xushuo Tang, Liuyi Chen, Wenke Yang, Zhengyi Yang, Mingchen Ju, Xin Shu, Zihan Yang, and Yifu Tang. Tabular-textual question answering: From parallel program generation to large language models. World Wide Web*, 28(4):42, Jun 2025. [ bib \| DOI \| http ] Hybrid tabular-textual question answering (HTQA) involves integrating multiple data sources, traditionally managed through LSTM-based step-by-step reasoning. However, such sequential approaches are prone to exposure bias and cumulative errors, limiting their effectiveness. This paper first introduces an innovative parallel program generation method, ConcurGen, aiming to transform this paradigm by simultaneously formulating comprehensive program constructs that seamlessly blend operations and values. This approach not only rectifies the inherent pitfalls of sequential methodologies but also infuses efficiency into the process. Through our further research, we found that some HTQA scenarios extend beyond traditional question-answering, often involving open-ended questions that demand dynamic, context-aware response generation. Therefore, we introduce a second framework that leverages large language models (LLMs) to effectively answer both traditional and open-ended questions. Our method demonstrates substantial improvements over existing models such as FinQANet and MT2Net on benchmarks including ConvFinQA and MultiHiertt, achieving new state-of-the-art performance across multiple evaluation metrics. In addition to its accuracy, it delivers a nearly 21x speedup in program generation, significantly enhancing inference efficiency. Unlike traditional models, our system maintains robust performance as the complexity of numerical reasoning increases, highlighting its adaptability in challenging scenarios. Furthermore, supplementary experiments on the LLM-based framework show that it provides enriched answer justifications while achieving similar performance to ConcurGen on standard benchmarks.
[44]	Zi Chen, Keke Liang, Long Yuan, Wenjie Zhang, and Zhengyi Yang. Recent advances in efficient dynamic graph processing. Applied Sciences, 15(11), 2025. [ bib \| DOI \| http ] Graph as one of the most fundamental and representative data structures has found a wide spectrum of emerging application domains such as social media, financial transactions, biology science, and road networks. Recently, with the proliferation of graph applications, graph processing has attracted much attention in both industry and academia. Among them, most existing works focus on the static graphs in which the vertices and edges are immutable. However, in the real world, graphs are constantly and dynamically changing, bringing tricky challenges to process such dynamic graphs. This paper surveys the recent advances in dynamic graph processing, including centrality, graph coloring, cohesive subgraph, path traversal, and graph separation. We summarize the computational complexity models for dynamic algorithm analysis, theoretically compare the efficiency of algorithms among different research topics. Moreover, we also explore the research opportunities for the future.
[43]	Qi Luo, Wenjie Zhang, Zhengyi Yang, Dongxiao Yu, Xuemin Lin, and Liping Wang. Efficient indexing and searching of constrained core in hypergraphs. The VLDB Journal, 34(3):34, Apr 2025. [ bib \| DOI \| http ] Hypergraphs are generalized graph models that capture high-order relationships through hyperedges that contain arbitrary vertices. In large-scale hypergraphs, it is common to observe global sparsity and local density, which makes identifying the dense substructures a fundamental task in graph mining. This paper introduces a framework called (k, p)-PoolCore for searching constrained cohesive subgraphs in hypergraphs, with k representing the degree constraint of vertices and p indicating the size constraint of hyperedges. We theoretically analyze the monotonicity and hierarchical properties of (k, p)-PoolCore. To capitalize on these properties, we propose a tree-based PoolCore index that organizes all (k, p)-PoolCores within trees to facilitate efficient queries. Furthermore, we optimize this index to establish a list-based PoolCore index, which offers O(1) query time complexity in most cases without additional space complexity in large-scale hypergraphs. Our extensive experiments and case studies on real-world hypergraphs demonstrate the advantages of (k, p)-PoolCore in hypergraph decomposition and modeling cohesiveness substructures. The proposed tree-based PoolCore index achieves a remarkable $$10^6$$speedup compared to the basic computation method. Additionally, the optimized list-based PoolCore index further accelerates querying by at least 100x while maintaining space-complexity efficiency and scalability in real-world hypergraphs.
[42]	Kaiyu Chen, Dong Wen, Hanchen Wang, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. Covering k-cliques in billion-scale graphs. In Proceedings of the ACM on Web Conference 2025 (WWW), WWW '25, pages 2299–2308, New York, NY, USA, 2025. Association for Computing Machinery. [ bib \| DOI \| http ] The k-clique structure in graphs has been investigated in various real-world applications, such as community detection in complex networks, functional module discovery in biological networks, and link spam detection in web graphs. Despite extensive research on k-clique enumeration, the large number of k-cliques in many graphs poses a challenge for practical application and computation. To address this, we explore the k-clique τ-cover problem, a generalization of the vertex cover problem. The problem aims to find a small set of vertices that can effectively represent all k-cliques in the graph. We prove the NP-hardness of finding the minimum k-clique cover. We propose a hierarchical solution that computes a small cover without enumerating k-cliques. Extensive experiments on real-world graphs verify the efficiency and effectiveness of our solution.
[41]	Diya Yan, Yi Ding, Riza Yosia Sunindijo, Cynthia C Wang, and Zhengyi Yang. Career progression of female and male managers in the australian construction industry: a comparative analysis using linkedin data. International Journal of Construction Management, 0(0):1–10, 2025. [ bib \| DOI \| http ] Despite ongoing efforts to promote gender diversity, gender disparities in leadership roles remain in the Australian construction industry. Existing research has identified barriers to women's career progression in construction using traditional survey or interview methods. To complement these findings and provide a broader perspective on the current state of women's representation in managerial positions in the Australian construction industry, this study analyzed 12,284 LinkedIn profiles using an exploratory-confirmatory design and compared 10 career features between male and female managers relating to work experience, educational background, and professional network and endorsement. The results revealed that women remain underrepresented in managerial roles, but their distribution across management levels was comparable to men, indicating progress toward gender parity. Women also demonstrated higher educational attainment and greater activity on professional networking platforms. However, gender disparities were observed in job categories, indicating that certain career paths may not be fully accessible to women. Additionally, female managers with more than 15 years of experience reported fewer followers than men, highlighting the need for supporting professional influence in later career stages. This study contributes a quantitative perspective to the understanding of gender disparities and offers practical strategies to advance gender equity in the Australian construction industry.
[40]	Boge Liu, Chunling Wang, Xiaoshuang Chen, Yu Hao, Zhengyi Yang, Yi Jin, Yixing Yang, Wenke Yang, Wanchuan Zhang, and Wenjie Zhang. Phoebedb: A disk-based rdbms kernel for high-performance and cost-effective oltp. In Proceedings 28th International Conference on Extending Database Technology (EDBT)*, pages 996–1004, 2025. [ bib \| DOI \| .pdf ] Relational databases have long been fundamental to data management. This paper presents PhoebeDB, an enterprise- and commercial-oriented RDBMS kernel that integrates recent research with practical innovations to deliver high-performance, cost-efficient OLTP solutions. It features: 1) an in-memory data-centric storage design optimized for parallel access and data temperature-based buffer management, 2) a co-routine pool-based runtime with a smart scheduler that maximizes CPU utilization for high-concurrency workloads, and 3) optimized transaction management with in-memory UNDO logs, hybrid concurrency control, parallel Write-Ahead Logging with Remote Flush Avoidance, and enhanced snapshot isolation. Experiments show that PhoebeDB achieves nearly 13.7 million tpmC and 30 million tpm on the TPC-C benchmark using a single machine, delivering a 27x improvement over PostgreSQL.
[39]	Tianming Zhang, Renbo Zhang, Zhengyi Yang, Lu Chen, Yunjun Gao, and Xiaochun Yang. Temsroute: A temporally and socially aware routing framework for delay-tolerant networks. Ad Hoc Networks, 169:103755, 2025. [ bib \| DOI \| http ] In the realm of delay-tolerant networks (DTNs), designing a routing strategy that optimizes relay vertex selection for faster message dissemination and lower network overhead is an important and challenging topic. DTNs inherently exhibit temporal variations, characterized by mobility and intermittent connectivity. In addressing this, in the paper, we model DTNs as temporal networks and propose a temporally and socially aware routing framework, called TemsRoute, which takes both the temporal betweenness centrality and the social information into consideration to intelligently identify optimal relay vertices. Within the TemsRoute, we devise exact and approximate heuristic sorting-based label propagation methods, together with two pruning lemmas, to efficiently compute the temporal betweenness centrality. We also introduce four metrics to calculate social relevance between pairs of vertices. Additionally, we explore how to incrementally update the approximate temporal betweenness centrality within the context of temporal graph streams. Extensive experiments conducted on real-world DTNs underscore the superior performance of TemsRoute. It achieves the highest message delivery rate and the lowest message average delay when compared to six other routing methods. This underscores the potential of TemsRoute to improve message dissemination efficiency in certain dynamic and challenging DTN scenarios, particularly those involving sporadic connectivity, frequent topology changes, and limited resources.
[38]	Jingtian Wei, Zhengyi Yang, Qi Luo, Yu Zhang, Lu Qin, and Wenjie Zhang. High-order local clustering on hypergraphs. EAI Endorsed Transactions on Scalable Information Systems*, 11(6), November 2024. [ bib \| DOI ]
[37]	Diya Yan, Riza Yosia Sunindijo, Cynthia C Wang, and Zhengyi Yang. Comparing the career paths of male and female managers in the construction industry: Insights from linkedin data. In 40th Annual ARCOM Conference, pages 247–256. Association of Researchers in Construction Management, September 2–4 2024. [ bib \| .pdf ] As one of the most male-dominated industries, the construction industry in Australia is known for the low representation of women in the workforce. Previous studies have indicated that women face significant obstacles in their career developments, progressing more slowly than their male counterparts and struggling to attain managerial positions. Building on past research, this research compares the differences between the career paths of male and female managers in construction to understand key factors contributing to gender disparities in career progression. Based on an examination of the LinkedIn profiles of 480 managerial level employees from 61 companies, this research leveraged the rich dataset provided by the world's largest professional network, LinkedIn, to examine these key factors, such as educational backgrounds, years of experience, job titles, career transitions, and network sizes. This research contributes to existing literatures on gender disparities in leadership roles within the construction industry. By utilising LinkedIn data, which reflect the lived experiences of managers, this research provides empirical evidence that can inform policymaking, talent management, and diversity initiatives aimed at narrowing the gender gap in construction leadership.
[36]	Xiao Li, Yanping Wu, Xiaoyang Wang, Zhengyi Yang, Wenjie Zhang, and Ying Zhang. Keyword-based betweenness centrality maximization in attributed graphs. In Australasian Database Conference (ADC), pages 209–223, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Betweenness centrality is a key concept in graph analysis that measures the significance of a node by counting how often it appears in the shortest paths between other nodes. The task of betweenness centrality maximization, which seeks to identify a set of nodes with the highest centrality scores, is crucial in various real-world applications. Most existing studies about betweenness centrality focus on general graphs. However, in reality, users in networks are usually associated with attributes such as preferences, which play an essential role in analyzing the properties of networks. Therefore, the traditional betweenness centrality is not applicable to the attribute graphs. Motivated by this, we propose a novel concept called Keyword-based Betweenness Centrality (KBC), which quantifies the number of times each node acts as the midpoint of shortest paths between nodes having one of the given attributes. Given an attribute graph G, a query attribute set Q, and a positive integer k, in this paper, we aim to find a node set of size no larger than k so that its KBC value based on Q is maximized. To address this problem, we propose a keyword-based hyper-edge sampler and devise an algorithm achieving the approximation guarantee of $$(1-1/e-\epsilon )$$(1-1/e-ϵ)with at least 1-$$\delta $$δprobability. Extensive experiments on four real networks demonstrate the efficiency and effectiveness of our proposed algorithms.
[35]	Jiaxuan Wu, Xushuo Tang, Zhengyi Yang, Kongzhang Hao, Longbin Lai, and Yongfei Liu. An experimental evaluation of llm on image classification. In Australasian Database Conference (ADC)*, pages 506–518, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Image classification is one of the fundamental tasks in computer vision (CV) and has numerous practical applications. Traditionally, machine learning and deep learning methods such as k-Nearest Neighbors (kNN), decision trees, and Convolutional Neural Networks (CNN) have been widely used to perform this task. However, with the recent emergence of large language models (LLMs), such as Generative Pre-trained Transformers (GPT), originally designed for natural language processing, their cross-domain applications, including in CV, are now being explored. In this paper, we investigate the capabilities of GPT-4o, a variant of the GPT model, for image classification on the Fashion-MNIST dataset. By using carefully designed prompts, we evaluate GPT-4o's performance and compare it with more traditional models. Our study offers insights into the cross-domain potential of GPT models, explores how prompt engineering can enhance GPT's performance on image classification tasks, and suggests new avenues for developing more flexible and adaptable multimodal LLM systems. The code can be found at https://github.com/Tanghaha1424/gpt-fashionmnist.
[34]	Yi Ding, Hualong Lin, Zhengyi Yang, Dong Wen, Xiaoyang Wang, and Wenjie Zhang. Benchmarking rdf systems on the cloudan experimental comparison of rdf systems on cloud. In Australasian Database Conference (ADC)*, pages 30–43, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] With the growing use of graph-based applications across various domains, the demand for efficient and scalable RDF (Resource Description Framework) systems has intensified. As organizations increasingly deploy RDF systems on cloud infrastructures, understanding the trade-offs between performance and cost becomes critical. This paper presents a comprehensive experimental comparison of multiple RDF systems, analyzing their performance in cloud and on-premises environments. Our study evaluates key metrics such as query execution time, data ingestion speed, storage efficiency, and cost-effectiveness across several benchmarks, including LDBC SNB, WatDiv, LUBM, and DBpedia. The experiments reveal distinct advantages and limitations in RDF systems. Notably, Virtuoso demonstrated strong performance and cost-efficiency across most datasets, while systems like gStore exhibited higher variability under stress conditions. Our findings offer actionable insights for practitioners seeking to balance performance and cost when deploying RDF systems in cloud environments. We also highlight areas for improvement in system documentation, compatibility, and stability under extreme workloads. Future work will explore alternative architectures and further optimization strategies for cloud-based RDF solutions. The related code, scripts and data of this paper are available online (https://github.com/unswdb/RDF_on_Cloud).
[33]	Xia Li, Zhengyi Yang, Kongzhang Hao, Xin Shu, Xin Cao, and Wenjie Zhang. Distributed hop-constrained s-t simple path enumeration in labelled graphs. In Australasian Database Conference (ADC), pages 265–278, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Hop-constrained s-t simple path (HC-s-t path) enumeration is a core problem in graph analysis, commonly applied to unlabelled graphs without considering label constraints. However, in many practical scenarios, graphs are edge-labelled, requiring queries to satisfy specific label constraints on the paths between vertices. This introduces new computational challenges that traditional HC-s-t path algorithms are not equipped to handle, especially in distributed environments dealing with large-scale graphs. To address these challenges, we propose a distributed algorithm for labelled hop-constrained s-t path (LHC-s-t path) enumeration. Our approach introduces an online label-based index to prune unnecessary computations, reducing both redundant processing and communication overhead. This enables efficient LHC-s-t path enumeration across distributed systems. Extensive experiments on large real-world graphs demonstrate that our algorithm significantly outperforms existing methods, achieving over an order of magnitude improvement in performance and scalability.
[32]	Zhuoqing Xu, Zhengyi Yang, Xiaoshuang Chen, Huangleshuai He, and Xin Cao. Efficient answering of k-reachability on temporal bipartite graphs. In Australasian Database Conference (ADC), pages 224–238, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] This paper investigates the k-reachability problem on temporal bipartite graphs, which determines whether a vertex can reach another vertex within a given time interval I and a step constraint k. The problem is motivated by real-world scenarios where the number of hops from a source vertex s to a target vertex t reflects the influence s exerts on t. Bipartite graphs are used to model relationships between two types of entities, such as people-location, author-paper, and customer-product. When modeling real-world applications like disease outbreaks, biomedical reactions, and travel planning, edges are often enriched with temporal information. Temporal bipartite graphs provide an effective tool for modeling these dynamic scenarios. Recent studies show that k-reachability on temporal bipartite graphs can expand research in the above fields. Adding a step constraint k provides more insights into the connectivity of temporal bipartite graphs. Although k-reachability has been extensively studied on unipartite graphs, it remains unexplored on temporal bipartite graphs. To fill this research gap, we propose an efficient algorithm for k-reachability problem on temporal bipartite graphs in this paper. Specifically, we introduce a 2-hop index-based approach to efficiently answer k-reachability queries. Extensive experiments on real-world datasets demonstrate that the proposed method achieves a speedup of at least three orders of magnitude in query performance.
[31]	Nimish Ukey#, Guangjian Zhang#, Zhengyi Yang, Xiaoyang Wang, Binghao Li, Serkan Saydam, and Wenjie Zhang. A cluster-based approach to knn join over batch-dynamic high-dimensional data. In International Conference on Advanced Data Mining and Applications (ADMA)*, pages 81–96, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] The k nearest neighbors (kNN) join is a crucial operation in data mining, retrieving the kNN for each point in the query set within an answer set. This operation finds extensive applications in various domains, including recommendation systems, spatial databases, and knowledge discovery. With the surge in data volume and dimensionality, numerous approaches have emerged to enhance the efficiency of kNN join operations on static and dynamic high-dimensional data. However, we observed that research on batch-dynamic kNN join, where updates occur in batches rather than individually, remains scarce. To bridge this gap, we propose a novel cluster-based approach tailored for batch-dynamic kNN join over high-dimensional data. Our contributions include a cluster-based batch update technique, which efficiently processes similar updates in clusters, and a cluster-based pruning method using the high-dimensional R-tree (HDR-Tree) for optimised search during updates. Extensive experimental evaluations across 6 real-world datasets demonstrate the efficiency of our approach, significantly outperforming state-of-the-art methods by 19 to 55 times.
[30]	Qi Luo, Wenjie Zhang, Zhengyi Yang, Dong Wen, Xiaoyang Wang, Dongxiao Yu, and Xuemin Lin. Hierarchical structure construction on hypergraphs. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), CIKM '24, pages 1597–1606, New York, NY, USA, 2024. Association for Computing Machinery. [ bib \| DOI \| http ] Exploring the hierarchical structure of graphs presents notable advantages for graph analysis, revealing insights ranging from individual vertex behavior to community distribution and overall graph stability. This paper studies hierarchical structures within hypergraphs, where a hyperedge can connect multiple vertices. We observed that directly extending hierarchical frameworks from pairwise graphs to hypergraphs overlooks high-order interactions and can result in either high computational complexity or sparse hierarchy structure. To address this challenge, we introduce a dual-layer hypergraph hierarchy consisting of a primary hierarchy and a secondary hierarchy, enabling the construction of a refined hypergraph hierarchy in linear time. The dual-layer hierarchy establishes a global hierarchy based on vertex cohesion, utilizing vertex-induced subhypergraphs, and a local hierarchy based on hyperedge containment, employing edge-induced subhypergraphs. The combination of global and local hierarchy mitigates the homogeneity and sparsity issues inherent in single-layer hierarchies, allowing more effective modeling of high-order interactions. Furthermore, we propose an efficient hierarchical construction algorithm by leveraging a novel hyperedge-based disjoint set to identify connected subhypergraphs. Additionally, to optimize the local hierarchy further and prevent the emergence of excessively redundant levels, we introduce a compact local hierarchy by defining a restricted subgraph metric to eliminate redundancy caused by large-sized hyperedges. Empirical studies on real-world hypergraphs demonstrate the effectiveness of our approach.
[29]	Haonan Yan, Zhengyi Yang, Tianming Zhang, Dong Wen, Qi Luo, and Nimish Ukey. Approximating temporal katz centrality with monte carlo methods. In Web and Big Data. APWeb-WAIM 2024 International Workshops*, pages 3–16, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Graphs have long served as fundamental data models across various disciplines such as data mining, social media analysis, and knowledge management. In real-world applications, interactions between nodes often evolve over time, necessitating the use of temporal graphs. Centrality measures are pivotal in graph analysis for identifying key nodes. Specifically, Temporal Katz Centrality (TKC) has garnered significant attention in recent years for its ability to measure node influence in temporal graphs. TKC evaluates the influence of nodes by considering all walks originating from a node, where contributions are weighted based on a user-specified time decay factor. This approach captures temporal dynamics by incorporating both the timing of interactions and the intervals between them, offering a comprehensive assessment of node importance over time. However, computing TKC in large temporal graphs is computationally intensive due to the requirement to traverse all edges and update centrality values along temporal paths. To address this challenge, this paper proposes an efficient Monte Carlo approximation method for TKC. This approach employs random sampling via Simple Random Walk and Alpha Walk to estimate TKC values. The paper rigorously proves the asymptotic consistency of the method using the Law of Large Numbers and Central Limit Theorem, ensuring that estimated TKC values converge to true TKC values as the sample size increases. Experiments conducted on six real-world temporal graph datasets demonstrate the effectiveness of the proposed approximation methods. Under the same running time, the Alpha walk outperforms other methods on large temporal graphs, exhibiting the lowest mean relative error and highest precision.
[28]	Haoran Ning, Bocheng Han, Zhengyi Yang, Kongzhang Hao, Miao Ma, Chunling Wang, Boge Liu, Xiaoshuang Cheng, Yu Hao, Yi Jin, Wanchun Zhang, and Chengwei Zhang. Exploring simple architecture of just-in-time compilation in databases. In Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data (APWeb-WAIM)*, pages 504–514, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Just-in-Time (JIT) compilation is an effective technique for enhancing query execution in modern relational databases, and it has gained increasing attention from academia and industry in recent years. However, the architectures of state-of-the-art JIT-based database systems are often complex, leading to challenges and limitations when adopted for commercial use. In this paper, we present an industrial view to JIT compilation for relational databases, emphasizing practicality and applicability. Our focus is on minimizing engineering effort, simplifying testing, and ensuring seamless integration with existing database ecosystems. We achieve these goals by adhering to three core principles: a simple, lightweight architecture; reuse of existing technologies and frameworks, particularly LLVM; and strong extensibility and compatibility. We demonstrate the feasibility and potential of this approach through an initial exploration using LLVM's mature JIT compilation capabilities to translate TPC-H database queries into optimized machine code. This proof-of-concept implementation shows the promise of our approach, pivoting the way for a comprehensive database system that leverages a lightweight yet powerfu JIT compilation framework for real-world applications.
[27]	Yi Ding, Zhengyi Yang, Shunyang Li, Liuyi Chen, Haoran Ning, Kongzhang Hao, and Yongfei Liu. Fgaq: Accelerating graph analytical queries using fpga. In Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data (APWeb-WAIM)*, pages 357–361, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Field-programmable gate arrays (FPGAs) have significant advantages in parallelism and energy efficiency over CPUs and GPUs and are widely deployed by many enterprises and cloud server providers nowadays. In this paper, we demonstrate FGAQ, an FPGA-based system for accelerating graph queries on massive graphs. FGAQ supports the two most fundamental types of graph queries, namely subgraph and path queries, and features 1) a CPU-FPGA co-designed framework, 2) a fully pipelined FPGA execution, and 3) reduced data transfer from FPGA's external memory. FGAQ provides a user-friendly interface and significantly improved performance. Performance evaluation shows that $$\textsf{FGAQ}$$FGAQoutperforms the most popular graph database, Neo4j, by up to three orders of magnitude. The demo video can be found at https://www.youtube.com/watch?v=pEkzw_DOQYE.
[26]	Wenke Yang, Zihan Yang, Liuyi Chen, Ruiqing Yan, Zhengyi Yang, Linhan Zhang, and Tang Yifu. Parallel program generation for hybrid tabular-textual question answering. In Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data (APWeb-WAIM)*, pages 121–137, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI ] Hybrid tabular-textual question answering (HTQA) involves tapping into a mosaic of data sources, traditionally managed through LSTM-based step-by-step reasoning, which has been vulnerable to exposure bias and subsequent error accumulation. This paper introduces an innovative parallel program generation method, ConcurGen, aiming to transform this paradigm by simultaneously formulating comprehensive program constructs that seamlessly blend operations and values. This approach not only rectifies the inherent pitfalls of sequential methodologies but also infuses efficiency into the process. When subjected to rigorous evaluation on benchmarks like the ConvFinQA and MultiHiertt datasets, our methodology showcased significant superiority over prevalent models such as FinQANet and MT2Net. This was evidenced by enhancements in various performance metrics, effectively raising the bar for what's deemed state-of-the-art. Notably, beyond setting these commendable benchmarks, our method facilitates a striking acceleration in program creation, achieving speeds nearly 21 times faster. Additionally, a salient feature of our approach becomes evident when numerical reasoning steps escalate: unlike traditional models, our system sustains its robust performance, emphasizing its adaptability and resilience in complex scenarios.
[25]	Kaiyu Chen, Dong Wen, Wentao Li, Zhengyi Yang, and Wenjie Zhang. On compressing historical cliques in temporal graphs. In Database Systems for Advanced Applications (DASFAA), pages 37–53, Singapore, 2024. Springer Nature Singapore. [ bib \| DOI \| http ] Maximal clique is a fundamental cohesive subgraph model that plays an important role in many practical applications such as social network analysis and bioinformatics. Many real-world graphs change over time, with edges arriving continuously, and each edge has a timestamp representing the arrival time of that edge; such graphs are also known as temporal graphs. All maximal cliques in all snapshots since all possible historical moments are called historical cliques. Querying historical cliques has not been explored in existing research efforts. In this paper, we study the problem of compressing historical cliques in temporal graphs. We design a novel trie index structure called HC-Trie to compress historical cliques by exploiting the duration of historical cliques and the overlapping relationship between different historical cliques. We also propose an algorithm for maintaining the index when the graph changes. Experiments on real-world temporal graphs demonstrate that our solution achieves much faster query efficiency than the online solution and significantly less index space than the straightforward solution.
[24]	Tianming Zhang, Junkai Fang, Zhengyi Yang, Bin Cao, and Jing Fan. Tatkc: A temporal graph neural network for fast approximate temporal katz centrality ranking. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 527–538, New York, NY, USA, 2024. Association for Computing Machinery. [ bib \| DOI \| http ] Numerous real-world networks are represented as temporal graphs, which capture the dynamics of connections over time. Identifying important nodes on temporal graphs has a plethora of real-life applications, such as information propagation and influential user identification, etc. Temporal Katz centrality, a popular temporal metric, gauges the importance of nodes by taking into account both the number of temporal walks and the timespan between the interactions. The computation of traditional temporal Katz centrality is computationally expensive, especially when applied to massive temporal graphs. Therefore, in this paper, we design a temporal graph neural network to approximate temporal Katz centrality computation. To the best of our knowledge, we are the first to address temporal Katz centrality computation purely from a learning-based perspective. We propose a time-injected self-attention model that consists of two phases. In the first phase, we utilize a time-injected self-attention mechanism to acquire node representations that encompass both structural information and temporal relevance. The second phase is structured as a multi-layer perceptron (MLP) which uses the learned node representation to predict node rankings. Furthermore, normalization and neighbor sampling strategies are integrated into the model to enhance its overall performance. Extensive experiments on real-world networks demonstrate the efficiency and accuracy of TATKC.
[23]	Tianming Zhang, Yunjun Gao, Jie Zhao, Lu Chen, Lu Jin, Zhengyi Yang, Bin Cao, and Jing Fan. Efficient exact and approximate betweenness centrality computation for temporal graphs. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 2395–2406, New York, NY, USA, 2024. Association for Computing Machinery. [ bib \| DOI \| http ] Betweenness centrality of a vertex in a graph evaluates how often the vertex occurs in the shortest paths. It is a widely used metric of vertex importance in graph analytics. While betweenness centrality on static graphs has been extensively investigated, many real-world graphs are time-varying and modeled as temporal graphs. Examples include social networks and telecommunication networks, where a relationship between two vertices occurs at a specific time. Hence, in this paper, we target efficient methods for temporal betweenness centrality computation. We firstly propose an exact algorithm with the new notion of time instance graph, based on which, we derive a temporal dependency accumulation theory for iterative computation. To reduce the size of the time instance graph and improve the efficiency, we propose an additional optimization, which compresses the time instance graph with equivalent vertices and edges, and extends the dependency theory to the compressed graph. Since it is theoretically complex to compute temporal betweenness centrality, we further devise a probabilistically guaranteed approximate method to handle massive temporal graphs. Extensive experimental results on real-world temporal networks demonstrate the superior performance of the proposed methods. In particular, our exact and approximate methods outperform the state-of-the-art methods by up to two and five orders of magnitude, respectively.
[22]	Lingbo Li, Zhichun Li, Fusen Guo, Haoyu Yang, Jingtian Wei, and Zhengyi Yang. Prototype comparison convolutional networks for one-shot segmentation. IEEE Access, 12:54978–54990, 2024. [ bib \| DOI ] In few-shot semantic segmentation (FSS), the key challenges are efficiently tuning the interaction between the support set and the query set and distinguishing between context, background, and interfering items. To address these challenges, we propose prototype comparison networks for one-shot segmentation (OPCN) to capture the details required for FSS. Specifically, we offer the Fusion Interaction Module (FIM) to improve the segmentation performance by capturing the correlation and semantic information between the support set and query set features. Subsequently, we propose the Feature Enhancement Module (FEM), which aims to enhance the information required in the support set and query set features while increasing the focus on critical details by reducing the weight of the background regions to provide a more targeted feature representation for subsequent query image segmentation. Then, we propose the Feature Refinement Module (FRM) to filter irrelevant background information in the features and specify the target location region. Finally, the Feature Matching Module (FM) generates the final segmentation mask for the query image. Extensive experiments on the PASCAL-5i and COCO-20i datasets show that our approach achieves excellent performance in the case of the one-shot setup.
[21]	Tianming Zhang, Xinwei Cai, Lu Chen, Zhengyi Yang, Yunjun Gao, Bin Cao, and Jing Fan. Towards efficient simulation-based constrained temporal graph pattern matching. World Wide Web, 27(3):22, 2024. [ bib \| DOI \| http ] In the context of searching a single data graph G, graph pattern matching is to find all the occurrences of a pattern graph Q in G, specified by a matching rule. It is of paramount importance in many real applications such as social network analysis and cyber security, among others. A wide spectrum of studies target general graph pattern matching. However, to analyze time-relevant services such as studying the spread of diseases and detecting attack patterns, it is attractive to study inexact temporal graph pattern matching. Hence, in this paper, we propose a relaxed matching rule called constrained temporal dual simulation, and study simulation-based constrained temporal graph pattern matching which guarantees that the matching result (i) preserves the ancestor and descendant temporal connectivities; and (ii) implements edge-to-temporal path mapping. We devise a decomposition-based matching method, which first decomposes the data graph into Source Temporal Connected Components, and then performs matching on decomposed subgraphs. To speed up the matching, we define child/parent dependency relation tables and propose an efficient double hierarchical traverse strategy. Considering that the temporal graphs are naturally dynamic, we further propose update algorithms. An extensive empirical study over real-world and synthetic temporal graphs has demonstrated the effectiveness and efficiency of our approach.
[20]	Sai Li, Peng Kou, Miao Ma, Haoyu Yang, Shuo Huang, and Zhengyi Yang. Application of semi-supervised learning in image classification: Research on fusion of labeled and unlabeled data. IEEE Access, 12:27331–27343, 2024. [ bib \| DOI ] Deep learning has attracted wide attention recently because of its excellent feature representation ability and end-to-end automatic learning method. Especially in clinical medical imaging diagnosis, the semi-supervised deep learning model is favored and widely used because it can make maximum use of a limited number of labeled data and combine it with a large number of unlabeled data to extract more information and knowledge from it. However, the scarcity of medical image data, the vast image size, and the instability of image quality directly affect the model's robustness, generalization, and image classification performance. Therefore, this paper proposes a new semi-supervised learning model, which uses quadratic neurons instead of traditional neurons, aiming to use quadratic convolution instead of the conventional convolution layer to improve the feature extraction ability of the model. In addition, we introduce two Dropout layers and two fully connected layers at the end of the model to enhance the nonlinear fitting ability of the network. Experiments on two large medical public data sets - ISIC 2019 and Retinopathy OCT - show that our method can improve the model's generalization performance and image classification accuracy.
[19]	Wenqian Zhang, Zhengyi Yang, Dong Wen, and Xiaoyang Wang. Efficient distributed core graph decomposition. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW)*, pages 1023–1031, 2023. [ bib \| DOI ] Core decomposition is one of the most fundamental problems in graph analytics, which is associated with numerous applications, such as community detection, protein network analysis, and system structure analysis. As the sizes of graphs are becoming increasingly large, it is challenging to compute core decomposition on a single machine. In this paper, we study the problem of k-Core decomposition in the distributed environment. Specifically, we propose the distributed Filter-Array k-Core (FAkCore) algorithm, which adopts the commonly used Scatter-Gather framework. We design an auxiliary data structure of running counts for each vertex to track the statistics of its neighbors' core number. It allows us to recompute the core number of a vertex only when the value is updated. Together with an enhanced message filtering mechanism, our method significantly reduces redundant computation and communication in the existing distributed k-Core decomposition algorithm. Experiments on 10 real-world graphs show that our method outperforms the baseline algorithms by 1.4 times on average and up to 2.2 times.
[18]	Liuyi Chen, Bocheng Han, Xuesong Wang, Jiazhen Zhao, Wenke Yang, and Zhengyi Yang. Machine learning methods in weather and climate applications: A survey. Applied Sciences*, 13(21), 2023. [ bib \| DOI \| http ] With the rapid development of artificial intelligence, machine learning is gradually becoming popular for predictions in all walks of life. In meteorology, it is gradually competing with traditional climate predictions dominated by physical models. This survey aims to consolidate the current understanding of Machine Learning (ML) applications in weather and climate prediction—a field of growing importance across multiple sectors, including agriculture and disaster management. Building upon an exhaustive review of more than 20 methods highlighted in existing literature, this survey pinpointed eight techniques that show particular promise for improving the accuracy of both short-term weather and medium-to-long-term climate forecasts. According to the survey, while ML demonstrates significant capabilities in short-term weather prediction, its application in medium-to-long-term climate forecasting remains limited, constrained by factors such as intricate climate variables and data limitations. Current literature tends to focus narrowly on either short-term weather or medium-to-long-term climate forecasting, often neglecting the relationship between the two, as well as general neglect of modeling structure and recent advances. By providing an integrated analysis of models spanning different time scales, this survey aims to bridge these gaps, thereby serving as a meaningful guide for future interdisciplinary research in this rapidly evolving field.
[17]	Miao Ma, Zhengyi Yang, Kongzhang Hao, Liuyi Chen, Chunling Wang, and Yi Jin. An empirical analysis of just-in-time compilation in modern databases. In Australasian Database Conference (ADC)*, pages 227–240. Springer Nature Switzerland, 2023. [ bib \| DOI \| http ] JIT (Just-in-Time) technology has garnered significant attention for improving the efficiency of database execution. It offers higher performance by eliminating interpretation overhead compared to traditional execution engines. LLVM serves as the primary JIT architecture, which was implemented in PostgreSQL since version 11. However, recent advancements in WASM-based databases, such as Mutable, present an alternative JIT approach. This approach minimizes the extensive engineering efforts associated with the execution engine and focuses on optimizing supported operators for lower latency and higher throughput. In this paper, we perform comprehensive experiments on the twoes representative open-source databases to gain deeper insights into the effectiveness of different JIT architectures.
[16]	Nimish Ukey, Zhengyi Yang, Wenke Yang, Binghao Li, and Runze Li. knn join for dynamic high-dimensional data: A parallel approach. In Australasian Database Conference (ADC)*, pages 3–16. Springer Nature Switzerland, 2023. [ bib \| DOI \| http ] The k nearest neighbor (kNN) join operation is a fundamental task that combines two high-dimensional databases, enabling data points in the User dataset U to identify their k nearest neighbor points from the Item dataset I. This operation plays a crucial role in various domains, including knowledge discovery, data mining, similarity search applications, and scientific research. However, exact kNN search in high-dimensional spaces is computationally demanding, and existing sequential methods face challenges in handling large datasets. In this paper, we propose an efficient parallel solution for dynamic kNN join over high-dimensional data, leveraging the high-dimensional R tree (HDR Tree) for improved efficiency. Our solution harnesses the power of Simultaneous Multi-Threading (SMT) technologies and Single-Instruction-Multiple-Data (SIMD) instructions in modern CPUs for parallelisation. Importantly, our research is the first to introduce parallel computation for exact kNN join over high-dimensional data. Experimental results demonstrate that our proposed approach outperforms the sequential HDR Tree method by up to 1.2 times with a single thread. Moreover, our solution provides near-linear scalability as the number of threads increases.
[15]	Nimish Ukey#, Guangjian Zhang#, Zhengyi Yang, Binghao Li, Wei Li, and Wenjie Zhang. Efficient continuous knn join over dynamic high-dimensional data. World Wide Web*, 2023. [ bib \| DOI \| http ] Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.
[14]	Kongzhang Hao, Long Yuan, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. Efficient and scalable distributed graph structural clustering at billion scale. In International Conference on Database Systems for Advanced Applications (DASFAA), pages 234–251, Cham, 2023. Springer Nature Switzerland. [ bib \| DOI \| http ] Structural Graph Clustering (SCAN) is a fundamental problem in graph analysis and has received considerable attention recently. Existing distributed solutions either lack efficiency or suffer from high memory consumption when addressing this problem in billion-scale graphs. Motivated by these, in this paper, we aim to devise a distributed algorithm for SCAN that is both efficient and scalable. We first propose a fine-grained clustering framework tailored for SCAN. Based on the new framework, we devise a distributed SCAN algorithm, which not only keeps a low communication overhead during execution, but also effectively reduces the memory consumption at all time. We also devise an effective workload balance mechanism that is automatically triggered by the idle machines to handle skewed workloads. The experiment results demonstrate the efficiency and scalability of our proposed algorithm.
[13]	Zhengyi Yang, Wenjie Zhang, Xuemin Lin, Ying Zhang, and Shunyang Li. Hgmatch: A match-by-hyperedge approach for subgraph matching on hypergraphs. In 2023 IEEE 39th International Conference on Data Engineering (ICDE)*, pages 2063–2076, 2023. [ bib \| DOI ] Hypergraphs are generalisation of graphs in which a hyperedge can connect any number of vertices. It can describe n-ary relationships and high-order information among entities compared to conventional graphs. In this paper, we study the fundamental problem of subgraph matching on hypergraphs (i.e, subhypergraph matching). Existing methods directly extend subgraph matching algorithms to the case of hypergraphs. However, this approach delays hyperedge verification and underutilises the high-order information in hypergraphs, which leads to large search space and high enumeration cost. Furthermore, with the growing size of hypergraphs, it is becoming hard to compute subhypergraph matching sequentially. Thus, we propose an efficient and parallel subhypergraph matching system, HGMatch, to handle subhypergraph matching in massive hypergraphs. We proposes a novel match-by-hyperedge framework to utilise high-order information in hypergraphs and uses set operations for efficient candidates generation. Moreover, we develop an optimised parallel execution engine in HGMatch based on the dataflow model, which features a task-based scheduler and fine-grained dynamic work stealing to achieve bounded memory execution and better load balancing. Experimental evaluation on 10 real-world datasets shows that HGMatch outperforms the extended version of the state-of-the-art subgraph matching algorithms (CFL, DAF, CECI and RapidMatch) by orders of magnitude when using a single thread, and achieves almost linear scalability when the number of threads increases.
[12]	Nimish Ukey, Zhengyi Yang, Binghao Li, Guangjian Zhang, Yiheng Hu, and Wenjie Zhang. Survey on exact knn queries over high-dimensional data space. Sensors*, 23(2), 2023. [ bib \| DOI \| http ] k nearest neighbours (kNN) queries are fundamental in many applications, ranging from data mining, recommendation system and Internet of Things, to Industry 4.0 framework applications. In mining, specifically, it can be used for the classification of human activities, iterative closest point registration and pattern recognition and has also been helpful for intrusion detection systems and fault detection. Due to the importance of kNN queries, many algorithms have been proposed in the literature, for both static and dynamic data. In this paper, we focus on exact kNN queries and present a comprehensive survey of exact kNN queries. In particular, we study two fundamental types of exact kNN queries: the kNN Search queries and the kNN Join queries. Our survey focuses on exact approaches over high-dimensional data space, which covers 20 kNN Search methods and 9 kNN Join methods. To the best of our knowledge, this is the first work of a comprehensive survey of exact kNN queries over high-dimensional datasets. We specifically categorise the algorithms based on indexing strategies, data and space partitioning strategies, clustering techniques and the computing paradigm. We provide useful insights for the evolution of approaches based on the various categorisation factors, as well as the possibility of further expansion. Lastly, we discuss some open challenges and future research directions.
[11]	Weixiao Xu, Lin Sun, Cheng Zhen, Bo Liu, Zhengyi Yang, and Wenke Yang. Deep learning-based image recognition of agricultural pests. Applied Sciences, 12(24), 2022. [ bib \| DOI \| http ] Pests and diseases are an inevitable problem in agricultural production, causing substantial economic losses yearly. The application of convolutional neural networks to the intelligent recognition of crop pest images has become increasingly popular due to advances in deep learning methods and the rise of large-scale datasets. However, the diversity and complexity of pest samples, the size of sample images, and the number of examples all directly affect the performance of convolutional neural networks. Therefore, we designed a new target-detection framework based on Cascade RCNN (Regions with CNN features), aiming to solve the problems of large image size, many pest types, and small and unbalanced numbers of samples in pest sample datasets. Specifically, this study performed data enhancement on the original samples to solve the problem of a small and unbalanced number of examples in the dataset and developed a sliding window cropping method, which could increase the perceptual field to learn sample features more accurately and in more detail without changing the original image size. Secondly, combining the attention mechanism with the FPN (Feature Pyramid Networks) layer enabled the model to learn sample features that were more important for the current task from both channel and space aspects. Compared with the current popular target-detection frameworks, the average precision value of our model (mAP@0.5) was 84.16%, the value of (mAP@0.5:0.95) was 65.23%, the precision was 67.79%, and the F1 score was 82.34%. The experiments showed that our model solved the problem of convolutional neural networks being challenging to use because of the wide variety of pest types, the large size of sample images, and the difficulty of identifying tiny pests.
[10]	Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, and Wenjie Zhang. Hop-constrained s-t simple path enumeration in large uncertain graphs. In Australasian Database Conference (ADC), pages 115–127. Springer International Publishing, 2022. [ bib \| DOI \| http ] Uncertain graphs are graphs where each edge is assigned with a probability of existence. In this paper, we study the problem of hop-constrained s-t simple path enumeration in large uncertain graphs. To the best of our knowledge, we are the first to study this problem in the literature. Specifically, we propose a light-weight index to prune candidate paths by adopting the concept of probability-constrained distance. An efficient enumeration algorithm is designed based on the index structure. Experiment results on real-world datasets show that our proposed methods significantly outperform the baseline methods by up to 6 times.
[9]	Nimish Ukey, Zhengyi Yang, Guangjian Zhang, Boge Liu, Binghao Li, and Wenjie Zhang. Efficient knn join over dynamic high-dimensional data. In Australasian Database Conference (ADC)*, pages 63–75. Springer International Publishing, 2022. [ bib \| DOI \| http ] Given a user dataset U and an object dataset I in high-dimensional space, a kNN join query retrieves each object in dataset U its k nearest neighbors from the dataset I. kNN join is a fundamental and essential operation in applications from many domains such as databases, computer vision, multi-media, machine learning, recommendation systems, and many more. The datasets in real world often update dynamically on insertion or deletion of objects. However, existing algorithms of dynamic kNN join lack support for deletion and batch update, which are important in real-life applications. In this paper, we propose a new method of kNN join over dynamic high-dimensional data. Specifically, our method features lazy updates, batch operations, and optimised deletions. Experiments on real-world datasets show that our method outperforms the existing algorithms of naive RkNN join and HDR Tree by up to 5 and 4 times, respectively.
[8]	Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, Wenjie Zhang, Long Yuan, and Xuemin Lin. Hop-constrained s-t simple path enumeration in billion-scale labelled graphs. In Web Information Systems Engineering (WISE), pages 49–64. Springer International Publishing, 2022. [ bib \| DOI \| http ] Hop-constrained s-t simple path (HC-s-tpath) enumeration is a fundamental problem in graph analysis. Existing solutions for this problem focus on unlabelled graphs and assume queries are issued without any label constraints. However, in many real-world applications, graphs are edge-labelled and the queries involve label constraints on the path connecting two vertices. Therefore, we study the problem of labelled hop-constrained s-t path (LHC-s-tpath) enumeration in this paper. We aim to efficiently enumerate the HC-s-tpaths using only edges with provided labels. To achieve this goal, we first demonstrate the existence of unnecessary computation specific to the label constraints in the state-of-the-art HC-s-tpath enumeration algorithm. We then devise a novel online index to identify the fruitless exploration during the enumeration. Based on the proposed index, we design an efficient LHC-s-tpath enumeration algorithm in which unnecessary computation is effectively pruned. Extensive experiments are conducted on real-world graphs with billions of edges. Experiment results show that our proposed algorithms significantly outperform the baseline methods by over one order of magnitude.
[7]	Shunyang Li, Zhengyi Yang, Xianhang Zhang, Wenjie Zhang, and Xuemin Lin. Sql2cypher: Automated data and query migration from rdbms to gdbms. In Web Information Systems Engineering (WISE), pages 510–517. Springer International Publishing, 2021. [ bib \| DOI \| http ] There are many real-world application domains where data can be naturally modelled as a graph, such as social networks and computer networks. Relational Database Management Systems (RDBMS) find it hard to capture the relationships and inherent graph structure of data and are inappropriate for storing highly connected data; thus, graph databases have emerged to address the challenges of high data connectivity. As the performance of querying highly connected data in relational query statements is usually worse than that in the graph database. Transforming data from a relational database to a graph database is imperative for improving the performance of graph queries. In this paper, we demonstrate SQL2Cypher, a system for migrating data from a relational database to a graph database automatically. This system also supports translating SQL queries into Cypher queries. SQL2Cypher is open-source (https://github.com/UNSW-database/SQL2Cypher) to allow researchers and programmers to migrate data efficiently. Our demonstration video can be found here: https://www.youtube.com/watch?v=eGaeBrVTJws.
[6]	Zhengyi Yang, Longbin Lai, Xuemin Lin, Kongzhang Hao, and Wenjie Zhang. Huge: An efficient and scalable subgraph enumeration system. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), SIGMOD/PODS '21, pages 2049–2062, New York, NY, USA, 2021. Association for Computing Machinery. [ bib \| DOI \| http ] Subgraph enumeration is a fundamental problem in graph analytics, which aims to find all instances of a given query graph on a large data graph. In this paper, we propose a system called HUGE to efficiently process subgraph enumeration at scale in the distributed context. HUGE features 1) an optimiser to compute an advanced execution plan without the constraints of existing works; 2) a hybrid communication layer that supports both pushing and pulling communication; 3) a novel two-stage execution mode with a lock-free and zero-copy cache design; 4) a BFS/DFS-adaptive scheduler to bound memory consumption; and 5) two-layer intra- and inter-machine load balancing. HUGE is generic such that all existing distributed subgraph enumeration algorithms can be plugged in to enjoy automatic speed up and bounded-memory execution.
[5]	Xin Jin, Zhengyi Yang, Xuemin Lin, Shiyu Yang, Lu Qin, and You Peng. Fast: Fpga-based subgraph matching on massive graphs. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1452–1463, 2021. [ bib \| DOI ] Subgraph matching is a basic operation widely used in many applications. However, due to its NP-hardness and the explosive growth of graph data, it is challenging to compute subgraph matching, especially in large graphs. In this paper, we aim at scaling up subgraph matching on a single machine using FPGAs. Specifically, we propose a CPU-FPGA co-designed framework. On the CPU side, we first develop a novel auxiliary data structure called candidate search tree (CST) which serves as a complete search space of subgraph matching. CST can be partitioned and fully loaded into FPGAs' on-chip memory. Then, a workload estimation technique is proposed to balance the load between the CPU and FPGA. On the FPGA side, we design and implement the first FPGA-based subgraph matching algorithm, called FAST. To take full advantage of the pipeline mechanism on FPGAs, task parallelism optimization and task generator separation strategy are proposed for FAST, achieving massive parallelism. Moreover, we carefully develop a BRAM-only matching process to fully utilize FPGA's on-chip memory, which avoids the expensive intermediate data transfer between FPGA's BRAM and DRAM. Comprehensive experiments show that FAST achieves up to 462.0x and 150.0x speedup compared with the state-of-the-art algorithm DAF and CECI, respectively. In addition, FAST is the only algorithm that can handle the billion-scale graph using one machine in our experiments.
[4]	Ran Wang, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. An empirical study on recent graph database systems. In 13th International Conference Knowledge Science, Engineering and Management (KSEM), pages 328–340, Berlin, Heidelberg, 2020. Springer International Publishing. [ bib \| DOI \| http ] Graphs are widely used to model the intricate relationships among objects in a wide range of applications. The advance in graph data has brought significant value to artificial intelligence technologies. Recently, a number of graph database systems have been developed. In this paper, we present a comprehensive overview and empirical investigation on existing property graph database systems such as Neo4j, AgensGraph, TigerGraph and LightGraph (LightGraph has recently renamed to TuGraph.). These systems support declarative graph query languages. Our empirical studies are conducted in a single-machine environment against on the LDBC social network benchmark, consisting of three different large-scale datasets and a set of benchmark queries. This is the first empirical study to compare these graph database systems by evaluating data bulk importing and processing simple and complex queries. Experimental results provide insightful observations of various graph data systems and indicate that AgensGraph works well on SQL based workload and simple update queries, TigerGraph is powerful on complex business intelligence queries, Neo4j is user-friendly and suitable for small queries, and LightGraph is a more balanced product achieving good performance on different queries. The related code, scripts and data of this paper are available online (https://github.com/UNSW-database/GraphDB-Benchmark).
[3]	Kongzhang Hao, Zhengyi Yang, Longbin Lai, Zhengmin Lai, Xin Jin, and Xuemin Lin. Patmat: A distributed pattern matching engine with cypher. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), CIKM '19, pages 2921–2924, New York, NY, USA, 2019. Association for Computing Machinery. [ bib \| DOI \| http ] Graph pattern matching is one of the most fundamental problems in graph database and is associated with a wide spectrum of applications. Due to its computational intensiveness, researchers have primarily devoted their efforts to improving the performance of the algorithm while constraining the graphs to have singular labels on vertices (edges) or no label. Whereas in practice graphs are typically associated with rich properties, thus the main focus in the industry is instead on powerful query languages that can express a sufficient number of pattern matching scenarios. We demo PatMat in this work to glue together the academic efforts on performance and the industrial efforts on expressiveness. To do so, we leverage the state-of-the-art join-based algorithms in the distributed contexts and Cypher query language - the most widely-adopted declarative language for graph pattern matching. The experiments demonstrate how we are capable of turning complex Cypher semantics into a distributed solution with high performance.
[2]	Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin, Wenjie Zhang, Ying Zhang, Zhengping Qian, and Jingren Zhou. Distributed subgraph matching on timely dataflow. In Proc. VLDB Endow. (VLDB), volume 12, pages 1099–1112. VLDB Endowment, 2019. [ bib \| DOI \| http ] Recently there emerge many distributed algorithms that aim at solving subgraph matching at scale. Existing algorithm-level comparisons failed to provide a systematic view of distributed subgraph matching mainly due to the intertwining of strategy and optimization. In this paper, we identify four strategies and three general-purpose optimizations from representative state-of-the-art algorithms. We implement the four strategies with the optimizations based on the common Timely dataflow system for systematic strategy-level comparison. Our implementation covers all representative algorithms. We conduct extensive experiments for both unlabelled matching and labelled matching to analyze the performance of distributed subgraph matching under various settings, which is finally summarized as a practical guide.
[1]	Zhengmin Lai, Zhengyi Yang, and Longbin Lai. Improving distributed subgraph matching algorithm on timely dataflow. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pages 266–273, 2019. [ bib \| DOI ] The subgraph matching problem is defined to find all subgraphs of a data graph that are isomorphic to a given query graph. Subgraph matching plays a vital role in the fields of e-commerce, social media and biological science. CliqueJoin is a distributed subgraph matching algorithm that is designed to be efficient and scalable. However, CliqueJoin is originally developed on MapReduce, thus the performance of the algorithm can be affected by the notorious I/O issue of MapReduce while processing multi-round join tasks. Meanwhile, CliqueJoin does not propose a cost evaluation strategy for labelled graphs, which limits its application in practice where most real-world graphs are labelled. Targeting the limitations of CliqueJoin, we propose CliqueJoin++ to improve CliqueJoin in two aspects. Firstly, we implement CliqueJoin++ on the Timely dataflow system instead of MapReduce to avoid considerable I/O cost. Secondly, we extend the cost evaluation function in CliqueJoin to compute optimal join plans for labelled graphs in the distributed context. Extensive experiments have been conducted to show that the proposed method is up to 10 times faster than the MapReduce version for unlabelled matching, and it achieves good performance and scalability for labelled matching.

This file was generated by bibtex2html 1.99.