A Production-Style NetworKit 11.2.1 Coding Tutorial for Large-Scale Graph Analytics, Communities, Cores, and Sparsification

In this tutorial, we implement a production-grade, large-scale graph analytics pipeline in NetworKit, focusing on speed, memory efficiency, and version-safe APIs in NetworKit 11.2.1. We generate a large-scale free network, extract the largest connected component, and then compute structural backbone signals via k-core decomposition and centrality ranking. We also detect communities with PLM and quantify quality using modularity; estimate distance structure using effective and estimated diameters; and, finally, sparsify the graph to reduce cost while preserving key properties. We export the sparsified graph as an edgelist so we can reuse it in downstream workflows, benchmarking, or graph ML preprocessing.

Copy Code

!pip -q install networkit pandas numpy psutil


import gc, time, os
import numpy as np
import pandas as pd
import psutil
import networkit as nk


print("NetworKit:", nk.__version__)
nk.setNumberOfThreads(min(2, nk.getMaxNumberOfThreads()))
nk.setSeed(7, False)


def ram_gb():
   p = psutil.Process(os.getpid())
   return p.memory_info().rss / (1024**3)


def tic():
   return time.perf_counter()


def toc(t0, msg):
   print(f"{msg}: {time.perf_counter()-t0:.3f}s | RAM~{ram_gb():.2f} GB")


def report(G, name):
   print(f"\n[{name}] nodes={G.numberOfNodes():,} edges={G.numberOfEdges():,} directed={G.isDirected()} weighted={G.isWeighted()}")


def force_cleanup():
   gc.collect()


PRESET = "LARGE"


if PRESET == "LARGE":
   N = 120_000
   M_ATTACH = 6
   AB_EPS = 0.12
   ED_RATIO = 0.9
elif PRESET == "XL":
   N = 250_000
   M_ATTACH = 6
   AB_EPS = 0.15
   ED_RATIO = 0.9
else:
   N = 80_000
   M_ATTACH = 6
   AB_EPS = 0.10
   ED_RATIO = 0.9


print(f"\nPreset={PRESET} | N={N:,} | m={M_ATTACH} | approx-betweenness epsilon={AB_EPS}")

We set up the Colab environment with NetworKit and monitoring utilities, and we lock in a stable random seed. We configure thread usage to match the runtime and define timing and RAM-tracking helpers for each major stage. We choose a scale preset that controls graph size and approximation knobs so the pipeline stays large but manageable.

Copy Code

t0 = tic()
G = nk.generators.BarabasiAlbertGenerator(M_ATTACH, N).generate()
toc(t0, "Generated BA graph")
report(G, "G")


t0 = tic()
cc = nk.components.ConnectedComponents(G)
cc.run()
toc(t0, "ConnectedComponents")
print("components:", cc.numberOfComponents())


if cc.numberOfComponents() > 1:
   t0 = tic()
   G = nk.graphtools.extractLargestConnectedComponent(G, compactGraph=True)
   toc(t0, "Extracted LCC (compactGraph=True)")
   report(G, "LCC")


force_cleanup()

We generate a large Barabási–Albert graph and immediately log its size and runtime footprint. We compute connected components to understand fragmentation and quickly diagnose topology. We extract the largest connected component and compact it to improve the rest of the pipeline’s performance and reliability.

Copy Code

t0 = tic()
core = nk.centrality.CoreDecomposition(G)
core.run()
toc(t0, "CoreDecomposition")
core_vals = np.array(core.scores(), dtype=np.int32)
print("degeneracy (max core):", int(core_vals.max()))
print("core stats:", pd.Series(core_vals).describe(percentiles=[0.5, 0.9, 0.99]).to_dict())


k_thr = int(np.percentile(core_vals, 97))


t0 = tic()
nodes_backbone = [u for u in range(G.numberOfNodes()) if core_vals[u] >= k_thr]
G_backbone = nk.graphtools.subgraphFromNodes(G, nodes_backbone)
toc(t0, f"Backbone subgraph (k>={k_thr})")
report(G_backbone, "Backbone")


force_cleanup()


t0 = tic()
pr = nk.centrality.PageRank(G, damp=0.85, tol=1e-8)
pr.run()
toc(t0, "PageRank")


pr_scores = np.array(pr.scores(), dtype=np.float64)
top_pr = np.argsort(-pr_scores)[:15]
print("Top PageRank nodes:", top_pr.tolist())
print("Top PageRank scores:", pr_scores[top_pr].tolist())


t0 = tic()
abw = nk.centrality.ApproxBetweenness(G, epsilon=AB_EPS)
abw.run()
toc(t0, "ApproxBetweenness")


abw_scores = np.array(abw.scores(), dtype=np.float64)
top_abw = np.argsort(-abw_scores)[:15]
print("Top ApproxBetweenness nodes:", top_abw.tolist())
print("Top ApproxBetweenness scores:", abw_scores[top_abw].tolist())


force_cleanup()

We compute the core decomposition to measure degeneracy and identify the network’s high-density backbone. We extract a backbone subgraph using a high core-percentile threshold to focus on structurally important nodes. We run PageRank and approximate betweenness to rank nodes by influence and bridge-like behavior at scale.

Copy Code

t0 = tic()
plm = nk.community.PLM(G, refine=True, gamma=1.0, par="balanced")
plm.run()
toc(t0, "PLM community detection")


part = plm.getPartition()
num_comms = part.numberOfSubsets()
print("communities:", num_comms)


t0 = tic()
Q = nk.community.Modularity().getQuality(part, G)
toc(t0, "Modularity")
print("modularity Q:", Q)


sizes = np.array(list(part.subsetSizeMap().values()), dtype=np.int64)
print("community size stats:", pd.Series(sizes).describe(percentiles=[0.5, 0.9, 0.99]).to_dict())


t0 = tic()
eff = nk.distance.EffectiveDiameter(G, ED_RATIO)
eff.run()
toc(t0, f"EffectiveDiameter (ratio={ED_RATIO})")
print("effective diameter:", eff.getEffectiveDiameter())


t0 = tic()
diam = nk.distance.EstimatedDiameter(G)
diam.run()
toc(t0, "EstimatedDiameter")
print("estimated diameter:", diam.getDiameter().distance)


force_cleanup()

We detect communities using PLM and record the number of communities found on the large graph. We compute modularity and summarize community-size statistics to validate the structure rather than simply trusting the partition. We estimate global distance behavior using effective diameter and estimated diameter in an API-safe way for NetworKit 11.2.1.

Copy Code

t0 = tic()
sp = nk.sparsification.LocalSimilaritySparsifier(G, 0.7)
G_sparse = sp.getSparsifiedGraph()
toc(t0, "LocalSimilarity sparsification (alpha=0.7)")
report(G_sparse, "Sparse")


t0 = tic()
pr2 = nk.centrality.PageRank(G_sparse, damp=0.85, tol=1e-8)
pr2.run()
toc(t0, "PageRank on sparse")
pr2_scores = np.array(pr2.scores(), dtype=np.float64)
print("Top PR nodes (sparse):", np.argsort(-pr2_scores)[:15].tolist())


t0 = tic()
plm2 = nk.community.PLM(G_sparse, refine=True, gamma=1.0, par="balanced")
plm2.run()
toc(t0, "PLM on sparse")
part2 = plm2.getPartition()
Q2 = nk.community.Modularity().getQuality(part2, G_sparse)
print("communities (sparse):", part2.numberOfSubsets(), "| modularity (sparse):", Q2)


t0 = tic()
eff2 = nk.distance.EffectiveDiameter(G_sparse, ED_RATIO)
eff2.run()
toc(t0, "EffectiveDiameter on sparse")
print("effective diameter (orig):", eff.getEffectiveDiameter(), "| (sparse):", eff2.getEffectiveDiameter())


force_cleanup()


out_path = "/content/networkit_large_sparse.edgelist"
t0 = tic()
nk.graphio.EdgeListWriter("\t", 0).write(G_sparse, out_path)
toc(t0, "Wrote edge list")
print("Saved:", out_path)


print("\nAdvanced large-graph pipeline complete.")

We sparsify the graph using local similarity to reduce the number of edges while retaining useful structure for downstream analytics. We rerun PageRank, PLM, and effective diameter on the sparsified graph to check whether key signals remain consistent. We export the sparsified graph as an edgelist so we can reuse it across sessions, tools, or additional experiments.

In conclusion, we developed an end-to-end, scalable NetworKit workflow that mirrors real large-network analysis: we started from generation, stabilized the topology with LCC extraction, characterized the structure through cores and centralities, discovered communities and validated them with modularity, and captured global distance behavior through diameter estimates. We then applied sparsification to shrink the graph while keeping it analytically meaningful and saving it for repeatable pipelines. The tutorial provides a practical template we can reuse for real datasets by replacing the generator with an edgelist reader, while keeping the same analysis stages, performance tracking, and export steps.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post A Production-Style NetworKit 11.2.1 Coding Tutorial for Large-Scale Graph Analytics, Communities, Cores, and Sparsification appeared first on MarkTechPost.

from MarkTechPost https://ift.tt/SVX3Oay
via IFTTT

World Wire

A Production-Style NetworKit 11.2.1 Coding Tutorial for Large-Scale Graph Analytics, Communities, Cores, and Sparsification

Comments

Post a Comment

Popular posts from this blog

Implementing Persistent Memory Using a Local Knowledge Graph in Claude Desktop

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup

Technical Deep Dive: Automating LLM Agent Mastery for Any MCP Server with MCP- RL and ART