Introduction by example¶
Little Ball of Fur is a graph sampling extension library for NetworkX.
Little Ball of Fur consists of methods which sample from graphs. To put it simply it is a Swiss Army knife for graph sampling tasks. First, it includes a large variety of vertex, edge, and exploration sampling techniques. Second, it provides a unified application public interface which makes the application of sampling algorithms trivial for end-users. Implemented methods cover a wide range of networking (Networking, INFOCOM, SIGCOMM) and data mining (KDD, TKDD, ICDE) conferences, workshops, and pieces from prominent journals.
Citing
If you find Little Ball of Fur useful in your research, please consider citing the following paper:
>@inproceedings{rozemberczki2020little,
title={{Little Ball of Fur: A Python Library for Graph Sampling}},
author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
year={2020},
pages={3133–3140},
booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization={ACM},
}
Overview¶
We shortly overview the fundamental concepts and features of Little Ball of Fur through simple examples. These are the following:
Standardized dataset ingestion¶
Little Ball of Fur assumes that the NetworkX
or NetworKit
graph provided by the user has the following important properties:
The graph is undirected.
The graph is connected (it consists of a single strongly connected component).
Nodes are indexed with integers.
There are no orphaned nodes in the graph.
The node indexing starts with zero and the indices are consecutive.
The returned NetworkX
or``NetworKit`` graph uses the same indexing.
API driven design¶
Little Ball of Fur uses the design principles of Scikit-Learn which means that the algorithms in the package share the same API. Each graph sampling procedure is implemented as a class which inherits from the Sampler
class. The constructors of the sampling algorithms are used to set the hyperparameters. The sampling procedures have default hyperparameters that work well out of the box. This means that non expert users do not have to make decisions about these in advance and only a little fine tuning is required. For each class the sample
public method provides sampling from the graph. This API driven design in practice means that one can sample a subgraph from a Watts-Strogatz graph with a RandomWalkSampler
just like this.
import networkx as nx
from littleballoffur import RandomWalkSampler
graph = nx.newman_watts_strogatz_graph(1000, 20, 0.05)
model = RandomWalkSampler()
new_graph = model.sample(graph)
This snippet can be modified to use a ForestFireSampler
with minimal effort like this.
import networkx as nx
from littleballoffur import ForestFireSampler
graph = nx.newman_watts_strogatz_graph(1000, 20, 0.05)
model = ForestFireSampler()
new_graph = model.sample(graph)
Looking at these two snippets the advantage of the API driven design is evident. First, one had to change the import of the sampler. Second, we needed to change the sampler construction and the default hyperparameters
were already set. The public methods provided by RandomWalkSampler
and ForestFireSampler
are the same. A subsample is returned by
sample
. This allows for quick and minimal changes to the code when a sampling procedure performs poorly and has to be replaced.
Node sampling¶
The first task that we will look at is sampling a subgraph by drawing a representative set of nodes from a Facebook graph. In this network nodes represent official verified Facebook pages and the links between them are mutual likes. For details about the dataset see this paper.
We first need to load the Facebook Page-Page network dataset which is returned as a NetworkX
graph.
from littleballoffur import GraphReader
reader = GraphReader("facebook")
graph = reader.get_graph()
The constructor defines the parametrized graph reader object while the get_graph
method reads the data.
Now let’s use the PageRank Proportional Node Sampling
method from Sampling From Large Graphs. We will sample approximately 50% of the original nodes from the network.
from littleballoffur import PageRankBasedSampler
number_of_nodes = int(0.5*graph.number_of_nodes())
sampler = PageRankBasedSampler(number_of_nodes = number_of_nodes)
new_graph = sampler.sample(graph)
The constructor defines a graph sampler, we sample nodes from the Facebook graph with the sample
method and return the induced subgraph. Finally, we can evaluate the sample quality by comparing clustering coefficient values calculated for the original and subsampled graphs. We somewhat overestimated the transitivity.
import networkx as nx
transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)
print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))
>>> Transitivity Original: 0.2323
>>> Transitivity Sampled: 0.2673
Edge sampling¶
The second task that we will look at is sampling a subgraph by drawing a representative set of edges from a Wikipedia graph. In this network nodes represent Wikipedia pages about Crocodiles and the edges between them are mutual links. For details about the dataset see this paper.
We first need to load the Wikipedia dataset which is returned as a NetworkX
graph.
from littleballoffur import GraphReader
reader = GraphReader("wikipedia")
graph = reader.get_graph()
The constructor defines the parametrized graph reader object while the get_graph
method reads the dataset.
Now let’s use the Hybrid Node-Edge Sampling
method from Reducing Large Internet Topologies for Faster Simulations. We will sample approximately 50% of the original edges from the network.
from littleballoffur import HybridNodeEdgeSampler
number_of_edges = int(0.5*graph.number_of_edges())
sampler = HybridNodeEdgeSampler(number_of_edges = number_of_edges)
new_graph = sampler.sample(graph)
The constructor defines a graph sampler, we sample edges from the Wikipedia graph with the sample
method and return the induced subgraph. Finally, we can evaluate the sample quality by comparing clustering coefficient values calculated for the original and subsampled graphs. We massively underestimated the transitivity.
import networkx as nx
transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)
print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))
>>> Transitivity Original: 0.0261
>>> Transitivity Sampled: 0.0070
Exploration sampling¶
The third task that we will look at is extracting a subgraph with exploration sampling from a GitHub social network. In this graph nodes represent GitHub developers and the edges between them are mutual follower relationships. For details about the dataset see this paper.
We first need to load the GitHub dataset which is returned as a NetworkX
graph.
from littleballoffur import GraphReader
reader = GraphReader("github")
graph = reader.get_graph()
The constructor defines the parametrized graph reader object again, while the get_graph
method reads the dataset.
Now let’s use the Metropolis-Hastings Random Walk Sampler
method from Metropolis Algorithms for Representative Subgraph Sampling. We will sample approximately 50% of the original nodes from the network.
from littleballoffur import MetropolisHastingsRandomWalkSampler
number_of_nodes = int(0.5*graph.number_of_nodes())
sampler = MetropolisHastingsRandomWalkSampler(number_of_nodes = number_of_nodes)
new_graph = sampler.sample(graph)
The constructor defines a graph sampler, we sample from the Github graph with the sample
method and return the new graph. Finally, we can evaluate the sampling by comparing clustering coefficient values calculated from the original and subsampled graphs. We overestimated the transitivity.
import networkx as nx
transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)
print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))
>>> Transitivity Original: 0.0124
>>> Transitivity Sampled: 0.0228
Benchmark datasets¶
We included a number of social network and webgraph datasets which can be used for comparing the performance of sampling algorithms. These are the following: