Introduction by example¶

Little Ball of Fur is a graph sampling extension library for NetworkX.

Little Ball of Fur consists of methods which sample from graphs. To put it simply it is a Swiss Army knife for graph sampling tasks. First, it includes a large variety of vertex, edge, and exploration sampling techniques. Second, it provides a unified application public interface which makes the application of sampling algorithms trivial for end-users. Implemented methods cover a wide range of networking (Networking, INFOCOM, SIGCOMM) and data mining (KDD, TKDD, ICDE) conferences, workshops, and pieces from prominent journals.

Citing

If you find Little Ball of Fur useful in your research, please consider citing the following paper:

>@inproceedings{rozemberczki2020little,
     title={{Little Ball of Fur: A Python Library for Graph Sampling}},
     author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
     year={2020},
     pages={3133–3140},
     booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
     organization={ACM},
}

Overview¶

We shortly overview the fundamental concepts and features of Little Ball of Fur through simple examples. These are the following:

Standardized dataset ingestion
API driven design
Node sampling
Edge sampling
Exploration sampling
Benchmark datasets

Standardized dataset ingestion ¶

Little Ball of Fur assumes that the NetworkX or NetworKit graph provided by the user has the following important properties:

The graph is undirected.
The graph is connected (it consists of a single strongly connected component).
Nodes are indexed with integers.
There are no orphaned nodes in the graph.
The node indexing starts with zero and the indices are consecutive.

The returned NetworkX or``NetworKit`` graph uses the same indexing.

API driven design ¶

Little Ball of Fur uses the design principles of Scikit-Learn which means that the algorithms in the package share the same API. Each graph sampling procedure is implemented as a class which inherits from the Sampler class. The constructors of the sampling algorithms are used to set the hyperparameters. The sampling procedures have default hyperparameters that work well out of the box. This means that non expert users do not have to make decisions about these in advance and only a little fine tuning is required. For each class the sample public method provides sampling from the graph. This API driven design in practice means that one can sample a subgraph from a Watts-Strogatz graph with a RandomWalkSampler just like this.

import networkx as nx
from littleballoffur import RandomWalkSampler

graph = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

model = RandomWalkSampler()
new_graph = model.sample(graph)

This snippet can be modified to use a ForestFireSampler with minimal effort like this.

import networkx as nx
from littleballoffur import ForestFireSampler

graph = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

model = ForestFireSampler()
new_graph = model.sample(graph)

Looking at these two snippets the advantage of the API driven design is evident. First, one had to change the import of the sampler. Second, we needed to change the sampler construction and the default hyperparameters were already set. The public methods provided by RandomWalkSampler and ForestFireSampler are the same. A subsample is returned by sample. This allows for quick and minimal changes to the code when a sampling procedure performs poorly and has to be replaced.

Node sampling ¶

The first task that we will look at is sampling a subgraph by drawing a representative set of nodes from a Facebook graph. In this network nodes represent official verified Facebook pages and the links between them are mutual likes. For details about the dataset see this paper.

We first need to load the Facebook Page-Page network dataset which is returned as a NetworkX graph.

from littleballoffur import GraphReader

reader = GraphReader("facebook")

graph = reader.get_graph()

The constructor defines the parametrized graph reader object while the get_graph method reads the data.

Now let’s use the PageRank Proportional Node Sampling method from Sampling From Large Graphs. We will sample approximately 50% of the original nodes from the network.

from littleballoffur import PageRankBasedSampler

number_of_nodes = int(0.5*graph.number_of_nodes())
sampler = PageRankBasedSampler(number_of_nodes = number_of_nodes)
new_graph = sampler.sample(graph)

The constructor defines a graph sampler, we sample nodes from the Facebook graph with the sample method and return the induced subgraph. Finally, we can evaluate the sample quality by comparing clustering coefficient values calculated for the original and subsampled graphs. We somewhat overestimated the transitivity.

import networkx as nx

transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)

print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))

>>> Transitivity Original: 0.2323
>>> Transitivity Sampled: 0.2673

Edge sampling ¶

The second task that we will look at is sampling a subgraph by drawing a representative set of edges from a Wikipedia graph. In this network nodes represent Wikipedia pages about Crocodiles and the edges between them are mutual links. For details about the dataset see this paper.

We first need to load the Wikipedia dataset which is returned as a NetworkX graph.

from littleballoffur import GraphReader

reader = GraphReader("wikipedia")

graph = reader.get_graph()

The constructor defines the parametrized graph reader object while the get_graph method reads the dataset.

Now let’s use the Hybrid Node-Edge Sampling method from Reducing Large Internet Topologies for Faster Simulations. We will sample approximately 50% of the original edges from the network.

from littleballoffur import HybridNodeEdgeSampler

number_of_edges = int(0.5*graph.number_of_edges())
sampler = HybridNodeEdgeSampler(number_of_edges = number_of_edges)
new_graph = sampler.sample(graph)

The constructor defines a graph sampler, we sample edges from the Wikipedia graph with the sample method and return the induced subgraph. Finally, we can evaluate the sample quality by comparing clustering coefficient values calculated for the original and subsampled graphs. We massively underestimated the transitivity.

import networkx as nx

transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)

print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))

>>> Transitivity Original: 0.0261
>>> Transitivity Sampled: 0.0070

Exploration sampling ¶

The third task that we will look at is extracting a subgraph with exploration sampling from a GitHub social network. In this graph nodes represent GitHub developers and the edges between them are mutual follower relationships. For details about the dataset see this paper.

We first need to load the GitHub dataset which is returned as a NetworkX graph.

from littleballoffur import GraphReader

reader = GraphReader("github")

graph = reader.get_graph()

The constructor defines the parametrized graph reader object again, while the get_graph method reads the dataset.

Now let’s use the Metropolis-Hastings Random Walk Sampler method from Metropolis Algorithms for Representative Subgraph Sampling. We will sample approximately 50% of the original nodes from the network.

from littleballoffur import MetropolisHastingsRandomWalkSampler

number_of_nodes = int(0.5*graph.number_of_nodes())
sampler = MetropolisHastingsRandomWalkSampler(number_of_nodes = number_of_nodes)
new_graph = sampler.sample(graph)

The constructor defines a graph sampler, we sample from the Github graph with the sample method and return the new graph. Finally, we can evaluate the sampling by comparing clustering coefficient values calculated from the original and subsampled graphs. We overestimated the transitivity.

import networkx as nx

transitivity = nx.transitivity(graph)
transitivity_sampled = nx.transitivity(new_graph)

print('Transitivity Original: {:.4f}'.format(transitivity))
print('Transitivity Sampled: {:.4f}'.format(transitivity_sampled))

>>> Transitivity Original: 0.0124
>>> Transitivity Sampled: 0.0228

Benchmark datasets ¶

We included a number of social network and webgraph datasets which can be used for comparing the performance of sampling algorithms. These are the following:

Introduction by example¶

Overview¶

Standardized dataset ingestion¶

API driven design¶

Node sampling¶

Edge sampling¶

Exploration sampling¶

Benchmark datasets¶