1 Introduction
Graph partitioning is an important optimization problem with numerous applications in domains spanning computer vision, VLSI design, biology, social networks, transportation networks and more. The objective is to find balanced partitions of a graph while minimizing the number of edge cut. This problem is NPcomplete which is formulated as a discrete optimization problem and solutions are generally derived using heuristics and approximation algorithms. Some notable approaches include multilevel methods and spectral partitioning methods
[Karypis and Kumar, 1998, Karypis et al., 1999, Karypis and Kumar, 2000, Miettinen et al., 2006, Andersen et al., 2006, Chung, 2007].In this work, we introduce a learning based approach, GAP, for the continuous relaxation of the problem. We define a differentiable loss function which captures the objective of partitioning a graph into disjoint balanced partitions while minimizing the number of edge cut across those partitions. We train a deep model to optimize for this loss function. The optimization is done in an unsupervised manner without the need for labeled datasets.
Our approach, GAP, does not assume anything about the graph structure (e.g., sparse vs. dense, or scalefree). Instead, GAP learns and adapts to the graph structure using graph embedding techniques while optimizing the partitioning loss function. This representation learning allows our approach to be selfadaptive without the need for us to design different strategies for different types of graphs.
Our learning based approach is also capable of generalization, meaning that we can train a model on a set of graphs and then use it at inference time on unseen graphs of varying sizes. In particular, we show that when GAP is trained on smaller graphs (e.g., 1k nodes), it transfers what it learned to much larger ones (e.g, 20k nodes). This generalization allows trained GAP models to quickly infer partitions on large unseen graphs, whereas baseline methods have to redo the entire optimization for each new graph.
In summary, this paper makes the following contributions:

[leftmargin=*]

We propose GAP, a Generalizable Approximate Partitioning
framework, which is an unsupervised learning approach to the classic problem of balanced graph partitioning. We define a differentiable loss function for partitioning that uses a continuous relaxation of the normalized cut. We then train a deep model and apply backpropagation to optimize the loss.

GAP models can produce efficient partitions on unseen graphs at inference time. Generalization is an advantage over existing approaches which must redo the entire optimization for each new graph.

GAP leverages graph embedding techniques [Kipf and Welling, 2017, Hamilton et al., 2017] and learns to partition graphs based on their underlying structure, allowing it to generate efficient partitions across a wide variety of graphs.

To encourage reproducible research, we provide source code in the supplementary materials and are in the process of opensourcing the framework.

We show that GAP achieves competitive partitions while being up to 100 times faster than top performing baselines on a variety of synthetic and realworld graphs with up to 27000 nodes.
2 Related Work
Graph Partitioning
: Graph partitioning is an important combinatorial optimization problem that has been exhaustively studied. The most widely used graph partitioning algorithms generate partitions by performing operations on the input graph until convergence
[Andersen et al., 2006, Chung, 2007]. On the other hand, multilevel partitioning approaches first reduce the size of the graph by collapsing nodes and edges, then partition on the smaller graph, and finally expand the graph to recover the partitioning for the original graph [Karypis and Kumar, 2000, Karypis et al., 1999, Karypis and Kumar, 1998, Miettinen et al., 2006]. These algorithms are shown to provide highquality partitions [Miettinen et al., 2006].Another approach is to use simulated annealing. [Van Den Bout and Miller, 1990]
proposed mean field annealing, which combines simulated annealing with Hopfield neural networks.
[Kawamoto et al., 2018] studied a different formulation of graph partitioning in which a graph is generated by a statistical model, and the task is to infer the preassigned group labels of the generative model. They developed a meanfield theory of a minimal graph neural network architecture for this version of the problem.This line of inquiry formulates graph partitioning as a discrete optimization problem, while our GAP framework is one of the first deep learning approaches for the continuous relaxation of the problem. Moreover, GAP generalizes to unseen graphs, generating partitions on the fly, rather than having to redo the optimization per graph.
Clustering
: Given a set of points, the goal of clustering is to identify groups of similar points. Clustering problems with different objectives such as selfbalanced kmeans and balanced mincut have been exhaustively studied
[Liu et al., 2017, Chen et al., 2017, Chang et al., 2014]. One of the most effective techniques for clustering is spectral clustering, which first generates node embeddings in the eigenspace of the graph Laplacian, and then applies kmeans clustering to these vectors
[Shi and Malik, 2000, Ng et al., 2002, Von Luxburg, 2007].However, generalizing clustering to unseen nodes and graphs is nontrivial. To address generalization, SpectralNet [Shaham et al., 2018]
is a deep learning approach to spectral clustering which generates spectral embeddings for unseen data points. Other deep learning approaches for clustering attempt to encode the input in a way that is amenable to clustering by kmeans or Gaussian Mixture Models
[Yang et al., 2017, Xie et al., 2016, Zheng et al., 2016, Dilokthanakul et al., 2016].Although related, graph clustering and graph partitioning are different problems in that graph clustering attempts to maximize locality of clusters, whereas graph partitioning seeks to preserve locality while maintaining balance among partitions. Our approach also treats the partitioning problem as an endtoend learning problem with a differentiable loss, whereas the aforementioned approaches generate embeddings that are then clustered using nondifferentiable techniques like kmeans.
Device Placement
: The practical significance of graph partitioning is demonstrated by the task of device placement for TensorFlow computation graphs, where the objective is to minimize execution time by assigning operations to devices.
[Mirhoseini et al., 2017]proposed a reinforcement learning method to optimize device placement for TensorFlow graphs. They used a seq2seq policy to assign operations to devices. The execution time of the generated placements is then used as a reward signal to optimize the policy. A hierarchical model for device placement has been proposed in
[Mirhoseini et al., 2018], where the graph partition and placement are learned jointly. While this work also uses a neural network to learn the partitions, their objective is to optimize the runtime of the resulting partitions, forcing them to use policy gradient to optimize their nondifferentiable loss function.3 Problem Definition and Background
Let be a graph where and are the set of nodes and edges in the graph. Let be the number of nodes. A graph can be partitioned into disjoint sets , where the union of the nodes in those sets are (), and each node belongs to only one set (), by simply removing edges connecting those sets.
Minimum Cut: The total number of edges that are removed from in order to form disjoint sets is called cut. Given sets , and , the is formally defined as:
(1) 
This formula can be generalized to multiple disjoint sets , where is the union of all sets except .
(2) 
Normalized Cut: The optimal partitioning of a graph that minimizes the cut (Equation 2) is a wellstudied problem and there exist efficient polynomial algorithms for solving it [Papadimitriou and Steiglitz, 1982]. However, the minimum cut criteria favors cutting nodes whose degree are small and leads to unbalanced sets/partitions. To avoid such bias, normalized cut (Ncut), which is based on the graph conductance, has been studied by [Shi and Malik, 2000, Zhang and Rohe, 2018], where the cost of a cut is computed as a fraction of the total edge connections to all nodes.
(3) 
Where , i.e., total degree of nodes belong to in graph .
One way to minimize the normalized cut is based on the eigenvectors of the graph Laplacian which has been studied in
[Shi and Malik, 2000, Zhang and Rohe, 2018]. Previous research has shown that across a wide range of social and information networks, the clusters with the smallest graph conductance are often small [Leskovec, 2009, Zhang and Rohe, 2018]. Regularized spectral clustering has been proposed by [Zhang and Rohe, 2018] to address this problem.In this paper, however, we propose GAP as an unsupervised learning approach with a differentiable loss function that can be trained to find balanced partitions with minimum normalized cuts. We show that GAP enables generalization to unseen graphs.
4 Generalizable Approximate Partitioning
We now introduce the Generalizable Approximate Partitioning framework (GAP). As shown in Figure 1
, GAP has two main components: graph representation learning for generating partition probabilities per node (the model), and a differentiable formulation of the normalized cut objective (the loss function). GAP enables us to train a neural network to optimize a previously undifferentiable objective by generating balanced partitions with minimum edgecut. We first present the loss function before discussing the model.
4.1 GAP Loss Function
We assume that our model returns where represents the probability that node belongs to partition . We propose a loss function based on to calculate the normalized cut in Equation 3 and evaluate the balancedness of the partitions. Later in subsection 4.2, we discuss the model that generates .
Normalized Cut: As we discussed in Section 3, is the number of edges , where and . Let be the probability that node belongs to partition . The probability that node does not belong to partition would be . Therefore, can be formulated by Equation 4, where is the set of nodes adjacent to (visual illustration in Figure 1).
(4) 
Since the set of adjacent nodes for a given node can be retrieved from the adjacency matrix of graph , we can rewrite Equation 4 as follows:
(5) 
The elementwise product with the adjacency matrix ensures that only the adjacent nodes are considered. Moreover, the result of is an matrix and is the sum over all of its elements.
From Equation 3, is the sum over the degree of all nodes that belong to . Let be a column vector of size where is the degree of the node . Given , we can calculate the as follows:
(6)  
Where is a vector in , and is the number of partitions.
With and from Equations 5 and 6, we can calculate the expected normalized cut in Equation 3 as follows:
(7) 
is elementwise division and the result of is an matrix where is the sum over all of its elements.
Balanced Cut: So far, we have shown how one can calculate the expected normalized cut of a graph given the matrix (probabilities of nodes belonging to partitions). Here, we show that given we can also evaluate how balanced those partitions are.
Given the number of nodes in the graph and the number of partitions , to have balanced partitions the number of nodes per partition should be . The sum of the columns in gives us the expected number of nodes in each partition due to the fact that represents the probability that node belongs to partition . Thus, for the balanced partitions we minimize the following error:
(8) 
Combining expected normalized cut (Equation 7) with the balanced partition error (Equation 8), we have the following loss function:
(9) 
Next, we discuss the GAP neural model that finds the graph partition to minimize the loss in Equation 9.
4.2 The GAP Model
The GAP model ingests a graph definition, generates node embeddings that leverage local graph structure, and projects each embedding into logits that define a probability distribution to minimize the expected normalized cut (Equation
9).Graph Embedding Module: The purpose of the graph embedding module is to learn node embeddings using the graph structure and node features. Recently, there have been several advances on applying graph neural networks for node embedding and classification tasks using approaches such as Graph Convolution Network [Kipf and Welling, 2017] (GCN), GraphSAGE [Hamilton et al., 2017], Neural Graph Machines [Bui et al., 2017], Graph Attention Networks [Veličković et al., 2018] and other variants. In this work, we leverage GCN and GraphSAGE to learn graph representations across a variety of graphs, which helps with generalization.
GCN: [Kipf and Welling, 2017] showed that untrained GCN with random weights can serve as a powerful feature extractor for graph nodes. In our implementation, we used a 3layer GCN with weight matrices () using Xavier initialization described in [Glorot and Bengio, 2010].
where , is the adjacency matrix of the undirected graph with added selfconnections.
is the identity matrix, and
. The input feature matrix depends on the graph. In TensorFlow computation graphs, each operation type (such as MatMul, Conv2d, Sum, etc.) would be a feature.GraphSAGE: [Hamilton et al., 2017] developed a node embedding technique that generates high dimensional graph node representations based on node input features. Central to this technique is sample and aggregate, where given a node , we sample a set of ’s neighbors from
, and aggregate their representations (with max pooling) to generate an embedding for the sampled neighbors of
. This neighbor representation, along with the representation of itself, is combined to generate a new representation for . Iterating this process multiple times results in message passing among nodes for an increasing number of hops.Our implementation of GraphSAGE is based on Algorithm 1 in [Hamilton et al., 2017]. For each message passing step , we perform the following operations per node :
where agg and proj denote the aggregation and projection matrices respectively.
Graph Partitioning Module: The second module in our GAP framework is responsible for partitioning the graph, taking in node embeddings and generating the probability that each node belongs to partitions (Y in Figure 1). This module is a fully connected layer followed by softmax, trained to minimize Equation 9.
We also note that for particularly large graphs, it is possible to optimize on randomly sampled minibatches of nodes from the larger graph. Furthermore, it is possible to stop gradient flow from the partitioning module to the embedding module, resulting in unsupervised node embeddings.
Computation graphs  hMETIS  GAP  

Name  Edge cut  Balancedness  Edge cut  Balancedness 
VGG  0.05  0.99  0.04  0.99 
MNISTconv  0.05  0.99  0.05  0.99 
ResNet  0.04  0.99  0.04  0.99 
AlexNet  0.05  0.99  0.05  0.99 
Inceptionv3  0.04  0.99  0.04  0.99 
5 Experiments
The main goals of our experiments are to (a) evaluate the performance of the GAP framework against hMETIS [Karypis and Kumar, 2000], a widely used partitioner that uses multilevel partitioning and (b) evaluate the generalizability of GAP over unseen graphs and provide insights on how the structural similarities between train and test graphs affect the generalization performance. Source code is provided for reproducibility and is in the process of being opensourced.
5.1 Setup
We conducted experiments on real and synthetic graphs. Specifically, we use five widely used TensorFlow graphs. We also generate Random as well as Scalefree graphs as synthetic datasets to show the effectivenesss of GAP on graphs with different structures.
Real Datasets

[leftmargin=*]

ResNet [He et al., 2016]
is a deep convolutional network with residual connections to avoid vanishing gradients. The TensorFlow implementation of
ResNet_v1_50 with 50 layers contains operations. 
Inceptionv3 [Szegedy et al., 2017] consists of multiple blocks, each composed of several convolutional and pooling layers. The TensorFlow graph of this model contains operations.

[leftmargin=*]

AlexNet [Krizhevsky et al., 2012] consists of 5 convolutional layers, some of which are followed by maxpooling layers, and 3 fullyconnected layers with a final softmax. The TensorFlow graph of this model has operations.

MNISTconv has 3 convolutional layers for the MNIST classification task. The TensorFlow graph of this model contains operations.

VGG [Simonyan and Zisserman, 2014] contains 16 convolutional layers. The TensorFlow graph of VGG contains operations.
Synthetic Datasets

[leftmargin=*]

Random: Randomly generated networks of size and nodes using the Erdös–Rényi model [Erdos and Rényi, 1960], where the probability of having an edge between any two nodes is .

Scalefree: Randomly generated scalefree networks of size and nodes using NetworkX [Hagberg et al., 2008] (A scalefree network is a network whose degree distribution follows a power law [Bollobás et al., 2003]).
Computation graphs  AlexNet  Inceptionv3  ResNet  

Name  Embedding  Edge cut  Balancedness  Edge cut  Balancedness  Edge cut  Balancedness 
GAPop    0.16  0.71  0.24  0.74  0.45  0.90 
GAPid  GCN offline  0.28  0.97  0.19  0.98  0.17  0.93 
GAPop  GCN offline  0.07  0.99  0.12  0.98  0.11  0.94 
GAPop  GraphSAGE offline  0.07  0.99  0.08  0.99  0.09  0.95 
GAPop  GraphSAGE trained  0.06  0.99  0.06  0.99  0.08  0.98 
Baseline: Since graph partitioning is NPcomplete, solutions are generally derived using heuristics and approximation algorithms. While there has been a substantial amount of work on graph partitioning for specific graph structure/applications [Gonzalez et al., 2012, Hada et al., 2018], hMETIS [Karypis and Kumar, 2000, Karypis et al., 1999] is a general framework that works across a wide variety of graphs and is shown to provide high quality partitions in different domains (e.g., VLSI, road network [Miettinen et al., 2006, Xu and Tan, 2012]. Similar to hMETIS, GAP is a general framework that makes no assumptions about graph structure. In our experiments, we compare GAP against hMETIS. We set the hMETIS parameters to return balanced partitions with minimum edge cut.
Performance Measures: As we discussed in Section 3, balanced partitions with minimum edge cut is the goal of graph partitioning. We evaluate the performance of the resulting partitions by examining 1) Edge cut: the ratio of the cut to the total number of edges, and 2) Balancedness: is one minus the MSE of number of nodes in every partition and balances partition ().
5.2 Performance
In this set of experiments, we find that GAP outperforms hMETIS. Since hMETIS does not generalize to unseen graphs and optimizes one graph at a time, we also constrain GAP to optimize one graph at a time for a fair comparison. We discuss the generalization ability of GAP in Section 5.3.
Table 1 shows the performance of GAP against hMETIS on a 3partition problem over real TensorFlow graphs. Both techniques generate very balanced partitions, with GAP outperforming hMETIS on edge cut for the VGG graph.
Figure 3 shows the performance of GAP against hMETIS on random graphs when the number of partitions is varied from 2 to 10. The plots represent the average value across 5 random graphs. Both GAP and hMETIS produce 99% balanced partitions. However, GAP is also able to find lower edge cut partitions than hMETIS. By examining the degree histograms of our datasets (Figures 1(a) to 1(d)), we found that while hMETIS heuristics work reasonably well on sparse TensorFlow graphs, GAP outperforms hMETIS on dense graphs.
5.3 Generalization
In this section, we show that GAP generalizes effectively on real and synthetic datasets. To the best of our knowledge, we are the first to propose a learning approach for graph partitioning that can generalize to unseen graphs.
5.3.1 Generalization on real graphs
In this set of experiments, we train GAP with a single TensorFlow graph, VGG, and validate on MNISTconv. At inference time, we test the trained model on unseen TensorFlow graphs: AlexNet, ResNet, and Inceptionv3.
Table 2 shows the result of our experiments, and illustrates the importance of node features and graph embeddings in generalization. In GAPid, we use the index of a node as its feature, while in GAPop, the operation type (such as Add, Conv2d, and L2loss in TensorFlow) is used as the node feature. We encode all features as onehots. Following Section 4.2, we leverage Graph Convolution Networks [Kipf and Welling, 2017] (GCN) and GraphSAGE [Hamilton et al., 2017] to capture similarities across graphs. In GCN offline and GraphSAGE offline, we do not train the graph embedding module (Figure 1) without gradient flow from the partitioning module, while in GraphSAGE trained both modules are trained jointly. Table 2 shows that GAPop with GraphSAGE trained (last row) achieves the best performance and generalizes better than the other models. Note that this model is trained on a single graph, VGG with only nodes, and it is tested on AlexNet, ResNet, and Inceptionv3 with , , and nodes, respectively.
Figure 4 shows the GAP partitioning of Inceptionv3 using a model trained on the same graph (3(a)) and a model trained on VGG (3(b)). Note that partitions are denoted by colors and we only show nodes whose operation type is convolution. In the scenario (3(a)) where we train and test GAP on Inceptionv3, we achieve 99% balanced partitions with 4% edge cut (Table 1). Where GAP is trained on VGG and tested over the unseen graph (Inceptionv3), it achieves 99% balanced partitions with 6% edge cut (last row of Table 2). The partition assignments in Figures 3(a) and 3(b) are quite similar (75%), which demonstrates GAP generalization.
We also observed that the similarity of the node features (operation types) in VGG and other computation graphs used in inference and validation is correlated with the edge cut score of GAP partitioning (Figure 5). For example, let A and B be the set of the operation types in VGG and ResNet, respectively, with a Jaccard similarity of ). Figure 5 shows that as Jaccard similarity of a graph with VGG increases, the edge cut decreases. In other words, the presence of similar node types across train and test graphs aids the generalization of our model.
Model Architecture and Hyperparameters: Here, we describe the details of the model with the best performance (corresponding to the last row of Table 2
). The number of features (TensorFlow operation types) is 1518. GraphSAGE has 5 layers of 512 units with shared pooling, and the graph partitioning module is a 3 layer dense network of 64 units with a final softmax layer. We use ReLU as activation function and all weights are initialized using Xavier initialization
[Glorot and Bengio, 2010]. We use the Adam optimizer with a learning rate of 7.5e5.5.3.2 Generalization on synthetic graphs
We further evaluate the generalization of GAP on Random and Scalefree graphs. Note that we train and test GAP on the same type of graph, but number of nodes may vary. For example, we train GAP on random graphs of 1k nodes and test on random graphs of 1k and 10k nodes. Similarly, we train GAP on scalefree graphs of 1k nodes and test on scalefree graphs of 1k and 10k nodes.
Figures 5(a), 5(b), and 5(c) show the edge cut, balancedness, and execution time of GAP against hMETIS over the scalefree graphs (every point is the average of 5 experiments). In GAPScalefree1 we train GAP with only one scalefree graph, while GAPScalefree10 is trained on 10 scalefree graphs. We then test the trained models GAPScalefree1 and GAPScalefree10 over 5 unseen scalefree graphs of 1k and 10k nodes and we report the average results. Figure 5(a) shows that both GAPScalefree1 and GAPScalefree10 partition the unseen graphs of 1k and 10k nodes with lower edge cut than hMETIS. Despite the balancedness of GAPScalefree1 being lower than that of hMETIS, by increasing the number of graphs in the training set (GAPScalefree10) balancedness is improved as shown in Figure 5(b), while its edge cut is still smaller (5(a)). Furthermore, GAPScalefree10 runs slightly faster than hMETIS (5(c)) and its partitions are just as balanced as those of hMETIS (5(b)) but with lower edge cut (5(a)).
Figures 6(a), 6(b), and 6(c) show the edge cut, balancedness, and execution time of GAP against hMETIS on random graphs. Every point is the average of 5 experiments. In GAPRandom1, we train GAP on only one random graph, while in GAPRandom10, we train on 10 random graphs. We then test the trained models GAPrandom1 and GAPRandom10 on 5 unseen random graphs with 1k and 10k nodes and we report the average results. The performance of GAP when generalizing on unseen random graphs of 1k and 10k nodes is almost the same as the performance of hMETIS, while Figure 6(c) shows that during inference, GAP is 10 to 100 times faster than the runtime of hMETIS.
Model Architectures and Hyperparameters:
Unlike computation graphs where node features are operation types, nodes in synthetic graphs have no features. Furthermore, we must train a model that generalizes to graphs of different sizes. For example, we train a model on a random graph with 1k nodes and test it on a random graph with 10k nodes. To do so, we apply PCA to the adjacency matrix of a featureless graph and retrieve embeddings of size 1000 as our node features. We use ReLU as our activation function and all weights are initialized using Xavier initialization. We also use the Adam optimizer. Here are the rest of the hyperparameters for each model.
GAPScalefree1: model is trained with one scalefree graph. GraphSAGE has 5 layers of 512 units, and graph partitioning module is 3 layer dense network of 128 units with softmax. Learning rate is 2.5e6.
GAPScalefree10: Trained with 10 scalefree graphs. GraphSAGE has 4 layers of 128 units, and graph partitioning module is 1 layer dense network of 64 units with softmax. Learning rate is 7.5e6.
GAPRandom1: Trained with only random graph. GraphSAGE has 5 layers of 128 units with shared pooling, and graph partitioning module is 2 layer dense network of 64 units with softmax. Learning rate is 7.5e4.
GAPRandom10: Trained with 10 random graphs. GraphSAGE has 2 layers of 256 units with shared pooling, and graph partitioning module is 3 layer dense network of 128 units with softmax. Learning rate is 7.5e6.
6 Conclusion
We propose a deep learning framework, GAP, for the graph partitioning problem, where the objective is to assign the nodes of a graph into balanced partitions while minimizing the edge cut across the partitions. Our GAP framework enables generalization: we can train models that produce performant partitions at inference time, even on unseen graphs. This generalization is an advantage over existing baselines which redo the optimization for each new graph. Our results over widely used machine learning models (ResNet, VGG, and Inceptionv3), scalefree graphs, and random graphs confirm that GAP achieves competitive partitions while being up to 100 times faster than the baseline and generalizing to unseen graphs.
References
 [Andersen et al., 2006] Andersen, R., Chung, F., and Lang, K. (2006). Local graph partitioning using pagerank vectors. In FOCS, pages 475–486. IEEE.
 [Bollobás et al., 2003] Bollobás, B., Borgs, C., Chayes, J., and Riordan, O. (2003). Directed scalefree graphs. In Proceedings of the Fourteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139.
 [Bui et al., 2017] Bui, T. D., Ravi, S., and Ramavajjala, V. (2017). Neural graph machines: Learning neural networks using graphs. CoRR, abs/1703.04818.
 [Chang et al., 2014] Chang, X., Nie, F., Ma, Z., and Yang, Y. (2014). Balanced kmeans and mincut clustering. arXiv preprint arXiv:1411.6235.
 [Chen et al., 2017] Chen, X., Huang, J. Z., Nie, F., Chen, R., and Wu, Q. (2017). A selfbalanced mincut algorithm for image clustering. In ICCV, pages 2080–2088.
 [Chung, 2007] Chung, F. (2007). Four proofs for the cheeger inequality and graph partition algorithms. In Proceedings of ICCM, volume 2, page 378.
 [Dilokthanakul et al., 2016] Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. (2016). Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
 [Erdos and Rényi, 1960] Erdos, P. and Rényi, A. (1960). On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60.
 [Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Teh, Y. W. and Titterington, M., editors, AISTATS, volume 9, pages 249–256, Chia Laguna Resort, Sardinia, Italy. PMLR.
 [Gonzalez et al., 2012] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. (2012). Powergraph: distributed graphparallel computation on natural graphs. In OSDI, volume 12, page 2.
 [Hada et al., 2018] Hada, R. J., Wu, H., and Jin, M. (2018). Scalable minimumcost balanced partitioning of largescale social networks: Online and offline solutions. IEEE Transactions on Parallel and Distributed Systems, 29(7):1636–1649.
 [Hagberg et al., 2008] Hagberg, A., Swart, P., and S Chult, D. (2008). Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States).
 [Hamilton et al., 2017] Hamilton, W. L., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1025–1035.

[He et al., 2016]
He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778.  [Karypis et al., 1999] Karypis, G., Aggarwal, R., Kumar, V., and Shekhar, S. (1999). Multilevel hypergraph partitioning: applications in vlsi domain. IEEE T VLSI SYST, 7(1):69–79.
 [Karypis and Kumar, 1998] Karypis, G. and Kumar, V. (1998). Multilevelkway partitioning scheme for irregular graphs. Journal of Parallel and Distributed computing, 48(1):96–129.
 [Karypis and Kumar, 2000] Karypis, G. and Kumar, V. (2000). Multilevel kway hypergraph partitioning. VLSI design, 11(3):285–300.
 [Kawamoto et al., 2018] Kawamoto, T., Tsubaki, M., and Obuchi, T. (2018). Meanfield theory of graph neural networks in graph partitioning. In Advances in Neural Information Processing Systems, pages 4366–4376.
 [Kipf and Welling, 2017] Kipf, T. N. and Welling, M. (2017). Semisupervised classification with graph convolutional networks. In ICLR.
 [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
 [Leskovec, 2009] Leskovec, J. (2009). Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters. Internet Mathematics, 6(1):29–123.
 [Liu et al., 2017] Liu, H., Han, J., Nie, F., and Li, X. (2017). Balanced clustering with least square regression. In AAAI, pages 2231–2237.
 [Miettinen et al., 2006] Miettinen, P., Honkala, M., and Roos, J. (2006). Using METIS and hMETIS algorithms in circuit partitioning. Helsinki University of Technology.
 [Mirhoseini et al., 2018] Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V., and Dean, J. (2018). A hierarchical model for device placement. In ICLR.
 [Mirhoseini et al., 2017] Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. (2017). Device placement optimization with reinforcement learning. In ICML.

[Ng et al., 2002]
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002).
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pages 849–856.  [Papadimitriou and Steiglitz, 1982] Papadimitriou, C. H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. PrenticeHall, Inc., Upper Saddle River, NJ, USA.
 [Shaham et al., 2018] Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., and Kluger, Y. (2018). Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587.
 [Shi and Malik, 2000] Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905.
 [Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Szegedy et al., 2017] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. (2017). Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12.
 [Van Den Bout and Miller, 1990] Van Den Bout, D. E. and Miller, T. K. (1990). Graph partitioning using annealed neural networks. IEEE Transactions on neural networks, 1(2):192–203.
 [Veličković et al., 2018] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations.
 [Von Luxburg, 2007] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416.
 [Xie et al., 2016] Xie, J., Girshick, R., and Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487.
 [Xu and Tan, 2012] Xu, Y. and Tan, G. (2012). hmetisbased offline road network partitioning. In AsiaSim 2012, pages 221–229. Springer.
 [Yang et al., 2017] Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. (2017). Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3861–3870. JMLR. org.
 [Zhang and Rohe, 2018] Zhang, Y. and Rohe, K. (2018). Understanding regularized spectral clustering via graph conductance. In NeurIPS, pages 10654–10663.
 [Zheng et al., 2016] Zheng, Y., Tan, H., Tang, B., Zhou, H., et al. (2016). Variational deep embedding: A generative approach to clustering. arxiv preprint. arXiv preprint arXiv:1611.05148.
Comments
There are no comments yet.