فهرست منبع

[Syngen] 22.12 Release Demo notebooks description extension

Artur Kasymov 3 سال پیش
والد
کامیت
5c45d613ea

+ 220 - 45
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/advanced_examples/e2e_cora_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "af9ebdc3",
+   "id": "5a21cdb1",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e16cd18a",
+   "id": "277223b5",
    "metadata": {},
    "source": [
     "# End to end graph generation demo (CORA)"
@@ -31,12 +31,14 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0711da78",
+   "id": "d8ddf635",
    "metadata": {},
    "source": [
     "## Overview\n",
     "\n",
-    "In this notebook, we have walked through the complete process of generating a synthetic dataset based on a CORA dataset. The CORA dataset consists of scientific publications classified into one of seven classes. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, so we can interpret the CORA dataset as a graph with categorical node features.\n",
+    "In this notebook, we walk through the complete process of generating a synthetic dataset based on a CORA dataset. \n",
+    "\n",
+    "The CORA dataset consists of scientific publications classified into one of seven classes. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, so we can interpret the CORA dataset as a graph with categorical node features.\n",
     "\n",
     "Content:\n",
     "\n",
@@ -48,7 +50,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cb2a97a3",
+   "id": "295da055",
    "metadata": {},
    "source": [
     "### Imports"
@@ -57,7 +59,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "4c92ade2",
+   "id": "9c8dab41",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,17 +89,40 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9f40bfe2",
+   "id": "359be279",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
     "### Fit synthesizer"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5e850760",
+   "metadata": {},
+   "source": [
+    "#### Instantiating the building blocks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7f3981b",
+   "metadata": {},
+   "source": [
+    "As the CORA dataset is a graph with node features the following objects are instantiated:\n",
+    "\n",
+    "- A node feature generator to generate the node features. In this example we simply choose the Kernel Density Estimate (KDE) generator\n",
+    "- A graph generator to generate the graph structure, e.g. RMAT\n",
+    "- An aligner to align the two, in this case random aligner, to randomly assign the node features to the generated nodes.\n",
+    "\n",
+    "\n",
+    "**Note**: Alternative generators can be used as long as they implement the `fit` \\& `generate` API and consumes data dictionary (described below)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "b23cd179",
+   "id": "c29b7ddb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,10 +132,28 @@
     "graph_aligner = RandomAligner()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b7b9cbca",
+   "metadata": {},
+   "source": [
+    "#### Defining the synthesizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95139288",
+   "metadata": {},
+   "source": [
+    "Once the set of building blocks are instantiated with the corresponding hyperparameters, we can instantiate a synthesizer which defines how these building blocks interact. \n",
+    "\n",
+    "The static graph synthesizer object, can be used to generate graphs with either node or edge features."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "bf237cca",
+   "id": "3d7bce4f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -121,20 +164,49 @@
     "                                    graph_aligner=graph_aligner)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "323834ba",
+   "metadata": {},
+   "source": [
+    "#### Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a00edc74",
+   "metadata": {},
+   "source": [
+    "For the CORA dataset a preprocessing step is pre-implemented (see `/syngen/preprocessing/cora.py`), which reads the corresponding data files to create the CORA graph with labels converted into ordinal values."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "ac06ae7e",
+   "id": "4fe80b1c",
    "metadata": {},
    "outputs": [],
    "source": [
     "data = preprocessing.transform('/workspace/data/cora')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "024b44d8",
+   "metadata": {},
+   "source": [
+    "The output of the preprocessing function is a dictionary with\n",
+    "\n",
+    "- MetaData.EDGE_DATA: data corresponding with the graphs edge information\n",
+    "- MetaData.NODE_DATA: data corresponding with graphs node information.\n",
+    "\n",
+    "Now that we have the data, the synthesizer can be fit. This step simply fits each component on the data."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "f24786f9",
+   "id": "7956c60e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -143,17 +215,27 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0ff98447",
+   "id": "6ab8d92c",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
     "## Dataset Generation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "006edfa1",
+   "metadata": {},
+   "source": [
+    "Now that we have a synthesizer fitted on a downstream dataset, we can generate a graph with similar characteristics as the original.\n",
+    "\n",
+    "In this example, we simply generate graph of the same size."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "c3116be5",
+   "id": "cb149b46",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -161,10 +243,18 @@
     "num_nodes = 2708 "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "79adcbe2",
+   "metadata": {},
+   "source": [
+    "By calling generate with the desired graph size, it will return a dictionary with keys corresponding to edge data and node data (if the synthesizer was configured to generate these)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "24cf045e",
+   "id": "c364179e",
    "metadata": {},
    "outputs": [
     {
@@ -181,31 +271,56 @@
   },
   {
    "cell_type": "markdown",
-   "id": "57c0fc18",
+   "id": "355d6dff",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
     "## Tabular Data Evaluation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "38f0c94d",
+   "metadata": {},
+   "source": [
+    "Now that we have generated the data we may be interested in assessing the quality of the generated graph.\n",
+    "\n",
+    "The tool provides a set of analyzers which can be used to analyze\n",
+    "- tabular features\n",
+    "- graph structure\n",
+    "- both\n",
+    "\n",
+    "Below a series of examples are shown comparing the original node feature distribution with various node feature generators"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "3c0c1a12",
+   "id": "d85f0ead",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# - extract the fitted node generator from the synthesizer\n",
     "tabular_generator = synthesizer.node_feature_generator\n",
     "cols_to_drop = set(['id'])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "488e32f4",
+   "metadata": {},
+   "source": [
+    "**Note**: the `id` column is dropped as this simply corresponds to the node id"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "b721dea8",
+   "id": "3371b941",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# - extract node features associated with nodes in the graph\n",
     "real = data[MetaData.NODE_DATA]\n",
     "real = real.drop(columns=cols_to_drop).reset_index(drop=True)"
    ]
@@ -213,18 +328,29 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "a23275bd",
+   "id": "16423b72",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# - generate node using the generator used in the synthesizer\n",
+    "# note the synthetic data could be also be replaced with node data\n",
+    "# generated above.\n",
     "synthetic = tabular_generator.sample(len(real))\n",
     "synthetic = synthetic.drop(columns=cols_to_drop).reset_index(drop=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "6c90e69d",
+   "metadata": {},
+   "source": [
+    "The `real` and `synthetic` data to be compared are then fed to the `TabularMetrics` object, which can be used to visually compare the data, or provide a series of metrics comparing feature distributions and their correlations."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "623110cd",
+   "id": "125df9fb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -238,7 +364,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "3942330b",
+   "id": "e541e647",
    "metadata": {},
    "outputs": [
     {
@@ -260,16 +386,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "305c3fc9",
+   "id": "546b0618",
    "metadata": {},
    "source": [
     "### Random Tabular Data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "450ced3b",
+   "metadata": {},
+   "source": [
+    "In the cell below a comparison is done using a uniform random generator."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "be3317b8",
+   "id": "f0d511e7",
    "metadata": {},
    "outputs": [
     {
@@ -301,16 +435,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "33a7d3e4",
+   "id": "6c3f78ff",
    "metadata": {},
    "source": [
     "### Random Multivariate"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b8fd3f8a",
+   "metadata": {},
+   "source": [
+    "In the cell below a comparison is done using a multivariate random generator. Note thesimilarity of this with `KDEGenerator` as the KDE generator simply adds gaussian noise and clamps the values."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "0b42fc7e",
+   "id": "cae90372",
    "metadata": {},
    "outputs": [
     {
@@ -342,17 +484,31 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a61839a5",
+   "id": "d8556744",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
     "## Structure evaluation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "995bd9ad",
+   "metadata": {},
+   "source": [
+    "Next the graph structure can similarly be analyzed.\n",
+    "\n",
+    "In the following cells the properly generated graph (using the synthesizer), a random graph, as well as the original are compared.\n",
+    "\n",
+    "The tool implements a graph analyzer, i.e. `AnalysisModule`, which provides a series of useful metrics to compare graphs across.\n",
+    "\n",
+    "First purely the graph structure is extracted, i.e. nodes and edges."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "b8aad032",
+   "id": "67a04184",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -365,7 +521,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "ca24da8e",
+   "id": "ffca7954",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -384,7 +540,7 @@
   {
    "cell_type": "code",
    "execution_count": 17,
-   "id": "fb12cfcb",
+   "id": "d2e6520c",
    "metadata": {},
    "outputs": [
     {
@@ -398,6 +554,7 @@
     }
    ],
    "source": [
+    "# - print graph size\n",
     "print(f'src_dst:{len(src_dst)}')\n",
     "print(f'dst_srct:{len(dst_src)}')\n",
     "print(f'graph:{len(graph)}')"
@@ -406,7 +563,7 @@
   {
    "cell_type": "code",
    "execution_count": 18,
-   "id": "c1b20191",
+   "id": "09ec506e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -418,11 +575,11 @@
   {
    "cell_type": "code",
    "execution_count": 19,
-   "id": "c694f1a1",
+   "id": "2a00643e",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# save graph structure to file\n",
+    "# - save graph structure to file\n",
     "np.savetxt('/workspace/data/cora_demo_proper.txt', np.array(graph_structure_proper), fmt='%i', delimiter='\\t')\n",
     "np.savetxt('/workspace/data/cora_demo_random.txt', np.array(graph_structure_random), fmt='%i', delimiter='\\t')\n",
     "np.savetxt('/workspace/data/cora_demo_orig.txt', np.array(graph_structure_orig), fmt='%i', delimiter='\\t')"
@@ -431,17 +588,26 @@
   {
    "cell_type": "code",
    "execution_count": 20,
-   "id": "6d88b840",
+   "id": "d368ba9c",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# - instantiate graph analyzer\n",
     "graph_analyser = AnalysisModule()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "de328c33",
+   "metadata": {},
+   "source": [
+    "Graph objects are then instantiated using the extracted graph structures"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 21,
-   "id": "38fc6135",
+   "id": "b2a3ff7f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -451,10 +617,19 @@
     "all_graphs = [proper_graph, random_graph, orig_graph]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b2e1b416",
+   "metadata": {},
+   "source": [
+    "The graphs can then be fed to various metrics, for example `get_dd_similarity_score` provides a score between 0 and 1,\n",
+    "comparing the degree distribution of a source graph and destination graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 22,
-   "id": "82de736f",
+   "id": "c05b60b2",
    "metadata": {},
    "outputs": [
     {
@@ -476,10 +651,18 @@
     "print(\"ORIG vs RANDOM:\", orig_random)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "64bd7bde",
+   "metadata": {},
+   "source": [
+    "The `compare_graph_stats` compares the graphs across a series of statistics"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 23,
-   "id": "b689eef8",
+   "id": "e3272df1",
    "metadata": {},
    "outputs": [
     {
@@ -714,7 +897,7 @@
   {
    "cell_type": "code",
    "execution_count": 24,
-   "id": "e5b77900",
+   "id": "b6240453",
    "metadata": {},
    "outputs": [
     {
@@ -735,14 +918,6 @@
     "set_loglevel('warning')\n",
     "_ = graph_analyser.compare_graph_plots(*all_graphs);"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6e58427e",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
@@ -764,7 +939,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 245 - 46
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/advanced_examples/e2e_ieee_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "00f1c154",
+   "id": "576b3dbf",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c66c5dfb",
+   "id": "39e9d423",
    "metadata": {},
    "source": [
     "# End to end bipartite graph generation demo (IEEE)"
@@ -31,12 +31,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d2e0bc04",
+   "id": "fdc85976",
    "metadata": {},
    "source": [
     "## Overview\n",
     "\n",
-    "In this notebook, we have walked through the complete process of generating a synthetic dataset based on an IEEE dataset. The IEEE dataset includes information about e-commerce transactions, so it can be iterpret as bipartite graph (user / product) with edge features (transaction info).\n",
+    "In this notebook, we walk through the complete process of generating a synthetic dataset based on an IEEE dataset. The IEEE dataset includes information about e-commerce transactions, so it can be iterpreted as a bipartite graph (user / product) with edge features (transaction info).\n",
     "\n",
     "Content:\n",
     "\n",
@@ -48,7 +48,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dc784911",
+   "id": "70e5ff72",
    "metadata": {},
    "source": [
     "### Imports"
@@ -57,7 +57,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "8a194237",
+   "id": "b59277ba",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -92,17 +92,40 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c62a9eea",
+   "id": "52555dc9",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
     "### Fit synthesizer"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "460f9a0d",
+   "metadata": {},
+   "source": [
+    "#### Instantiating the building blocks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3677e422",
+   "metadata": {},
+   "source": [
+    "As the IEEE dataset is a graph with edge features, and suppose we would like to similarly generate graph with edge features the following objects are instantiated\n",
+    "\n",
+    "- A edge feature generator to generate the node features. In this example CTGAN is used.\n",
+    "- A graph generator to generate the graph structure. As the original graph is bipartite, the RMATBipartiteGenerator is used.\n",
+    "- An aligner to align the two.\n",
+    "\n",
+    "\n",
+    "**Note**: Alternative generators can be used as long as they implement the `fit` \\& `generate` API and consumes data dictionary (described below)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "44081e17",
+   "id": "858155d3",
    "metadata": {},
    "outputs": [
     {
@@ -121,10 +144,28 @@
     "graph_aligner = XGBoostAligner(features_to_correlate_edge={'TransactionAmt': ColumnType.CONTINUOUS})"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "068d91ee",
+   "metadata": {},
+   "source": [
+    "#### Defining the synthesizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca6c4610",
+   "metadata": {},
+   "source": [
+    "Once the set of building blocks are instantiated with the corresponding hyperparameters, we can instantiate a synthesizer which defines how these building blocks interact. \n",
+    "\n",
+    "The static graph synthesizer object, can be used to generate graphs with either node or edge features."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "16b5ad0b",
+   "id": "7a179deb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -135,10 +176,26 @@
     "                                    graph_aligner=graph_aligner)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5b4248ae",
+   "metadata": {},
+   "source": [
+    "#### Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "385e1bf5",
+   "metadata": {},
+   "source": [
+    "For the CORA dataset a preprocessing step is pre-implemented (see `/syngen/preprocessing/ieee.py`), which reads the corresponding data files to create the CORA graph with labels converted into ordinal values."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "83c77e00",
+   "id": "791cb6db",
    "metadata": {},
    "outputs": [
     {
@@ -154,10 +211,22 @@
     "data = preprocessing.transform('/workspace/data/ieee-fraud/data.csv')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "25b906a2",
+   "metadata": {},
+   "source": [
+    "The output of the preprocessing function is a dictionary with\n",
+    "\n",
+    "- MetaData.EDGE_DATA: data corresponding with the graphs edge information\n",
+    "\n",
+    "Now that we have the data, the synthesizer can be fit. This step simply fits each component on the data."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "75e74395",
+   "id": "2b18e425",
    "metadata": {},
    "outputs": [
     {
@@ -218,17 +287,27 @@
   },
   {
    "cell_type": "markdown",
-   "id": "360bf7dd",
+   "id": "c09b9ac5",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
     "## Dataset generation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8e726c9c",
+   "metadata": {},
+   "source": [
+    "Now that we have a synthesizer fitted on a downstream dataset, we can generate a graph with similar characteristics as the original.\n",
+    "\n",
+    "In this example, we simply generate graph of the same size."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "b864f571",
+   "id": "017c7ae3",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -239,10 +318,18 @@
     "num_edges_dst_src = num_edges"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8469a953",
+   "metadata": {},
+   "source": [
+    "By calling generate with the desired graph size, it will return a dictionary with keys corresponding to edge data and node data (if the synthesizer was configured to generate these)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "9f351504",
+   "id": "21c6a61f",
    "metadata": {},
    "outputs": [
     {
@@ -289,17 +376,32 @@
   },
   {
    "cell_type": "markdown",
-   "id": "716e1503",
+   "id": "7ab43ba7",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
     "## Tabular Data Evaluation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d35d5a34",
+   "metadata": {},
+   "source": [
+    "Now that we have generated the data we may be interested in assessing the quality of the generated graph.\n",
+    "\n",
+    "The tool provides a set of analyzers which can be used to analyze\n",
+    "- tabular features\n",
+    "- graph structure\n",
+    "- both\n",
+    "\n",
+    "Below a series of examples are shown comparing the original node feature distribution with various node feature generators"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "3feef581",
+   "id": "f8c47316",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -307,10 +409,18 @@
     "cols_to_drop = set(['user_id', 'product_id'])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0ad07177",
+   "metadata": {},
+   "source": [
+    "**Note**: the `id` columns are dropped as this simply corresponds to the node ids"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "178834df",
+   "id": "8edbe264",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -321,17 +431,25 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "b1c4c8b2",
+   "id": "1f0cef64",
    "metadata": {},
    "outputs": [],
    "source": [
     "synthetic = tabular_generator.sample(len(real))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "2a547326",
+   "metadata": {},
+   "source": [
+    "The `real` and `synthetic` data to be compared are then fed to the `TabularMetrics` object, which can be used to visually compare the data, or provide a series of metrics comparing feature distributions and their correlations."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "2673ba75",
+   "id": "f4ce3f1d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -345,7 +463,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "524aef2c",
+   "id": "24cbeecb",
    "metadata": {},
    "outputs": [
     {
@@ -401,16 +519,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1ed80808",
+   "id": "8959f465",
    "metadata": {},
    "source": [
     "### Random Tabular Data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "741292a9",
+   "metadata": {},
+   "source": [
+    "In the cell below a comparison is done using a uniform random generator."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "7d489be0",
+   "id": "ae74f402",
    "metadata": {},
    "outputs": [
     {
@@ -469,16 +595,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "34e99313",
+   "id": "a157f4c9",
    "metadata": {},
    "source": [
     "### Random Multivariate"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a8ede4ca",
+   "metadata": {},
+   "source": [
+    "In the cell below a comparison is done using a multivariate random generator."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "82eee13d",
+   "id": "6eadec77",
    "metadata": {},
    "outputs": [
     {
@@ -537,17 +671,31 @@
   },
   {
    "cell_type": "markdown",
-   "id": "35d063d3",
+   "id": "79e9edc0",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
     "## Structure evaluation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "51cb5872",
+   "metadata": {},
+   "source": [
+    "Next the graph structure portion can similarly be analyzed.\n",
+    "\n",
+    "In the following cells the properly generated graph (using the synthesizer), a random graph, as well as the original are compared.\n",
+    "\n",
+    "The tool implements a graph analyzer, i.e. `AnalysisModule`, which provides a series of useful metrics to compare graphs across.\n",
+    "\n",
+    "First purely the graph structure is extracted, i.e. nodes and edges."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "a4bb63d6",
+   "id": "f891475c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -561,7 +709,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "9b40ee17",
+   "id": "40332eae",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -580,7 +728,7 @@
   {
    "cell_type": "code",
    "execution_count": 17,
-   "id": "5809bf7a",
+   "id": "18582388",
    "metadata": {},
    "outputs": [
     {
@@ -602,7 +750,7 @@
   {
    "cell_type": "code",
    "execution_count": 18,
-   "id": "9a303c30",
+   "id": "f8b38eb4",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -614,7 +762,7 @@
   {
    "cell_type": "code",
    "execution_count": 19,
-   "id": "3d9ea211",
+   "id": "60b28bdd",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -627,17 +775,25 @@
   {
    "cell_type": "code",
    "execution_count": 20,
-   "id": "f7e6008d",
+   "id": "e767c65f",
    "metadata": {},
    "outputs": [],
    "source": [
     "graph_analyser = AnalysisModule()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "21d8c406",
+   "metadata": {},
+   "source": [
+    "Graph objects are then instantiated using the extracted graph structures"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 21,
-   "id": "b7e5c7e1",
+   "id": "07c04e12",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -647,10 +803,19 @@
     "all_graphs = [proper_graph, random_graph, orig_graph]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f082c939",
+   "metadata": {},
+   "source": [
+    "The graphs can then be fed to various metrics, for example `get_dd_similarity_score` provides a score between 0 and 1,\n",
+    "comparing the degree distribution of a source graph and destination graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 22,
-   "id": "aed14815",
+   "id": "ceafd9ac",
    "metadata": {},
    "outputs": [
     {
@@ -672,10 +837,18 @@
     "print(\"ORIG vs RANDOM:\", orig_random)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8423f983",
+   "metadata": {},
+   "source": [
+    "The `compare_graph_stats` compares the graphs across a series of statistics"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 23,
-   "id": "d3869525",
+   "id": "50d4ecf8",
    "metadata": {},
    "outputs": [
     {
@@ -889,7 +1062,7 @@
   {
    "cell_type": "code",
    "execution_count": 24,
-   "id": "4f9b0eae",
+   "id": "aad8b339",
    "metadata": {},
    "outputs": [
     {
@@ -917,7 +1090,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "27e11a72",
+   "id": "65f39b1b",
    "metadata": {},
    "source": [
     "# Structure + Feature Distribution"
@@ -925,16 +1098,34 @@
   },
   {
    "cell_type": "markdown",
-   "id": "076d6177",
+   "id": "d4031c91",
+   "metadata": {},
+   "source": [
+    "The final graph with its associated features can be compared using a heat map visualization of the associated degree distribution and feature distribution.\n",
+    "\n",
+    "In the cells below, the real graph is compared with graphs generated using random generators versus the properly fitted generators."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d56446f1",
    "metadata": {},
    "source": [
     "### Real Data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "c61326c5",
+   "metadata": {},
+   "source": [
+    "`plot_node_degree_centrality_feat_dist` consumes a graph and plots the degree distribution-feature distribution for a particular column of interest."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 25,
-   "id": "cb17c6a2",
+   "id": "d1fa05e3",
    "metadata": {},
    "outputs": [
     {
@@ -965,7 +1156,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5d9a1310",
+   "id": "0b77586a",
    "metadata": {},
    "source": [
     "### Random"
@@ -974,7 +1165,7 @@
   {
    "cell_type": "code",
    "execution_count": 26,
-   "id": "03e58f56",
+   "id": "0d6144e7",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -986,7 +1177,7 @@
   {
    "cell_type": "code",
    "execution_count": 27,
-   "id": "7891cba1",
+   "id": "e3f41114",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -997,7 +1188,7 @@
   {
    "cell_type": "code",
    "execution_count": 28,
-   "id": "b671f226",
+   "id": "17251b68",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1012,7 +1203,7 @@
   {
    "cell_type": "code",
    "execution_count": 29,
-   "id": "cabd5761",
+   "id": "080400b1",
    "metadata": {},
    "outputs": [
     {
@@ -1035,16 +1226,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dc463132",
+   "id": "d19976bd",
    "metadata": {},
    "source": [
     "### Properly Generated"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f7ec2fce",
+   "metadata": {},
+   "source": [
+    "As depicted below, properly generated graph has the closest resemblines to the 2d heatmap of the original graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 30,
-   "id": "8c86b048",
+   "id": "970cdbff",
    "metadata": {},
    "outputs": [
     {
@@ -1085,7 +1284,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 96 - 30
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/advanced_examples/edge_classification_pretraining.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "49d2aaf9",
+   "id": "ea7ceaf7",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fc0ea279",
+   "id": "4e1a2027",
    "metadata": {},
    "source": [
     "# Edge Classification Pretraining demo (IEEE)"
@@ -31,16 +31,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5a5196e2",
+   "id": "05d3d798",
    "metadata": {},
    "source": [
     "## Overview\n",
+    "\n",
+    "Often times it is helpful to pre-train or initialize a network with learned weights on a downstream task of interest and further fine-tune.\n",
+    "\n",
     "This notebook demonstrates the steps for pretraing a GNN on synthetic data and finetuning on real data. "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "01a6300c",
+   "id": "26f39e76",
    "metadata": {},
    "source": [
     "### Imports"
@@ -49,7 +52,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "69a28c16",
+   "id": "f315cfcd",
    "metadata": {},
    "outputs": [
     {
@@ -94,16 +97,28 @@
   },
   {
    "cell_type": "markdown",
-   "id": "01eddd70",
+   "id": "20e3e3a6",
    "metadata": {},
    "source": [
     "### Generate synthetic data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5c3db76c",
+   "metadata": {},
+   "source": [
+    "In the following cells, a synthesizer is instantiated and fitted on the IEEE dataset.\n",
+    "\n",
+    "Once fitted, the synthesizer is used to generate synthetic data with similar characteristics.\n",
+    "\n",
+    "For a more detailed explanation checkout the `e2e_ieee_demo.ipynb`"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "65da8b0a",
+   "id": "8f86bf18",
    "metadata": {},
    "outputs": [
     {
@@ -131,7 +146,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "b0b64872",
+   "id": "60bb8cfb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -145,7 +160,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "ac0d50f7",
+   "id": "37d4eb69",
    "metadata": {},
    "outputs": [
     {
@@ -164,7 +179,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "84732600",
+   "id": "873d0cf2",
    "metadata": {},
    "outputs": [
     {
@@ -204,7 +219,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "b615610c",
+   "id": "b08f1603",
    "metadata": {},
    "outputs": [
     {
@@ -251,7 +266,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "66f7a839",
+   "id": "03e21408",
    "metadata": {},
    "source": [
     "### Train GNN"
@@ -259,7 +274,22 @@
   },
   {
    "cell_type": "markdown",
-   "id": "07805108",
+   "id": "a834318e",
+   "metadata": {},
+   "source": [
+    "To train an example GNN we need the following:\n",
+    "\n",
+    "- a dataset object instantiated using either the synthetic or original data\n",
+    "- the model, optimizer and hyperparameters defined\n",
+    "\n",
+    "In the tool an example dataloader is implemented for edge classification under `syngen/benchmark/data_loader`.\n",
+    "\n",
+    "This dataset object is used to great the dgl graphs corresponding to both the generated data and real data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28fabfa9",
    "metadata": {},
    "source": [
     "#### Create datasets"
@@ -268,7 +298,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "0fe941f0",
+   "id": "f7e8bd44",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -279,16 +309,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a8b23137",
+   "id": "b830709c",
    "metadata": {},
    "source": [
     "#### Create helper function\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b959a3a2",
+   "metadata": {},
+   "source": [
+    "The helper function defines a simple trianing loop and standard metrics for edge classification."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "f46973e3",
+   "id": "5c4bec86",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -329,16 +367,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6ad092e6",
+   "id": "dc4cea06",
    "metadata": {},
    "source": [
     "#### No-Pretrain"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "093203f8",
+   "metadata": {},
+   "source": [
+    "Without pre-training the model is trained from scratch using the original data graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "d4ad039a",
+   "id": "93ab387d",
    "metadata": {},
    "outputs": [
     {
@@ -383,16 +429,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7f061442",
+   "id": "08f5280a",
    "metadata": {},
    "source": [
     "#### Pretrain"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "18bebba4",
+   "metadata": {},
+   "source": [
+    "In this example the model is first trained on the generated data for a certain epoch budget.\n",
+    "\n",
+    "Subsequently it is further trained on the original data graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "2f3985b2",
+   "id": "e21ab679",
    "metadata": {},
    "outputs": [
     {
@@ -438,7 +494,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "f33bec4f",
+   "id": "8b615c76",
    "metadata": {},
    "outputs": [
     {
@@ -458,7 +514,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a6f0cfbe",
+   "id": "69b9e95c",
    "metadata": {},
    "source": [
     "### CLI example"
@@ -466,7 +522,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2c48ec37",
+   "id": "93fd05a0",
+   "metadata": {},
+   "source": [
+    "The tool also provides this functionality through its CLI.\n",
+    "\n",
+    "The commands used to generate and pretrain/fine tune on the downstream tasks as done above are provided below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8de441fe",
    "metadata": {},
    "source": [
     "#### Generate synthetic graph"
@@ -475,7 +541,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "b588c44a",
+   "id": "af89d214",
    "metadata": {},
    "outputs": [
     {
@@ -553,7 +619,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "01eeff23",
+   "id": "7fef4fb7",
    "metadata": {},
    "source": [
     "#### Results without pretraining"
@@ -562,7 +628,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "50238488",
+   "id": "c65ab4be",
    "metadata": {},
    "outputs": [
     {
@@ -607,7 +673,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1a8474cb",
+   "id": "e6655f58",
    "metadata": {},
    "source": [
     "#### Pretrain and finetune"
@@ -616,7 +682,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "92039366",
+   "id": "fd2b8caf",
    "metadata": {},
    "outputs": [
     {
@@ -668,7 +734,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f0405bf2",
+   "id": "2da530b6",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -693,7 +759,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 54 - 24
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/advanced_examples/frechet_lastfm_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "cdba51b0",
+   "id": "19f38122",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "28128ce4",
+   "id": "7e5b0ae2",
    "metadata": {},
    "source": [
     "## Scaling non-bipartite graph demo (lastfm)"
@@ -31,17 +31,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f15262a4",
+   "id": "2358eb8e",
    "metadata": {},
    "source": [
     "### Overview\n",
     "\n",
-    "This notebooks demonstates the scaling capabilities of SynGen non-bipartite generators."
+    "This notebooks demonstates the scaling capabilities of syngen non-bipartite generators."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e55bea44",
+   "id": "1c11a911",
    "metadata": {},
    "source": [
     "#### Imports"
@@ -50,7 +50,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "abde3728",
+   "id": "315be8e0",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -76,16 +76,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6c4ca1a2",
+   "id": "3352d6ef",
    "metadata": {},
    "source": [
     "#### Prepare synthesizers"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "59bdd307",
+   "metadata": {},
+   "source": [
+    "The synthesizer is prepared simply using the graph generator, which will generate the graph structure.\n",
+    "\n",
+    "In this case no features will be generated."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "9dc7c323",
+   "id": "389cf122",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -96,7 +106,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "a0d6675c",
+   "id": "e2af2e5b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -110,16 +120,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "311dc5aa",
+   "id": "b127cc7a",
    "metadata": {},
    "source": [
     "#### Load data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "6000c677",
+   "metadata": {},
+   "source": [
+    "The original dataset is loaded.\n",
+    "\n",
+    "**Note**: to obtain the datasets run the `/scripts/get_datasets.sh` script as described in the `README.md`"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "740d92ee",
+   "id": "bef09719",
    "metadata": {},
    "outputs": [
     {
@@ -140,7 +160,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "91ce5732",
+   "id": "fb90b9a6",
    "metadata": {},
    "source": [
     "#### Fit synthesizer"
@@ -149,7 +169,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "87091346",
+   "id": "9780781c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -159,7 +179,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "d6af0893",
+   "id": "438630a2",
    "metadata": {},
    "outputs": [
     {
@@ -182,16 +202,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1ee0040e",
+   "id": "d6b33361",
    "metadata": {},
    "source": [
-    "#### Generate graphs"
+    "#### Generate and compare graphs\n",
+    "\n",
+    "To check the generator scaling capabilities we compare the results of randomly and properly generated graphs with the original graph. There is no trivial way to scale the original graph, so we use Normalized Frechet Score. It takes all of three graphs, normalizes their degree distribution curves using moving average, smoothes using log transformation,, computes [Frechet Distance](https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance) for original and generated graphs, and compares these distances.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "a2166a66",
+   "id": "807a7cfa",
    "metadata": {},
    "outputs": [
     {
@@ -305,7 +327,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "e92248f5",
+   "id": "2cacbfe6",
    "metadata": {},
    "outputs": [
     {
@@ -345,16 +367,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "26d0f190",
+   "id": "8ff3cafc",
    "metadata": {},
    "source": [
     "#### Show results"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "6dde7adc",
+   "metadata": {},
+   "source": [
+    "The next couple cells computers the frechet distance over the generated synthetic graph sand the original graph using the fitted parameters as well as the erdos renyi graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "aa968937",
+   "id": "d36e3da8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -379,7 +409,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "a1aca3d0",
+   "id": "64e2a773",
    "metadata": {},
    "outputs": [
     {
@@ -412,7 +442,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "ad6f7e80",
+   "id": "0b8732df",
    "metadata": {},
    "outputs": [
     {
@@ -449,7 +479,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f9a3fb35",
+   "id": "5a258342",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -471,7 +501,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 65 - 23
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/advanced_examples/frechet_tabformer_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "a4f97b46",
+   "id": "33237397",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6ddabe6f",
+   "id": "ec42487c",
    "metadata": {},
    "source": [
     "## Scaling bipartite graph demo (tabformer)"
@@ -31,7 +31,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eb5fb8ae",
+   "id": "131c7e13",
    "metadata": {},
    "source": [
     "### Overview\n",
@@ -41,7 +41,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "caf5d448",
+   "id": "fa1d39b6",
    "metadata": {},
    "source": [
     "#### Imports"
@@ -50,7 +50,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "89643962",
+   "id": "662b3b65",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -75,16 +75,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "26abb5cd",
+   "id": "a5fe1bb3",
    "metadata": {},
    "source": [
     "#### Prepare synthesizers\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7c28f45e",
+   "metadata": {},
+   "source": [
+    "The synthesizer is prepared simply using the graph generator.\n",
+    "\n",
+    "In this case no edge or node features will be generated."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "dc0b8b3b",
+   "id": "2e4149d0",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -95,7 +105,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "d3141a29",
+   "id": "269c0bbb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -109,16 +119,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aac62e12",
+   "id": "ab7bcfc9",
    "metadata": {},
    "source": [
     "### Load data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7eac4c0c",
+   "metadata": {},
+   "source": [
+    "The original dataset is loaded.\n",
+    "\n",
+    "**Note**: to obtain the datasets run the `/scripts/get_datasets.sh` script as described in the `README.md`"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "e4d165a6",
+   "id": "a5a1808a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -133,7 +153,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5dbbdbbf",
+   "id": "28a7834e",
    "metadata": {},
    "source": [
     "#### Fit graph"
@@ -142,7 +162,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "165fcdf1",
+   "id": "7c1d47f9",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -152,7 +172,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "d3ccf3e2",
+   "id": "e69a90a4",
    "metadata": {},
    "outputs": [
     {
@@ -174,10 +194,22 @@
     "synthesizer.graph_generator.get_fit_results()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7a0d1a0e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### Generate and compare graphs\n",
+    "\n",
+    "To check the generator scaling capabilities we compare the results of randomly and properly generated graphs with the original graph. There is no trivial way to scale the original graph, so we use Normalized Frechet Score. It takes all of three graphs, normalizes their degree distribution curves using moving average, smoothes using log transformation,, computes [Frechet Distance](https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance) for original and generated graphs, and compares these distances."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "11892ec1",
+   "id": "55a39d02",
    "metadata": {},
    "outputs": [
     {
@@ -277,7 +309,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "4ca3c2e7",
+   "id": "90d8f977",
    "metadata": {},
    "outputs": [
     {
@@ -347,16 +379,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "10c42dee",
+   "id": "1113728f",
    "metadata": {},
    "source": [
     "#### Show results"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "ba04be6d",
+   "metadata": {},
+   "source": [
+    "The original dataset is loaded.\n",
+    "\n",
+    "**Note**: to obtain the datasets run the `/scripts/get_datasets.sh` script as described in the `README.md`"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "7a2dcf5c",
+   "id": "b23c4866",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -399,7 +441,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "ee64cac0",
+   "id": "d42fb841",
    "metadata": {},
    "outputs": [
     {
@@ -441,7 +483,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "d869107c",
+   "id": "675d057f",
    "metadata": {},
    "outputs": [
     {
@@ -474,7 +516,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "d614f9f1",
+   "id": "0d9c1ac7",
    "metadata": {},
    "outputs": [
     {
@@ -507,7 +549,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "4e5a93ad",
+   "id": "2120c9e3",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -515,7 +557,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9120e5a7",
+   "id": "e1cd5805",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -537,7 +579,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 46 - 30
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/basic_examples/er_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "91851fca",
+   "id": "abe6a6ef",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b348ed6c",
+   "id": "2cafca8e",
    "metadata": {},
    "source": [
     "# Graph generation demo"
@@ -31,12 +31,13 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d50b8105",
+   "id": "b360efc8",
    "metadata": {},
    "source": [
     "### Overview\n",
     "\n",
-    "This notebook demonstrates examples of generating different types of random graphs with their further analysis using SynGen tool. You may find the following graphs:\n",
+    "This notebook demonstrates examples of generating random graphs and their analysis using the syngen tool. \n",
+    "We utilize [Erdos-Renyi](https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model) model based generators for regular and bipartite graphs. Along with this, we will consider directed and unidirected graphs, so we have the following options:\n",
     "\n",
     "1. [Directed nonbipartite graphs](#1)\n",
     "1. [Unidirected nonbipartite graphs](#2)\n",
@@ -48,7 +49,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0b39b4f4",
+   "id": "edf205a6",
    "metadata": {},
    "source": [
     "### Imports"
@@ -57,15 +58,18 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "9cd73740",
+   "id": "b4c4bf6c",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# utils\n",
     "import math\n",
     "import numpy as np\n",
     "\n",
+    "# generators\n",
     "from syngen.generator.graph import RandomGraph, RandomBipartite\n",
     "\n",
+    "# graph statistics \n",
     "from syngen.analyzer.graph import Graph\n",
     "from syngen.analyzer.graph.analyser import AnalysisModule\n",
     "from syngen.analyzer.graph.stats import get_dd_simmilarity_score"
@@ -73,16 +77,20 @@
   },
   {
    "cell_type": "markdown",
-   "id": "23915683",
+   "id": "77a49f09",
    "metadata": {},
    "source": [
-    "### Helper functions"
+    "### Helper functions\n",
+    "\n",
+    "`generate_graphs` function handles all cases of the graph generation that we are going to present in this notebook. It picks between `RandomBipartite` and `RandomGraph` generators, then fits the chosen one with all all necessary information (is it directed or not), and generates `n` graphs with a given number of nodes and edges. \n",
+    "\n",
+    "Note: In the bipartite scenario, the generator requires the number of nodes in both parts."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "e0bb0a3b",
+   "id": "e55eaf74",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -109,10 +117,18 @@
     "    return graphs"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f0dd08c3-6129-4b9b-aca9-b132e2b0a13e",
+   "metadata": {},
+   "source": [
+    "We use [SNAP](https://snap.stanford.edu/) library for the graph analysis, so we need to convert the generated graphs to the appropriate format."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "6b066f4c",
+   "id": "934a221e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -128,7 +144,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "5a24b2ec",
+   "id": "961da913",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -141,7 +157,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fec274cd",
+   "id": "d2b3d974",
    "metadata": {},
    "source": [
     "## Graph Generation"
@@ -149,7 +165,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "df1ced92",
+   "id": "c14df222",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
@@ -159,7 +175,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "3c345225",
+   "id": "140a3492",
    "metadata": {},
    "outputs": [
     {
@@ -440,7 +456,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e14ca828",
+   "id": "cbb029ec",
    "metadata": {},
    "source": [
     "### Directed nonbipartite graphs with noise"
@@ -449,7 +465,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "fce2cae5",
+   "id": "fb23e1ca",
    "metadata": {},
    "outputs": [
     {
@@ -753,7 +769,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "60dbb7fc",
+   "id": "017e9d00",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
@@ -763,7 +779,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "a01bb067",
+   "id": "d48f65c3",
    "metadata": {},
    "outputs": [
     {
@@ -1034,7 +1050,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9e8fdcd1",
+   "id": "cbcbbd88",
    "metadata": {},
    "source": [
     "### Unidirected nonbipartite graphs with noise"
@@ -1043,7 +1059,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "8ad10f13",
+   "id": "f23504f1",
    "metadata": {},
    "outputs": [
     {
@@ -1314,7 +1330,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2298292a",
+   "id": "392f226a",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
@@ -1324,7 +1340,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "22b0b4b5",
+   "id": "f387e078",
    "metadata": {},
    "outputs": [
     {
@@ -1605,7 +1621,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "3b7d0ee5",
+   "id": "24dad00a",
    "metadata": {},
    "source": [
     "### Directed bipartite graphs with noise"
@@ -1614,7 +1630,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "df9b8d1b",
+   "id": "f0ac1262",
    "metadata": {},
    "outputs": [
     {
@@ -1895,7 +1911,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7c9cbeb0",
+   "id": "17c51d59",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
@@ -1905,7 +1921,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "62f5aa71",
+   "id": "a4129268",
    "metadata": {},
    "outputs": [
     {
@@ -2156,7 +2172,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ac3ad808",
+   "id": "6e2a0ced",
    "metadata": {},
    "source": [
     "### Unidirected bipartite graphs with noise"
@@ -2165,7 +2181,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "b44f8600",
+   "id": "fa5715ba",
    "metadata": {},
    "outputs": [
     {
@@ -2437,7 +2453,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "67cdfd33",
+   "id": "f60a23be",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -2459,7 +2475,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.9.13"
   }
  },
  "nbformat": 4,

+ 67 - 26
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/basic_examples/ieee_demo.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "4763f2d2",
+   "id": "bc53e4f2",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "92a13c36",
+   "id": "2b4e6b4a",
    "metadata": {},
    "source": [
     "# Basic bipartite graph generation demo (IEEE)"
@@ -31,12 +31,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d11612f1",
+   "id": "9e800257",
    "metadata": {},
    "source": [
     "### Overview\n",
     "\n",
-    "This notebook demonstrates an example of mimicking a bipartite graph structure by providing an edge list of the target graph directly to SynGen generator. Our target graph will be the one from [IEEE](https://www.kaggle.com/c/ieee-fraud-detection) dataset. \n",
+    "This notebook demonstrates an example of mimicking a bipartite graph structure by providing an edge list of the target graph directly to the graph generator, responsible for generating the structure. The target graph will be the one from [IEEE](https://www.kaggle.com/c/ieee-fraud-detection) dataset. \n",
     "\n",
     "### Content\n",
     "\n",
@@ -48,7 +48,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7522acbd",
+   "id": "ea1ec967",
    "metadata": {},
    "source": [
     "### Imports"
@@ -57,7 +57,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "4b858098",
+   "id": "45af8524",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -74,19 +74,27 @@
   },
   {
    "cell_type": "markdown",
-   "id": "4e8fcd1a",
+   "id": "ac151980",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
     "### Load graph\n",
     "\n",
-    "[IEEE](https://www.kaggle.com/c/ieee-fraud-detection) dataset is stored as a csv file; however, we need only a graph structure, which could be extracted from `user_id` and `product_id` columns."
+    "[IEEE](https://www.kaggle.com/c/ieee-fraud-detection) dataset is stored as a csv file; however, we only require the corresponding graph structure, which could be extracted from `user_id` and `product_id` columns."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e058fb42",
+   "metadata": {},
+   "source": [
+    "**Note**: to obtain the datasets run the `/scripts/get_datasets.sh` script as described in the `README.md`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "04089b74",
+   "id": "57d41238",
    "metadata": {},
    "outputs": [
     {
@@ -124,17 +132,25 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7bc5ce9f",
+   "id": "8b36bb64",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
     "### Fit generator to graph"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "40682480",
+   "metadata": {},
+   "source": [
+    "Now that we have loaded the original dataset, and extracted the graph structure, a `RMATBipartiteGenerator` is instantiated to fit on this graph."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "369ae541",
+   "id": "a19a04e3",
    "metadata": {},
    "outputs": [
     {
@@ -154,7 +170,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "bbf4ce52",
+   "id": "7c818ca0",
    "metadata": {},
    "outputs": [
     {
@@ -171,17 +187,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cc98f4d6",
+   "id": "3c42be3d",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
     "### Generate graphs"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d3e151cb",
+   "metadata": {},
+   "source": [
+    "The fitted generator can be used to generate graphs of arbitrary size.\n",
+    "A few examples are provided below"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "4c35cf46",
+   "id": "fc9c684d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -196,7 +221,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "36e34ef5",
+   "id": "563b9275",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -211,7 +236,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "753986c4",
+   "id": "87ab6a29",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -227,7 +252,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "460f9c0c",
+   "id": "2ac4b3a1",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -241,16 +266,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "db0ee959",
+   "id": "56c85932",
    "metadata": {},
    "source": [
     "### Convert graph to SNAP"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "01230096",
+   "metadata": {},
+   "source": [
+    "Next the original graph is converted to [SNAP](https://snap.stanford.edu/), where the graph analyzer is used to analyze the generated graphs with the original."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "798113a1",
+   "id": "ce014250",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -264,7 +297,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "0acbffab",
+   "id": "7d4b705c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -279,7 +312,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "75885b69",
+   "id": "c0485b3d",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
@@ -289,7 +322,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "e760aaba",
+   "id": "02457764",
    "metadata": {},
    "outputs": [
     {
@@ -314,7 +347,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "5ff19f73",
+   "id": "c49e37af",
    "metadata": {},
    "outputs": [
     {
@@ -585,10 +618,18 @@
     "df"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "2b5944c0",
+   "metadata": {},
+   "source": [
+    "The degree distribution comparison and hop plots can also be visualized by calling the `compare_graph_plots` fn on the analyzer."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "0000aa29",
+   "id": "aedf53b7",
    "metadata": {},
    "outputs": [
     {
@@ -611,7 +652,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "2c327c6e",
+   "id": "904b97e0",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -636,7 +677,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 46 - 29
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/basic_examples/lastfm_demo.ipynb

@@ -3,7 +3,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "527a872a",
+   "id": "87265316",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -25,7 +25,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "48b6c28b",
+   "id": "fd00dbdb",
    "metadata": {},
    "source": [
     "# Basic graph generation demo (lastfm)"
@@ -33,10 +33,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e4991072",
+   "id": "f988ea71",
    "metadata": {},
    "source": [
-    "This notebook demonstrates an example of mimicking a no-partite graph structure by providing an edge list of the target graph directly to SynGen generator. Our target graph will be the one from [lastfm]() dataset. \n",
+    "This notebook demonstrates an example of mimicking a no-partite graph structure by providing an edge list of the target graph directly to the graph generator, responsible for generating the structure. In this example the target graph will be the one from the [lastfm](https://snap.stanford.edu/data/feather-lastfm-social.html) dataset. \n",
     "\n",
     "### Content\n",
     "\n",
@@ -49,7 +49,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "ca0e5eb6",
+   "id": "79a868a8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -64,17 +64,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "95057ba0",
+   "id": "480775d4",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
-    "### Load graph"
+    "### Load graph\n",
+    "\n",
+    "We are interesting only in the structural part of the [lastfm](https://snap.stanford.edu/data/feather-lastfm-social.html) dataset, so we load exclusively the edges."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "ae3a821a",
+   "id": "8b27d322",
    "metadata": {},
    "outputs": [
     {
@@ -100,17 +102,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d9b56403",
+   "id": "fa3d1624",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
-    "### Fit generator to graph"
+    "### Fit generator to graph\n",
+    "\n",
+    "Now that we have loaded the graph structure, the `RMATGenerator`, which is one of the graph generators implemented, is instantiated and fitted on this graph."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "ba66b20e",
+   "id": "ee718a26",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -121,7 +125,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "4afd9cc1",
+   "id": "08a89163",
    "metadata": {},
    "outputs": [
     {
@@ -138,17 +142,20 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5d2747dd",
+   "id": "8f634f99",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
-    "### Generate graphs"
+    "### Generate graphs\n",
+    "\n",
+    "The fitted generator can be used to generate graphs of arbitrary size.\n",
+    "A few examples are provided below"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "03e1e551",
+   "execution_count": null,
+   "id": "bf28269e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -162,7 +169,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "7ccd90df",
+   "id": "de3a3140",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -176,7 +183,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "8ae1f465",
+   "id": "838272df",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -191,7 +198,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "eb5e2bcb",
+   "id": "4012835f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -204,16 +211,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "66608f04",
+   "id": "869a8136",
    "metadata": {},
    "source": [
-    "### Convert graph to SNAP"
+    "### Convert graph to SNAP\n",
+    "\n",
+    "Next the original graph is converted to [SNAP](https://snap.stanford.edu/), where the graph analyzer is used to analyze the generated graphs against the original."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "2f12a218",
+   "id": "580a56bd",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -227,7 +236,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "8a9c3eea",
+   "id": "147e28fe",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -242,7 +251,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a30c24be",
+   "id": "26986292",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
@@ -252,7 +261,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "1b46e743",
+   "id": "b97186af",
    "metadata": {},
    "outputs": [
     {
@@ -277,7 +286,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "6ea38131",
+   "id": "8b2f13f8",
    "metadata": {},
    "outputs": [
     {
@@ -548,10 +557,18 @@
     "df"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5ba91a8b",
+   "metadata": {},
+   "source": [
+    "The degree distribution comparison and hop plots can also be visualized by calling the `compare_graph_plots` fn on the analyzer."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "15df3ae7",
+   "id": "5a89dba0",
    "metadata": {},
    "outputs": [
     {
@@ -574,7 +591,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "bb2c601f",
+   "id": "a1eacb32",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -596,7 +613,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 79 - 35
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/performance/struct_generator.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "2da8f4bf",
+   "id": "505e59e0",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f987a4b2",
+   "id": "1f5c969c",
    "metadata": {},
    "source": [
     "# Graph structure generation demo"
@@ -31,13 +31,15 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eb1caeed",
+   "id": "dd1a0e56",
    "metadata": {},
    "source": [
     "## Overview\n",
     "\n",
     "In this notebbok we compare the performance (throughput) of graph structure generators presented in the SynGen tool. \n",
     "\n",
+    "The graph generator can run on both CPU and GPU, hence we provide cells to run throughput test on either.\n",
+    "\n",
     "Available generators:\n",
     "\n",
     "1. [Exact RMAT generator (GPU)](#1)\n",
@@ -49,7 +51,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "f8a1adaf",
+   "id": "2fc6406b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -70,7 +72,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "cb6937d3",
+   "id": "bb794f78",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -82,7 +84,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ad27d92d",
+   "id": "1896fcb6",
    "metadata": {},
    "source": [
     "## Exact generator"
@@ -90,17 +92,27 @@
   },
   {
    "cell_type": "markdown",
-   "id": "06dadc04",
+   "id": "3535c9fd",
+   "metadata": {},
+   "source": [
+    "In  the cell below a graph with 6.8e6 nodes and 168e3 edges is generated. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3eb50018",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
-    "### GPU"
+    "### GPU\n",
+    "\n",
+    "We instantiate a `RMATGenerator` and ensure that it will use a GPU mode."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "a8bf077d",
+   "id": "2e9af1cc",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -108,10 +120,18 @@
     "static_graph_generator.gpu=True"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "32106249",
+   "metadata": {},
+   "source": [
+    "Provide the RMAT probability matrix for graph generation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "01bc4461",
+   "id": "683b5ed6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -121,7 +141,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "e52f19c4",
+   "id": "3d6b33c9",
    "metadata": {},
    "outputs": [
     {
@@ -147,10 +167,18 @@
     "print(elapsed)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a29fa479",
+   "metadata": {},
+   "source": [
+    "Additional throughput tests for varying node/edge scaling sizes."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "7271985d",
+   "id": "b8fc1679",
    "metadata": {},
    "outputs": [
     {
@@ -240,7 +268,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "99412af6",
+   "id": "867378a1",
    "metadata": {},
    "outputs": [
     {
@@ -273,17 +301,25 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7b20c540",
+   "id": "f15d5794",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
     "### CPU"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "51eb24ab",
+   "metadata": {},
+   "source": [
+    "Similar as the GPU setup, instead the generator device is set to CPU."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "09ae86bd",
+   "id": "8686622f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -294,7 +330,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "a92affa3",
+   "id": "fb876c5a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -304,7 +340,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "f939a153",
+   "id": "8c0c391f",
    "metadata": {},
    "outputs": [
     {
@@ -586,10 +622,18 @@
     "    "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0aff93e0",
+   "metadata": {},
+   "source": [
+    "We can plot the throughput in terms of number of edges generated per second for the varying scales as depicted below."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "68c523f8",
+   "id": "5097ef88",
    "metadata": {},
    "outputs": [
     {
@@ -623,7 +667,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "f45ad186",
+   "id": "c4d7a90f",
    "metadata": {},
    "outputs": [
     {
@@ -668,7 +712,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d24cdc1b",
+   "id": "29ac5284",
    "metadata": {},
    "source": [
     "## Approximate generator"
@@ -676,7 +720,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e0f2c00a",
+   "id": "7c56cc7b",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
@@ -686,7 +730,7 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "7551999a",
+   "id": "630f0035",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -724,7 +768,7 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "3100aa6e",
+   "id": "46f143d8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -734,7 +778,7 @@
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "3d89f284",
+   "id": "c21a1449",
    "metadata": {},
    "outputs": [
     {
@@ -819,7 +863,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "304e5fee",
+   "id": "401d907f",
    "metadata": {},
    "outputs": [
     {
@@ -852,7 +896,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0f6346ce",
+   "id": "44dca44d",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
@@ -862,7 +906,7 @@
   {
    "cell_type": "code",
    "execution_count": 17,
-   "id": "343eb3fb",
+   "id": "accaea7b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -872,7 +916,7 @@
   {
    "cell_type": "code",
    "execution_count": 18,
-   "id": "0fcc8f5a",
+   "id": "522f623f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -884,7 +928,7 @@
   {
    "cell_type": "code",
    "execution_count": 19,
-   "id": "423f40bd",
+   "id": "e8173a77",
    "metadata": {},
    "outputs": [
     {
@@ -911,7 +955,7 @@
   {
    "cell_type": "code",
    "execution_count": 20,
-   "id": "74e28c87",
+   "id": "9008bf1f",
    "metadata": {},
    "outputs": [
     {
@@ -932,7 +976,7 @@
   {
    "cell_type": "code",
    "execution_count": 21,
-   "id": "0fc54d1f",
+   "id": "0b80b0d3",
    "metadata": {},
    "outputs": [
     {
@@ -1204,7 +1248,7 @@
   {
    "cell_type": "code",
    "execution_count": 22,
-   "id": "9159ec79",
+   "id": "bb794041",
    "metadata": {},
    "outputs": [
     {
@@ -1238,7 +1282,7 @@
   {
    "cell_type": "code",
    "execution_count": 24,
-   "id": "5e3ededb",
+   "id": "98a67938",
    "metadata": {},
    "outputs": [
     {
@@ -1281,7 +1325,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f2fdddd6",
+   "id": "1592cf45",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -1303,7 +1347,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,

+ 55 - 33
Tools/DGLPyTorch/SyntheticGraphGeneration/demos/performance/tabular_generator.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "raw",
-   "id": "bbdcd85e",
+   "id": "715b754c",
    "metadata": {},
    "source": [
     "# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "99811865",
+   "id": "d0289a5e",
    "metadata": {},
    "source": [
     "# Tabular data generation performance demo"
@@ -31,7 +31,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f4436d03",
+   "id": "6cb2ea32",
    "metadata": {},
    "source": [
     "## Overview\n",
@@ -49,7 +49,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cec05078",
+   "id": "071caf5f",
    "metadata": {},
    "source": [
     "### Imports"
@@ -58,7 +58,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "5ab89412",
+   "id": "e024521d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -82,16 +82,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "81ec3beb",
+   "id": "aaf4b14c",
    "metadata": {},
    "source": [
-    "### Helper function"
+    "### Helper function\n",
+    "\n",
+    "This function measures throughput in samples per second."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "be4b5b7a",
+   "id": "35af9cde",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,16 +109,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "4db05295",
+   "id": "7def7b7e",
    "metadata": {},
    "source": [
-    "### Load tabular features"
+    "### Load tabular features\n",
+    "\n",
+    "We utilize `IEEEPreprocessing` class, which loads and prepares the entire dataset. Then we extract tabular data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "d09efe1a",
+   "id": "eff6b0c1",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -126,7 +130,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "fd401437",
+   "id": "ce93c923",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -136,7 +140,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "fec262e5",
+   "id": "3baf7373",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -145,10 +149,18 @@
     "real = data[MetaData.EDGE_DATA][list(cat_cols)].reset_index(drop=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b888bad5",
+   "metadata": {},
+   "source": [
+    "Util dict to store the results"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "9968856c",
+   "id": "640a6738",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -157,17 +169,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c69a6885",
+   "id": "23d1e2f3",
    "metadata": {},
    "source": [
     "<a id=\"1\"></a>\n",
-    "## KDE (Kernel Density Estimation) Generator\n"
+    "## KDE (Kernel Density Estimation) Generator\n",
+    "\n",
+    "PyTorch implementation of the [KDE](https://en.wikipedia.org/wiki/Kernel_density_estimation)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "d3a7994d",
+   "id": "e72061cc",
    "metadata": {},
    "outputs": [
     {
@@ -189,17 +203,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9548e934",
+   "id": "a1c93461",
    "metadata": {},
    "source": [
     "<a id=\"2\"></a>\n",
-    "## KDE (Kernel Density Estimation) Generator from sklearn"
+    "## KDE (Kernel Density Estimation) Generator from sklearn\n",
+    "\n",
+    "We make a wrapper over [KDE sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "4678f8e6",
+   "id": "6314b42d",
    "metadata": {},
    "outputs": [
     {
@@ -221,17 +237,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9f76d797",
+   "id": "85e43e6f",
    "metadata": {},
    "source": [
     "<a id=\"3\"></a>\n",
-    "## Uniform Generator"
+    "## Uniform Generator\n",
+    "\n",
+    "Takes the data distribution from the real data and then uniformly samples from it"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "7ba982e7",
+   "id": "4c3f25e9",
    "metadata": {},
    "outputs": [
     {
@@ -253,17 +271,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "3ccccb76",
+   "id": "cf5c65f6",
    "metadata": {},
    "source": [
     "<a id=\"4\"></a>\n",
-    "## Gaussian Generator"
+    "## Gaussian Generator\n",
+    "\n",
+    "Interprets the real data distribution as a Normal one and samples from it."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "fc461ba3",
+   "id": "df79fc05",
    "metadata": {},
    "outputs": [
     {
@@ -285,17 +305,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "76cfb4c4",
+   "id": "23c08184",
    "metadata": {},
    "source": [
     "<a id=\"5\"></a>\n",
-    "## CTGAN Generator"
+    "## CTGAN Generator\n",
+    "\n",
+    "Implements [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503) paper."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "a650784f",
+   "id": "a8548c71",
    "metadata": {},
    "outputs": [
     {
@@ -318,7 +340,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2ee4678f",
+   "id": "7d1c9c26",
    "metadata": {},
    "source": [
     "## Results"
@@ -327,7 +349,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "68bc5f55",
+   "id": "06e15e7c",
    "metadata": {},
    "outputs": [
     {
@@ -388,7 +410,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "2d6da47b",
+   "id": "4344a7f3",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -410,7 +432,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,