|
|
@@ -2,7 +2,7 @@
|
|
|
"cells": [
|
|
|
{
|
|
|
"cell_type": "raw",
|
|
|
- "id": "af9ebdc3",
|
|
|
+ "id": "5a21cdb1",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"# Copyright 2023 NVIDIA Corporation. All Rights Reserved.\n",
|
|
|
@@ -23,7 +23,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "e16cd18a",
|
|
|
+ "id": "277223b5",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"# End to end graph generation demo (CORA)"
|
|
|
@@ -31,12 +31,14 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "0711da78",
|
|
|
+ "id": "d8ddf635",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"## Overview\n",
|
|
|
"\n",
|
|
|
- "In this notebook, we have walked through the complete process of generating a synthetic dataset based on a CORA dataset. The CORA dataset consists of scientific publications classified into one of seven classes. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, so we can interpret the CORA dataset as a graph with categorical node features.\n",
|
|
|
+ "In this notebook, we walk through the complete process of generating a synthetic dataset based on a CORA dataset. \n",
|
|
|
+ "\n",
|
|
|
+ "The CORA dataset consists of scientific publications classified into one of seven classes. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, so we can interpret the CORA dataset as a graph with categorical node features.\n",
|
|
|
"\n",
|
|
|
"Content:\n",
|
|
|
"\n",
|
|
|
@@ -48,7 +50,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "cb2a97a3",
|
|
|
+ "id": "295da055",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"### Imports"
|
|
|
@@ -57,7 +59,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 1,
|
|
|
- "id": "4c92ade2",
|
|
|
+ "id": "9c8dab41",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -87,17 +89,40 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "9f40bfe2",
|
|
|
+ "id": "359be279",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"1\"></a>\n",
|
|
|
"### Fit synthesizer"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "5e850760",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Instantiating the building blocks"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "c7f3981b",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "As the CORA dataset is a graph with node features the following objects are instantiated:\n",
|
|
|
+ "\n",
|
|
|
+ "- A node feature generator to generate the node features. In this example we simply choose the Kernel Density Estimate (KDE) generator\n",
|
|
|
+ "- A graph generator to generate the graph structure, e.g. RMAT\n",
|
|
|
+ "- An aligner to align the two, in this case random aligner, to randomly assign the node features to the generated nodes.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "**Note**: Alternative generators can be used as long as they implement the `fit` \\& `generate` API and consumes data dictionary (described below)."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 2,
|
|
|
- "id": "b23cd179",
|
|
|
+ "id": "c29b7ddb",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -107,10 +132,28 @@
|
|
|
"graph_aligner = RandomAligner()"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "b7b9cbca",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Defining the synthesizer"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "95139288",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Once the set of building blocks are instantiated with the corresponding hyperparameters, we can instantiate a synthesizer which defines how these building blocks interact. \n",
|
|
|
+ "\n",
|
|
|
+ "The static graph synthesizer object, can be used to generate graphs with either node or edge features."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 3,
|
|
|
- "id": "bf237cca",
|
|
|
+ "id": "3d7bce4f",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -121,20 +164,49 @@
|
|
|
" graph_aligner=graph_aligner)"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "323834ba",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Preprocessing"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "a00edc74",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "For the CORA dataset a preprocessing step is pre-implemented (see `/syngen/preprocessing/cora.py`), which reads the corresponding data files to create the CORA graph with labels converted into ordinal values."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 4,
|
|
|
- "id": "ac06ae7e",
|
|
|
+ "id": "4fe80b1c",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"data = preprocessing.transform('/workspace/data/cora')"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "024b44d8",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The output of the preprocessing function is a dictionary with\n",
|
|
|
+ "\n",
|
|
|
+ "- MetaData.EDGE_DATA: data corresponding with the graphs edge information\n",
|
|
|
+ "- MetaData.NODE_DATA: data corresponding with graphs node information.\n",
|
|
|
+ "\n",
|
|
|
+ "Now that we have the data, the synthesizer can be fit. This step simply fits each component on the data."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 5,
|
|
|
- "id": "f24786f9",
|
|
|
+ "id": "7956c60e",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -143,17 +215,27 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "0ff98447",
|
|
|
+ "id": "6ab8d92c",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"2\"></a>\n",
|
|
|
"## Dataset Generation"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "006edfa1",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now that we have a synthesizer fitted on a downstream dataset, we can generate a graph with similar characteristics as the original.\n",
|
|
|
+ "\n",
|
|
|
+ "In this example, we simply generate graph of the same size."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 6,
|
|
|
- "id": "c3116be5",
|
|
|
+ "id": "cb149b46",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -161,10 +243,18 @@
|
|
|
"num_nodes = 2708 "
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "79adcbe2",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "By calling generate with the desired graph size, it will return a dictionary with keys corresponding to edge data and node data (if the synthesizer was configured to generate these)."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 7,
|
|
|
- "id": "24cf045e",
|
|
|
+ "id": "c364179e",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -181,31 +271,56 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "57c0fc18",
|
|
|
+ "id": "355d6dff",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"3\"></a>\n",
|
|
|
"## Tabular Data Evaluation"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "38f0c94d",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now that we have generated the data we may be interested in assessing the quality of the generated graph.\n",
|
|
|
+ "\n",
|
|
|
+ "The tool provides a set of analyzers which can be used to analyze\n",
|
|
|
+ "- tabular features\n",
|
|
|
+ "- graph structure\n",
|
|
|
+ "- both\n",
|
|
|
+ "\n",
|
|
|
+ "Below a series of examples are shown comparing the original node feature distribution with various node feature generators"
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 8,
|
|
|
- "id": "3c0c1a12",
|
|
|
+ "id": "d85f0ead",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
+ "# - extract the fitted node generator from the synthesizer\n",
|
|
|
"tabular_generator = synthesizer.node_feature_generator\n",
|
|
|
"cols_to_drop = set(['id'])"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "488e32f4",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "**Note**: the `id` column is dropped as this simply corresponds to the node id"
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 9,
|
|
|
- "id": "b721dea8",
|
|
|
+ "id": "3371b941",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
+ "# - extract node features associated with nodes in the graph\n",
|
|
|
"real = data[MetaData.NODE_DATA]\n",
|
|
|
"real = real.drop(columns=cols_to_drop).reset_index(drop=True)"
|
|
|
]
|
|
|
@@ -213,18 +328,29 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 10,
|
|
|
- "id": "a23275bd",
|
|
|
+ "id": "16423b72",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
+ "# - generate node using the generator used in the synthesizer\n",
|
|
|
+ "# note the synthetic data could be also be replaced with node data\n",
|
|
|
+ "# generated above.\n",
|
|
|
"synthetic = tabular_generator.sample(len(real))\n",
|
|
|
"synthetic = synthetic.drop(columns=cols_to_drop).reset_index(drop=True)"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "6c90e69d",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The `real` and `synthetic` data to be compared are then fed to the `TabularMetrics` object, which can be used to visually compare the data, or provide a series of metrics comparing feature distributions and their correlations."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 11,
|
|
|
- "id": "623110cd",
|
|
|
+ "id": "125df9fb",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -238,7 +364,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 12,
|
|
|
- "id": "3942330b",
|
|
|
+ "id": "e541e647",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -260,16 +386,24 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "305c3fc9",
|
|
|
+ "id": "546b0618",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"### Random Tabular Data"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "450ced3b",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "In the cell below a comparison is done using a uniform random generator."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 13,
|
|
|
- "id": "be3317b8",
|
|
|
+ "id": "f0d511e7",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -301,16 +435,24 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "33a7d3e4",
|
|
|
+ "id": "6c3f78ff",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"### Random Multivariate"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "b8fd3f8a",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "In the cell below a comparison is done using a multivariate random generator. Note thesimilarity of this with `KDEGenerator` as the KDE generator simply adds gaussian noise and clamps the values."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 14,
|
|
|
- "id": "0b42fc7e",
|
|
|
+ "id": "cae90372",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -342,17 +484,31 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "a61839a5",
|
|
|
+ "id": "d8556744",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"4\"></a>\n",
|
|
|
"## Structure evaluation"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "995bd9ad",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Next the graph structure can similarly be analyzed.\n",
|
|
|
+ "\n",
|
|
|
+ "In the following cells the properly generated graph (using the synthesizer), a random graph, as well as the original are compared.\n",
|
|
|
+ "\n",
|
|
|
+ "The tool implements a graph analyzer, i.e. `AnalysisModule`, which provides a series of useful metrics to compare graphs across.\n",
|
|
|
+ "\n",
|
|
|
+ "First purely the graph structure is extracted, i.e. nodes and edges."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 15,
|
|
|
- "id": "b8aad032",
|
|
|
+ "id": "67a04184",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -365,7 +521,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 16,
|
|
|
- "id": "ca24da8e",
|
|
|
+ "id": "ffca7954",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -384,7 +540,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 17,
|
|
|
- "id": "fb12cfcb",
|
|
|
+ "id": "d2e6520c",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -398,6 +554,7 @@
|
|
|
}
|
|
|
],
|
|
|
"source": [
|
|
|
+ "# - print graph size\n",
|
|
|
"print(f'src_dst:{len(src_dst)}')\n",
|
|
|
"print(f'dst_srct:{len(dst_src)}')\n",
|
|
|
"print(f'graph:{len(graph)}')"
|
|
|
@@ -406,7 +563,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 18,
|
|
|
- "id": "c1b20191",
|
|
|
+ "id": "09ec506e",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -418,11 +575,11 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 19,
|
|
|
- "id": "c694f1a1",
|
|
|
+ "id": "2a00643e",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "# save graph structure to file\n",
|
|
|
+ "# - save graph structure to file\n",
|
|
|
"np.savetxt('/workspace/data/cora_demo_proper.txt', np.array(graph_structure_proper), fmt='%i', delimiter='\\t')\n",
|
|
|
"np.savetxt('/workspace/data/cora_demo_random.txt', np.array(graph_structure_random), fmt='%i', delimiter='\\t')\n",
|
|
|
"np.savetxt('/workspace/data/cora_demo_orig.txt', np.array(graph_structure_orig), fmt='%i', delimiter='\\t')"
|
|
|
@@ -431,17 +588,26 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 20,
|
|
|
- "id": "6d88b840",
|
|
|
+ "id": "d368ba9c",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
+ "# - instantiate graph analyzer\n",
|
|
|
"graph_analyser = AnalysisModule()"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "de328c33",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Graph objects are then instantiated using the extracted graph structures"
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 21,
|
|
|
- "id": "38fc6135",
|
|
|
+ "id": "b2a3ff7f",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
@@ -451,10 +617,19 @@
|
|
|
"all_graphs = [proper_graph, random_graph, orig_graph]"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "b2e1b416",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The graphs can then be fed to various metrics, for example `get_dd_similarity_score` provides a score between 0 and 1,\n",
|
|
|
+ "comparing the degree distribution of a source graph and destination graph."
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 22,
|
|
|
- "id": "82de736f",
|
|
|
+ "id": "c05b60b2",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -476,10 +651,18 @@
|
|
|
"print(\"ORIG vs RANDOM:\", orig_random)"
|
|
|
]
|
|
|
},
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "64bd7bde",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The `compare_graph_stats` compares the graphs across a series of statistics"
|
|
|
+ ]
|
|
|
+ },
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 23,
|
|
|
- "id": "b689eef8",
|
|
|
+ "id": "e3272df1",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -714,7 +897,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 24,
|
|
|
- "id": "e5b77900",
|
|
|
+ "id": "b6240453",
|
|
|
"metadata": {},
|
|
|
"outputs": [
|
|
|
{
|
|
|
@@ -735,14 +918,6 @@
|
|
|
"set_loglevel('warning')\n",
|
|
|
"_ = graph_analyser.compare_graph_plots(*all_graphs);"
|
|
|
]
|
|
|
- },
|
|
|
- {
|
|
|
- "cell_type": "code",
|
|
|
- "execution_count": null,
|
|
|
- "id": "6e58427e",
|
|
|
- "metadata": {},
|
|
|
- "outputs": [],
|
|
|
- "source": []
|
|
|
}
|
|
|
],
|
|
|
"metadata": {
|
|
|
@@ -764,7 +939,7 @@
|
|
|
"name": "python",
|
|
|
"nbconvert_exporter": "python",
|
|
|
"pygments_lexer": "ipython3",
|
|
|
- "version": "3.8.15"
|
|
|
+ "version": "3.8.10"
|
|
|
}
|
|
|
},
|
|
|
"nbformat": 4,
|