Skip to content

SCMs Data Generation


Advice for developers if needed: SCMs data generation

Generator in CADIMULC serves as a framework of general data generation in the task of causal discovery. Default settings of hyperparameters (e.g. parameters of specific causal function) in the Generator might require being fine-tuned depends on different purposes for simulation.

Users could develop their own "causal simulator" in data analysis based on their need interest, by following the data generation template in Generator.

Class: Generator

cadimulc.utils.generation.Generator

Bases: object

The Generator simulates the empirical data implied by the structure causal models (SCMs). Primary parameters for Generator's simulation consist of the model classes (e.g. linear or non-linear) and the (independent) noise distributions (e.g. Gaussian or non-Gaussian).

Take causation in graphical context, where a variable \(y_i\) is supposed to be the effect of its parents \(pa(y_i)\). Then the data relative to \(y_i\) is expected to be generated given the (group of) data relative to \(pa(y_i)\), following the causal mapping mechanism \(F\) characterized as SCMs.

Currently, research in causal discovery has suggested that structural-identifiable empirical data should be further generated by a special "genus" of the SCMs, which is normally referred to as the additive noise models (ANMs) shown in the following

\[ y_{i} := F(pa(y_i), e_{i}):= \sum_{x_{j} \in pa(y_i)} f( x_{j}) + e_{i}, \]

where \(f(\cdot)\) denotes the linear or non-linear function, and \(e_{i}\) refers to the independent noise obeying the Gaussian or non-Gaussian distributions.

Structural-identifiable SCMs simulation in CADIMULC in light of related literature

linear: linear non-Gaussian acyclic models (LiNGAM)[1], referring to the experiment setup by MLC-LiNGAM[2].

non-linear: causal additive models (CAM)[3], referring to the experiment setup by CAM-UV[4].

__init__(graph_node_num, sample, causal_model='hybrid_nonlinear', noise_type='Gaussian', noise_scale='default', sparsity=0.3, _skeleton=None, _dag=None, _dataset=None)

Parameters:

Name Type Description Default
graph_node_num int

Number of the vertex in a causal graph (ground-truth), which represents the number of the variable given a causal model (recommend: < 15).

required
sample int

Size of the dataset generated from the SCMs (recommend: < 10000).

required
causal_model str

Refer to structural-identifiable SCMs simulation in light of related literature. e.g. LiNGAM (str: lingam), CAM (str: hybrid_nonlinear).

'hybrid_nonlinear'
noise_type str

Refer to structural-identifiable SCMs simulation in light of related literature. e.g. Gaussian (str: Gaussian), uniform distribution as non-Gaussian (str: non-Gaussian).

'Gaussian'
noise_scale int | str

"Default" as following the experiment setup in light of related literature.

'default'
sparsity float

Control the sparsity of a causal graph (ground-truth) (recommend: 0.3).

0.3

The causal model should be carefully paired with the noise type

  • If causal model = "lingam", the noise distribution must satisfy noise_type="non-Gaussian".
  • If causal model = "hybrid_nonlinear", the noise distribution can either choose noise_type="Gaussian" or noise_type="non-Gaussian". However, evaluation in CADIMULC suggests that Gaussian noise is more preferable to yield identifiable results.

Primary Method: run_generation_procedure

Run the common two-steps procedure for SCMs data generation:

  1. Generate a random DAG in light of the well-known Erdős–Rényi model;
  2. Provided a topological order converted by the DAG, generate each variable \(y_i\) by summarizing the effects of its parents \(pa(y_i)\).

Returns:

Name Type Description
self object

Update the Generator's attributes: _skeleton as the undirected graph corresponding to the causal graph (ground-truth), _dag as the directed acyclic graph (DAG) corresponding to the (ground-truth), and _dataset in a format (n * d) (n = sample, d = graph_node_num).

Source code in cadimulc\utils\generation.py
def run_generation_procedure(self) -> object:
    """ Run the common **two-steps** procedure for SCMs data generation:

    1. Generate a random DAG in light of the well-known Erdős–Rényi model;
    2. Provided a topological order converted by the DAG, generate each variable $y_i$ by
    summarizing the effects of its parents $pa(y_i)$.

    Returns:
        self:
            Update the `Generator`'s attributes: `_skeleton` as the undirected graph
            corresponding to the causal graph (ground-truth), `_dag` as the
            directed acyclic graph (DAG) corresponding to the (ground-truth),
            and `_dataset` in a format (n * d) (n = `sample`, d = `graph_node_num`).
    """

    self._clear()

    # Generate a random DAG in light of the well-known Erdős–Rényi model.
    self._generate_dag()

    # Provided a topological order converted by the DAG, generate each variable by
    # summarizing the effects of its parents
    self._generate_data()

    return self

Private Method: _generate_dag

Generate a random DAG in light of the well-known Erdős–Rényi model.

Returns:

Name Type Description
self object

Update _dag (DAG) represented as a bool adjacency matrix.

Source code in cadimulc\utils\generation.py
def _generate_dag(self) -> object:
    """
    Generate a random DAG in light of the well-known Erdős–Rényi model.

    Returns:
        self: Update `_dag` (DAG) represented as a bool adjacency matrix.
    """

    # undirected graph
    undigraph = self._get_undigraph(
        graph_node_num=self.graph_node_num,
        sparsity=self.sparsity
    )
    self.skeleton = copy(undigraph)

    # directed graph
    digraph = self._orient(undigraph)
    self.dag = copy(digraph)

    return self

Private Method: _generate_data

Provided a topological order converted by the DAG, generate each variable by summarizing the effects of its parents.

Returns:

Name Type Description
self object

Update _data represented as a (n * d) numpy array (n = sample, d = graph_node_num).

Source code in cadimulc\utils\generation.py
def _generate_data(self) -> object:
    """
    Provided a topological order converted by the DAG, generate each variable by
    summarizing the effects of its parents.

    Returns:
        self: Update `_data` represented as a (n * d) numpy array (n = sample,
         d = graph_node_num).
    """

    self.data = np.zeros([self.sample, self.graph_node_num])

    # topological order converted by the DAG
    dag_nx = nx.DiGraph(self.dag.T)
    topo_order = list(nx.topological_sort(dag_nx))

    # data generation in light of additive noise model
    for child_index, child_var in enumerate(topo_order):
        parent_vars = list(dag_nx.predecessors(child_var))
        parent_indexes = [topo_order.index(var) for var in parent_vars]

        # independence noise
        self._add_random_noise(var_index=child_index)

        # summary of the parents effects
        if len(parent_vars) > 0:
            for parent_index in parent_indexes:
                self._simulate_causal_model(
                    child_index=child_index,
                    parent_index=parent_index
                )

    # standard deviation of the dataset (default setting)
    self.data = self.data / np.std(self.data, axis=0)

    return self

Running examples

CADIMULC is a light Python repository without sophisticated library API design. Documentation on this page is meant to provide introductory materials of the practical tool as to causal discovery. For running example, please simply check out Quick Tutorials for the straightforward usage in the "micro" workflow of causal discovery.

Reference

[1] Shimizu, Shohei, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. "A linear non-Gaussian acyclic model for causal discovery." Journal of Machine Learning Research. 2006.

[2] Chen, Wei, Ruichu Cai, Kun Zhang, and Zhifeng Hao. "Causal discovery in linear non-gaussian acyclic model with multiple latent confounders. " IEEE Transactions on Neural Networks and Learning Systems. 2021.

[3] Bühlmann, Peter, Jonas Peters, and Jan Ernest. "CAM: Causal additive models, high-dimensional order search and penalized regression." 2014.

[4] Maeda, Takashi Nicholas, and Shohei Shimizu. "Causal additive models with unobserved variables." In Uncertainty in Artificial Intelligence. 2021.