SCMs Data Generation
Advice for developers if needed: SCMs data generation
Generator
in CADIMULC serves as a framework of general data generation in
the task of causal discovery.
Default settings of hyperparameters (e.g. parameters of specific causal function)
in the Generator
might require being fine-tuned depends on different purposes
for simulation.
Users could develop their own "causal simulator" in data analysis based on their need
interest, by following the data generation template in Generator
.
Class: Generator
cadimulc.utils.generation.Generator
Bases: object
The Generator
simulates the empirical data implied by the
structure causal models (SCMs).
Primary parameters for Generator
's simulation consist of the model classes
(e.g. linear or non-linear) and the (independent) noise distributions (e.g.
Gaussian or non-Gaussian).
Take causation in graphical context, where a variable \(y_i\) is supposed to be the effect of its parents \(pa(y_i)\). Then the data relative to \(y_i\) is expected to be generated given the (group of) data relative to \(pa(y_i)\), following the causal mapping mechanism \(F\) characterized as SCMs.
Currently, research in causal discovery has suggested that structural-identifiable empirical data should be further generated by a special "genus" of the SCMs, which is normally referred to as the additive noise models (ANMs) shown in the following
where \(f(\cdot)\) denotes the linear or non-linear function, and \(e_{i}\) refers to the independent noise obeying the Gaussian or non-Gaussian distributions.
Structural-identifiable SCMs simulation in CADIMULC in light of related literature
linear: linear non-Gaussian acyclic models (LiNGAM)[1], referring to the experiment setup by MLC-LiNGAM[2].
non-linear: causal additive models (CAM)[3], referring to the experiment setup by CAM-UV[4].
__init__(graph_node_num, sample, causal_model='hybrid_nonlinear', noise_type='Gaussian', noise_scale='default', sparsity=0.3, _skeleton=None, _dag=None, _dataset=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
graph_node_num |
int
|
Number of the vertex in a causal graph (ground-truth), which represents the number of the variable given a causal model (recommend: < 15). |
required |
sample |
int
|
Size of the dataset generated from the SCMs (recommend: < 10000). |
required |
causal_model |
str
|
Refer to structural-identifiable SCMs simulation in light of related literature. e.g. LiNGAM (str: lingam), CAM (str: hybrid_nonlinear). |
'hybrid_nonlinear'
|
noise_type |
str
|
Refer to structural-identifiable SCMs simulation in light of related literature. e.g. Gaussian (str: Gaussian), uniform distribution as non-Gaussian (str: non-Gaussian). |
'Gaussian'
|
noise_scale |
int | str
|
"Default" as following the experiment setup in light of related literature. |
'default'
|
sparsity |
float
|
Control the sparsity of a causal graph (ground-truth) (recommend: 0.3). |
0.3
|
The causal model should be carefully paired with the noise type
- If
causal model = "lingam"
, the noise distribution must satisfynoise_type="non-Gaussian"
. - If
causal model = "hybrid_nonlinear"
, the noise distribution can either choosenoise_type="Gaussian"
ornoise_type="non-Gaussian"
. However, evaluation in CADIMULC suggests that Gaussian noise is more preferable to yield identifiable results.
Primary Method: run_generation_procedure
Run the common two-steps procedure for SCMs data generation:
- Generate a random DAG in light of the well-known Erdős–Rényi model;
- Provided a topological order converted by the DAG, generate each variable \(y_i\) by summarizing the effects of its parents \(pa(y_i)\).
Returns:
Name | Type | Description |
---|---|---|
self |
object
|
Update the |
Source code in cadimulc\utils\generation.py
Private Method: _generate_dag
Generate a random DAG in light of the well-known Erdős–Rényi model.
Returns:
Name | Type | Description |
---|---|---|
self |
object
|
Update |
Source code in cadimulc\utils\generation.py
Private Method: _generate_data
Provided a topological order converted by the DAG, generate each variable by summarizing the effects of its parents.
Returns:
Name | Type | Description |
---|---|---|
self |
object
|
Update |
Source code in cadimulc\utils\generation.py
Running examples
CADIMULC is a light Python repository without sophisticated library API design. Documentation on this page is meant to provide introductory materials of the practical tool as to causal discovery. For running example, please simply check out Quick Tutorials for the straightforward usage in the "micro" workflow of causal discovery.
Reference
[1] Shimizu, Shohei, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. "A linear non-Gaussian acyclic model for causal discovery." Journal of Machine Learning Research. 2006.
[2] Chen, Wei, Ruichu Cai, Kun Zhang, and Zhifeng Hao. "Causal discovery in linear non-gaussian acyclic model with multiple latent confounders. " IEEE Transactions on Neural Networks and Learning Systems. 2021.
[3] Bühlmann, Peter, Jonas Peters, and Jan Ernest. "CAM: Causal additive models, high-dimensional order search and penalized regression." 2014.
[4] Maeda, Takashi Nicholas, and Shohei Shimizu. "Causal additive models with unobserved variables." In Uncertainty in Artificial Intelligence. 2021.