Basic Usage

The two required input elements to run diffusion using DiffuPy are:
  1. A network/graph. (see Network-Input Formatting below)

  2. A dataset of scores. (see Scores-Input Formatting below)

Alternative text

For its usability, you can either:

from diffupy.diffuse import run_diffusion

# DATA INPUT and GRAPH as PATHs -> returned as *PandasDataFrame*
diffusion_scores = run_diffusion(~/data/input_scores.csv, ~/data/network.csv).as_pd_dataframe()

# DATA INPUT and GRAPH as Python OBJECTS -> exported *as_csv*
diffusion_scores = run_diffusion(input_scores, network).as_csv('~/output/diffusion_results.csv')

Methods

The diffusion method by default is z, which statistical normalization has previously shown to outperform. Further parameters to adapt the propagation procedure are also provided, such as choosing from the available diffusion methods or providing a custom method function. See diffusion Methods and/or Method modularity.

diffusion_scores_select_method = run_diffusion(input_scores, network, method = 'raw')

from networkx import page_rank # Custom method function

diffusion_scores_custom_method = run_diffusion(input_scores, network,  method = page_rank)

You can also provide your own kernel method or select among the ones provided in the kernels.py function which you can provide as a kernel_method argument. By default regularised_laplacian_kernel is used.

from diffupath.kernels import p_step_kernel # Custom kernel calculation function

diffusion_scores_custom_kernel_method = run(input_scores, method = 'raw', kernel_method = p_step_kernel)

So method stands for the diffusion process method, and kernel_method for the kernel calculation method.

Formatting

Before running diffusion algorithms on your network using DiffuPy, take into account the graph and input data/scores formats. You can find specified here samples of supported input scores and networks.

Input format

The input is preprocessed and further mapped before the diffusion. See input mapping or see process_input docs for further details. Here we outline the input formats covered for its preprocessing.

Scores

You can submit your dataset in any of the following formats:

  • CSV (.csv)

  • TSV (.tsv)

  • pandas.DataFrame

  • List

  • Dictionary

(check Input dataset examples)

So you can either provide a path to a .csv or .tsv file:

from diffupy.diffuse import run_diffusion

diffusion_scores_from_file = run_diffusion('~/data/diffusion_scores.csv', network)

or Pythonically as a data structure as the input_scores parameter:

data = {'Node':  ['A', 'B',...],
      'Node Type': ['Metabolite', 'Gene',...],
       ....
      }
df = pd.DataFrame (data, columns = ['Node','Node Type',...])

diffusion_scores_from_dict = run_diffusion(df, network)

Please ensure that the dataset minimally has a column ‘Node’ containing node IDs. You can also optionally add the following columns to your dataset:

  • NodeType

  • LogFC *

  • p-value

*

Log2 fold change

Networks

If you would like to submit your own networks, please ensure they are in one of the following formats:

  • BEL (.bel)

  • CSV (.csv)

  • Edge list (.lst)

  • GML (.gml or .xml)

  • GraphML (.graphml or .xml)

  • Pickle (.pickle). BELGraph object from PyBEL 0.13.2

  • TSV (.tsv)

  • TXT (.txt)

Minimally, please ensure each of the following columns are included in the network file you submit:

  • Source

  • Target

Optionally, you can choose to add a third column, “Relation” in your network (as in the example below). If the relation between the Source and Target nodes is omitted, and/or if the directionality is ambiguous, either node can be assigned as the Source or Target.

Kernel

If you dispose of a precalculated kernel, you can provide directly the kernel object without needing to also provide a graph object. As mentioned above, if you wish to use your kernel method function you can provide it as kernel_method argument on the previous described function.

Input dataset examples

DiffuPath accepts several input formats which can be codified in different ways. See the diffusion scores summary for more details on how the labels input are treated according each available method.

1. You can provide a dataset with a column ‘Node’ containing node IDs.

Node

A

B

C

D

from diffupy.diffuse import run_diffusion

diffusion_scores = run_diffusion(dataframe_nodes, network)

Also as a list of nodes:

['A', 'B', 'C', 'D']
diffusion_scores = run_diffusion(['A', 'B', 'C', 'D'], network)

2. You can also provide a dataset with a column ‘Node’ containing node IDs as well as a column ‘NodeType’, indicating the entity type of the node to run diffusion by entity type.

Node

NodeType

A

Gene

B

Gene

C

Metabolite

D

Gene

Also as a dictionary of type:list of nodes :

{'Gene': ['A', 'B', 'D'], 'Metabolite': ['C']}
diffusion_scores = run_diffusion({'Genes': ['A', 'B', 'D'], 'Metabolites': ['C']}, network)

3. You can also choose to provide a dataset with a column ‘Node’ containing node IDs as well as a column ‘logFC’ with their logFC. You may also add a ‘NodeType’ column to run diffusion by entity type.

Node

LogFC

A

4

B

-1

C

1.5

D

3

Also as a dictionary of node:score_value :

{'A':-1, 'B':-1, 'C':1.5, 'D':4}
diffusion_scores = run_diffusion({'A':-1, 'B':-1, 'C':1.5, 'D':4})

Combining point 2., you can also indicating the node type:

Node

LogFC

NodeType

A

4

Gene

B

-1

Gene

C

1.5

Metabolite

D

3

Gene

Also as a dictionary of type:node:score_value :

{Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}

diffusion_scores = run_diffusion({Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}, network)

4. Finally, you can provide a dataset with a column ‘Node’ containing node IDs, a column ‘logFC’ with their logFC and a column ‘p-value’ with adjusted p-values. You may also add a ‘NodeType’ column to run diffusion by entity type.

Node

LogFC

p-value

A

4

0.03

B

-1

0.05

C

1.5

0.001

D

3

0.07

This only accepted pythonicaly in dataaframe format.

See the sample datasets directory for example files.

Custom-network example

Source

Target

Relation

A

B

Increase

B

C

Association

A

D

Association

You can also take a look at our sample networks folder for some examples.

Input Mapping/Coverage

Even though it is not relevant for the input user usage, taking into account the input mapped entities over the background network is relevant for the diffusion process assessment, since the coverage of the input implies the actual entities-scores that are being diffused. In other words, only the entities whose labels match an entity in the network will be further processed for diffusion.

Running diffusion will report the mapping as follows:

Mapping descriptive statistics

wikipathways:
gene_nodes  (474 mapped entities, 15.38% input coverage)
mirna_nodes  (2 mapped entities, 4.65% input coverage)
metabolite_nodes  (12 mapped entities, 75.0% input coverage)
bp_nodes  (1 mapped entities, 0.45% input coverage)
total  (489 mapped entities, 14.54% input coverage)

kegg:
gene_nodes  (1041 mapped entities, 33.80% input coverage)
mirna_nodes  (3 mapped entities, 6.98% input coverage)
metabolite_nodes  (6 mapped entities, 0.375% input coverage)
bp_nodes  (12 mapped entities, 5.36% input coverage)
total  (1062 mapped entities, 31.58% input coverage)

reactome:
gene_nodes  (709 mapped entities, 23.02% input coverage)
mirna_nodes  (1 mapped entities, 2.33% input coverage)
metabolite_nodes  (6 mapped entities, 37.5% input coverage)
total  (716 mapped entities, 22.8% input coverage)

total:
gene_nodes  (1461 mapped entities, 43.44% input coverage)
mirna_nodes  (4 mapped entities, 0.12% input coverage)
metabolite_nodes  (13 mapped entities, 0.38% input coverage)
bp_nodes  (13 mapped entities, 0.39% input coverage)
total  (1491 mapped entities, 44.34% input coverage)

To graphically see the mapping coverage, you can also plot a heatmap view of the mapping (see views). To see how the mapping is performed over an input pipeline preprocessing, take a look at this Jupyter Notebook or see process_input docs in DiffuPy.

Output format

The returned format is a custom Matrix type, with node labels as rows and a column with the diffusion score, which can be exported into the following formats:

diffusion_scores.to_dict()
diffusion_scores.as_pd_dataframe()
diffusion_scores.as_csv()
diffusion_scores.to_nx_graph()