Tutorial 5: Cell-type annotation on scST data
We utilized MERFISH mouse hypothalamic data (ID=15) to test the annotation accuracy of STABox-STAMapper, the scST dataset with a manual annotation is provided. The scRNA-seq dataset and the scST dataset can be download from https://drive.google.com/drive/folders/1xP3Fh94AwKu4OsH3khGq-KEw0VCoiRnL (MERFISH_hypothalamic.zip).
[1]:
from pathlib import Path
import numpy as np
import pandas as pd
import scanpy as sc
import os
import shutil
[2]:
from stabox.model import STAMapper
import warnings
warnings.filterwarnings("ignore")
D:\Users\lqlu\work\software\Anaconda\envs\new_STABox_env\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[3]:
key_class1 = 'celltype'#celltype label for adata_sc, stored in adata_sc.obs[key_class1]
key_class2 = 'celltype'#celltype label for adata_sp, stoted in adata_sp.obs[key_class2]
key_classes = [key_class1, key_class2]
[4]:
path = 'D:/Users/lqlu/work/Data/STABox_Data/MERFISH_hypothalamic/scRNA/hypothalamic_sc.h5ad'
[5]:
adata_sc = sc.read_h5ad(r'D:/Users/lqlu/work/Data/STABox_Data/MERFISH_hypothalamic/scRNA/hypothalamic_sc.h5ad')
adata_sp = sc.read_h5ad(r'D:/Users/lqlu/work/Data/STABox_Data/MERFISH_hypothalamic/spatial/MERFISH_hypothalamic_15.h5ad')
[6]:
sc_name = 'hypothalamic_sc'
sp_name = 'hypothalamic_MERFISH'
#change dsnames for a new training task
dsnames = (sc_name, sp_name)
[7]:
adata_sc
[7]:
AnnData object with n_obs × n_vars = 30370 × 20320
obs: 'celltype'
var: 'gene_ids', 'n_cells'
[8]:
adata_sp
[8]:
AnnData object with n_obs × n_vars = 18586 × 161
obs: 'Animal_ID', 'celltype'
var: 'n_cells'
obsm: 'spatial'
[9]:
adatas = [adata_sc, adata_sp]
[10]:
adatas
[10]:
[AnnData object with n_obs × n_vars = 30370 × 20320
obs: 'celltype'
var: 'gene_ids', 'n_cells',
AnnData object with n_obs × n_vars = 18586 × 161
obs: 'Animal_ID', 'celltype'
var: 'n_cells'
obsm: 'spatial']
[11]:
outputs = STAMapper.training(adatas=adatas, dsnames=dsnames, key_classes=key_classes)
Data Preprocessing!
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
[leiden] Time used: 238.7358 s
a new directory made:
_temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0\figs
already exists:
_temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0
[*] Setting dataset names:
0-->hypothalamic_sc
1-->hypothalamic_MERFISH
[*] Setting aligned features for observation nodes (self._features)
[*] Setting observation-by-variable adjacent matrices (`self._ov_adjs`) for making merged graph adjacent matrix of observation and variable nodes
-------------------- Summary of the DGL-Heterograph --------------------
Graph(num_nodes={'cell': 48956, 'gene': 154},
num_edges={('cell', 'express', 'gene'): 1451057, ('cell', 'self_loop_cell', 'cell'): 48956, ('cell', 'similar_to', 'cell'): 315368, ('gene', 'expressed_by', 'cell'): 1451057, ('gene', 'self_loop_gene', 'gene'): 154},
metagraph=[('cell', 'gene', 'express'), ('cell', 'cell', 'self_loop_cell'), ('cell', 'cell', 'similar_to'), ('gene', 'cell', 'expressed_by'), ('gene', 'gene', 'self_loop_gene')])
self-loops for observation-nodes: True
self-loops for variable-nodes: True
AlignedDataPair with 48956 obs- and 154 var-nodes
n_obs1 (hypothalamic_sc): 30370
n_obs2 (hypothalamic_MERFISH): 18586
Dimensions of the obs-node-features: 154
main directory: _temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0
model directory: _temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0\_models
============== start training (device='cuda') ==============
It will take minutes from epoch 0 to 100, please wait!
Epoch 0000 | Train Acc: 0.5054 | Test: 0.5148 (max=0.5148) | AMI=0.3144 | Time: 14.1613
Epoch 0050 | Train Acc: 0.9674 | Test: 0.8378 (max=0.8378) | AMI=0.6519 | Time: 10.2342
[current best] model weights backup
Epoch 0099 | Train Acc: 0.9817 | Test: 0.8559 (max=0.8559) | AMI=0.6810 | Time: 9.5997
Epoch 0100 | Train Acc: 0.9841 | Test: 0.8475 (max=0.8559) | AMI=0.6730 | Time: 9.5985
[current best] model weights backup
Epoch 0102 | Train Acc: 0.9843 | Test: 0.8555 (max=0.8559) | AMI=0.6817 | Time: 9.5939
[current best] model weights backup
Epoch 0106 | Train Acc: 0.9841 | Test: 0.8574 (max=0.8574) | AMI=0.6900 | Time: 9.5863
[current best] model weights backup
Epoch 0108 | Train Acc: 0.9859 | Test: 0.8601 (max=0.8601) | AMI=0.6922 | Time: 9.5800
Epoch 0150 | Train Acc: 0.9916 | Test: 0.8540 (max=0.8601) | AMI=0.6784 | Time: 9.6373
[current best] model weights backup
Epoch 0170 | Train Acc: 0.9918 | Test: 0.8598 (max=0.8601) | AMI=0.6950 | Time: 9.6166
Epoch 0200 | Train Acc: 0.9929 | Test: 0.8528 (max=0.8601) | AMI=0.6784 | Time: 9.5164
[current best] model weights backup
Epoch 0230 | Train Acc: 0.9933 | Test: 0.8621 (max=0.8621) | AMI=0.6955 | Time: 9.4783
[current best] model weights backup
Epoch 0237 | Train Acc: 0.9939 | Test: 0.8579 (max=0.8621) | AMI=0.6964 | Time: 9.4771
[current best] model weights backup
Epoch 0247 | Train Acc: 0.9945 | Test: 0.8630 (max=0.8630) | AMI=0.7002 | Time: 9.4642
Epoch 0250 | Train Acc: 0.9935 | Test: 0.8547 (max=0.8630) | AMI=0.6957 | Time: 9.4615
[current best] model weights backup
Epoch 0277 | Train Acc: 0.9947 | Test: 0.8608 (max=0.8630) | AMI=0.7002 | Time: 9.4659
[current best] model weights backup
Epoch 0283 | Train Acc: 0.9942 | Test: 0.8622 (max=0.8630) | AMI=0.7039 | Time: 9.4768
[current best] model weights backup
Epoch 0285 | Train Acc: 0.9938 | Test: 0.8591 (max=0.8630) | AMI=0.7063 | Time: 9.4758
Epoch 0300 | Train Acc: 0.9941 | Test: 0.8434 (max=0.8630) | AMI=0.6709 | Time: 9.4846
Epoch 0350 | Train Acc: 0.9954 | Test: 0.8478 (max=0.8645) | AMI=0.6805 | Time: 9.4704
Epoch 0400 | Train Acc: 0.9941 | Test: 0.8588 (max=0.8678) | AMI=0.6926 | Time: 9.4040
Epoch 0450 | Train Acc: 0.9949 | Test: 0.8554 (max=0.8678) | AMI=0.6878 | Time: 9.3473
Epoch 0500 | Train Acc: 0.9936 | Test: 0.8538 (max=0.8678) | AMI=0.6853 | Time: 9.3259
The AMI didn't increase for the last 100 epohs, early stopping!
Epoch 0501 | Train Acc: 0.9937 | Test: 0.8566 (max=0.8678) | AMI=0.6806 | Time: 9.3289
figure has been saved into:
_temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0\figs\cluster_index.png
states loaded from: _temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0\_models\weights_epoch285.pt
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 1.01it/s]
WARNING:root:An error occurred when plotting results: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
object saved into:
_temp\('hypothalamic_sc', 'hypothalamic_MERFISH')\0\datapair_init.pickle
[15]:
from stabox.model import pp, pl
[16]:
folder_path = Path(".") /'_temp' / f'{dsnames}'
best_model, accuracy, macrof1, weightedf1 = pp.calc_scores(folder_path, key_class2)
print(f"Best Model: {best_model}, Accuracy: {accuracy:.3f}, Macro F1 Score: {macrof1:.3f}, Weighted F1 Score: {weightedf1:.3f}")
Best Model: 0, Accuracy: 0.859, Macro F1 Score: 0.573, Weighted F1 Score: 0.861
[17]:
obs = pd.read_csv(folder_path/f'{best_model}'/'obs.csv',index_col=0)
obs = obs[obs['dataset'].isin([sp_name])]
adata_sp.obs['predicted'] = list(obs['predicted'])
[18]:
sc.pp.normalize_total(adata_sp, target_sum=1e4)
sc.pp.log1p(adata_sp)
adata_sp.raw = adata_sp
#sc.pp.scale(adata_sp, max_value=10)#optional
sc.tl.pca(adata_sp, svd_solver="arpack")
sc.pp.neighbors(adata_sp, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_sp)
[19]:
palette = {'Astrocytes': '#1f77b4',
'Inhibitory': '#9467bd',
'Endothelial': '#ff7f0e',
'Ependymal': '#2ca02c',
'Excitatory': '#d62728',
'Fibroblast': '#f3bb70',
'Macrophage': '#e7ac18',
'Microglia': '#8c564b',
'Mural': '#bcbd22',
'OD Immature': '#aec7e8',
'OD Mature': '#7f7f7f',
'OD Newly formed': '#fda1a0'}
sc.pl.umap(adata_sp, color=["celltype", "predicted"], ncols=1, palette=palette)
[ ]: