Vignette to use singletCode package

The input needed to run singletCode is a .csv file that contains the information about cell ID (added while sequencing), lineage barcode, and sample name. Each row should be repeated n times where n is the number of UMIs associated with that barcode and cell ID combination. You can download a sample input sheet here. It is a subset of data from Jiang Et al and details about it are described in the singletCode paper in detail. The folder also contains expected output files in the test folder within the outputFiles folder. This vignette can be downloaded as a jupyter notebook from the singletCode Tools repo.

Install singletCode package

!pip3 install singletCode

Import necessary functions from it

from singletCode import check_sample_sheet, get_singlets

Read in input sheet

# Read in input sheet
import pandas as pd
path = "path/to/downloaded/and/unzipped/folder"
pathToInputSheet = f"{path}/inputFiles/JiangEtAlSubset_InputSheet.csv"
df = pd.read_csv(pathToInputSheet)

Check formatting of input sheet

check_sample_sheet(df)
The sample sheet provided can be used as input to get_singlets to get a list of singlets identified.

The sample sheet provided can be used as input to get_singlets to get a list of singlets identified.

Identify singlets from input sheet

outputPath = "path/to/output/folder"
cellLabelList, stats = get_singlets(df, dataset_name= "JiangEtAlSubset", save_all_singlet_categories = True, output_path=outputPath)
INFO: Raw data counts:
sample
1    1306
Name: count, dtype: int64
Total cells for sample 1: 39
INFO: Using ratio based filtering.
Current Sample Adjusted UMI cutoff: 2
100%|██████████| 122/122 [00:00<00:00, 11475.27it/s]
All singlets identified with multiple barcodes are unique? True
Total Singlets: 10
Total Multiplets: 9
stats.to_csv(f"{outputPath}/JiangEtAlSubset_stats.csv")
cellLabelList[cellLabelList['label'] == "Singlet"].to_csv(f"{outputPath}/JiangEtAlSubset_singletList.csv")

Visualizing the distribution of cells into low UMI, different kinds of singlets and undetermined

import matplotlib.pyplot as plt
#Plotting the distribution of low UMI cells, different kinds of singlets, and undetermined cells.
colors = ['#62575b', '#2175a8', '#feb422', '#d62728', '#d4d4d4']  # Example colors, modify as needed
plotData = stats.set_index('sample', inplace=False).drop(columns = ['dataset', 'total_cells', "total_singlets"])

# Plotting
ax = plotData.plot(kind='barh', stacked=True, figsize=(10, 7), color=colors)

for p in ax.patches:
    ax.annotate(f'{int(p.get_width())}', (p.get_x() + p.get_width()/2, p.get_y() + p.get_height()/2), ha='right', va='center')

ax.set_xlabel('Total cells')
ax.set_title('Distribution of Singlets by Criteria')
plt.show()
../_images/singletCodePackageVignette_12_0.png

The above plot shows that the data we had contained different kind of singlets: 6 single-barcode cells, 2 cells which had more than one barcode but with same combination being present in more than one cell, 2 cells that had one dominant barcode. The data also contained 9 cells which singletCode could not determine as being truly singlets and 20 cells whose barcode UMI counts were below the set threshold.

Understanding the output files

To understand some of the files in the output, we can look at cell IDs and their data in the original input sheet

For the dominant_umi_singlets, there are two cell IDs. One of them is TGTAAGCGTCTCGCGA. If we look at that entry in the input sheet and count the number of UMI associated with each barcode, we see that one barcode has 99 UMI counts while the second highest UMI count is 7. So, the cell most likely has only one barcode associated with it and hence, a singlet.

import pandas as pd
df[df['cellID'] == 'TGTAAGCGTCTCGCGA'].groupby(['cellID', 'barcode', 'sample']).size().reset_index(name='count').sort_values('count', ascending=False).reset_index(drop=True)
cellID barcode sample count
0 TGTAAGCGTCTCGCGA ATTGTTGTTGCAGATGCAGTTGATGCTGATGAAGTTGTACAAGGTC... 1 99
1 TGTAAGCGTCTCGCGA ATTCGACTTGATCTTCTAGAACATGGTGAACTAGCAGGTGCTGATC... 1 7
2 TGTAAGCGTCTCGCGA ATACTAGCTCAAGCAGTACTACTACTTCGTCTTCATGCAGAACAAC... 1 6
3 TGTAAGCGTCTCGCGA ATAGATGCACTTGGTGGTCGAGTTCTAGTTGTAGCTGATCGTCCAG... 1 6
4 TGTAAGCGTCTCGCGA ATTCGACCAGAACCACATGCAGTTCAACGTGTTCGAGGTGTAGATG... 1 6
... ... ... ... ...
82 TGTAAGCGTCTCGCGA ATAGTAGTAGCTGTTGGTGTTGAAGTACTTCCTCTTGCTCCTCGTG... 1 1
83 TGTAAGCGTCTCGCGA ATAGTAGATGAACGTCCTCTACATGTTCTTCGTCAAGTACCAGCAC... 1 1
84 TGTAAGCGTCTCGCGA ATAGTACATGGTGGACCTGGACTTCGAGATGGAGCTCTTGTTCCTG... 1 1
85 TGTAAGCGTCTCGCGA ATAGGAGTAGTTGGTGATGGTCTACCAGAAGGTGAAGGTGGAGAAG... 1 1
86 TGTAAGCGTCTCGCGA GGTGCTCAACTTCTTGTTGTACTTCTAGTTGATGTTGGACGTCATC... 1 1

87 rows × 4 columns

Next, we can look at multi-barcode singlets. There are two cell IDs: AGGCTGCTCTTTCCGG and GAGGGATGTAACATCC. If we look at the barcodes with greater than 2 UMI counts, we see that they have the same combination. The only way this can occur is if a cell receives multiple barcode initially and then divides.

(df[df['cellID'] == 'AGGCTGCTCTTTCCGG']
 .groupby(['cellID', 'barcode', 'sample'])
 .size()
 .reset_index(name='count')
 .sort_values('count', ascending=False)
 .query('count >= 2')
 .reset_index(drop=True)
)
cellID barcode sample count
0 AGGCTGCTCTTTCCGG ATAGGAGTAGTTGGTGATGGTCTACCAGAAGGTGAAGGTGGAGAAG... 1 13
1 AGGCTGCTCTTTCCGG ATTGAACGTGGAGTTGAACTTGTACTACGAGTACGTCTAGAACATG... 1 2

scRNAseq data

Further single-cell RNAseq analysis with both scRNAseq data and singlet information from singletCode output

Install and import scanpy for further single-cell RNAseq analysis

!pip scanpy[leiden]
#Import scanpy
import scanpy as sc

Reading the scRNAseq input data in h5ad format

#Reading the scRNAseq data in h5ad format
adata = sc.read_h5ad(f"{path}/inputFiles/JiangEtAlSubset_scRNAseqData.h5ad")
adata
AnnData object with n_obs × n_vars = 39 × 36601
    var: 'gene_ids', 'feature_types'
Making copies of singletCode input/output to use them along with scRNAseq data. The -1 is added to cell IDs to match the cell IDs seen in 10x format data.
NOTE: It may not be needed for your actual data.
singleCellDf = df.copy()
singleCellDf['cellID'] = singleCellDf['cellID'] + "-1"
singleCellDf = singleCellDf.drop_duplicates(subset = 'cellID')
cellLabelListSingleCell = cellLabelList.copy()
cellLabelListSingleCell['cellID'] = cellLabelListSingleCell['cellID'] + "-1"
cellLabelListSingleCell = cellLabelListSingleCell.drop_duplicates(subset='cellID').reset_index(drop = True)

Calculating total counts and genes identified per cell.

NOTE: In this vignette we are not doing any actual QC - but in actual analysis, it would need to be done.

sc.pp.calculate_qc_metrics(adata, inplace=True)

Calculating PCA and plotting variance ratio vs ranking

sc.tl.pca(adata)
sc.pl.pca_variance_ratio(adata, n_pcs=10)
../_images/singletCodePackageVignette_28_0.png

Identifying cells that were thresholded by singletCode as low UMI by identifying cells that were in the original list provided to singletCode but not labeled as either singlet or undetermined. Then creating a list of annotations of singletStatus(singlet, multiplet, low UMI) for all cells

umiCutoff = pd.DataFrame(
    singleCellDf.loc[~singleCellDf['cellID'].isin(cellLabelListSingleCell['cellID']), 'cellID']
    .drop_duplicates()
    .reset_index(drop=True),
    columns=['cellID']
)
umiCutoff['label'] = "Low UMI"
cellIDLabels = cellLabelListSingleCell.drop(columns = ['barcode', 'sample', 'nUMI']).drop_duplicates().reset_index(drop = True)
#Creating a list of cell IDs with annotation of whether singlet, multiplet or low UMI.
labelID = pd.concat([umiCutoff, cellIDLabels]).reset_index(drop=True)
labelID = labelID.set_index(labelID['cellID']).drop(columns = ['cellID'])
#Adding the labels to cells in the adata to visualise it
adata.obs["singletStatus"] = labelID

Visualising the cells in PCA space

sc.pl.pca(
    adata,
    color = ['n_genes_by_counts', 'total_counts', 'singletStatus'],
    size = 250
)
../_images/singletCodePackageVignette_34_0.png

Calculating neigbours and UMAP for further visualisation

sc.pp.neighbors(adata)
sc.tl.umap(adata, random_state=101010)
sc.pl.umap(
    adata,
    color=['singletStatus'],
    # Setting a smaller point size to get prevent overlap
    size=250,
)
../_images/singletCodePackageVignette_37_0.png

Saving the AnnData

adata.write(f"{outputPath}/JiangEtAlSubset.h5ad")