singletCode PyPI package

Availablility

The package is available on PyPI as singletCode .

How do I use it?

  1. Installation

    It can be installed from PyPI using the following in the terminal:

    pip3 install singletCode
    from singletCode import check_sample_sheet, get_singlets
    
  2. Preparing the input sample sheet.

    The input sample sheet is a .csv file that contains the information about cell ID (added while sequencing), lineage barcode, and sample name. Each row should be repeated n times where n is the number of UMIs associated with that barcode and cell ID combination. For creating the input, you can read the .csv file in as a pandas dataframe.

    import pandas as pd
    df = pd.read_csv("path/to/csv/file.csv")
    

    You can check if the format and the column names are valid for running singletCode using check_sample_sheet() function.

    check_sample_sheet(df)
    

    It will either return an error with information about how to modify the sample sheet to make it a valid input or will print the message, “The sample sheet provided can be used as input to get_singlets to get a list of singlets identified”, in which case you can move to the next step.

  3. Running get_singlets() to get an assignment of singlet status for each cell ID and barcode combination.

    This is the step where singlet identification is done using the singletCode framework. You can read more about the parameters for get_singlets() below. In this example, default values are used.

    cellLabelList, stats = get_singlets(df, dataset_name = "Sample1")
    

    cellLabelList is a pandas dataframe that contains 5 rows: cellID, barcode, sample, nUMI and label. Label is whether the particular cell ID and barcode combination has been called a singlet or not.

    To get a dataframe containing just the singlets, you can run this:

    singletList = cellLabelList[cellLabelList["label"] == "Singlet"]
    

    stats contains the statistics for each sample present in your dataset: total cells, total number of singlets, number of singlets recovered from different categories of singlets (such as single-barcode singlets, muli-barcode singlets, dominant-UMI singlets), number of cells removed due to low barcode UMI counts for the barcode and number of indeterminate cells since singletCode can identify only truly singlet cells but not be certain if the other cells are truly not singlets.

    You can save the singlets list in a .txt file and the list of cell IDs in each category of singlets by setting save_all_singlet_categories parameter to TRUE.

Detailed information about the parameters for the functions in the package

get_singlets

get_singlets(sample_sheet, output_path=None, dataset_name=None, save_all_singlet_categories=False, save_plot_umi=False, umi_cutoff_method='ratio', umi_cutoff_ratio=7.5e-06, umi_cutoff_percentile=None, min_umi_cutoff=2, umi_diff_threshold=50, umi_dominant_threshold=10)[source]

Function that inputs the sample sheet and other parameters and runs it through count_doublets to get a list of singlets in the sample. If a row is repeated, it is assumed to reflect the UMI associated with a barcode in this cell (identified by the cellID).

Parameters
  • sample_sheet (DataFrame) – A dataframe that contains 3 columns: cellID, barcode, sample.

  • dataset_name (str) – The name of the dataset being analysed. It will be in the name of all saved files and be a column in the singlet_stats sheet returned.

  • output_path (str, optional) – The path to store any output files, including plots to show UMI distribution and what the umi_cutoff used is, csv files containing singlets of different categories. If None, then the list of singlets will be returned but it won’t contain information about what category of singlet each cell is. Defaults to None.

  • save_all_singlet_categories (bool, optional) – If true, then singlets of each category are saved separately in csv files along with all singlets and all non-singlets. Defaults to False.

  • save_plot_umi (bool, optional) – If true, then plots showing UMI distribution indicating the UMI cutoff used will be saved for each sample. Defaults to False.

  • umi_cutoff_method (str, optional) – Specify if quality control for barcodes using UMI counts should be based on “ratio” or “percentile”. Defaults to ‘ratio’.

  • umi_cutoff_ratio (float, optional) – The ratio used to determine the umi_cutoff if umi_cutoff_method is “ratio”. Defaults to 3/4e5.

  • umi_cutoff_percentile (float, optional) – If umi_cutoff_method is “percentile”, then the umi_cutoff will be the minimum UMI count required to be in the top umi_cutoff_percentile’th percentile. There is no default and if umi_cutoff_method is set to “percentile”, then manually set this parameter.

  • min_umi_cutoff (int, optional) – This is the absolute minimum number of UMIs that need to be associated with a barcode for it to be considered a barcode. However, the actual umi_cutoff used will be the greater of min_umi_cutoff and the cutoff calculated using umi_cutoff_method. Defaults to 2.

  • umi_dominant_threshold (int, optional) – The minimum UMI count to be associated with a barcode for it to be considered to be a potential dominant barcode in a cell. Defaults to 10.

  • umi_diff_threshold (int, optional) – This is the minimum difference between UMI counts associated with a potential dominant barcode present within a cell and the median UMI count of all barcodes associated with the cell. If a cell has only one dominant barcode, it will be counted. Defaults to 50.

Returns

A 2-tuple containing:
  • pandas.DataFrame: A dataframe which contains all unique cell ID and barcode combinations in the data along with singlet assignment to each cell ID.

  • pandas.DataFrame: A dataframe which contains the statistics for total singlets, different categories of singlets, and cells removed due to low UMI counts.

Return type

tuple

check_sample_sheet

check_sample_sheet(data_frame)[source]

Function to check if the dataframe can be used as input to get_singlets function. It checks if the three columns - cellID, barcode, sample needed are present and in same order. If this dataframe can be used for get_singlets function, then a statement will be printed to confirm that

Parameters

sample_sheet – A dataframe that contains your sample sheet.

Returns

None