scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data#

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = "hint"
lb.settings.auto_save_parents = False
ln.track()
💡 lamindb instance: testuser1/test-scrna
💡 notebook imports: lamindb==0.67.0 lnschema_bionty==0.38.4
💡 saved: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)
💡 saved: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_3l4nIIfFIijfLF4oTUvN.txt

Standardize a data shard#

Let’s now consider a collection with less-well curated features:

adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

lb.settings.organism = "human"

Standardize & validate genes #

This data shard is indexed by gene symbols which we’ll want to map on Ensemble ids:

adata.var.head()
Hide code cell output
n_counts highly_variable
index
HES4 1153.387451 True
TNFRSF4 304.358154 True
SSU72 2530.272705 False
PARK7 7451.664062 False
RBP7 272.811035 True

Let’s inspect the identifiers:

lb.Gene.inspect(adata.var.index, lb.Gene.symbol)
Hide code cell output
695 terms (90.80%) are validated for symbol
70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...
   detected 54 terms with synonyms: ATPIF1, C1orf228, CCBL2, AC079767.4, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, SEPT7, WBSCR22, RSBN1L-AS1, CCDC132, ...
→  standardize terms via .standardize()
   detected 5 Gene terms in Bionty for symbol: 'SNORD3B-2', 'GPX1', 'RN7SL1', 'IGLL5', 'SOD2'
→  add records from Bionty to your Gene registry via .from_values()
   couldn't validate 11 terms: 'AC084018.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP3-467N11.1', 'RP11-782C8.1', 'CTD-3138B18.5', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'RP11-277L2.3', 'RP11-156E8.1'
→  if you are sure, create new records via ln.Gene() and save to your registry
<lamin_utils._inspect.InspectResult at 0x7fdd350993c0>

Let’s first standardize the gene symbols from synonyms:

adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)
💡 standardized 749/765 terms
749 terms (97.90%) are validated for symbol
16 terms (2.10%) are not validated for symbol: RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, GPX1, RP3-467N11.1, SOD2, RP11-390E23.6, RP11-489E7.4, RP11-291B21.2, RP11-620J15.3, TMBIM4-1, AC084018.1, RN7SL1, SNORD3B-2, CTD-3138B18.5, IGLL5

We only want to register data with validated genes:

adata_validated = adata[:, validated].copy()

Now that all symbols are validated, let’s convert them to Ensembl ids via standardize(). Note that this is an ambiguous mapping and the first match is kept because the keep arg of .standardize() defaults to "first":

adata_validated.var["ensembl_gene_id"] = lb.Gene.standardize(
    adata_validated.var.index,
    field=lb.Gene.symbol,
    return_field=lb.Gene.ensembl_gene_id,
)
adata_validated.var.index.name = "symbol"
adata_validated.var = adata_validated.var.reset_index().set_index("ensembl_gene_id")
adata_validated.var.head()
Hide code cell output
💡 standardized 749/749 terms
symbol n_counts highly_variable
ensembl_gene_id
ENSG00000188290 HES4 1153.387451 True
ENSG00000186827 TNFRSF4 304.358154 True
ENSG00000160075 SSU72 2530.272705 False
ENSG00000116288 PARK7 7451.664062 False
ENSG00000162444 RBP7 272.811035 True

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

Standardize & validate cell types #

Inspection shows none of the terms are validated:

inspector = lb.CellType.inspect(adata_validated.obs.cell_type)
Hide code cell output
❗ received 9 unique terms, 61 empty/duplicated terms are ignored
9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+
   couldn't validate 9 terms: 'CD34+', 'CD4+/CD25 T Reg', 'Dendritic cells', 'CD4+/CD45RO+ Memory', 'CD14+ Monocytes', 'CD8+ Cytotoxic T', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD19+ B'
→  if you are sure, create new records via ln.CellType() and save to your registry

Let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = lb.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = lb.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    # add the original name as a synonym, so that next time, we can just run .standardize()
    record.add_synonym(name)
Hide code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000451'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001201'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000624'

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)
9 terms (100.00%) are validated for name

We don’t want to store any of the other metadata columns:

for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

Register #

experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()
artifact = ln.Artifact.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
)
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Fnlkapxs00VtAUBQoN6n.h5ad')
💡 parsing feature names of X stored in slot 'var'
749 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='VJyf7iOhqwDRtbgBYUUo', n=749, type='number', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4Fw', created_by_id=1)
💡 parsing feature names of slot 'obs'
1 term (25.00%) is validated for name
3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅    linked: FeatureSet(uid='RN5RRCb5KXFMEPd1AnfT', n=1, registry='core.Feature', hash='-q5M1pKR4seTGVpNrxe6', created_by_id=1)

As we do not want to manage the remaining unvalidated terms in registries, we can save and annotate the artifact:

artifact.save()
artifact.labels.add(adata_validated.obs.cell_type, features.cell_type)
artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.describe()
✅ saved 2 feature sets for slots: 'var','obs'
✅ storing artifact 'Fnlkapxs00VtAUBQoN6n' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Fnlkapxs00VtAUBQoN6n.h5ad'
✅ loaded: FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:16:52 UTC, created_by_id=1)
✅ linked new feature 'organism' together with new feature set FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
💡 nothing links to it anymore, deleting feature set FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
✅ linked new feature 'assay' together with new feature set FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
Artifact(uid='Fnlkapxs00VtAUBQoN6n', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=853388, hash='eKH1ljAEh7Kd81-o2H4A7w', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-01-12 06:17:10 UTC)

Provenance:
  🗃️ storage: Storage(uid='sYjXl3Ee', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2024-01-12 06:16:27 UTC, created_by_id=1)
  💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)
  👣 run: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-01-12 06:16:27 UTC)
Features:
  var: FeatureSet(uid='VJyf7iOhqwDRtbgBYUUo', n=749, type='number', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4Fw', updated_at=2024-01-12 06:17:10 UTC, created_by_id=1)
    'IL18', 'NPM3', 'S100A9', 'S100A8', 'CNN2', 'ARHGAP45', 'RNF34', 'GPX4', 'S100A6', 'ADISSP', 'S100A4', 'FAM174C', 'SIT1', 'CCDC107', 'RSL1D1', 'TLN1', 'HES4', 'TNFRSF17', 'PCNA', 'RAB13', ...
  obs: FeatureSet(uid='RN5RRCb5KXFMEPd1AnfT', n=1, registry='core.Feature', hash='-q5M1pKR4seTGVpNrxe6', updated_at=2024-01-12 06:17:10 UTC, created_by_id=1)
    🔗 cell_type (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human'
  external: FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
artifact.view_lineage()
_images/b3e4d3607da5826468e06b018c2ad60eaa551d361b486a985f52fa2d7307aefd.svg

Append the shard to the collection#

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifact],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)
Hide code cell output
✅ loaded: FeatureSet(uid='cqxdmgGi52JUD5QkkwmN', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-01-12 06:16:49 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='vBGOrQDcV8jISsmeQK93', n=4, registry='core.Feature', hash='_fgSxLBHcJkUbq0B0akl', updated_at=2024-01-12 06:16:51 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
💡 adding collection [1] as input for run 2, adding parent transform 1
💡 adding artifact [1] as input for run 2, adding parent transform 1
💡 transferring cell_type
💡 transferring assay
💡 transferring organism
💡 transferring cell_type
💡 transferring assay
💡 transferring tissue
💡 transferring donor
💡 adding collection [1] as input for run 2, adding parent transform 1

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Collection(uid='tS0d8GpnFLVXJEfFcMl3', name='My versioned scRNA-seq collection', version='2', hash='BOAf0T5UbN_iOe3fQDyq', visibility=1, updated_at=2024-01-12 06:17:11 UTC)

Provenance:
  💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)
  👣 run: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-01-12 06:16:27 UTC)
Features:
  var: FeatureSet(uid='cqxdmgGi52JUD5QkkwmN', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-01-12 06:16:49 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='vBGOrQDcV8jISsmeQK93', n=4, registry='core.Feature', hash='_fgSxLBHcJkUbq0B0akl', updated_at=2024-01-12 06:16:51 UTC, created_by_id=1)
    🔗 cell_type (40, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (40, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  🏷️ unordered_artifacts (2, core.Artifact): 'scrna/conde22.h5ad', 'None'

View data lineage:

collection_v2.view_lineage()
_images/9101cfbac5a55933b229842f404d875ac37482b220e89a51a2cc8a3cd2aec9d2.svg