Standardize and append a batch of data#

Here, we’ll learn

how to standardize a less well curated collection
how to append it to the growing versioned collection

import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = "hint"
lb.settings.auto_save_parents = False
ln.track()

💡 lamindb instance: testuser1/test-scrna

💡 notebook imports: lamindb==0.67.0 lnschema_bionty==0.38.4

💡 saved: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)

💡 saved: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)

💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_3l4nIIfFIijfLF4oTUvN.txt

Standardize a data shard#

Let’s now consider a collection with less-well curated features:

adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally set an organism:

lb.settings.organism = "human"

Standardize & validate genes #

This data shard is indexed by gene symbols which we’ll want to map on Ensemble ids:

adata.var.head()

Show code cell output Hide code cell output

	n_counts	highly_variable
index
HES4	1153.387451	True
TNFRSF4	304.358154	True
SSU72	2530.272705	False
PARK7	7451.664062	False
RBP7	272.811035	True

Let’s inspect the identifiers:

lb.Gene.inspect(adata.var.index, lb.Gene.symbol)

Let’s first standardize the gene symbols from synonyms:

adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)

💡 standardized 749/765 terms

✅ 749 terms (97.90%) are validated for symbol

❗ 16 terms (2.10%) are not validated for symbol: RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, GPX1, RP3-467N11.1, SOD2, RP11-390E23.6, RP11-489E7.4, RP11-291B21.2, RP11-620J15.3, TMBIM4-1, AC084018.1, RN7SL1, SNORD3B-2, CTD-3138B18.5, IGLL5

We only want to register data with validated genes:

adata_validated = adata[:, validated].copy()

Now that all symbols are validated, let’s convert them to Ensembl ids via standardize(). Note that this is an ambiguous mapping and the first match is kept because the keep arg of .standardize() defaults to "first":

adata_validated.var["ensembl_gene_id"] = lb.Gene.standardize(
    adata_validated.var.index,
    field=lb.Gene.symbol,
    return_field=lb.Gene.ensembl_gene_id,
)
adata_validated.var.index.name = "symbol"
adata_validated.var = adata_validated.var.reset_index().set_index("ensembl_gene_id")
adata_validated.var.head()

Show code cell output Hide code cell output

💡 standardized 749/749 terms

	symbol	n_counts	highly_variable
ensembl_gene_id
ENSG00000188290	HES4	1153.387451	True
ENSG00000186827	TNFRSF4	304.358154	True
ENSG00000160075	SSU72	2530.272705	False
ENSG00000116288	PARK7	7451.664062	False
ENSG00000162444	RBP7	272.811035	True

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

Standardize & validate cell types #

Inspection shows none of the terms are validated:

inspector = lb.CellType.inspect(adata_validated.obs.cell_type)

Let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = lb.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = lb.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    # add the original name as a synonym, so that next time, we can just run .standardize()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)

✅ 9 terms (100.00%) are validated for name

We don’t want to store any of the other metadata columns:

for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

Register #

experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()

artifact = ln.Artifact.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
)

As we do not want to manage the remaining unvalidated terms in registries, we can save and annotate the artifact:

artifact.save()
artifact.labels.add(adata_validated.obs.cell_type, features.cell_type)
artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.describe()

✅ saved 2 feature sets for slots: 'var','obs'

✅ storing artifact 'Fnlkapxs00VtAUBQoN6n' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Fnlkapxs00VtAUBQoN6n.h5ad'

✅ loaded: FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:16:52 UTC, created_by_id=1)

✅ linked new feature 'organism' together with new feature set FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)

💡 nothing links to it anymore, deleting feature set FeatureSet(uid='8IyN9ZSLwbIBIYV1HAQs', n=1, registry='core.Feature', hash='Jvdom8iFnEJ0A-lSpMqH', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)

✅ linked new feature 'assay' together with new feature set FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)

Artifact(uid='Fnlkapxs00VtAUBQoN6n', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=853388, hash='eKH1ljAEh7Kd81-o2H4A7w', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-01-12 06:17:10 UTC)

Provenance:
  🗃️ storage: Storage(uid='sYjXl3Ee', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2024-01-12 06:16:27 UTC, created_by_id=1)
  💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)
  👣 run: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-01-12 06:16:27 UTC)
Features:
  var: FeatureSet(uid='VJyf7iOhqwDRtbgBYUUo', n=749, type='number', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4Fw', updated_at=2024-01-12 06:17:10 UTC, created_by_id=1)
    'IL18', 'NPM3', 'S100A9', 'S100A8', 'CNN2', 'ARHGAP45', 'RNF34', 'GPX4', 'S100A6', 'ADISSP', 'S100A4', 'FAM174C', 'SIT1', 'CCDC107', 'RSL1D1', 'TLN1', 'HES4', 'TNFRSF17', 'PCNA', 'RAB13', ...
  obs: FeatureSet(uid='RN5RRCb5KXFMEPd1AnfT', n=1, registry='core.Feature', hash='-q5M1pKR4seTGVpNrxe6', updated_at=2024-01-12 06:17:10 UTC, created_by_id=1)
    🔗 cell_type (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human'
  external: FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'

artifact.view_lineage()

_images/b3e4d3607da5826468e06b018c2ad60eaa551d361b486a985f52fa2d7307aefd.svg

Append the shard to the collection#

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifact],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()

Collection(uid='tS0d8GpnFLVXJEfFcMl3', name='My versioned scRNA-seq collection', version='2', hash='BOAf0T5UbN_iOe3fQDyq', visibility=1, updated_at=2024-01-12 06:17:11 UTC)

Provenance:
  💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', short_name='scrna2', version='1', type=notebook, updated_at=2024-01-12 06:16:59 UTC, created_by_id=1)
  👣 run: Run(uid='3l4nIIfFIijfLF4oTUvN', run_at=2024-01-12 06:16:59 UTC, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-01-12 06:16:27 UTC)
Features:
  var: FeatureSet(uid='cqxdmgGi52JUD5QkkwmN', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-01-12 06:16:49 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='vBGOrQDcV8jISsmeQK93', n=4, registry='core.Feature', hash='_fgSxLBHcJkUbq0B0akl', updated_at=2024-01-12 06:16:51 UTC, created_by_id=1)
    🔗 cell_type (40, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='r5fFhdHaHqNXgwIDTj03', n=2, registry='core.Feature', hash='K5LbdAzPMpnbvOg83iZ5', updated_at=2024-01-12 06:17:11 UTC, created_by_id=1)
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (40, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  🏷️ unordered_artifacts (2, core.Artifact): 'scrna/conde22.h5ad', 'None'

View data lineage:

collection_v2.view_lineage()

_images/9101cfbac5a55933b229842f404d875ac37482b220e89a51a2cc8a3cd2aec9d2.svg