HTSJDK
是一个用于处理高通量测序数据的 Java 库,其中VariantContext
是这个库中的一个重要类,用于表示变异数据。VariantContext
类是htsjdk.variant
包的一部分,主要用于处理和描述变异(如 SNPs、插入和删除等)。
VariantContext
类介绍VariantContext
类用于存储和操作与变异相关的信息,包括:
位置:
getContig()
: 返回变异所在的染色体或 contig 名称。getStart()
: 返回变异的起始位置(1-based)。getEnd()
: 返回变异的结束位置(1-based)。变异类型:
isSNP()
: 检查变异是否是一个单核苷酸多态性(SNP)。isIndel()
: 检查变异是否是插入或删除(Indel)。等位基因:
getAlleles()
: 返回变异的等位基因列表。getReference()
: 返回参考等位基因。getAlternateAlleles()
: 返回变异等位基因列表。基因型:
getGenotypes()
: 返回所有样本的基因型信息。getGenotype(String sampleName)
: 获取特定样本的基因型。过滤和质量:
isFiltered()
: 检查变异是否被过滤掉。getPhredScaledQual()
: 返回变异的质量分数。附加信息:
getAttributes()
: 获取变异的附加信息(例如注释、额外的标记等)。/* * Copyright (c) 2012 The Broad Institute * * Permission is hereby granted, free of charge, to any person * obtaining a copy of this software and associated documentation * files (the "Software"), to deal in the Software without * restriction, including without limitation the rights to use, * copy, modify, merge, publish, distribute, sublicense, and/or sell * copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following * conditions: * * The above copyright notice and this permission notice shall be * included in all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR * THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ package htsjdk.variant.variantcontext; import htsjdk.beta.plugin.HtsRecord; import htsjdk.tribble.Feature; import htsjdk.tribble.TribbleException; import htsjdk.tribble.util.ParsingUtils; import htsjdk.variant.utils.GeneralUtils; import htsjdk.variant.vcf.*; import java.io.Serializable; import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.Comparator; import java.util.EnumSet; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.regex.Pattern; import java.util.stream.Collectors; /** * * High-level overview
* * The VariantContext object is a single general class system for representing genetic variation data composed of: * * - Allele: representing single genetic haplotypes (A, T, ATC, -) (note that null alleles are used here for illustration; see the Allele class for how to represent indels)
* - Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus
* - VariantContext: an abstract class holding all segregating alleles at a locus as well as genotypes * for multiple individuals containing alleles at that locus
*
* * The class system works by defining segregating alleles, creating a variant context representing the segregating * information at a locus, and potentially creating and associating genotypes with individuals in the context. *
* * All of the classes are highly validating -- call validate()
if you modify them -- so you can rely on the * self-consistency of the data once you have a VariantContext
in hand. The system has a rich set of assessor * and manipulator routines, as well as more complex static support routines in VariantContextUtils
. *
* * The VariantContext
(and Genotype
) objects are attributed (supporting addition of arbitrary key/value pairs) and * filtered (can represent a variation that is viewed as suspect). *
* *VariantContext
s are dynamically typed, so whether a VariantContext
is a SNP, Indel, or NoVariant depends * on the properties of the alleles in the context. See the detailed documentation on the Type
parameter below. *
* * It's also easy to create subcontexts based on selected genotypes. *
* Working with Variant Contexts
* By default, VariantContexts are immutable. In order to access (in the rare circumstances where you need them) * setter routines, you need to create MutableVariantContext
s and MutableGenotype
s. * * Some example data
* * Allele A, Aref, T, Tref; * Allele del, delRef, ATC, ATCref; *
* * A [ref] / T at 10 *
* * GenomeLoc snpLoc = GenomeLocParser.createGenomeLoc("chr1", 10, 10); *
* * A / ATC [ref] from 20-23 *
* * GenomeLoc delLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 22); *
* * // A [ref] / ATC immediately after 20 *
* * GenomeLoc insLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 20); *
* Alleles
* * See the documentation in the Allele
class itself * * What are they?
* * Alleles can be either reference or non-reference
* * Examples of alleles used here:
* * A = new Allele("A"); * Aref = new Allele("A", true); * T = new Allele("T"); * ATC = new Allele("ATC"); *
* Creating variant contexts
* * By hand
* * Here's an example of a A/T polymorphism with the A being reference: * * * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref, T)); *
* * If you want to create a non-variant site, just put in a single reference allele * * * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref)); *
* * A deletion is just as easy: * * * VariantContext vc = new VariantContext(name, delLoc, Arrays.asList(ATCref, del)); *
* * The only thing that distinguishes between an insertion and deletion is which is the reference allele. * An insertion has a reference allele that is smaller than the non-reference allele, and vice versa for deletions. * * * VariantContext vc = new VariantContext("name", insLoc, Arrays.asList(delRef, ATC)); *
* * Converting rods and other data structures to VariantContext
s
* * You can convert many common types into VariantContexts using the general function: * * * VariantContextAdaptors.convertToVariantContext(name, myObject) *
* * dbSNP and VCFs, for example, can be passed in as myObject
and a VariantContext
corresponding to that * object will be returned. A null
return value indicates that the type isn't yet supported. This is the best * and easiest way to create contexts using RODs. * * * Working with genotypes
* * * List<Allele> alleles = Arrays.asList(Aref, T); * Genotype g1 = new Genotype(Arrays.asList(Aref, Aref), "g1", 10); * Genotype g2 = new Genotype(Arrays.asList(Aref, T), "g2", 10); * Genotype g3 = new Genotype(Arrays.asList(T, T), "g3", 10); * VariantContext vc = new VariantContext(snpLoc, alleles, Arrays.asList(g1, g2, g3)); *
* * At this point we have 3 genotypes in our context, g1-g3. * * You can assess a good deal of information about the genotypes through the VariantContext
: * * * vc.hasGenotypes() * vc.isMonomorphicInSamples() * vc.isPolymorphicInSamples() * vc.getSamples().size() * * vc.getGenotypes() * vc.getGenotypes().get("g1") * vc.hasGenotype("g1") * * vc.getCalledChrCount() * vc.getCalledChrCount(Aref) * vc.getCalledChrCount(T) *
* * NO_CALL alleles
* * The system allows one to create Genotype
s carrying special NO_CALL alleles that aren't present in the * set of context alleles and that represent undetermined alleles in a genotype: * * Genotype g4 = new Genotype(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL), "NO_DATA_FOR_SAMPLE", 10); *
* * subcontexts
* It's also very easy get subcontext based only the data in a subset of the genotypes: * * * VariantContext vc12 = vc.subContextFromGenotypes(Arrays.asList(g1,g2)); * VariantContext vc1 = vc.subContextFromGenotypes(Arrays.asList(g1)); *
* * * * Fully decoding.
* Currently VariantContext
s support some fields, particularly those * stored as generic attributes, to be of any type. For example, a field AB might * be naturally a floating point number, 0.51, but when it's read into a VC its * not decoded into the Java presentation but left as a string "0.51". A fully * decoded VariantContext
is one where all values have been converted to their * corresponding Java object types, based on the types declared in a VCFHeader
. * * The fullyDecode(...)
method takes a header object and creates a new fully decoded VariantContext
* where all fields are converted to their true java representation. The VCBuilder
* can be told that all fields are fully decoded, in which case no work is done when * asking for a fully decoded version of the VC. * * */ public class VariantContext implements HtsRecord, Feature, Serializable { public static final long serialVersionUID = 1L; private static final boolean WARN_ABOUT_BAD_END = true; private static final int MAX_ALLELE_SIZE_FOR_NON_SV = 150; private boolean fullyDecoded = false; protected CommonInfo commonInfo = null; public static final double NO_LOG10_PERROR = CommonInfo.NO_LOG10_PERROR; public static final Set PASSES_FILTERS = Collections.emptySet(); /** The location of this VariantContext */ protected final String contig; protected final long start; protected final long stop; private final String ID; /** The type (cached for performance reasons) of this context */ protected Type type = null; /** The type of this context, cached separately if ignoreNonRef is true */ protected Type typeIgnoringNonRef = null; /** A set of the alleles segregating in this context */ protected final List alleles; /** A mapping from sampleName -> genotype objects for all genotypes associated with this context */ protected GenotypesContext genotypes = null; /** Counts for each of the possible Genotype types in this context */ protected int[] genotypeCounts = null; public static final GenotypesContext NO_GENOTYPES = GenotypesContext.NO_GENOTYPES; // a fast cached access point to the ref / alt alleles for biallelic case private Allele REF = null; // set to the alt allele when biallelic, otherwise == null private Allele ALT = null; /* cached monomorphic value: null -> not yet computed, False, True */ private Boolean monomorphic = null; /* * Determine which genotype fields are in use in the genotypes in VC * @return an ordered list of genotype fields in use in VC. If vc has genotypes this will always include GT first */ public List calcVCFGenotypeKeys(final VCFHeader header) { final Set keys = new HashSet<>(); boolean sawGoodGT = false; boolean sawGoodQual = false; boolean sawGenotypeFilter = false; boolean sawDP = false; boolean sawAD = false; boolean sawPL = false; for (final Genotype g : this.getGenotypes()) { keys.addAll(g.getExtendedAttributes().keySet()); if ( g.isAvailable() ) sawGoodGT = true; if ( g.hasGQ() ) sawGoodQual = true; if ( g.hasDP() ) sawDP = true; if ( g.hasAD() ) sawAD = true; if ( g.hasPL() ) sawPL = true; if (g.isFiltered()) sawGenotypeFilter = true; } if ( sawGoodQual ) keys.add(VCFConstants.GENOTYPE_QUALITY_KEY); if ( sawDP ) keys.add(VCFConstants.DEPTH_KEY); if ( sawAD ) keys.add(VCFConstants.GENOTYPE_ALLELE_DEPTHS); if ( sawPL ) keys.add(VCFConstants.GENOTYPE_PL_KEY); if ( sawGenotypeFilter ) keys.add(VCFConstants.GENOTYPE_FILTER_KEY); List sortedList = ParsingUtils.sortList(new ArrayList<>(keys)); // make sure the GT is first if (sawGoodGT) { final List newList = new ArrayList<>(sortedList.size() + 1); newList.add(VCFConstants.GENOTYPE_KEY); newList.addAll(sortedList); sortedList = newList; } if (sortedList.isEmpty() && header.hasGenotypingData()) { // this needs to be done in case all samples are no-calls return Collections.singletonList(VCFConstants.GENOTYPE_KEY); } else { return sortedList; } } // --------------------------------------------------------------------------------------------------------- // // validation mode // // --------------------------------------------------------------------------------------------------------- //no controls and white-spaces characters, no semicolon. public static final Pattern VALID_FILTER = Pattern.compile("^[!-:<-~]+$"); public enum Validation { ALLELES() { void validate(VariantContext variantContext) { validateAlleles(variantContext); } }, GENOTYPES() { void validate(VariantContext variantContext) { validateGenotypes(variantContext); } }, FILTERS { void validate(VariantContext variantContext) { validateFilters(variantContext); } }; abstract void validate(VariantContext variantContext); private static void validateAlleles(fina