Why does this human bam file only have one copy of each chromosome?












9












$begingroup$


As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



Q1: Where is the other gene copy in the sequence or have I have missed something?



Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










share|improve this question











$endgroup$












  • $begingroup$
    Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
    $endgroup$
    – Mithoron
    Jan 3 at 23:29
















9












$begingroup$


As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



Q1: Where is the other gene copy in the sequence or have I have missed something?



Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










share|improve this question











$endgroup$












  • $begingroup$
    Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
    $endgroup$
    – Mithoron
    Jan 3 at 23:29














9












9








9


1



$begingroup$


As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



Q1: Where is the other gene copy in the sequence or have I have missed something?



Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










share|improve this question











$endgroup$




As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



Q1: Where is the other gene copy in the sequence or have I have missed something?



Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?







bam sequencing fastq exome






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 24 '18 at 6:45









conchoecia

1,818526




1,818526










asked Dec 24 '18 at 5:15









Lot_to_learnLot_to_learn

1228




1228












  • $begingroup$
    Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
    $endgroup$
    – Mithoron
    Jan 3 at 23:29


















  • $begingroup$
    Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
    $endgroup$
    – Mithoron
    Jan 3 at 23:29
















$begingroup$
Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
$endgroup$
– Mithoron
Jan 3 at 23:29




$begingroup$
Not all genes have two copies, for ex. Y chromosome.is (usually) only one.
$endgroup$
– Mithoron
Jan 3 at 23:29










1 Answer
1






active

oldest

votes


















14












$begingroup$

The maternal and paternal copies of a chromosome are called haplotypes. Many metazoans (animals) are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



Response to Q1



Your question, in other words, is: Why do .bam files not differentiate between haplotypes?



Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



Response to Q2



If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
    $endgroup$
    – Lot_to_learn
    Dec 25 '18 at 6:17











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









14












$begingroup$

The maternal and paternal copies of a chromosome are called haplotypes. Many metazoans (animals) are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



Response to Q1



Your question, in other words, is: Why do .bam files not differentiate between haplotypes?



Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



Response to Q2



If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
    $endgroup$
    – Lot_to_learn
    Dec 25 '18 at 6:17
















14












$begingroup$

The maternal and paternal copies of a chromosome are called haplotypes. Many metazoans (animals) are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



Response to Q1



Your question, in other words, is: Why do .bam files not differentiate between haplotypes?



Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



Response to Q2



If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
    $endgroup$
    – Lot_to_learn
    Dec 25 '18 at 6:17














14












14








14





$begingroup$

The maternal and paternal copies of a chromosome are called haplotypes. Many metazoans (animals) are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



Response to Q1



Your question, in other words, is: Why do .bam files not differentiate between haplotypes?



Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



Response to Q2



If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






share|improve this answer











$endgroup$



The maternal and paternal copies of a chromosome are called haplotypes. Many metazoans (animals) are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



Response to Q1



Your question, in other words, is: Why do .bam files not differentiate between haplotypes?



Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



Response to Q2



If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).







share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 24 '18 at 14:48









Daniel Standage

2,120327




2,120327










answered Dec 24 '18 at 6:21









conchoeciaconchoecia

1,818526




1,818526












  • $begingroup$
    Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
    $endgroup$
    – Lot_to_learn
    Dec 25 '18 at 6:17


















  • $begingroup$
    Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
    $endgroup$
    – Lot_to_learn
    Dec 25 '18 at 6:17
















$begingroup$
Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
$endgroup$
– Lot_to_learn
Dec 25 '18 at 6:17




$begingroup$
Thanks for nice explanation. If possible, I would like to give +50. This answer can be a nice starting for further study.
$endgroup$
– Lot_to_learn
Dec 25 '18 at 6:17


















draft saved

draft discarded




















































Thanks for contributing an answer to Bioinformatics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

If I really need a card on my start hand, how many mulligans make sense? [duplicate]

Alcedinidae

Can an atomic nucleus contain both particles and antiparticles? [duplicate]