https://doi.org/10.1016/j.stem.2018.10.023
本研究是如何找切入点的?
Previously, we reported that iPSC-derived dopamine neurons obtained from PD GBA-N370S patients exhibited increased ER stress, autophagic and lysosomal perturbations, and elevated α-synuclein release (Fernandes et al., 2016).
这是之前的研究基础,分子层面的研究对象已经比较明确了:ER stress, autophagic and lysosomal perturbations, and elevated α-synuclein release
https://doi.org/10.1016/j.stemcr.2016.01.013
Overall, gene ontology (GO) enrichment analysis of the upregulated and downregulated genes in PD GBA-N370S iPSC-derived dopamine neurons highlighted DE of genes involved in neuronal development, neuronal differentiation, and synaptic activity, whereas zinc ion transport functions featured in the upregulated genes
bulk的结果没有多大用处:neuronal development, neuronal differentiation, and synaptic activity, whereas zinc ion transport functions featured in the upregulated genes
分层的结果: A GO enrichment analysis on the over-dispersed genes identified the signal recognition particle (SRP)-dependent co-translational protein targeting to membrane pathway and related processes as driving this variation.
关于core gene set的选取:
Within the set of 60 DE genes, those downregulated early in the proposed case-control axis include genes implicated in neuronal function (γ-synuclein [SNCG], brain-derived neurotrophic factor [BDNF], and dopamine receptor D2 [DRD2]); genes involved in microtubule-associated protein tau (MAPT) splicing, microtubule function and formation, and neurite and axonal outgrowth; genes involved in protein secretion and trafficking; and protein kinase C (PKC) pathway genes. Genes identified as upregulated late in the process include the ER stress genes protein disulfide isomerase family member 6 (PDIA6), FK506 binding protein 9 (FKBP9), and ER oxidoreductase 1 alpha (ERO1A).
目前,我已经能够过滤出所有对HSCR有用的基因,那就是S-HSCR or L-HSCR的差异基因,在这些基因中再确定S和L有无差异。
把所有的基因分类,而不要把一个基因重复分到几个类里!!!
目前已经大致确定了S-HSCR没有太多特异的基因(有一些,但基本可以忽略),反而他与L-HSCR共通的pathway有很多,而L-HSCR则有一些非常特异的pathway让他们变得更加严重。
猜想-证明
猜想一:S-HSCR相较于L-HSCR而言,基因表达的变化非常轻微
猜想二:虽然S-HSCR基因表达影响非常轻微,但S-HSCR和L-HSCR有一些共通的pathway
猜想三:L-HSCR有一些pathway受影响非常严重,证明S-HSCR基本不受影响
猜想四:有一个key gene调控了基本的HSCR表型
猜想五:有一个key gene对L-HSCR的严重表型负责
猜想六:有一个core gene set可以描述HSCR的progress
I have read the CSC paper again and again and found one core gene set is not enough for us.
In the CSC paper, they only have one phenotype and one variant in GBA. So one core gene set and one key regulator make sense to describe the PD disease. For us, we have two clear phenotypes and different variants. Then we should have at least two core gene sets and their upstream regulators (like HDAC4) for S-HSCR and L-HSCR.
For picking up the candidate core get set, I’m sure the CSC paper used a supervised method. They filtered the DEGs by their prior result (pathway analysis). So, the early genes are both related to neuronal development, the late genes are both related to ER stress. In our case, I found there was no uniquely affected pathways in S-HSCR (that means nearly all affected pathways in S-HSCR were also affected in L-HSCR). But, the L-HSCR showed some uniquely affected pathways like -
全部搞完,我的任务就结束了。。。
2019年09月19日再读:
很明确的一个问题:如何鉴定出一个core gene set?
(1) those DE in bulk RNA-seq using DESeq2 after the removal of GBA3 (at 5% FDR)
(2) those DE across PC2 using switchde (at 5% FDR; Campbell and Yau, 2017; Figure S4B).
We further refined this set to include only those additionally identified as discriminating marker genes after clustering the single-cell RNA-seq data using SC3 (Kiselev et al., 2017).
To validate that our core set of 60 genes were functionally convergent, we assessed the functional similarity between these genes within a phenotypic linkage network, as compared to known PD genes and a random background set controlled for the relevant core set gene properties (STAR Methods; Figures S4D, S4E, and 3B). We found significant enrichment of functional similarity within the 60-gene set compared to background genes (p < 2.6e−16; Figure 3B). Strikingly, we also found a significant enrichment between the 60 genes and a set of known PD genes (Figure 3B).
第二个很明确的问题:如何找到这个core gene set的核心调控因子,hub gene?
Analysis of the core set of 60 DE genes using ingenuity pathway analysis (IPA) (QIAGEN) identified histone deacetylase 4 (HDAC4) as a repressor of a set of genes downregulated early in the pseudotemporal profile (Figure S5A).
这里很有讲究,外行很难明白其中道理。找的是一个repressor,为什么不是activator;为什么core gene set里面down的DEG占绝大多数(52个up,8个down)?因为up不好解释!!!down很直接,variant就可以做到;而up则要绕几个弯才行。
(A) IPA (Qiagen) identified HDAC4 as an upstream regulator of a number of genes identified as downregulate and one up-regulated gene, in the core set of 60 DE genes (modified from IPA Qiagen).
这里看似简单,其实很难找到这个HDAC4,因为HDAC4并不是DEG,也不是TF,这意味着我们要去试3万个gene才有可能找到HDAC4,这就是IPA的厉害之处了,有意义的数据库大大减少了数据量。
细细一想,真的很不容易:这个分析的也是瞎搞,瞎猫碰到死老鼠了,随便杂交了几个方法,得到了所谓的60个core gene set,然后就用IPA去找上游的调控因子,找到了HDAC4,但居然表达没变化,(正常人到这里就会放弃了,然而人家扣着这个点,继续探索其他方面),才发现了mislocalization,他们为什么这么坚定的就认定了HDAC4呢?难道不怀疑自己的core gene set的问题吗?
Although total levels of HDAC4 protein were unchanged between controls and PD GBA-N370S patients (Figure S5B), the downregulation of four of the HDAC4-regulated genes (TSPAN7, ATP1A3, RTN1, and PRKCB) in PD GBA-N370S patient-derived neurons was experimentally confirmed 这里还有其他可能,或许有其他的因子来调节这几个基因。
我觉得坚定走这条路的原因:
1. 一个因子调控了8个基因,而且都是确实变化了的基因,IPA的调控关系十分可靠;(作为上游的调控因子的关系至关重要,需要非常准确,且有文献支持)
2. mislocalization和up、down一样都不是什么新鲜事了;
HDAC4, a class IIa histone deacetylase, shuttles between the cytoplasm and the nucleus, where it acts as a transcriptional repressor. We observed an increase in nuclear localization of HDAC4 in PD GBA-N370S iPSC-derived dopamine neurons compared to controls at DIV 45, consistent with the downregulation of HDAC4 controlled genes within our set of 60 genes (Figure 4A). This HDAC4 nuclear mislocalization was not observed in iPSC-derived non-dopaminergic neurons of PD GBA-N370S patients (Figure S6).
这里解释就很直观了,HDAC4错误地跑到了细胞核里,抑制了一系列的转录,从而导致了GBA中的core gene set表达下调,从而引发了PD疾病,逻辑一气呵成。
我们的难题:
- bulk数据不可靠,不可用
- HSCR的up的基因更多
- 没有IPA可用,噪音少(假阳性少)、已知的数据库更可靠(与其说验证,不如说重复),算法简单(主要还是根据疾病分类了)
- 我们有两个phenotype,所以会有三个gene set
IPA:
- 数据的基本结构
- 找一个疾病的所有相关分子
- 这些因子之间的相互关系(不是简单的PPI)
- IPA这个工具真的设计得很好,对于机制的挖掘十分有帮助
- 缺点:无法预测新的关系,所有的都是已经发表的关系
Question: Is There A Free Alternative To Ingenuity Pathway Analysis? 没有太好的替代品,IPA做得真的很好
Gene and Pathways Interactions Graph User's Guide 整合了很多数据库,但是数据库太老了
BioGRID Database Statistics - 非常新
关键
找出一个key gene,调控一群重要的基因表达,这不难,文章里只有7个而已,实际上有几百上千个基因的表达被影响了,所以此文只解释了其中很小的一部分;
关键是,这个key gene足够重要,这才能做接下来的功能验证;
首先这个key gene确实变了,Mislocalize,down;
其次Correct这个key gene一定要能恢复表型,不然就不算重要,后面功能验证也没意义;
2019年09月18日再读:
单细胞的核心优势是什么?Disease Progression
如何评估iPSC诱导分化而来的细胞?iPSC-Dopamine Neurons
这篇文章其实是几篇小文章的整合体:
1. three PD GBA-N370S patients and three controls,细胞的诱导分化和提纯;
2. 常规转录组测序,分析了Synaptic Function and Development;
3. 单细胞Stratifies PD GBA-N370S Patients,这是常规转录组做不了的,样本太少,effect太弱;PCA差异、heatmap基因表达差异、RPS基因差异;也是ribosomal protein差异;(虽然此病人服从诊断标准,但是在后续的跟踪中仍发现了差异;iPSC的单细胞情况和临床的信息完美的吻合了,加分)
4. Pseudotemporal Axis的构建,core gene set的筛选;用phenotypic linkage network验证了这个gene set;与已知基因集显著富集;其实是根据prior和pathway的结果来筛选的core gene set;用了一个Bayesian approach来锚定了这些core set gene;用IPA跑出来了一个HDAC4调控因子;用qRT-PCR简单大致的验证了4个基因;
5. HDAC4 Is Mislocalized to the Nucleus,验证了猜想的调控机制;功能验证更倾向于repressor
6. 调节HDAC4能够缓解Corrects其下游的基因和表型;
7. HDAC4 Modulation Corrects Perturbations in the Autophagy and Lysosomal Pathway
8. Idiopathic PD Cases,拓展了HDAC4的专利范围,PD beyond carriers of GBA mutations,我觉得还可以再拓展;
解释gene type和clinical phenotype的方法,值得借鉴:
Although all the patients in our study fulfilled UK Brain Bank diagnostic criteria for clinically probable PD at presentation, longitudinal clinical follow-up allows the diagnosis to be reviewed in light of disease progression and subsequent medication response. GBA patients 1, 2, and 4 presented at an early stage with akinetic-rigid parkinsonism and maintained a good levodopa response for their first five years of treatment without significant falls or dementia. In contrast, patient GBA3 presented with akinetic-rigid parkinsonism, failed to respond to standard medication (600 mg levodopa with benserazide 150 mg daily), and presented with early dementia and frequent falls two years after initial PD diagnosis. A supranuclear gaze palsy with dysarthria was then noted, and the patient received a revised clinical diagnosis of progressive supranuclear palsy (PSP). The stratification of PSP from PD among these GBA-N370S carriers by single-cell profiling of their iPSC-derived dopamine neurons is therefore consistent with the clinical stratification and reveals a potential disease-relevant pathway for PSP.
核心问题:
1. 他们的变异很简单,只做了一个变异,而我们则是很多个变异杂糅在一起;
2. 样本量少,有效的就是两个样本,这样就不存在confounder的问题,但也有可能找出假阳性;
3. 某些分析说不清原因,分析不够严谨;
4. 我们的bulk RNA-seq不纯,没法利用,之前的结论的可靠性有待考证;
5. 目前我们卡在了core gene set这一步,而且似乎也没有比较靠谱的可以讲故事的一个调控因子;我的数据分析只能止于此步了,再就只能往横向拓展了!!!
6. 后面验证都是围绕一个调控因子来讲故事,讲得很深;
2019年04月01日再读;精读;
已经发现我的data没法在PCA里有明显的规律;应该可以直接从bulk RNA-seq里获取有价值的信息,那么single cell到底有什么优势呢?回答:单细胞的数据是必须的,它可以把core genes锚定到case-control pseudotime,从而可以得到早期的致病基因。
比较:
- 他们研究的是PD,这个疾病更重要,更流行,在老年人中发病率高,生理表征是失去了多巴胺神经元,仅有10%的遗传性;我们研究的是HSCR,没有那么流行,大概万分之几,基本是出生缺陷病,表征类似,没有正常function的神经元,遗传性高;
- GBA基因发现与PD有很强的相关性,5%-10%的PD携带GBA杂合突变;类似我们HSCR里的RET基因的突变;
- GBA和LRRK2的研究已经发现蛋白的稳态缺陷,通过内质网、自噬和溶酶体通路;我们发现了什么?需要看一下RET的paper;
- PD是一个神经退性疾病,所以他们要找上游的信号;而我们已经测到ENCC,我们鉴定出来的就直接是上游的信号了,我们只需要看上游的影响了哪些pathway;
short-segment Hirschsprung disease (S-HSCR: 80% of the patients), longsegment Hirschsprung (15% of the patients), and total colonic aganglionosis (TCA: 5% of the patients).1
highlight1:因为iPSC-derived的样本包含了细胞的异质性,会混淆基因表达的分析,所以本文做了单细胞(然而即使FACS后还是有异质性存在)。
highlight2:样本分层,发现某个样本中某些pathway被高度激活,确认后,这个样本就从下游分析中移除了;我们的更复杂,两个对照细胞系,更多的疾病种类,如何处理?
highlight3:60 genes,capture an axis of variation between control and case;如何确认的,我们的数据如何处理?
highlight4:HDAC4地位的确定,早期的许多差异基因被HDAC4调控,所以顺理成章,后面围绕着HSAC4故事讲得很完整,但他们用的软件貌似是收费的;我们如何发现这个明星分子?
核心1:60 genes是如何选出来的?
- DEG bulk RNA-seq
- DEG PC2 switch
- markers SC3
最终得到了60个genes,其中52个下调,8个上调;
核心2:如何验证这60个genes的?
- phenotypic linkage network;确定这些基因确实是和表型相关的;
- A Bayesian approach that learns transcriptomic trajectories,就是把每一个基因锚定到pseudotime上,确定它们是在哪个点产生影响的;
- ingenuity pathway analysis (IPA) (QIAGEN),收费的,伤不起啊;
核心3:HDAC4是如何筛选出来的。
Although total levels of HDAC4 protein were unchanged between controls and PD GBA-N370S patients (Figure S5B), the downregulation of four of the HDAC4-regulated genes (TSPAN7, ATP1A3, RTN1, and PRKCB) in PD GBA-N370S patient-derived neurons was experimentally confirmed. 可以看到,HDAC4的表达水平是不变的,但是有四个早期的基因都是被HDAC4调控的(IPA显示的),所以才有了后面的mislocalization的推断和验证。
其实整个过程是可以控制的,他们的搞法不是特别地严谨,也就只能在这套数据上使用吧。我们是不是可以取齐精髓,改进一下这个方法;50-100个core gene set比较好讲故事,200个以上就不好讲了,他们的阈值卡的不错;
这篇文章的方法写得比较容易理解,比正文写得明确些,值得多看看。
方法篇:
RNA-seq read alignment and expression quantification
Single-cell and bulk RNA-seq data was processed identically. FASTQ files were trimmed using TrimGalore v0.4.1 on default settings. Transcript expression levels were quantified using Kallisto v0.42.5 (Bray et al., 2016) against GRCh38 reference human transcriptome. Additional quality control (QC) metrics were compiled by Picard 2.0.1 (https://broadinstitute.github.io/picard/) on BAM files aligned to the human genome (GRCh38) using Hisat2 (Kim et al., 2015). Transcript level abundances were then summarized to gene level estimates using tximport 1.4.0 (https://bioconductor.org/packages/release/bioc/html/tximport.html) (Soneson et al., 2015).
Quality control of single-cell RNA-seq
Quality-control, visualization, and handling of single-cell data was performed using Scater (McCarthy et al., 2017). Cells belonging to plates 3-6 were removed due to distinct clustering on reduced-dimensionality representations and low expression of otherwise constitutively expressed genes. Further outliers were removed using Cellity (Ilicic et al., 2016) and subsequently any cell expression GAPDH at a level below the maximum GAPDH expression in blank wells was further removed, leaving a total of 146 cells for analysis.
Differential expression analysis
Differential expression analysis on bulk RNA-seq was performed using DESeq2 (Love et al., 2014), including a covariate to account for technical replicates. GO enrichment was performed using goseq (Young et al., 2010) and over-represented p values were multiple test corrected using the Benjamini-Hochberg procedure.
Single-cell pseudotime analysis
A PCA representation of the cells was computed using the prcomp function in the stats package in R using the 500 most variable genes (in log expression space), the default in Scater. Single-cell differential expression analysis along PC2 was performed using the R package switchde (Campbell and Yau, 2017). A further refined trajectory using the combined gene list along was computed using Ouija.
Identification of pathway activation in GBA 3
Over-dispersion analysis was performed in the method identical to Brennecke et al. (2013) using the R package scran (Lun et al., 2016) using ERCC spike-ins. A gene was designated as over-dispersed if the reported q-value < 0.05. GO analysis was performed using the R package GOSeq (Young et al., 2010). Genes from the SRP-dependent co-translational protein targeting to membrane pathway were selected for validation by performing a Wilcoxon rank-sum test for log2(TPM+1) expression in GBA3 cells compared to controls and prioritized based on p value.
Phenotypic linkage network construction
To assess functional similarity and convergence of the core gene set we constructed a phenotypic linkage network (Honti et al., 2014). We wished to assess the functional similarity of genes within the core set compared to a randomly sampled background distribution. The genes selected for the background distribution should match the overall expression pattern of the core set in these iPSCs in order to account for the increased likelihood of functional similarities between genes randomly selected from the same cell type. We first noted that the core set of genes exhibited high mean expression than average among the whole transcriptome. We then fitted a gamma distribution to the mean log2(TPM+1) values of both the core gene set and all other genes in the transcriptome (the core distribution and background distribution). Then for each gene not in the core gene set we calculated the probability of observing the mean expression level under both models, and formed an unnormalized inclusion probability of the ratio of the density of the observed expression given the core gene set distribution to the density of the observed expression given the background distribution. This can loosely be thought of as ‘‘how many times more likely is it that a gene fits the core set distribution compared to the background distribution.’’ To choose the background set of genes we then sampled 1000 genes from the transcriptome excluding the core gene set, where the probability of a gene being selected was proportional to the un-normalized inclusion probability. The resulting empirical distribution fitted well with the fitted core gene set distribution (Figure S4D). We subsequently constructed a phenotypic linkage network as per Honti et al. (2014) using 1) the core set of genes, 2) the 1000 sampled background genes, and 3) the genes SNCA, PARK2, PARK7, LRRK2, UCHL1, GBA, PINK1, ATP13A2, HTRA2, PLA2G6, VPS35, and EIF4G1. Links were compared between different classes of genes using a one-sided Wilcoxon rank-sum test.
这篇文章的核心就是构建了disease pseudotime,case-control axis ,这跟普通的pseudotime完全不同;
从core gene set中选出了关键的基因HDAC4,并做了非常完善的功能验证;药物验证;
目前唯一一篇把单细胞测序应用到疾病发育上的高分paper,发表在了CSC上。
参考阅读:
首先明确一下概念:
帕金森疾病有一定的遗传因素,本文就是研究可遗传的这部分,GBA基因里的N370S突变会导致帕金森。The majority of PD cases are idiopathic, with only about 10% attributed to inherited PD cases
早期发现:increased ER stress, autophagic and lysosomal perturbations, and elevated a-synuclein release;内质网应激增加,自噬和溶酶体微扰增加,a-synuclein释放增加
基本的思路:
找出了一个核心的DEG:HDAC4
HDAC4靶向药物可以纠正帕金森症状
把核心DEG按照作用时间进行了分类。
根据临床信息把不同的疾病样本分层了stratify
HDAC4错误定位了 mislocalized
问题:
1. 为什么找出的是这个基因?如何找出HDAC4的?
因为这个基因是早期基因的一个抑制因子,它更有可能是disorder的起源。
Analysis of the core set of 60 DE genes using ingenuity pathway analysis (IPA) (QIAGEN) identified histone deacetylase 4 (HDAC4) as a repressor of a set of genes downregulated early in the pseudo temporal profile
这并不是一个转录因子
Pseudotime analysis of genes differentially expressed (DE) along this axis identified the transcriptional repressor histone deacetylase 4 (HDAC4) as an upstream regulator of disease progression.
HDAC4 was mislocalized to the nucleus in PD iPSC-derived dopamine neurons and repressed genes early in the disease axis, leading to late deficits in protein homeostasis.
HDAC4错误地进入了细胞核,抑制了那些早起基因,导致了蛋白质稳态缺陷。
2. 怎么找到靶向药物的?
3. 如何构建pseudotime的?
we identified a progressive axis of gene expression variation leading to endoplasmic reticulum stress.
Combining bulk and single-cell expression profiles, we identified a robust set of 60 genes whose expression captured an axis of variation between cells from controls and the remaining two PD GBA-N370S patients.
4. core gene set是怎么找出来的?
We next sought to identify a core set of genes consistently perturbed across both bulk RNA-seq and the single-cell transcriptomic signature. This core set was identified as the intersection of two gene sets from the analysis: (1) those DE in bulk RNAseq using DESeq2 after the removal of GBA3 (at 5% FDR) and (2) those DE across PC2 using switchde (at 5% FDR; Campbell and Yau, 2017; Figure S4B). We further refined this set to include only those additionally identified as discriminating marker genes after clustering the single-cell RNA-seq data using SC3 (Kiselev et al., 2017). By combining genes found through both bulk and single-cell DE as well as single-cell clustering (STAR Methods), we identified a core set of 60 genes, 52 of which were consistently downregulated and 8 of which were consistently upregulated in PD GBA-N370S iPSC-derived dopamine neurons (Figure S4B).
去文中一一找答案。。。
图1:
A:提纯purify iPSC-derived dopamine neurons,然后做bulk RNA-seq 和单细胞
B:increased expression of dopamine neuron marker genes多巴胺神经元的marker表达提升。感觉图表很业余,看得费力,大致就是想展现一些marker的表达情况,证明细胞是对的;
C:常规DEG的火山图,对比的bulk RNA-seq,我们不能用,因为样本不够纯;
D:常规的GO 注释图,down and up分类
图二:
A:单细胞常规的聚类分析,只不过用得是PC2和PC3,应该用了scater包,发现了分层现象,GBA3的表达变异被PC3捕获了;
B:常规的Over dispersion分析,鉴定出具有生物学差异的基因;才143个gene,有没有搞错;
C:an enrichment in the endoplasmic reticulum (ER) signal recognition particle (SRP) pathway in GBA3;ER通路在GBA3中高度富集;
D:挑了几个基因出来证明;
E:qRTPCR验证,靶向膜途径的SRP依赖性共翻译蛋白的升高对GBA3具有特异性,提出了本研究中使用的患者的分子分层,分子分层的依据;
The stratification of PSP from PD among these GBAN370S carriers by single-cell profiling of their iPSC-derived dopamine neurons is therefore consistent with the clinical stratification and reveals a potential disease-relevant pathway for PSP.
我们的这个项目适合分层吗?肯定可以
图三:
A:首先想办法构建了一个pseudo diease axis,然后判断core gene set在哪个位置发挥作用
B:统计学验证了core gene set的合理性;
C:挑了几个基因做了验证,DIV 22就是分化的天数,可以直观的看出基因到底是在哪一个区段开始发挥作用