Genomics, Big Data, and Medicine Seminar Series – Trey Ideker

thanks for the invite here in today’s lecture on how to interpret genomic variants in general and and how how network knowledge of the cell may be useful in doing that is quite close to everyone’s hearts so let’s let’s see as all of us know or should know in this audience massive DNA sequencing is underway for lots of different diseases we happen to mostly study cancer in which case you have depending on how you tally it up about 10 to the 5th genomes course that depends if you’re talking about the complete genomic coding sequence or just the X the X own portion of that but nonetheless it’s a lot of genomes and it’s made possible by these x10 instruments among among other advances and if you if you read essentially any of those Cancer Genome Atlas papers that people have been publishing or again for any complex disease by definition practically the the most surprising finding or one of the most striking findings out of all of those studies is how enormously heterogeneous each patient’s genome is so every in cancer when you when you sequence the genome is again you may know you have actually two genomes you have the genome of the tumor in you and then you have the genome of the normal tissue and you subtract the two to get just the mutations in the tumor relative to the normal but even then you have on average depending on the mutational load and tissue type and all of that let’s say about a hundred mutated genes in that tumor versus normal and so if you look at that pattern of somatic mutations it’s essentially a snowflake for that particular patient that has never before been seen in nature and will never be seen again in nature that 100 gene set sit and and that really is the fundamental challenge that underlies interpretation of these complex disease genomes to see that in a couple of different ways let’s first just randomly take a patient out of the Cancer Genome Atlas it was frequently mutated gene among the 25 genes this patient has mutated is got a three and that’s about 50 times more than expected by chance in terms of the background mutation rate this is relative to background mutation rate and then very quickly you get into this long tail this quote-unquote long tail of mutations and then this line here is itself the random background so notice there are genes that are mutated below random backgrounds now are these genes functional or are they not and so notice data 3 is a known cancer gene Braca one is it’s perhaps the most infamous breast cancer gene is not currently detectable by G wasps but it just gets missed but of course it’s known by lots of other means and then these two genes are also known cancer genes yet they’re very rarely mutated in this cohort begging the question is are they already functional or they not so that’s the real key challenge of heterogeneity let’s now switch cohorts and look at the Cancer Genome Atlas ovarian cancer and now we’re going to zoom in just one chromosome a gene by Gene resolution not not mutation by nutation resolution but gene by gene so there’s about 550 genes encoded on chromosome 17 and a dot indicates that that gene was mutated in that patient for a long time in cancer biology and in complex disease biology as a whole that this disease is not a disease at the nucleotide mutation level or even at the gene level but it’s a disease at the pathway level here’s here’s the whole pathway and all the one gene in that pathway in particular in this case EGFR is frequently mutated here in fifty-seven percent of these glioblastoma patients look at how many more mutations you can pick up and recognize really is the same sub type of cancer based on the fact that genes in that same pathway or downstream of egfr here are mutated and look you have genes like rats which are very well-known cancer genes only mutated one percent of this particular patient cohort are a population but nonetheless we believe these are going to be functional because we’re hitting that pathway and again it’s not just wrath but it’s it’s really many many different points connected by by pathway interactions so so that’s the working hypothesis for how we might make sense of all of these data the question then is how do you implement that hypothesis in any kind of systematic way and that’s right now where the field is we sort of know that pathway information is important but how exactly you get at that in the same kind of systematic rigor or with the same kind of rigor that the genome project

itself is getting at mutations is is sort of the open challenge and my last introductory slide here is just to remind us that most cancer genes have not been discovered with sequencing or with G wasps so if you take and this is again part of mittens paper that’s impressed right now he did a very interesting thing as simple to do but no one had done if you take the gold standard set of cancer disease genes and you might say well which one do I believe your gold standard said there’s about three different ones he took this one is the cancer gene census from the sanger center it’s got about 500 cancer genes they’re all literature curated and validated by the by Thai team but if you prefer a more conservative list you might look at Bert vogelstein s list and there’s other lists you might look at all of them are going to have the same qualitative plot that i’m going to show you here for each accepted cancer gene in this in this database what matan did is went into pubmed and you might know pubmed has these things called mesh terms that actually I’ve never personally used before but we’re really useful here so if you looked at the first pubmed publication at each of those cancer genes that associate that gene with cancer so not the first time the genes ever mentioned in the literature but the first time that someone says this is a cancer gene and then you look at that pubmed mesh term annotation it tells you what were the experimental methods that were used in that paper and he records what those experimental methods work so and and this took us through 2013 here although he did this actually in 2015 but but I think the database he wanted to make sure he had a completely updated cancer gene census so it ends in 2013 but you can see here of those jeans the real winner in terms of the the first discovery of that as a cancer gene is gene manipulation by knockout knock down knock in so the original Bob Weinberg Ian experiments where they showed oncogenes by by transfecting in a DNA fragment that would transform cells that was was certainly mapping to the gene knock in orgy manipulation category another great example to pick out of here p53 although it is the most recurrently mutated gene and sequencing and would certainly be rediscovered by sequencing was in fact down first by gene expression so it would map the RNA analysis here in fact so you could always say well sequencing has come along more recently maybe part of this is historic and that could be true but it’s nonetheless I think sobering to look at this and realize that that sequencing really has not found most of the cancer genes we now know and and it’s worth finding out whether it could okay oh so so then the question is you know how to get at these cancer genes and how to get at the pathways that these cancer genes function in in any kind of systematic way for that reason late last year Nevin Krogan and I announced this this initiative we call the cancer cell map initiative it’s just getting off the ground it’s a collaboration currently between you CSDs Cancer Center and UCSF and their cancer center and our end and the two teams there but but the idea is is twofold one can we integrate knowledge from genotype into these networks as a way of predicting and better better predicting phenotypes of interest to clinicians like disease diagnosis response to therapy and so on and essentially the operation is exactly the same as like all good geneticists we do to go from unit type to phenotype the difference being that now we want to insert a middle layer and that middle layer would be a comprehensive map of pathways relevant to that disease biology begging the second question how do we begin to accrue that comprehensive network knowledge in any kind of sort of systematic way much like genomes are now systematically sequence can we do the same for pathways today I’m going to talk almost exclusively about this first task of I’m going to assume we have a lot of network knowledge and we’re going to get a lot more of it as systematic pathway mapping continues but the question really is the analysis challenge that I’m going to address today in my talk for how how you would use that knowledge systematically to translate genotype into into clinical phenotype and for that purpose because we don’t yet have incredibly systematic knowledge of networks and pathways it’s it’s coming be a bunch of different technologies I won’t spend much time on today we do at least have existing network knowledge and and that comes and has been collated in a bunch of

different databases that many of you I think in this audience probably if you don’t use you’ve at least heard of alphabet soup of acronyms like bio grid intact human protein reference database and so on and so forth these collate interactions across many different kinds protein-protein interactions transcriptional or protein-dna interactions metabolic connections you can always throw up your hands and draw a coexpression network if you want to do that to connect together co-expressed genes if you sort of just back at the envelope look at how many interactions you can get from existing databases it’s about a million it’s about 10 to the 6th for human pairs of human proteins this of course is by no means comprehensive by no means has a well-characterized error rate these databases are not context specific in terms of what disease or cell line much less patient that they refer to but nonetheless you get a large collection of again about a million interactions it may or may not be useful as hypotheses behind the pathways of interest so that’s that what we’re going to do here just to get started with this whole exercise and so for about 10 or 15 years my lab has worked on ways of translating patient data through networks in this way but the paper I’m going to talk about today is this 2013 nature methods paper because it really it was built on the shoulders of a lot of our earlier work for instance done by Hana Hana Chong and Alice Lee this is matan here but I think this this paper had all the wisdom of those earlier ones and it really has been pivotal also for all the thinking and in my lab since and so so the way that this method uses networks to interpret patient data go something like this so if you if you simply look at the raw somatic mutation data as we already saw in the ovarian cancer example you cannot cluster it in fact what’s the first thing you want to do with omics data not knowing anything better to do let’s try to stratify those patients by their omics data and that is to say cluster the patient’s by their omics data for gene expression you can always get clusters of the gene expression profiles because it’s not a sparse data set every gene has a level in every patient and you see and there’s lots of downstream effects like proliferation and stress that cause patients to be quite similar to one another across the whole cohort and so you can always get clusters the question then are they meaningful in the case of somatic mutations we saw you can’t get clusters and in fact if you simply try to cluster patients right out of the box you get one monolithic cluster in this particular variant of the clustering algorithm or each patient and their own cluster it just doesn’t know what to do because again there are very few feature is shared in common between patients everyone has the p53 mutation and after that no one has anything in common whatsoever that’s sort of what’s what’s happening here so here’s what matan did he said well under the hypothesis that it’s not the gene level that’s that’s that’s causing these patients to look similar or not it’s the pathway level that we want to analyze so what I want to do is transform those gene measurement by pathway knowledge into into something that I can cluster and so essentially I’ll show you the sort of schematic of the approach in a second but but what what this approach is going to do is instead of representing whether Regine’s mutated or not in a patient instead of just using that list of genes every gene is now going to get scored in a patient by how close it is to mutations in the network so even if a gene wasn’t mutated in the patient if that gene is upstream or downstream of the mutated gene it will feel heat and it’ll get a score and so now you’ve transformed a very sparse set of 50 to 100 genes into a network proximity score for every gene and that cluster is beautifully let’s see that and so first the algorithm here is not original it’s called it’s got several different names depending on which literature you read but they’re all a similar network propagation network diffusion or network smoothing the paper I like to refer people to to that was sort of an early use of this in computational biology literature is this paper by Vanunu at al from road insurance group of tel aviv university published in plos comp bio in 2010 the first use in cancer I’m aware of is by VIN Rafael’s group in this paper here and we’re using a very similar algorithm now to cluster patients the way it works is for each patient you propagate those

mutations on the network in a way we’ll all described so imagine two patients a yellow patient genotype and a blue patient genotype yellow marks the mutated genes in patient 1 blue marks the mutated genes in patient to maybe there’s one gene like p53 that’s mutated in both patients and now it runs essentially a thermodynamic simulation where every mutation is a source of heat and that heat can then spread over time through wires in the network which are these protein interactions so then that blue he now spreads to his neighbors and then to a lesser degree to the neighbors neighbors and so on until the entire system comes to equilibrium that’s essentially how it works there’s a closed-form solution for the math but that’s unimportant the point is you get this sort of heat spreading and that’s how upstream activators repressors and downstream effectors will feel the heat if you’ve mutated here so in this toy example what’s happening is because this region of the network gets hit a lot in both patients even though it’s almost never the same place the whole part of that Network now feels heat in both patients and that similarity is what allows you to cluster the patients in a robust way so to show an example this is now from the ovarian back to those 351 ovarian cancer patients this happens to be the network responsible for clustering the patients robustly into one of the four clusters if in fact two slides to go away would be that slide I showed that took a nun cluster will dataset into four clusters this is now one of those four clusters so the idea here is although two patients in this cluster almost never hit the same genes every patient in that cluster hits one gene at least on this subnetwork in this on the slide just to kind of illustrate one case study there this cluster of red jeans happens to be the fibroblast growth factor receptor and fibroblast growth factor family so it’s getting connected in the network database presumably here because you have a family of related genes all in the fgf our family a few of these genes were already known to be cancer genes fgfr1 in particular because it is frequently mutated in cohorts of cancer patients it is hit time and time again over that cohort but in other case you have a bunch of in a one so there’s one patient that hits there for instance one patient that hits they’re one patient that hits there if you didn’t have network knowledge even though this is almost an in retrospect an obvious case to have looked at this gene family it wouldn’t have written to popped out you would not have found any recurring events now what the network is doing is organizing them all into at a higher level into this this network and I wouldn’t show you this particular example if it weren’t also clinically interesting so once you get clusters whether it’s mutations the networks in this case or its gene expression or whatever way you’ve derived your patient clusters you’d like to know whether those clusters are clinically informative now up until this 2013 paper there had been about eight different expression studies of ovarian cancer and while all of them had shown a clustering figure at some point in the study none had been able to link those clusters to to any kind of clinical outcomes and so we thought it was significant that when you cluster with mutations as opposed to expression you’ve got clinically informative outcomes and so here are the survival and the drug response curves or for those three hundred fifty-one patients stratified by by the network subtype so the network I just showed you when it gets hit that’s the blue subtype it’s the most aggressive subtype you can see after 70 months of time no patients are left alive compare that to the best subtype we’re 65 upwards of that percent of the patients are still alive and if you look at the drug response in terms of platinum sensitivity by the way all ovarian cancer patients get the same standard of care platinum-based agents why because we can’t stratify them right now so they all got platinum and you can therefore look at what fraction remain sensitive to platinum over time and you can see that what’s essentially happening here in this aggressive subtype is all these patients for drug-resistant essentially right out of the box now this is unpublished data that we’re writing up now so since the 2013 study a second large sequencing study of ovarian tumors came out by the sister project of the Cancer Genome Atlas the ICGC or international cancer genomics consortium we were quite pleasantly surprised to see that when those patients came out even though it’s fewer patients and one of the subtypes

drops out for black of patients if we now simply use these subtypes in a prognostic way or so in a supervised way for you bioinformaticians out there then then we can take each one of these I see GC women and put her tumor into a subtype that we defined back here and look at the survival of those cohorts and it’s the same exact stratification so that’s that’s suggestive if you look at the back again at that network I showed you this is a quick and dirty layout of the exact same network I showed in the previous slides same nodes and edges if you now look at those nodes and edges at which ones are getting hit in TCGA and orange which ones are getting hit and I cgc in blue the green or the genes that are hidden both cohorts you can see that without the network you would see much less similarity just for genes here then you can now recognize between these these two different continents of patient cohorts with with the network and so that’s kind of again showing you how the network is able to integrate these discrete events and then just to sort of give my one non-cancer example my lab mainly thinks about cancer but we collaborate with many others who think about all sorts of complex diseases so this is a paper by Joe Gleason from a couple years ago where we used essentially the same approach that the matan had developed and this is sort of a network that integrates mutations for for that disease this is more of a straight jean linkage slats g was type of paper for a rare childhood neurodegenerative disorder called HSP or hereditary paraplegia joe sequences these cohort of consanguineous individuals as a way of mapping disease genes for these rare childhood diseases and essentially what what he was able to do here is use that network to substantially increase his power to find linkage for some of these candidate genes and it’s the same exact exact idea you heard from Pamela already we develop cytoscape I have to have my obligatory advertisement for cytoscape but I also want to point out that our latest thing is we are trying to get site as hooked into a back-end database of of networks and there’s there’s lots of interaction databases out there but what we saw I need for in the field is databases of pathway diagrams that come out of studies like are going on here in many of your labs light go on in my own lab and many many labs everywhere you find that you know you have a pathway diagram coming out of these systemic network analyses or a quote unquote sub-network maybe one of those sub networks ended up in a major figure of your paper but then seven others are like deep in the supplement and you might know them but no one else is ever going to know them the idea is is is we need to capture those these are not canonical pathways like you find in textbooks or databases for those these are working databases of network hypotheses that have been published so they’re somewhat mature but they’re not all the way to you know mol bio of the cell textbook kind of kind of thing and so you can like Facebook you can define social groups you can share your networks at the beginning with just your work group once you publish you can make those networks public or not it’s up its up to you hopefully you would make a public less you you publish this was actually developed by initially a grant from pharma and so for instance Roche has index deployed in inside of Roche and and they get all the public networks but they also get just the roast specific networks that none of us can see an interesting feature but but it’s it’s useful for for industry collaborations with academia and so on and so forth so check that out but now for the remainder of my talk a little more than halfway through here I want to talk about where we think we should go from here and a sink which is essentially the same thing as saying what bothered us about that last analysis and what bothers us about that not that the last analysis and in fact what bothers us about network biology as a field as a whole is that never not look like the contents of cells the contents of cells are not networks it’s not the same as as if you opened up a Pentium chip there is an opinion ship a network that is the wiring diagram of transistors and connections between those there and so this this hair ball we draw for biology actually does exist in some domains when you draw that hairball like in electronic engineering

and biology you know okay two proteins interact we draw a line between them but that’s not what the cell looks like the cell looks like this and in fact for a fact what you’re looking at here is the same structure in two different representations this is the proteasome as you may know the proteasome is the vacuum cleaner of proteins in the cell it sucks up into grades proteins you’re looking at that vacuum cleaner from a side view so there’s multiple folds of symmetry of course but you’re looking at the side of that vacuum cleaner this is in an orange and yellow tones that’s the core of that vacuum cleaner and then the regulatory particle is in purple and blue on either side and that same color so there’s the core it’s here and in purple and blue there’s the regulatory particle there you can already start to see there is some similarity between the network representation in the structure because in the spring embedded layout insight escape you can see it’s trying to pull apart the core and the regulatory particle because there’s a higher density of interactions on either side then there is between them the question we had is how much farther can you push that towards ultimately its structure or if not a structure at least some sort of hierarchy of modules that that structure contains and so let me describe what i mean by that point so the proteasome it’d be great to get a 3d picture of it but even before you get a 3d picture you can at least recognize that there’s like a deep three layer hierarchy of modularity even in this quote-unquote single protein complex there’s the whole proteasome that factors into a core and a regulatory particle each of those in turn factories in two parts it turns out it was even a sub-sub back during beneath that that you can’t see here so what we tried to do starting a couple years ago we’ll see how far we could get at that hierarchy of modules that a protein complex or pathway contains directly from data from from from interaction data I should say and so that resulted in an approach we called neck so for network extracted ontology so why the word ontology because if you think about it the gene ontology is in fact such a hierarchical factorization of of the cell so I’m expecting ninety percent of my audience probably has used the gene ontology and either loves or hates the gene ontology or maybe both at the same time but in it but if you haven’t does you know ontology the easiest way to describe it is it’s the best model of a cell these descriptive model of a cell we as biologists have for better or for worse it’s a large team of about five different groups headed by Judy Blake that has attempted to factor a cell into its constituent parts and their parts in their parts in their parts all the way down to individual genes so at the top of the hierarchy you have organelles and then organelles factor into processes and processes factor into complexes and smaller complexes and finally and finally genes and we thought and this is now team led by Mike Kramer in Janish Bukowski that that we could try to reconstruct a lot of this hierarchy directly from network data we went for this process first into yeast so so the next number of slides are no longer human cancer it’s now all all in sacramento CA because we have by far the most network data really any omics data for for yeast as for any biological system or species and if they if they pour in different interaction types like protein-protein interactions other kinds of interactions and start to cluster those interactions using any of about a thousand different clustering algorithms that have been published you could begin to create these binary trees or dendrograms to describe your clustering you would join two genes based on their similarity and interaction space and then you would join that later based on similarity to other other clusters and so on and so forth until you had this entire tree linking all all genes that starts to look like the hierarchy of processes in a cell with two key differences one is when you’re talking about the ax protein complex in the cell there’s no requirement that that complex only has two genes beneath it or two proteins a protein complex can have any proteins at once so we first need a clustering algorithm that can join right away if you have seven proteins or seven objects that are equally similar is you just join them in one fell swoop that’s that’s the first requirement that most clustering algorithms didn’t have and the second is we have this this notion of pleiotropy the geneticists called pleiotropy that is to say that what about a component a gene or protein

complex it works not just in one parent process but in multiple parent processes why not just recognize that in data why force everything into one cluster let let the data tell you how many modules that that that gene or complex participates in so you need to allow for multiple parents and then you get a structure that is driven not by some computational convenience like this binary tree that really came out of evolutionary biology right and speciation trees here we need a structure from our clustering that resembles and is driven by data and and so that’s all I’m going to say about the algorithm it’s published to your nature biotech in 2013 you can read all about it we updated that in bioinformatics in 2014 you can read all about that in fact I think there’s probably a lot of work to still be done and how you best do this clustering but we’ve got a couple of tools and having having built this structure you can now directly align it against the reference human or literature curated version that we call the gene ontology so you can recognize when you’ve discovered you’ve rediscovered ah there’s the ribosome I just rediscovered that ah there’s something else there’s protein translation as a pair the ribosome I just discovered that and then you can also look where where you have not captured the go and you have novel concepts or terms that you have have found there and that’s your novel biology to bring this a bit more down to earth these are this is an unpublished project this is my Kramer’s last work before he graduates my lab where he’s trying to build a he’s trying to build a date entirely data-driven ontology of the process of autophagy as you may know autophagy like the proteasome degrades proteins and components in the cells the other pathway by which you can recycle components in the cell through the so-called lysosome or vacuole in the yeast it’s a vacuole but just just to show you how it works in this case with data so we now here we’ve already integrated lots of different omics data sets again PPI protein-protein interaction gene expression etc into an overall paralyzed gene similarity score so red means in data these genes or more similar blue means they’re less similar so you and and then we’ve already done a 2d normal clustering on that on a subset of core autophagy genes here to show you how it works and so you can as many of us are used to doing look at these heat maps and pull out by I some of the structures you can see here 80 g 7 and 380 g by the way stands for autophagy if you didn’t guess atg seven and three for Mekhi cluster odd look seven forms another cluster with 10 and 12 and 10 itself forms another cluster with 27 so already you can see a stray clustering would have not really done the right thing here and not only that but then all of these you know so here’s a case where that strong cluster now has a slightly less strong cluster above it and a slightly less strong cluster above it still so all of that by the algorithm gets processed and you get this ontology and by aligning that against the knowledge in the reference gene ontology you can now start to name the terms or the concepts or the clusters you have found in this heat map so let’s look at what happened with atg 7 it got put up to two terms and and then all and then all of these guys end up getting clustered here which looks a lot like what go calls core machinery of autophagy now you get worth in terms that don’t align to go and so it turns out one of them is actually here 183 that’s just the systematic numerical name that the algorithm gave it you look at this term though in the data 1027 that’s a pretty strong cluster and it turns out is very robust if you look at you look at the data there’s multiple experimental data types putting those two together that’s probably a real function and there it’s very clear where to look that’s that’s the idea let’s now look at the whole ontology for yeast so not just 11 sub-process this is what it looks like in fact you can you can browse around that there’s an online tool here that let me just quickly show you after I orient you so the root of this thing is here all the names again come from just direct transfer from the gene ontology database by direct alignment but very quickly so this is the whole cell the cell factors into a mitochondria and membrane and an intracellular part this is of course again all directly from data why did it do that because genes that are mitochondrial genes have weak similarity to one another it’s nonetheless stronger than to non mitochondrial genes and

that’s why you get that whole branch and it turns out that go tells you that was the mitochondrial part that let’s now look into just to bring this whole thing full circle let’s now look back at at the original example that motivated me here with the proteasome there is what the proteasome looks like if you currently look at the network and cytoscape or your favorite network browser here’s what it now looks like in this neck so data driven hierarchy or gene ontology that’s up towards the root so just to go back here it is in the full view and just zooming in on that we’ve and again the names come from transfer from go when you recognize the same the same term so we found the proteasome it splits into liqueur and a regulatory particle the core in turn splits into an alpha beta subunit the regulatory particle splits into a base and a lid and then you have all those those those proteins so you have these three layers of hierarchy that do get recapitulated and just so you’ll check out this there is a tool behind this so here it is online in next ontology org this is more of I think a publicity stunt right now than to actually you know have a functional tool but you can see where how useful it might be so now we can zoom into this thing there’s the proteasome right and if you want to actually see what are the raw data that that led to any one of these terms being formed you can click on it and it’s going to pop out and there’s the hairball okay and if you want to look at the interactions maybe it’s a little more intuitive here you can you can scroll down and get all of the data that are supporting that particular inference but again why do that just look at the hierarchy and realize that data art have have gone all these data have gone into building it oh yeah re so in the last seven minutes here one more thing to talk about so so just this just to finish off that point maybe when we’re looking at these hair balls we’re simply too close to the data not unlike if you think again about the structure as we’re going for and their functions how is that structure actually generated from structural data well it turns out that’s a cartoon that i downloaded from pdb but if you click and you find out where the data actually came from it’s a few million x-ray diffraction images stacked together and analyzed by now a very well then it’s set of algorithm that we call structural proteomics right now you take all of those images you push the button and you get that not that simply as you know but but that’s that’s the workflow of course you can look at those pictures and people do and certainly did but that’s obviously much more intuitive so in the same way maybe networks are just like distraction images they’re there it’s not something we can process by I easily but there’s important patterns there that can be analyzed if you know if you know how ok now I want to get back to the first thing I talked about which was using this kind of information to interpret genomes so I showed you at the beginning of my talk how through the network you can use that as a in a flat way to integrate mutations that can stratify patients and predict things like survival and drug response can we do the same thing now that we’ve changed the representation of networks and we’re no longer looking at them directly but we’re building them into hierarchies that’s what this last piece is about again it’s still in yeast because our hierarchies are in yeast but you can see where we we would like for this to go with more research and this just came out a few weeks ago so the way this is going to work now so first I have to have a model genotype-phenotype translate problem in yeast I certainly have them in abundance in human disease here our model problem was predicting yeast growth phenotype from simple genetic perturbations to the genotype so in yeast lots and lots of experiments exist where a single gene or a pair of jeans or three genes a triplet of genes have been deleted and the growth rate measured the largest such data set is by toronto by brenda andrews and Charlie Boone who looked at many many double gene deletion genotypes and scored them for for growth and combining all of the extant data out there in the literature you get what I i think is probably the

largest genotype-phenotype compendium in existence it’s about three millions to type genotype pairs and most of those are pairs of jeans being deleted and in a growth so it’s a huge amount of genotype phenotype data in yeast and the question is is can we can we predict the growth for a simple genotype like a pairwise gene knockout through this hierarchical network representation or through the gene ontology and it’s a very an ontology is a very natural way to think about integrating these these data because it spans perfectly or in a very convenient representation all of the scales between genotype and phenotype if you think about this it really is genotype phenotype is the original multi-scale problem you have insults or changes to nucleotides at the level of nanometers right those impact protein complexes and larger molecular machine at the order of hundreds of nanometers those then impact signaling pathways and finally organelles and finally the cell and the cells are about for yeast 1 to 10 microns for humans may be a little bit bigger so we basically connected here a 1 nanometer scale to the cell level which is maybe a 10 micron scale and all of the layers in between if you talk about how deep is this hierarchy go is about 12 layers deep agro are data-driven go next I was also about 12 layers deep so it’s it’s quite nice as a translation hierarchy or devise to get from a nanometer scale to a to a micron scale the way we we thought about this was take the genotype which is a by definition a genotype is the set of states on jeans and convert that to a set of states on intermediate biological processes like complexes pathways and organelles we call this the anto type so whereas the genotype is sets of states there’s a state of jeans the antitype is the set of states of all biological objects all the way from genotype to phenotype so for instance you know this large complex t5 here would get some state which is a function of its components and their components all the way down down to the genes that begs the question what should the function be that integrates all of this and that’s an open question here we want it just to start as simply as possible so we used an or gate essentially any term if anything below t5 gets mutated I call t5 mutated and in fact as a slight variant of that what my Q is done here is simply count the number of genes beneath that term it’s mutated and and and and so like minus 2 means that there’s two genes that have been deleted beneath beneath t5 here when I delete B and D together and notice that’s the same on to type I can get from deleting another set of genes together because they’re impacting the same hierarchy of processes once i have the on to type now i just do a straight g wasps or machine learning thing and and just in a black box way try to associate these terms dates with the growth phenotype that’s the way it works i’m going to skip that slide and go right to to the prediction so so here interaction score it is really just applies to double mutants you can think about this is the predicted growth score vs the measure growth score this is again against those three million genotype-phenotype relate in yeast it’s in cross validation if you if you’re interested in that and you can summarize how well the the predictions are working I mean they certainly you can see here some correlation but you can actually force compute a correlation or a precision recall from from the raw analysis but let me focus really on on on this set of bars because what it does it just takes this analysis and gives you a correlation so what you’re looking at here is is is about point three four it’s a correlation coefficient between predicted growth and measured growth so room for improvement it is what it is if I and this is using the go hierarchy this one is using the Nexo data-driven hierarchy I described so you can see the performance whether you construct a hierarchy or you take the literature curated hierarchy it’s about the same at the moment the literature curated one wins lightly if I randomize the structure of the hierarchy I as a negative control like to see that my correlation goes away and then there’s a bunch of papers that in the past have attempted to predict these kinds of growth from for genetic interactions using what we’ll just call non-hierarchical approaches there’s a bunch of different ways and we we just

chose three of them here FBA MMC and GBA and and for comparison we also simply flattened both the go and the Nexo networks so you might be aware of those through straightforward ways of taking the entire hierarchy and just reducing it to gene similarities and just use men jeans similarity network for prediction of interactions and and and essentially all of these non hierarchical approaches perform around the same and significantly worse I hope you’ll appreciate that to the hierarchy of or these hierarchical approaches and then here’s the precision recall versions of that same analysis so we think there’s something simple and and an inherent to this higher commute it’s really working for us here so that’s that’s my talk where do I think this is going we we want to apply this human cancer and other other diseases but generally we think it’s it’s a really nice way to proceed here to move from genotype to phenotype why because it’s a way of capturing cell biology networks just capture cell biology these hierarchies now just encode those networks in a way that relates genotype to phenotype because phenotype to phenotype I used to think about them as fundamentally qualitatively different objects which need to type your phenotype they’re not the same they are the same there is two different scales their traits of biology at two different scales and those two traits trades on nucleotides and traits on cells or organisms can be connected through higher peace and so you can think about mutations hitting hitting genes or dripping down from genes through this sort of organic you know a system of funnels and then ultimately all getting coalesced to create a unified phenotype that that we see so in summary as we know genome sequencing is revealed thousands of genes all burn in many diseases I some of these genes are well known and had strong effects others have had weak effects and and we don’t know exactly how many there are but common patterns emerge or begin to emerge at the level of components pathways and systems for for cancer neben and I have recently launched this open campaign to to try to map cancer networks and apply them using these principles and this is by the way not meant to be an exclusive activity we’d love to involve people who are interested in and and have a planning meeting going on into the summer for that so please let me know if you’d like to participate and then in this last piece which i think it is certainly the most interesting computationally and maybe the most forward-looking biologically that that we think network data are just too close to the raw data and we should be building at least hierarchies from them if not something even more advanced like a wholesale model and and and we have some promising results that show how that representation is even better at connecting genotype to phenotype and it’s this idea of onto a type that really those two notions genotype and phenotype are just opposite ends of the spectrum so with that I will take any questions thank you very much lots of lively questions afterwards thank you um so I guess to have like three very related questions to mention the go biology ontology in terms terms about how you work you be able to compute a rican hey we kept we kind of were able to kind of capture some of the biology of this right have you looked at other kind of like a cell pathways if you will and try to see if maybe some of these mutations cluster in certain specific groups of pathways likened not on the hallmarks cancer paper that kind of described multiple processes that have to be hit ya in order to me that’s exactly where we want to yes the idea is now we’ll have a hierarchical representation of that or we need to work towards towards that have it and again I think it’s going to be interesting to see what that hierarchy is in terms of the hallmarks it’s basically a two level mm-hmm that right maybe I should go back and read that paper but but you know at first blush it’s basically one level of hierarchy of these 12 i think it is now pathways they

have genes but we know it’s going to be like 12 layers deep so i mean like have you tried to see if like let’s say the cake pathway just using that to perhaps group together your genes see if that works as well yes also expressionist geez that part that’s been done by the TCGA I mean you know like that degli a blast a pathway I should at the beginning certainly my group is not the first by any means to organize mutations in canonical pathways and and keg I think is one of the sources people use especially for for instantly applies dilemma where you have I eh which is a prominent TCA cycle gene this frequently mutated and so you have sort of a rich story around metabolic canonical pathways you can integrate so I think in my talk it’s really been how how would you systematize that and and and clearly at the end I think you should go back and ask how much of the systemic discoveries overlap with hallmarks and uh one last question is uh I’m interested in complex diseases particularly Alzheimer’s but I word your thoughts on like the application of this to kind of discover epistasis and like I’ve static interactions yeah exactly so so epistasis is an interaction between two or more genes just to be clear right and and that’s that’s a that’s exactly what has gone on here to get this good predictive performance over yeast growth so essentially all these experiments where Toronto knocked out pairs of jeans and measure growth that was to find epistasis and so you’re really trying to predict that the game here really is if you just want to put it growth ninety percent of the time you can do pretty well because you can start by predicting wild-type growth or at least just a single you know you can look at the single gene deletion phenotypes and and just add them together if you have three two or three gene perturbations you would just add the effects of each perturbation independently and so that you can you can predict growth fairly well that way but here the question is what can you do to predict the actual interaction so that’s if you notice the prediction the predictions versus let’s go back here yeah too fast i glossed over it this actually wasn’t predicting growth directly that would look that would look like 99 that would be perfect correlation and it wouldn’t be helpful here we’re actually doing the harder problem predicting the interaction this is predicting epistasis and so that that was actually the the benchmark we used I sort of glossed over it this is all about how well you’re predicting epistasis hi Trey hi this is ray so I work I’m API and work with eric i was at the UCSD so probably we met in your home doing my pasta so basically I have two questions why is relating to the first part that uses mutation data to stratify the patients have you find the concordance when you use the mutation data to stratify patients with the patterning that you use of that you use the gene expression great question so I mentioned only gene expression to sort of dis gene expression and say oh there’s eight papers that tried this with gene expression and they failed right that was but let’s now get back to that so we’ve now discovered subtypes those were not discovered before we’ve defined a novel stratification of this disease that stratification appearance be clinically useful why wasn’t it found before and and then and that’s question one in question two is having found it can you train expression profiles to find it that is to say can you select a few genes whose expression levels do give you those those subtypes and the answer to the second question is yes you can so I can go back let’s let’s start with that I can go back and in a supervised mode find mrna levels that do predict those clusters so now why weren’t those found if you if you just cluster expression data without that knowledge because it’s washed out by all these downstream effects proliferation has been found a thousand times in cancer papers oh you know I mean seriously I mean if you look at cancer papers expression data they always pick up on on essentially cell cycle and and how fast this the cells are growing and it’s a great indicator that it’s a downstream effector phenotype and you have lots of other phenotypes that expression data are sensitive to stress state of the cells immunological stata you know all of this stuff is integrated into an expression profile which is both advantageous or dis advantageous depending on what you want to do in this case it was dis advantageous because it’s completely

obscuring its dominating the signal and driving the clustering whereas the clusters we’re finding here are the upstream you would like the hypothesis would be at least think that these are the more upstream causal events that are causing those those different subtypes to to emerge but anyway you can I thanks for the question because you can find these clusters and expression data in a supervised way and so if you wanted a diagnostic for instance you might now not even use mutations in the clinic you would you would use you might use short RNA profiles probably just to pick up some like icon features I can do that in your you know temperature network and then the second question is that this the last part so this other respect I think that the answer is great and but it was the reason do you think that why you don’t get a tip on a tour through points7 correlation is that because of the you know I met I know you use decision tree to train the model or is that because of that or is because of the initial design of this Oh gate and ok it’s tough yeah it’s a great it’s a great question so what I was trying to focus on here is how much better this is a very hard problem and it’s one that a lot of people have been interested in for out for you know 15 15 years so so I was more resting on the laurels of I would actually say you know if you’re interested in this um make sure that you know that we’re right about this you know I claim were right but don’t believe me you know because that’s a big difference okay that that to me I mean I you know I that’ll be interesting right if others also are able to build their own hierarchical models and see the same effect okay then the question is what what where to go from from there and and one is what’s can you learn the functions at the same time I mean that the beauty of it is is we’re not learning the structure the structure has learned well we are but in a separate problem yes let us know so we’re like like any time in biology you have structure and function and we’ve now cleanly separated the structural reconstruction from the functionalization of that structure and so my own thinking by the way is evolving on this so rather in the initial nexo structure we built for the you know the recapitulation of go we did throw in every data set we could get our hands on now I’m thinking don’t I’m thinking keep it pure keep it structural and then keep all your functional data aside for this to throw at the structure to learn what are those gates on all of those on that neural network on that deep neural network of neurons which which aren’t neurons their their components of the cell but it’s the same exact encoding and type of encoding my work is on the quality of otology yeah and I reconstructing is hierarchy in a Cell much more complex areas about which little high quality yeah definitely so so first of all just to clarify its not dependent on go at all because that was because we through go out and we rebuilt it and then we only bring out so go go does get you is just to be clear to make sure everyone’s going on that go does get used at the end just as a reference to annotate back and can name name the things in our menagerie basically right but but but then what about the data so slow mix data to go in what we can do is we can perform Stan so the innovation really was in that clustering algorithm that that can build this ontology at that point we have standard statistical techniques like bootstrapping that we can apply to assess robustness so what you can for instance do is throw out bootstrapping would sort of subsample your data which in effect throws out about a third of it and then rebuilds the the hierarchy does that 10,000 times and now the terms you want to preserve are the ones that are all you know come up ninety-five percent of so the way it actually works as we build like 10,000 hierarchies and align them all to each other and then preserve and then

preserve the result so so through robustness we’re pretty confident that the terms we see are real and when those especially when those terms get named because they were in go so now you’ve recapitulated biological knowledge the more interesting use though is then when you is now look at the terms that aren’t and go and rank them by robustness so I think the way you do new biology with this approach is you rank the new terms by their robustness and you study that the top ones first now the other thing you’re kind of getting at here is what about the MIT what’s missing and that’s a harder that’s a much much harder question tape to get then you then and then you go into and then you go into lab right a certain point you have to go back and then you really have to start following up on on those discoveries but the nice thing is that it points you at exactly where the novel discoveries in your data are or if one can imagine in the future is this whole type of if you know if we can really make this kind of standard systems approach then you could be a journal and someone someone gives you a big omics data set they’ve generated just because they can and you always wonder with reviewers what’s the novelty and what do people do they pick an anecdote and they they drive deep and follow it up and show you it you know yes fine it works in a mouse but that doesn’t really tell you anything about systematically how many novel discoveries are in their resource this does so so I think that’s the exciting use of it as well as it tells you exactly what’s what’s novel about the resource so yeah so actually check out the paper because we looked a little bit deeper at this in the paper so so now we can dissect this is this is blanket for all three million genotype-phenotype pairs but now if we just focus on the genotypes that perturb subsystems in the cell for instance DNA repair in fact it’s a whole experimental piece we went out and basically did two thousand more genetic interaction studies to further expand on on on DNA repair and nuclear luminor the two processes anyway it’s part of that we showed that for each for each process right for the whole cell that’s the correlation but for each process you can get a correlation and there were some processes where go wins and and there is some processes for Nexo wins and if you then start to look at where next so wins it’s basically in areas where go sucks you know and so it’s interesting so so i think right now whether i mean this is not to disparage literature curation and you know all those go teams of hundreds of curators it’s not that it’s not i mean you can see it’s look at that exactly and they can’t curate what has been published right whereas this sort of can write and that’s and so so i think that might be the better take home message not to look at the whole cell correlation to answer your question it’s that there is some processes where you wear one looks way better the other thank you very much i’m coming from a different background not cancer biology but it but it seems that are the data that you are using for a stratification of tumor cancers or somatic mutations how could you ignore the genetic background of each patient and just just including the difference between other tumors and the genetic up you mean like the germline or inherited variation exactly of each patient no I so we did ignore it I mean every by the way pretty much every cancer sequencing paper that’s been published in the past 10 years ignores it it’s a real problem and and I think it’s going to be exciting

next five years of cancer genomics is to work out the germline so so what can I say we ignored it but but I mean it we’re working from a certain data set that we have in state of ER now the question is to what degree are these kinds of approaches going to be useful for g wass types of approaches and i did show i did show that one example with Joe Gleason slab on on hereditary paraplegia and that’s more of the best definitely inherited variants that we’re analyzing their and and if you look outside of my I mean this is my talk my time up here but of course I’m you know in a way what I’m talking about here it’s representative of a field that’s trying to do this in which case outside of my group there’s there’s a number of examples of you know network G wasps but this as you may know their verdict is definitely not not in on on any of that hi so here you’re basically comparing each patient’s with itself but in other fields are like toefl that i’m working on i’m dealing with the very same problem of relating genotype to phenotype and there i’m only using variants that are like germline variants so are basically two types of questions if we don’t want to stratify our patients and we are only trying to find some jeans that are related to that disorder how how should we go from the network approach just with how can I narrow down to some jeans yeah so there’s there is a number of I I refer you to number of papers by others actually Eric shad and I wrote a review article with a number of others a few years ago about about all of these kinds of methods Edmar cod has a nice method called boosting g wasps with networks which is a very simple approach but but a good place to start I would say if you look at just a comment about that so there is a lot of literature on doing exactly what what you want to do which kind of you know that that motivated us to do something no one had done before right but but in terms of what’s out there look at the literature there’s a lot of pathway G loss methods and papers one one thing I would say is most of them are actually not going to be helpful to you why because they perform the pathway analysis after or downstream of the g wasps so they’ve already found p values of association for every snip and then they show that the the the most significant snips are enriched for a go function that is probably not going to be helpful to address your question what what your question is is can I use Pappy knowledge before i compute an association and use that to find genes that that i wouldn’t have found and and there are fewer papers but there still are some that they’d do that so that those are the papers you want and just last question the number of genes in your network that you have discover you have considered are pretty much less than the number of genes that we have in our genome so how can we address that question i sorry what is the question the number of genes that you consider in the network are pretty much less than what we have in our genome but that’s by design i mean you want because i’m working your negative in 2013 nitric nature article you have used only ten percent of the edges between the gene the jeans right I need a human net you have only included 10% of edges which you’re saying we should have used more more edges not fewer edges out you mean uh yeah and and those ten percent edges only include seven thousand genes so oh ok sorry I thought your question was why was the network so large it’s why it’s so small yes God yeah that’s that’s a reasonable question so we at the time use the human human net database of interactions and it was very much tied to our analysis of that and what was considered a confident we were we were essentially basing our calls on a recommendation of that of that paper there might have been a computational efficiency consideration going on as well but I think it really had to do with what was recommended by that paper you know these days there’s lots of networks one to use it’s an open question what network one should use and how exactly to formulate that here you know rather rather than pause there although i think you know certainly people are writing review articles about networks and it more quality in my lab is but but here we’ve sort of raised past that and said we actually don’t think networks or the right way to represent the knowledge anyway we think

it’s hierarchies let’s not dwell on it let’s just keep going and and and so that’s where we’re at