Slideshow w/ audio of talk by Jonathan Eisen @phylogenomics at #LAMG12

as I sure you can tell I’m a very serious person I wore a t-shirt that I got printed at zazzle it says proud to be a GM d that’s guardians of microbial diversity and my kids five and seven drew what they thought of as microbial diversity I don’t quite know what any of them are they even crossed out some of the words I don’t know what that was about so as you will see in my talk I’m also very very serious when I give talks so thanks Jeff thanks for everyone coming coming to a evening talk before the drinks I guess that’s the way to keep people at least somewhat I’ve done the after-dinner talks after the drinks which is really a disaster so so so as many of you know I’m kind of obsessed with the history of this meeting I’ve been coming for many many years not the whole way but many many years if you’re interested in some of the past history i’ve posted scans of many of the programs from the last 10 or so of these meetings I made a tiny URL link for them so it’s easy for people to find at / l am g 12 that’s also the twitter hashtag for the meeting for all the people posting to twitter it’s la mg 12 and there’s all sorts of stuff they are notes from previous meetings previous blog posts my hand notes and I’m sure there’s really offensive stuff somewhere in there so please forgive me but really the key to the history of these meetings is captured and I’m going to need your help with this for this year in quotes from previous years so when you hear something good like we just heard a minute ago and I wrote down microbe Annette please record those we need a record of these good clothes and I’m going to go through some of them here not telling you who set them or when they were but someone in one of their talks refer to the space-time continuum of genes and genomes microbes not only have a lot of sex they have a lot of weird sex that must’ve in rosy red field right I’m an atom gene sequences are the wormhole that allows one to tunnel into the past I yeah everybody’s nodding I have no idea what that means this is how you do metagenomics on fifty dollars and that’s Canadian this is this is shows the quality the precision of the science of this one day the human God’s are a real with you of stuff antibiotics do not kill things they corrupt them and I’m not done there comes a point in life when you have to bring chemists into the fish I disagree with this statement ba ba is true I’m not even gonna read it if i have time i will tell you about a dream and that was in a talk not at the bar another thing you need to know actually you don’t need to know any of it Lance I have been influenced by fisher-price throughout my life if you have the pleasure of seeing Lance show all of his fisher-price toys eNOS talks this is going to be ironic coming from someone who studies circumcision lance and we will bring out the unused cheese from yesterday that must have been chaff a paper came out next year the people here are really brilliant it takes 1000 in a ball just to make one and last but not least in an engineering sense the vagina is a simple club ball reactor so now you know you know what I’m all about now please record quotes posting to Twitter unless they’re rich Roberts telling a bad story sorry on and or send them to me and I will keep an archive of them alright so so now I’m

sorry I’m actually going to give a talk so what I want to talk about today and really there are no more quotes what I want to talk about today is focusing in on taking a phylogenetic or you can call it phylogenetic if you integrate genomic data although I’m kind of hate ohmic words even the one I invented 50 genomic so I probably should across this app for my slide a phylogenetic approach to studying microbial diversity and I want to talk about I’ve been doing this pretty much for my whole career and I want to talk about some new developments mostly in terms of methodological developments that have either come from my group or from collaborators to sort of give you a prime the pump here so to speak for the rest of the meeting where I hope that people think about explicitly phylogeny in the history of organisms and of jeans when they’re looking at some of the data that people will be presenting and the first thing I want to talk about which I assume everyone or virtually everyone here is familiar with the general principle is Philo typing so far the typing is developed you know originally by norm pace and colleagues where you go to an environmental say Paul you clone out originally ribosomal RNA genes and still predominantly ribosome RNA genes sequence them take the sequences you build a sequence alignment with them and you can do lots of analysis from that sequence alignment but 50 typing explicitly really is supposed to refer to phylogenetic analysis from that sequence alignment so you can build a tree and try and understand what the uncharacterized organisms are that is who is out there from building a phylogenetic tree of the sequences and comparing the sequences you get from an environmental sample to uncharacterized organisms and that tree is really important because it tells the history of the relationships among the organisms and it’s explicitly different than a grouping of things by their similarity to each other if you just look at the similarity of organisms or genes to each other you may be misled in cases for example of unequal rates of evolution in different taxa where the level of similarity will not reflect the evolutionary history of the organisms or jeans so you build this evolutionary tree you try and classify the uncharacterized sequences by how they relate to each other or to characterized organisms you can deviate from this general plan and a lot of people do this was we get more and more sequence data more and more people are doing this where you take the sequence alignment or even the sequences directly and cluster them compare them to each other and try and identify which ones are really really similar to each other and pull out the dreaded Oh to use from that clustering some groupings that correspond to species or some taxonomic group that you think that all the entities in that group are closely related to each other and the entities and different groups are distinct from each other and you can come up with a list of the OT use from this clustering without ever doing a phylogenetic analysis however it’s been shown in a variety of papers in a variety of ways that if you lay on top of that otu clustering the phylogenetic relatedness of the o to use you can do a lot with that data so for example this is the basis behind things like you nuh frack analysis where you there are two different communities by not the list of 0 to use in those communities but by the phylogenetic relatedness of the organisms in those communities so you can also go through of course as I implied before rather than going to 0 to use directly and build your tree first and then identify either 0 to use from that tree or actually directly do phylogenetic analysis without ever identifying Oh to use so there’s this sort of sweet of ways that people take ribosomal RNA data and end up with either a tree or ot use or a tree of 0 to use that they classify and there’s been you know 25-plus years of people using phylogenetic approaches to analyze this type of data and shown that it has been useful in a variety of ways identifying novel groups identifying clades looking at rates of change in organisms if you have different genes you can look at lateral gene transfer convergent evolution questions of phylogenetic diversity etc I’m not going to go into any of these in real detail but just want to again emphasize that phylogenetic approaches are somewhat distinct from just getting a list of the taxa that are present in a particular sample because they tell you how closely related particular tax are to each other so what I want to talk about instead is not sort of the history of this and justification for it but what’s really new in the last few years in philo typing and you know as I’m sure everybody here appreciates the really

the new thing and philo typing or one of them the first one and I’m going to talk about is what you can get by cranking out data from one of the new toy wonderful sequencing machines that we all use more than many of us use where you can generate just literally absurd amounts of data for a trivial cost compared to what you could have done three years ago five years ago 10 years ago so obviously if you’re doing ribosomal RNA pcr you can generate more PCR products but that in and of itself isn’t necessarily the key thing what many people want to do is deeper sequencing of individual samples that is you can get the tail end of the relative abundance curve and if you’re interested in a particular sample you might get to what we have been calling the rare biosphere many people in the community and maybe get more accurate relative abundance estimates of even the abundant organisms by sampling more deeply what I think many people here will talk about or show posters about is taking the sequence data that you can generate and instead of generating deeper sequencing generating data from more samples time series spatially diverse samples fine scale sampling it’s at around where you can barcode from midas alone a PCR at bar codes to each of your samples and pull together hundreds to thousands to possibly tens of thousands of samples into a single run of one of the sequencing machines and thereby generate data that we have never seen the likes of which for microbes before mean people have collected this type of data for plants and animals from thousands and thousands of sites it’s hard but people have done it but for microorganisms we generally don’t have this type of diversity data from across a wide number of samples so this is just an example of what can be done this is a paper that I was a co-author on with Jen Hughes martini and a few other people from Jen’s group where she was interested in beta diversity and when I was a tiger prior to moving UC Davis I helped her sequence with Sanger sequencing people know what that is should I explain with Sanger sequencing something like 10,000 ribosomal RNA sequences I mean that’s like five cents now with the limited sequencing or something um and she in analyzing the state and in planning the experiment was able to generate really interesting findings related to biogeography of organisms and beta diversity across the planet and we thought you know this is amazingly deep we had you know five sites that are closely spaced next to each other and in total like 48 sites across the world and now you have things like the earth microbiome project where I don’t think anybody’s going to talk about it here but there Jack Gilbert and colleagues are literally talking about generating data from millions of samples across the globe instead of 48 and you can do this again because of the possible barcoding and pulling together samples and what we’re going to be able to do is generate really a biogeography of the planet at least at some level and if we choose this well we can do good biogeography and I would argue that even though and I’ll come back to this in a minute even though you may feel like you’re drowning in data it’s still useful to take a phylogenetic approach to analyzing that data even if you think oh my god I can’t build trees with this data because there’s too much of it so you can do things for example with that data macaulay Josh Liddell who’s in katy Pollard’s lab who’s been working on taking biogeographical data and trying to actually develop range maps for microbes there a few other people know if here and colleagues have been starting to do this where you know sort of like in a bird field guide we have a range map of the birds we don’t really have that for environmental microbes out there we have ten samples from many places but this is trying to take a modeling approach to develop range maps to organisms I made this argument a couple of weeks ago and I thought I’d just go through it again I mean like five years ago making this argument would have been absurd to talk about DNA based studies of microbes in the Mississippi River it’s two thousand three hundred and twenty miles long if you want to do one site per mile three samples per site that’s not a lot of redundancy but let’s pretend like that’s good that’s 6960 samples with barcoding you can generate if you get really good runs with a my secret I seek you know something like 4,300 sequences per sample in a single my secret for those seven thousand samples or eight hundred and sixty-two thousand sequences per sample if you do it in a full high seek two thousand run that all works perfectly and everything is high-quality we’re going to ignore all the problems with that but but even so I mean literally you can now imagine going out for five thousand dollars $10,000 whatever creating a trance act of a site that you’re interested in with DNA based studies of microbes on a scale with which no one had thought about previously it’s why as I assume you will hear in the human microbiome sessions

what many people have been doing in essence creating biogeographical map of the human body with the low-cost sequencing we can now do that in any environment so what’s new and follow typing what else is new now obviously in the last ten years one of the big things that’s been new is metagenomics and as sequencing gets cheaper and cheaper people are not going to just go do ribosomal RNA pcr from an environmental sample but they’re going to do metagenomics from those environmental samples and obviously you get fewer sequences / gene that you’re interested in let’s say you were interested in that some functional gene but you get sample of all the genes in your sample rather than just the ribosomal RNA genes and you can go to environments and do metagenomics just like you would do with ribosomal RNA surveys now it turns out that it’s a little trickier a little more complicated but you can do the same general approach philo typing with metagenomic data that people have been doing for years with ribosomal RNA PCR data you can take metagenomic data cluster sets of sequences that you’re interested in compare them to each other build o to use you can build phylogenetic trees of the sequences that you’re interested in and identify o to use or just do phylogeny and then take all of the benefits that come from phylogeny such as doing things like you nafrat or phylogenetic ecology you’re looking at rates and modes of evolution and do it for all the genes in a sample as opposed to for just ribosomal RNA in the sample so that’s the fundamental difference is that you can do this for all of the genes in the sample and get maybe a different picture about the biogeography your bio diversity of organisms than was available for ribosomal RNA or if you work with organisms that were not amenable to ribosomal RNA PCR what are those small things for us that there that they have like a code on them I mean so viruses have been ignored by the right axillary a PCR studies but we can now do filho typing of viruses as many people have been doing for the last four or five years because of random shotgun metagenomics so what I want to talk about now is what are the challenges with trying to do this so let’s just assume we’re going to generate either billions of sequences or sequences for metagenomics and now we’re going to try and do filho typing from the those samples but one of the big problems especially if you get your data from metagenomics is that if you generate a sequence alignment of your gene compared to full-length data from mostly from sequence genomes which is where we’re getting most of the full length sequence data you get a sequence alignment where your environmental data might be a fragment that the left hand hand of the gene or at the right hand of the gene or in the middle of the gene and some of those sequences might not even overlap with each other so how are we going to build a phylogenetic tree with traditional methods that rely upon a multiple sequence alignment without any gaps in it with matta genomic data this has been somewhat of a challenge for many people and analyzing large data sets to figure out exactly how to do this there have been multiple approaches that have been developed and what I want to do is take you through sort of the philosophy and examples of each of them so the first that you can do which many people still do today is to treat each sequence as an island in essence you build a tree of reference sequences full length sequences that you have from genomes and then you take each individual meta genomic sequence and you build a tree of it relative to the reference data so here you pluck out an individual sequence and you build a tree of that one environmental sequence relative to the reference data and then you build a tree of the next sequence relative to the reference data and to the next sequence relative the reference data and so on and how do you do this for billions of sequences what you need to start to automate virtually everything everybody who does genomics now is becoming more and more familiar with this we develop the pipeline you almost certainly don’t want to use it anymore because it can’t handle millions of sequences but we develop the pipeline a few years ago to automate this sequences an island approach to analyzing ribosomal RNA data and building a phylogenetic tree of it called staff there are many other pipelines out there that do things like this we’ve also bought pipelines in my lab this one was developed by Martin woo a few years ago called him for that will do the same type of thing take a sequence one at a time build a tree of it relative to reference data but it will do this with proteins sequences as opposed to with ribosomal RNA sequences and we selected a set of 31 phylogenetic marker genes protein coding genes that seem to have good properties that would allow them to robustly build phylogenetic trees of the organisms from which those sequences came and built tools around automating the construction of phylogenetic trees of these sequences using alignments from

hidden Markov models and precomputed masks for the sequence alignments so that you didn’t have to pull out regions that were poorly aligned from these sequences and you can do this with the same for our program and build trees of each individual protein coding sequence these 31 markers and you can then scan through data and you get phylogenetic types phylotypes for a suite of genes in addition to ribosomal RNA and for even for organisms where you have ribosomal RNA genes this can be beneficial and I’ll come back to this a little bit for example because copy number of ribosomal RNA genes vary significantly between taxa and if you want to estimate relative abundance of organisms from the copy number that you see of sequences and the copy number of the gene and the genome varies a lot you’re going to miss estimate relative abundance of organisms without knowing that information many of these protein coding genes are have much lower variance and copy number than ribosomal RNA between organisms and provide a better way of estimating relative abundance they also tend to have higher sequence variation among closely related organisms because of third codon position variation that ribosomal RNA does not have so they allow you to distinguish a finer scale resolution of the phylogeny of organisms than ribosomal RNA so an alternative approach from this sequences an island approach is to try and analyze some of the meta genomic sequences not just relative to the reference but relative to each other and one of the approaches that has been taken to doing this is to take a sequence alignment and identify a core region of that alignment where many of your meta genomic sequences overlap with each other in that core region of the alignment throw away everything else and then build trees from that that’s what I an analysis of the Venter Sargasso Sea data in the original Venter metagenomic paper and you can again you can scan through ribosomal RNA data from metagenomics build trees just like this and now compare all of the sequences to each other you can build trees with protein coding genes as well I always pick recce as my first protein coding gene many people know that I’m obsessed with recce and build trees and do the phylogenetic classification again but now be comparing the sequences to each other as well as to the reference data and this is getting to be more and more important as we deeply sample environments but don’t have genomes from reference organisms that are closely related to the tax and the environments that you’re interested in you need to compare the environmental sequences to each other not just to the references this is again how many software packages work this is what basically people do when they analyzed ribosomal RNA pcr data and chime or when you analyze data and mother or when you take metagenomic data and analyze them in certain similar ways that’s what people are trying to do find a core region align it and build a tree from that and then do analysis just as a brief aside I wanted to mention this a graduate student previous graduate student in my lab amber Hartman developed this software pipeline called waters which is for analyzing live to learn a day and it’s a what’s called a scientific workflow system it’s embedded within the Kepler scientific workflow package and this is I mean this is not the easiest to use system on the planet but it’s going to be really more and more important in the future as we start to do everybody starts to do bioinformatics what scientific workflows do is basically record everything that you do with the data that you’re looking at they have full provenance of every setting every input and output flow of data conversion into different data formats recording of the operating system that you’re using recording of every sort of feature of the data and that’s going to be more and more important for sort of record keeping as we basically become a data-driven world you can make your workflow available to people record everything you did in one particular workflow and people will be able to at least in theory reproduce what you did a little bit more than in many cases in bioinformatics so I’ve already mentioned this a couple of times there’s one major problem with relying upon by bassam RNA where phylogeny can actually help us and that’s this copy number variation and you know for many years people have been looking at the copy number variation in ribosomal RNA and saying you know there seems to be some biological consistency with the copy number there’s been a lot of papers about organisms with more copies grow more rapidly than organisms with fewer copies there have been a lot of studies about the ecological context of organisms that have more copies but there’s also phylogenetic consistency in the copy number and stephen campbell who was a postdoc in jessica greene slab and is now moving on to a faculty position worked on this with martin woo used to

be in my lab to do to track the funny the phylogenetic history of copy number so that’s what’s shown in this figure there’s a paper and press in plos computational biology that you use the first author on where this is a far as you had a tree of organisms and then the black bars are the copy number of ribosomal RNA for those organisms and there is some phylogenetic consistency to this copy number what that means is that you can correct for the copy number over or under estimate that you get from ribosomal RNA sequence data by looking at where you are in the funny and in the copy number pattern in the phylogeny and get a corrected estimate of relative abundance of organisms using the phylogenetic history of copy number to make your estimate of what the copy number is going to be in a new organism it is not perfect copy number varies in ways that are not related to the phylogeny or not predictable right now but it is better than not using this correction so a third and what we think is probably the best way to analyze metagenomic data in the long run is to analyze everything to analyze all the data not in essence chickened out when sequences don’t overlap with each other because for years evolutionary biologists and implementations and others have been working on methods to compare bits of data to each other even if they don’t compare to each other that is if you have reference data you can compare sequences by triangulation to each other you can ask how far sequence on the left side of a multiple sequence alignment is from reference number one and how far as sequence on the right side of the multiple sequence alignment from reference number one and try and analyze them in the context of each other and build a tree of everything from the metagenomic data and their variety of things you might want to do with this tome sharpton who is a postdoc with Katie powered loud worked on a method to do classification of out to use so this is an important sort of subset of what you might want to do with data is this fragment from the same o tu as this fragment even when they don’t align with each other and you can build an essence of phylogenetic tree of all the sequences and try and figure out if they’re from the same o to you and that’s what this 500 to you a software package that Tom sharpton develop does I really wanted him to name this so its phylogenetic analysis of OT use I really wanted him to call this POTUS President of the United States but he chose far lower to you it would have been perfect for election time so I’d be happy to talk to people about this on the side I don’t really have time to go into all the details this is what we did something like this to try and analyze all of the sequences compared to each other when we did phylogenetic analysis of other protein coding marker genes from environmental data and found even for sort of core phylogenetic marker genes like recce and rpob sequences that did not group into any of the known major lineages of organisms including bacteria archaea eukaryotes and the known lineages of viruses so there’s data out there that’s from really weird phylogenetic lineages we don’t know what those sequences come from but if we want to again compare a fragment and that is from a rare organism and we only have the left half and sample number one and only the right half and sample number two we again need methods to build phylogenetic trees of all the sequences and what the real cutting-edge in this right now is methods that work like a method called P placer which many people may have heard of developed by Eric Matson and still developed by arc Matson he’s now at the hutch in Seattle and there’s a software package being developed in my lab by our darling and Holly BIC who’s here and Guillaume jaspin and other people that’s sort of an equivalent to amphora the package that was developed by Martin woo but now trying to analyze all sequences compared to each other rather than just the regions that overlap with each other and the way these things work is basically what I was outlining you have a reference sequence alignment you place individual sequences into that the tree of the reference sequence alignment but you try and place them all into the tree and then you can even compare the individual environmental sequences to each other to try and figure out if they came from the same lineage or not using certain statistical approaches and that’s sort of one of the things not going to go into detail about but I’m sure Holly or other people in my lab would be happy to talk to you about statistical approaches to analyzing phylogeny have become very big for phylogenetic trees of known organisms so Bayesian phylogeny likelihood for lodging originally and now bayesian based methods one of the advantages of Bayesian methods and likely based

methods is rather than in essence just giving you a tree and telling you that that’s the best tree you get multiple trees and it gives you a statistical probability of each tree in assets and so you may not be able to place a sequence into an individual part of the tree perfectly but you get a probability for where it should go and that’s really important if you want to do things like use a uniform ettrick to compare our five different environments to each other and you get a tree-based metric which I love by comparing them to each other but in essence you lock the tree when you do that and you assume that the you feed into your act is correct if you instead took a tree that had statistical probabilities of each position in the tree you could calculate similarities and differences between communities and statistics of those similarities and differences in community based upon the statistics of the tree itself and that’s I think where a lot of the phylogenetic methods that people are using to analyze environmental data are going to take advantage of the statistical approaches that have been used in a lot of the reference organisms or cultured organisms etc and applying that to environmental data or two very large trees like many things in my lab we’re pretty open about stuff so there’s no paper yet published for philo sifts but the code is out there on github and Holly and other people even have a blog describing many of the things going on with 50 sift if you just Google file us if you can find all the stuff about that and you’re welcome to download the software and use it so one eventual end point that I would like to see but it’s pretty hard is instead of analyzing just one gene at a time analyze all sequences relative to all genes and trying for example place all sequences in environmental sample relative to the reference genomes that you have rather than reference individual genes and so you could call this method sort of the all in the genome approach to analyzing data so taking each of these fragments now so your fragmentary data from different sequences and narrow analyzing multiple genes if not entire genomes of organisms and building a phylogenetic tree of everything in a sample relative to everything else in a sample and to really do this I’ve ignored this up until now but you have to take into account real phylogenetic history so not just bifurcating vertical evolution but recombination within species and that nasty recombination between species that lateral gene transfer stuff that makes trees somewhat messy we can actually take that into account from analyzing environmental data many people are building phylogenetic networks of Prince genomes compared to each other taking into account gene gain and loss duplication recombination deletion lateral gene transfer etc and now you can overlay environmental data onto that network instead of onto a tree and in theory we could place all the environmental data onto that network rather than single one gene at a time Steve Kimball again coming back to Steve Campbell he’s really done some brilliant stuff in metagenomic data was the first person that I know of to do this in a true phylogenetic context he basically took the m4 off phylogenetic markers and rather than and for which analyzed them one at a time he concatenated them all together and built a reference for lodging and then anchored environmental data into that reference phylogeny in this paper that came out last year in PLoS ONE not going to go into the details of it seems to help you know at least a little bit with making conclusions based upon environmental data that you weren’t able to do if you didn’t analyze all the data from across the genomes so I’m going to shift away from Philo typing now and talk about another example sort of very briefly of phylogenetic analysis in the context of environmental data in Lake our heads of ancient past this is what I used to talk about a lot so the first lake arrowhead that I came to I talked about what I used to call 50 genomics which was taking sequences of genes in multi gene families or even sequences of genes that just had homologs and other genomes and when you wanted to predict functions for these genes what most people did at the time and still what many people do is you take the sequence of a gene from your new genome and you do a blast search or some similarity search of that gene and you pull out a list of homologues and one of the problems with doing this is that frequently when you get a list of homologues the list has genes with different functions in them so how do we take that list of genes that has different functions and choose which one to grab and assign to our uncharacterized jun what i argued at the

time and what I still believe is really useful is to build file genetic trees of the genes place the new gene and contacts relative to characterize genes and use in essence evolutionary character state reconstruction methods to infer the likely evolutionary changes in function of those jeans over time which will then allow you to predict the function of unknown genes now I didn’t talk about this before but that’s basically what most people were trying to do with Philo typing for many years predict the biology of an organism by its phylogenetic placement relative to organisms for which the biology has been characterized this is the same thing but now for gene function so instead of organismal biology overlaid onto the tree like photosynthesis chemosynthesis whatever we’re taking a tree of genes overlaying experimentally determined functions on to that tree and predicting the function of uncharacterized genes by the position in the tree turns out this actually probably works better than predicting function of organisms based upon ribosomal RNA position organisms with the same ribosome 1h sequence can differ by forty percent of their genomes so it’s got some issues with predicting function but predicting functions from phylogenetic trees of genes actually turns out to work pretty well you can of course do this for any type of data like environmental data you can build phylogenetic trees of your environmental sequences apply the same approaches that you would use for or genes from genomes and try and predict functions of those the same methods that I just went through for philo typing can be applied to phylogenetic analysis in the context of functional prediction so if we want to predict the functions of uncharacterized environmental rhodopsin sequences proti rhodopsin homologues we can use those different phylogenetic methods that I described to improve your ability to build a phylogenetic tree and then predict the functions of uncharacterized environmental genes in the same way you would do for characterized environmental Germans so I’m not going to talk about that in detail but you can apply those same methods whether or not it’s you know of course you should use 50 sets but I mean can use other things so what I want to talk about just very briefly is the the issues with this phylogenetic prediction of function so it turns out if you want to predict functions of genes based upon the similarity of that gene to something in a database you can do a blast search or if you’re you know a snooty phylogenet assists like me you can build a tree but what if you do that search and the list of genes that your genius of August to none of them have been experimentally characterized who cares about the tree I mean it’s not going to help you place your sequence in context if nothing’s been studied and so this of course is a massive problem we are generating sequence data at an unbelievable rate we are not generating functional data at a particularly high rate to go with the sequence data so we are set with this problem all the time and actually first out of Lake Arrowhead meeting I think I don’t know if it was a talk or just talking to people who are at the meeting I heard about an alternative approach to doing this which are non homology functional prediction methods they were developed by people like Pellegrini and David Eisenberg and at mercat and others and one of them that we liked in particular in my lab is I’ll explain here in a minute but nonlin ology functional prediction methods are sort of exactly what they sound like you’re trying to predict the function of a gene based upon some feature of that gene other than its sequence similarity to genes that have been characterized so a classic example of this is correlated gene expression patterns in microarray experiments the genes don’t have have to have to have any sequence similarity to each other but if they always show up is being co-expressed under every condition that you look at you might want to consider that they have some functional similarity to each other and if you have an uncharacterized gene that shows up as being co-expressed with 75 genes that all have the same function I’m going to suggest you might want to look at that function that those other genes have even if there is no sequence similarity so there’s a list of now about eight or nine of these non non nology functional prediction coaches my favorite is something called phylogenetic profiling where you look at the distribution patterns of genes across organisms and you group genes by the distribution pattern of that gene its profile so you search a gene against genomes and you make a list yes know is that gene and its homologues present across the set of genomes and you group them by their similarity in their profile that is you’re looking for co-occurrence patterns of genes they don’t have to have any sequence

similarity to each other we did this many years ago and I presented it actually at Lake Harold I think probably in 2004 ancient history for an organism called kabocha to film is hydrogen of for man’s it’s an interesting form of all that grows in Hot Springs off of carbon monoxide gas and produces hydrogen as a byproduct of its metabolism and we knew that it was probably closely related to sporulating organisms when we sequence the genome and then when we annotated the genome we found homologues of many of the sporulating genes in the genome it had been people have tried to get it to spoil a previously and it never worked but Frank bob was able to manage to get it the form spores and go through the whole endospore germination process when you look at the genes in the genome and you group them by their distribution pattern of their homologs across other taxa you see this amazing cluster of genes that are found in all the spoil aiding species basically and not any of the non spray lighting species and you see many of them are annotated as spoil aiding proteins for you latian proteins spo2 SPO 3 fo 50 whatever but many of them were annotated as conserved hypothetical protein they had never been experimented characterized in any organism including the model sporulating organisms like bacillus auda less and maybe some of the other clostridial species they were there but no one had ever shown an experimental a determined function for any of them in those organisms so not to tell you details about this but we published a paper showing this suggesting that many of these genes were involved in sporulation and actually two days ago rich losa who we sent the list to and have been going back and forth for about five years now over this list he had a postdoc in his lab working on this for a while and he’s actually shown that a huge number of these genes are in fact sporulation genes that were missed by biochemical and genetic surveys they I think probably 18 of the 20 that we showed our as concerned hypothetical proteins are now known to be sporulating proteins so this file genetic profiling just gets better and better as we get more and more genomes you know we it would be nice to have more functional information but we are going to fill in the knowledge of sort of protein pathways across organisms and if you want to play around with genomes it’s a great thing to do phylogenetic profiling to look for conserved functions across taxa now you can do this for metagenomic data turns out to be much messier and trickier Suzanna tryn did this many years ago with one of the first sort of meta genomic comparative analysis papers that belongs done this in many papers a lot of people have done this comparing the presence and absence of genes across metagenomic samples rather than across taxa and grouping them again like phylogenetic profiling so you could call it meta genomic profiling it’s um it’s pretty messy to pull out information from this right now it doesn’t work as well as phylogenetic profiling we’ve been collaborating with some people who are more mathematically inclined than I am to try and develop new methods to group genes by their distribution patterns across taxa there’s a new sort of you could call it a neo clustering method non-negative matrix factorization that a bunch of people have been using in a variety of not just by informatik applications but in general and clustering approaches I can’t tell you anything about the math but we have a paper coming out actually on Tuesday where we’ve used this with collaborators josh whites and jonathan douche off to try and analyze metagenomic data and it seems to really work quite uniquely compared to other clustering methods i’m not sure if it’s better but it’s it’s different and it’s still and it works so I think it’s going to end up being useful for environmental samples so um last example I want to talk about is using phylogeny sort of been a different way which many people have heard me and I’m sure they’re really sick of it talk about in terms of selecting organisms for study and I’ve been pretty obsessed with this for many many years trying to say here’s a phylogenetic tree of organisms here’s the list of genomes they’re not coming from a diverse sample of the phylogenetic tree of organisms and that’s been the case since the start of the genome project some subgroups within the evolutionary tree of life have been sampled pretty well phylogenetically actually the NHGRI has done a pretty good job of targeting diverse vertebrates to compare to humans they’ve been many plant projects to sample across the plant tree of life but for studies of bacteria and archaea unfortunately sampling across the tree was not done very well in previous in the last in the 2010 my card meeting I talked about this genomic encyclopedia of bacteria and archaea project I was

sort of coordinating as an outside number of the joint genome Institute really the joint genome Institute was turning its entire sort of microbial sequencing machine on to this project and so far we’ve now sequenced about 250 I don’t know if Tanya knows the latest numbers but 230 genomes from the Kiba project from the phylogenetic diversity of cultured organisms that’s been this amazing partnership driven mostly by jg I partnership with the DSM Z culture collection but but that’s that’s you know that’s old news that’s so I can anybody pronounced laying the gun but I know if you’re supposed to be able to pronounce some hash tags but i have no idea to how to do this one but I mean but I’m not going to talk about this yet maybe Jeff should tell us how to pronounce them I’m not going to talk about this really in detail today what I just want to point out is that you know genomes are coming out in an astonishing rate including now good follow genetic sampling of the diversity of life how do you keep up the day I usually keep up is by looking at genomes online nico scarpa days maintains this database its genomes online or gets an amazing compendium all the microbial genome projects that are out there there’s also various databases that track genomes Morgan Langella has been at previous like our head meetings develop something called microbe DB which is out there in an open source pipeline for keeping track of microbial genomes what I want to talk about in the context of ghee bez how we can use the information that came from giba in essence and other approaches to improve our ability to do filho typing so Philo typing works but it doesn’t work exceptionally well in all cases in particular when lineage has not been sampled well for their genomes so one thing that we’ve been doing is going through the genomes that have now been generated we had the giba project as well as with other genomes that are out there and rather than using the 31 phylogenetic markers that we and paperwork and other people identified from across the diversity of bacteria to sample the phylogenetic diversity of metagenomes we’ve been going through each major leage of archaea and bacteria and identifying far genetic markers that seem to be robust for that group as opposed to for all bacteria at once or all our key at once and you can see for example for some lineages there are hundreds if not up to a thousand marker genes that appear to work robustly for doing phylogenetic analysis of that clade so when we scan through metagenomic data we can actually place with sort of robust phylogenetic markers a much greater fraction of the data than we are currently doing with and for our these particular phylogenetic marker approaches to do this we need a reference tree we need to have a big phylogenetic tree to anchor all these environmental sequences that’s the picture I took of the tree that Jenna Morgan Jenna Lang no sorry she was Morgan at the last meeting she came to Genoa Lange has been working on my lab to build a tree of all the genomes using sort of these new statistical based approaches to analyze in genomes that’s been submitted and revisions are in the work I don’t think you can see the individual tax on this tree but we’d be happy to share it with you if you need a reference tree for genomes another thing you can do with all these new genomes that are coming out as improve your ability to make functional predictions from environmental data so we’ve been going through all of the genomes that are available in particular the phylogenetically diverse genomes and identifying the protein families that are present in those genomes now there are great databases of protein families that are out there whether it’s P fam or Tiger fans or cogs or a variety of other databases but they don’t get updated very rapidly and with the amount of speed at which sequence data is getting generated one thing we need to do is figure out ways to update the protein family databases with new families and adding sequences to the old families rapidly so we’ve been working on sort of very rapid markov chain based clustering algorithms to take all genomes all the peptides in all genomes search them against all of the genomes and identify protein families as rapidly as you can possibly do although there’s no way we can keep up with one hi seek machine anymore but we’ll pretend like we are you know we can keep up at least at some level and identify protein families in those databases and gillham jaasmyn in my lab and Tom Sharpton and Katie Pollux lab and a variety of other people have been working on analyzing these protein families were released a data set of all of these protein families we’ve built alignments of all the families hidden Markov models of all the families and that will allow people to scan through environmental data for more protein families than anybody is looking at currently and get rapid alignments and even trees if you want them and you want them for sequences from environmental

samples there’s a program that you could use which I would encourage you to actually use for many of these cases that Martin Luda veloped when he was in my lab he’s now University of Virginia called Zorro it’s for masking sequences we called it Zorro get it ha and it basically takes a sequence alignment and we’ll use a probabilistic model of comparison of the sequences in order to identify which regions of the sequence alignment are in essence not statistically supported and you can mask those out from analyzing building a phylogenetic and he showed that there are some improvements in particular with distantly related sequences that you’ve got when you have this masking as opposed to analyzing a complete sequence alignment let’s skip over the last little thing and then so just to wrap up this so what I’ve been trying to talk about in this lizard latest context is how we can use better sampling of gene families or genomes in order to leverage that information to then analyze environmental data and I think the most striking thing that came from our genome encyclopedia paper and from my work on the genome encyclopedia project is actually something we could have done even without sequencing any of the genomes but we hadn’t yet sort of wrapped our brains around if you take the tree of life let’s just take the ribosomal RNA tree from silva or green jeans or some database that’s out there and you take that tree and you count the branch lengths in the tree that’s a metric called pd or phylogenetic diversity the sum total length of the branches in that tree if we take that and we count that for different subgroups of the trip what we see is shown in this plot so organism the number of taxa that we look at on the x-axis and the sum total branch lengths and the tree for those tax on the y-axis if we do that for our genome encyclopedia project also in light in blue here are the genomes that were sequenced before our project they came to a total of about 20 units of pd we sorted them by the amount of pd that they contain relative to other organisms so the 18th e coli genome sorry all you ecoli fans out there didn’t add a lot of pde to the tree and out of many genes because of lateral gene transfer but not a lot of ribosomal RNA diversity each of our genome encyclopedia genomes added a lot of pd they better add that’s how we picked them that would have been pretty messed up if they didn’t doubt a lot of branch lane because that’s how we selected organisms if you look in dark right here that was all the organisms that were in the art the green jeans tree that were described as cultured cultured organisms at the time sum total of you know 180 units of PD industry it’s sort of arbitrary units we would need about 5,000 genome sequences to sample half of that phylogenetic diversity something that effect is in fact a project going on at jgi right now to start to tackle this by sequencing all about type strains from across the diversity of bacteria and archaea and eventually we will fill in the pd of cultured organisms but look at the light gray that’s the uncultured organisms that were just in that database for like five years ago before the sequencing went crazy with lumen on 454 sequencing although many of the things coming out of that might not be real but so but we would need ten thousand genomes to capture half of the phylogenetic diversity of the uncultured organisms that were had full length ribosomal RNA sequences in the green genes database as of 2008 or 2000 and think 2008 that’s a massive number of genomes from organisms that we have not yet grown in the laboratory technically challenging to generate those genomes if we actually look at the full diversity of ribosomal RNA that is out there now the sequences that maybe we trust more than some of the sequences that seem to be error prone we probably need hundreds of thousands of genomes to capture half of the pd of bacteria and archaea that are out there we have sampled nothing I mean it’s great we have lots of ginos they’re really interesting but overall in terms of the full diversity of bacteria and archaea and forget about microbial eukaryotes and viruses we’ve barely scratched the surface of the diversity of life on the planet so you know building all these protein families and reference trees it’s helpful but only to a point in terms of what we’ve sampled we need projects like Tony are you talking about the uncultured you be here so that Tania wiki is running at the jg I to go through the uncultured lineages and try and obtain complete genome sequences for those organisms which are going to be fundamentally important for sampling the diversity of life of course we need experiments across the diversity of life too so you know just sequencing

things is not enough but eventually we will capture some portion of the diversity of microbes that are out there we will have a phylogeny of them we sort of have that now we will have genomes of them we will have the biogeography of those organisms we will have the functions of some of those organisms or many of those organisms what we’re going to have is my dream as many people know I used to be a birder when I first came to Lake Arrowhead I was actually deciding between multiple jobs and one of them was not even on microbes at the time actually I used to work on birds I want a field guide to the microbes and I actually I mean this seemed ludicrous five years ago even ten years ago a real field guide will recover again the taxonomy the phylogeny the functions the ranges the niches the biogeography the temporal variation we can actually start to do this for at least some subsets of environments on the planet for microbes for the first time so I will leave it at that and thank the many diverse sources of funding that I have obtained for much of the work that I talked about in particular department of energy for all the Kibo related stuff the National Science Foundation and the Gordon and Betty Moore Foundation as well as DARPA and homeland security for much of the informatics work and then I’ll just leave it up there and thank many of the people and hope that you interact with many people from my group that have driven down to this meeting and probably really want to get out of their chairs right now so I will thank you