Best Practices: Technology Assisted Review

10 days are in listen-only mode welcome and thank you for attending today’s session of Lexi’s ediscovery webinar series best practices technology assisted review presented by Lexi principle Carson Weber today’s presentation will discuss how you can apply transparent and scalable predictive coding technology to speed review and reduce costs our ediscovery webinar series covers a wide range of topics related to newly discovered in litigation document management technologies our webinar today should last about 20 to 30 minutes and we’ll have a URL to access a recorded version of the presentation available to anyone who registered later this week to give you a brief overview of our focus at Lex V we develop and implement technology that helps legal professionals meet tight discovery deadlines regardless of the amount of data in their case and without breaking the bank we offer high-speed ESI processing and conversion services in addition to our web-based review tool and/or recently named the top 100 provider at complex discovery com if during our webinar there are any questions that you have or any technical issues being experienced please email them to webinars at Lex be calm before I hand things over let me tell you a bit about Carson as mentioned Carson is one of the founders and principles here at Lexi and is the architect and lead developer of Lexi’s e-discovery technologies including our assisted review and other analytics tools Carson brings a wealth of experience managing complex software development in a variety of fields it does an expert witness consulting with the lumen expert group holds an MBA from the University of Texas and a master’s in engineering from the Danish Technical University and with that let me turn things over to Carson thank you Stu so today we’re going to talk about technology assisted the view what is it how does it work when to use it and then comparing the outcomes as well as some of the benefits that we bring to the table that are specific to our implementation of the algorithm so what is the technology assisted review also known as predictive coding it’s the ability to as we get bigger and bigger data sets generated by computers it’s the ability to take computers and help us get through the review of the ever increasing size of data basically it’s an alternative to manual review and it can dramatically reduce the amount of time needed to complete the review so why use it it’s really it comes down to two things if if you have a small collection you you wouldn’t use technology assisted the view and as a rule of thumb you would say maybe you need 40 thousands of 50,000 documents before it really makes sense to that using technology assistive in you others says no you haven’t have a case with more than 100,000 documents before the cost of using technology assistive the view and the benefits outweigh that of just brute force linear review but once the case size increases you’re able to apply technology assisted review to increase the speed which mean you’ll be able to stay on track and meet the deadlines and you’ll also be able to decrease the cost at the same time just the further motivation for the reduced cost on this slide you can see that that seventy-five percent of the overall cost I hidden in the review part and so if you’re able to save significantly review it will impact the overall budget quite a bit whereas if by comparison you’re able to save significantly on collections it will not have nearly the same impact on overall cost reductions so how does it work there are lots of different implementations of technology assisted with you but they tend to have a similar approach and that is the the algorithms cannot run without some amount of training you need to tell it what does a responsive documents look like what is a non responsive documents look like and so it will involve at first a training

set and it’s very interesting that you can look at it and statistically that if your data set was normally distributed you would need as the data set overall increases you need more and more documents in that initial training set order also known as the seed set in order to show the system which documents of responsive and which ones are not but if you double the number of documents in the case you don’t need twice as many documents in the seed set and there is this a formula they can use from statistics that actually shows as the number of documents go to infinity the number of documents you need in a training set goes to 2399 so statistic is speaking with ninety-five percent confidence you only need 2399 documents even if your data set is several million documents large now that that lies on the assumption that data czar unnormalized and in litigation data is anything but normalized and so you have to choose your seed set size with care one thing that we offer here is a text grouping and you can actually take your text in the case put them into near groups near duplicates and if that comes out that you have 3,000 groups that are near dupes well then your training set has to be at least 3,000 because otherwise you’re clearly missing some representative for each group and the more you can get those into your seat said the better it’s going to perform so once you have demonstrated to the system which documents are responsive and which ones are not on your seat set your training said you could in principle let it rip and code all you know hundreds of thousands of documents and then you know however long that takes two to compute and then go and look at the results I mean you can’t just let it run and then turn the production over and hope that it’s turned out okay you wanna you want to do a QC step and the way a lot of technology assisted with you close work is rather than letting it run and then do the QC after the fact it runs on the smaller set and call those the control sets so we run on just a small set of documents that predicts those and then the user will go in and we’ll check the results of those and say yeah I agreed with this you know ninety percent of the time ninety-five percent of the time and then you do another set and you do another said until you see that the predictive coding is stabilizing and what we’ll talk more about that in detail and then once you’ve convinced yourself of that then you let it run on the remainder of the case the control sets are generally small as i mentioned 25 to maybe 50 documents at a time and it’s your first chance to really show how does it how well is this doctor is this algorithm working on this particular data set so it’s interesting that how well is it working really depends on the data set in question some datasets lend themselves very well to technology assisted review others may not work as well but you will find out very quickly when you go into into that second step of checking the results and you can sort of you can sort of compare that to if you had a whole pool of different balls when you had they were red balls and green balls and you had to sort those into two buckets that be trivial and you could do it very quickly and that’s your task but if all of a sudden by analogy some of the balls were both red and green or they had polka dots and or that shades of colors it could become very different it comes in and if your case consists of the first scenario where you only have the red and the green yes going to lend it up extremely well so predictive coding or technology assisted review if you have a case with a lot of those documents that

are in between you will not get nearly as perfect results as you would otherwise and we’ll get into that more details later and here’s the last step where we actually apply the coating and this is the one where there’s no longer a user involved and it completely automatically runs and classify the the big amount of data so the first two steps might work on you know 1,000 2,000 5,000 documents um and the users involved in that if it’s covering five thousand documents and you’re reviewing at 50 documents an hour you can see you have a hundred hours worth of work but then on the last step you might run a million documents and it’s all automated at that point so once it’s done running it will have classified everything into two bins the responsive and the non responsive and let’s look at it in the following way you haven’t a row of actual documents that are responsive and non-responsive and then underneath it you have the predicted ones and in this case there’s a perfect match so what does it mean that the document is actually responsive and non-responsive and I think that’s a really interesting question because it’s easier to assume that every document can be easily classified into one of the other it sort of goes back to that analogy with the balls and the full way helps us differentiate between the red and green colors there will be all kinds of different shades and we run several tests there we’ve had people that have done coding for 10 20 years and and they didn’t all get the same result on the same documents some people would say yes this document is clearly responsive or no they wouldn’t say clearly but they say this is responsive and others would say are you sure and it really becomes possibly a document that’s 20 pages long but it mentions one line of something that’s relevant to this oh say does that make the whole document responsive or was that or is it is it even you know what exactly was meant by that one line was that in context out of context so it can be very difficult to say there those sort of actual number that can even be a debate as to to that first line but let’s for the sake of argument assume that there is a way to classify every document with one hundred percent certainty then the predicted one by the algorithm we need some way of saying how well did that prediction do and that prediction that falls into two different categories so basically goes the next line falls into two different things called precision and recall and precision is we know it’s not going to be one hundred percent so out of the documents that were produced what is the percentage that were actually responsive and we want that number to be as high as possible recall says what percentage of responsive documents were classified as being responsive in other words how many of the responsive ones did get produced we want that number to be as high as possible they want both of them to be a hundred percent but as you’ll see those two numbers kind of working against each other so as one gets very high the other one starts to fall off and vice versa this is one example of low recall and high-precision low recall and high-precision means we may have admitted some documents that really were responsive but the one we gave the one we produced have very high precision almost all of them are indeed responsive I mean the extreme would be to have a production and only produce one document but we know with with incredibly high certainty that this algorithm that this document was responsive it only turned over one document but it was responsive so that would have a

precision of one hundred percent but your recall would be you know just a few percent the flip side would be guess what we turn over everything so have we given you all the responsive documents yes we turned over the entire collection of documents and in that case the recall is one hundred percent but your precision say you have ten percent of the actual documents that are responsive then your precision is only ten percent so we need a balance between the two but it it is not possible to have them both be one hundred percent with predictive coding algorithm and in fact you have a very similar so now you’re going on when have human reviewers and you’ll also see that they tend to be maybe around eighty percent but they’re not at a hundred percent so it really comes back to the best effort so how how can we you know justify doing technologies is this the view if it’s not one hundred percent well it’s really a matter of saying cases are getting ever bigger and as they get bigger and bigger and bigger you know at some point you simply is not practical to do a traditional linear review where humans are looking at every single document and if it’s fairly practical today you wait another year you have a case that’s ten times bigger wait another year but they’ve got a case is now a hundred times bigger than today and at some point that this simply breaks down and you have to look at what are the alternatives and one of them is technology assisted with you so while it’s not perfect it’s it’s a great alternative and it actually has lots of studies to show that it competes very well with manual review because manual review is not perfect we believe in transparency and so the way we have to approach that we’ve taken and our predictive coding is to create a system in which it’s possible to weave disclose the algorithm and would be possible to actually take the same documents and for somebody else to run it and get the same results this is different than some other algorithms where the complete black box and there’s no way of there’s no easy way of arguing why does this document get produced in the collection of why does this other document not get produced it’s simply a black box so we didn’t want to do that instead we chose a bayesian classifier and the algorithm is very very well-known can look it up on Wikipedia and basically the how it works you can sort of think of and how technology is this the review works in a lot of system is it’s all based on the words in the document and you can think of it as each word having a certain weight and and then you take those weights as if it’s responsive the document all those way to become very positive if it’s not respond to those weights go on the other side you can sort of think of it like an old-fashioned scale or Lady Justice with the scale and you’ve got weights for each word and you put it on either the left side is responsive the right side as non-responsive and you’ve trained the base set the seats that it knows what these words are if they’re responsive or not and they will say gee I’ve seen this word show up on both responsive and non-responsive document so it doesn’t it doesn’t tip the scale very much but if you’re searching for example in case but look at what some fantasy football and you use the word fantasy football and it’s only seen fantasy football in responsive documents then those would weigh went very heavily on the responsive site and then the first time for example we had a document show up with fantasy basketball it suddenly says oh the word fantasy I have now seen that

both for a responsive document and a non-responsive document and therefore its weight starts to become a little bit more neutral whereas the weight associated with the word football was still very heavy and so in that sense you can think of it as it’s a keyboard search but it’s very in compressing because it takes all of the words in the documents into account and it’s also very elegant because you find one document through keyword search but through predictive coding once you find that one document it might find another word in there like the Houston Oilers were related to the fantasy football and from then on the Houston Oilers now gets picked up and used as a way to predict other documents so very simple approach but very powerful at classifying and was used in the in the old days first for spam filters and now for the technology assisted with you one thing we have is we’re running the system in the cloud we’re able to in a data center very quickly scale up to hundreds of servers and that way we can now do in in hours what would take days otherwise two runs that were able to just lead more time before the deadline before you start the actual coding and then be able to still finish it in time so in summary it allows a skill computer person to to train the system with the explanation that I gave of the words having different weights and being on a scale you can see that that person have to be very skilled if they make one bad classification all of a sudden that where it gets the wrong weight and starts to tip the scale in the wrong direction so it’s absolutely critical that you get the base is correct and we do have a way of unlike some other systems we have a way of setting up that base it and have it done within a traditional review set with all the QC and bells and whistles and then feed those results into our technologies is the review we can use this for larger sets to still made the deadline and as the case ice grows it becomes more and more economically advantageous to use predictive coding it works as we talked about through setting up a seats of testing it and then letting it run we talked about precision and recall one little side note on that is back in the day when when Google started them then there was another search engine called Alta Vista that was there before and they’re very much based on academic from an academic background and took the approach that we need to be able to search from academic articles and we need super high recall we just want to make sure if you type this then it will show these results so that resulted in very low precision or you would get lots of web pages but not the results you’re looking at and then Google among one of them many things that hit right was they flip that around and says we’ll go with very high precision and there will be lots of results you don’t see the recall one below but you won’t know that they were there so it won’t hurt as much as long as what you’re seeing on the first page is what you’re really interested in so a lot of the same thing we’re talking about here with precision and recall is both for technologies is the review but is also for but we’re browsing and those are the terminology the use of the big search firms and it’s important to understand the algorithm so much just so that you can explain and defend your results it’s also important to understand I think the analogy with the red and green balls that different cases will behave differently and by having a disclosed algorithm you’re able to better understand why it works the way it does as opposed to it might have worked very well on previous cases but somehow it doesn’t work as well on on your case and then finally we have a

very scalable and transparent approach you Carson and thank you all for attending today’s webinar if you have any questions or interested in learning more about predictive coding in the Cystic review class please don’t hesitate to contact us at any time you can also visit lexical calm / assisted review to download our assistant review white paper that further details are specific approach to Tir thanks again for joining us and have a great day