Improving Natural Language Understanding through Adversarial Testing

so let me now introduce today’s speaker chris potts christopher potts is professor of linguistics and by courtesy of computer science at stanford university he is the director of the center for the study of language and information at stanford in his research he develops computational models of linguistic reasoning emotional expression and dialogue he is the author of the book the logic of conventional implicatures and as well as numerous colored papers in linguistics and natural language processing so thank you for joining us chris and i’m happy to hand it over to you thank you petra and thank you everyone uh for turning out today i think this is an exciting event i also want to have the chance just want to thank those course alums for coming back to talk about their work i think that’s very generous they did big and exciting things during the course and it’s great for them to be reporting out to you on what they achieved to kick this off i think i just want to say very boldly that we live in the most exciting moment in history for doing work in natural language understanding it really does feel like over the past 15 years we’ve seen some real qualitative changes in what we’re able to do both with technology and with kind of core scientific development so that’s a very positive picture it really is an exciting moment on the other hand i think as practitioners we can see that some of the gains aren’t what they first seen that they’re that really the big questions are still left open and that’s part of what makes this such an exciting moment is not just that we’re able to do new and exciting things but also that the big research challenges are still ahead of us and i think what i can do today is give you a full picture of that through the lens of what i’m going to call adversarial testing which is a new mode of evaluating our systems and looking for ways to find problems with them and improve them so here’s kind of our outline for today i do want to just emphasize under the heading of a golden age for nlu that lots of exciting things are happening and i want that to be the overall message that it really is an exciting moment however it’s important that we take a peek behind the curtain and come to a measured understanding of what this product this progress is actually like and that will kind of key us up to talk about adversarial testing which is this more technical more strategic way that we as practitioners can find fault with our systems and then look for ways to improve them and that’s kind of a nice transition into the course work for xcs224u because i think the tools and techniques that we introduce are really great for you know combining that with adversarial testing and finding new ways to make progress well let’s kick it off on this positive note a golden age for nlu i’ve just assembled some examples that have to do with natural language understanding actually i could be talking about ai in general because it really is a golden age for the entire field first example artificial assistance these get a lot of the press these are things like siri and google home the more trusting among you might have these devices in your homes listening to you at all times they certainly pass the bar in terms of utility they’re able to help us with simple tasks around the house i also want to just call out the fact that their speech to text capabilities are astounding i think you know they’re good in the sense that 15 years ago the things that they achieve every day would have looked like science fiction now it’s by no means a solved problem there are plenty of remaining issues with that speech to test but again it really does feel like we’ve made a phase change in terms of the things we can achieve there maybe for the nlu a little less as we’ll see in a little bit we don’t actually talk about machine translation in the course but i’d be remiss if i didn’t bring it up because this is another major breakthrough area for language technologies i’ve picked google translate as kind of my example here first of all google translate can take you from dozens of input languages to dozens of output languages that alone is a kind of astounding science and technology accomplishment but it’s also remarkable how good the translations can be here just as a quick example i’ve put some english in on the left i’ve got a french example on the right and this is actually a kind of example from a popular data set so we have a so-called gold translation here and it’s just striking how close the translated text is to that human gold standard you have the usual mistakes in terms of preposition and maybe word choice and stylistics but again it really passes the bar in terms of helping someone who doesn’t understand this input language actually figure out what was expressed and that’s just something that wasn’t true 15 or 20 years ago for these mt systems

image captioning is a really great kind of grounded language understanding task where the inputs are images and the task is to assign them accurate and interesting and descriptive captions this is from a paper a few years ago that was a really a breakthrough paper i think on doing this kind of metro language generation and i just want to observe that these captions are really great a person riding a motorcycle on a dirt road a group of young people playing a game of frisbee this is fluent makes it basically factually accurate text that is really good as a kind of caption for those systems and again this just wasn’t something we could achieve 15 years ago another major technology moment for me anyway was when ibm’s watson system won jeopardy the online game show this is a really integrated technology system that had to do lots of things in order to play the game of jeopardy but at the heart this was a very powerful open domain question answering system so i really do mark this as a win for nlu that it was able to be super human in some sense on the show by beating these two jeopardy champions here and then if we zoom in on the kind of tasks that we actually focus on in the course i think we see a similar kind of picture where the kinds of things that we can do now just feel very different from the kinds of things we could do 15 years ago and it feels like we’re on the cusp of seeing some really transformative things in the near future as well because it’s a major unit for the course i’ve picked natural language inference as the task i’m going to focus on to illustrate some of this stuff so just briefly the task of natural language inference or nli you’re given a premise and a hypothesis text and the task is to assign one of three labels to that pair so in this case we would assign the label entails to the pair a turtle danced a turtle moved the idea is that any situation in which this premise sentence which is true would also be one in which this hypothesis sentence was true it’s a kind of common sense reasoning task here for the second example every reptile danced is neutral with respect to a turtle age because they can be true or false independently of each other and finally some turtles walk and no turtles move would be a contradiction in this notion of common sense reasoning so that’s a fast overview of the nli task there are a few major benchmark data sets the first one the oldest one is the stanford natural language inference corpus here what i’ve done is just map out on the y-axis the f-1 score you can think of that as a kind of notion of accuracy for the system and in the original paper we set a human baseline of just short of 92 on this f1 score metric that’s this red line here across the x-axis i have time and so what we’re going to look at is published papers over time that have tried to achieve new things on this snli benchmark so here’s the picture what you see is first of all basically sort of monotonic progress i think people are learning from previous papers and figuring out new tricks that help them on the snli task so you see a lot of very rapid progress and then the striking thing is that in the middle of last year we saw the first what you might call superhuman system for the snli task so two things i want to say about that first again this is not something that we could have achieved two decades ago it really is remarkable that we even have systems that can enter this kind of competition to say nothing of actually surpassing this kind of estimate of human performance however what we have to keep in mind and what we’ll see in a little bit is this that this does not mean that we have systems that are super human when it comes to the human task of common sense reasoning all the hard aspects of that problem remain unsolved completely and i’m going to make that very clear to you but nonetheless it’s striking that we have systems that are even this good in this narrowly circumscribed way multi-nli is a very similar data set it’s just arguably harder because the underlying data is more diverse i have the same framework here f1 score and the human estimate is 92.6 so a little bit higher and i again have time along this x-axis here now it’s kind of exciting that this data set unlike the previous one is on kaggle so many more people can enter and we can get their scores so the picture overall is there’s a lot more variance a lot more people are entering and doing a lot more interesting and diverse things but it’s a similar picture and that we can see the community kind of slowly hill climbing toward what is eventually going to be superhuman performance on this task and that’s exciting to see and nli just to be clear is not the only area in which we have what you might call superhuman but in quotes superhuman performance here are a few other examples they include speech technologies

translation question answering and glue here is a big benchmark task that captures captures a lot of diverse things again though you have to be really careful about how you talk about this what we have is systems that are superhuman on these particular data sets using a very particular set of metrics this does not mean that we have superhuman performance in any larger sense and that’s the part that i find exciting in fact these are unsolved problems but nonetheless you might look and reflect back on this technology and start to adopt the perspective that’s in this book by nick bostrom called super intelligence where it kind of looks at the current state of technology and begins to wonder what life might be like when very soon perhaps we have systems that are vastly better than humans at all of these kind of core human abilities and he worries about what the world and the universe might be like when we achieve those kind of breakthroughs so have that picture in mind i can see why people would arrive at it given the golden age that we live in but i do want to temper that a little bit so let’s take a peek behind the curtain at those examples i mentioned those artificial agents before that are in your houses you probably experienced them in various ways i think the dream is that they’ll be able to do things like this you say any good burger joints around here and it replies i found a number of burger restaurants near you and you say hmm what about tacos and at that point your device is able to recognize your intention very flexibly think about your language and the context it’s in and kind of proactively help you solve the problem that you’ve implicitly defined for it that’s the dream i’m not sure how often you all experience it i want to balance that against this very funny sketch from the stephen colbert show from a number of years ago so the premise is that stephen has been playing with his iphone which has siri on it all day and he has failed to write his television show so he says for the love of god the cameras are on give me something and syria replies what kind of place are you looking for camera stores or churches as practitioners we should pause there and kind of realize what has happened siri is doing some very superficial keyword matching on the utterance not deep language understanding and that’s why it has associated cameras with camera stores and god with churches so sort of a peek behind the curtain there and then the the interaction continues i don’t want to search for anything i want to write the show and siri does what siri often does searching the web for search for anything i want to write the shuffle there’s a small transcription error there but i think the broader picture is just that siri does not have a deep understanding of this interaction and we’re seeing on the surface here the cheap tricks that the device uses in order to try to get past those limitations this is by no means open domain dialogue of the sort that we were hoping for translation again i think google translate is an astounding technological achievement but it too shows that it doesn’t have deep understanding for this example what i’ve done is just input a bunch of random vowel sequences this is the trick i learned from the language log website completely random input here it’s interesting that it had that it has inferred that this is the hawaiian language if you know something about hawaiian syllable structure you might grant that that is at least an interesting hypothesis about this input nonetheless it is completely random the really disconcerting part is that on the right here in english we have a completely fluent sentence that by definition has nothing to do with that nonsense input and even stranger if i make very small changes on the left i’ll get a completely fluent but completely different sentence out on the right and this is revealing that these systems don’t know anything about their own uncertainty and certainly don’t understand the inputs they’re processing i showed you before those examples from image captioning to their credit from this this excellent paper here they didn’t just show the really successful cases here we have a spectrum from the really good ones on the left to the really kind of embarrassing ones on the right this middle one says a refrigerator filled with lots of food and drinks it’s actually a sign with some stickers on it a yellow school bus parked in a lot is kind of close but really not like what the human understanding of those scenes is so lots of work to be done even in the narrowly subscribed space of image captioning i mentioned before that i think watson was a really breakthrough technology moment for especially open domain question answering but again watson did not understand what it was processing here’s a sort of funny interaction you have to remember that jeopardy kind of reverses its questions and answers so the answer came grasshoppers eat it and watson’s reply was kosher which seems completely disjointed it’s not a guess that a human would make but if you realize that watson was primarily trained on lots of wikipedia entries and you look up grasshoppers on wikipedia you’ll find very rich discussions of whether modern-day

grasshoppers are kosher so there is a kind of human way in which we understand what watson did but this also reveals how superficial the processing techniques actually are and on how unhuman-like they actually are so summarizing there you might say i showed you that perspective from the book super intelligence before you might having seen behind the curtain now balance that against the perspective from this very funny book called how to survive a robot uprising this is by daniel wilson who is a practitioner a roboticist and this book is full of advice like if you’re being pursued by a robot run up some stairs or be sure to wear clothing that you know will will confuse its vision system a much more tempered perspective so let’s try to make that a little bit more precise in terms of things that we could take action on in a course like natural language understanding that falls under the heading of adversarial testing so just to get into our common ground let me quickly review what standard evaluations are like in standard evaluations in nlu but actually throughout the field of artificial intelligence we work like this you create a data set from some single process you could scrape some data from the web or crowdsource a new data set or something like that but the point is that it’s kind of homogeneous in the next step you divide the data set into disjoint train and test sets and you set the test set aside it’s under lock and key you develop your system on the train set never once looking at the test set and only after all development is complete you finally evaluate your train system on that held out test set and the idea is that that will provide you an estimate of your system’s capacity to generalize to new cases because after all you held that test set under lock and key and only at the very end did you look at how your system behaved with those entirely new examples it sounds good it has a lot going for it but i want to point out how generous this is to the systems that we’re developing because in step one we had a single process it’s kind of too much to say that this is actually going to be an estimate of how the system will perform in the real world because after all the real world will throw in our system many more diverse experiences than we saw in step one and throughout this process so adversarial testing kind of embraces that right because in adversarial testing we make a slight tweak start by creating a data set by whatever means you like it could be just as before you do develop and assess your system using that data set again according to whatever protocols you choose so this part could be standard but here’s the new bit you develop a new test data set of examples that you suspect or know as a practitioner will be challenging given your system and the original data set and then of course only after all system development is complete you evaluate systems on that new test set and you report that number as an estimate of the system’s capacity to generalize a lot of this is familiar except for the introduction of this new and potentially quite adversarial data set in the middle here this is kind of simulating what we saw when we looked behind the curtain where entirely new examples that the system developers didn’t anticipate were causing a lot of grief for our otherwise very good systems let’s return to that nli problem let me show you what this is like in practice so remember this is that premise hypothesis prediction task with three labels in a lovely paper by glockner at all what they did is create a new adversarial data set that’s kind of based on lexical substitutions i actually hesitate even to call this adversarial because i think this is just kind of an interesting challenge thing that they did so here’s how it worked you have a fixed premise here it’s a little girl kneeling in the dirt crying the original example had the hypothesis a little girl is very sad and that has the entailment relation what they did is just use wordnet which is a structured lexical resource to substitute hear the word sad and have it become the word unhappy those are roughly synonymous so what we would expect is that systems will just continue to predict the entailment relation for this new adversarial example everything else about the examples is the same that’s why i say this is actually kind of a friendly adversary here what they found in practice is that systems that are otherwise very good are apt to predict something like contradiction for the second case it’s probably because they think that that negation in the word unhappy is a good signal of contradiction so they make a mistake and it’s not a very human mistake it’s something very systematic about our understanding of a language like english that we see that these two are synonymous assuming we know what the words mean this example down here is similar where you have the fixed premise and all they’ve done is changed wine to champagne and that should cause a change from entailment to neutral

but in fact since the system has a very fuzzy understanding of how wine and champagne relate to each other it continues to predict entailment in that case this is a picture of the data set i think it’s really cool because it’s got a lot of examples especially for contradiction and entailment and it also has this nice breakdown by individual categories so you can get some real insights into what it’s doing and as predicted this is quite devastating for these systems so i have a few models here that were very good models at the time they have very good snli test accuracy that’s one of those benchmark tasks i mentioned before and their accuracy on this new test set has plummeted by you know as much as like 30 percentage points in absolute terms so this is really devastating now there is a ray of hope here i’m not going to go into this slide in detail of course because there’s a lot of information here i’ve put it here just to say that our course has really great coverage of what are called transformer based models you might have heard about them like burt roberta electra excel net by the end of the course you’ll have a very deep understanding of all the technical details that you see here for now though i would just want you to think there has been a kind of really interesting breakthrough in the last two years related to how people use these transformer based models and the way i can give you a glimpse of that let’s just highlight roberta here so what i’ve done on the next slide is just use some of the course code from our course and some a pre-trained model that’s easy to access from using facebook code so i read that model in and evaluate it on that full glockner at all data set that i just showed you and the result is amazing these are the performance numbers here the accuracy is at 0.97 and it’s doing extremely well for those two categories where you have enough examples or enough support and remember just two years ago the best system on this adversarial test wasn’t even above 0.75 and now we’re at 0.97 and doing well on both of these categories that’s starting to look like yet again some big leap forward and how well we can do it’s very exciting now we can level up once more so just quickly so far we’ve been using adversaries just for test sets but we could actually have them be part of the entire life cycle of a model and that’s what these authors have done for the adversarial nli nri dataset this is a direct response to the kind of adversarial test failings we just saw here’s how this worked the annotator is presented with a premise sentence and a condition so a label they need to produce entailment contradiction or neutral the annotator writes a hypothesis and then a state-of-the-art model comes in and makes a prediction about this new premise hypothesis pair if the model’s prediction matches the condition that is if the model was correct the annotator returns to step two and tries again and you could continue that loop until the model is finally fooled and you have a premise hypothesis pair which you then validate with humans so what’s happening here by definition is we’re creating a data set that is intuitive and natural for humans but by definition very difficult for state-of-the-art models because they are now in the loop where people are being adversarial with them it’s a familiar picture this is the current state of the art so we have a few systems here that are outstanding on snli and multi-nli all these numbers in the middle here are different views of the adversarial nli data set and you can just see that they are dramatically lower than those standard evaluations on the right so another unsolved problem we saw a glimmer of progress but i think now this is the new thing to beat here and in fact we’re going to hear a bit more about these kind of evaluations a bit later so finally just by way of wrapping up i don’t want to take too much time but i thought i could connect this really nicely with our coursework so here’s the high level summary we cover these topics on the left it’s by no means an exhaustive list of topics for the field but i think it’s a good sample in the sense that it gives you a picture of a lot of different tasks structures models techniques and metrics so that if you’re good at this sample of topics you’re really empowered to take on anything that’s happening in the field of nlu right now part of the reason i feel confident saying that is that the course is very hands-on so we have four assignments each paired with a bake-off i’m going to tell you about the bake-offs in a second but each one of them is meant to be a kind of simulation of a small original final project and that culminates or leads into the final projects which come in a sequence of things that help you incrementally build up from a literature review through in a protocol and then finally to a final paper so that you’ve kind of with the help of a teaching team mentor slowly built to something that’s an

original contribution in the field for those assignments and bake-offs let me just give you a glimpse of what the rhythm of those is like so each assignment culminates in a bake-off which is an informal competition in which you enter an original model this is like the kind of shared evaluation tasks that you see a lot throughout the field the assignments ask you to build up some baseline systems to inform your own model design and to build that original model and then you enter that original model into the system we have held out test sets for you so that we really get a look at how good your systems are and the teams that win that get some extra credit uh it’s also important that the the teaching team assembles all of these entries and reflects insights from them back to the entire group so that we can kind of collectively learn what worked and what didn’t for these problems the rationale of course behind all this is that each one of these should exemplify best practices for doing nlu and help make you an expert practitioner i want to connect back with those earlier themes and i think we have one bake off that does that in a really exciting way and this is a kind of micro version of the nli task we do word level entailment where the training examples are pairs like turtle animal and a one means that they’re in the entailment relation turtle desk is in the zero relation so it’s a small one-word version of that full nli problem the reason it connects with what i was just covering is that we try to make this a bit adversarial so the train and test sets have disjoint vocabularies so for example if you do see turtle in the train set you won’t find it anywhere in these pairs that are in the tested examples the idea is to really push systems to make sure that they are learning something that is actually generalizable information about the lexicon as opposed to just benefiting from idiosyncrasies kind of in the patterns of the data set that happen to exist so in that way i think we can push ourselves to develop systems that really have robust lexical knowledge embedded in them and these tested evaluations give you a glimpse of how much of that you’ve actually achieved there’s an oh and just just to kind of emphasize again how hands-on this all is so this is a kind of full system for that word uh level entailment problem uh we don’t need to dive into the details of the code i’ll just say that you make essentially three decisions here under glovevec this is your choice of how to represent the individual words in this case i’m using glove pre-trained representations glove is a model we cover in some detail at the start of the course but of course you are free to make use of any representation scheme you want for these words you should also decide how to represent the pairs here i’ve chosen to just concatenate the two representations but lots of things are possible and then i’d say finally the most interesting and exciting part falls under this network here so this is a bit of pie torch code it’s using code that we release as part of the course and that you’ll make a lot of use of the reason that’s important is that that pre-built code really frees you up to think creatively about the problem at hand and you can see that here this is a complete working system in cell 4. primarily what you do for this assignment in bake off is work on this build graph method where you’re essentially building the computation graph for a deep neural network model and then everything else about the optimization process is handled by the base classes which are already part of the course repository and that’s important because it’s hidden away under base keyword arcs here this model actually has lots of different settings that you can explore for different optimization choices and other things so that you can really experience in a hands-on way how best to optimize these modern deep learning models that you’re building i don’t have time for it but i did just want to mention this other bake off this is a two there are four and all but this is a really different one from the previous ones if i had more time with you i think the other theme that i would emphasize would be the importance of grounding natural language outside of language and in actual physical scenes and stuff like that and the way we kind of explore that in a tractable way is by doing natural language generation where we’re trying to describe color patches in context this is another modeling direction and it does bring in non-linguistic information in the form of these color patches i think it’s a really interesting problem it kind of connects with interesting topics in linguistics and it’s a chance for you to explore another prominent class of models which are these encoder decoder models which process sequences on the left this is not a linguistic sequence these are color patches they could be images and on the right of course you’re producing a natural language description but in the interest of time i think i’ll just go quickly to this wrap up as i said before i really believe this this is the most exciting moment ever in history for doing nlu it’s not like you’re joining the field just at the moment when all the hard

tasks have been solved i think rather we now have a good foundation for the really exciting breakthroughs which are in the future and i think the adversarial testing really makes that clear this course gives you hands-on experience with a wide range of challenging nlu problems and when you come to do your original research you’ll have a mentor from the teaching team to guide you through not only the project work but also all those assignments and bake offs and so forth and you know the examples of success there is that some of these things have turned into really exciting and mature papers some of them even published and you’re going to actually hear about some of that really mature and interesting work in just a moment so the central goal of course of all of this is to make you the best that is most insightful and responsible nlu researcher and practitioner whatever you decide to do next with all of this new material so i’ll wrap up there thank you very much thank you chris this was extremely interesting thank you uh so if you have any questions for chris feel free to post them in the q a box we will be moving on right now to allow enough time for students to present their projects if you are interested in learning more about a course chris was mentioning or other courses a cpd is offering in artificial intelligence program like machine learning or deep learning you can check the links you will see on your platform but now we will move on so we will hear from two project teams uh they took the natural language understanding course and developed great projects that they will now briefly present the first speaker will be gokan chagrici uh so gokan you can go ahead yes hi everyone so yeah my name is and my presentation is about the effecto and stumbling on a nli benchmark and our focus keyword is basically uh adversaries and uh as you probably know the leaderboards and uh these littles are created for some challenging problems but most of them are using uh yeah each one of these are basically using a frozen corpus and then there are practitioners and there are researchers who are trying to beat the best colors and this is how the life cycle goes but you might imagine that it might not necessarily mean that you are getting the best uh kind of model that can generalize into new uh areas uh because of the incapability of generalizing the idea so that these models can take some shortcuts when you have a fixed kind of training and test it so here is a here is a paper one of the latest papers uh about like incorporating this adversarial idea and basically uh as as professor pulse mentioned uh it it is using different rounds so there is round one uh which is basically creating this uh training set and this it and it releases state of the art model and based on the weaknesses of that model and then a new uh training and testing is being created taking those weaknesses into account and uh this moves on so i used to challenge the model capability and uh increasing the generalization capability before we move on with the next slide i would like to mention something so the question is is it even easier for humans right for an nli task as you see there are very simple kind of texts a single sentence text and very simple hypotheses based on these premises and these judgments are being made by humans you’ll see that for the for example the second example and the last example even humans cannot agree with the correct label so if this is the case for humans then how how are we going to approach this problem with uh with the machines themselves well as we said we will challenge a model with a progressively harder tasks in my project i took the data set from from the paper that i mentioned and i applied several transformer-based state-of-the-art models so these models actually you’ll see three different models like bird uh roberta and x on it and there are two variants uh one is the base one one is the large one

and for the output from y1 to y6 you see the outputs per module and for y7 there is a strategy for assembling these models and to see if and something is going to be a cure for us on the right uh i wanted to give an uh a feeling about the uh the data set uh you see that this data set contains very complex and long kind of sentences and there are a lot of named entities a lot of relationships references uh and for the top three best performing models for example for these two examples and none of them could come up with the right answers so again we see that the task is very hard and for the question that i mentioned let’s see if ensembling is the cure here right well these are the results for the models in isolation we can look at just the f1 score because it is one of the score being used by the community a lot for these problems and even though 90 plus uh percent f1 scores have been achieved for snli and their variants here we see that we couldn’t even like achieve a 50 percent kind of iphone score and whenever we applied and something yes there is something from us we we could barely see 51 percent for the f1 score but again it is far from uh like a reasonable kind of success right so here we see that just assembling different models is not going to have for something that is so hard for the models in isolation and people are using assembling mostly for improving something that is already improved by by the by the individual models so then the next reasonable question to ask is okay why are we still far away from a very nice kind of solution uh here is a list so yeah model architecture can be an issue uh but personally i don’t think it is it is one of the most critical ones the size of the training data is as important as anything that anything in other areas as well so if you have a quality if you have a high quality in terms of the training data and test data then you have a much better chance of creating a satisfactory model but here even creating a train data is very expensive because as we said even humans are having trouble for uh like agreeing with the output or hypothesis and uh and the premise itself so uh it needs a lot of effort time and money uh but yes it is very important but the last one is very interesting so if you think of a child uh that child’s interaction with the environment is playing a very important role as well as like reading some text from some books and trying to analyze it so that child is basically experimenting with the external world all the time and then creating new hypotheses and testing it creating another advertises i’m testing it again so machines are lacking this ability maybe we are trying something that that cannot be done without these machines living among us uh and having said that i would like to conclude with my experience uh in terms of the project and the class yes it is very demanding but something should be very demanding so is to give you a better insight into into that topic uh and it should challenge you so that you will you’ll feel the need to learn more and that comes actually uh that is mentioned in the rewarding part so you gain the discipline of analyzing papers uh searching and comparing the results and then trying to repeat those results or even go beyond those results and last but not the least the guiding part uh it does not matter what kind of questions you are having but there is a very strong kind of community from stanford uh helping i didn’t even see any any question that that that was not

answered by uh by those expert people including uh professor potts and yeah i’m really happy to be here and uh thanks for your time and yeah see you soon thank you gokan thank you for your time and presenting your project uh to everybody and now we can move on to the other project that was developed by mohan rangarajan wupang and ethan guin so mohan will now let you know a little bit more about it thank you petra hi everyone this is mohan uh it’s a privilege to be presenting here on on behalf of my team ethan nooyan and wu farm and it’s we’re certainly looking forward to this this quote by mahatma gandhi actually captures the essence of how we approach both the course and the project we had a learning mindset and we said we are going to learn at whatever whatever will be the cost right when we actually approach the project work we wanted to do something in in question answering obviously and with knowledge graphs using knowledge graphs right like many people and like professor parts mentioned earlier we were enamored with bert transformers and the birth variants and the whole notion of contextual embedding right we were curious to see how contextual embedding would improve accuracy and so our hypothesis was really about using knowledge graph and seeing if contextual embedding would improve the accuracy you may ask hey how did you come from this broad topic to to to a specific focused hypothetical right and here we have to really talk about the structured approach that professor potts mentioned you know doing the literary review and then the experimental protocol and then going on to the project this really helped us narrow down to a specific focused topic on the hypothesis so on the left hand side what you see here really is the broad area that we were wanting to initially look at right and then uh you know one thing that we talked about is as we did the literary review we realized that oh you know what we have to kind of narrow down our focus and then we sought guidance from our course facilitator and you know based on that direction and also looking at how much computer resources we have and the time available we narrow down to the natural language back in portion of this topic and even within that we chose the simple questions data set and the now for the knowledge graph we chose an embedded approach to represent the knowledge graph which was based on freebase i would be remiss if i don’t point out here that we had complete freedom in choosing the topic choosing our hypothesis and so the outcome really was not a concern right because the evaluation is going to be on the methodology and the rigor that uh that that we are going to have right and so that kind of freed us from the pressure that comes with oh our hypothesis should actually improve the results to focusing more on the methodology and the concurrence and the results there here’s our our experiment so we had three fundamental tasks in the project one was the entity learning the other one was predicate and the third one was entity detection the entity learning and predicate learning models were primarily used for predicting the entity and the predicate in the simple question and the entity detection model was useful for collecting a set of tokens that would represent entity names for the knowledge graph we use freebase and we use the embedded representation that that was needed for the entity and the predicate so the idea here really is we present the entity we present the entity detection the tokens in the entity detection to the knowledge graph and we retrieve a set of candidate facts and then the tail entity associated with the facts would yield our possible answer right so the closest fact since we were using embedding the fact that was closest to the embedding representation of the entity and the predicate would result in the answer for the question so that was kind of in essence what uh what our model was and what our experiment was about here is a little bit of detail on the model itself the entity and predicate learning models were very similar right but the entity detection model was slightly different in the sense that you know each token had to be assessed as to whether it would be a potential entity name or not looking at the results and analysis uh

you know we should say that it we were quite pleased that there was marginal improvement in the in the model that in the model that we were using as compared to the baseline model but i used the word marginal because as you can see it was an improvement right since we were using knowledge graphs we were also wanting to compare the results if we didn’t use an embedded representation of the knowledge graph and we were kind of uh it was interesting to note that from the results then for the individual tasks that is entity learning and predicate learning and entity detection right the scores for using the knowledge graph directly without using embedding was better right and but the interesting part is when we were actually doing the evaluation on the test set we noticed that the embedding based approach had better results compared to the approach without embedding and of course you know the marginal improvement in accuracy was higher compared to the baseline models that we were using right so the interesting part really here is you know while we were kind of pleased with the marginal improvement we also noticed that the execution time associated with our models was uh was you know much slower you know our fastest variant was twice as slow as the original model that was in the baseline right and so the interesting part here is you know we were then puzzled as to okay you know did contextual embedding really help in this problem or no right and the other part here is you know we concluded then that you know this the training duration is low the improvements and my accuracy were marginal so we felt it was less compelling to use a fine-tuned birth model for simple question answering applications in the real world a big generalization to make but that was our conclusion based on the simple questions data set and the other part to also note here is we chose the measure accuracy because it was a simple question a question answering solution right so either the answer is correct or incorrect so what did we learn from this exercise right so i think it’s important to understand the data set that you’re using for your testing right when we actually you know started the project with with our hypothesis we felt that we are going to actually you know have at least 90 percent accuracy considering all the wonderful things bert and the bert variants are have done actually right and then we realized we’re kind of a bit deflated with uh our question you know we’re not performing as well and we did a little bit more research only to realize that you know there is a cap of 83.4 accuracy on the model i mean on the um you know as far as using simple questions dataset is concerned right and this is because there’s a high prevalence of unanswerable questions and some questions don’t have any ground truths in the knowledge graph right the other part as well is you know apart from other things we have now a deep appreciation for the level of effort required to hypothesize research experiment and author a paper that conforms to acl standards i’d like to quote isaac newton here by saying if i have seen further it is by standing on the shoulder of giants we really have a lot of people that we should thank for xiaohuang and the team whose works served as a launching pad for our our work here salman muhammad and the team whose work we use as a baseline for comparing our our model the hugging phase company whose transformer based models allowed us to compare different variants professor parts thank you so much you know we learned a ton of new things in this in this course your active active participation in the slack channel and your enthusiasm was a welcome and pleasant surprise our course facilitator pradeep cheema and other course facilitators that helped us they’re always there to help and encourage us with our work and lastly steve it will be here ms if you don’t mention you for helping us throughout the course thank you so much thank you mohan thank you for presenting the project i think it’s very exciting thank you for joining us also ethan and wu who are here with us but not not visible at this moment but they are here uh i think we can move on to do q and a session uh we got some interesting questions from the audience so thank you everybody also for your questions uh i will now ask chris um the first question so are the adversarial examples generated by humans exclusively or by computational models

such as generative adversarial network yeah that’s a great question so you really see a full spectrum of approaches in some cases humans have just written new adversarial cases as you saw with goken’s project with the adversarial nli dataset sometimes we can do sort of quasi-automatic stuff like with wordnet uh where we just do some lexical substitutions and kind of we can assume that the meaning that we want is preserved or changed in a systematic way but you can also have models in the loop acting as adversaries there have been some applications of generative adversarial networks in nlu to kind of make sure these models are robust the picture seems more mixed than you get from vision where i know gans have really been a powerful force for good and making models more robust so i think there’s some space for innovation there but the general picture would be i think we can think really flexibly and creatively about how to create those adversarial tests and that just creating one could have its own modeling interest in addition to that that serving as a kind of weight new way to evaluate models so lots of space for innovation there okay great um i hope it answered the question um the next question uh is about electra based transformer uh so is electra-based transformer more robust to adversarial examples compared to mlm based transformers such as bird oh interesting open question so elektra’s primary motivation i think is to make more efficient use of its data than bert does we have a nice little lecture on on electra and how it works and kind of why it’s successful um but the in the paper if i remember correctly there isn’t an evaluation that you would call adversarial they mostly just post better numbers on the standard data sets and explore a really wide range of variations on the electra model both how it deals with data and how it’s structured but again i love that question it’s just so interesting to ask for a model that seems to be a step forward here not only in terms of accuracy but also in terms of efficient use of data and compute resources what is it doing on these very human but ultimately very challenging adversarial data sets great question to address yeah and it’s especially fruitful if you can address that question and then be maybe think about how the answer could inform an improvement to a model like electra because then you have that full cycle of the adversary helping us do innovative things with the models we’re building okay great um the next question it seems that there is no adversarial training in this product but only adversarial testing would adverse your training fit into this particular for nlu or how would it work for sure yeah so actually uh adversarial nli the data set that cocaine talked about that is large enough that you can use it for training in addition to assessment and there are a few other data sets that are like that um very few of them were created with the kind of full human in the loop stuff that you saw with adversarial nli some of them have more automatic model based means of creating the data sets that are large enough for training but that’s on the horizon and in fact one of the visionary statements that the adversarial nli paper makes is that we should move into a mode of kind of continually retraining and evaluating our models on data sets that were created adversarially and that that kind of ongoing process which is a more fundamental change to how we do system development is a kind of way to lead to even more robust systems so so my my quick answer would be we should be looking for ways to scale the adversarial testing paradigm so that we can have training sets as well um so what is the current state of research and applications of deep rl to nlu a part of transformers uh if there is research in it how promising is it for you or how promising do you find it interesting question so the overall i would say the vision of reinforcement learning really resonates with me you know if you think about your life as an agent in the world trying to learn things and experience the environment you don’t get direct reward signals right you only get very indirect feedback and you have credit assignment problems you don’t know how to update your own parameters and so forth so you have to make a lot of guesses and it’s a kind of chaotic process but nonetheless that’s the world we live in and we all learn effectively so something like that set of techniques has to be brought more fully into the field i think we’re seeing really exciting stuff in the area of combining deep rl with

grounded language understanding and certainly with dialogue and i’m sure there are other areas so it’s a great space to explore the models tend to be hard to optimize and hard to kind of understand but that’s part of the journey toward making them really effective for the problems we look at i’ll give a quick plug i did some stuff that i thought was really exciting that tried to use reinforcement learning together with transformer-like models to induce more modularity so that the systems that we developed instead of having very diffuse sort of solutions would do things that looked more like encompassing lexical capabilities and specific functionality and those are called recursive routing networks again hard to tune but obviously an inspiring idea and you know we should just keep keep pushing those techniques um maybe one very general question where do you see future of nlu what are the how it will look like in the next few years oh five years starting to seem like a long time i mean after all my short lecture showed you that in just a two year span we had what looked like a real phase change on some hard problems so five years feels like an eternity so just some predictions i could make you know the idea behind contextual word representations which is primarily how transformer based architectures are used that’s powerful and that’s going to last it really has changed things and i think the vision there is that like more chances to have contextual understanding more chances to be embedded in a context that’s going to be important grounded language problems are going to be more prominent and i think that’s going to lead to breakthroughs because after all human learners don’t learn just from text that’s kind of an absurd idea human learners learn from the social environment lots of inputs that they get in addition to language input so to the extent that our systems can be multimodal like that they’re probably going to get better i hope as a personal thing that we as a field do more to break free of the confines of always looking at english i saw briefly in the q a there was a question about whether or not the field just looks at english the big data sets tend to be in english but i think that that’s changing a little bit for example in the nli problem we now have some good multilingual data sets and that’s important because english is not representative of the world’s languages and if we look more farther afield we might lead that might lead us to new kinds of models that would count as again fundamental breakthroughs so that’s going to be important and then we should all be thinking much more now that our systems are more useful and more often deployed we should be thinking more holistically like not just that my system did well in some narrow evaluation but what is it actually doing out there in the world when it interacts with real examples and real users and where the fundamental thing should be we should be assessing making sure that those systems are having a positive impact as opposed to causing some social process to go awry or something like that so that’s the kind of new responsibility we have that’s coming from the recent successes so a welcome challenge but an important one okay great thank you um yeah actually at the time almost so i would like to spend the last minute thanking you for your time to take part in this webinar also both or all of them gokan also ethan also mohan osobu who joined us today i think it was really interesting exciting that they joined us and they talked about the project so they brought in an actual life uh the course materials so thank you all for joining us i hope you participants enjoyed it and if you are interested more in learning more about the courses in ai professional uh certificate uh feel free to check the links uh we are offering uh um uh in the interface or please feel free to contact us also directly so thank you everybody and stay safe and healthy thanks petra yeah and thanks to the project teams those presentations were really great i found that really inspiring you