Mining Input Grammars for Security Testing

>> Good morning, everyone Thank you so much for coming My name is Tom Zimmermann, I’m a researcher in the RiSE Group at MSR, and it’s my pleasure to introduce Andreas Zeller, who is a full professor at Saarland University in Germany And he was actually my PhD adviser, so I’m really looking forward to a talk Andreas is one of the top people in software engineering research He is an ACM fellow He won several Most Influential Paper Awards for his work on mining software repositories and automated debugging In recent years, he has done much more on security, and we are looking forward to your talk >>Thank you, Tom Thank you very much I was making jokes because I became ACM fellow for our work on mining software repositories, among others, OK Yes, that’s great It’s always nice to come back to the Seattle area I just noted that it’s now 18 years ago since I came here the first time and it’s amazing how much things have changed since then I was strolling through the city yesterday and I found that the Alaskan highway is now ripe for demolition, which people were talking about 18 years ago It’s nice to see the new 520 bridge now in operation, which people were talking about 18 years ago >>The hardware is pretty much the software >>And I understand that in only seven years, there will be an extension to Microsoft We will see, we’ll see Before I retire, I will see all the beautiful things that people have been talking about all these years >> You must plan the work >>The place where I come from so this is Germany over here, and this little pixel, two times two pixel thing here, this is the state of Saarland and this the City of Saarbrucken Well, I also arrived there 16 years ago, and we’ve had no new bridges or no new – oh, we got a new light rail Yes, but it has two new stations, too, very much like the one over here But, the place actually, but things don’t change that much It’s fine and it’s good because things are good as they are Although there is one particular point where things do change, and if you haven’t been to Saarbrucken for some time, this is our newest building in computer science This is our, oh I think it should be, our tenth building or so that we’re getting there This is an all new center for IT security and earlier this year, it has been announced that this center is going to be expanded massively Right now, there’s about 100 researchers in there It’s going to expand to 500 researchers in the course of the next 10 years Meaning that, right now, we’re having one application talk every week and it’s going to remain that way for the next 10 years, which is fun If you have any friend who, for some reason, doesn’t like to go to Microsoft Research, which is also what I recommend to my students, and who likes work in security, ask them to check us out It’s a very dynamic place to be in What I’m doing there is, among others, the work that I’m presenting today This is “Mining Input Grammars for Security Testing” Although this is a bit, well, this is not the best title for this talk, as I realized, unfortunately, only yesterday because a much better title for this talk would be something like “Making $50,000 a Month by Fuzzing Software” And now that I have your attention, the money is a great motivator, let me tell you how we’re doing this Fuzzing is what most of you know I very much suppose, is the idea of feeding programs with random input at the system level This is Bart Miller who invented the term and the technique in 1989, and the paper today is a classic, “An Empirical Study of the Reliability of UNIX Utilities”, where Barton and his students went along and created random inputs, fed them into the Unix utilities at the time and saw them crashing and hanging and actually, a very wide number, 25 to 33 percent of them, because they were never ever subject to such inputs before, actually crashed and hanged This is a typical output of a Fuzzer It’s very easy to build a Fuzzer Actually, that’s five lines of Python code, which will happily produce code, which will happily produce inputs like these I’m sure there is a programming language where this is actually a valid program >> It’s called shots >>Yeah, not exactly sure

Maybe we should, maybe we can fit this into a Perl interpreter of sorts or something else but the thing is, there will be lines in there, which simply are very long so programs are prone to buffer overflows Anything that parses things will check, say, for strings, for a beginning of a string, end of a string so there’s many ways that programs can fail simply by scanning this input And at the time, yes, plenty of different utilities on the various Unix flavors of the time, on Sun, on Solaris, on IBM, and so on, all happily failed Actually, the nice exception to the rule was the GNU tools, which already existed at the time They turned out to be the most robust ones, and this is also the base for the tools we have today on Linux Since then, of course, 1989 has been a long time, anyone now knows how to fuzz a program The reliability has much improved Today, it’s actually pretty hard to get any of these systems to crash if you use the same technique as Bart Miller did in 1989 but the thing is that parsers that we have today, that are written today, are much more robust than the parsers we had in 1989, where people only began to imagine that they could be some weird input that feeds into one’s program The problem is that a well-written parser will happily reject all invalid input as invalid, and you will never ever reach the code that actually matters to you, namely the runtime system, the deeper layers that can be beyond the parse, where you would like to find bugs or whether you’d like to find successes However, if you do have a syntactically valid input, that is, an input that is actually processed by the parser, then you have a chance to discover bugs in those later stages that happened that then process the input as parsed by the parser In order to get such a syntactically valid input, what we need is a description of what makes a valid input This is called the input language of the program, and we all know there’s language theory for that and what we can do is, we can create a grammar that describes the input structure, the syntactic rules of the input, and using this grammar, we can then produce a syntactically correct input This is something we did, a student of mine in 2012 Now, that’s five years ago My student, Christian Holler, built this whole thing for JavaScript He built a JavaScript grammar, which actually was quite some work writing this down Using this JavaScript grammar, which he created, his tool would then happily create inputs that are indeed valid JavaScript and feed these to JavaScript parsers such as the one that is found in Firefox or in Chrome This input, as you see it over here, was one of the first inputs that actually crashed the parser or that crashed the JavaScript interpreter, which actually is a serious thing because if you can manage to crash an interpreter or the JavaScript interpreter, this typically is the base for a later exploit If you have an exploit in the JavaScript interpreter, that’s very serious because this means that you can take over the entire browser and then collect arbitrary data from the user Browsers today are like operating systems so we get full control of everything >> The interesting thing about the work of Christian, this fuzzer, was that it not only would it produce arbitrary random inputs, you could also use the very same parser to parse existing JavaScript programs, of which there were many And to top it all off, he used his parser to parse existing JavaScript programs from bug reports which were known to have caused troubles before and he used them to parse them and then to recombine them So in a way he was looking at reports of, say, the measles and cholera, parsed these two, recombined them and created new strains and feed all of these to the interpreter in the hope of triggering new bugs which was lots of fun It also was very successful

Over the course of nine months, he got 18 Chromium Security awards, 12 Mozilla Bug Bounty awards from Google and Mozilla respectively And more specifically, he gained $50,000 in bug bounties in the first four weeks of running his tool Running his tool on a single machine, on a laptop So we’re not even talking about a server farm or anything And this is him, this is Christian After these four weeks, so you get to the top of all the security leaderboards and everything so that was fun, he got hired away by Mozilla at a quite nice salary It’s not $50,000 a month but it’s steady, it’s fun And his job description was simply to continue to run his tool That was fun So he’s running his tool for five years now He’s perfecting it In these five years, no, sorry Since- well, that’s one year ago So one year ago when I last asked him, his tool had found 4,000 bugs, had found 4,000 bugs in the JavaScript interpreters of Firefox and Chrome, all of these bugs are confirmed by developers, fixed So all the shebang, all is there By the way, Mozilla gave him the authorization to still run his tool on Chrome and he still gets bug bounties once a while as a side business So he gets a steady income from Mozilla and he gets the bug bounties from Chrome the backbone so he’s perfectly at ease >> Trigger buffer overruns? >> Yes, this is buffer overruns, this is all sorts of ways to cause the interpreter to hang or to crash So it’s buffer overruns, it’s bugs in the optimization stages, just-in-time compiler, everything And as I said, the largest success comes from the fact that his tool is able to parse existing bug reports and existing inputs and recombine them So not only are his inputs not completely random but actually there’s a stronger resemblance to existing code, that’s one thing And the second thing is that if you search in the vicinity of existing bugs, you’re likely to find more bugs because as we know bug fixes aren’t always perfect Sometimes they treat more the symptom than the cause and if you search in the vicinity, you’re going to find more This is what made his tool particularly successful When a taxi driver asks me, so what do you do for a living? I’m a professor So you’re running from my taxpayers money and everything, so have you already done something where you can pay things back? Okay. You, at Microsoft, get asked this all the time but for me this is rather weird, then I can ask this taxi driver, so which browser are you using? Oh, I’m using Firefox and I can go, yes, you know, it’s because of my- the fact that you can use Firefox and it’s all safe is well to some extent, to quite some extent actually due to our work with my student at the time And then he’s very happy and everything is settled So this is work that’s already a couple of years ago It was very nice, we loved it and also had a nice paper out of it So in principle one could do this for every program but the problem is that, well, you need a grammar in the first place and creating such a grammar for a for a non-trivial input, most inputs are non-trivial, it’s actually quite some work For Christian, it took him three months to create this JavaScript grammar Actually, if you want to have a real working grammar for C programs for instance just in order to do some refactoring work or like things such that you have a good parser for it, that’s actually something quite expensive I think you have to shell out about $10,000 for a single license for a good C grammar for instance Yes? >> But in some sense, you don’t need a really good grammar, you just need a good enough grammar So you could write a grammar that is much more permissive than it needs to be and rely on a real parser to allow that [inaudible] >> To some extent, yes If you want to have a grammar that simply produces lots of JavaScript code, you can deal with a grammar that only deals with a subset of JavaScript for instance

But if you want the very same grammar to parse the existing inputs and to recombine them, it really needs to go down to the tiniest details in order to be able to process all the- >> You could always use that real parser to filter out- >> That’s right. That’s right So there’s two things in here So your grammar may over approximate that is it may produce more inputs that are actually not valid JavaScript then in this particular setting of fuzzing Well you’re going to live with whatever 20%, 50% of inputs that will simply be rejected so that’s all there is You’re just wasting a bit of time But if you have a grammar that under approximates, you’re going to miss out a couple of features which will not be tested and then you will miss all the bugs that are in there So that’s the downside of the whole thing So there’s these two sides But then we’re talking about context free grammars here anyway, and Javascript has plenty of context sensitive features as any programming language Actually, Christian also did a couple of tricks in order to deal with identifiers and others in order to have the parser accept more input So, what we started to reason about at the time was the idea of whether it would be possible to learn such grammars given not only a set of samples There’s a couple of works to do that using machine learning but actually using the program at hand that processes those inputs in order to learn how it’s inputs are structured And the way we built it is as follows We do have a set of inputs which can be a very small set Actually, we’re aiming to get this set to be the empty set We feed these inputs to a program We observe how the program processes these inputs, we do this dynamically We have a tool called autogram, which observes how the program processes those inputs and out of these observations it produces a grammar This grammar can then be fed into a parser which then happily will create more inputs towards the program thus creating in the end a virtual circle where you learn more and more about the grammar and the program itself So how does this work? Think of a program that processes a very simple input, let’s say, URLs We all know what URLs are made of and the idea is as follows A URL is made out of different parts We do have a protocol We do have the hostname We do have a port We do have a login People, you should not use logins in URLs because they’re transmitted in the clear but, well, they provide this, and there’s also path So these are standard elements and we do have a couple of separators in there And all of these elements while they are being processed by the program that processes a URL, all of these individual fragments go through different paths and execution So the protocol, for instance, the protocol is being used to to determine what is going to be parsed, how the rest is going to be parsed The hostname is going to end up in some DNS resolver function that is going to return us an IP address to which we are going to connect using the given port And then over that connection, we’re going to send out the login as well as the page request And all of these data, all of these little fragments go through different paths through the execution and they end up in different functions And this idea that there is a natural separation of concerns in the program execution that corresponds to the structure in the input is what we want to exploit So all of these different functions, for each fragment different functions that are stored in different variables, this is the basic idea where we want to be able to extract the structure of the grammar So, how can we do that? Essentially we have explored two ways One very thorough one and one very simple one The first technique that we looked at is to use dynamic tainting which essentially dynamically during execution attaches a label, a tag, to individual data as elements as they’re being processed And this tag propagates along the computation

So for instance, and this tag in our case labels all the inputs with their origin, that is the input source and the position in the input So for instance, we can track where the port goes to and even if the port gets converted to a number, the number inherits the tag of the string and the port then carries that, take over to the function so we can track all of these We can track all of these inputs down The second technique is much simpler This simply consists of taking a memory dump at a particular point in time And simply looking for all string variables, whether any of the string variables contain some characters that also happened to occur in the original input This is much simpler, it’s much cruder There’s also room for error in there, but it turns out that the latter one is something that one can actually implement much easier than the first one And I had some fun a couple of weeks ago implementing the second technique in Python In Python, tracking a variable is very- in tracking- sorry In Python, accessing memory during execution is a very simple exercise You do have a function that is called settrace and you pass this- and you have this API call settrace and you pass it to function which is called traceit And this traceit function is going to be called for every single line that is executed and every line you get past the current frame, the current event Meaning that a new line has been execute as well as other information And in the current frame you can then look up all the local variables and all of their values as Python elements. It’s very simple So actually writing a debugger in Python is 12 lines of code Okay? So, compare that to Java or C, it’s no fun at all In Python, dynamic analysis is a breeze And what we do is we access all the local variables, we access all of the values and then we simply check for the values If it’s a string and if it has two characters or more, and if this value also occurs in the input that we’ve seen, then we simply store the variable that we have Store the variable and the value that we have just seen So we just track these And let me show you how this works in a real code So here’s the very same code again I do have a function which is your URL parse I do have a couple of inputs which are passed to them And very down here, I should have precisely the function that we have just seen So I’m using this trace function thing Where is it again? So one second. Here we go So this is the set tracing function I am producing some output and for every value I’m saving it and accessing the individual variables and I’m saving those that also occur in the input Now, to run this on your URL parse, this is precisely what I get So I get that during your URL path, a function which I have not implemented, which I have never seen Actually, this would be something I’ve never looked into that, there’s a variable called URL which holds the path This is one part of the input There is a variable named fragment which lo and behold it’s the fragment There is a variable called scheme which happens to be the scheme There is a variable called netloc which holds the host and the port Which is precisely what we have over here And there there’s a path which is very much like the URL also holds this very path over here And I can now go and also apply this to other inputs just for the fun of it Where I will, if you let me, make this the first input So here I am happy lurking along So here’s another URL and here we also have a scheme which now is HTTPs and we do have a netloc, again, a location which also occurs in here So you can see that simply by looking at the variables, I can already extract the individual lexical items as they occur in the input So from this I can now go and infer and use this knowledge that the input is decomposed into these variables to derive a grammar Namely in two steps, I start with a very simple grammar that simply says, “My grammar consists of precisely this input.” This is very trivial, anybody can write example like this

Then you create a disjunction over multiple inserts and it’s all very simple That’s not how you do it Okay, well. And what we do then, no grammar inference can be a very easy thing It’s actually 100% sound, so it’s cool Every string in that grammar is a valid input So and then what I do is I go, and for each and every variable I see,I introduce the rule by replacing all the occurrences of that value by a non-terminal and then I say this non-terminal expands to the appropriate value Let’s try that out with the values that we’ve just seen So here’s my initial grammar Here we go. Now I apply the second rule I go and all of these variables I introduce appropriate new rules So let’s start with netloc for instance netloc, this is this one So I’m going to replace this string as it occurs up here by the netloc So this is what I get So now I have this expansion to netloc and I see what netloc is in here I can do the same for the scheme I can repeat the same thing, for the path, for the fragment as well as finally for this URL variable which doesn’t really belong here I guess, it’s some sort of misnomer, but that’s what you get And if I do this again then I see that the path is in fact a URL Under URL there’s path All these names are well-chosen except for the URL one I have no idea what the folks in URL parse thought of when they gave this variable this name But what you can see here now is how we have decompose this grammar input into a nice of things So our URL starts with a scheme followed by location, followed by a path, followed by a fragment Now, this is just for one single input, okay? And of course I want to extend this to multiple inputs. Yes >> [inaudible] for example if the user had capitalized all the URLs? >> Yes, yes, yes. I’ll come to that in a second Wait a moment We’ll definitely expand this whole thing, okay? So this is just for one single input and this is just one decomposition, the one decomposition we’ve seen for that input Now the idea is that I can go and do this for multiple inputs, sorry And for multiple inputs, what I’m going to- Actually I have implemented this whole thing So here we go And if we implement this and if we run this, this’s precisely the grammar that we get So this is just live here in Python Note that this is a fully self-contained Python program and we are at like 200 here So it took me four hours to design it, to build it, and to quality control it So that was something that was not very hard to do and this was very annoying for my student who had spent a year building this whole thing for Java, but his work is much more sophisticated than the very crude thing I’m having over here Now the interesting thing is that I can repeat this thing for multiple inputs and I can easily merge these grammars together And since I have the structure I can merge them by rule So for every single rule that I have, I’m simply going to add a disjunction for the next value that I see which actually represents the individual variables, the individual values that their respective variable can take So if I do that I am going to spare you the code for that This is not very nice code, but I’m simply going to run that So what I have done now here is I have taken three inputs with different schemes for instance And how you see, here is the disjunction, it’s HTTP or HTTPS and the fragment variables one is ref and one is ref2 And the host of course can take different values and the query, we have now other query This is one of the inputs Well, there’s only just one variable here So we have a path that has taken the value /path and /bar And over here we see the various schemes that there are So we see that you can have something With a fragment you can have something Without a fragment, you can have something With a query and a fragment So these are different expansions and if I would add some syntactic sugar, I could go and say that the query in here and the fragment are actually both optional Now if I have such a grammar and indeed I can start with this very grammar, I can immediately use this for producing various combinations which then gives us various combinations of the little constituents And then I can create arbitrary combinations of hosts,

and paths, and references and for some reason no query in here Good. I’m going to- where’s with the query? Oh yes, hasn’t expanded Let’s try that again Do we see a query? No query. How come? Good. Here we go Again, another combination So every time I can use this roll the dice, use an arbitrary extension writing a producer for a given grammar Again, this is eight lines of code Not very complicated Okay. And of course the more elaborated my grammar gets, the more alternatives I have and the more possibilities for combination follow So as I said, this is a very simple program We use these programs for teaching I’m having a course on security testing, the next one starts mid-October and we are indeed showing folks how to create such grammars and how to mine such grammars This is the very program we’re giving out to students such that they can then try out and see how it works. Now- >>This is just a very simple version, okay? There is still lots and lots of pieces missing but it is clear that there’s ways to further expand this and we now get to our real implementation, the one that my student took a year for This is now for a real programming language, so we are working with Java I never try to submit a paper where I am saying, this works beautifully on these 20 lines of this, or these 10 lines of Python For some reason, it has to be this big Java program or this big C program, so I am always fearing that if I submit something that works on Python, people will not take it seriously So, we built this whole thing for Java and it has a full grammar inference of Java programs This is actually the first technique ever that is able to mine input grammars from programs What we did have before was mining such grammar from samples, which is very nice Google has implemented quite a number of machine learning techniques that tell you that, if in a period after command you have this token and this other token then typically this next token is going to follow, and you can create valid documents with that The problem though, is that this is not going to tell you well, it is only going to give you those documents that look realistic, but it is not going to give you documents that explore all the other possibilities that are in the program >> Which paper if Google does that? >> Google does that, yes >> Which paper is that? >> I don’t know whether they have a paper but I spoke to Domagoj Babic at Google I was not under NDA and he told me that >> [inaudible] >>And there is work with Patrice Godefroid Yes. Machine learning for [inaudible] , yes sure So you have the same paper I think they never bothered to write a paper about it And also there is people at Stanford also that appeared this year at PLDI, where people have a great technique for inferring grammars using some active learning of the program but our idea is to have the program as a reference so extract the grammar for it, not samples, nothing, as little as possible Our Java implementation uses dynamic tainting, so we don’t look at memory dumps, we simply follow our characters We also do some active learning to check for repetitions and optional parts, that is, for each and every grammar element, we actually go and find out what happens if we leave that out And if this still the result in a valid input, if the program under test still accepts this, we know it is optional For every element, we try repetitions and if it accepts multiple repetitions, we know we can repeat it That’s fine too, and very nice too We do have a number of predefined lexical items such as, numbers and others which we identify, so these are regular expressions, and if we find out that some input matches a regular expression, say such as a number, we try out more numbers and if we find that more numbers are also matched, we know that this is a number, and we also have a hierarchy of these expressions We have natural numbers and next coming up is negative numbers This is how we find out that the port number, for instance, only can be a natural number but cannot be a negative number I have no idea what happens if you pass a negative port number or whatever is going to happen I remember that if you kill a negative process on Unix,

if you kill a process and give it negative process ID, the entire group is going to collapse but no idea what happens if you open a negative port Anyway, we now know, from our work, that ports cannot be negative in URL I am going to give you a couple of examples of what this actually looks like So, this is again a URL example, but now with our autogram prototype applied on the Java net URL parser and this is the grammar that pops out of it This is the raw output of our tool We just as in the Python example that I have shown you before, we make use of actual variable names and function names in order to make this other readable And here you can see what is URL? A URL starts with a spec and there is a string in the beginning and this string is a protocol followed by an authority as well as by an optional path Your authority consists of an optional password, at a host, the host is mandatory, followed by an optional port The protocol in our example here is always HTTP The user info is a given but the host is a hostname The port is a sequence of digits The path is actually a path A query and references are all alphanumeric characters with special characters in there And this is what we learn from this one single example only So, simply by looking at this one single example, which now has a couple of features and generalizing over it and finding out which parts are optional, finding out which parts can be repeated, we get all the generalizations that we get in this anti grammar >> The protocols that are not [inaudible] they are false, have you do some kind of checking options? >> The protocol is something we cannot generalize because our tool has no knowledge about HTTP protocols If it had knowledge about protocols, then I think it would be cheating at some point We can generalize over individual protocols, what we can find out is that we cannot generalize protocol to an arbitrary strain because arbitrary strings would not be accepted But we cannot find out that other protocols, say like HTTPS or file or something would be accepted We would need extra inputs for that The reason for that is simple The way the Java URL net empowers the work is, for a given protocol, if you pass that in, it is going to look whether there is any class in the class loader that matches the appropriate name, and if there is a class, it is going to load it and then it is going to execute it Well, we cannot come up with a neat and universal mechanism that at some point would search the file system for Java files that make specific file names So that is not something we can, we do not know this But the other parts, yes the path, the query, reference and all of this can all be systematically expanded such that we know which parts are actually part of the grammar You had a question? Not now? Yes >>When you are doing that team tracking? You’re now looking at the path of a thing, like the control flow of a thing >>We’re not looking at the control flow, we’re only looking at the data flow Thing is, if we look at the control flow, if we look at control – It’s implicit information flow, that would be the correct term Then, we would very quickly get, and find out that everything depends on everything, and so we only look at the data flow This has a limitation For instance, this technique works because the URL parser is a handwritten parser If we apply this technique say on the table driven parser, such as a generator parser, we would not be able to find out that there is information flow along all this part or on all these table elements because all of these are governed by implicit information flow Generally, if you have something like switch and a character, and then in case of that character do that, in case of the other character do that, you have a controlled dependency and this is something where we still have to tune which kinds of control dependencies we want to look at and which others we don’t want to look at But there is a couple of expansions, which goes into that direction On the other hand, If you do have a table-driven parser, chances are that you do have a computer readable grammar in the first place because that’s where these table-driven parsers are typically generated from I can give you another example This is actually a Microsoft one This is INI files. This is a Java parser for INI files, and here we can see, so INI files That’s key value pairs with an equal sign sections in square brackets

The essence is in here You have lines that consist of sections or key followed by arbitrary number of blanks, followed by a space, followed by an optional value The key is not optional, the value is I didn’t know that you can have something without values, and all of this is in here, and these lines then are separated by new lines, and they all are combined into one single grammar You can see that this is a bit convoluted You can see here there’s a first line, that’s first mean parse, so there’s a first section, and here again we have this section down here again That’s because of the way this INI file parser is written, and I should say it’s not very well written, it looks like a crude scanner but then these individual elements that are in the implementation are also reflected in our grammar again but still, with a bit of manual cleanup, this becomes a very beautiful grammar Here’s my- in the sense – I would have expected you to generate things like, it’s either line 1, line 2, line 3, line 4 or line 1, line 2, line 2, line 3 so you have away to pick a sequence of lines and generalize it to a line start >>Yes. This is what we do too This is a repetition recognition, and we can find out that the input is valid if we have no lines, if we have one line, and if we have multiple lines, and this is how we can then generalize that there is a, is it a clean start? I think it’s a clean start So we can find out that the elements can be repeated arbitrary number of time Likewise here for the spaces and others This is my all favorite This is for JSON This is, again, one single input so one single sample suffices This is a constructed example where we actually look for various ways of number formatting because these make up different parts in the parser Again, we feed this into our autogram, and this is the grammar that comes out of it, and I’m very much in love with that grammar because it’s almost text book quality You see, what is a JSON file? The JSON file is the value, and the value is either a string, or false, or true, or an object, an area, a null, or a number A string is something called a hash, which is a bad name but I’m happy with it, something enclosed in double quotes False and true are false and true An object is a string, a key value pair followed by more optional key value pairs as many as needed, separated by commas and with closing parenthesis at the end An array is similar except that we’re using a square brackets here. Null is a null A number is an integer followed by possible fraction and exponent All of this is in there Again, this is what a tool produces from one single input, okay? There’s so many things you can do with that You can use this immediately for understanding what the input is made of, because we all know what a JSON is so we understood We can see that his grammar is valid, but now imagine you have some unknown file format and this is what pops out, or you can immediately use this for fuzzing because you can immediately create tons and tons of valid JSON inputs You can use this to compare multiple versions of the program against each other You take one JSON parser, take another JSON parser, you find out what the difference between the two grammars is so it gives you plenty of insights about how the input is, about what the input looks like, and what you can do with it You have a question >>Yes. Terminals or non-terminals, is that pulled from like identifiers within the- >>All pulled from identifiers >>Mostly, there’s some alphanum with white space or special white spaces >>The ones that are underlined here, alphanum with special white space, these are our predefined ones I just noticed that string internal is not, this is actually false This is part of the grammar in here but the ones that are underlined in here, this is true up here Integer here and positive integer, these should also be underlined These are our predefined ones We have a paper where this is described This also come as part of another grammar where all of these are being defined We simply recognize that something matches this regular expression, and then we know that this for all the positive integers that we have generated actually holds, and after a couple of I think it’s five trials,

after five trials we know well, we’ve seen it work for 53, for 4,700 and for three more, then we assume it’s a positive integer We can also do some Newton iteration in order to find ranges and other stuff that’s all in the works. You have a question >>Many things or types of learned object definition >>Yes >>What is the is guideline if you’re trying to understand and trying all possible expressions in the grammar? >>What we do for individual element in one rule, okay? For individual elements in one rule, we try to omit them and we check whether the input is still valid Let me just check. I think we also do this for larger groups We first do this for individual humans, then for pairs, then for triples and so on It takes some time so this JSON inference is a number of a couple of minutes, but well, you do it once and then you have it so there you go Yes >>What are you doing when you obfuscate variable names? Let’s say someone uses capture hoops or something and then you have arrays with, you know >> Then all of this becomes X1, X2, X3 and so on so we cannot find these out either But, even so, the simple fact that it’s not only the names that make these readable, it’s also the structure that follows from it because you can really see how this, so if you look at the JSON implementation, this structure of the grammar very nicely reflects the structure of the program that processes it and you can obfuscate lots of names, no problem If you start obfuscating the structure of your program, that process has inputs Plenty of programmers who will work with your code later on will be extremely angry at you If you want to really obfuscate things, you also need to come up with some mechanism, “Oh, that’s really bad.” You need to come up with a mechanism that not only obscures the control flow but actually that also obscures all the data flow in your program You would have to split these things into multiple variables, but, even so, if you say I’m going to store the first three characters in one variable and the next two characters another variable, all that is going to show up is that there’s a rule in here that say that false is composed of two parts, and the first is fal and the second is se so yeah, it’s a web Right now, there should be ways of obscuring this but right now we’re rather interested in bringing more light into what’s happening here These are just three examples of a autogram tool, and yes, we do have a couple of exciting work that’s dealing with that Right now, we are re-implementing this whole thing for C programs and the reason for that is that we found that our technique for tagging the Java elements and actually changing their representation is still something that is very brittle Whenever we have a new subject and particular more larger and more complex subject, we typically run into bugs in the Java interpreter in Java virtual machine We have many ways of getting the Java virtual machine to crash, that very interesting, but it doesn’t help us because we need to build appropriate work around Now, we’re building this whole thing for C programs for which we do have nice interpreters and instrumentation infrastructures that are way more solid plus there’s more interesting system level utilities at the sea level Think of all the operating system, parts of the demons, the protocols, all the data exchanges, all this is written in C, so it’s fun With this active learning, we already can do with a minimum of samples One single sample is great So we can explore all the variations, but in the long run, we want to do this without any samples at all So for JSON, we have, we have for JSON, and how do you say, in a spike solution We have managed to create a nice little test generator that is oriented towards the parsing technique that is able to systematically generate all the different variations of JSON, that was nice And now, we’re generalizing this towards more programs So, which means that in the longer run, you will be able to take an arbitrary program, feed this into our technique and out pops the grammar This is the long-term goal we’re aiming for The fuzzing part, interestingly, has lots and lots of open issues because you want to apply techniques from one to play a couple of tricks from test generation, for instance you want to fuzz in order to get a maximum of code coverage or feature coverage You want to, a simple thing,

covering all the production rules, okay? Again that’s in our Python fuzzer That’s 30 minutes, one hour of work And it turns out this has not been done before, but as soon as you systematically try out to come up with all sorts of various production rules, you get a much better coverage of the program And that’s because the grammars we’re having here and the functions, that because the structure of the program, is reflected in the structure of the grammars And so, if you cover the structure of the grammar, you’re also going to cover the structure of the program. This is very helpful Integration with symbolic testing, grammars on one side, program semantics on the other side How do you bring these two together? So what we can do, and what one of my students is working on, is to create local solutions with symbolic constraint solving at the function level that is determining how can you reach a particular behavior in the program So, solving those constraints, and then finding out Well, this variable which I need to have this particular value down here in that function in order to achieve that goal, is something that stems from this element in the grammar, because this something we have established earlier with and imitating So we can find out oh, if you pass a minus one down here, it’s going to crash We can lift that value back to the grammar and see Well, let’s put it in minus one in here Maybe the minus one will be rejected by the parser, but at least that’s worth a try And in the end, what we want to build is the world’s best fuzzing platform Yeah I’m competing with the pros here I know that, but yes, we want to build We want to bring all of this together, creating a platform where you throw in a program, where we can infer that in, where we can infer the structure of the input, where we can apply smart test generation techniques based on the structure of the input and then cover as many features as possible in a minimum amount of time And fortunately, I’m not alone with that I do have a couple of great post-docs and PhDs We do have a whole group in it from Oregon State University student of Alex Gross who’s working with me on the creative part of the Nicholas, is working with Nicholas Avakov is working on creating the fuzzer Matyas has created the first implementation of Autogram Alexander is working on integration with symbolic testing, and Burin is doing all the work on C programs these days This is the army of folks who is helping with it I’d like to point out that I wrote the Python programs So, you know, for somebody who increasingly relies on others to write programs and build stuff, I would like to point out that I also can write stuff myself So, just to point that out, just to point that out German professors can write code too You are happy to analyze my Python code for whatever features Occasionally, I write stuff OK? And in the long run, you know that this is fun I looked at file formats in Wikipedia There’s a page that’s called file format There’s a list of file format These are all the file formats that are documented somewhere in Wikipedia There is about, I think it’s 1,000 different file formats, all possible three and four letter combinations File extensions are written in their JPEG format You don’t know everything under the sun Dotsie are download, no idea what this is, oh it’s Google Chrome of course And we wanted to be able to infer those formats one after the other, and then be able to fuzz them all As I said this is very simple question: what is the input language of a program? This is a very simple question, and it’s awfully under-researched It’s a question that, well, it’s very fundamental to all sorts of testing and of course to program understanding, at least of data understanding, but inferring this automatically is something that will at best a handful of people who’ve undertaken into the last decade And how can I leverage this infrastructure in order to cover, understand, prevent particular inputs if have a parser force but for particular inputs? Then you can say, “Okay, here’s this legacy program which has a couple of vulnerabilities I’m going to restrict its input to only those inputs that I know that are legal or the other way around I know that specific inputs are malicious I’m going to infer grammar for this I’m going to explicitly prevent them.” There you go

You don’t even have to patch the program for your whatever, for your two-year-old IOT device which you bought at a cheap price and which no longer is maintained Well, at least you can set up a filter for that All of this translate also into better understanding of code because if you know that this part of code processes this particular input, well, you have a deeper understanding of what is going on And of course, this also gives you an idea about how to cover specific behaviors, how to understand behaviors There’s hundreds of open issues that it’s very easy to start doing research If I can write a Python program that infers a grammar in four hours, you can do so I’m sure you folks can do so in one hour including learning Python So, please enjoy it all, okay And this gets me to the end of my talk So, what have I shown? I’ve shown you how to use grammars for fuzzing of existing programs The obstacle to that is getting a grammar in the first place And I have shown you, well, both the Python prototype as well as our Autogram prototype that is able to infer such grammars automatically, and the current work of my students happily working on extending this idea in various directions And there is plenty of research opportunities around it, and all of this can actually be found on our web page that is devoted to Autogram You also find our most recent technical report with all the features on active learning there So, was it fun? It was my pleasure to present, and I’m happy to take questions Thank you very much for your attention >> Yes, Chris? >> Really cool, and the result is kind of exciting My question is, inputs, it seems like you focusing mostly in input that is human readable and often inputs, whether it’s file formats or across the wire or binary, typically, at least in my experience, writing the codecs for these things, they are not context-free >> Yes >> You’ll have some length encoded or layers something like that >> Yes. Yes. Yes. Yes >>Do you have a sense of how well these types of approaches would work in trying to handle that where the output is not context-free grammar but something else? >> Yes. There’s two important issues in there The first one is context, as you put it What we infer here is context through grammars and there’s plenty of data formats Actually, in most reasonable formats, there is some amount of context sensitivity in there You would normally not want a context-sensitive grammar to describe these You can say, for instance, suppose you have a checksum for instance, okay? You do not want to describe the checksum using a context-sensitive grammar, okay? You can do that but that’s not fun I mean you look at regular expressions and say, I’m not sure about that but programmers love regular expressions and for 100 regular expressions you have one grammar, one context-free grammar that is I’m pretty sure that for 100 context-free grammars, you have one context-sensitive grammar and then we go up the way to universal grammar and Turing machines and whatnot The further up you go, the less popular things are and you don’t want to write this with a context-sensitive grammar What we’re working on is to extend such context-free grammars with constraints, that tell you that one particular item is the result of this particular operation being applied on this other item and this operation is something that we imagine can actually be extracted from code because you can find out that the length that is being encoded in here, actually comes from code and is being checked at this point, so we can extract these operations and feed them and make constraints out of them This is how you can then represent constraints with respect to checksums and length and of all these other container formats and binary formats Actually, I have a very nice master thesis on that where I’m awaiting results soon, which is specifically geared towards checksums and input The second thing is that what we need here, what we have here is actually multiple layers of languages stacked upon each other like in media formats you have or

say like the ISO/OSI networking model, not sure whether this is still in textbooks Let’s say you have a layer where describe things in terms of bits being processed, you have a layer where you think about these as data packets and then somewhere up there, you have the application layer Same thing happens with media containers You have a representation where you have little bits with checksums and lengths and then you have another representation where you actually talk about the context of all these individual frames that are packaged in here and our goal is to bring these together, not only and by having multiple grammar We can have one grammar for each layer that describes it, but the concept of so-called conjunctive grammars, which is a mostly theoretical result from language theory but the idea is that you can take multiple constraints The idea is that not only you have these junctions I’ll tell you what this can be I or B, but you can also have conjunctions You can tell this can be A or B and it also has to fulfill this other constraint and the idea here is that you can then express that file has to fulfill the constrains that happen at the bit level but also at the application level and all of these together You can have negations in there actually, then you get Boolean grammar Then you can say, an input has to fulfill for this, and this, and must not be this, okay? Then you can actually combine things like, this is a kind of input that you’d like to avoid so you have an explicit blacklist for instance This is something that we have, that where we think of in the near future, we’ll be able to infer these The problem is production because, right now, if you only have these junctions, production is a very simple thing As soon as you have conjunctions, you actually have to go into string constraint solving and others where you, in the end, have to invoke something like a humpy or Z three in order to fulfill those constraints But, again, this is classical This is a classical constraint solving We have tools for that, we know how to do that so it’s not out of the reach Two answers, two longer answers to a two single question Yes? >> The examples that you show would had been the proper way with this thing, in the first, in writing any grammar >> Sorry >> Would it have been the proper way to deal with these things to write grammar in the first place rather than constructing that as you go? >> Well, you know, even for a simple thing as a URL, writing a grammar can be difficult because if you write the grammar, you start to reason about what kind of characters can be in the path? What is the range of the port? Can I omit the port? What kind of characters are allowed in the password, okay? This is the things that make you think hard but it’s important to capture these constraints precisely because these are precisely the things you want to test, and by having an automatic learner, or in this case, an active learner that infers those things for you by running tens of thousands of experiments You get way more information than what you could ever do through manual testing and probing Even for a symbol, even for for something where we think we know how something is composed, formalizing this is actually manual work If you think about a format that you do not know, okay, and say some proprietary data format I remember, I was working, this was during my student term, I was working with a with some sort of vector drawing program, which had a proprietary file format in which everything was stored I had no idea what this looked like I loaded this file into my text editor and the first characters in that file were, so you’re trying to find out the innards of a such and such file A, question mark It was an insult I mean it’s like you abandon all of you who enter here or something like that Okay, so it is difficult Now, you think about, somewhere there’s a hard disk in which somebody has stored millions and millions of pictures made with that particular program on a hard disk and you have no idea what it is made of, but if we can extract the grammar for it, then we will be able to identify what it’s made of and actually, if you think of it, we can create a universal XML converter, take anything and convert it to XML using this very structure in the grammar to use XML out of it, and all of a sudden you can reuse everything Yes, there’s plenty of uses for that

>> People might want to replace their current implementation, their grammar- based implementation that they get off of your tools >> Well, if you have a grammar then it’s easier to create a parser again out of it It’s obviously easy to- also if you think of it as an Oracle, for instance, you can use the grammar to check whether the outputs are valid for instance It’s amazing that we as well, computer scientists love grammars, we do If you do one year of studying computer science, you do learn a bit about grammars in formal languages and you need them later too But bringing these things together, here’s my program and here’s the language Okay, so far, we’ve only seen this from the prospect of creating a specification and then implementing it and creating a parser automatically from a written specification Yes, we know that this is parser generator but the other way around, here’s a program what’s its input? That’s a problem that has never been addressed before and I’m not even sure whether the problem has been posed before and I think the problem has not been posed before because so far Even though the answer is relatively simple, it simply was not obvious how to do that The thing is if you apply standard static symbolic techniques and have a parse If you have a parser and you want to find out what it does using standard static symbolic techniques You ever looked at a generated parser with its tables and everything, it’s impossible to understand what it does and also for constraints, it’s impossible because if you have a table with 10,000 entries, you’re going to end up with a constraint that has 10,000 entries until the solution is lo and behold the grammar So there was no such thing and only by using a dynamic technique I think, by looking at what’s actually happening during the execution and combining this with active learning and testing this is how we can know Now, we have a good grip on how things and how to tackle that problem It’s still going to be a decade of research until it all works out and there’s plenty of opportunities so I’m happy to encourage you all to try these things out for yourself >> Okay. So let’s thank Andreas one more time >> Pleasure