Natural language processing in Python using NLTK. Part 1/3

okay so what is natural language processing this was the first question that I heard on the hallway just now well in a broad sense it’s any kind of computer manipulation of natural language and that can go from counting words to word frequencies or things like that to really hurt things like understanding meaning in a text and things that have not entirely been solved because we don’t know right now how to really communicate with a machine and do it very very well or there have been some advances recently but we are still not there okay so why would we want to learn natural language processing and why would we want to do it as political scientists especially there are multiple applications for this first of all because at some point you might want to just process text you have a block of text and you want to do something with it extract something from it some information and that’s going to be very helpful its second information extraction extracting certain types of information not just dividing the text into sentences and then working with the sentences or the words but actually extracting pieces of information from it another thing that I’ve been doing myself is document classification and also sentiment analysis you can also compare documents or compare texts among themselves you can also do an automatic summarizing of texts again it’s one of the things that is a bit harder to do and it’s programs are not extremely good at it but you can do it you can also do translate translations and you know that from Google Translate for example and thus political scientists we can also do discourse analysis with all that entails so what is an LT K it’s a toolbox it’s a natural language toolkit and it’s a toolbox of different sorts of libraries and programs for Python and you’ll know Python because you’ve we’ve already had the deep my turn and we had it last week as well the good thing about Python of is of course that it’s free and the good thing about an LT k is a so that it’s free it was initially developed for educational purposes and as a consequence of that it also has extremely good online documentation so you can there are multiple books on it I will be talking about two of them and there’s online documentation and there’s also help on Stack Overflow and things like that what are some other options for working with text for natural language processing well there are some if you go on our and you go and check out the natural language processing section you’ll see that there are lots of packages open NLP is one of them open NLP is also available in Java it’s similar to an LD K you have other things like link pipe in Java as well and you have lots of commercial applications it’s one of the things that really sells well so lots of companies are developing it you have the SAS text analytics and various SPSS tools for that however the cool thing about another case that it is the most widely used and the most consistently used by by its users so ok so how do we download and install I already sent you the email about that but in case you didn’t get it or anything you can find the installation instructions at that link you will need Python 2.7 even though there’s a newer version of Python available um first of all I strongly recommend getting 2.0 just because it’s a bit better and second and ltk only works on 2.7 right now and you also need to download corpora and packages and the data used for the examples in the book and from Python I already as I already said you have to run the two lines of code you have to import and ltk and then download and it’s going to open that a downloader and you can choose what packages to install from there lots of them are going to already be installed um if anyone finished downloading the thing did it finish downloading okay so we can actually go ahead and use it so what resources are yeah oh yeah the my code yeah can you see it if I make it large you know it’s it’s 15 n L large n and then L and then nine

does he’s work okay or in so what resources I’m going to use most of what I’m going to show you is based on the on the textbook for an LT K which is natural language processing with Python um the textbook is free and it’s available online you can also buy a hard copy of it but yeah it’s available online and another useful one is Python text processing with an l TK 2.0 cookbook I use that one when I was learning and how DK I read it it’s a really easy read but the examples that I have right now are based on the textbook okay so let’s start by reviewing some of the things that we already know from jewels presentation in Python the very first presentation that we had this semester first of all strings they are the lowest level of text processing and you have some examples here of things that you can do with a string so I’m defining Monte as a string why do I have it on two rows just because I wanted to show you that you can write strings on two rows and you need to separate them so that the program knows that it’s the same thing you can do operations with strings for example I can multiply that string by two I can just add text to it and then I can the last piece of that the last piece over there I’m taking out only the last word the last seven characters in the string and while it prints out is double plus the expression that we get it in the middle plus the last word you can also search a string and it’s going to if you run the name of the string plus one and find it’s going to find the position of that the substring starts at and its position six in this case you can also transform strings into all uppercase or lowercase and you can replace letters in strings you can or groups of letters you can and I’ve replaced in this example Y by X most of the time we are not going to use strings that we are going to use list in an LD KY because you can do lots of other things that you cannot do with strings with lists for example they are flexible about what elements they contain which means that you can have defined as elements of a list multiple things and entire words or sentences or pieces of text and if you look through them you’re gonna end up with the elements the way they are if you could look through a string you’re gonna it’s going to loop over the characters in the string there are some things that you can do with a with lists you can look at the length of all these these are the most common ones you can look at individual elements and remember one important thing that you have to remember is that counting starts with indexing starts with zero so we file type the first element of the of the list is going to be actually the second one that you see on the screen just because it starts with zero you can also append things to list append element and that you can sort them and if you sort them what you’re gonna end up with is at least a new list that starts with numeric the start with words that start with numeric characters then we are going to have words that starts with capital letters and only then it’s going to link via it’s going to release the other ones alphabetically and then B and these the ones that start with capital that are going to be ordered as well but they go first yeah you can also join the elements of own list the way I did here and you can also split the elements of a list you can split a string into the elements of a list again now I’m going to for this part I’m going to use text from the book that you downloaded just now so I hope it works for everyone if you want to run the code with me it’s importing the text that comes from presidential inaugural address addresses and here I’m showing you how you can search and display words in the context of the text and for instance if I’m if I’m searching for the

word vote and do I want to display it in context it’s going to have eight matches and I’m only showing you the first three of them here and you can see the word inside sentences in the text another thing that you can do with this text is you can look at words that appear in similar context that the word vote appeared in so for example you see that the word vote appears in sentences like this and then other words such as nation abandoned achieve adopt appear in similar contexts they are surrounded by similar words one thing that is pretty important the end is going to be even more important when we start working with text is collocations of other collocations it’s um groups of words that appear frequently together and we can list up well you can we can show a list of collocations that appear in our text for example the United States fellow-citizens and so on and so far these are words that usually appear together in groups isn’t it so echolocation is it removing some no it’s it’s not removing anything it’s just looking which words for words that appear together one after the other frequently so he’s not doing anything it’s not just know there’s I’ve done is like and I of D is always like I yeah because it might be removing common words or only including them if they don’t if they’re not two of them if there’s one common word for something else but I think it’s removing common words from what I see there’s there are no common words here yeah so it’s not removing them but it’s not taking them into custody she – telling this um another thing that you can easily do in terms of just playing basic operations with text is counting and you can count vocabulary you can count the length of a text from start to finish it’s going to be a bit different if you want to see how different how many different words are used in a text so what’s the vocabulary using the text and you cannot actually compute a measure of richness of the text by looking at the proportion between the number of words that are used in in total in the text and the proportion of the number of words that are used only one state appear you know can you take them only once you count them and then you divide that by the number the entire number of words in the text and that gives you a measure of richness of the text another thing that you can do is look at um positions of different words in the text and what I’m graphing here is a lexical dispersion plot which shows you in my text in all of the inaugural speeches where do these following words appear citizen democracy freedom war America and this is sorted because each shown so you have the the speeches which are ordered over time and you can see where America appears a long time more often and it’s going to be here towards more recent times you can see the where word war appears and it’s more densely used you can actually look at the if you if I would have caught it this against time you would have seen different times in which word was used more often than others but citizens is a pretty common word and it is also used consistently over all the text democracy is freely I I was thinking that democracy would be used more often but it’s not you actually use that often yeah and freedom this is interesting that freedom appears here a lot first the end of course there are more systematic ways of doing this another pretty cute thing that you can do using an alpha K is mmm look at text that is generated in a certain style and then have an LT k generate random text in a similar style and I did that and if you read it it’s really really funny and this is I haven’t tried to get it like this to get it to sound good or something it’s just the first thing that popped up all over now DK and I really like this the last sentence destructive Wars issued issued which has always worked perfectly yeah what is that constructivist who’s

more intense generator right yeah yeah it’s pretty cool I was I mean you can if you get if you read that and if you read I did it also with a text from the Bible and that one looked exactly like it was going for the voluble it was but this one is pretty close as well if you do it with news though it’s not going to be so so good and you can imagine why because there aren’t first of all the richness of the vocabularies is larger and then you sometimes use short phrases and it’s not going to have a lot context around it and by the end of the presentation you actually be able to figure out how the program did this if you don’t know already okay what can we do with elements of a list I remember drew speaking about list comprehension and here for example I’m looking at the length of a list that includes words in lowercase and which are longer than five characters in my text and you can see how many they are it’s going to give you a number and you do that only to only only one line of text another thing that you can do is just capitalize everything all the words in capital letters for the first five words in the text another brief review of loops and conditions yeah so what I’m doing here is that I’m trying to print out different things depending on what I get in what I find in the text so for example if the word is shorter than five characters and it ends with E I’m going to print is short and ends with E else if the word starts with a capital letter this is what is title means then I’m going to print either title case world and else if none of the other conditions are met I’m going to print word this is just another word and then you can try to do this in the program and you’ll see what it gives you okay going back to my note ek um these are all these are Orbison Piper yeah these are these are all things that you do in fightin almost all of them I also showed you a few things that you buy meaning then ltk for this but most of them are things that you do in fightin digits was just a review of what we need to know in terms of Python to be able to work with an alpha K so we are going back to NLT K and to text um I use the word earlier corpus what is that it’s a large collection of text um very large it can be very large hundreds of documents it can be raw or it can be categorized so those documents can be put into categories it can be either concentrated on a specific topic or it can be very broad opened only and what are some examples of corpus you have the brown purpose which is the largest and it was the first one that became really really huge it’s categorized by general and it includes lots of documents different types of documents you have web text that includes reviews and things from from the web and discussions in forums and things like that you have one form use which is a Reuters and you have the inaugural speeches I also have a multilingual one and these are not the only ones that you have given in NL DK comes with access to different types of corpus why do we need them and what we can do with them we’ll see a bit later that we can use corpus for training our own programs and expanding our programs and we use what we already know from the corpus so yeah what are some basic corpus operations you can look at you can list the categories in a corpus if it has categories you can also list the names of the files in the corpus um so the the inaugural corpus you can see the names of the of the the first two files it starts with the nineteen seventeen eighty nine for speech and we can actually inspect the distribution of different words in this text it’s going to plot the conditional distribution

function for the words America and war and I another condition that I have here is well I’m starting with a lower case we are lowering lower casing the the words in the text and then we are only using those and also I’m looking at words that start with these two work words so you can also have America and things like that and this is what the graph looks like and it’s actually I put the iPod ears down there you can see the evolution of the terms over time you can also look at differences between categories different categories inside the purpose and for example we can look at several genres from the brown corpus News government and Romans reminds gonna have like love novels and things like that government is going to have definitely into the government and then news you have news as well and you can look at how often different sorts of modal verbs appear in these documents and I’m looking at some that I think might be interesting like should lay and can and I have a feeling that their frequency is going to depend on the type of document that we are talking about and if you do that you get that indeed in the news corpus the most common ones are can and may however in the calm in the government category the most common one is made which is pretty interesting and in the romance category it’s can so you can do that and you can imagine doing this with different sorts of other words and words that are more informative than that only another corpus and a really important one is wordnet what is it it’s a huge Dictionary of English in which we define synonyms antonyms and also things that are more specific to language processing such as hyper names hyper names what are these things they are subcategories of a certain category or upper categories of that category you can also look at a word for example and see if it’s categorized how deep is it on what step is it and you’ll see for example that I think I have it here yeah so you’ll see for example that a word such as animal is going to be not so deep compared to a word such more specific such as lion things are structured in trees in wordnet the other interesting thing that it has is entailment for instance we can have this last line of code here it shows you that walking entails stepping how does it do that because they are ordered service like that in that in a tree let’s go a bit through the text because this is interesting anything so what I’m doing is first of all importing wordnet I’m importing it as WN because it’s easier to use it like this and then we can see the scene sets for motorcar and they are going to be car and you can see different sorts of other names for that car although automobile machine motorcar you can also see the definition for that which will be just a definition for what car is um and yeah you can also print different subcategories of cars things that fall into the category of car and you can have in one category you have car rail car railway car railroad car so you can see that these things are things that are related to trains and they fall into one category their order together and in the other category

you have a car as you know gondola and you have these are just two of them because you can have others you will have one with car which is going to be together with automobile and things like that and you can you can have car as in traction Carlos things like that and they are grouped together so this is why it can access them like that because it has all this information about the relations between different sorts of words yeah yeah you um you can do that by your own it’s already pretty comprehensive oh they’re good you know but I think you can’t do that yeah yeah because it’s um it’s going to be structured as lists and dictionaries and things like that so you can change it you can define your own things yeah I mean this is really we are seeing the raw code for that it’s going to be a tree that has different things defined inside of it so you can actually see how it’s built and you can of course change it if you want to write code to change it another thing that we can do well but I mean I’m sure that NL TK is not the best program out there for doing this but you can do it if you want to use it these get text out of HTML and it does that through URL open and you can actually clean the text it has a function that cleans text for you it’s not extremely good if you free so for example what does it do let’s say you want to open a file and that’s just a BBC article about planets and what is the smallest planet around and we can read that if you try to look at it it’s what it gives you it’s going to give you like this HTML code because it’s going to have lots of HTML codes it’s going to have tables and everything so this is not very good but then you can you can actually clean that and you can you can clean that HTML code and if you do that it’s going to end then your tokenize if you want to see the just a word you’ll see that it gets rid of this part here and it’s going to give you the main words in the text it can recognize things like that of course it still needs some cleaning because what was coming from after accessibility and links and things like that were things that had nothing to do with the article so if you want to extract just the article you might want to still work on this but you can do it and you’d get you pretty far and of course you can insert your own data you can use it to input your data which is not going to be very very we’re not going to get you very far and you can also use it on your own files on your disk and this might be pretty obvious for you guys because you probably saw this but I thought I would this would be good to have you there how do you open a file how do you import your own file in Python and what I have here is oh yeah that’s who Pablo was smiling here because we were feeling we were all ready for the same project where we used the manifestos and I’m what I’m working with is manifesto data from UK oh yeah you can actually open that another interesting thing here that I forgot to mention is that when you open files and you start using them you might want to check out the section in the book about the format of the file because it depends a lot in which format its it is originally you want to have it in Unicode why do you want to have it in Unicode because Unicode is understands everything and knows everything so you can do everything with it but you might not have it in Unicode and you might want to try to figure out how to get it in your account and back if you want to get it back okay another thing that you can do is import your own collection of file as corpus and why would you want to do that because you can do operations on the corpus it has predefined the operation so if you define your files as a corpus then there are some things that are already pretty fine then you that you can use so for example what I’m doing is I’m going through the the rest of the files you can see the

file IDs for example in my in my folder which was something that we saw previously for data that was coming from one of the big corpus corpora out there and I can also look at the words in my corpus for a certain file ID so I already have those functions okay I’m not going to go through regular expressions but you have to know that if you want to do things in Python and things in and lvk you should know some basic things about regular expressions and how you can use them to do what you want to do if you want to test yourself I have an exercise here at the end count the vowels in the string supercalifragilistic expialidocious and you would be like you know I’m good at you should count them with Python okay another thing that we can do and that is going to get us started on doing more complicated things you have your text let’s let’s say you have raw text the first thing that you do with it is your tokenize it and what does it mean to tokenize you you divide it into words let’s say although tokens are not always going to be worked tokens can be for example comas and punctuation and things like that you can decide how to tokenize your text and there are multiple ways of doing that and one of the most important decisions that you will need to take is how are you going to treat punctuation are you going to have it included with the word as then you have a sentence and then you have you have the end of sentence and then a sentence character are you going to include that with a word or treat it separately and you can see that if you want to do complicated things with your text these decisions are going to be important the basic tokenizer treats punctuation separately Islam is not including it with words which you can imagine makes sense because you want to be able to compare the words and if you want to be able to do that a word that is at the end of a sentence should be comparable to another word that is in the middle of the center so you don’t want to have a punctuation mark there that is going to confuse things but yeah again you can you can choose how to do this there you can also use a regular expression tokenizer pretty good and you can decide what pattern you’re going to use for your punctuation and I actually use something similar when I was again working with the manifestos data we were trying to divide them not into words but into sentences and the thing that was coming right out of the regular tokenizer in and out EK was not very good it wasn’t very good for us because we had lots of unusual characters that were end of line or we’re not end of line so using on a regular expression professional organizer with a pattern that you define yourself is a very good idea oh yeah and another thing that of course you should do when you are when you have a large project and you want to make sure that you’re doing things correctly is that you test test your tokenize text against so you’re defining your own tokenizer and you want to test that how good is it you have to test it against text that has already been tokenized by someone else and you know that it’s properly tokenized the way you want it to be oh yeah so the after you will tokenize the text the next thing that we do is we normalize tokens and what does that mean we just remove our brackets and capital letters again another important decision is what you want do you want to do that and it depends on the project if you wanna do that because it might for some things it might make sense to keep things in uppercase if that’s going to make a difference for what you’re interested in okay the next thing is stemming so what do you do when you do stemming you take out end of word terminations like beautifully you take out y or other sorts of FX’s at the end of the words and you are gonna end up with like the root of the word there are multiple there are at least two stemming stammers that I know of Porter and Lancaster and if you look at an example I had an

example in the book had some text from Innova or something like that and you compare the two stammers they are going to give you different stems for different words for example lying one of them the porter one is going to turn it into lie the other one is not going to do anything with it for another one we mean one of them is going to turn it into woman and the other one is going to turn it it again um this would be better because you can then compare both the plural and the singular of a word they work better with simple words much better with simple words and similar to stemming its land lemma translation which only removes efficacies if the resulting world is in a dictionary so it searches a dictionary if it’s there than it is it’s removing it otherwise it’s going to leave it as it is so for example what’s going to do to the two words that we have for lying it’s going to leave it as it is and for women is going to turn it into women is going to turn it into a woman and of course you can define your own stammers if you want to do that and you can do it through regular expressions because there aren’t that many terminations of words that you you can know for instance terms for instance verbs at the past they end with a B and you know for sure that that one has to be removed and there are others yeah I don’t know because I’ve never worked with other languages but you can always if you know the language really well you can always define your own stammers um but you have to yeah you have to it’s not going to be that long because when you I’m pretty sure that if you’re like a linguist you know exactly what the rules are and you’re not going to have all that many categories if you think about it in English I think you can I think I saw an example once of how to build your own stammer and what they were doing is they were removing what ten or fifteen endings that were very common and then defining some other stuff for remaining words so it might be like that it depends on the rules of the language however you can also try to search if there someone else has one already defined and you can go and check out if there’s any code out there for that what language do you want it’s for Germany either yeah yeah I don’t know about Russia yeah yeah I can see how that can be harder and I think a lot of what we are talking about here is informed by the syntax and language the English language um another thing that we can do with text and that we you would have to probably do at some point is to say divide it into sentences to be able to work with different programs in an dealt with to be able to do more complicated things in analytical you don’t always have to do this but for some others you need to divide into sentences you need to know where the sentences and and don’t you do that by using the punc sentence segment err it is pretty good it can divide sentences pretty well it has some problems with for example having over like us a at the end of the sentence it’s going to get really confusing it’s not going to know what to do with it so it might need some help cleaning things and defining special situations like that and what I have here is I’m printing out sentences that are coming from the raw data that I previously that we previously uploaded from the manifestos and I’m finding some sentences from that manifesto and now they are listed each one on a different role and separated among themselves and yes as you were mentioning one

problem with word segmentation is that in some cases there are no visual representations that are going to help us with that and the example was I think it was an example it’s Chinese which it was really hard to divide words among themselves and also it’s a pretty large problem this when you are working with spoken language because you can imagine if you’re a machine and you’re trying to understand what I am saying you’re gonna hear us a string of sounds and you have to know where each word ends for this so what do you do in a situation like that is that you can define a search and identify algorithms and functions that are based on an objective function which is a scoring function so it’s looking at how likely it is that you have the correct thing how likely is it that the word ends there based on what it already knows and what it has in dictionaries okay the other obvious one we need to write output files as well and yeah you can also write other title you can write output from from Python you just have to define in a string before you do that and one thing that is pretty important is that you have to avoid filenames that contain space characters and also that are identical except for lower and upper case and you probably already know that if you ever try to do that in you had problems with it okay now something that is pretty specific to natural language part of speech tagging what is that and why do we want to do that it is the process of classifying words into their parts of speech and labeling them so the parts of speech are classy such as nouns verbs adjectives adverbs and so on and so forth you can define lots of different types of classes and you tagged them maybe so you take them you tag words based on what class they belong to the tags are called texts the the tags are called text sets and you can use an applicator tag text automatically so the question is why would you want to know for a given text if something is a verb or an adverb or a noun or whatever well first of all because you can use that information to analyze word usage in different texts for instance let’s say that you want to see how dynamic a certain text is and you’re interested in the number of verbs that it uses you can use the information about the verbs you can also predict the behavior previously answer for previously unseen words if you if you know the parts of speech you can also run more powerful searches because you can define your search not only based on the information in the text what the words are or what the characters are but also whether the word is a noun or a verb and it’s also used for classification this is going to be the most important one and you have multiple methods that can be used and are used by analytical for tagging you have a default tagger which is does not perform really well a regular expression tagger you have a unigram tiger and an Engram tagger and the best way of approaching or part-of-speech tagging would be to combine them and you do that by using a technique called back off which means that you start with a very specialized model a bigram tagger for example and if you don’t get good enough results by by using that one you can use a more general model and you can define your own taggers you will train you have to train and evaluate them using a corpora that has already been tagged for that so let’s see some examples of that so some corpora has already been tagged it comes stacked so for instance the brown corpus the words are tagged and you can see the first words you have a

natural a noun the fall term you have though which is like a preposition and let’s see an example in which we define the sentence and some simple tags phrases if our sentence is and now for something completely different then end is going to be a conjunction defined as a conjunction for is going to be a preposition and something is going to be a numb and so on and so forth and there’s there’s a list of what these things mean in ltk you can have lots lots of other taggers frankfurters or one form for foreign terms and you’ll have two letters at the final of the tense of a verb one for determine is a thirst with WH which we’re so on and so forth and let’s see how net NL DK performs with an example that is harder to solve in which we have words that are both can be both nouns and verbs and for example they refuse to permit us to obtain the refuse permit it’s tricky and you might think oh this is impossible it’s not going to be able to do that but actually when you look at it it correctly identifies refuse first time as a verb and then as an on the second time as a noun and for permission for me again its first of all first time it’s over then it’s an O so it correctly defines them how does it do that it does that by taking the context of the sentence into account