Andrew Fogg of import•io at San Diego Tableau User Group

hi my name is Andrew and I am as you can tell by the accent I’m from San Francisco so I’m from London and i’m one of the founders of company called import I oh and what I’m going to do is very quickly give you a bit of a background story into what import a is what it does and then we’ve straightened to kind of doing playing around with like why it’s interesting for people who are interested in tableau and so I think all of us in this room will agree that there is a lot of data on the web and but it’s trapped inside web pages and getting it is kind of difficult you can use an API if there’s one available you can write custom computer code but sad face that’s kind of tricky too and both of these things require a friendly developer and a lot of time um who here has had so I assume everyone’s used table’ but who has ever done anything like writing written custom code to get data transfer website some nodding quite a few hands okay so we know this is painful so we were infinite from we obviously a lot of experience working with people who do that’s a lot and reckon it takes something between four hours and four days to get data out of any single website so imagine if you could turn any website into a spreadsheet or an API in less than four minutes without writing any code that’s what we do so how does it work sounds interesting how does it work and there is a web browser it’s a web browser for data and you navigate to a web page that’s got the data on it that you want the web browser should say is connected to our servers in the cloud and it’ll become clear as we go through so um you know get to web page wit some data on it this is the this is a shopping website and I extract the data from the page by point-and-click so it’s select an example row and it highlights it I do the same thing for another example and it highlights this as well and then it learns from these examples and generalizes to learn where all the rows are on the page I do the same thing for columns I’ve got some relics and columns in an area see that maybe squid and and I choose an example colin valley from the web browser in this case the product name is click to say this is a product name in it it it’s same thing extract strict at the table and so with a couple of clicks I can turn a web page into a table of data I can also record actions on web sites so as to teach the platform how to navigate through sites to a web page that’s got the date wrong I want so this is me showing it has to do a search for shoes thank you can speeding up there’s a cruller you can use to get a load of data from web sites you shared a couple of examples of web pages that have got the data on it that you want you go out to lunch you come back and you’ve got tens of thousands of rosin us in on your computer and you can also search multiple sources in real time so I went and built for this one I went and built what we call connectors to a bunch of job-search websites and I you know indeed monster careerbuilder careerbuilder etc and then I’m doing a search for engineering jobs in California and in real time it’s going out to all those websites executing those searches and streaming the data straight back into the application there’s dashboards for display very simple stuff is but out of date that but basically just you can see your data and but most importantly everything I’ve just shown you is available over an API for immediate integration as soon as you’ve done it so pretty powerful and I’m going to very quickly for this background kind of run through a couple of more run through some use cases and then we talk about how the habits can be used with Tabler so very quickly how people using is today British Red Cross very simple use case they wanted an iphone application with hospital data from the National Health Service there is a website with this data on it but no api so they built connects from the way I just showed you take them the minutes and now in their iphone application they have live hospital data nice and easy something slightly more complicated slightly more complicated use case HP these guys sell laptops via channel partners like amazon etc and the channel partners are meant to observe mini retail price which is a price beneath which no further discounting is lad but sometimes they’re a bit naughty and they’ll discount below MRP for a couple of hours steal the sales from the other channel partners and put the price

back up hoping no one’s noticed so this is a problem for HP any of the channel partners so what they did is they built connectors to about 50 partner websites and they can now monitor live price data in their existing business process which for them was this bridging and final example large recruitment company what these guys wanted to do is as soon as a job appears on the careers page of a large company website they want that job to go straight into Salesforce so that one of the recruiters can do what he does best immediately pick the phone up and fulfill the vacancy they tried writing custom code themselves to do this for four thousand websites oh and we hung on the kitten oh yeah they tried record click over themselves the jokes kind of learned by that they tried writing custom Kevin sales for four thousand websites scared cats and because that because every website is different and the estimate for how long that would take remember those numbers from before for four thousand websites it’s going to be in this range and let’s call it 39 years it’s like be fair and go in the middle so and with import I I it took five people two weeks because if you remember no developers are required it’s all point-and-click so that says and I thought it were giving a bit of backgrounds kind of like where we are how it being used a little bit of very quick run-through you’re the kind of technical capabilities and because i want to talk about tableau actually let’s talk about Porsches firm okay and i would love the fuch okay ah but i know nothing about cars i just think they look pretty and I’ve been told 911s are quite easy and forgiving to drive so I’m going to need some help ok so I’m going to use import I oh and tableau together now live demo time and to see to try and help me get a good push so everyone loves the live demo 1 i’m going to do it i’ve been on some web life and that’s that porsche dealer website that’s what it looks like this isn’t risking google chrome and this is in redwood city not too far away from me and and I’ve got all this data back porches so I’m going to base i’m going to what i’m going to do is open the step in the import I browser and I’m going to extract all the car prices and all of these cards from here up into the cloud so let’s click the button to say let’s get going cracking and what I’m going to do I have a choice first it’s asking me how I want to extract the data I’m going to go with the very simplest just critten extractor and say yes I can confirm that on a page with the date on it as I want then detect the optimal settings and it’s going to ask me it’s my data is the data still vocalist is the date still visible in the browser and no it’s not so I’m just going to it change some settings and it’s going to try again it’s asking me is the dirty one so now avail them rise which it is say yes and then it’s shown for the sake of speed i’m going to show you our Auto table extraction because it allows us to get straightens tabloid a very quick say what I can do here but if you see it detecting the tables on the page there’s only one table on this page it looks like I’m I can just click on it and then say extract table and it extracts it straight down into the and the table the bottom and instead go or I need the Rosa correct the convoys are correct I’m gonna training and load this to platform and pushes for San Diego and what this is doing there is creating that API on the platform and I can also say show me the data and it’s going to show me the data though just extracted pretty straight forwards so are you working in the browser so this is our browser this is it this is our application it’s kind of a browser yeah so I also know from

having a little bit as basically there are loads of these sites and it seems that there’s a maybe it’s actually done by Porsche but there are lots of these things and obviously it’s a service they provide to their dealerships that they can host a website for them on Porsche dealer calm and basically all the websites look pretty similar so I reckon I can use that that extractor that I built with all of these different Porsche websites so I’m going on this appliances into the tedious part copy and paste these things and add some URLs then and cooks I plan people shortcuts you you not night when I so Oh taste this final run in and I’ll explain what’s going to happen so what we’ve just done is using the browser on my machine i created a on extraction pattern for this for this for corp course dealer website that one I then published that it created that as a as an API basically on our platform in the cloud okay and what I’ve done is got some syrup now affecting got an API that takes is an input a URL and it gives me that structured data okay so what I’ve just done here on my laptop is this is like a I just pasted in a bunch of URLs this is a effectively like a client application there and when I hit refresh it’s making a single API call to the service and saying to you using that API what and you just built at feeding all these URLs in from the dates back so we we had like I think about 30 or so from just one of those sites but now we’ve got a couple hundred ok so I’m going to download this is a CSV file so I’m calling your third in your service that’s right hand so what I’m create an instructor Franklin then it’s our IP address that’s right now is there already zoo for your classroom so the deck so what I just don’t see the data what we’re a sec factory we are building is the API the thing that passes between the two places and we do do so some saving of data and the principal it’s kind of all like in that sense and so what I’ve just done now is just download it to my laptop and tableau bit out of date 711 a tank you I quickly that new on updates are sex file I 530 oh ok you I’m after a bargain you say I don’t have much money so what I’m going to do very well yes so I downloaded the csb ok we can get on a bit later and took out the livestock but am very simply I reckon that i will give the hypothesis which is that mileage and price are correlated so no more miles the one cheaper cheaper it will be I look at that that looks about right and and you can see you can see as the

mileage goes up the price comes down but importantly with the trend line on this I can spot there are some I basically want to be buying from underneath the trend line all these guys anyone buys over his pathetically overpaying and I want to be underneath there so there are some interesting things that this looks good very cheap when I both need twenty thousand dollars only 7,000 miles again have a look at the data and de lying data I’m I can see and actually see whether saffron and it looks like actually this is a honda accord it’s a horse dealer sells cars that are not Porsches so ah you know you know this stuff well yourselves but what I’m going to do to help me is go photo image get the alt of the image and put that on as a label and sure enough I can see down here we’ve got to an s-class Merc you know all of the outliers a Wrangler and a d8 for all the outliers art mod Porsches so very simple as you know and just do a filter and I wouldn’t probably oh and don’t get really these things I always saw six or seven are we going on so when I extracted it I sorry maybe I didn’t scroll over that this is all the data that it all that so sorry I didn’t actually show the fact that it pulled all images etc and with its detective table also extract is detected the data type so we’re currencies is disease that when there are currencies its tried to it’s typed it as a currency and broken it out as the principality and the number and with the images we’ve got the alt text and this URL of the image source of your own or than the Pope bang yeah which is how i was able to so quickly that doing you can’t see me right in formulas in that cell yet or anything exactly right so and I can see I’m going to actually keep in and you go down here and get rid of some of these things that not Porsches very quickly say some addys here I’m keeping my posters will c class ok and a good as well caimans of fine that is six more I’m stuck up of hundred I panamera Caesar the porches yeah I pulled in tight attendant was 98 that’s what’s so I can check my school I can show you pagination in a moment all right you might thanks Tara sexually I’m Robert on homework so I set this up in pen really glad as a couple of us are going over to sweet labs time point get out of here maps I just want to let you know that to let yourself out who is dis going to say hit me do it yeah any questions about comfort reps if you work over to face your dog sir Cyrus well thanks for being here this one a loser may ask so very quickly I can see these are now all Porsches and even see that written yeah some nine Elevens around here which are able to get kind of under the money and so I came back to the presentation because there are some quick this is prompting some questions I think I can find something within my budget so what I Porsche versus a Citroen would look like ok let’s have a quick look at similar things in more detail and extractors i want to show you that when you’re back to the kind of live side because what i did was very quick and i used a new feature that because this auto extract which is the parsing etc and it’s the auto extract is a very small part of what we do in it we just come out and it sometimes works it sometimes isn’t that’s what almost always does work is a training it

yourself how i remember those screenshots in the beginning where selected a row and another one and i just want to very quickly show you that and it’s so i’m going to use and navigate to websites in date on it for my favorites i’m hoping to go mountain biking this weekend like the outdoors got rei hair and plan your shoes and this is one that category pages and I’ve got a bunch of I put some data here and click the button before let’s get cracking and build an extractor so there settings okay asking me if the dates is still in the browser that I want and it is sewing say yes basically the riddle what it’s doing though is it tries to build what is the API that gets built on the server is what we want it to be the most optimal possible so it doesn’t pull JavaScript so it doesn’t need to get the data it’s in poem and doing this number of things and this kind of interrogative kind of way of building it it’s kind of questioning on to help certitude that so you don’t see that that green thing around there was no there any tables here or no HTML tables so it can’t do the table or to extract but I can teach it where the data is so and it’s the first question asked me is here it’s just like a single row of data or are there effectively multiple rows you will later guys I think something to explain your difference this quite clearly am interested in shoes it’s got multiple rows of data and so then the first thing I’m going to teach it where the rows are on the loaves select a row trinet there’s a dim as a blue highlight around like you’re probably I’d see on the screen and again and pink highlighter and it’s actually as a pink and blue highlight around every single shoe so and so yeah got all 30 rows and add in some columns and for the name and I can actually type this so you’ve got text i’m going to type this is a link because it’s also a link to an underlying product page and going to add another one in for price and so if you wanted all outdoors where product like beta product data from every single that goes online retailer you would just make sure you called product name product name on every single one I and do the image as well so and it’s spool free i’m going to show you the price as well so i’m going to add in another column for Bryce and I notice here we’ve got like sort of sale prices and vitamins that are and so I think there’s none on sale if I was just to say take your vows to train it with one of the examples that are not on sale and just say this is an example of price if you look here is actually pulling all of the numerical data from here so insane 999 was 170 dollar to save for web site don’t want that so I’m going to undo this and what is right wrong Colin yeah and clear that it’s like this okay so I’m actually just going to select the price here and say and train it and it only manages to extract one okay because of the naturally example then I generalize and say actually when it’s just a price with just a retail price but no sale price then that’s good for the Ryan trend again and you see it’s keeping the 993 the 99-93 you also its tracks in the full price from the other items add an

image back in 11 11 7 dot 93 it’s pulling up which is anything I’m actually got an adjacent set okay so what if Isis is headed up expected the lead from the invasive issue for paper grooving exactly right so it’s all live that’s all we’re all time since I’m done raining upload some potatoes rei san diego category and what we’re going to see this is it’s going to first show me the data that i just trend it with the date that I extracted from the website and but if I never going to rei and have a look at another category page so climbing harnesses and I can change this and it will hold fresh data three so not trained on that previously of its climate palaces usually out exactly right so that is extractors crawlers when use the same site and because you see this date sir that’s on the category page but there’s also data in the profile page as well okay and i want to show you how that works is also so and if i go to re our interview you’ve got one of these i’ll just click on one of these in grab an example so it’s an example product profile page and a product page and what I want is I’m going to build a cruller because i want all i want product data for all of their products and getting at it through the categories is kind of one way but it’s kind of not ideal and since they let’s get cracking i’m going to choose a cruller and it says okay let’s extract some data and down on a page with the date on it I want detect optimal settings and it’s going to ask me the data still there and yes it is I don’t either asking tables on here but not so interesting that I want to like because I know there are no rows here effectively this is a single row I’m going to use this in a single row data as a single product and so it’s a single row to product page so it’s not going to ask me to try to teach it where the rows are I’m just going to add some columns straight in for name image type it some image and price and and you see I can detect it that’s an example of a name train it listen example of the image shrine it ya got on my train and I’ve got all images and as my price ok so I’ve got what I need and because it’s a cruller and also because it’s a single page actually it Arthur it asks for the five examples now I’m not going to waste our time and sit here doing five examples because i actually did one already and will show you what it looks like so i’m going to do that five times on 5 different product pages so it’s got some variety that will a ensure that the extraction pattern is working and b will give a variety of different examples for the cruller for the platform to try and work out how best to crawl this website to get to all the data you want so I’m going to this is the kind of hand put it I’m going to go back to the Edit and screen so this is the state that we were just in but now is your see how about it wow they got five pages already use them on the right hand side will include brown we’ll see if does the scenario you’re imagining I’m really wanna see you bye yeah more like yeah having wife on that okay good ah well this two-dimensional said is the extracting

from the page which and there’s the what the short answer is no it doesn’t it doesn’t say oh there are these things that you might be interested in it doesn’t suggest things but this was in John thought well that’s interesting well we’ll see what the amplitudes with the output that’s got some output you can actually get it to write the logs to a text file for you and if your technical you can sort of or you’re interested I should say you can actually start getting into that but it’s what we’re trying to do this is our Juanita there our alpha was super-technical super powerful it’s super difficult to use you could do all sorts of amazing planes what we did between the ship from our alpha to our beta was all about taking lots of features out making it less powerful but making it much much easier to use and and we’re a process now of gradually putting these powerful features back in on a sort of bang for your buck kind of basis Jose yeah it’s turns out to be one else I mean I mean all of you know table’ tableau do a good job of making something hard easy and and it was one of those was a reference point for us as we’re going through this we still don’t think we still think this is too complicated we want to sort of run it beat take all the all of the complexity out to make it very simple but we’re discovering that from the hardest things to do and so anyway I had my five examples i just clicked ok run the crawler is brought at the crawler screen it’s pre-populated with my five examples which is where it’s going to start the crore from and they’re also and I can if I like I can get into those sort of advanced options here and I can fiddle around these to my heart’s content to understand how crawlers work but you can just use the example pages that have given it to try and work out some defaults certain just say go I’m not doing anything and it will start pulling data okay right so we’re now into territory where it’s pulling data which I’ve not trained it for so I only training for five examples and we’ve already got 14 I’ve got it dialed dance it’s not going to quickly yet but Sir I got to lynch now come back and I’ve got lots of data to play around with in tableau stop that and upload data you so it’s just added in the data into this thing and can download and play around with this and what else do you have so got crawlers connectors i’m going to show you one of these that’s just on my screen resolutions unhappy I’m page so connector is the situation whereby in order to get data out for site you’ve got to put data in and search box is the most obvious and that there are lots of different situations you can actually use what we’re doing to write data automatically as well and to form web forms and i’m going to show you that make it very clear kind of how it can be used and say they’re clicked the IO button top right hand corner to bring up the bottom screen let’s get cracking I’m going to choose a connector this time go to the search box or whatever that you want to interact with and say I’m there okay so hit record to track your actions and it’s recording now it refreshes the page and I’m going to try and do an example search for glasses click the search button and see the data comes back which it does and if data was not coming back they can say not working and it would restart the process with JavaScript turned on and there’s a couple of opportunities to do that if you it’s a complicated site maintained session integrity you know that canister if you have technical but and we’ve got from the search box to the results page and just click stop and it’s now going to work out what it can do is the data one still in the browser yes it is suggesting I can use the or table to extract first thing I’m going to do though is this recording is going to be available on the platform and every time it hits this recording is going to type in glasses into that search box I don’t want it to do that I want it to pass in a variable like sometimes it might be sofas or chairs or tables so I’m going

to make that text inputs which is recorded here and make that set that as an input it’s picking out the name query from the website itself they’ve wanted to a cool apnea product name whatever so then take the next step and I’m going to train it by hand and those some questions around pagination this is this is where pagination is often in search scenario and so does the site whatever the results count it does there’s a result cam there so yes trained for that and you can extract data from pages 23 forex such a joint strange pagination why not so it’s going to play back using the example I’ve just given it the the kind of yellow visual feedback is more to for you to kind of obviously rent a lot quicker on the server but it’s for you to sort of check that everything is good I’m going to train the rose we’ve got some interesting at javascript here and I think that’s because I didn’t get it to turn javascript on but that’s where we’ve got the Prada moment section train yet okay we’ve got our data all the rose gold fits its rose add a column which can you name those two and put the name description but whatever i’m going to combine things so again it’s got the but those and and a kid you know you’ve seen how the training works i’m just going to move on quick things I’ve got what I need this is where the actually that’s where that takes the results are cut my pledge results and so go to the second page because I’ve disabled javascript it’s kind of skewed the formatting this web page so trust me when I say that I would work I’m going to skip it for now and we’ll do final 12 so far because and this will bring us the end I want to show you the live side of things playing it back this I know trying to be required and it will just be confirmation yes don’t you have data all 56 roads are highlighted all of the names are coming through I’ve got what I need and point map the results in creating test upload to import a up here so jealous reminders my snaps and James in the future always enable JavaScript the site okay shown the data and we’re going to see that familiar kind of spreadsheet like you we’ve got the two queries that were trained it work with I’m going it rid of one just the sick my typing and search for chairs and on again dates back and obviously we’ve seen the dating pool club a club at URLs about images etc and integrate it’s the kind of final things I gave the promise of at the beginning of the turning websites into api’s or spread shoes I think I’ve overdone the spreadsheet side and I’m going to finish with showing you how the api’s work so there is an integrated button on every kind of data table where you’re viewing data and if you click it it takes you to a view that looks like this is a bunch of integration options we’ve got a ways of integrating live data into both excel microsoft excel and google sheets also on your favorite programming languages and but the sake of showing them here i’m going to use curl which is a command line I’ve got up this clan line tool a very simple way of demonstrating an API just- of my api key this example here is my terminal and I can run Bess and I will get back Jason data that we you oh yeah this is the way it is if you could spot me today yes so people you I think I would

remember earlier an example so some people integrate it into iPhone applications where its life is like the data the input data as it was originating from a user playing on my iPhone and some people are using it back end processes to like look good in retail I positive fight imani think if i come stuff on lots of different use cases i mean the most the best thing about doing what we’ve done is the crazy ways that people one of my favorites is a as a company called onefinestay i think that market I didn’t think I think about what they face feel like a hotel and like they’ve got rooms and differentiates between one fronts day and Airbnb is that whereas a B&B your advertising property people sign up to stay and then you’ve got to be that giving the feed to change the street make up what you got to be around and about and one fine stay these are all very light very expensive parts going like Los Angeles you’re like the past exactly and vegetable studies give them the keys and you walk away and they then run it like a hotels they have people who come in to your property when you’re away change chief greedy bastard that’s what he did and they have the problem as they knew which was that if the site Wilson these are expensive if the site wasn’t filled the property will be filled on a Friday evening it’s effectively like dirty babe what’s sitting there so you want to see those pricing so they hired a bunch of guys to do that statistician prominent to like 30 developers I build a new variable pricing model what it did we say started up you can import io to do a piece of experimental research they went to the booking work sites of five-star hotels in all of their major markets and they give us back 50 to help and because they see that competition with the alternative to staying with their mustaine fights back itself and they built connectors to these head by Bucky pages and effectively booked or at least reserved the same hotel room in every hotel i’ll be putting five and four extra musics by both the air and all we be checked out so effectively they could directly inspect the variable pricing of all of these five but I felt so some of them would like flat like know about the price from some of them have been kind of weekly on a bike when Peter friday-sunday then and what will tickle interesting with the trends as well so the Olympics are in were in London and my expensive olympics is cheaper and January that Christmas this or something and that’s how you stuff and like I don’t never dreamt that a tease okay yeah I use with the ball with this version any form info so a URL click drop down a radio button yeah so at the map certain moment the road you could achieve that but it manually be really about texting to be programmable to be variably super gullible to variable yeah there’s a long long time so about why we chose to do that the quiet basically requires if you’re thinking like that but it requires a lot of work I’m in fact be the example of the neck don’t go the hotel’s guys it has been there are ways we do doing it using a bit practice and we have a goods so where i should say we’re not tablet they’re not as big as those guys were very small and we have 220 in lending

take me very very fun um but there are great you do have a user that at the moment it’s kind of it’s also the media full channel and i’m actually working at mine i deserve as iOS qi it’s like a stack overflow community find which we’re looking at yeah if you can so the things you create us sort of private I’ve kind of the private if you know URL as a GUI d to random expected the man code that identifies the thing what you built you can share that with your friends post on the internet check using and but by default it’s funny private you realize yeah you right about here at this event s the report the screenshot without you create one extractor for the view of the manager and then you fill in zuma bleach manager view their unique URL generate the URLs these pigment very low so then the origin screens will do that without looking web page and how to wet that website work do my job but you know it’s got it maintains second integrity and that’s the whole point of header connector the recording worked so you can actually navigate into a website do that but completely if the date is not available to set the URL and then you can navigate so that’s an example of one of those 50 and so that is that such good enough for it so I’ve see the Riddler any better internal work as that and that’s when yeah so yeah that’s that I make good coffee future you could do all sorts of other stuff so you could navigate to a page to pick for cookie you can kind of still do that email from that you don’t help I remember get one website click in here nugget May and then visit that website with it say yeah make sure you give the slave Saturday but it’s all point-and-click said it once run it many times where people doing moon query to that I’m so kind of this scale right the thing you know it goes very rugged scales and my co-founder Matt friend of mine from Cambridge and when I persuaded him to start that foods working at on the front page checking the second box a clean it