AWS Summit Series 2016 | Chicago – Deep Dive on Amazon S3

hello good afternoon and welcome to the deep dive session for s3 my name is America Lonnie I’m a product manager for these three team and helping me out is carved summers my name is Carl I’m a software engineer with s3 spent a couple years on the front doors the rest api and most recently been developing the event notifications features alright so let’s get started we have a lot to cover today this is a deep dive session so we’re going to try and keep it as detailed as possible but there are a lot of new capabilities we’ve added to s3 so we’re going to be focusing a lot on that as well I’m sure there’s going to be questions at the end Carl and I are going to be here for as long as it takes to answer any questions we encourage you guys to come up and talk to us give us your feedback and you know will answer any questions as well so with that let’s get started so let’s start with a bit of trivia on the slide behind me there’s a all the storage services that AWS offers as part of a storage portfolio who here can tell me which one of these services launched first yes three yeah you know but dude you are in the s3 session both of us are part of the SV team so it’s liberating for you but that’s right s3 was one of the first services that they launched for storage services that launched for parade of years but as you see here this kind of leads you to how we think about storage all up at the AWS right for us it’s about freedom of choice very early on we heard from customers that not all storage and not all data is it’s built equal right data comes in different formats and as such it requires different capabilities from its storage you have data that that’s file-based and you need file based semantics in order to in your story in order to use that appropriately be a block storage which you might want to use your computer for persistent storage and then of course you have object storage as well which we have s3 and glacier for but we really want to continue to make sure that we offer the right breadth of capabilities for our customers so that the right kind of data in many cases more than one type four per customer goes in the right kind of storage and it’s not just about adding breaths to storage as well I’d be make sure that we continue to innovate in each one of these storage services as we have been along the law around longest but even then you’ll see in our session today to be continued in a way we continue to think about how to make it easier for customers to get more value from their data on s3 another aspect especially as we keep working with our customers through their journey of migration migrating their their data from on-premises to the cloud is how do we transfer information on how we transfer data from on-premise e to the cloud and any and back and there’s a number of different services that we are continuing to invest in that help you move your data from on-premises to to the cloud and back you have the ability to use door i connect to establish a high-throughput link between your on-premises network and the cloud last year we announced snowball and for this where the summit you can now move up to 80 terabytes of information by getting this snowball shipped to your device to your network connecting it to your network and then moving the information to the device and shipping it and getting it directly to your data leg industry where you can then start getting a lot more value from that from that data by using all the great ecosystem that we provide around it of services and capabilities we work very closely with the rice be partners to make sure that all the on-premises storage appliances that you guys know and love work seamlessly with the AWS n NS three we also launched Kinesis fire hose which allows you the ability to stream data directly to x 30 redshift and stream turbine center observe information specifically for those IOT related use cases and for this summit we also announced s3 transfer acceleration a new service that allows you to leverage the the broad network we have for cloud front and upload data faster to s3 by by by using the network the cloud front has and reaching out to your closest point of presence and we’ll talk a bit more about this but there is a dedicated session as well for for this at the summit so I would encourage you to go go look at that as well and then of course we have Storage Gateway as well which allows you to work seamlessly with with your network and in the cloud so a lot of breath in terms of services a lot of innovation in each one of these capabilities and of course we’re going to be focusing on those three for our session today but before we move on to some of the new capability we’ve launched though I want to kind of take a minute and talk about our core customer prompts you know really customers think of as three as the place to move data where you go I know is going to be there then I need it I know my dad is going to be reliable we promised 11 nines of your ability four nines of availability so when I need my data is going to be there available for me and it’s going to scale seamlessly we’re working with customers starting with hundreds petabyte all the way to exabytes right and having that ability to scale up seamlessly to that level is a big part of what we provide we regularly peek at millions of

transactions per second and have trillions of objects in history and that that scale really makes my job very interesting because a lot of energy that we put in in any of the capabilities that we add to the platform go back to our core customer promise to make sure that we have that 11-9 of durability the four nines of ale ability as well as as well as being able to handle s3 scale right it’s one thing to add a replication capability that says you know what I’m going to replicate billions of even billions of objects from one bucket to another a very different engineering problem and you’re saying it could potentially be and it is in many cases trillions of objects so how do you design a system that scales at that while still giving you the same performance also giving you the same durability and availability so that’s something that we continuously think of on the team and just want to kind of touch base on that is that you know this is our core customer promise and any of the capability that we provide we make sure that we are true to that customer promise so with that let’s talk a little bit about some of the some of the great innovation we’ve been doing in s3 over the over the past years I’m not going to go in detail for each one of these in the interest of time but again if you have questions around this at the end you know feel free to come talk to us we’d love to hear your feedback around this now one of the one of the feedback items we hear from a lot of our customers is you know I put data in s3 and that’s great I need to be able to use that data I need to be able to derive value from this data and a big part of that is using is 3 as a data leak so that I can leverage all the great capabilities and services that AWS has to offer true to that one of the capabilities will build last year was event notifications on s3 where you have the ability to add configuration to your bucket which then triggers a notification on put events or delete events and even what you can figure and delivers it to multiple destinations in AWS you can send a notification to SMS you can send it to NS q SQ and pick it up for your application and perform actions on that or you can send it to a pre-configured lambda function which is compute in the cloud for us with scales dynamically and then you can take specific actions on that there’s a lot of very interesting use cases we’re seeing customers build with this the more common ones are I want to be able to transcode my file so when a new file does show up in s3 maybe I want to create a thumbnail Orion code that far the video in a different format and so forth or I want to build a secondary index or even dynamodb so when a new object gets put I want to trigger it on the function than that then goes and update my index and of course metal object is removed i want to take action as well so there’s a lot of very interesting things that you can do with this event based computing paradigm and hit obviously different committed that and as it is three you continue to innovate another capability we added there is the ability to filter by prefixes or not only can you trigger a notification for the entire bucket but for a specific prefix another capability we launched is cross-regional replication specifically for those use cases where you have compliance related requirements where you want to copy a copy of your data hundreds of miles apart you do work with a lot of financial customers who need to make sure that for compliance reasons their data is hundreds of miles apart you know some some specific geographical distance or that I want to move a copy of my data closer to my users so that when by end users access this in the object they give they see the best possible performance because is actually physically closer to them in other use cases protecting against a rogue actor right so if we see customers giving access to the logging information for instance to all of all of their organizations so people can can move fast and use that information but at the same time creating a copy of that or replicating a copy of that in another region which is locked down just to the administrator provides that extra protection as well we also have PPC endpoints and we launched the capability we launched VPC endpoints for amazon s3 which with bcn points you can create with VPC you can create a closed network within is through your your private cloud and make sure that your application cannot be accessed from the outside world and vice versa but a lot of these applications do use s3 so in that case you then have to set up an Internet gateway or manage net instances the VPC endpoints for s3 however you only need to put a VPC policy on on your VPC endpoint and then without your traffic having to go over the internet you are able to talk to specific s3 bucket that you set up and that’s important right you need to be able to configure which PPC is talk to which buckets and which buckets have permission to to allow in from a PA calls from a specific VPC you can control both of these using the pc policies and a combination of bucket policies there now as s3 grows in scale one of the feedback items we hear from customers is I need a better way to understand how much my date how much data i have on s3 and how that’s actually changing responding to that we launched the ability to we actually launched a specific Loudwater metrics for s3 you can now for free for all buckets you can now monitor the object count for your bucket and see trends

that trending over time as well as byte count for standard standard infrequent axis which is on your storage class as well as reduce redundancy story so really kind of get an idea of how my storage is changing over time we also integrated the cloud trail so for bucket level API and configuration requests you can now track when and when a specific API call was made what was a response who was the person making that which gets added to your cloud trail logs so you have that available for auditability and you can go back in time and see who changed what configuration and when we also increased the bucket limit for a three by default you can create up to 100 buckets in your account and we find that for most use cases that’s actually more than enough there are certainly use cases however we’re increasing the book is limit is required so if you think that you you need higher bucket limit you know open up a support ticket will work with you to make sure that we increase your limit to an appropriate number that you require now s3 is a distributed system and as such we are eventually consistent which means that if you delete an object or if you update an object it will be eventually consistent and eventually we’ll see the latest version of that object however we heard from customers there specifically for new objects that are coming into s3 it would be great if i can get read after right consistency wages and if i put an object i want to be able to read that new object immediately and take action on that so last year we announced that for all regions and all endpoints we have strong read after right consistency you are still eventually consistent for you after updating to leave and other operations however so just a quick look at some of the capabilities that that we’ve been working on of course it doesn’t stop there we also launched a new storage class last year which we’re seeing great adoption and great feedback on which is standard s3 infrequent axis and I’ll actually go in more detail for this we also this year just announced two new life cycle policies that help you manage your data better the idea behind lifecycle policies really is that for those repetitive cumbersome tasks around cleanup and managing the life of individual objects especially at s3 scale the more of that work that we can take away from you the customer the more you can focus on building great applications and workloads on top of s3 and trigger the true to that with lifecycle policies we now have two new policies one for expired object delete marker and again we’ll go in more detail as we as we progress in the with the presentation and also another policy for incomplete multi-part upload exploration and as I mentioned earlier we launched in your service forestry transfer acceleration we leverage the great network we have a cloud front as well as other network optimizations to increase the upload speed of your data up to 5x depending on network conditions and it’s actually very easy to use with this 3 as well there’s basically two steps that you need to do in order to use transfer acceleration one is set a configuration at your bucket level to enable transfer acceleration and then for your specific put API where you want to use this simply change the endpoint with the name transfer acceleration in there and that’s about it right other than that you normally do need to make any changes to your application and especially very easy to use and we’ll make sure that we your data gets to s3 in the fastest possible way so so let’s dive a bit deeper into the new storage class transfer acceleration the way the way we think about this is for specifically active close we’ve had three standard and you know customers know and love that where you have the best possible durability the best possible performance and for you know regardless of how actively you use this data the store in standard infrequent act three standard is is the best storage class for that on the other end of the spectrum though you have archived data right data that’s cold you’re not really using it you really want to make sure it’s durably stored but you don’t need it immediately right for that of course glacier is the best possible choice and you know the way we kind of see a lot of workloads play out and you’re talking to a lot of customers as data and its characteristics change over time right so I might have an object let’s say a video that I upload which is a training video for my from our enterprise perhaps I upload that to s3 you know it’s an it’s training season so everyone’s looking at that or you know there might be a document that I’m working on with my colleagues and so it’s a new document I’m actively working on it and very actively being it’s very actively being accessed but over time you know there’s going to come a point when I want to archive it and you know have it available but not actively use it but that doesn’t really change overnight it’s not a flip of a switch right over time these characteristics change and the data is still important the the video is still important people who are still looking at it but it it is less frequently accessed and so that’s really where standard in frequent access comes in is letting you know this is the storage class is specifically designed for data that is less frequently access of course but when you need it you need it immediately so we wanted to make sure that you get the best possible performance for this but you do save on storage costs because this data is less frequently accessed there’s a number of very interesting use cases we’re seeing with this sink violin and share is one that I talked about a little bit you know for for documents backup and

archive specifically the disaster recovery is very interesting because I might be backing up my data but when I need it for disaster recovery purposes i really needed immediately i don’t want to wait for that data because this is about business continuity for me so i need access to that immediately even though it’s a very unlikely scenario and similarly long retain data such as log data we’re seeing a lot of that where i might not be either retain the data earlier now I have the storage class and this data specifically logging something that I might not actively use right if there is an issue with my application I might I really really need this information I need it right away because it’s an operational issue but odds are that I don’t run into that issue and I you know that means that this data is going to be very infrequently accessed and there’s there’s a lot more very interesting use cases we’re hearing from customers around this as well so let’s look at some of the characteristics for standard infrequent axes so we wanted to make sure that as a customer and you put data into s3 you never have to worry about losing that information right so across the board from standard to standard in frequent access in glacier you get the same 11 nines of durability durability is exactly the same as standard standard in frequent access is designed for three nines of availability which means ninety-nine point nine percent of the time when you access this object you will get it but if you don’t you can issue a get request immediately after and get your object but we also wanted to make sure that you get the same performance and that’s very important for our customers and that’s what we heard from customers too is that you know yes this file is less frequently accessed but I don’t want my end users to pay for that in terms of user experience so when my user requests a file it should it should get it with the same performance that you get for standard and that is what we’ve delivered you get the same throughput as s3 standard another important aspect for us is ease of use not only can you keep objects within standard center infrequent axis and glacier within the same bucket you don’t need to change the prefix the location or your API for that race you mean you don’t need to make any changes to your application you simply tear your data from standard standard your frequent access and your application just works security is always top of mind for AWS and s3 and as such we added a lot of capability around security and encryption over the years you can encrypt an object using server-side encryption you can also bring your own key and increment object even before you put it in s3 or you can use KMS managed encryption keys as well sv standard in frequent access works with all of these just like standard does and it’s also integrated with all the different capabilities we continue to add wait on on s3 right such as the life cycle platform it works with versioning just like standard you can find a notification from standard infrequent axis just like you can with standard and any other information such as metrics you get with the storage class as well but one very common sort of question I hear from customers around this is it’s great that I understand that I can put in frequently accessed data into SI ay or starting a frequent access and save on that but more often than not I have specific workloads or applications running within the same bucket under different prefixes so how do I know which one of my prefixes are infrequently access and which ones are not and for a demo today we’re going to answer that specific question using all the capability that’s available today in s3 which is server access logs right you have server access logs and it had this for a while where you can enable this logging functionality in s3 and all of your API calls are logged in in a log file within your bucket but those are specific API requests with time Stan choice we’re going to look at when the demo that Carl is set up which actually actually leverage is one of the most common movement one of the very common used use cases for s3 which is big data on s3 and we’ve written a small big data job that then processes these logs and gets us to the specific insight we’re looking at which is which one of my prefixes should I actually tier 2 s.i a given the data so so with that let’s look at the architecture real quick of what we’re going to be showing you guys it’s actually seven simple steps we are going to enable access logging a free capability available we’re going to put the logs in in our data leg in s3 we’re going to leverage the power of EMR and write a very simple spark application that is actually going to aggregate these logs by the are and filter by by operation so we know so we can filter out the total number of gets per hour and then and then visualize those and the portal number of ports and so forth we also want to make sure that we filter these out by response code right so if I want to make sure that the successful guess is what I look at we’re also going to persist the interim results on to s3 a very common usage pattern with big data job we’re customers once you as a projective structure on your data and pre process this information you don’t want to do that again because you might have to come back to this either for the same or different analyses so it’s a great it’s a great best practice to actually persist that that interim data such as a

hive table for instance on to your on to your s3 bucket so that your big data job can actually start from there and save a lot on on time when you when you run that analysis and of course we’re going to persist the aggregated results in s3 and then use the materials visualization tools to to drive insights from from this aggregated information there it is alright so let’s start with enabling access logs for our bucket right so once you go in your bucket under properties you have server access logging available you can simply use the UI just click enable you can put the excess logs either in the same bucket or a different bucket and also specify a specific prefix and this is where s3 will then start generating access logs for you so let’s look at the logs themselves to get a feel for what the logging information is and that really is you know for every single API request you get to s3 we’re going to log that information as well as interesting information about that right such as the turnaround time or you know what was the response code for this right what was the bucket in the prefix that we were accessing this under so this is what your access log actually looks like if you just highlight one of these we can get a feel for what they look like so you have your prefix you have your time stamp on when the specific operation the specific API call was was requested you have your your operation this was a put right and then we have the response as well for this and and turnaround time and so forth right so this is just a kind of get you an idea of what access logs actually look like of course this by itself doesn’t give us the answer of which one’s of my prefixes are infrequently access rights over them so that’s where we need to get to as part of our demo so so next step then we’re going to start setting up an EMR job i’m going to use park for that it’s super easy to set up we just created a new cluster right or are we creating a new cluster give it a name logging again you can enable this and put it in your data lake which is s3 but we’re just going to keep that up we’re going to use cluster as the launch mode we will pick the latest version for applications we can select all the applications although we could have selected spark as well because it is a spark application let’s pick an instance type we could leave this as m3 excels and then pick the number of instances again depending on the amount of logs that you have dependent depending on the overall traffic that you have which by the way you can end up on a data you have which by the way you can see using using cloud watch metrics as well and then we will set up the security nexus for this and just hit go for the cluster let’s not hit OK for now we have a pretty big cluster for you to go look at all right so we’ve set this up already and the next step then is to to add a door spark code right so let’s take a look at the code that car let’s put together for us it’s actually fairly simple all this code does really is pull the Act the raw access logs from s3 project structure on it and to start parsing parsing these logs and then we actually aggregate the aggregate that logs up I we are saving the interim results here by the way so this was one of the best practices that I talked about that once you’ve done the pre processing processing of your logs it’s a great idea to persist that in s3 as well so the next time I spin up this cluster I can probably just start with the preprocessor I don’t need to parse that part set again this did this next file looks at looks at prefix aggregation I so over here what we’re doing is looking at how do we want to add this up right so we look at the access logs it was specific access requests right you have you see a put you see a get and information about that what we want to be able to do is add it up by prefix because we wanted to answer the question by specific prefixes we want to edit up by our otherwise it’s just going to be too much information so i want to see total number of put requests for a given prefix for for our one two and three and so forth or a given our but status closes code is also important for us because you want to filter out unsuccessful requests versus to hundreds and so forth so 40 for hundreds you might want to you might want to filter out and then we kick off for a MapReduce job to actually go ahead and then at the end of the day we will persist the final results back in s3 so that really is the extent of the code that we have in two files and so now let’s add code to our EMR cluster right so we will add a spark application give it a name and then give it our input options so so in our code if you noticed we had two parameters we have the input location of where my raw logs are located and then another one for output location where you know where do I say where what is the s3 location where I actually save my final my final log so let’s go ahead and copy through these arguments over and there we have here

right so notice that you know in our arguments we specifically specify as three as our input and output location all right now the the jar files have been uploaded again into s3 and then we can simply just knew once you specify this we can simply hit add and this will go ahead and add the spark application to our prefix to our cluster so so mr super easy to use in fact let’s actually go back to to a presentation and talk about some interesting best practices when it comes to leveraging EMR 4 4 s 3 there we go right so EMR super easy to use but there are a lot of key benefits you get by leveraging mr with s3 first and foremost s3 allows you to separate your compute from from your story from your computer which really means that you have the ability to resize or shut down your EMR cluster completely without ever having to worry about losing data because that’s that’s an s3 additionally you can actually point multiple Liam are clusters to the same data in s3 so so that’s you know these are these are key sort of value adds that s3 brings to the table in addition to the great durability and persistence of your availability of your data another another best practice is leveraging EMR FS with EMR EMR FS is an implementation of Hadoop which is optimized words writing and reading directly form history for EMR for your EMR cluster and there’s a number of great capabilities that they actually brings to optimize your big data jobs you do jobs first is it provides you the agreed after right consistency in addition to that it actually has optimizations in it where it makes the list performance faster specifically when you want to kick off your job and given the volume of your data you might have to wait for a bit whereas with EMR FS they have optimizations built in to make your list perform much faster they also have some error handling really even make sure that you use your your for you as far as your big data application is concerned you have transparent access directly to s3 and then of course the other advantage is any and all capabilities we have in history you get to use for instance encryption right you can use server-side encryption on your data you can use client-side encryption or KMS be based encryption all of those as well as life cycle so all of those capabilities that we’ve been talking about and are going to talk about you can actually leverage with your big data job so for instance for your temp files you don’t need to delete those you can show the lifecycle policy that will take care of it for you and so forth so just just need a kind of pointing to some of the best practices when it comes to using EMR FS with this three so let’s move back to our demo and you know look at the start with looking at the output file right and so once we do the aggregation we talked about which is aggregating by prefix and operation at the time and aggregating it within the within a one hour time slot and the operation and the response code what is what does the output file actually look like so you see here we’ve transformed their access logs from a specific API request to you know for this given prefix I have an for this given our 2100 or I had 235 get get API requests which was successful right wish i had a 200 count and so forth and you see this for all the different times tan that we never actually analyze this form now if you were to put this in a pivot table look at that and what we did is we just put it in a pivot table and we we filtered it by the access frequency by deterring by the count so a pivot table we did by the count and we can actually see the different API requests so if you were to visualize that pivot table you see here that each one of these graphs shows you for a given are the total number of for total number of get requests successful get requests so I see here the the prefix in blue for instance over time it is seeing a very high number of persistent get count the dark blue one where is the light blue one is still high but it is it is it is tapering down or larger amount of time whereas on the left bottom side the green in the Reds are very low hanging fruit where yes they did peak to about 100 gets for an hour but you know very quickly the taper down so for this specific vertical maybe most logs that I add there that I analyzed for the first few days and then I’m kind of done with those or whatever or whatever else is low-hanging fruit and now I know specifically which one of my prefixes are in frequently accessed now I can actually leverage lifecycle policy to actually go back and make sure that we wheatear these out so let’s actually take a look at that real quick so if you go to our bucket yeah so if you go to

our bucket for our given prefix and we see these back there I don’t know that’s gravy go sorry alright so if we go to a bucket we can actually set the lifecycle policy for the specific prefixes that we know are now in frequently accessed based on our analysis all right so we can simply go in specify the specific prefix and and then you know given the data that we saw we can say you know in 60 days or 30 days whatever we we were able to see you cannot expire with a lot of high confidence because you know that you’re accessed it actually tells you that the specific prefix isn’t frequently accessed all right so let’s move back to our presentation and talk a bit more about about life cycle policies right so so for life cycle policies as I mentioned earlier really the goal is to take as much overhead for housekeeping and like managing the lifecycle of your of your data away from customers making it simpler to manage data especially its scale right and there’s there’s two possible actions you can take in life cycle policy you can either transition to standard or two standard infrequent axis or glacier or you can take an exploration action and delete objects right specifically for those use cases maybe I set up my Big Data job with with s3 and I know that the temp files I don’t want to keep for more than a week right and so you can actually expire objects based on lifecycle or expire them yourself as well you are you do have the ability to combine multiple actions so so in our demo we set a specific lifecycle policy for a given prefix we can actually go set other lifecycle policies for the same bucket in other prefixes as well so if we look if you look back to go to the next movies all right so if you look back to standard in frequent access as I mentioned earlier it’s fully supports the lifecycle capabilities you have the ability to transition from standard or standard in frequent access and then from standard in frequent access to glacier based on your usage pattern that we just looked at as part of our demo you can of course expire objects that are standard infrequent axis just like you can all other storage classes and the new works with versioning supported as well but another thing I wanted to call out is why you have the ability to trend to transition from standard too infrequent access and for a lot of workloads it actually makes sense that your data is frequently accessed and it changes characteristics to be in frequently accessed there are a lot of cases where it makes makes sense to put data directly into history and you have that ability when you put an object you get to specify the storage class and you can specify standard infrequent axis as one of those storage classes so if i do know i have logs that are not going to be used as frequently i can just put them directly into s3 so here’s a sample lifecycle policy just to kind of try the point home on how easy it is to set up over here i’m setting to transition actions the first transition action that’s highlighted I’m actually transitioning data that is 30 days old regards to the storage class into standard in frequent access and then in the next slide I’m going to talk in the next slide we see here that you know we are then transitioning data that is a year old into glacier I see the both of these policies work together to actually make sure that your data moves between storage classes but again remember that you never have to change anything in your application neither did you have to change the location of your data as it moved from standard to standard in frequent access to glacier alright so so let’s look at another best practice that we’d recommend our customers which is versioning right we talked a bit about this as well versioning is a capability that allows you to protect against accidental deletes and over rights and the way this really works is you can enable versioning etch at your bucket level and once you do that if you if you put an object and then you put it another version of another essentially you try and overwrite the object its Trubel not override it it will actually make the current version non-current while still keeping it and then simply just put the latest version on top of that so now you actually have all versions of your object and you can always recover a rollback right so if you accidentally overwrote an object no problem you can always just lethally dispersion and then the one you have you can make your current version and so forth so so that really kind of gives you that additional protection but against against those accidentally deletes and overrides there are three states for version buckets by default versioning is not enables you do have to enable that for your bucket once you do enable versioning the the state changes from verse 22 versioning enabled and that is when s3 will actually persist all the versions of the object you can also suspend versioning in which

case we will not remove any of the existing version objects but if you do additional puts they will not be worshipped now one very common sort of feedback item there and women in common use case and you should patent we see with with versioning and lifecycle together is the ability to get recycle bin now I want to make sure I protect against accidental deletes but at the same time I don’t want to keep all versions of my data forever and if that is what you want to do you can combine lifecycle policies bit versioning you can enable versioning on your bucket and then set an exploration bicycle policy specifically for non current versions to expire in X amount of days right so maybe I want to have the ability to retrieve and roll back for 90 days right so I can I perform a mistake I have 90 days so she rolled back so you can do that by combining lifecycle policy exploration action along with versioning another interesting sort of feedback we heard from customers is because I used versioning averaging enabled and I have an expression policy to remove my non current versions as well it is possible for me to issue a delete command and and not have a current version and then over time the exploration policy goes ahead and removes the non current versions under me you now have an empty delete market so the way this works is if I issue a delete command instead of overriding s3 will place a delete marker to signify that the latest version has been deleted and not remove the not remove the other versions and then lifecycle can come in and then remove the non current versions based on the policy that you have set up but the implication of that then is over time it is possible for you to have delete markers for objects where all the versions have been removed based on the policy now that is something customers can list and you can actually go ahead and remove that yourself but again true to the spirit of life cycle policy we want to take that overhead away from customers and this housekeeping you can simply do by specifying a new life cycle policy to remove expired object delete markers so there’s two different use cases there one is if i use life cycle to actually delete existing versions and and previous versions as well so that’s the example i have here where i set an expiration policy and say i want to delete objects once they become 60 days old but because you have originated in abled s3 is not going to delete it is simply going to make it the non current version and place at the neat marker on top of it then i specify another action where i want to expire non current versions of my object in another 30 days so in the first 60 days and an object will be deleted 30 days from then the lifecycle policy will delete the non the expired version the non current version of that object and all you have is a delete marker because you’re using the lifecycle policy to do this like we’re going to do the right thing and we’ll simply just clean up the delete marker once all the versions under it are actually proved there’s another version of this where I do not leverage the exploration policy to remove objects right it is possible that I my application needs to ensure the lead command specifically however once objects are removed I want to have the recycle bin functionality for 4 30 days I’d have the ability to recover for 30 days so in this case I said and I said the non current version exploration policy just like it in the previous one for 30 days but I do not use life cycle I my application actually issues that delete itself in this case life cycle the in this case is going to be a delete marker which either the application customer needs to remove or you cannot specify this additional property which is a expired object to leave marker just set it to true if we do have a delete marker with no versions under it lifecycle policy is simply going to do that cleanup and housekeeping for you all right so let’s take a look at how you might use the UI to actually go and set up this expired object to eat marker policy alright so if I for any given bucket if I go to properties for the bucket under lifecycle I can actually add a new rule for previous versions so for previous versions if I want to delete these let’s say in 30 days over here I now have this checkbox this option to actually go remove the expired object delete marker and do the housekeeping for me so I don’t need to worry about doing it myself alright so that is that’s that that’s how simple that is you can simply specify that and s3 will then take care of it for you the implication of that of course is specifically you get better list performance if you do not have delete markers right so if you had children you know hundreds of billions of objects and a lot of those are actually delete markers right because you can keep deleting those and adding new data over time when we when you list we actually filter those delete markers out and that takes a little bit of time so if you remove those and you do the housekeeping your performance of your list API actually gets you see

improvement in that in that performance in other best practice again to protect against those accidental deletes is multi-factor authentication so that’s something that we support in s3 you can configure this at your bucket level so that in order to successfully issue a delete command s3 will actually not only require your security credentials but also will require the code from an improved authentication authenticated device yet another way for you to protect yourself against this accidental deletes whether it’s an issue in your policy or just an accidentally by a user so you can set up versioning and the recycle bin functionality with versioning and lifecycle here’s another way that is is recommended for you to protect specifically the data you want to make sure that you want to be on a double check then you and you actually want to remove that data right so with that let’s talk a little bit about some of the best practices around improving performance of of your workload and application on s3 so one of those is really the ability you to paralyze your your put API calls specifically for larger objects you have the ability to break that object into multiple parts and actually upload those parts in parallel to get much better performance and for instance if you’re on a high bandwidth network you can actually get much higher throughput on aggregate throughput on your network by breaking your object into multiple parts and actually uploading that you know if you were doing for instance a big data job like we were this is an optimization that you can do if you have a lot of data that you’re moving around also for actually also for networks where the network resiliency is lower for spotty networks right in that case you want to make sure that for a retry you don’t upload larger objects again again so if you break your up tube down into smaller parts you add resiliency to your network because one specific part might fail but that’s the only one that you need to upload you don’t need to upload the entire object now multiple to upload the way multiple output works is that you can break these parts down and in smaller chunks and upload those and once you’ve completed all of those you can issue a command to actually put the object back together and s3 actually exposes that to you as a complete object however we’ve heard from customers that you know depending on my application there are cases when either my application might die or run into an issue throw an exception without having uploaded all of the other different parts so now I ended up with multiple parts that I am actually paying for you do you are charged storage costs for each of those each of those parts but I never really completed those now you can list those individually by default list does not show them but you can specify parameter to actually list incomplete multi parts as well and actually clean those up yourself but again why do that when life cycle can actually do that for you so that which is why we really introduced this in this new policy for incomplete multi-part upload exploration policy where you can tell lifecycle to simply delete all parts if upload is not completed in X amount of days so here’s an example of the policy in your life cycle policy you can specify the rule two aboard incomplete multi-part objects and all you need to do is specify how long should I wait you expect your multi-part uploads to take and you can simply specify that in terms of days and s3 will then the lifecycle policy will then delete all the multiple multiple parts that have not completed the multi parts that have not completed in that in that main day so you don’t need to worry about that cleanup and of course you can you can restart that again another best practice is in addition to your paralyzing your put request you can also paralyzed you’ll get requests right you have the ability to do range based gets on your individual objects specifically for those large objects so you can actually benefit from you can actually spin multiple threads up and get higher higher download speed right and download larger objects quicker this again compensates for unreliable networks as well because if you are downloading smaller parts if it fails we all need to need to retry for that specific part and not download the entire the entire object alright so we looked at paralyzing your get as one of the best practices right another thing you can do is paralyzed your list request as well so you can pair you can do multi-part upload and upload specific parts you can download specific ranges of your object and then you can also paralyze your get request right specifically if you want a sequential key for a sequential list of your keys you can then pick specific

prefixes and down and start multiple lists requests because this request is paginate it you can get a foul up to a thousand objects per request so you can actually start multiple requests with the right delete markers for multiple prefixes or even within the same prefix another very common usage pattern we see from customers is using secondary index it I specifically for a list right I can maintain a secondary index and they’re not only that does that get me faster list I also have the ability to sort and search on metadata which in a lot of cases can be very useful in other best practice is around SSL right and really for all of these different best practices if you can you should just use the SDK we make sure that we continually update and optimize the SDK to leverage all of these where possible but of course you know you can’t use that every single time and so if not you know please be aware of these best practices they can really help you optimize the performance of your application for SSL based if you can use if you can use hardware that’s optimized and you can use hardware acceleration for SSL encryption that can can really help you get the most right because there is overhead in encoding all of these in software and coding doing this so encryption in software also the handshakes are expensive so if you can avoid these handshakes by performing keeper lives or pulling connections sending multiple HTTP requests I should be as request over the same connection not having to set up these expensive connections every time again is a way you can actually optimize your app education and improve the performance another best practice is making sure that your key names are randomly distributed and that really kind of comes from is relevant for buckets that you expect hi TPS expected right height eps against one hundred TPS or more and we see a lot of customers naming the bucket in this exam similar to this example where maybe I have my bucket and then I have date ranges year month day and then I break down my data the problem with this is that if you have your keys in this sequential manner s3 is going to try and write these keys to a specific partition right because these are these need to be located together but because you don’t have that randomness all of the uploads that you’re doing go to the single partition now s3 does distribute workload under the covers but if you expect your workflow to spike very very fast you might see higher turnaround times so the best practice really is to get as much randomness as possible as as quick as soon as possible in your prefix right so right after your bucket name if you can randomize the first few characters of your key name that then generates the the randomness and when you upload these they’re naturally going to show up and get uploaded to multiple partitions in s3 which makes it which make sure that you do not see how a turnaround times when you hit higher higher TPS so some of the techniques we see customers use you can hash your entire object name entirety name it’s important to note that this is the key name not just the file name right so if you do have bucket one prefix 12 and and your file name image.jpg all of that is considered a key name so you need to make sure that right after the bucket name you you hash everything including the prefixes another another so technique is to prepend a key name the key name with a short hash or you can actually reverse the key name if that makes sense but again the goal is really to make sure that right after the bucket you have as many random characters as possible we should then ensure for high TPS workloads they randomly get distributed across the multiple partitions in s3 and you don’t see higher you don’t see higher turnaround times so there was a lot of information we covered we looked at standard infrequent access we we look at the great demo Carl put together for us to really take access logs that we have and use that information to determine which one of my specific prefixes are in fact in frequently accessed and then we set a lifecycle policy to go tier those specifically those prefixes to SI a because now we deterministically know that those prefixes are in frequently accessed we looked at life cycle policy and the new management capabilities we’ve added to life cycle policy for expired object delete markers well as multi-part upload we looked at versioning and some best practices around that as well as some best practices around performance so with that we’re going to end again Carl and I are here for any questions we encourage you guys to come over chat with us give us your feedback but thank you for coming