Adopting PCF At An Automobile Manufacturer – Gregor Zurowski, Thomas Seibert (Mercedes-Benz)

Adopting PCF At An Automobile Manufacturer - Gregor Zurowski, Thomas Seibert (Mercedes-Benz)



hello everybody you know and thank you for attending our talk on adopting pivotal Cloud Foundry at an automobile manufacturer so by now you should know what kind of automobile manufacturer we are we're mercedes mercedes benz i oh let me quickly introduce ourselves to you but because we only have 30 minutes left and we have a lot of information this should be really quick so my name is Thomas Ivers I'm the lead architect of mercedes benz i oh yeah hi everyone my name is Greg Rusedski I'm an independent consultant and I work as a software architect for mercedes-benz IO since early 2016 and just a couple of words about mercedes-benz IO it's a very young company we have been founded in November of this year so probably you might ask yourself why can we have so much experience and talk about the adoption of people Cloud Foundry if we're such a young company well actually we have been here before as CTO and this is the firm company which has been transformed to term to mercedes-benz IO and we started out our journey in 2015 so we have a lot of experience in the mean time with adopting and working on pivotal foundry so in order to get to know and to understand our journey with pivotal cloud foundry I think we have to go back to 2014 and look at the situation which was there in the Daimler digital blasters so all the websites all the marketing related websites and there you see the mercy this me portal which is on the left side it's custom focused gives information about customer data customer registration big Association etc then we have the product focused ian beside ian be means electronic mercedes-benz and this is product related so we have all the brand offerings vehicles we have dealer information and with functionalities such as a car configurator which probably you all know earth and then the third the sidewall was the brand its mercedes mercedes come and the brand shows lifestyle information provides information about events like auto shows or a Formula One newest etc so this were three different sites and if mister came to the Mercedes me portal and then jumped to the MB site we didn't know anything what had happened before so the behavior of the caster of our visitor was not clear to each of those sites because they were separate so we had a clearly broken customer journey and on top of that we didn't know about customer so we didn't gain any knowledge any coherent knowledge of the customer and the intentions and the purposes where he or she was on and on top of this we had different technology stacks so the Mercedes me portal works on a Java EE portal server the MB works on a content management system and the brand works on a content management management system as well orbit a totally different one than the one used with EMB so what we had to do in 2014 in order to get a coherent customer journey we had first formulate a mission what was the mission when one mission which said we want to create the best customer experience for all Daimler customers and prospects but also for people just interested in the brand and the offerings and for grading this best customer experience there were three things which we had to put into focus the first one was delivered the most relevant content to our visitors for their purpose for their country context second one was provide the most useful functionality for the intense the customers or the visitors our ring and as an insight we had to stay constantly innovative and improve all the time so and this is where we created mercedes-benz one web the platform the platform should unify all those contexts all those areas of purpose a customer the product in the brand and should guarantee a coherent customer journey and I'm top of this we wanted to have a consolidated and unified technological stack which was based on pivotal cloud foundry one the content management system analytics and below that an infrastructure as a service so because we were a small company and we had a worldwide coverage of all the websites we had to find a way how how we would scale and how we would extend our work into the world and this is why we thought about dimensions of the scaling and we have three dimensions of scaling those were scale with content scale with functions and scale with market so in a true MVP fashion we started out with one model which is the content dimension one market this is the UK market and the core functionality the core components which we deemed the most important for the customer experience and then gradually gradually we try to scale out get more content cover more markets and extend the functionality so um today's 2017 almost 2018 what have we achieved so far we have a collaboration with more than 10 external partner companies we had a canary rollout to the UK in 2016 with the C class only limited content and limited functionality we had the rollout in summer 2017 to Austria and then up until now we have rolled out in 20 countries in less than five months so this means that on average we have rolled out four countries a month sometimes we even managed to a roll out three countries in one week and this all has been done without any critical production incident and we had zero management escalations so this is why we're quite that in the next year 2018 we will be able to achieve more than 40 countries a worldwide with our platform so what were the architectural decisions for the one bi platform how did we go on in order to achieve those goals so we knew from the beginning that we have to how to build a lightweight and elastic application landscape we knew also that we wanted to have an organizational blueprint a startup setting so feel like a startup behave like a startup have values like a startup deliver small increment small batches constantly collect feedback from stakeholders product owners or customers and incorporate this feedback into the next iteration we also applied a hypothesis driven development so we started when with a product or a project or a component by definition of a KPI or KPIs and then we tried to it as quickly as possible validate or verify and falsify our hypothesis beads on a technical or business or a logical site so this is how we created an efficient scalable and extensible system for a global web presence so this go fast stay focused and be responsive so those architect act architectural decisions were that we had to go to the cloud in order to get the faster provisioning of Bareilles hardware in order to be more efficient with the computational things which we had to do so we wanted to only pay when we had use of some computational compute power and we wanted to have a cloud application platform a pass this is little cloud foundry to let our teams focus on developing business capabilities and business capabilities alone we wanted to have easy deployments to a modern container environment aggregor will tell you a little bit about isolation which was very important to us and we wanted to have a ready-made outer architecture so all the services which have horizontal functionality service registry configuration server monitoring logging etc this should not be done by the teams because they should concentrate on the business capability and of course automation was very important for us so the configuration the building of software the deploying and operations should all be automated as much as possible then on top of the past we had a micro services oriented architecture so we knew that it was very beneficial to work with micro service with the micro services concept but we'll just not to over architect so Martin Fowler put it like you have to be this big to use micro service we knew about the danger so we try to experiment a little bit if it worked out we went on if you didn't work out we had to Pivo or just to abandon our our hypothesis and then we defined common principles and guidelines for a whole landscape or for how our whole ecosystem and the last point was that we wanted to scale out we wanted to to let the teams develop different functionality in parallel so we really needed speed and this is what we try to do by isolating processes running in containers by isolation in several steps which also Greg over detail later on we needed to have independent releases and deployments and we made those components in to communicate with each other by messaging or HTTP and this is a bird's-eye view from the one web technological landscape so on the right you see the content management system and on the left the micro services architecture with the API gateway channeling every request then those different services and on the base we have pivotal cloud foundry as the past but we couldn't do a greenfield development so we knew that there were many many systems many data lying in a data center of timeless so actually we also had to do to take care of the communication between our cloud and the Dino data center enabling of teams was also a very very important thing for us so we wanted to minimize the ramp up time for a new team or a team on a new project or product so what we had to do is to automate by PCF or with PCF the space creation the permission and also the service bindings and we provided maven archetypes for work for the teams so that they could build skeletons of apps and were productive probably in one day or two days so very fast and we try to allow self servicing of those teams as much as possible so the service provisioning could be done by those teams the usage of deployment templates let the teams customize the deployment pipelines and we granted the teams access to monitoring and logging so that they could evaluate the behavior of their application by themselves and we provided common components we those components included error handling actuator endpoint configuration and other generic functionality and then we tried to build ecosystem guidelines for everybody so that the teams had security in building applications for themselves so they knew where the boundaries are and what kind of principles were guiding them in developing functionality and we provided blueprints for the communication specification as well as a precise definition of what a good citizen what a good application is within our one bad ecosystem so let me compare what was before and what we do now with the cloud in the past we have the software development like cyber lifecycle which probably isn't very known to you we have the inception phase with ideation scoping budgeting technical feasibility etc and then we gather requirements we create architectural artifacts and when we know what kind of hardware we need we order those Hardware at the data center and sometimes or very often we have to wait at some point in time that all the hardware was available before we could deploy into an integration environment and this waiting time was basically waste so with people Cloud Foundry and in the cloud at the present we have the same inception phase but then we don't even have to know what kind of hardware we need because we just create a space we create the service bindings the service skeletons the deployment pipelines and the team permissions and then the team can go on and is productive in a very very short amount time so we have no wedding time at all and if you compare those durations from inception to the first to the first integration deployment then we had earlier on about 30 days right now we can manage to have teams being productive and deploying onto a integration environment within two or three days how did it really achieve this in detail this is something which Gregor now I will tell you all right so we want to talk about how we integrated PCF in with our architecture so one of the key things that we thought about in the inception of the project was were the key principles of isolation and decoupling we see these principles as the main enablers to create efficiency in the teams and to make it possible that different teams work in parallel so isolation also allows us to change parts of the system just parts of the system and not affecting other parts of the system at the same time so how do we achieve that we are using three levels of isolation the PCF provides us one level of isolation is the isolation the PCF provides natively with orcs and spaces that isolates different applications and different users we also use PC FS containers that provides isolation on a process level isolates up applications from each other and we also came up with a custom versioning concept that allows us to deploy different versions of the same applications in parallel without affecting each other effectively achieving parallel deployments of different releases but having all these three levels of isolation we also found that we are missing something on the platform level on the PCF platform level and that is that service tiles that we install service tiles or software components in Cloud Foundry that provide your foundational services and data sources such as sprinkler services gem fire my sequel etc and we found that there is no isolation when you update these services meaning if you have one Claude found Foundation and your host different deployment violence on there you don't have it you you don't have a chance to actually test the changes that a new service tile update would bring so we're thinking about how we could actually address this issue so what we came up with is our own setup of our cloud foundry environment and we decided that we want to separate foundations for different deployment environments we choose an approach where we actually group some of the deployment environments to actually not waste resources and we came up with the set up as we see it in the lower bottom of the diagram of the slide so we have one foundation for death and test we have one foundation for integration and pre prod and we have a separate foundation for our production environment this now as we can see it in the diagram allows us to test product tile updates in a lower level environment tested do QA and once we know that we are safe propagate these changes into a higher level environment through inte pre prod and then eventually to profit this follows pretty much a very traditional workflow of promoting changes but it's essentially less complex that other approaches it's simple stable and effective so far the net thing I want to talk about is coming back to this first level of isolation that we mentioned how PCF is able to isolate different apps and different teams PCF provides natively a concept which is auction spaces with the with the auction spaces you can actively achieve multi-tenancy on one cloud for your foundation so when our set up as we see it in the right diagram we give each team each product team that provides business functionality its own space so they can with the highest level of freedom they can create their services they can push their services into their space they can can instantiate service instances without affecting other teams we also obviously make use of roles and permissions to actually guarantee that every team member has the right access and see what the team members supposed to see so while having separates spaces for isolating applications and teams from each other which is great at the same time we're thinking about how can we provide a set up where we would have services that are shared across consumers so there are some cross-cutting services like a service registry which is used for service discovery or a config server for example that must not be instantiated in each and every space essentially also because you might have some component global components that might need to discover services that are hose or applications that are hosted in different spaces so we came up of a concept that we call shared services so we have as we see in the diagram one space shared services where we deploy or instantiate those service instances which need to be shared and with using PCF native mechanism such as service keys user provided services and a customized connector library we are able to give access to these shared services to applications deployed into our business functionality areas this concept works good and we're still running into production but it also comes with a couple of disadvantages one main disadvantage is that it requires to to a grant space developer permissions for team members that want to have access to the service dashboard this is obviously a problem because it's a high level of access and could be potentially dangerous next disadvantage that we identified in the process is that we are missing an overview of who is actually consuming our shared services meaning that we could accidentally delete a shared service instance without knowing that they are still active consumers out there and last but not least there is one disadvantage that to implement our setup we require a static naming scheme on the consumer ends because there is no tagging of no tagging feature for user provided services that does exist for regular services but essentially we are very happy with this setup it has definitely advantages it reduces the overall maintenance of these types of services especially when it goes to updates etc it's also simply stable and effective and it's pretty much aligned with what PCF is planning for a future or what pivotal is playing for the future because pivotal we know is already working on making shared services which pretty much is similar to our setup a first-class citizen and future versions so let's talk about development topics how do we develop with Cloud Foundry and what were the major topics that we addressed the first thing that we are thought about is road to versioning we realized from the start that in an corporate environment we will always have the requirement to run different major versions of the same application it was one of the main requirements that we set up we obviously selected semver as a versioning concept which is a pretty safe choice and then we decided how do we achieve that in PCF we decided to have every major version deployed as a separate application and running as a separate application we only expose major versions through the API of each app meaning we have a v1 app we have revealed to etc and this overall simplifies our development and testing approach because now we can introduce a new major version we can test it we can curate without actually doing any QA on the older version that has already been curate it also simplifies the upgrades and transitioning to newer technologies because after after years that an old version was running a production we want to come up with a new major version we can actually opt for a new technology stack or a new language and don't have to worry to incorporate the functionality of the older version in the new version this effectively decouples changes in business logic so how does it look like we see it here simply in a simplified diagram we have those two wide applications versions v1 and v2 that are deployed in our environment they are registering with our service registry with their major version and they they pull their configuration from the config server which is pulling the configuration from the back in git repository and now we have on the very left we have a client that wants to have access or wants to issue a request for the service version two so it hits the API gate where the IP I gateway is able to translate that request into the corresponding version it uses the service registry to discover the right application and then the request is forwarded to the appropriate instance by the Gateway let's talk about API gateway we mentioned that as separate several times and we think that the IP I gateway is the core component in every micro services architecture it essentially is an edge service that acts as a single entry point miss controlling external access to all of the services an API is that we are hosts in our foundation it provides uniform endpoint behavior at the base URL and makes it very simple for clients to implement implement consumers against those the API gateway has centralized responsibility for things like dynamic routing service discovery circuit breaking client-side load balancing dealing with time odds with retries etc for implementation of the API gate where we choose Netflix Zoo it's a lightweight API gateway or add service it comes with a simple implementation it's essentially a spring good application that we deploy into Cloud Foundry and it makes it very easy for us to analyze problems it's such a curl it's also extensible easily extensible with filters for fine gate control for process before for request processing some of the insights we want to share with you regarding API gateways so first of all what we already mentioned is it definitely provides a unified endpoint behavior that can be managed centrally but on the other end it also provides currently how we set it up now it does not provide any self service its self servicing functionality for our development teams which means if a development team comes up with a new service or comes up with a new version we need to do some plumbing in the API gateway to enable that that is essentially a governance task then we also don't have any service catalog functionality so we don't have a text document like overview of what is installed in our environment with the appropriate documentation it's also we also identified that it definitely has a limited feature set compared to more comprehensive API management solutions next thing that we want to talk about is Bluegreen deployments so Bluegreen deployments or zero downtime deployments was one of the precondition for us to actually start the project to enable continuous deployments and also precondition for even just going into production after we thought about that and it's an absolute experimentation we also said that we need to have Bluegreen deployments also in lower-level environments so it's not only important to have that in production it's also very helpful to have these in non production environments because we are not causing any kind of outages when we apply changes to any environment and therefore we don't cause any issues so unfortunately also in the inception of the project we realized that zero to hundred zero downtime deployments are not first class citizens in PCF so essentially if you use CF push or similar you're essentially dragging down all the existing instances and you're starting new ones as containers which is very nice but you're still gonna experience an outage in that time so what you need to do is or what we needed to do it is to create a custom implementation that is following a simplified approach in our case where we don't deal with cannery deployments and we don't deal with traffic redirection we are very happy just to be able to push applications out and don't have outages on the client end at this point we are able to do all of our concept we are able to do multiple deployments per day into production and don't cause any disruption we also want to talk about development challenges that we are faced when we started to create the platform and create a first couple of applications so our development stack is based mainly on spring boot and spring cloud services that we deploy to our environment and we ran into several issues when we started we had problems with garbage collection that led to out of memory exceptions or errors when during eureka communication this was eventually fixed in spring cloud services 1.1 we also had recently problems with spring rear spring boot integration with the apps manager that does not work well with our custom context paths we fixed that we have a workaround as far as I know if spring bonus being good 2.0 it's supposed to get to be fixed we also if then to identified that the recommended and documented blue/green deployments as documented on the cloud for our website is not well aligned with the communication with eureka so you might even experience outages there that's why we set up our own approach and last but not least another example is that we had issues with the spring cloud services conflict server because it has it had a reduced feature set compared to the open source version but also we addressed that with the pivotal engineering team and eventually got a new version working that provides all the features that we need one thing that we wanted to point out is that the use of spring cloud service is very useful it makes the self servicing much easier and we can really have features like security and high availability of important services of our infrastructure but we also see it as an additional dependency in our overall architecture that also needs to be maintained it also needs to be maintained and on both ends on both the infrastructure level where when you provide or when you when you update your service tile and pro and update to a new version you're going to need to update all the service instances but you also need to update all your client dependencies so last section that we want to talk about about some lessons learned throughout this process throughout our prior so one thing that is very obvious that automation is good and provides awesome or obvious benefits but it's not always easy to achieve so when we started out we thought that we could automate everything but we're very early realized that we need to start from a very minimal set that provides us with functionality to do continuous delivery and then increase the level of automation from there and that would be a recommendation from our end as well and also it's very important as Thomas already described it's also important to provide the developers a toolbox that they can use to automate projects that are built and the deployments to the platform we also learned that a pass solution can tremendously help you in increasing the efficiency of your team because it's automated specific tasks or many tasks like the creation of PCF routes when you deploy a application you don't have to worry about this you have service binding which implements the concept of attached resources in like described in 12 X or abs we don't have to worry about managing credentials by the development teams and it's also extremely easy to scale in and out but platform-as-a-service systems are no panacea for everything there is definitely considerable effort that we put into maintaining the platform so it's no one click installation and then you don't have to worry about anything still you need to plan for having an ops team that manages the system also we found that we are still missing some major functionality in PCF things like a built-in monitoring and alerting system is still not available so if you plan to implement this platform plan to use something some other external products to do this for you there's also as we already described there is no out-of-the-box blue/green deployments but that will potentially get address addressed and future PCF versions we have no multi versioning of tiles which would could also tremendously help to change our current foundation setup and also one smaller thing is that we think that because to container networking that should work across arcs and spaces could be simplified for the users on an organizational level we think that you should treat your development team as customers to be successful with this platform meaning that it is a complex system and there are many things to consider so get their feedback and incorporate that in growing your platform only like this you will be successful and also because of the complexity of the platform it's also extremely important to put some effort into onboarding new team members because they might not know all the concepts of 12 factor X or generally working with cloud environments Thomas do you want to jump in and also share some yes lessons learned this is the last slide I'm a higher level I wanted to tell in those in this last slide what kind of lessons we have learned what kind of experience we've made with cloud foundry and pass or cloud native application implementation now first one is provisioning of a centrally managed platform drastically improves the velocity of the teams and also improves the stability of the components then the isolation and the decoupling of components and teams is paramount for parallel development you cannot have a multitude of teams without isolating those teams and decoupling those teams with deployments with the container processes etc as Gregor put it in detail a couple of minutes ago the last thing is if you turn to a micro service-oriented architecture your organization has to change as well so if you have an organization which was not accustomed to have multiple teams very fast velocity iterations then this is one thing exactly as the actual model your organization has to change in order to pull the most profit from a micro service-oriented architecture so this basically is it thank you very much for your attention if you have any questions then this is the best time to ask so you have to of course it's dependent on what what your organization looks like when you start your journey but if you have for example if you have a hierarchical organization this is not going to work with a micro service oriented team set up you have team setups where the team should be self responsible they should be they should have as much freedom as possible in order to develop and they should have the freedom of doing the iterations in a very fast manner yeah yeah yeah so what probably daimler was not really known for EDI development I think it's safe to say that but when we started out it's a gradual development so once you really see that this is happening and it really gives you some power and some velocity and some stability I think everybody involved really can turn to that and this is a four for us it was a grad very gradual development but right now I see that we are in all our teams have a very active very iterative and incremental development so I think this is also something where time has changed a lot within the last two or three years okay there's another question there right so the question was oh the question is yes yeah that's correct in our setup each team gets their own space to actually isolate any type of you know applications and service instances from each other yeah we have parent organizations so the for this project we have one parent organization that is correct we might change that set up we started out like that might change the set up because with PCF I think one point ten you have you know isolations like segments and things like that so we might change things around but the current setup it's a working set up for us so the question is how we can manage a central conflict server and extend it to other applications so there are several things that we did we also went through a couple of iterations there also because we we had some features that were not there and were added with config server later and we started out to use things like the placeholders in the beginning where we had different config repositories and we could address them from the config server perspective with placeholders so if an application that was called service one we was requesting this configuration from the service we used the placeholder service one and was were dragging in or pulling in the correct configuration and later on we changed the setup because of security concerns and at this point it's also a test that we want to automate but at this point we are adding separate configuration repositories with separate credentials just to increase the level of security does that answer your question ok yes so feel free to projects which have to be discussed in detail just approaches after the talk and we are more than happy to discuss this with you thank you yes yes so in in the very beginning and so the question is how we manage application names or and whoever we came up with a convention yes so when we started the project we actually give it some gave it some thought and we came up with a concept of using only acronyms and when a new application is created or a new team creates a new application they will actually ask us for appropriate acronym we document that and we use that going forward and that acronym is also used in service registration in service configuration etc turns out to be very effective you try to capture in those ecosystem guidelines so in those guidelines and I'd give the team's the safety to do the right things and if they don't know or if they don't understand fully the guidelines then this is something which we have to do in me onboarding in their events so that all the teams know what they have to do and we get an overview of the whole ecosystem question yeah you're asking whether we automated deployments etc mm-hmm yes at this point they have a space developer role again we trust them because they have their isolated space they cannot cause any disruption and I think this also goes back into this organizational change the Thomas was talking about before right I know it also from other setups where there is still the mentality that people don't want to give out permissions and they are very careful but this essentially slows down the development it also has a psychological impact because if people are not able to push things into production potentially they will create other code if it compared to as if they produce code that is pushed into production so essentially it turned out to be a very well working concept so every developer or most of developers get this based developer permission and are responsible from development up to higher level environments to maintain the application and that also includes Jenkins etc question here so what you're asking is whether there are updates which we communicate with regards to PCF or with regard to our ecosystem yeah so actually if a team starts anew in our platform or ecosystem then we have one boring onboarding sessions or back-end onboarding sessions or service onboarding sessions as well as the front end implementation sessions and then we have ecosystem so the the people get introduced to our ecosystem to the guidelines and they still have maximum freedom so of course we can't cover everything but we cover most of the things which are efficient or insufficient enough to really get the team's going and have them implement their their the functionality anyway if some those teams and as I said we have several teams if they have problems they come to – to us and ask us and if we think or team this to be valuable for the ecosystem guidelines because well they they touch subjects or topics which everybody should know about we just integrate them in the ecosystem guidelines yeah and I think just just to add to that so we started out with giving those onboarding sessions and training but we also found that this is not enough so that's why in the lessons learned we said that we also want to provide all development teams that what we call toolbox right so things like a maven archetype because once you have a mature you know platform there's a lot of plumbing and configuration to be done and it's really much much easier than going back and forth you know with support calls and all you need to configure your recur there because you need to include the major of goshen stuff like that right it's easier to create that maven archetype ones or we also expanded into things like we created Jenkins pipelines we have a shared Jenkins library used by Jenkins pipelines that is also created or the generated code out of the mail types also creates those Jenkins pipelines we also have a seeded Jenkins that automatically sets up their necessary deployment jobs so I think it's very important to incorporate that as well yes yes well the question is how long it takes for a team to really become productive yeah so it depends of course on the knowledge of the teams so if the team is knows about to a factor absolute load native applications then it's not much faster if they don't know anything about that or if they really need to get educated about it it takes more time so you cannot really say it's it takes two weeks or one month it really depends on the experience level of the teams but if a team is very knowledgeable in writing implementing cloud native applications I think after one week you have the team really to get going also with the ecosystem guidelines and then I think around one month will will be the time and after this one month the team is really fully productive all right okay so thank you very much thank you I said before if you have any questions and we want to discuss this in detail come to stop to us all right thank you very much [Applause] you

Leave a Reply

Your email address will not be published. Required fields are marked *