DevOps Consulting, Platform Engineering
Transport | Trade | Logistics
Every single day, over 2.5 quintillion bytes of data are created in the world, driving innovation. Our client, a global player in logistics and a pioneering company from its start, wanted to harness the power of its own data to achieve its digital marketing potential. The company was already collecting behavioral, attitudinal, and transactional metrics 24/7, from its own applications as well as from third-party toolings such as Google Analytics and Adobe Analytics. But the wide variety of sources—intelligent marketing automation tooling, websites, customer surveys, social media, online communities, and loyalty programs—not to mention its legacy systems, made maximizing its data potential a challenge. The multinational services company needed an entirely new data platform.
Xebia joined the project in close cooperation with Xebia group company GoDataDriven (GDD), forming a team of platform engineers, data engineers, a scrum master, and a product owner. As the team consisted of Xebia consultants, GDD consultants, and employees of the client organization, it was possible to closely align with the client's requirements and maintain a sustainable knowledge transfer. We defined the roadmap together and changed the team compilation and expertise when needed due to changing priorities.
The team was given the trust and freedom to apply different ways of working and finally chose Agile scrum in combination with DevOps principles as they learned along the way.
With the new self-service data platform we are able to instantly provide our marketing organization worldwide with actual and consistent data to fuel their customer focused campaigns.
Bringing Software Engineering & DevOps Principles to Data Science & Data Platform Engineering
Within a month, the team built the initial (MVP) data platform on Amazon Web Service (AWS). This platform complied with the initial business requirements and to DevOps principles. For example, it was possible to completely delete and redeploy the platform within hours from the very start. In fact, DevOps principles were embedded in the data science context, such as:
- Everything as code (e.g. the configuration of components, the definition of (security) policies)
- Complete version control in Git
- Automated testing and quality checks
- Buy over build (SaaS over PaaS/cloud native over IaaS)
- Multifunctional autonomous teams
- Agile delivery
- You build it, you run it
The team designed the architecture and developed a new cloud-native platform on AWS (Amazon Web Services) that used AWS-native services and open source toolings, such as Hashicorp Terraform, Apache Airflow, and Jenkins. Fully reproducible within minutes, the new data platform supported continuous delivery concepts and delivered its (data) services on-demand via self-service
Core Services of the Data Platform:
- ETL to onboard data sources to the platform: on-prem databases (Oracle, Teradata), cloud data sources (PostgreSQL, Cassandra) and third-party data providers (e.g., Google Analytics, Adobe Analytics & Salesforce)
- Provides a data science stack (exploration & development) for data scientists and a data analytics stack for data analysts
- (SQL based) data exploration and data catalog
- A runtime environment for hosting Analytical, ML, and AI applications (e.g., Spark, Tensorflow models)
The new platform integrated with multiple data sources and can quickly add new ones. It covered cloud benefits such as scalability, reliability, real-time availability, and cost-effectiveness. Dashboards visualized performance, and dedicated alerting notified the team immediately of incidents.
Migration from AWS to GCP
In 2019 the client decided to move the data platform from AWS to Google Cloud Provider (GCP). The migration required refactoring the cloud-native services (a more specific change to the infrastructure code created in Hashicorp Terraform). It also required moving the data sets from the AWS S3 buckets to GCP Cloud Storage. The migration proved the portability of the platform’s cloud-agnostic, analytical, ML, and AI applications. Configured with the proper libraries and access to required data sets, they can execute on a runtime environment.
The GCP Data and Analytics Platform
The new platform aligns data-science and analytics with the business's needs. It can be used to search and identify customer segments to target with campaigns and can also run analyses without interfering with existing data in corporate systems. The data platform facilitates data analysis on demand. This on-demand way of working has replaced standard reporting.
After the analysis by the data scientists, the cluster cleans up automatically, saving costs. Each data scientist has a personal data science stack to develop predictive models. Benefits are isolation, data traceability, and scalability.
It also helps data scientists to apply mature software engineering and DevOps principles. They can use version control for their code, embed code quality checks, and use automated deployment pipelines.