Enabling a big data team on AWS – FrieslandCampina

FrieslandCampina have a team of data scientists and data engineers that work on the processing and analysis of data from many different sources including ERP, sales points and social media. They work according to agile methodologies and require high levels of flexibility in their activities.

Problem definition

They work in an AWS environment using native AWS services. There is a requirement to reduce their dependency on other teams, while ensuring, while safeguarding security and operational stability. Often times they need to work with production datasets for validation of data models. The solutions provided needs to be reusable across multiple projects and development teams with minimal customization. Specifically it was requested achieve the following;

Increase speed of deployment from days to minutes.
Achieve flexibiltiy and reliability through repeat-ability.
Enable data scientists and engineers to deploy and configure AWS resources based on ‘infrastructure as code’ commits to code repositories with approval step.
A single source of truth via version control.
Eliminate human errors by enforcing a unified way of working via deployment pipelines and avoiding local undocumented hacks.

Solution

We enabled the data science team in the following ways;

Deployment pipeline for Lambda functions that run short lived or lighter Python Spark processing and analysis of data. (including cross-account deployments)

Using SAM framework templates, AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, AWS Cloudformation to create a deployment pipeline for data engineers and data scientists to deploy AWS Lambda functions for Python Spark based data processing and analysis.

In AWS CodeCommit we use a test and production branch to deploy against different AWS CodePipelines, for Production there is a manual approval step that has access limited to specific authorized AWS IAM users.

fcbglambda

Deployment pipeline for Python Spark application functions that run short lived or lighter Python Spark processing and analysis of data.(including cross-account deployments)

Using AWS CodeCommit, AWS CodePipeline, AWS CodeDeploy and AWS Cloudformation to create a deployment pipeline for data engineers and data scientists to deploy and configure Python Spark applications to a MAPR cluster and or an AWS EMR cluster for long running data processing and analysis.

Repeatable standardized deployment of developer workstations in staging and or production for new hires.

Using AWS Cloudformation and Ansible (AWX) to deploy a stand developer workstation on demand for new hires to the data science team.

Outcomes

Lambda function deployment including S3 buckets and associated triggers can be deploy and updated by data science team independent of other teams.
- Including cross-account deployment to staging and production.
Continuous deployment and testing of Python Spark applications to MAPR and or EMR
Reduction of time to enable new hires with developer workstations in staging and or production environments to 2 hours from a number of days.
Billable deployment requests from the development team to the DevOPs team have been reduced by 70%.
The pipeline and process is reusable for other development teams.
Logging for data science team activities centralized in Cloudwatch with alerting to specified Slack channel.
Access control for deployments and configuration controlled by ‘application’ and ‘environment’ tagging.

Enabling a big data team on AWS – FrieslandCampina

Navigate:

Recent Posts:

Find us at