Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera
January 20, 2015
Big data processing can take place in many contexts. Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost. The best deployment option often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs — from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.
But in all these cases, what remains true is that you need an easy-to-use, powerful and flexible programming model that makes developers productive. And no one wants to have to rewrite their algorithm for a specific runtime.
We believe the
Dataflow programming model
, based on years of experience at Google, can provide maximum developer productivity and seamless portability. That's why in December we
open sourced the Cloud Dataflow SDK
, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing. This allows the same program to execute either in stream or batch mode.
Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. There are currently three runners available to allow Dataflow programs to execute in different environments:
Direct Pipeline
: The “Direct Pipeline” runner executes the program on the local machine.
Google Cloud Dataflow
: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can
apply here
.
Spark
: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the
Cloudera Labs effort
and is available in
this GitHub repo
. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this
blog post
.
We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises. Please join us – whether by using the
Dataflow SDK
(deploying via one of the three runners listed above) for your own data processing pipelines, or by creating a new Dataflow runner for your favorite big data runtime.
-Posted by William Vambenepe, Product Manager
No comments :
Post a Comment
Free Trial
Labels
Android
Announcement
api
app engine
Atmosphere Live
bigquery
BigTable
CDN
Cloud Console
Cloud Dataflow
Cloud Datastore
cloud endpoints
Cloud Pub/Sub
Cloud SDK
cloud sql
cloud storage
Cloudera
Compute
Compute Engine
container cluster
customer
Dev Tools
developer tools
developer-insights
Developers
Developers Console
devfests
Disaster Recovery
Encryption Keys
ESG
Event
events
GA
Go Client
Google App Engine
Google Apps
Google BigQuery
Google Cloud Deployment Manager
Google Cloud Networking
Google Cloud Platform
Google Cloud Storage
Google Compute Engine
Google Container Engine
gRPC
hadoop
Hardware
Helium
how to
IO2013
iOS
Kubernetes
Levyx
Local SSD
mapreduce
Media
Nearline
networking
open source
PaaS Solution
Partner
Pricing
Research
round-up
Server
Siggraph
solutions
Startup
Tableau
TCO
Technical
Windows
Wowza
Zync
Archive
2015
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Follow @googlecloud
No comments :
Post a Comment