Running Apache Beam WordCount on Apache Apex
As the next step toward running on Apex, you can also run your pipeline on a local Apex cluster, for a testing scenario that is slightly more similar to production:
mvn compile exec:java \ -P apex-runner \ -D exec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/* --output=/tmp/output-apex/ --runner=ApexRunner --embeddedExecution=true"
Again, you should find output files in /tmp/output-apex
. The number of files may differ, but their overall contents will be the same. Unless you request particular sharding, it is up to the Beam runner to decide the parallelism of the write step.
Now, we should run this on a real YARN cluster; if you are not already in an environment with a cluster available, it is easy to set one up with Google Cloud Dataproc or AWS EMR. To do so, there is no special treatment needed.
Now, let's spin up a Dataproc cluster and run this via those instructions:
mvn compile...