I Love AWS

Big Data on AWS

How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.

Harness the power of your data with AWS Analytics https://aws.amazon.com/blogs/big-data/harness-the-power-of-your-data-with-aws-analytics/

AWS Marketplace for Big Data

Data Ingestion and Transfer

Amazon Kinesis Agent for Data Ingestion https://github.com/awslabs/amazon-kinesis-agent
Apache Flume https://flume.apache.org/ can be installed and run on Amazon EC2 instances.
You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Apache Sqoop https://cwiki.apache.org/confluence/display/SQOOP/Home supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
AWS IoT can collect and handle large quantities of data coming from a variety of sources https://aws.amazon.com/iot/ and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
AWS DataSync https://aws.amazon.com/datasync/ is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/ provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
AWS Glue DataBrew https://aws.amazon.com/glue/features/databrew/ visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning

Big Data Streaming and Amazon Kinesis

Amazon Kinesis Data Streams resources https://aws.amazon.com/kinesis/data-streams/resources/ Tools and Libraries.
Overview of Amazon Kinesis Data Firehose https://aws.amazon.com/kinesis/data-firehose/
AWS Kinesis Data Analytics - SQL Functions https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-functions.html
Using the Schema Discovery Feature on Streaming Data https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sch-dis.html
Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing. http://spark.apache.org/streaming/
Amazon Managed Streaming for Kafka (MSK) https://aws.amazon.com/msk/ is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.

Visualisation

The best stats you've ever seen https://www.youtube.com/watch?v=usdJgEwMinM a TED-Ed talk by Hans Rosling in July 2013.
Amazon QuickSight Features https://aws.amazon.com/quicksight/features/
Demo QuickSight Dashboards https://democentral.learnquicksight.online/#Dashboard-DashboardDemo-Exec-Business-Summary
Amazon QuickSight Community hub https://community.amazonquicksight.com/ - the place to ask, answer, and learn with others in the QuickSight community

TECHNICAL BLOGS and PRESENTATIONS

Pushing Physical Limits with AWS Snowball Edge https://www.youtube.com/watch?v=__ooXhq5gZ4&t=47s
Harness the power of your data with AWS Analytics https://aws.amazon.com/blogs/big-data/harness-the-power-of-your-data-with-aws-analytics/
AWS serverless data analytics pipeline reference architecture https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture
SQL Based Data Processing in Amazon ECS https://d1.awsstatic.com/architecture-diagrams/ArchitectureDiagrams/sql_based_data_processing_amazon_ecs.pdf Build a configuration-driven, codeless extract-transform-load (ETL) alternative using a containerized ETL framework (ARC) https://arc.tripl.ai/ that simplifies and accelerates data processing with Apache Spark.
This blog post describes how to Implement continuous integration and delivery of serverless AWS Glue ETL applications using AWS Developer Tools https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-serverless-aws-glue-etl-applications-using-aws-developer-tools/
Using Step Functions to Orchestrate Amazon EMR Workloads https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/
Amazon EMR Studio introduction video https://youtu.be/Xv0yhKJQPOc

CUSTOMER REFERENCES

AWS Case Study: Hearst Corporation https://aws.amazon.com/solutions/case-studies/hearst/ . See also AWS re:Invent 2015 | (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS https://www.youtube.com/watch?v=6cwbbqi36k8 . The Hearst presentation starts at 18:30.
Netflix case study https://aws.amazon.com/solutions/case-studies/netflix/
Snowplow Analytics https://aws.amazon.com/solutions/case-studies/snowplow/ enables its clients to collect granular customer-level and event-level data from multiple platforms (web and mobile).
AdRoll is a global leader in retargeting, serving 50 billion personal ad impressions every day. https://aws.amazon.com/solutions/case-studies/adroll/ use S3, Kinesis Apache Storm and Amazon DynamoDB.
Data Xu makes use of Kinesis for data processing, and Athena on Amazon S3 as a single source of truth https://aws.amazon.com/solutions/case-studies/dataxu/
Channel 4 https://aws.amazon.com/solutions/case-studies/channel-4/ using Amazon EMR enables them to run analyses on 100% of their available data instead of sampling.

I Love AWS

Big Data on AWS

Data Ingestion and Transfer

Big Data Streaming and Amazon Kinesis

Data Lake and Lake House Architectures

Hadoop Frameworks and Deployment Options

EMR User Interfaces

Spark

Management and Monitoring

Visualisation

TECHNICAL BLOGS and PRESENTATIONS

CUSTOMER REFERENCES