Welcome to episode 274 of The Cloud Pod, where the forecast is always cloudy! Justin, Ryan and Matthew are your hosts this week as we explore the world of SnapShots, Maia, Open Source, and VMware – just to name a few of the topics. And stay tuned for an installment of our continuing Cloud Journey Series to explore ways to decrease tech debt, all this week on The Cloud Pod.
Titles we almost went with this week:
- 🫛The Cloud Pod in Parallel Cluster
- 🧾The Cloud Pod cringes at managing 1000 aws accounts
- 👀The Cloud Pod welcomes Imagen 3 with less Wokeness
- 🖼️The Cloud Pod wants to be instantly snapshotted
- 💸The Cloud pod hates tech debt
A big thanks to this week’s sponsor:
We’re sponsorless! Want to get your brand, company, or service in front of a very enthusiastic group of cloud news seekers? You’ve come to the right place! Send us an email or hit us up on our slack channel for more info.
General News
00:32 Elasticsearch is Open Source, Again
-
- Shay Banon is pleased to call ElasticSearch and Kibana “open source” again. He says everyone at Elastic is ecstatic to be open source again, it’s part of his and “Elastics DNA.”
- They’re doing this by adding AGPL as another license option next to ELv2 and SSPL in the coming weeks.
- They never stopped believing or behaving like an OSS company after they changed the license, but by being able to use the term open source and by using AGPL – an OSI approved license – removes any questions or fud people might have.
- Shay says the change 3 years ago was because they had issues with AWS and the market confusion their offering was causing.
- So, after trying all the other options, changing the license – all while knowing it would result in a fork with a different name – was the path they took.
- While it was painful, they said it worked.
- 3 years later, Amazon is fully invested in their OpenSearch fork, the market confusion has mostly gone, and their partnership with AWS is stronger than ever.
- They are even being named partner of the year with AWS.
- They want to “make life of our users as simple as possible,” so if you’re ok with the ELv2 or the SSPL, then you can keep using that license. They aren’t removing anything, just giving you another option with AGPL.
- He calls out trolls and people who will pick at this announcement, so they are attempting to address the trolls in advance.
- “Changing the license was a mistake, and Elastic now backtracks from it”. We removed a lot of market confusion when we changed our license 3 years ago. And because of our actions, a lot has changed. It’s an entirely different landscape now. We aren’t living in the past. We want to build a better future for our users. It’s because we took action then, that we are in a position to take action now.
- “AGPL is not true open source, license X is”: AGPL is an OSI approved license, and it’s a widely adopted one. For example, MongoDB used to be AGPL and Grafana is AGPL. It shows that AGPL doesn’t affect usage or popularity. We chose AGPL because we believe it’s the best way to start to pave a path, with OSI, towards more Open Source in the world, not less.”
- “Elastic changes the license because they are not doing well” – I will start by saying that I am as excited today as ever about the future of Elastic. I am tremendously proud of our products and our team’s execution. We shipped Stateless Elasticsearch, ES|QL, and tons of vector database/hybrid search improvements for GenAI use cases. We are leaning heavily into OTel in logging and Observability. And our SIEM product in Security keeps adding amazing features and it’s one of the fastest growing in the market. Users’ response has been humbling. The stock market will have its ups and downs. What I can assure you, is that we are always thinking long term, and this change is part of it.”
03:03 📢 Ryan – “ I have a hard time thinking that this has nothing to do with performance and you know, there was quite the reputation hit when they changed the license before and Since you can do open search now, which is truly open search open source. I imagine there’s a lot of people that are sort of adopting that instead.”
AI Is Going Great – Or How ML Makes All It’s Money
06:28 Nvidia H100 now available on DigitalOcean Kubernetes (EA)
- Digital Ocean is making Nvidia’s latest H100 GPU’s available on DigitalOcean Kubernetes (DOKS).
- Early access customers have the choice of 1 x H100 or 8 x H100 nodes.
- H100 nodes are of course in high demand for building and training your AI workloads, and so this is a great alternative option to other cloud providers.
06:51 📢 Ryan – “I wonder how many people are actually because of the capacity constraints are having to utilize multiple clouds for this. Like it’s kind of crazy if you think about, you know, people using capacity across DigitalOcean, GCP, Azure, and AWS trying to get model training done, but it’s possible.”
AWS
08:06 How AWS powered Prime Day 2024 for record-breaking sales
- AWS is here to tell us how they powered the mighty Prime Day from July 17-18th in their annual recap blog post.
- Amazon Ec2 Services such as Rufus and Search use AWS Artificial Intelligence chips under the hood, and Amazon deployed a cluster of over 80,000 Inferentia and Trainum chips for Prime Day.
- They used over 250k AWS Graviton chips to power more than 5800 distinct Amazon.com services (double that of 2023.)
- EBS used 264 PiB of storage, 62% more than the year before.
- With 5.6 trillion read/write operations, they transferred 444 Petabytes of data during the event, an 81% increase.
- Aurora had 6,311 database instances running Postgres and Mysql compatible editions, processed 376 billion transactions, stored 2,978 terabytes of data and transferred 913 terabytes of data.
- DynamoDB powers many things, including Alexa, Amazon.com sites and Amazon fulfillment centers.
- Over the course of the Prime Days, they made tens of trillions of calls to the DynamoDB API.
- DynamoDB maintained high availability while delivering single-digit millisecond responses, and peaking at 146m requests per second.
- Elasticache served more than a quadrillion requests on a single day, with a peak of over 1 trillion requests per minute.
- QuickSight dashboards saw 107k unique hits, 1300+ unique visitors and delivered over 1.6M queries.
- Sagemaker processed 145B inference requests.
- SES sent 30 percent more emails than the prior year.
- Guard Duty monitored nearly 6 trillion log events per hour, a 31.9% increase.
- Cloudtrail processed over 976 Billion events in support of PD.
- CloudFront had a peak load of over 500M http requests per minute, for a total of over 1.3 Trillion HTTP requests during prime day, 30% more than the year prior.
- Rigorous preparation is the key, for example 733 AWS Fault Injection Service experiments were run to test resilience and ensure Amazon.com remains highly available.
- With the rebranded AWS Countdown support program your organization can handle these big events using tried and true methods.
13:47 📢 Matthew – “ I would love to be at a company where I’m running something at this scale. I feel like, you know, they’re like, cool, come have us do it. But the amount of companies that run stuff at this insane scale is going to be in the single digits.”
16:48 Announcing AWS Parallel Computing Service to run HPC workloads at virtually any scale AWS Parallel Computing Service is Now Generally Available, Designed to Accelerate Scientific Discovery
-
- AWS is announcing AWS Parallel Computing Services (AWS PCS), a new managed service that helps customers set up and manage HPC clusters so they seamlessly run their simulations at virtually any scale on AWS.
- Using the Slurm Scheduler, you can work in a familiar HCP environment, accelerating your time to results instead of worrying about infrastructure.
- This is a managed service of an open source tool they provided in November 2018.
- This open source tool allowed you to build and deploy POC and production HPC environments, and you could take advantage of a CLI, API and Python libraries.
- But you were responsible for the updates, as well as tearing down and redeploying clusters.
- The Managed services makes everything available via the AWS Management Console, AWS SDK and AWS CLI.
- Your system administrators can create managed Slurm clusters that use their compute and storage configs, identity and job allocation preferences.
- “Developing a cure for a catastrophic disease, designing novel materials, advancing renewable energy, and revolutionizing transportation are problems that we just can’t afford to have waiting in a queue,” said Ian Colle, director, advanced compute and simulation at AWS. “Managing HPC workloads, particularly the most complex and challenging extreme-scale workloads, is extraordinarily difficult. Our aim is that every scientist and engineer using AWS Parallel Computing Service, regardless of organization size, is the most productive person in their field because they have the same top-tier HPC capabilities as large enterprises to solve the world’s toughest challenges, any time they need to, and at any scale.”
- Maxar Intelligence provides secure, precise geospatial intelligence, enabling government and commercial customers to monitor, understand, and navigate our changing planet. “As a long-time user of AWS HPC solutions, we were excited to test the service-driven approach from AWS Parallel Computing Service,” said Travis Hartman, director of Weather and Climate at Maxar Intelligence. “We found great potential for AWS Parallel Computing Service to bring better cluster visibility, compute provisioning, and service integration to Maxar Intelligence’s WeatherDesk platform, which would enable the team to make their time-sensitive HPC clusters more resilient and easier to manage.”
18:31 Exclusive: Inside the mind of AWS CEO Matt Garman and how he aims to shape the future of cloud and AI
- Silicon Angle’s John Furrier got an exclusive with new AWS CEO Matt Garman, and they chatted about how he plans to shape the future of cloud and AI.
- Garman was a key architect of the AWS EC2 computing service.
- Now, as the new CEO, he faces leading AWS into the future – and this is a future dominated by generative AI.
- On Generative AI Garman says that their job at AWS is to help customers and companies take advantage of A in a secure, reliable and performant platform that allows them to innovate in ways never imagined before.
- Garman sees AI as a transformative force that could redefine the AWS trajectory.
- Garman asserts that they never obsess about their competitors, instead they obsess about their customers.
- He says AWS is focused on customers by focusing on the future and not dwelling on the past.
- In the interview Garman stressed the importance of inference, which is leveraging the knowledge of the AI to generate insights or perform tasks, as the true killer app of generative AI.
- “All the money and effort that people are spending on building these large training models don’t make sense if there isn’t a huge amount of inference on the backend to build interesting things,” Garman notes. He sees inference not just as a function but as an integral building block that will be embedded in every application.
- “Inference is where the real value of AI is realized,” Garman adds, signaling that AWS is not just participating in the AI revolution but is engineering the very infrastructure that will define its future.
- Garman believes Generative AI could unlock new dimensions for AS, enabling it to maintain its dominance while expanding into new areas of growth.
- Garman views developers and startups as the lifeblood of AWS.
- AWS is not just a cloud provider; it’s an enabler of innovation at all levels, from the smallest startups to the largest enterprises.
- Garmin isn’t just investing in silicone with Trainium and Inferentia chips, but in the whole ecosystem by betting on open, scalable technologies.
- Their investments in ethernet networking, for example, has allowed them to outperform traditional Infiniband networks in terms of scalability and reliability.
- Garman is confident that AWS is up to the task in AI and cloud and continues to innovate.
- AWS offers not just the best technology, but a partnership that is focused on helping customers succeed.
21:15 📢 Justin – “Well, I feel like we’re reaching the point when AI has already been shoved in at the low hanging fruit for things. We were like, cool. You know, EBS is AI. Cool. That doesn’t really help me. And I don’t really care about it. I feel like now you’re starting to hit those higher level services. You’ve done the building blocks and now hopefully they can start to piece things together to be useful AI versus just everyone raising their hands and say, I have AI and things, you know, and I think that’s what’s going to be interesting a to those higher level services the same way they’ve done with S3 & EC2?”
23:55 Amazon EC2 status checks now support the reachability health of attached EBS volumes
- You can now leverage EC2 status checks to directly monitor if the EBS volumes attached to your instance are reachable and able to complete I/O operations. You can use the new status check to quickly detect attachment issues or volume impairments that may impact the scaling of your apps running on Ec2.
- You can further integrate these status checks with auto-scaling groups to monitor the health of Ec2 instances and replace impacted instances to ensure high availability and reliability of your applications. Attached EBS status checks can be used along with the instance status and system status checks to monitor the health of your instances.
24:37 📢 Justin – “And this one’s like, I get it. It’s nice that this is there. It seems straightforward that you’d want to know that your EBS volume is attached. But really the reason why people typically don’t like an EBS volume is because of its performance, not because of its attachment status. So they do their own set of custom checks typically on the EBS volume to make sure it’s getting the expected IO throughput, which I do not believe is part of this particular status check.”
29:16 Organizational Units in AWS Control Tower can now contain up to 1,000 accounts
- AWS Control Tower can now support OU’s with 1,000 accounts.
- You can now implement governance best practices and standardize configurations across the accounts in your OU at greater scale.
- When registering an OU or enabling the AWS control tower baseline on an OU, member accounts receive best practice configurations, controls, and baseline resources such as AWS IAM roles, AWS CloudTrail, AWS Config, AWS Identity Center.
- Previously you could only register OU’s with 300 or less accounts, so this is a 3x increase.
30:07 📢 Justin – “Every time I see things that support this number of accounts, I’m like, okay, it’s great. When everybody wants to say the base costs for there is a base cost for an AWS account by the time you implement. That trail and guard duty and config and all those, and you have to enable some of those services here. And I’m like, okay, the base costs are just writing. Those are going to be a lot, but then again, if you have a thousand accounts, you probably don’t care about a single, a couple hundred dollars.”
GCP
31:33 Get started with the new generally available features of Gemini in BigQuery
- Several BigQuery Gemini features are now generally available:
- SQL Code Generation and explanation
- Python code generation
- Data Canvas
- Data Insights and Partitioning
- Cluster Recommendations
- Data insights starts with data discovery and assessing which insights you can get from our data assets.
- Imagine having a library of insightful questions tailored specifically to your data questions you didn’t even know how you should ask.
- Data Insights eliminates the guesswork with pre-validated, ready-to-run queries offering immediate insights.
- For instance, if you are working with a table containing customer churn data, Data Insights might prompt you to explore the factors contributing to churn within specific customer segments.
- Gemini for BigQuery now helps you write and modify SQL or Python code using straightforward natural language prompts, referencing relevant schemas and metadata.
- This helps reduce errors and inconsistencies in your code while empowering users to craft complex, accurate queries, even if they have limited coding experience.
32:44 📢 Ryan – “Yeah, I mean, that’s the cool thing about BigQuery and Gemini is that they’ve just built it right into the console.”
34:07 New in Gemini: Custom Gems & improved image generation with Imagen 3
- Google is rolling Gems, first previewed at Google I/O.
- Gems is a new feature that lets you customize Gemini to create your own personal AI experts on any topic you want.
- They are now available for Gemini Advanced, Business and Enterprise users.
- Their new image generation model, Imagen 3, will be rolling out across Gemini, Gemini Advanced, Business and Enterprise in the coming days.
- Gems allow you to create a team of experts to help you think through a challenging project, brainstorm ideas for an upcoming event or write the perfect caption for a social media post.
- Some of the premade gems available for you:
- Learning Coach
- Brainstormer
- Career Guide
- Writing Editor
- Coding Partner
- Imagen 3 sets a new high watermark for image quality.
- Gems have built-in safeguards and adherence to product design principles, across a wide range of benchmarks, Imagen 3 performs favorably compared to other image generation models available.
35:00 📢 Matthew – “Yeah, it’s kind of cool. I was wondering if I could get all of those pre -made gems at the same time. Like I’m going to do a brainstorming session with a career coach and the coding partner and the brainstormer. then like the career guides, like you should really think about getting a new job. I like to use SQL server on Kubernetes and it’s like, yeah, I think you should update your resume. That’s what that should see.”
39:11 Instant snapshots: protect Compute Engine workloads from errors and corruption
- Google is introducing instant snapshots for Compute Engine, which provides near instantaneous, high-frequency, point in time checkpoints of a disk that can be rapidly restored as needed.
- Instant snapshots have a RPO of seconds and a RTO in the tens of seconds.
- Google cloud is the only hyperscale to provide high-performance checkpointing that allows you to recover in seconds.
- Common use cases for this feature include:
- Enabling rapid recovery from user error, application software failures, and file system corruptions
- Backup verification workflows, such as for database workloads, that create periodic snapshots and immediately restore them to run data consistency checks.
- Taking restore points before an application upgrade to enable rollback in the event that maintenance fails.
- Improving developer productivity.
- Verify state before backups
- Increase backup frequencies
- Some additional benefits over traditional snapshots:
- In Place backups at the zonal or regional disk level
- Fast and incremental
- Fast restore
- Convertible to backup or archive (second point of presence for long term, geo redundant storage)
- I supposed this could save you in a crowdstrike even too…..
40:22 📢 Justin – “Ryan, I’d like you to get this set up on all of our operating system drives for CrowdStrike as soon as possible.”
44:29 Google Cloud launches Memorystore for Valkey, a 100% open-source key-value service
- The Memorystore team is announcing the preview of Valkey 7.2 support.
- Memorystore for ValKey joins Memorystore for Redis Cluster and Memory store for Redis as a direct response to customer demand, and is a game-changer for organizations seeking high-performance data management solutions relying on 100% open source software.
- Maybe soon Redis can be open source again too (but we won’t hold our breath.)
45:12 📢 Justin – “I haven’t heard much about Valkey since they forked. I assume people are adopting it, but I didn’t hear much about Open Tofu for quite a while. Then everyone started talking about Open Tofu, so I assume it’s one of those things. As the cloud providers get support for it, I do think Valkey was already supported on AWS ElastiCache, and I think Microsoft was supporting it earlier as well. So I think Google is kind of late to the party on supporting Valkey, but we’ll see.”
45:46 A radically simpler way to plan and manage block storage performance
- Earlier this year, Google announced the GA of Hyperdisk storage pools with advanced capacity, that helps you simplify management and lower the TCO of your block storage capacity.
- Today, we are bringing that same innovation to block storage performance through hyperdisk storage pools with advanced performance.
- You can now provision IOPS and throughput in aggregate which hyperdisk storage pools will dynamically allocate as your app read and write data, allowing you to increase resource utilization and radically simplify performance planning and management.
46:18 📢 Justin – “I mean, it’s just basically taking a pool of IOPS and you’re allocating it to different disks dynamically through ML or AI, similar to what you’re doing for the capacity of your disk. It makes it nice, I appreciate it. I don’t know that I use it, but I like that it’s there.”
Azure
47:07 Inside Maia 100: Revolutionizing AI Workloads with Microsoft’s Custom AI Accelerator
- At Hotchips 2024, Microsoft initially shared some specs on Maia 100, Microsoft’s first-gen custom AI accelerator designed specifically for large scale AI workloads deployed in Azure.
- Maia 100 accelerator is purpose built for a wide range of cloud based AI workloads, and utilizes TSMC’s N5 process with COWOS-S interpose technology.
- Equipped with large on-die SRAM, Maia 100’s reticle-size SOC die, combined with four HBM2e die, provide a total of 1.8 TB per second of bandwidth and 64gb of capacity to accommodate AI scale data handling requirements.
- The chip architecture includes a high-speed tensor unit for training and inference, while supporting a wide range of data types, including low precision data types such as the MX data format.
- Vector processor is a loosely coupled superscalar engine built with custom instruction set architecture (ISA) to support a wide range of data types, including F32 and BF16.
- A direct memory access engine supports different tensor sharding schemes. And a Hardware semaphore enables asynchronous programming on the MAIA systems.
- Maia 100 supports up to 4800 gbps all gather and scatter reduced bandwidth, and 1200 gbps all to all bandwidth.
49:05 📢 Ryan – “I’m just, not sure whether or not like I’m just too far gone into the managed services part where I don’t really want this level of detail anymore. Like just, do the thing I’m paying to do the thing and all the type of processor with this type of chip and you know, these types of things are irrelevant, but also like maybe, maybe in that space, if you’re deep in it, you need that performance. It’s really hard to say.”
50:29 Introducing Simplified Subscription Limits for SQL Database and Synapse Analytics Dedicated SQL Pool
- Azure is introducing new and simplified subscription limits for Azure SQL Database and Azure Synapse analytics dedicated SQL Pool (Formerly SQL DW).
- What’s changing:
- New vCore based limits, which will be directly equivalent to DTU and DWU
- Default logical server limits
- Configurable vCore limits
- New Portal Experience
- All subscriptions will have a default limit of 250 logical servers.
51:23 📢 Matthew – “They went from one metric, which was their original metric of a weird combination of memory and CPU and maximum storage allocation to the newer one. Which is supposed to simplify it.”
54:21 Check out what’s new in Azure VMware Solution
- Azure is pleased to announce several enhancements to their VMWare solution for Azure:
- Azure VMWare solution is now in 33 regions
- Azure VMware solution has been added to the DoD SRG impact level 4 provisional authorization in Azure Government.
- Expanded support for FCF with Netapp and VMware being able to simplify their FCF hybrid environment by leveraging Netapp Ontap software.
- You can now leverage Spot Eco by Netapp with your Vsphere VM’s in the cloud.
- Collaborations with Jetrsteam enhance DR and Ransomware protection. Jetrsteam delivers advanced DR that offers near zero RPO and Instant RTO.
55:04 📢 Matthew – “Can I translate this? How to burn a;; your capital and piss off your CFO in 15 minutes or less.”
Cloud Journey Series
55:52 4 ways to pay down tech debt by ruthlessly removing stuff from your architecture
- Richard Seroter from google had a great blog post about paying down tech debt by ruthlessly removing stuff from your architecture.
- We thought we’d pass some of these along to my co hosts to get their take on Richard’s advice.
- He starts out covering debt and really architectural debt from carrying 8 products that do the same thing in every category, Brittle automation that only partially works or still requires manual workarounds and black magic. Unique customizations to package software that prevents upgrades to modern versions. Or half-finished “ivory tower” designs where the complex distributed system isn’t fully in place and may never be. Too much coupling, too little coupling, unsupported frameworks and on and on.
- To help eliminate some of this debt, he breaks it down into 4 ways.
- #1 Stop moving so much data around
- How many components do you have that get data from point A to point B. How many ETL pipelines to consolidate or hydrate data, messaging and event processing solutions to send this data around or even API calls that suck data from system A to system b.
- Can you dump some of this? Here are some of examples to help you
- Perform analytics queries against data sitting in different places by leveraging BigQuery omni, query your data that runs in AWS, Azure or GCP and stop consolidating it to a single data lake.
- Enrich your data from outside the database. You might have ETL jobs in place to bring reference data into your data warehouse to supplement whats is already there, but with things like BigQuery federated queries, you can reach live into PostgreSQL, Mysql, Spanner and Even SAP Datasphere
- Perform complex SQL analytics against log data instead of copying and sending logs to online systems.
58:39 📢 Justin- “I was thinking about this is a great pitch for Google because I don’t think I could do this on AWS because all the data storage is separate for every product because of their isolation model. Where on GCP I can do these things because they have one data layer.”
- #2 Compress the stack by removing duplicative components
- Break out the chainsaw, time to kill duplicated products. Or too many best-of breeds. A rule of thumb from Richard‘s colleague Josh McKenty “if it’s emerging, buy a few; if it’s mature, no more than two.”
- You don’t need multiple database platforms or project management solutions. Or leverage multi-purpose services and embrace “good enough.”
- Do you have multiple databases? Maybe you should wait 15 days before you buy a specialized vector database. You can use Postgres or any number of existing databases that now support vectors. You can also have multiple messaging buses and stream processors, consolidate to Pub-Sub, etc.
- (underneath this one is really just use managed services)
1:00:08 📢 Ryan – “I’m sort of like… the trick of this is the replacing it, right? This is still identification of tech debt. I actually don’t know if that’s really the problem to be solved. I think the problem is like, how do you prioritize and change these? And I thought that, you know, the article, it sort of references offhand, but you know, the reality is you have to be constantly making changes.”
- #3 Replace hyper-customized software and automation with managed services and vanilla infrastructure.
- You are not Google or that unique. Your company likely does a few things that are “secret sauce,” but the rest is identical.
- Fit the team to the software, not the other way around. This customization leads to lock-in, and you get stuck in upgrade purgatory.
- No one gets rewarded for their super highly customized K8 cluster. Use GKE autopilot, pay per pod, or find some other way to not have to manage something highly customized to your org.
1:03:23 📢 Matthew – “Yeah; most of the time you don’t need that extra performance that you’re squeezing out of it, but adding complexity – and honestly, most likely the cause of many underlying outages whether you want to believe it or not.”
- #4 Tone it down on microservices and distributed systems
- People have gone overkill on microservices. You don’t need dozens of serverless functions to serve a static web app or a big, complex Javascript framework for two pages.
- Tech debt often comes from overengineering the system when you’d be better off smashing it back into an “app” hosted in a cloud run. There would be fewer moving parts and all the agility you want.
- He doesn’t advocate going full DHH, but most folks would be better off defaulting to more monolith systems running on a server or two.
1:04:54 📢 Ryan – “ It’s a common fallacy that you want to develop everything as a microservice so that you can manage them and update them separately. But really, if you only have a single customer of that API or your microservice, it shouldn’t be separate. And so it’s really about understanding the contracts and ins and outs and who needs to use the service.”
Closing
And that is the week in the cloud! Visit our website, the home of the Cloud Pod where you can join our newsletter, slack team, send feedback or ask questions at theCloud Pod.net or tweet at us with hashtag #theCloudPod