Welcome to episode 269 of the Cloud Pod Podcast – where the forecast is always cloudy! Justin, Matthew and Ryan are your hosts this week as we talk about – you guessed it – the Crowdstrike update that broke, well, everything! We’re also looking at Databricks, Google potentially buying Wiz, NY Summit news, and more!
Titles we almost went with this week:
- ✈️You can’t take Justin down; but a 23-hour flight to India (or Crowdstrike updates) can
- 🎯Google wants Wiz, and Crowdstrike Strikes all
- 🙃Crowdstrike, does anyone know the Graviton of this situation?
- ⛰️We are called to this summit to talk AWS AI Supremacy
- 🐻Crowdstrike, Wiz and Chat GPT 4o Mini… oh my
- 🧙An Impatient Wiz builds his own data centers not impacted by Crowdstrike
A big thanks to this week’s sponsor:
We’re sponsorless! Want to reach a dedicated audience of cloud engineers? Send us an email or hit us up on our Slack Channel and let’s chat!
General News
00:58 You Guessed It – Crowdstrike
Microsoft, CrowdStrike outage disrupts travel and business worldwide
Our Statement on Today’s Outage (listener note: paywall article)
- It’s not every day you get to experience one of the largest IT Outages in history, and it even impacted our recording of the show last week.
- Crowdstrike, a popular EDR solution caused major disruption to the worlds IT systems with an errant update to their software that caused servers to BSOD, disrupting travel (airplanes, trains, etc), governments, news organizations and more.
- Crowdstrike removed the errant file quickly, but still the damage was done with tons of systems requiring manual intervention to be recovered.
- The fix required booting into safe mode, and removing a file from the crowdstrike directory.
- This was all complicated by bitlocker and lack of local admin rights for many end user devices.
- Sometimes doing up to 15 reboots would bring the server back to life.
- Swinging the hard drives from one broken server to a working server manually removes the files and puts them back.
- The fix required booting into safe mode, and removing a file from the crowdstrike directory.
- The issue also caused a large-scale outage in the Azure Central region.
- In addition to services on AWS being impacted that run Windows (Amazon is a well-known large Crowdstrike customer)
- Crowdstrike CEO Goerge Kurtz (who happened to be the CTO at Mcafee during the 2010 Update Fiasco that impacted Mcafee clients globally) stated that he was deeply sorry and vowed to make sure every customer is fully recovered.
- By the time of this recording, most clients should be mostly fixed and recovered, and we are all anxiously waiting to hear how this could have happened.
04:50 📢 Justin – “It’s really an Achilles heel of the cloud. I mean, to fix this, you need to be able to boot a server into safe mode or into recovery mode and then remove this file manually, which requires that you have console access, which, you know, Amazon just added a couple of years ago.”
07:45 📢 Matthew – “It’s always fun when you’re like, okay, everyone sit down, no stupid ideas. Like these crazy ideas that you have, like end up being the ones that work, but you would never realistically have the opportunity to try them because you know, one, how often and God, I hope in your day job, you’re not actively logging into the serial port for fun or how to automate your deployments. Just sounds like you’re doing something horribly wrong at that point.”
15:20 📢 Justin – “I saw that article this morning about the EU might be the reason why Microsoft doesn’t protect the kernel more. I think that’s a cop out. Basically the EU saying we want fair and equal competition. And basically what Mac did or Apple did was they basically created a custom API that basically does what CrowdStrike needs to do in the kernel and provides that to serve to CrowdStrike and other vendors. They’re all on equal footing. They all get access to the same API. They can all implement the same features, but Mac controls it at the API.”
22:09 Google Has Been in Talks to Acquire Wiz for $23 Billion
(listener note: paywall article)
- Over the weekend it was rumored that Google has been in talks to acquire Wiz, a four-year old cybersecurity startup, for around 23B. The deal could still fall apart over negotiations, and prolonged regular reviews… or this getting leaked to the press.
- The 23B would be a large increase to the 12B valuation in its latest funding round in May.
22:58 📢 Ryan – “I still haven’t played firsthand with whiz and I hear nothing but good things. And so I’m very conflicted on this because where Whiz is , I wonder if it’s going to be more exposed and Google products like Mandiant has become, or is it going to be sort of behind the scenes integration? And so I don’t know. We’ll see. I think that’s a, I’m just curious in how all the things shakes down.”
AI Is Going Great – Or, How ML Makes All It’s Money
- Databricks is announcing the GA of serverless compute for notebooks, jobs and delta live tables on AWS and Azure.
- Databricks customers already enjoy fast, simple and reliable serverless compute for Databricks SQL and Databricks Model Server.
- The same capability is now available for all ETL workloads on the data intelligence platform, including apache spark and delta live tables.
- You write the code, Databricks then provides workload startup, automatic infrastructure scaling and seamless version upgrades of the databricks runtime.
- Importantly, with serverless compute you are only billed for work done instead of time spent acquiring and initializing instances from cloud providers.
- Databricks is currently offering an introductory promotional discount on serverless compute, available now until October 31st, 2024.
AWS
24:42 Monitor data events in Amazon S3 Express One Zone with AWS CloudTrail
- S3 Express One Zone supports AWS Cloud Trail data event logging, allowing you to monitor all object-level operations like putobject, getobject and deleteobject, in addition to bucket-level actions like create and delete bucket.
- This enables auditing for governance and compliance, and can help you take advantage of the S3 Express One Zones 50% lower request costs compared to S3 standard storage class.
- I mean… really use a lesser storage level, get less secure… for shame Amazon.
27:23 AWS Graviton4-based Amazon EC2 R8g instances: best price performance in Amazon EC2
- Remember when this would have been announced at the NY summit? Pepperidge Farms remembers.
- Graviton 4 based Ec2 R8G instances are now generally available (in preview since Re:invent 2023).
- AWS has built more than 2 million Graviton processors, and has more than 50,000 customers using AWS Graviton based instances to achieve the best price performance for their application.
- R8G instances offer larger instance sizes with up to 3x more vCPUs (up to 48xl), 3x the memory (up to 1.5 TB), 75% more memory bandwidth and 2x more L2 cache over R7g instances.
- Early benchmarks show the Graviton 4 performs about 30% faster.
28:10 📢 Ryan – “You know, because it’s only indirectly related to AI. That’s why I didn’t make the summit.”
30:28 Amazon SageMaker introduces a new generative AI inference optimization capability
- Amazon is saying its new inference capability delivers up to ~2x higher throughput while reducing costs by up to 50% for generative AI models such as Llama 3, Mistral and Mixtral models.
- For example, with Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48 xlarge instances v/s ~1200 tokens/sec previously without optimization.
- This allows customers to choose from several options such as speculative decoding, quantization and compilation, and apply them to their generative AI models
31:32 Announcing the next generation of Amazon FSx for NetApp ONTAP file systems
Amazon FSx for NetApp ONTAP now allows you to read data during backup restores
- FSX for Netapp OnTap File Systems gets several new features this week.
- They can now provide higher scalability and flexibility compared to previous generations. Previously the system consisted of a single high-availability pair of file servers with up to 4GBps of throughput. Now the next-gen file system can be created or expanded with up to 12 HA pairs, allowing you to scale up to 72GB/s of total throughput (6gbps per pair), giving you the flexibility to scale performance and storage to meet the needs of your most demanding workloads.
- You can now leverage the NVMe-over-TCP block storage protocol with Netapp On-Tap.
- Using NVME/TCP, you can accelerate your block storage workloads, such as databases and VDI, with lower latency compared to traditional ISCSI block storage and simplify multi-path (MPIO) configurations relative to iSCSI.
- Having NVME/TCP support in AWS is the first that I’m aware of.
- You can now read data from a volume while it is being restored from a backup. This feature “read-access during backup restores” allows you to improve your RTO by up to 17x for read only workloads.
33:55 📢 Justin – “Yeah, so NVMe over TCP is not iSCSI, just to be clear. But it’s basically iSCSI. It’s basically in kernel. It’s much more performant than iSCSI is, and it is the new hotness to replace iSCSI. But it is not technically iSCSI. Don’t correct us.”
35:36 Amazon ECS now enforces software version consistency for containerized applications
- ECS now enforces software version consistency for your containerized applications, helping you ensure all tasks in your application are identical and that all code changes go through safeguards defined in your deployment pipeline.
- Image tags aren’t immutable, but images are, and there is no standard mechanism to prevent different versions from being unintentionally deployed when you configure a containerized application using image tags.
- Now, ECS resolves container image tags to the image digest (SHA256 Hash of the image manifest) when you deploy an update to ECS service and enforce that all tasks in the service are identical and launched with the image digests.
- This means even if you use a mutable image tag like ‘Latest’ in your task definition and your service scales out after the deployment, the correct image (which was used when deploying the service) is used for launching new tasks.
36:24 📢 Ryan – “Well, the interesting part about this is because I actually really like this change because it is using sort of mathematically guaranteeing the workload is what you’ve set the workload is. But it’s funny because it is going to be a mixed bag; because the ability to tag an image with a shared tag that you refresh and change the image out from underneath has been something that’s been used and pretty much called out as an anti-pattern, and pretty much been called out as an anti pattern using environment specific labels or latest or. And so it’s sort of this weird thing and I’ve used this to get myself out of binds for sure. Actually specifically in ECS like using latest to update stuff as part of the underlying platform.”
37:57 Top Announcements of the AWS Summit in New York, 2024
AWS Summit recently took place in New York City – and there’s **A LOT** of announcements. Like, a lot.
👂Listener Poll: Do you genuinely think Amazon is leading at this level? Does this feel genuine to you? Let us know your thoughts by tagging us @thecloudpod or hit us up on our Slack Channel and let us know.
40:55 Vector search for Amazon MemoryDB is now generally available
- GA of vector search for MemoryDB, a new capability that you can use to store, index, retrieve and search vectors to develop real-time machine learning and generative AI applications with in-memory performance and multi-az durability.
- With this launch, Amazon MemoryDB delivers the fastest vector search performance at the highest recall rates among popular vector databases on AWS.
- You no longer have to make trade-offs around throughput, recall and latency, which are traditionally in tension with one another.
- You can now use one Memory DB database to store your app data and millions of vectors with millions of single-digit millisecond queries and update response time at the highest levels of recall.
36:24 📢 Ryan – “This sounds expensive, but I think it’s cool as hell, Vector search in general is just a new paradigm.”
42:29 Build enterprise-grade applications with natural language using AWS App Studio
- Amazon is releasing a new no-code solution with the public preview of AWS App Studio.
- App Studio is a generative AI powered service that uses natural language to create enterprise-grade applications in minutes, without requiring software development skills.
- It’s as easy as creating a new app, using the new generative AI assistant and building.
- Uh huh sure… may you live longer than Honeycode.
43:37 📢 Justin – “It’s no code. It’s dumb no code, but yeah.”
43:36 Amazon Q Apps, now generally available, enables users to build their own generative AI apps
- Amazon Q apps are now generally available with some new capabilities that were not available during the preview, such as API for Amazon Q apps and the ability to specify data sources at the individual card level.
- New features include specifying data sources at card level tso you can specify data sources for the outputs to be generated from.
- Amazon Q Apps API allows you to now create and manage Q Apps programmatically with APIs for managed apps, app library and app sessions.
- Cool. Moving on.
45:00 Customize Amazon Q Developer (in your IDE) with your private code base
- Amazon Q developer customization capability is now generally available for inline code completion, and launching a preview of customization chat.
- You can now customize Amazon Q to generate specific code recommendations from private code repositories in the IDE code editor and the chat.
- Amazon Q is an AI coding companion.
- It helps software developers accelerate application development by offering code recommendations in their integrated development environment derived from existing comments and code.
45:10 Announcing IDE workspace context awareness in Q Developer chat
- In addition you can invoke your workspace to your Q developer chat.
- This allows you to ask the chat bot questions about the code in the project they currently have open in the IDE.
45:50📢 Ryan – “I think Dr. Matt would probably, you know, think about his slide instead of doing bar charts, maybe, maybe do a little time -based chart because these are, you know, features that chat GPT was announcing like 18 months ago, two years ago.”
46:46 Agents for Amazon Bedrock now support memory retention and code interpretation
- Agents for Bedrock now support memory retention and code interpretation.
- Retain memory across multiple interactions.
- This allows you to retain a summary of the conversations with each user and be able to provide a smooth, adaptive experience, especially for complex, multi-step tasks, such as user-facing interactions and enterprise automation solutions like booking flights or processing insurance claims.
- Support for code interpretation-agents can now dynamically generate and run code snippets within a secure, sandboxed environment and be able to address complex use cases such as data analysis, data visualization, text processing, solving equations and optimization problems.
47:23📢 Justin – “But we have a sandbox that code can’t get out of the sandbox. That’s what CrowdStrike said too.”
- Guardrails allows you to implement safeguards based on application requirements and your company’s responsible AI policies.
- It can help prevent undesirable content, block prompt attacks, and remove sensitive information for privacy.
- Guardrails for Bedrock provides additional customizable safeguards on top of native projections offered by FM, delivering the best safety features in the industry.
- Blocks as much as 85% more harmful content
- Allows customers to customize and apply safety, privacy and truthfulness protections within a single solution.
- Filters over 75% hallucinated responses for RAG and summarization workloads.
49:27📢 Ryan – “I’m surprised that Bedrock can do this. I mean, it feels like no, detect hallucinations based off of the third party models. That seems crazy to me. And it just highlights how little I know about how these models work and how a platform like Bedrock operates. in my head, it’s just, you ask the model question, you get an answer back. so guardrails, clearly there’s more information being exchanged at a different level with which they can detect hallucinations.”
52:31 Knowledge Bases for Amazon Bedrock now supports additional data connectors (in preview)
- Knowledge Bases for Bedrock, foundational models and agents can retrieve contextual information from your company’s private data sources for RAG.
- Rags help FMs deliver more relevant, accurate and customized responses.
- Now you can connect in addition to S3, web domains, confluence, salesforce and SharePoint as data sources in your rag applications.
53:07 Introducing Amazon Q Developer in SageMaker Studio to streamline ML workflows
- Sagemaker studio can now simplify and accelerate the ML development lifecycle.
- Amazon Q Developer in Sagemaker Studio is a Gen AI-powered assistant built natively into the sagemaker jupyter lab experience.
- This assistant uses natural language inputs and crafts a tailored execution plan for your ML development lifecycle by recommending the best tools for each task, providing step-by-step guidance, generating code to get started, and offering troubleshooting assistance when you encounter errors.
53:39 📢 Ryan – “Going to be need to be one hell of an AI bot if it’s going to get me to successfully run a Spark.”
GCP
54:31 New Cloud SQL upgrade tool for MySQL & PostgreSQL major versions and Enterprise Plus
- Google is announcing an automated Cloud SQL upgrade tool for major versions and Enterprise Plus customers.
- The tool provides automated upgrade assessments, scripts to resolve issues and in-place major version upgrades, as well as Enterprise Plus Edition upgrades, all in one go.
- It’s particularly useful for organizations that want to avoid extended support fees associated with Cloud SQL extended support.
- Key features include:
- Automated pre-upgrade assessment, where checks are curated based on recommendations available for MySQL and PostgreSQL upgrades, as well as from insights from real customer experiences
- Detailed assessment reports
- Automated scripts to resolve issues
- In-place major version and enterprise plus upgrades leveraging Cloud SQL’s in-place major version upgrade feature.
56:39 Flexible committed-use discounts are now even more flexible
- The Compute Flexible CUD, has been expanded to cover cloud run on-demand resources, most GKE autopilot PODs and the premiums for Autopilot performance and accelerator compute classes.
- With one CUD purchase, you can now cover eligible spend on all three products.
- Since the new expanded compute flexible cud has a higher discount than the GKE Autopilot CUD and greater overall flexibility, they are retiring the GKE autopilot CUD.
47:23📢 Matthew – “I love when single things support multiple so I don’t have to think about it. It’s like, how much money do you want? Divide by four so I can give you a little bit so I can refresh as needed once a quarter. And here you go. Now I don’t need to manage 16 different things.”
59:01 Modern SecOps Masterclass: Now Available on Coursera
- Google is releasing a 6 week platform agnostic education program for Modern Secops.
- The course leverages the autonomic security operations framework and continuous detection, continuous response methodology.
59:18 📢 Ryan- “I’m sure the content won’t be heavily towards Security Command Center and the enterprise offering solutions.”
1:00:16 Discover a brand new catalog experience in Dataplex, now generally available
- Dataplex Catalog, Google Cloud’s next-generation data asset inventory platform, provides a unified inventory for all your metadata, whether your resources are in Google Cloud or on-premises, and today its GA.
- Dataplex Catalog, allows you to search and discover your data across the organization, understand its context to better assess its suitability for data consumption needs, enable data governance over your data assets, and further enrich it with additional business and technical metadata to capture the context and knowledge about your data realm.
- Benefits of the data plex catalog:
- Wide range of metadata types
- Self-configure the metadata structure for your custom resources
- Interact with all metadata associated with an entry through single atomic CRUD operations and fetch multiple metadata annotations associated with search or list responses.
- There are no charges for basic API operations (CRUD) and searches performed against Dataplex catalog individual resources.
1:01:01 📢 Ryan- “I really like that this is supporting both data on GCP and off GCP. Cause that’s the reality is, you know, almost always that you have data in multiple places. And if you’re trying to catalog everything so that you have a place to search and understand where your data is and sensitivity and the metadata around it. If you have three different versions of that catalog, it doesn’t, it’s worse than just having one.”
- Google has enhanced the ability to get HA and still meet data residency requirements with cloud spanner.
- To get the highest levels with 99.999% availability, and wanted to comply you had to have multi-region configurations enabled.
- You could achieve 99.99% with only two cloud regions.
- Now you can take advantage of 5 9s of available with the new Spanner dual region configurations available in Australia, Germany, India and Japan
- To solve this it takes advantage of countries that have multiple regions in a single geography, ie Delhi/Mumbai.
1:02:38 📢 Matthew – “I mean, with the data resilience, with the data regional requirements, like something like this is slowly going to be required more and more. It’s interesting that, you know, before you only got four nines with it, but also at one point compliance always wins in keeping the data in the correct country. It keeps you out of, know, compliance hell, you know, it’s kind of important. So, you know, it’s nice to be able to get that extra nine.”
Azure
OH look Azure woke up…. Oh it’s AI.
1:03:52 OpenAI’s fastest model, GPT-4o mini is now available on Azure AI
- GPT-4o Mini now allows customers to deliver stunning results at lower costs and blazing speed.
- Its significantly stronger than GPT-3.5 Turbot, scoring 82% on Measuring Massive Multitask Language Understanding compared to the 70% of GpT 3.5 and it’s 60% cheaper.
- The model delivers an expanded 128k context window and integrates the improved multilingual capabilities of GPT-4o.
- GPT 4o Mini, announced by Open AI, is available simultaneously on Azure AI, supporting text processing capabilities with excellent speed and with image, audio, and video coming later.
- Open AI… can you stick to a naming convention?
1:05:12 📢 Matthew – “So unlike AWS, Azure used to be careful that this is not available in many of the regions, which definitely makes Azure a little bit harder at times, but you get less press announcements.”
1:05:28 Latest advancements in Premium SSD v2 and Ultra Azure Managed Disks
- Azure is announcing the latest advancements in Premium SSD V2 and Ultra Disks, the next generation of Azure Disk Storage.
- First up is that they now support incremental snapshots of PV2 and Ultra Disks, which are reliable and cost effective point-in-time backups of your disks that store only the changes made since the last snapshot.
- Azure Native Fully Managed Backups and Recovery for Pv2 and Ultra Disks, allow you to protect your VM with a single click.
- As well as in preview support for Azure Site REcovery for Pv2 Disks.
- Application consistent VM restore points for PV2 and Ultra Disk
- 3rd party support for backup and DR.
- Encryption at Host for Pv2 and Ultra Disk, trusted launch support, reveals new features and capabilities.
1:06:25 📢 Matthew – “A lot of these just feel like good quality of life improvements that they really needed to get out there. Like the incremental snapshot support, know, PB2 also, know, UltraDicts go to decent large sizes. like, you probably don’t really need to be snapping the whole drive if you’re just handling little bits and pieces of change.”
OCI
1:07:18 Oracle Loses Out to Musk’s Impatience
- After rumors that Elon was getting close to Oracle Cloud to power the LLM of his startup, xAI, Elon has pivoted and decided to build his own AI Datacenter.
- Musk explained, “when our fate depends on being the fastest by far, we must have our own hands on the steering wheel”, apparently the issue stalled over Musk’s demand that the facility be built faster than Oracle thought possible.
- Only a month ago, Ellison trumpeted xAI as one of several big and successful companies choosing Oracle Cloud.
- Musk clarified that they are using Oracle for a smaller AI use case at the moment.
1:07:54 📢 Ryan – “Yeah, what my money is totally on the fact that this is, bet you that it’s going to take them longer to get this set up than whatever date they’re looking at for Oracle.”
1:09:16 Oracle Announces Exadata Exascale, World’s Only Intelligent Data Architecture for the Cloud
-
- Oracle is announcing that Exadata Exascale (exa-spensive), an intelligent data architecture for the cloud that provides extreme performance for all Oracle Database workloads — including AI Vector processing, analytics and transactions at any scale.
- “Exadata Database Service on Exascale Infrastructure is the most flexible database environment we have ever worked with,” said Luis Mediero, director, Cloud and Data Solutions, Quistor. “Its ability to scale efficiently will allow us to move all workloads to high-performance environments with minimal migration time. Because it leverages Exadata technology, we also have confidence in our data resiliency and security, something that has proven difficult to achieve in other environments. In addition, Exascale’s scalability will enable us to grow resources quickly and with minimal costs as our business expands.”
- Exadata Exascale, provides the following benefits:
- Elastic, pay-per-use resources: With Exascale, resources are completely elastic and pay-per-use, with no extra charge for IOPS.
- Users only specify the number of database server ECPUs and storage capacity they need, and every database is spread across pooled storage servers for high performance and availability, eliminating the need to provision dedicated database and storage servers.
- This reduces the cost of entry-level infrastructure for Exadata Database Service by up to 95 percent and enables flexible and granular on-line scaling of resources.
- Intelligent storage cloud: With Exascale, Oracle delivers the world’s only RDMA-capable storage cloud. This intelligent storage cloud distributes databases across all available storage servers and uses data aware, intelligent Smart Scan to make thousands of CPU cores available to speed up any database query. In addition, data is replicated on three different storage servers to provide high levels of fault tolerance. Exascale Storage Cloud intelligently moves hot or frequently accessed data from disk to memory or flash, and delivers the performance of DRAM, the IOPs of flash, and the capacity of disks.
- Intelligent AI: Exascale uses AI Smart Scan, a unique way to offload data and compute-intensive AI Vector Search operations to the Exascale intelligent storage cloud. AI Smart Scan and Exadata System Software 24ai run key vector search operations up to 30X faster enabling customers to run thousands of concurrent AI vector searches in multi-user environments.
- Intelligent OLTP: Intelligent communication between servers enables high-performance database scaling across the Exascale Virtual Machine clusters, and intelligent, low-latency OLTP IO quickly completes mission-critical transactions and supports more concurrent users. Exadata Exascale delivers 230X more throughput than other database cloud services—2,880 GB/s compared to up to 21 GB/s for other hyperscalers. It also delivers 50X lower latency, with 17 microseconds compared to 1,000 microseconds for other cloud providers.
- Intelligent analytics: Unique data intelligence automatically offloads data-intensive SQL queries to the Exascale intelligent storage cloud, enabling extreme throughput scaling for analytics. Automatic columnarization converts data into an ultra-fast in-memory columnar format that automatically uses flash caches in the Exascale intelligent storage cloud to increase capability and performance.
- Database-aware intelligent clones: Users can instantly create full copies or thin clones using the Exascale intelligent storage cloud and its redirect-on-write technology. Advanced snapshot capabilities make creating space-efficient clones of pluggable or container databases easy using read-write sources. These development, test, or recovery copies are immediately available and have the same native Exadata performance and scale as the source databases.
Closing
And that is the week in the cloud! Visit our website, the home of the Cloud Pod where you can join our newsletter, slack team, send feedback or ask questions at theCloud Pod.net or tweet at us with hashtag #theCloudPod