Google Cloud Professional Cloud Architect Certification Renewal
My GCP Certification has one month left for renewal, so I’m going to put my notes here. My primary source of study is Linux Academy, where I have a yearly subscription. The Coursera Course is also free to audit.
Edit: 21 February , 2020 – Passed

IAM
- Account Types
- Cloud Identity Domain – group of all google accounts in an org. Can sync AD
- allUsers
- allAuthenticatedUsers
- Google Group
- Service accounts (not a person), used for interactions within GCP resources and applications.
- Both a member and a resource
- Users are granted access to act as a SA
- SA granted access to permissions
- Role types,
- Primitive -> Owner (can touch IAM and billing), Editor (can’t touch IAM and billing), Viewer. Apply across entire project
- Predefined. Apply to a single service like compute engine (preferred over primitive due to least privilege)
- Custom
- Org -> Folder -> Project -> Resource. Roles inherited in that order. Permissive parent policy overrules less permissive child policies
Billing
- Export to cloud storage (retention) and BigQuery (analysis)
- Billing account need to be the same as the ownership account
Stackdriver
- Mnemonic – MELTED
- Works in AWS and GCP
- Logging – project scope
- Retention is either 400 or 30 days. Must be exported if you want them (storage, bigquery, pub/sub)
- To export, create a sink – filter, destination (above)
- Trace – latency for requests
- Works in App engine, HTTP LBs, GCE, GKE
- Error reporting – real time, App engine and cloud functions, only works with a few languages
- Debug – GAE, GCE, GKE, only work on a few languages
Cloud Storage
- Unstructured data
- ACLs (down to the object) and IAM (bucket level, not object)
- Prefer IAM over ACL. IAM is auditable. ACL can give object access without bucket access
- Regional vs multi-regional
- Nearline (monthly) vs coldline (quarterly access)
- NOT block storage
- Changing storage class only applied to new objs
- Can’t change multi-regional <-> regional
- gsutil uses linux filesystem commands
- Can use .boto config to supply custom encryption keys
- Object versioning is applied on the BUCKET LEVEL
- Lifecycle management: either delete or downgrade storage class, applied at bucket level. Uses Rules, Conditions and Actions
Databases
Cloud SQL
- Classic MySQL, Postgres type of database
- Only in one region, with read replicas (same zone) or failover replicas (in a different zone from the primary DB instance’s zone)
- If there’s a read replica and a failover: upon failover, a new read replica will be created in the same zone as the failover replica. The failover is promoted to primary DB
- There’s failover replication lag
- Vertical scaling (i.e. increase the instance’s CPU or RAM), requires restart
- Storage scales automatically
Cloud Spanner
- Expensive horizontally scalable SQL
- Relational
- Cross region
- Still need to choose machine type
Datastore = Firestore
- NoSQL, multi regional
- No need to spec any instance (fully managed)
- Up to ~Terabytes of data
- Web/mobile apps
- 0 to terabytes
- You can create composite indexes for complex queries using gcloud + yaml files
- Indexes: gcloud datastore indexes create .[yaml file]
- gcloud datastore indexes cleanup [.yaml file] – only keep what indexes in the file
Bigtable (Hbase)
- NoSQL
- Terabytes to Petabytes
- Expensive, more performant to Datastore
- More analytics oriented
- Requires node management
- High write volume, low response times
BigQuery
- No management needed, Serverless
- Uses SQL as a query language
- Load in datasets – i.e. not an operational DB. Project -> Dataset -> Table
- Data warehouse for storing, exploring, analysing data
- IAM is on the project and dataset level, not table
- Can list all queries done by users for auditing
- Running queries in non-billing projects (DataViewer role), and charging it to a particular billing project scenario (BigQuery user role)
- A job = query. You need bigquery.jobUser or user role to run a query
- Then you need permission to access the data you want to query on = Dataviewer
- Partitioning is done to lower query times and costs if we know that we only want data from a certain time period. Partition by ingest time or Timestamp/date column
- Sharded tables is an alternative where you literally separate tables by date
- Can expire table data. If tables are not edited it is auto converted to long term storage after 90 days
- Mountkirk and TerramEarth
MemoryStore
- Redis as a service
- Regional in nature
VPC
- Subnets are Regional. Subnets can span multiple zones in a region
- VPCs are Project based and can be shared across projects within organization – Shared VPC
- Host project = host shared VPC, Service Project = can connect to host’s VPC
- Allows network + security people to manage networks in the host project
- Incoming (ingress) traffic is free. Egress is not
- IAM: Compute Admin, Compute Network Admin
- Firewalls: implied deny all ingress, allow all egress
- Static IPs are REGIONAL
Hybrid Connectivity
Cloud Interconnect
- Physical connection
- Partner interconnect (telco edge location) up to 10gbps
- direct interconnect = to google 10-80 gbps
- Doesn’t touch public internet, i.e. private network
- No VPN tunnels
- 10 Gbps per link, up to 80Gbps max
- saves egress fees
Cloud VPN
- IPSEC over public internet
- 1.5 Gbps per tunnel, up to 8 = ~12 Gbps max
- Site to site, not site to client
- Supports IKEv1 and IKEv2
- Static vs dynamic routing (cloud router)
- Static, need to manage routing table manually to add all subnet routes
- Dynamic routes (uses cloud router) uses BGP. Client side’s VPN gateway needs to support BGP for this to work. Auto discovers new subnet routes
- Dynamic routes BGP may need enabling “Global” dynamic routing setting on the VPC level for both networks. The default is “regional” dynamic routing
- Both sides need to setup VPN gateways, and both sides need a VPN tunnel to each other
- 99% SLA
Peering
- Direct line to Google, not, just GCP for eg. GSuite
- Direct vs. Carrier Peering
- No SLAs
- Uses BGP
- 10 Gbps per link
VPC Network Peering
- Connect two GCloud VPC (even in different organizations)
- Shared VPC are if we are in the same org, different project
- Still internal private network
- Less expensive, lower latency than vpns
- Peering must be set up on both ends. No transitive peering
Compute Engine
Disks
- Persistent DISK (SSD or standard, network attached), Local SSD (physically attached to VM)
- Persistent disks are ZONAL and Local SSD are LOCAL
- HDD use aes128, sdd use aes256
- Persistent disk is the ONLY bootable option
- Auto RAID, networked
- Can detach and move
- Resize while running
- Can attach to multiple instances if in read only mode
- Performance improves as disk size increases
- Highly reliable
- Can do encryption at rest. including self managed keys
- Can use as a file server
- Upto 65 TB
- Only accessible within a ZONE. Not cross ZONE inside a region yet
- Local SSD
- Can’t be used as boot disk
- Physically attached, highest performance
- Must be attached on CREATION of instance
- Not accessible to other instances in the same zone
- 375 GB, up to 8 of them (3TB)
- Less reliable as disk can fail, no data replication
- Can encrypt but not using own keys
- Two types SCSI, NVMe (faster)
Images
- Can be used across zones, regions and projects and shared across projects
- Custom images – project scope
- Create new instances or create instance templates
- Better to create when instance is shutdown
- Has deprecation states: deprecated (works with warning), obsolete (no new access), deleted (no access at all), active
- Can version control images using Image Families
Snapshots
- For backups, can also use to create instances NOT INSTANCE TEMPLATES
- To create an instance template from snapshot, you have to turn it into an image or disk
- Incremental backup – first snapshot is big, subsequent only contain the diff
- If you delete a snapshot, the changes get merged to the next existing snapshot
- Can create while running
- Project scope but can be shared across projects
- If we restore a snapshot using gcloud command:
- First create a disk using the snapshot
- Then create the instance using the disk name
Startup Scripts
- Copy and paste on instance creation, or point it to a Storage bucket url that has a script of commands to run: key = startup-script-url; value = gs://bucket-location.sh
- Using the startup script in bucket lets you change the script on a whim
- Run as sudo
- Shutdown scripts exist, best-effort basis. Good for preemptible instances
Pre-emptible VMs
- Disposable VM, 24 hrs max
- Google can shut down your instance at any time w/ a 30 sec warning
- Used for fault tolerant BATCH PROCESSING workloads, e.g. rendering
- Cheaper by 80% max
- Can preserve disk state if we use –no-auto-delete
Scaling
Load Balancers
- L4 vs L7
- HTTP(S)=L7 vs TCP or UDP=L4
- 5 Types of LB:
- HTTP(S) – Global scope, external, Content based vs location based LB
- SSL Proxy – Global scope, external
- TCP Proxy – Global scope, external
- Internal TCP/UDP – NLB, regional in scope
- External TCP/UDP – L4 Network Load balancer, regional in scope
- Global (HTTP(S), SSL PROXY, TCP Proxy) vs regional (Internal TCP/UDP, Network TCP/UDP)
- External ( HTTP(S), SSL PROXY, TCP Proxy, Network TCP/UDP) vs internal traffic (Internal TCP/UDP)
- Firewall rules are not applied to LB, they are applied to whatever is behind the LB
- Bucket can be the backend target of LB
Instance Groups
- Manage a group of instances together
- Managed vs unmanaged
- Unmanaged, collection of instance that are not identical, no need to focus on this. Only good for LOAD BALANCING
- 2 steps to create a MANAGED instance grps Create an Instance Template. Then create an instance group from the instance template.
- Instance Templates are global and can be reused for multiple groups
- Instance group is bound to region
- Managed instance groups are pretty much always paired with LBs
- Managed Instance Group Updater – migrate to new machines with no downtime. Specify a newer Instance Template
- Can’t use snapshot to create instance template, but can use images
Autoscaling
- Health checks need to be enabled in firewall of instances
CDN
- Cache key hit ratio optimized by removing stuff from the URL
- Configure use of CDN from HTTP(S) LB
Compute Options
- Compute Engine – full control, most administrative work, for lift and shift
- Kubernetes Engine – containers, patches os for you
- App Engine – PaaS, HTTP only
- Cloud Functions – Respond to Events
App Engine
- Standard Env vs. Flexible Environment
- Standard has more contraints: Python, Java, PHP, Go, Node
- Faster scaling, cheaper
- Flexible has more languages, and can use Docker containers. Scales more slowly, has VPC access which allows for VPN
- App Engine can make use of memcache to speed up DB queries
- Memcache has 2 levels, Shared (free, on by default) and Dedicated (pick GB of memory to dedicate)
- Know that Dedicated memcache can improve SQL query times and app performance
- App Engine has Version Management, allows for canary testing. Split traffic to V1 40%, V2 60%.
- Can deploy a version but direct ZERO traffic to it for testing using a ‘–no-promote’ flag
Kubernetes Engine
- Portability/compatibility, reduce OS, version dependencies
- Use with microservices
- Pod = smallest deployable unit. Pods = one of more containers bundled together
- Containers inside a pod share hostname and ip address
- Pod & containers are the software side
- containers are isolated from containers in other pods
- Pods are the unit of replication
- Deployments are used to start pods
- Node = 1 Compute Engine. Multiple pods/containers per node
- Nodes run Docker containers
- Node pool = group of nodes that have same config (ram, cpu, disk, image)
- Cluster = group of node pools (first thing you create)
- zonal or regional
- requires a default node pool
- gcloud vs kubectl commands. kubectl used for pods on nodes. Gcloud for gcp resources
- pods, containers are kubectl
- cluster, node are gcloud (e.g. add nodes is gcloud resize)
- Autoscale the replicas=pods ->kubectl
- Autoscale the node pools -> gcloud
- Use alpine linux for dockerfiles, install deps, copy src code (IN THAT ORDER)
- Master node has the API server, scheduler, etcd (a kv store), core resource controller (manages cpu/memory/network)
Big Data & Machine Learning
- Cloud Dataproc = Managed Apache Hadoop & Spark. Lift and shift Hadoop & Spark workloads. Mostly used for Hadoop compatibility
- Cloud Dataflow = batch & streaming data processing but Apache Beam.
- Serverless
- Batch can get data from Cloud Storage
- Streaming can get data from PubSub
- Used for data processing
- Dataproc (hadoop compatibility) vs Dataflow (preferred)
- Cloud Dataprep = prepare your data with a UI, built on dataflow
- Cloud Pub/sub = async messaging. Global scope, serverless. Data ingest
- Machine Learning services
- Datalab – visualize data (GCP), based on Jupyter
- Data Studio – visualize data (Gsuite)
Data Lifecycle
- Ingest – pubsub
- Store – DBs. cloud storage
- Process and analyze – above section
- Explore and visualize – datalab, data studio, google sheets
Case Studies
Reference to the post I had two years ago about the case studies:
Mountkirk Games
- Want to put new game on GCP
- Game backend – GCE need custom linux distro (managed instance groups with custom images)
- NoSQL – Datastore
- Need analytics storage
- Management cares about:
- Scaling in case their game takes off
- measure performance (stackdriver monitoring/logging)
- No downtime
- Managed services
- Analytics – usage patterns
- establish global footprints – multiple regional instance group backends + global HTTP LB. PubSub (can buffer data), Datastore, BigQuery, Cloud Storage, Dataflow
- Game Backend – managed instance groups + stackdriver to drive autoscaling
- Transactional DB service, managed nosql = Cloud Datastore (Firestore)
- Time series game activity= Big Query (probably, fully managed, ~10TB data) or BigTable (has admin overhead),
- Game analytics platform
- pubsub -> dataflow -> bigquery
- User mobile data Batch upload to Cloud Storage -> dataflow
Dress4Win
- on prem hosted, need future proof
- POC -> migrate to GCP for dev and test
- setup DR on GCP (hybrid network, VPn or Interconnect)
- If successful, then full migration to GCP
- prefer managed services
- Worry about costs, scaling down during off-peak times
- Security, customer supplied key, IAM, firewalls
- Global footprint NOT a priority ATM
- Want to recreate existing infra in cloud, not redesign their applications
- Dev/Test should be separate projects
- Automate infra creation using gcloud, deployment manager
- Stackdriver monitoring, logging and debug
- Continuous deployment CI/CD, Cloud Build
- Replicate Mysql to Cloud SQL -> DNS cutover for DR, single region ok
- Redis cluster -> Memorystore
- Managed instance groups allows lift and shift of app servers
- Apache Hadoop servers -> Cloud Dataproc + Cloud Storage
- 3 RabbitMQ -> Pub/Sub
- Storage/SAN -> Persistent disks (block level storage)
- NAS -> Cloud Storage (potentially persistent disk as well)
TerramEarth
- Tractors, bulldozers w/ sensors
- Data collected used to do analytics, tune vehicles, pre-emptive stock replacement parts by detecting likely part failure
- 20 mil offline tractors, 200k IOT connected tractors. Most data is only accessible if the tractor is brought to a service centre for data upload
- They want to collect and act on data faster (900 batch+9TB stream /day)
- Global footprint
- Solution 1: convert everything to IoT
- Tractors <-> Cloud IoT core <-> pubsub <-> dataflow <–> BigQuery -> Machine Learning to tune tractor params
- Share data & dashboards with dealer networks with Data Studio
- They need multi-regional/global services
- They need a backup strategy -> BigQuery to Cloud Storage
- (no bigtable)
Cost Optimization
- Sustained use discount is automatic depending on how long your instance is up and running (compute and cloud sql). Up to 30%. Regional<?>
- Custom machine types – choose you own CPU and RAM combo
- Preemptible VMs, 24 hrs, up to 80% discount
- Nearline and coldline Cloud Storage – same performance but retrieval costs
- Committed use discounts – 1 or 3 year terms set pool of CPU and RAM, up to 57% discount
Storage Transfer
- Import from AWS S3, HTTP/HTTPS, another Cloud Storage Bucket
- Not applicable to on prem – in this case use gsutil
- The only destination is a GCP Storage Bucket
- Can physically mail in a “Transfer Appliance”
- gsutil [option] cp
- -m multi-threading option, transfer multiple files in parallel
- -o, object composition, break up a single file into multiple chunks for parallel upload
Disaster Recovery
- GCE Instances = disk snapshot (incremental). Done using cronjob or snapshot scheduler
- Cloud Storage object versioning + lifecycle management. Revert object
- Application rollback. Compute Engine -> rolling update by applying an old instance group template. Snapshots are irrelevant
- App Engine has versioning control with traffic % split (canary update)
Security
- Separate projects for dev, test, prod
- Principle of least privilege at Organization and Project levels (primarily)
- Google Cloud Storage: IAM (lower scope is bucket level), ACL and signed URLS (object level)
- Securing communications: public key infrastructure
- IOT core uses Message Queuing Telemetry Transport (MQTT)
- Customer managed encryption keys CMEK – data at rest, manage creations, rotations, but keep keys on cloud
- Customer supplied encryption keys – KEEP KEY ON PREM, provide the keys for API calls. Only for cloud storage and compute engine data at rest
Network Security
- Projects (including IAM)
- VPC
- Firewall
- Organization -> Projects -> VPC -> Regions -> Subnets
- IAM cannot limit access to VPCs within a same project. So if we don’t want someone to have access to a VPC, that VPC must be in a different project
- Projects separate people from having access
- VPCs separates resources
- Firewalls separate access by network traffic by port, tags, service accounts, ip ranges, subnets
- Firewall applies to resources behind a load balancer. NOT at the load balancer
- Health checks require firewall rules, i.e. allowing traffic from LB IP addresses
- Bastion hosts can be used to remove a public IP address
- Cloud NAT = allow instances with no public facing IP to make internet requests
Other
- DLP Data Loss Prevention API – redact PII data automatically before forwarding to other services
- Audits usually mean Stackdriver Logging + export to Cloud Storage for storage. If needed for analysis, BigQuery can get data from Cloud Storage
- Blue Green deployment – similar to Rolling updates in GCE and App Engine versioning. Switch LB to point at one of the two envs