Engineering & Operations – SMS https://www.sms.com Solving | Managing | Securing Wed, 20 Dec 2023 17:32:12 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://www.sms.com/wp-content/uploads/2023/05/cropped-sms-favicon-32x32.png Engineering & Operations – SMS https://www.sms.com 32 32 Zero Trust Microsegmentation with Illumio Core https://www.sms.com/blog/zero-trust-microsegmentation-with-illumio-core/ https://www.sms.com/blog/zero-trust-microsegmentation-with-illumio-core/#respond Wed, 20 Dec 2023 17:29:05 +0000 https://www.sms.com/?p=7020 Trust Issues

Every day, we are bombarded with news of cyber security breaches and ransomware attacks that cost companies a fortune and compromise sensitive data. It’s not a matter of if, but when your boundary is breached. The legacy firewall network segmentation methodology, known as castle-and-moat, was not designed to stop the spread of a breach once the moat is crossed. Once one workload within your network boundary is compromised, bad actors and malware can spread rapidly and uninhibited.

In May 2021, the Executive Order (EO) 14028, titled ‘Improving the Nation’s Cybersecurity,’ was issued. This order mandates the adoption of a Zero Trust Architecture (ZTA) within the US Government. The order defines that a ZTA security model operates under the assumption that a breach is inevitable or has likely already occurred.

Microsegmentation

Microsegmentation is a foundational component of a ZTA. It helps contain a breach by moving the boundary to the workloads where the data originates. In the Department of Defense’s (DoD) Zero Trust Strategy, which outlines seven pillars essential for achieving a ZTA, microsegmentation is highlighted as a key component of the ‘Network & Environment’ pillar. Additionally, it also significantly contributes to the success of other pillars, such as ‘Applications & Workloads’ and ‘Visibility & Analytics’.

medium miles zero trust

One of the first steps and core capabilities of microsegmentation is flow mapping. Data flows are collected from all devices and are then visualized within a web UI. Once organizations have visibility of the current state of their workload’s flows, they can move toward updating their policies to an allow-list model.

An allow-list model explicitly permits only necessary traffic and denies all other traffic implicitly, following the principle of least privilege. By restricting access to only what is essential, microsegmentation ensures that any breach is not only contained but also quickly identified and eliminated.

Illumio Core

In the following sections, I will provide a high-level overview of Illumio Core, Illumio’s approach to segmenting workloads in on-premise and cloud data centers. While Illumio is a key player in this domain, other notable vendors in microsegmentation include Akamai Guardicore, VMware NSX, and Cisco Secure Workload (formerly known as Tetration). Although this piece won’t compare these vendors, a future blog may delve into an analysis of alternatives. Below, you’ll find a company overview of Illumio as provided by the vendor.

medium miles illumio

Illumio Core consists of two key components, the Policy Compute Engine (PCE) and the Virtual Enforcement Node (VEN). The PCE is the server side of the platform aka “the brain” and the VEN is the agent that gets installed on workloads converting them to “managed workloads”.

medium miles 3

After a VEN is installed on a workload, the VEN can manage the native host’s OS firewall and report all flows in and out of the workload back to the PCEs. Once VENs are deployed throughout the network the PCEs will have full visibility of flows through its web UI. The infographic below shows how the agent is integrated with the host OS.

medium miles 4

VEN Enforcement Modes

There are four different enforcement modes a VEN can be set to. It is recommended that VENs are first paired in Idle mode and progressed through the enforcement modes as flows are discovered, rules are created, and policies are enforced.

Idle: The VEN doesn’t take control of the workload’s iptables (Linux) or Windows Firewall (Windows Filtering Platform) yet. The VEN reports the equivalent of a netstat back to the PCEs. A compatibility check is performed to see if there are any incompatibilities with the VEN software.

Visibility Only: The VEN takes full control of the host OS firewall. An allow-all rule is added at the end of the firewall rules so nothing is blocked. The VEN reports all flows back to the PCEs.

Selective Enforcement: Policies created on the PCEs are provisioned. Specific deny rules are added above the allow any/any rule. All flows that do not have implicit allow rules will be logged as “potentially blocked” to help prepare for Full Enforcement mode.

Full Enforcement: Rules are enforced for all inbound and outbound services. Traffic that is not allowed by a rule is blocked.

medium miles 5

The Power of Labels

Illumio uses a multi-dimensional label-based model to label workloads, create policies, and grant role-based access control for administrative access. By utilizing labels, policies will dynamically update when IP addresses change, new devices are deployed, or a device is decommissioned.

Illumio recommends four label types by default (Role, Application, Environment, and Location), with the option to create additional custom label types if needed. It’s important to define a labeling schema that is thoughtful, scalable, standardized, and simple.

As an example of labeling a workload, an application called CI/CD, could consist of three servers with three different roles: a web server, an application server, and a database server. These servers could be in the development environment and located in Miami. The three workloads within that app group could be labeled as followed:

medium miles 6

Policy Creation

Once the VENs are deployed and the workloads are labeled it’s time for policy creation. Many application owners don’t know all the flows their application actually needs, especially for east/west traffic that doesn’t leave the network boundary. Historically this has led to either creating weak policies or a broken application until needed ports can be identified.

To overcome this problem, in addition to flow visibility, Illumio provides the ability to draft and test policy before implementation. Illumio will track all flows that would have been blocked if a drafted policy is enforced. Potentially blocked traffic can then be analyzed and rules added before enforcing the policy.

There are many approaches to policy creation and each organization will have different needs. Illumio identifies three segmentation strategies that progressively provide increased security. These strategies allow organizations to progress over time to limit the application attack surface.

Environment or Location Segmentation
Denies traffic from one environment to another, or one location to another. For example, deny all traffic between the development and production environment.

Application Microsegmentation (Ringfencing)
Creates a boundary around applications allowing only necessary ports and protocols in and out of the app group, but allows full communications within the application.

Tier-to-Tier Microsegmentation
Locking down the application further, using role labels, specific tiers/roles can communicate on all ports and protocols within an app group, while blocking unnecessary tier-to-tier communication.

Nanosegmentation
And finally, using role labels and services, specific workloads can communicate only on specific ports.

Each strategy provides a tailored approach to segmentation that can be combined, catering to different organizational needs and security objectives.

Beyond Illumio Core

Besides Illumio Core, Illumio has been expanding its product line to provide a robust microsegmentation solution. Check out the following products to learn more.

Illumio CloudSecure: Agentless segmentation for cloud-native workload.

Illumio Endpoint: Segmentation for end user devices.

Illumio for Microsoft Azure Firewall: Simplify Azure Firewall management with enhanced visibility and Zero Trust security policies.

Bottom Line At The End

In summary, this blog shed light on the critical ‘why’ behind microsegmentation: traditional network segmentation methods are increasingly inadequate in safeguarding data and thwarting ransomware attacks, necessitating a shift in strategy. We delved into ZTA and emphasized the crucial role microsegmentation plays in fulfilling these stringent security requirements. Finally, we provided a high-level overview of Illumio’s solution as a concrete example of microsegmentation in action, illustrating how innovative approaches can effectively respond to modern cybersecurity challenges.

References

Executive Order (EO) 14028, Improving the Nation’s Cybersecurity
https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity/

Department of Defense (DoD) Cybersecurity Reference Architecture
https://dodcio.defense.gov/Portals/0/Documents/Library/CS-Ref-Architecture.pdf

Department of Defense (DoD) Zero Trust Strategy
https://dodcio.defense.gov/Portals/0/Documents/Library/DoD-ZTStrategy.pdf

Illumio Resource Center
https://www.illumio.com/resource-center/illumio-core-the-power-of-labels
https://www.illumio.com/resource-center/zero-trust-segmentation-for-dummies

]]>
https://www.sms.com/blog/zero-trust-microsegmentation-with-illumio-core/feed/ 0
Complex Routing Symmetry in BGP https://www.sms.com/blog/complex-routing-symmetry-in-bgp/ https://www.sms.com/blog/complex-routing-symmetry-in-bgp/#respond Thu, 07 Dec 2023 17:55:10 +0000 https://www.sms.com/?p=7046 Enforcing network traffic symmetry across stateful devices has long been the biggest routing challenge faced in our network. It is unquestionably the singular problem I have dedicated the most amount of brainstorming, whiteboarding, and testing to. In this blog I share the problem, how it relates to our network, and the theory and solution that we use to solve it. This blog assumes the reader has a strong foundation in BGP design and behavior for a full understanding of the theory and solution.

Traffic Symmetry

In many networks, asymmetry of traffic can be acceptable, provided it does not negatively affect the reliability of the traffic flow. Routers generally do not function as stateful devices — they do not require establishing and maintaining a connection that must handle both sides of a bidirectional traffic flow. However, firewalls, load balancers, and many security devices generally are stateful — they must observe the traffic flow in both directions. It should be intuitive that any device responsible for inspecting traffic, and not just forwarding traffic, would have this requirement. In a large disperse network that handles hundreds of stateful devices, the challenge rests on engineering the routing to ensure that any traffic flow traverses the same stateful device(s) bidirectionally. This challenge may be simple when stateful devices are placed at points in the network where traffic is naturally funneled, such as the ingress point to a local network, but may be very difficult when those devices are instead distributed across centralized locations.

Our Network

To better frame the problem as it relates to our network, it is helpful to provide an overview of our network design. Our network consolidated all network security and inspection into approximately a dozen centralized locations across the United States, while processing traffic originating at hundreds of other locations. We refer to these centralized security/inspection locations as stacks. They are a collection of routers, firewalls, switches, load balancers, passive and active inspection devices, and monitoring tools. These stacks also serve as the ingress/egress point for all traffic entering or exiting our network perimeter. Security is layered into tiers, where traffic flows often require traversing several sets of security devices when communicating between different organizations, each of whom control the security policy in their routing domain. Each stack provides redundancy within itself, as well as redundancy in predefined failover orders, such that any traffic pattern should have N+1 redundancy within a single stack, and N+2 redundancy across different stacks. In this model, every stack and edge location can be said to have a distinct primary, secondary, and tertiary stack for processing traffic.

All edge locations on the network are typically served by a single pair of edge routers with a L2 campus infrastructure.

1

All locations on the network connect to an MPLS network and exchange L3 routing information in BGP via VPNv4/VPNv6. Every security zone on the network exists as an isolated forwarding domain that must traverse one or more security tiers to communicate outside of that zone. These are enforced by strictly segregated L3VPNs across the MPLS core, VRFs on routing devices, VLANs on switches, and ultimately correspond to a unique zone on a firewall, the sum of which we refer to as a routing-domain.

The routing-domains form the discrete components of the tiered architecture, where an upper tier may belong to a large entity who broadly controls the security policy for the entire organization at the network perimeter, while a lower tier may belong to a subdivision of that organization, responsible only for their piece of the network, but still nested within the organization as a whole.

2

All intra-zone traffic routes directly while all inter-zone traffic must route through some portion of a stack — at a minimum across the FW within the zone’s routing domain. Routing information between zones within a stack is exchanged via eBGP, directly from each VRF on the edge routers to a control-plane-only router that we call the firewall-router. This router effectively shadows each firewall context and provides the routing functionality that could otherwise be provided by the firewall itself.

3

Requirements

The ultimate goal in developing our solution is that any traffic flow between hosts in different zones takes a symmetric path through the stacks, subject to the following requirements:

  1. Signaled in BGP — the routing logic must be enforced and signaled by BGP and without maintaining knowledge of specific prefixes (e.g. no policy-based-routing based on prefix lists).
  2. Redundancy — minimum N+2 redundancy across stacks must be provided for any traffic flow.
  3. Predictable/Optimal pathing — a traffic flow should always take a predictable and optimal path. These paths should be predetermined based on the geographic proximity of a network to a stack.

Theory

To better understand the logic behind how we solve this problem, I present the theory we use on how to classify and frame routing decisions within this environment.

Note: A number of unique terms to describe certain routing behaviors are defined and used in this section, often italicized, which should not be mistaken as belonging to the general network engineering lexicon nor any industry standard.

Source/Destination-Following

Source-following and destination-following are a pair of terms we coined to describe the two fundamental options for choosing a stack to route traffic through. In source-following, traffic to a destination follows the stack nearest to the source — but more accurately, the stack nearest the device forwarding the traffic. In destination-following, traffic to a destination follows the stack nearest to the destination. It should be apparent that using the same option bidirectionally results in asymmetry through the stacks. However, using the opposite technique on each side of the flow results in symmetry through a single stack.

4

This logic is implemented by tagging prefixes with communities indicating both their proximity to, and advertisement through, a particular stack. We then use these communities in policy to set appropriate metrics.

Consider for the following examples, the network is composed of 4 stacks with N+1 redundancy, and where 1/2 and 3/4 form redundancy groups (i.e 1 is secondary to 2, and 2 is secondary is 1). Each Rx, represents a specific forwarding table (i.e. VRF) on the edge routers at the stack.

5

In destination-following, each prefix is tagged with the dst-x community when originated into BGP, where x is an identifier for the stack. Policies on the north/south eBGP neighbors within the stack set a local-preference of 400 for paths tagged with the dst-x community that corresponds to its local stack, and a local-preference of 300 for paths that correspond to its secondary stack. The preferences are unmodified as they are advertised east/west across the VPNv4 mesh, and thus all tables on that horizontal plane share the same perspective on any prefix — each agree on the same best path for any prefix.

In source-following, each path is tagged with the src-x community when advertised north/south through the stack. Policies on the north/south eBGP neighbors set a local-preference of 400 for all paths. The east/west VPNv4 neighbors set a local-preference of 400 for paths corresponding to its local stack, and 300 for paths corresponding to its secondary stack, as indicated by the src-x community. In this case, all tables on the horizontal plane have a different perspective on the best path for any prefix.

6

In this scenario, the best path at R12 to network D1 is the LP 400 path learned through the stack from R11, and the best path at R11 is the LP 400 path learned laterally from R41. The best path at R41 to prefix A2 is the LP 400 path learned laterally from R11, and at R11 is the LP 400 path learned through the stack from R12.

Source/Destination-Following Control-Plane Overhead

It may be intuitive that that following an anycasted prefix (e.g. a default route), where that prefix may originate equally at all stacks, would be consistent with the method of source-following. Indeed we are required to follow the stack of the source in this case — an anycast route must never be originated on the side that uses destination-following, as this would result in bi-directional source-following. In a similar fashion, source-following is conducive to the use of expansive aggregate routes or supernets, without concern to the originating stack of a contributor. This allows for a notable reduction in control-plane overhead. For destination-following, aggregation can only be accomplished as well as a contributing and suppressed prefix matches the designated origin of the aggregate route.

Tiered Source/Destination-Following

When expanding the source/destination-following concept to multiple tiers, additional logic is required to continue to enforce symmetry during a failure of individual tiers within a stack. Consider the below scenario where there is a failure at the lower tier at stack-1. The traffic flows asymmetric through the upper tier due to a shift in perspective when traffic from network A2 fails over to stack-2. The network A2 prefix continues to be shared laterally across the Rx2 tables above the failure, and then advertised through the upper tier at all stacks. The destination-following logic would dictate that the best path for prefix A2 on all Rx1 tables is the LP 400 path learned laterally from R11 on stack-1. However during the failure event, traffic from network A2 would move laterally around the failure up through stack-2 to R22. At this point, the traffic must continue to follow stack-2’s perspective on how to reach network D1, which is to follow the path from R21 up the stack.

7

This problem introduces a pair of techniques that we call source-shifting and destination-shifting, where the src-x and dst-x communities are rewritten as a path traverses a tier. In destination-shifting, every stack in the prefix’s failover order rewrites the path’s existing dst-x community with the local dst-x community as the path is advertised across a tier. In source-shifting every stack rewrites the existing src-x community with the local src-x community as the path is advertised across a tier.

The destination-shifting technique relies on a concept that we call lateral-blocking, where paths learned laterally may block paths learned with different attributes across the stack. This, of course, is just standard BGP best-path selection behavior, but we highlight it to stress that we utilize path-selection for more than the basic purpose of influencing the forwarding table. We use it to signal information (e.g. network failures) carried in path-attributes, even in cases where the selection of one path over another has zero effect on the forwarding table. This is shown more clearly in the following sections.

In the non-failure scenario, the path for network A2 is rewritten with the local dst-x community across each stack in the failover order. The path learned laterally from R12 remains the best path on all Rx2, and is the one subsequently advertised to the next tier. In the failure scenario, the path with the rewritten dst-2 community onto R22 becomes the new best path. As that path is advertised through the upper tier, all Rx1 tables now see the path from R21 as the best path, and we achieve symmetry. The effect of this shifting behavior is that lower tier failures propagate up, causing traffic to shift to another stack across both tiers. However upper tier failures are masked, and traffic does not shift until reaching the tier that has failed.

8

Zone Directionality

These concepts apply well when there is a clear choice for determining which side utilizes source-following and which side utilizes destination-following. The natural direction in our network would simply be up/down the stack. Where down points towards lower tiers closer to the edge of user networks, while up points to the upper tiers towards the egress of our network perimeter. For many traffic flows, that distinction cannot easily be made, for example, between two zones that are neither up nor down from each otherThis largely applies to traffic between different DMZs within the same routing-domain, and forms what we refer to as zone-to-zone routing. Moreover, as there may be many zones in the same routing-domain, they all must be consistent in determining a routing option.

The directionality problem can be solved by determining a predictable ordering of the stacks (e.g. stack-1 is of a lower order than stack-2, and stack-2 is of a lower order than stack-3, etc) where routing between two zones utilizes source-following from a zone at a lower ordered stack, and utilizes destination-following from a zone at the higher ordered stack. For the below example: traffic from network A2 to B1 should traverse stack-1, from A2 to D1 should traverse stack-1, from B1 to C2 should traverse stack-2, and from C2 to D1 should traverse stack-3.

9

This logic however is slightly more nuanced, in that similar to the tiering problem, the perspective of each stack must be accounted for. In the following examples, note that stacks 2 and 3 have been flipped, such that 1/3 and 2/4 now form each redundancy group, but where the stack ordering of 1 < 2 < 3 < 4 remains the same.

Traffic between B1 and C2 would normally traverse stack-2, due to stack stack-2 being of a lower order than stack-3. During a failure at stack-2, R22 could not choose to traverse its secondary stack-4, since R42 will have a different (and unchanged by this failure) perspective that the best path to reach B1 is via the lower ordered stack-3 on R32.

10

This likewise applies in a dual-failure where both stack 2 and stack-3 have failed. Both R22 and R31 are forced to consider the perspective of their redundant stacks with respect to the best path, despite the fact that the traffic now follows the failover-order of the higher ordered stack as opposed to the failover-order of the lower ordered stack.

11

To achieve this, we preference paths as the combination of both the stack ordering and the advertising stack’s position in the path’s failover order (i.e. primary vs secondary). The rule being: when routing between two networks, choose the highest failover position (e.g pri is better than sec) and then lowest ordered stack between them.

To implement, we use a third helper failover community for each level of redundancy: failover-prifailover-sec, etc. This community is applied when a path is advertised through any stack within the path’s failover order. For example, when network A2 is advertised through stack-1, it is tagged with failover-pri. When that prefix is advertised through stack-3 (keeping with the failover pairs from the previous example), it is tagged with failover-sec. When advertised through stack-2 or stack-4, no failover community is applied. This community is not fundamental, in that it can be derived from an existing pair of dst-x and src-x values, however the abstraction substantially reduces the verbosity of the policies that use it.

The failover community in conjunction with the src-x community signals the information needed to preference any path: which stack the path is being advertised from, and where in the failover order (pri/sec/etc) the stack belongs for that path. Illustrating the logic directly on the diagrams would be impractical, but a complete implementation is provided in the following sections. This preferencing logic is applied on both the eBGP peers through the stack and the lateral VPNv4 peers, where the paths are ordered from most preferred to least preferred as follows:

  1. From a stack lower or equal ordered to the primary and is the failover-primary for the path.
  2. From primary stack.
  3. From a stack higher ordered to the primary and is the failover-primary for the path.
  4. From a stack lower or equal ordered to the secondary and is the failover-secondary for the path.
  5. From secondary stack.
  6. From a stack higher ordered to the secondary and is the failover-secondary for the path.

… Each set of 3 terms repeat for the degree of stack redundancy required.

Zone-to-Zone Control-Plane Overhead

The zone-to-zone model, while flexible, comes at the worst case expense in control-plane utilization. Generally, every prefix originating in a zone must be installed in the forwarding table for every other zone. Unlike source/destination-following, the option to at least reduce the overhead from the source-following side, with a default route or broad aggregation, does not apply.

Tiered Zone-to-Zone

The zone-to-zone model nearly breaks down when it is attempted across multiple tiers. The same rules from the tiering section apply with the need for destination-shifting and source-shifting. The latter is by nature easy, as the only requirement is to implement a local rewrite of src-xDestination-shifting is far more difficult, because it relies on lateral-blocking to force all stacks to agree on the same best-path. With zone-to-zone, the same lateral-blocking is not achieved — all routers on the same horizontal plane do not converge to the same best path. Without an agreement on the same best path, a dst-x rewrite would signal non-existent failures in accordance with the tiering logic.

If there is a solution to this problem purely in BGP policy, I have not discovered it, however a simple solution is taking advantage of another BGP feature here: add-path. The BGP add-path capability allows advertising and/or receiving multiple paths for the same prefix to/from a BGP neighbor. By implementing add-path send-only from R11 and R12 and add-path receive-only at the FW between them, we can finalize the necessary lateral-blocking at the mid-point, without affecting the forwarding decision made by the zone-to-zone logicAfter lateral-blocking converges to a singular best path (i.e. from the stack with the highest failover position) at the FW, destination-shifting is implemented, and the paths are exported to R11/R12 with the desired attributes.

A hypothetical BGP feature that would allow for decoupling the selection of paths for advertisement from paths for RIB installation would allow for the cleanest solution to this.

Our Hybrid Solution

Our network had been through several iterations of solving this traffic problem. A variety of new requirements and unrealized failure scenarios caused us to rethink and redesign our solution. We ultimately settled on a hybrid model based on the principals described in the previous section.

While implementing a complete tiered zone-to-zone model across the entire network would provide the ultimate in flexibility and uniformity — the ideal one-size-fits-all approach — in practice and at our scale this would immediately crush and exhaust the control-plane resources on our network. The exponential growth of routes propagated in a complete zone-to-zone model would far exceed the resource capacity on our edge routers, which are hefty carrier grade routers as it is. Coupled with the fact that we do have some natural directionality to the network, we use the following hybrid approach where we partition each routing-domain on the network. This approach uses both the source/destination-following and tiered zone-to-zone models. The tiered portion of zone-to-zone is implemented in policy, but only used by exception.

12

For every routing-domain, we implement source/destination-following between the untrust zone and all other zones. The untrust zone is that which peers to any upper-tiers, and towards the network perimeter. Between the non-untrust zones, we implement zone-to-zone logic. When source-following to the untrust zone, we limit the advertised prefix to only the default route, and subsequently constrain the exponential growth of routes at each routing-domain boundary.

Recall from first section, that the peering design within a stack is represented as follows, where each zone in a particular routing-domain on the edge routers (coded as ER) peers to its corresponding table on the firewall-router (coded as an FR).

13

The following pseudo-code policies outline the logic used to implement this solution.

ER Redistribute — Unicast Prefix

Implemented during simple redistribution or origination of a network into BGP. We add only the the dst-x community, failover community, and set our default maximum local preference.

add community dest-<stack-id>
add community failover-pri
set local-pref 400
pass

ER Redistribute — Anycast Prefix

Implemented during redistribution or origination of an anycasted network into BGP (e.g a default route). We add only a special community anycast to signal it is an anycast network.

add community anycast
set local-pref 400
pass

ER-Untrust to FR Export

Implemented on the edge router when exporting paths from an untrust zone to the firewall-router. We pass only paths tagged with the anycast community, which should normally be only the default route.

if community matches anycast then
  pass

ER to FR Export

Implemented on the edge router when exporting paths from a non-untrust zone to the firewall-router. We pass only paths tagged with a failover community. In tiered zone-to-zone this limits the add-paths to only those necessary for lateral-blocking.

if community matches failover-all then
  pass

FR to ER Import

Implemented on the firewall-router when importing paths from any zone on the edge router. The policy completes the lateral-shifting needed in the tiered zone-to-zone model by preferencing the add-paths and which only applies to the paths from non-untrust zones.

if community matches failover-pri then
  set local-pref 400
elseif community matches failover-sec then
  set local-pref 300
elseif community matches failover-ter then
  set local-pref 200

pass

FR to ER Export

Implemented on the firewall-router when exporting paths to any zone on the edge router. The first section of the policy rewrites the src-x community needed for source-shifting. The next section simultaneously rewrites the dst-x community needed for destination-shifting, and rewrites the failover community needed for zone-to-zone logic.

delete community in src-all
add community src-<pri stack-id>

delete community in failover-all
if community matches <stack-id>-pri-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-pri
elseif community matches <stack-id>-sec-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-sec
elseif community matches <stack-id>-ter-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-ter

pass

ER-Untrust to FR Import

Implemented on the edge router when importing paths into an untrust zone from the firewall-router. The policy first deletes the src-x community set by the firewall-router in order to force the zone to bypass any zone-to-zone preferencing set by the VPNv4 policy and instead fall-through to destination-shifting. The preferences are set consistent with destination-shifting by matching on failover communities.

delete community in src-any

if community matches failover-pri then
  set local-pref 180
elseif community matches failover-sec then
  set local-pref 150
elseif community matches failover-ter then
  set local-pref 120

ER to FR Import

Implemented on the edge router when importing paths to a non-untrust zone from the firewall-router. The logic needs only consider the top two preferences from the zone-to-zone logic, as they encompass any possible path learned in this direction: it is either a failover-pri path learned from its primary stack, or it is any other path learned from its primary stack.

if community matches failover-pri
  set local-pref 200
else
  set local-pref 190

VPNv4 Import

Implemented on the edge router when importing paths from all other stacks over VPNv4. The policy preferences paths consistent with the ordering specified by the zone-to-zone logic. To implement the “from any stack equal or lower than” qualifier in those rules, we utilize regex-based communities to classify those stacks. For example, a src-3-minus community would use the pattern ^[1–3]on the left-hand side of the community value.

if community matches src-<pri stack-id>-minus and failover-pri
  set local-pref 200
elseif community matches src-<pri stack-id>
  set local-pref 190
elseif community matches failover-pri
  set local-pref 180
elseif community matches src-<sec stack-id>-minus and failover-sec
  set local-pref 170
elseif community matches src-<sec stack-id>
  set local-pref 160
elseif community matches failover-sec
  set local-pref 150
elseif community matches src-<ter stack-id>-minus and failover-ter
  set local-pref 140
elseif community matches src-<ter stack-id>
  set local-pref 130
elseif community matches failover-ter
  set local-pref 120
else
  drop

pass

Hybrid Control-Plane Overhead

For a quick example of the control-plane savings on the hybrid approach: assume a network with 5 route-domains, each with 10 other route-domains tiered underneath, and where every route-domain contains 5 unique zones, and every zone originates 50 prefixes. In total: 55 route-domains, 275 zones, and 13,750 prefixes.

Some quick math can demonstrate that a complete tiered zone-to-zone model, absent any aggregation, would come at an expense of roughly 3.8M routes inserted into the RIB.

By partitioning the zone-to-zone logic to exist only within a route-domain, and enforcing that every route-domain uses a default route from the untrust to exit, this expense is reduced to roughly 130K routes inserted into the RIB — roughly a 97% reduction.

Conclusion

Implementing a complete tiered zone-to-zone model follows the same policy logic as above only without applying the unique policies on the untrust zone. There are a number of other policies involved to handle other unique requirements for example interfacing with external autonomous systems, redundant redistribution points, and route aggregation which all follow the same general routing principals outlined.

]]>
https://www.sms.com/blog/complex-routing-symmetry-in-bgp/feed/ 0
Use Azure Automation Runbook to deploy Nessus Agent via Terraform https://www.sms.com/blog/use-azure-automation-runbook-to-deploy-nessus-agent-via-terraform/ https://www.sms.com/blog/use-azure-automation-runbook-to-deploy-nessus-agent-via-terraform/#respond Thu, 02 Nov 2023 15:56:43 +0000 https://www.sms.com/?p=6981 Problem

All Virtual Machines (VMs) in the Azure environment must have Nessus Agent installed and registered to a newly created Nessus Manager without direct SSH or RDP access to any of the VMs.

Solution

Use an existing Azure Automation Account to deploy the Nessus Agent via a runbook. The runbook will add a Virtual Machine extension that will have the necessary steps to install and register the Nessus Agent based on the Operating System. This solution can be used to install pretty much anything on a Windows or Linux Virtual Machine.

What is an Azure Automation Account?

An Azure Automation Account is a cloud-based management service provided by Microsoft, designed to help automate, orchestrate, and manage repetitive tasks and processes within the Azure environment. It serves as a centralized location for storing various automation assets, such as runbooks, credentials, and integration modules, enabling users to streamline their automation efforts and improve operational efficiency.

In this case, an existing Azure Automation Account that was previously created is being used for this effort. If you don’t have an existing one, you can create a new one strictly for this purpose. There are a couple of requirements that are needed to make this work.

  • Associate a User-assigned Managed Identity with at a minimum, the “Virtual Machine Contributor” Azure role to all subscriptions in your tenant.
  • The Azure Automation Account must be linked to the same Log Analytics workspace that your VMs are linked. In this environment this task was previously taken care of to accomplish another effort. To associate VMs to a Log Analytics workspace, you will need the OMS or the MMA agent. See the link below as there are many ways to tackle this.

As mentioned above, if you don’t already have an Automation Account, you will need to create one. Below is an example of creating an Azure Automation Account with Terraform.

resource "azurerm_automation_account" "aa_account" {
location = "<azure region>"
name     = "<name of account>"
resource_group_name = var.rg
identity {
  identity_ids = ["<Your Managed identity ids>"]
  type         = "UserAssigned
}

What is an Azure Automation Runbook?

An Azure Automation Runbook is a set of tasks or operations that you can automate within the Azure environment. It is essentially a collection of PowerShell or Python script(s) that perform various actions, such as managing resources, configuring systems, or handling other operational tasks. Azure Automation Runbooks are commonly used for automating repetitive tasks, scheduling maintenance activities, and orchestrating complex workflows within Azure.

PowerShell 5.x was the scripting language used for this task, in part because Terraform does not currently support Powershell 7.1 as a runbook type. (e.g. https://github.com/hashicorp/terraform-provider-azurerm/issues/14089).

Terraform

Terraform is the current Infrastructure As Code tool for this environment therefore it used in this scenario. Below is a snippet of the main.tf

Let’s take a look at the Terraform code:

resource "azurerm_automation_runbook" "nessus_install" {
  name                    = var.runbook_name
  location                = data.azurerm_resource_group.ops.location
  resource_group_name     = data.azurerm_automation_account.ops.resource_group_name
  automation_account_name = data.azurerm_automation_account.ops.name
  log_verbose             = true
  log_progress            = true
  description             = var.runbook_description
  runbook_type            = var.runbook_type
  tags                    = var.default_tags
  content = templatefile("${path.module}/runbook/nessus.ps1", {
    umi                         = data.azurerm_user_assigned_identity.identity.client_id
    tenantid                    = var.tenant_id
    scriptnamelinux             = var.scritpname_linux
    scriptnamewindows           = var.scritpname_win
    storageaccountcontainer     = data.azurerm_storage_container.sa.name
    storageaccountresourcegroup = data.azurerm_resource_group.sa.name
    storageaccountname          = var.sa_acct
    workbookname                = var.runbook_name
    storageaccountsub           = data.azurerm_subscription.sa.subscription_id
    client_id                   = data.azurerm_user_assigned_identity.identity.client_id
    vms_to_exclude              = join(",", [for vm in local.vms_file_content : "\"${vm}\""])
    defaultsub                  = ""
  })
}

resource "azurerm_automation_job_schedule" "nessus_install" {
  resource_group_name     = data.azurerm_automation_account.ops.resource_group_name
  automation_account_name = data.azurerm_automation_account.ops.name
  schedule_name           = azurerm_automation_schedule.nessus_install.name
  runbook_name            = azurerm_automation_runbook.nessus_install.name

}

resource "azurerm_automation_schedule" "nessus_install" {
  name                    = var.nessus_schedule
  resource_group_name     = data.azurerm_automation_account.ops.resource_group_name
  automation_account_name = data.azurerm_automation_account.ops.name
  frequency               = var.schedule_frequency
  timezone                = var.timezone
  start_time              = var.start_time
  description             = var.schedule_description
  week_days               = var.week_days
  expiry_time             = var.expiry_time
}

azurerm_automation_runbook:  This section defines the Azure Automation Runbook, including its name, location, resource group, and related configurations. The templatefile is using several inputs that allow you to modify your variables and have your script configured with the desired output. The script utilizes a PowerShell script file named `nessus.ps1`, which is responsible for orchestrating the Nessus installation process and covered in the next section.

azurerm_automation_job_schedule: Here, we set up an Azure Automation Job Schedule, which determines the frequency and timing of the execution of the Nessus installation process.

azurerm_automation_schedule: This section specifies the details of the schedule, including the frequency, time zone, start time, and expiry time for the Nessus installation process. This needs to be run on a weekly basis to incorporate any new VMs that get created in any subscription.

If you choose to use the code as-is, the variables used in the templatefile are explained below.

    umi  = User Managed Identity that is associated with the Azure Automation Account
    tenantid                    = The Tenant ID 
    scriptnamelinux             = Name of Linux shell script
    scriptnamewindows           = Name of Windows script
    storageaccountcontainer     = Name of the Storage Account where the scripts reside
    storageaccountresourcegroup = Name of the Resource Group where the Storage Account resides
    storageaccountname          = Name of the Storage Account
    workbookname                = Name of the Runbook you are creating
    storageaccountsub           = The Subscription ID of the Storage Account
    vms_to_exclude              = join(",", [for vm in local.vms_file_content : "\"${vm}\""])
    defaultsub                  = "" # If you want to loop through all active subscriptions leave this as-is, if not put in the subscription you want to this script to run against

vms_to_exclude variable was configured so you can skip VMs by name if you choose. An issue occurred where a VM’s resources were pegged and the script would eventually error out waiting for the VM to finish. So this logic was inserted to mitigate that. A flat txt file “vms.txt” is used for this purpose, you can just list all VMs in this file, one per line.

Powershell

Let’s take a look at the Powershell script that is being called nessus.ps1

Disable-AzContextAutosave -Scope Process

$AzureContext = (Connect-AzAccount -Identity -Environment AzureUSGovernment -AccountId ${umi}).context
$TenantId = '${tenantid}'
$scriptNameLinux = '${scriptnamelinux}'
$scriptNameWindows = '${scriptnamewindows}'
$storageAccountContainer = '${storageaccountcontainer}'
$storageAccountResourceGroup = '${storageaccountresourcegroup}'
$storageAccountName = '${storageaccountname}'
$defaultSubscriptionId = '${defaultsub}'

$settingsLinux = @{
    "fileUris"         = @("https://$storageAccountName.blob.core.usgovcloudapi.net/$storageAccountContainer/$scriptNameLinux")
    "commandToExecute" = "bash $scriptNameLinux"
} | ConvertTo-Json

$settingsWindows = @{
    "fileUris"         = @("https://$storageAccountName.blob.core.usgovcloudapi.net/$storageAccountContainer/$scriptNameWindows")
    "commandToExecute" = "powershell -NonInteractive -ExecutionPolicy Unrestricted -File $scriptNameWindows"
} | ConvertTo-Json

$storageKey = (Get-AzStorageAccountKey -Name $storageAccountName -ResourceGroupName $storageAccountResourceGroup)[0].Value

$protectedSettingsLinux = @{
    "storageAccountName" = $storageAccountName
    "storageAccountKey"  = $storageKey
} | ConvertTo-Json

$protectedSettingsWindows = @{
    "storageAccountName" = $storageAccountName
    "storageAccountKey"  = $storageKey
} | ConvertTo-Json

$currentAZContext = Get-AzContext

if ($currentAZContext.Tenant.id -ne $TenantId) {
    Write-Output "This script is not authenticated to the needed tenant. Running authentication."
    Connect-AzAccount -TenantId $TenantId
}
else {
    Write-Output "This script is already authenticated to the needed tenant - reusing authentication."
}

$subs = @()

if ($defaultSubscriptionId -eq "") {
    $subs = Get-AzSubscription -TenantId $TenantId | Where-Object { $_.State -eq "Enabled" }
}
else {
    if ($defaultSubscriptionId.IndexOf(',') -eq -1) {
        $subs = Get-AzSubscription -TenantId $TenantId -SubscriptionId $defaultSubscriptionId
    }
    else {
        $defaultSubscriptionId = $defaultSubscriptionId -replace '\s', ''
        $subsArray = $defaultSubscriptionId -split ","
        foreach ($subsArrayElement in $subsArray) {
            $currTempSub = Get-AzSubscription -TenantId $TenantId -SubscriptionId $subsArrayElement
            $subs += $currTempSub
        }
    }
}



$excludeVmNamesArray = (${vms_to_exclude})


foreach ($currSub in $subs) {
    Set-AzContext -subscriptionId $currSub.id -Tenant $TenantId

    if (!$?) {
        Write-Output "Error occurred during Set-AzContext. Error message: $( $error[0].Exception.InnerException.Message )"
        Write-Output "Trying to disconnect and reconnect."
        Disconnect-AzAccount
        Connect-AzAccount -TenantId $TenantId -SubscriptionId $currSub.id
        Set-AzContext -subscriptionId $currSub.id -Tenant $TenantId
    }

    $VMs = Get-AzVM

    foreach ($vm in $VMs) {
        if ($excludeVmNamesArray -contains $vm.Name) {
            Write-Output "Skipping VM $($vm.Name) as it is excluded."
            continue
        }

        $status = (Get-AzVM -ResourceGroupName $vm.ResourceGroupName -Name $vm.Name -Status).Statuses[1].DisplayStatus

        if ($status -eq "VM running") {
            Write-Output "Processing running VM $( $vm.Name )"

            $extensions = (Get-AzVM -ResourceGroupName $vm.ResourceGroupName -Name $vm.Name).Extensions

            foreach ($ext in $extensions) {
                if ($null -ne $vm.OSProfile.WindowsConfiguration) {
                    if ($ext.VirtualMachineExtensionType -eq "CustomScriptExtension") {
                        Write-Output "Removing CustomScriptExtension with name $( $ext.Name ) from VM $( $vm.Name )"
                        Remove-AzVMExtension -ResourceGroupName $vm.ResourceGroupName -VMName $vm.Name -Name $ext.Name -Force
                        Write-Output "Removed CustomScriptExtension with name $( $ext.Name ) from VM $( $vm.Name )"
                    }
                }
                else {
                    if ($ext.VirtualMachineExtensionType -eq "CustomScript") {
                        Write-Output "Removing CustomScript extension with name $( $ext.Name ) from VM $( $vm.Name )"
                        Remove-AzVMExtension -ResourceGroupName $vm.ResourceGroupName -VMName $vm.Name -Name $ext.Name -Force
                        Write-Output "Removed CustomScript extension with name $( $ext.Name ) from VM $( $vm.Name )"
                    }
                }
            }

            if ($vm.StorageProfile.OsDisk.OsType -eq "Windows") {
                Write-Output "Windows VM detected: $( $vm.Name )"
                $settingsOS = $settingsWindows
                $protectedSettingsOS = $protectedSettingsWindows
                $publisher = "Microsoft.Compute"
                $extensionType = "CustomScriptExtension"
                $typeHandlerVersion = "1.10"
            }
            elseif ($vm.StorageProfile.OsDisk.OsType -eq "Linux") {
                Write-Output "Linux VM detected: $( $vm.Name )"
                $settingsOS = $settingsLinux
                $protectedSettingsOS = $protectedSettingsLinux
                $publisher = "Microsoft.Azure.Extensions"
                $extensionType = "CustomScript"
                $typeHandlerVersion = "2.1"
            }
            $customScriptExtensionName = "NessusInstall"

            Write-Output "$customScriptExtensionName installation on VM $( $vm.Name )"

            Set-AzVMExtension -ResourceGroupName $vm.ResourceGroupName `
                -Location $vm.Location `
                -VMName $vm.Name `
                -Name $customScriptExtensionName `
                -Publisher $publisher `
                -ExtensionType $extensionType `
                -TypeHandlerVersion $typeHandlerVersion `
                -SettingString $settingsOS `
                -ProtectedSettingString $protectedSettingsOS

            Write-Output "---------------------------"
        }
        else {
            Write-Output "VM $( $vm.Name ) is not running, skipping..."
        }
    }

    Set-AzContext -SubscriptionId $defaultSubscriptionId -Tenant $TenantId
}

This particular environment is in the AzureGOV region but could be modified to use any region.

The script is designed to automate the deployment of custom scripts/extensions to multiple Azure VMs across different subscriptions. It provides flexibility for both Linux and Windows VMs and ensures that any existing custom script extensions are removed before deployment; this is because you cannot have an extension with the same name.

OS Scripts

Now lets look at the windows script that the “nessus.ps1” calls

$installerUrl = "<URL to the msi>"


$NESSUS_GROUP="<Name of your Nessus Group>"

$NESSUS_KEY="<Name of Nessus Key>"

$NESSUS_SERVER="<FQDN of Nessus Server>"

$NESSUS_PORT="<Port if different from standard 8834>"

$installerPath = "C:\TEMP\nessusagent.msi"

$windows_package_name = "'Nessus Agent (x64)'"

$installed = Get-WmiObject -Query "SELECT * FROM Win32_Product WHERE Name = $windows_package_name" | Select-Object Name

function Test-Admin {

    $currentUser = New-Object Security.Principal.WindowsPrincipal $([Security.Principal.WindowsIdentity]::GetCurrent())

    $currentUser.IsInRole([Security.Principal.WindowsBuiltinRole]::Administrator)

}

 

if ((Test-Admin) -eq $false) {

    if ($elevated) {

    }

    else {

        Start-Process powershell.exe -Verb RunAs -ArgumentList ('-noprofile -file "{0}" -elevated' -f ($myinvocation.MyCommand.Definition))

    }

    exit

}

 
'running with full privileges'

if ($installed) {

    Write-Output "Nessus Agent is already installed. Exiting."

}

else {

    Write-Output "Downloading Nessus Agent MSI installer..."

    Invoke-WebRequest -Uri $installerUrl -OutFile $installerPath


    Write-Output "Installing Nessus Agent..."

    Start-Process -FilePath msiexec.exe -ArgumentList '/i C:\TEMP\nessusagent.msi NESSUS_GROUPS="$NESSUS_GROUP" NESSUS_SERVER="$NESSUS_SERVER" NESSUS_KEY='$NESSUS_KEY' /qn' -Wait

    $installed = Get-WmiObject -Query "SELECT * FROM Win32_Product WHERE Name = $windows_package_name" | Select-Object Name


    if ($installed) {

        Write-Output "Nessus Agent has been successfully installed."

    }

    else {

        Write-Output "Failed to install Nessus Agent."

    }

}

 
if (Test-Path $installerPath) {

    Remove-Item -Path $installerPath -Force

}

 
Function Start-ProcessGetStream {


    [CmdLetBinding()]

    Param(

        [System.IO.FileInfo]$FilePath,

        [string[]]$ArgumentList

    )


    $pInfo = New-Object System.Diagnostics.ProcessStartInfo

    $pInfo.FileName = $FilePath

    $pInfo.Arguments = $ArgumentList

    $pInfo.RedirectStandardError = $true

    $pInfo.RedirectStandardOutput = $true

    $pinfo.UseShellExecute = $false

    $pInfo.CreateNoWindow = $true

    $pInfo.WindowStyle = [System.Diagnostics.ProcessWindowStyle]::Hidden


    $proc = New-Object System.Diagnostics.Process

    $proc.StartInfo = $pInfo


    Write-Verbose "Starting $FilePath"

    $proc.Start() | Out-Null

    Write-Verbose "Waiting for $($FilePath.BaseName) to complete"

    $proc.WaitForExit()

    $stdOut = $proc.StandardOutput.ReadToEnd()

    $stdErr = $proc.StandardError.ReadToEnd()

    $exitCode = $proc.ExitCode

 

    Write-Verbose "Standard Output: $stdOut"

    Write-Verbose "Standard Error: $stdErr"

    Write-Verbose "Exit Code: $exitCode"


    [PSCustomObject]@{

        "StdOut"   = $stdOut

        "Stderr"   = $stdErr

        "ExitCode" = $exitCode

    }

}



Function Get-NessusStatsFromStdOut {

 

    Param(

        [string]$stdOut

    )

 

    $stats = @{}

 

 

 

    $StdOut -split "`r`n" | ForEach-Object {

        if ($_ -like "*:*") {

            $result = $_ -split ":"

            $stats.add(($result[0].Trim() -replace "[^A-Za-z0-9]", "_").ToLower(), $result[1].Trim())

        }

    }


    Return $stats

}


Function Get-DateFromEpochSecond {

    Param(

        [int]$seconds

    )

 

    $utcTime = (Get-Date 01.01.1970) + ([System.TimeSpan]::fromseconds($seconds))

    Return Get-Date $utcTime.ToLocalTime() -Format "yyyy-MM-dd HH:mm:ss"

}

 
Try {

    $nessusExe = Join-Path $env:ProgramFiles -ChildPath "Tenable\Nessus Agent\nessuscli.exe" -ErrorAction Continue

}

Catch {

    Throw "Cannot find NessusCli.exe, installing..."

}

 

Write-Output "Getting Agent Status..."

$agentStatus = Start-ProcessGetStreams -FilePath $nessusExe -ArgumentList "agent status"

 

If ($agentStatus.stdOut -eq "" -and $agentStatus.StdErr -eq "") {

    Throw "No Data Returned from NessusCli, linking now"

    Start-ProcessGetStreams -FilePath $nessusExe -ArgumentList 'agent link --key=$NESSUS_KEY --groups="$NESSUS_GROUP" --host=$NESSUS_SERVER --port=$NESSUS_PORT'

}

elseif ($agentStatus.StdOut -eq "" -and $agentStatus.StdErr -ne "") {

    Throw "StdErr: $($agentStatus.StdErr)"

}

elseif (-not($agentStatus.stdOut -like "*Running: *")) {

    Throw "StdOut: $($agentStatus.StdOut)"

}

else {

    $stats = Get-NessusStatsFromStdOut -stdOut $agentStatus.StdOut

    If ($stats.linked_to -eq '$NESSUS_SERVER' -and $stats.link_status -ne 'Not linked to a manager') {

        Write-Output "Connected to $NESSUS_SERVER"

    }

    else {

        Write-Output "Connecting..."

        Start-ProcessGetStreams -FilePath "C:\Program Files\Tenable\Nessus Agent\nessuscli.exe" -ArgumentList 'agent link --key=$NESSUS_KEY --groups="$NESSUS_GROUP" --host=$NESSUS_SERVER --port=$NESSUS_PORT'

    }


    If ($stats.last_connection_attempt -as [int]) { $stats.last_connection_attempt = Get-DateFromEpochSeconds $stats.last_connection_attempt }

    If ($stats.last_connect -as [int]) { $stats.last_connect = Get-DateFromEpochSeconds $stats.last_connect }

    If ($stats.last_scanned -as [int]) { $stats.last_connect = Get-DateFromEpochSeconds $stats.last_scanned }

}

 
#$stats | Out-Host

This script streamlines the process of installing and linking the Nessus Agent to the specified Nessus server, automating various steps and ensuring the seamless deployment and integration of the agent within the intended environment.

Now lets look at the linux script that the “nessus.ps1” calls:

#!/bin/bash

exec 3>&1 4>&2

trap 'exec 2>&4 1>&3' 0 1 2 3

exec 1>/tmp/nessus-install-log.out 2>&1

 

PACKAGE_NAME="nessusagent"

ACTIVATION_CODE="<Your Nessus Activation Key/Code>"

NESSUS_HOST="<fqdn of your Nessus Manager>"

NESSUS_AGENT="/opt/nessus_agent/sbin/nessuscli"

NESSUS_PORT="<port # if different from 8834>"

NESSUS_GROUP="<name of your group>"

base_url="<url to your Storage Account>"

debian_filename="NessusAgent-10.3.1-ubuntu1404_amd64.deb" # 

redhat_7_filename="NessusAgent-10.3.1-es7.x86_64.rpm" # Redhat EL7 filename

redhat_8_filename="NessusAgent-10.3.1-es8.x86_64.rpm" # Redhat EL8 filename

 

 

if_register_agent() {

  if  "$NESSUS_AGENT" agent status | grep -q "Linked to: $NESSUS_HOST"; then

    echo "Nessus Agent is already linked to Nessus Manager."

  else

    $NESSUS_AGENT agent link --host="$NESSUS_HOST" --port="$NESSUS_PORT" --key="$ACTIVATION_CODE" --groups="$NESSUS_GROUP"

    if [ $? -eq 0 ]; then

        echo "Nessus Agent linked successfully."

        else

          echo "Failed to link Nessus Agent. Check your activation code or permissions."

          exit 1

    fi

  fi

}

 

is_package_installed_debian() {

  if  dpkg -l | grep -i "ii  $PACKAGE_NAME"; then

    if_register_agent

    return 0

  else

    return 1

  fi

}

 

is_package_installed_redhat() {

  if  rpm -qa | grep -i "$PACKAGE_NAME" > /dev/null; then

    if_register_agent

    return 0

  else

    return 1

  fi

}

 

install_package_debian() {

  echo "$PACKAGE_NAME is not installed on $ID. Installing it now..." &&

  sleep 20 &&

   wget -qP /tmp $base_url$debian_filename &&

  sleep 20 &&

   dpkg -i /tmp/"$debian_filename" &&

  sleep 20 &&

   $NESSUS_AGENT agent link --host="$NESSUS_HOST" --port="$NESSUS_PORT" --key="$ACTIVATION_CODE" --groups="$NESSUS_GROUP" &&

  sleep 20 &&

   systemctl enable nessusagent --now &&

  sleep 20 &&

   $NESSUS_AGENT agent status |  tee /tmp/nessus_agent_status &&

   sleep 20 &&

   rm -f /tmp/"$debian_filename"

   exit

}

 

 

install_package_redhat_v7() {

  echo "$PACKAGE_NAME is not installed on $ID-$VERSION_ID Installing it now..."

  yum -y install wget &&

  sleep 20 &&

   wget -qP /tmp $base_url$redhat_7_filename &&

  sleep 20 &&

   rpm -ivh /tmp/"$redhat_7_filename" &&

  sleep 20 &&

   $NESSUS_AGENT agent link --host="$NESSUS_HOST" --port="$NESSUS_PORT" --key="$ACTIVATION_CODE" --groups="$NESSUS_GROUP" &&

  sleep 20 &&

   systemctl enable nessusagent --now &&

  sleep 20 &&

   $NESSUS_AGENT agent status |  tee /tmp/nessus_agent_status &&

   rm -f /tmp/"$redhat_7_filename"

   exit

}

 

install_package_redhat_v8() {

  echo "$PACKAGE_NAME is not installed on $ID-$VERSION_ID. Installing it now..."

  sleep 20 &&

   wget -qP /tmp $base_url$redhat_8_filename &&

  sleep 20 &&

   rpm -ivh /tmp/"$redhat_8_filename" &&

  sleep 20 &&

   $NESSUS_AGENT agent link --host="$NESSUS_HOST" --port="$NESSUS_PORT" --key="$ACTIVATION_CODE" --groups="$NESSUS_GROUP" &&

  sleep 20 &&

   systemctl enable nessusagent --now &&

  sleep 20 &&

   $NESSUS_AGENT agent status |  tee /tmp/nessus_agent_status &&

   rm -f /tmp/"$redhat_8_filename"

   exit

}

 

check_debian_based() {

  lowercase_id=$(echo "$ID" | tr '[:upper:]' '[:lower:]')

  if [[ "$lowercase_id" == *debian* || "$lowercase_id" == *ubuntu* ]]; then

    if is_package_installed_debian; then

      echo "$PACKAGE_NAME is already installed on $ID."

      exit 0

    else

      install_package_debian

    fi

  fi

}

 

check_redhat_based() {

  lowercase_id=$(echo "$ID" | tr '[:upper:]' '[:lower:]')

  if [[ "$lowercase_id" == *centos* || "$lowercase_id" == *rhel* || "$lowercase_id" == *ol* || "$lowercase_id" == *el* ]]; then

    if is_package_installed_redhat; then

      echo "$PACKAGE_NAME is already installed on $ID."

      exit 0

    else

      if [[ "$VERSION_ID" == 7 ]]; then

        echo "Red Hat $ID version 7 detected."

        install_package_redhat_v7

      elif [[ "$VERSION_ID" == 8 ]]; then

        echo "Red Hat $ID version 8 detected."

        install_package_redhat_v8

      else

        echo "Unsupported version: $VERSION_ID"

        exit 1

      fi

    fi

  fi

}

 

if [ -f /etc/os-release ]; then

  . /etc/os-release

  check_debian_based

  check_redhat_based

else

  echo "Unsupported Linux distribution."

  exit 1

fi

This script is pretty much the same as the one above except it is for linux distributions. It will determine the OS type and install the necessary agent package and register the agent to the appropriate Nessus Manager.

Example of the variables.tf

variable "default_tags" {
  description = "A map of tags to add to all resources"
  type        = map(string)
  default = {
  }
}

variable "tenant_id" {
  description = "Azure AD Tenate ID of the Azure subscription"
  type        = string
}

variable "nessus_schedule" {
  description = "Name of the Schedule in Automation Account"
  type        = string
  default     = "nessus-automation-schedule"
}

variable "timezone" {
  description = "Name of the Timezone"
  type        = string
  default     = "America/New_York"
}

variable "schedule_description" {
  description = "Schedule Description"
  type        = string
  default     = "This is schedule to download and install Nessus"
}

variable "week_days" {
  description = "Schedule Description"
  type        = list(string)
  default     = ["Monday", "Wednesday", "Saturday"]
}

variable "scritpname_linux" {
  default     = "nessus-linux.sh"
  description = "Name of Linux script"
  type        = string
}

variable "scritpname_win" {
  default     = "nessus-windows.ps1"
  description = "Name of Windows script"
  type        = string
}

variable "sa_container" {
  description = "Name of the Storage Account Container"
  type        = string
}

variable "sa_rg" {
  description = "Name of the Storage Account Resource Group"
  type        = string
}

variable "sa_sub" {
  description = "Subscription ID where the Storage Account lives"
  type        = string
}


variable "sa_acct" {
  description = "Name of the Storage Account"
  type        = string
}

locals {
  vms_file_content = split("\n", file("${path.module}/vms.txt"))
}

variable "schedule_frequency" {
  description = "Job frequency"
  type        = string
  default     = "Week"
}

variable "runbook_name" {
  description = "Name of the runbook"
  type        = string
  default     = "nessus_agent_install"
}

variable "runbook_type" {
  description = "Name of the language used"
  type        = string
  default     = "PowerShell"
}

variable "runbook_description" {
  description = "Description of the Runbook"
  type        = string
  default     = "This runbook will Download and Install the Nessus Agent"
}

variable "start_time" {
  description = "When to start the runbook schedule"
  type        = string
  default     = "2024-10-07T06:00:15+02:00"
}

variable "expiry_time" {
  description = "When to start the runbook schedule"
  type        = string
  default     = "2027-10-07T06:00:15+02:00"
}

variable "identity_sub" {
  description = "Subscription where MI lives"
  type        = string
}


All the above code can be found at the link below.

https://github.com/rdeberry-sms/nessus_aa_runboook
]]>
https://www.sms.com/blog/use-azure-automation-runbook-to-deploy-nessus-agent-via-terraform/feed/ 0
Automating Operating System Hardening https://www.sms.com/blog/automating-operating-system-hardening/ https://www.sms.com/blog/automating-operating-system-hardening/#respond Wed, 12 Jul 2023 17:01:42 +0000 https://smsprod01.wpengine.com/?p=6583 By Andrew Stanley, Director of Engineering, SMS

In the ever-evolving landscape of cybersecurity, the importance of operating system hardening cannot be overstated. As the foundational layer of any IT infrastructure, the operating system presents a broad surface area for potential attacks. Hardening these systems, therefore, is a critical step in any comprehensive cybersecurity strategy. However, the challenge lies in automating this process, particularly in legacy on-premises infrastructures not designed with automation in mind. 

Open-source software has emerged as a powerful ally in this endeavor, offering flexibility, transparency, and a collaborative approach to tackling cybersecurity challenges. Tools such as OpenSCAP and Ansible have been instrumental in automating and streamlining the process of operating system hardening. The Center for Internet Security (CIS), a non-profit entity, plays a pivotal role in this context by providing well-defined, community-driven security benchmarks that these tools can leverage. 

While cloud-native architectures have been at the forefront of automation with tools like HashiCorp’s Packer and Terraform, these tools are not confined to the cloud. They can be ingeniously adapted to work with on-premises systems like VMware, enabling the creation of hardened virtual machine images and templates. This convergence of cloud-native tools with traditional on-premises systems is paving the way for a new era in cybersecurity, where robust, automated defenses are within reach for all types of IT infrastructures. This blog post will delve into how these tools can automate operating system hardening, making cybersecurity more accessible and manageable. 

Why Use OpenSCAP and Ansible for Operating System Hardening

The Center for Internet Security (CIS) Benchmarks Level II Server Hardening standard is a stringent set of rules designed for high-security environments. It includes advanced security controls like disabling unnecessary services, enforcing password complexity rules, setting strict access controls, and implementing advanced auditing policies. OpenSCAP, an open-source tool, can automate the application of these benchmarks by generating Ansible templates. This automation ensures consistency, accuracy, and efficiency in securing your servers according to these high-level standards.

Prerequisites

  • VMware vSphere environment for building and testing images
  • One Linux host or VM to run the required tools
  • One Linux host or VM for auditing

Note

The examples in this post use Ubuntu 20.04 but should work for other versions and distros.

Steps

  • Execute the following on the host you intend to use for running OpenScap, Ansible, Packer and Terraform.
# Reference - https://medium.com/rahasak/automate-stig-compliance-server-hardening-with-openscap-and-ansible-85f2f091b00
# install openscap libraries on local and remote hosts
sudo apt install libopenscap8

# Create a working directory
mkdir ~/openscap
export WORKDIR=~/openscap
cd $WORKDIR

# Download ssg packages and unzip
# Check for updates here - https://github.com/ComplianceAsCode/content/releases
wget https://github.com/ComplianceAsCode/content/releases/download/v0.1.67/scap-security-guide-0.1.67.zip
unzip -q scap-security-guide-0.1.67.zip

# Clone openscap
git clone https://github.com/OpenSCAP/openscap.git
  • Create a new Ubuntu 20.04 base image and virtual machine template in VMware

Note

There are several ways to create base images in vSphere. Our recommendation is to use Hashicorp Packer and the packer-examples-for-vsphere project. The setup and configuration of these is outside the scope of this post but we may cover it in more detail in the future. The advantage of using this project is that it already provides a convenient way to add ansible playbooks to your image provisioning process. Additionally, SMS develops reusable terraform modules that are designed to work with images created from this project.

  • Run a remote scan against the new virtual machine you created
# Return to the root of the working directory
cd $WORKDIR

# Scan the newly created Ubuntu 20.04 instance using the CIS Level2 Server profile
./openscap/utils/oscap-ssh --sudo <user@host> 22 xccdf eval \
  --profile xccdf_org.ssgproject.content_profile_cis_level2_server \
  --results-arf ubuntu2004-cis_level2_server.xml \
  --report ubuntu2004-cis_level2_server.html \
  scap-security-guide-0.1.67/ssg-ubuntu2004-ds.xml
  • Generate an Ansible Remediation Playbook
# Generate an Ansible Playbook using OpenSCAP
oscap xccdf generate fix \
  --fetch-remote-resources \
  --fix-type ansible \
  --result-id "" \
  ubuntu2004-cis_level2_server.xml > ubuntu2004-playbook-cis_level2_server.yml
  • Test the generated Ansible Playbook
# Validate the playbook against the target machine
ansible-playbook -i "<host>," -u <user> -b -K ubuntu2004-playbook-cis_level2_server.yml

Note

It may be necessary to perform the previous scanning and playbook creation steps multiple times. As new packages are added additional hardening configurations will be needed.

Using Ansible Templates with Packer Examples for VMware vSphere

In this section, we delve into the practical application of Packer in a vSphere environment. We will explore the Packer Examples for VMware vSphere repository on GitHub, which provides a comprehensive set of examples for using Packer with vSphere. These examples demonstrate how to automate the creation of vSphere VM templates using Packer, Ansible and Terraform which can be used to create consistent and repeatable infrastructure. By the end of this section, you will have a solid understanding of how to leverage these examples in a vSphere environment to streamline your infrastructure management tasks. 

# Return to the root of the working directory
cd $WORKDIR

# Clone packer-examples-for-vsphere
git clone https://github.com/vmware-samples/packer-examples-for-vsphere.git
cd ./packer-examples-for-vsphere

# Create a new branch to save customizations. New templates will include the branch name by default.
git checkout -b dev
  • Update the repo to include the Ansible Playbook created with OpenSCAP
# Add a new role to the Ansible section of the repo
mkdir -p ./ansible/roles/harden/tasks
mkdir -p ./ansible/roles/harden/vars

# Create a variables file for the new role and copy all of the variables from the Ansible Playbook
vi ./ansible/roles/harden/vars/main.yml

# Create a task file and copy the remaining contents of the Ansible Playbook
vi ./ansible/roles/harden/tasks/main.yml

# Update the the existing Ansible Playbook to include the newly created role
vi ./ansible/main.yml

---
- become: "yes"
  become_method: sudo
  debugger: never
  gather_facts: "yes"
  hosts: all
  roles:
    - base
    - users
    - configure
    - harden
    - clean
  • Create a new hardened image and virtual machine template in VMware
# Follow the setup instructions in the README.md then create your base images
./build.sh

    ____             __                ____        _ __    __     
   / __ \____ ______/ /_____  _____   / __ )__  __(_) /___/ /____ 
  / /_/ / __  / ___/ //_/ _ \/ ___/  / __  / / / / / / __  / ___/ 
 / ____/ /_/ / /__/ ,< /  __/ /     / /_/ / /_/ / / / /_/ (__  )  
/_/    \__,_/\___/_/|_|\___/_/     /_____/\__,_/_/_/\__,_/____/   

  Select a HashiCorp Packer build for VMware vSphere:

      Linux Distribution:

         1  -  VMware Photon OS 4
         2  -  Debian 11
         3  -  Ubuntu Server 22.04 LTS (cloud-init)
         4  -  Ubuntu Server 20.04 LTS (cloud-init)

Choose Option 4

Creating Virtual Machines on VMware vSphere Using the Hardened Virtual Machine Templates

In this section, we will explore using the ‘terraform-vsphere-instance’ project, hosted on GitLab by SMS, for creating virtual machines. This project provides a set of Terraform configurations designed to create instances on VMware vSphere. These configurations leverage the power of Terraform, a popular Infrastructure as Code (IaC) tool, to automate the provisioning and management of vSphere instances. By using these Terraform modules, you can streamline the process of creating and managing your virtual machines on vSphere, ensuring consistency and repeatability in your infrastructure.

  • Create a virtual machine instance from the new template
# Return to the root of the working directory
cd $WORKDIR

# Clone terraform-vsphere-instance
git clone https://gitlab.com/sms-pub/terraform-vsphere-instance.git
cd ./terraform-vsphere-instance/examples/vsphere-virtual-machine/template-linux-cloud-init

# Copy and update the example tfvars file with settings for your environment
cp terraform.tfvars.example test.auto.tfvars

# Deploy a new virtual machine using Terraform
terraform plan

...
Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + module_output = [
      + {
          + vm_id           = (known after apply)
          + vm_ip_address   = (known after apply)
          + vm_ip_addresses = (known after apply)
          + vm_moid         = (known after apply)
          + vm_tools_status = (known after apply)
          + vm_vmx_path     = (known after apply)
        },
    ]

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now.

terraform apply

...
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

module_output = [
  {
    "vm_id" = "423d5014-829b-e000-9489-ac12dfaf4627"
    "vm_ip_address" = "10.4.3.142"
    "vm_ip_addresses" = tolist([
      "10.4.3.142",
      "fe80::250:56ff:febd:394f",
    ])
    "vm_moid" = "vm-4174"
    "vm_tools_status" = "guestToolsRunning"
    "vm_vmx_path" = "f784ad64-86a2-588d-a073-0025b500002e/lin-test-2004-default-00.vmx"
  },
]

Conclusion

In this blog post, we’ve explored the importance of operating system hardening and the challenges of automating this process, particularly in legacy on-premises infrastructures. We’ve seen how open-source tools like OpenSCAP and Ansible, along with the CIS Benchmarks, provide a robust framework for maintaining the security of enterprise systems. 

We’ve also delved into the practical application of Packer in a vSphere environment, demonstrating how to automate the creation of vSphere VM templates. Furthermore, we’ve seen how these templates can be used to create consistent and repeatable infrastructure, ensuring a high level of security across all systems. 

Finally, we’ve explored the use of Terraform modules from GitLab for creating virtual machines on VMware vSphere. This approach leverages the power of Infrastructure as Code (IaC) to automate the provisioning and management of vSphere instances, streamlining the process and ensuring consistency and repeatability in your infrastructure. 

In conclusion, the convergence of cloud-native tools with traditional on-premises systems is paving the way for a new era in cybersecurity. By leveraging these tools, organizations can ensure that their systems are configured according to best security practices and are resilient against potential threats. This approach makes cybersecurity more accessible and manageable, even in complex, legacy infrastructures. 

As we move forward, it’s clear that the automation of operating system hardening will continue to play a crucial role in cybersecurity. By staying informed and leveraging the right tools, we can ensure that our systems remain secure in the face of ever-evolving threats.

References

]]>
https://www.sms.com/blog/automating-operating-system-hardening/feed/ 0
EVPN+IRB Over MPLS With JUNOS and IOS-XR  https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/ https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/#respond Thu, 27 Oct 2022 02:00:38 +0000 http://sms-old.local/?p=4718 By Zachary Cayou, Network Engineer, SMS 

Introduction
I was given a project to implement EVPN+IRB over MPLS in our network, with the bonus to make it interoperable between JUNOS and IOS-XR routers. At the time, the depths of Google revealed precisely zero examples, guides, or blog posts of anyone attempting to do this. In addition, vendor documentation on the subject tends to assume a particular network design, which we do not follow. As a result, it became an interesting experiment of trial/error, dissecting documentation and RFCs, and a fair amount of head scratching.

The purpose of this post is to outline the compatibility, limitations, and tweaks necessary to implement EVPN+IRB in a multivendor environment with JUNOS and IOS-XR.

Primer
Ethernet VPN (EVPN) is a next-generation VPN protocol for building both L2 and L3 VPNs. EVPN attempts to address many of the challenges faced by traditional L2VPN protocols such as VPLS, while also providing L3VPN capabilities.

Adding integrated routing and bridging (IRB) into EVPN enables both L2 forwarding of intra-subnet traffic and L3 forwarding of inter-subnet traffic within the L3VPN. This facilitates stretching a L2 domain across the core when L2 reachability is needed, while providing optimal forwarding of L3 traffic, and enabling VM mobility support with distributed anycast gateways.

The topic of EVPN, associated protocols, and their applications are far too broad to be covered in depth here, and thus the details and challenges discussed hereafter assume a fair understanding of EVPN already. Details on EVPN implementations may be found in the respective vendor’s documentation.

The specific platforms and OS versions referenced:

  • Juniper MX80s running JUNOS 20.4R1.12
  • Cisco ASR9010s running IOS-XR 6.7.3

Interoperability
Is EVPN+IRB over MPLS interoperable between JUNOS and IOS-XR? At the time of writing… no, strictly speaking they are not, due to reasons I’ll outline below. The two platforms implement incompatible EVPN+IRB behavior. That said, with a certain degree of workarounds, loose interoperability can be achieved.

While EVPN is a mature technology, EVPN+IRB is less so. Vendors began adding IRB features of EVPN in advance of governing RFCs. RFC 9135 (Integrated Routing and Bridging in Ethernet VPN (EVPN)) was in draft until October 2021. This RFC largely outlines the two IRB models that IOS-XR and JUNOS follow, symmetric and asymmetric, respectively.

Asymmetric IRB

ENVP1.3 1

In the asymmetric IRB model, the lookup operation for inter-subnet routing is asymmetric on the ingress and egress PE. When H1 sends a packet destined for H4, PE1 interface X receives the frame and conducts an IP lookup for the destination in its VRF table, where the longest match resolves to the network on interface Y. PE1 then does a lookup for H4’s MAC, which resolves as an EVPN learned adjacency. The packet is then encapsulated with a source MAC of interface Y, a destination MAC of H4, and then forwarded as a L2 payload across the core to PE2. At PE2, a single MAC lookup is performed to switch the traffic to H4. While this model achieves some simplicity in the control plane, and provides for centralized routing, it introduces limitations in scalability and flexibility. Since the ingress PE must be able to do the MAC lookup for the destination, it follows that it must also contain every IRB interface and install adjacency rewrites in the forwarding plane for every host in the routing domain, regardless if the PE has any local hosts in that network.

 

Symmetric IRB

ENVP2.png 1

In the symmetric model, the lookup operation for inter-subnet routing is symmetric on the ingress and egress PE. When host H1 sends a packet destined for H4, PE1 interface X receives the frame and conducts an IP lookup for the destination in its VRF table, where the longest match resolves to a host route learned from PE2. The traffic is forwarded as a L3 payload across the core to PE2. PE2 then does a lookup for H4s MAC, which resolves as a local adjacency on interface Y, and the packet can be forwarded to H4.

In the symmetric model, it should be clear that PE1 need not store a L2 adjacency for H4 in the forwarding plane, nor does interface Y need to exist on PE1 at all. This is achieved by including an additional label (aka Label2) and an additional VRF route-target into EVPN Type-2 MAC/IP advertisements. The additional route-target and label are used the same as they would be in a VPNv4 advertisement: to import the host-route into the correct VRF and to provide a forwarding label.

Compatibility
RFC 9135 says that while asymmetric and symmetric IRB modes may coexist on the same network and the egress PE indirectly dictates the mode by the presence or absence of Label2 and the VRF Route-Target in the EVPN Type-2 advertisements. In other words, this coexistence only means that PEs can prefer to operate in different modes, but they must be capable of both modes. As it turns out, IOS-XR operates exclusively in symmetric mode, and JUNOS operates exclusively in asymmetric mode.

ENVP3.png e1666966983231 1

Let’s look at where this breaks down on the data-plane. Where PE1 is JUNOS and PE2 is IOS-XR, PE1 sends an EVPN Type-2 MAC/IP advertisement without Label2/VRF RT for H1, and PE2 sends an EVPN Type-2 MAC/IP advertisement with Label2/VRF RT for H4. PE1 does not recognize the Label2/VRF RT attributes, and they are ignored, and installs an adjacency for the H4 MAC/IP in the forwarding table. PE1 performs an IP lookup for H4, finds interface Y as the longest match, does a MAC lookup for H4, forwards the L2 payload to PE2, and PE2 successfully switches the packet to H4. In the return direction from H4, PE2 performs an IP lookup for H1, finds interface X as the longest match, but fails to resolve a MAC for H1. Now even though PE2 is aware of the MAC/IP binding from EVPN in the control plane, this binding is not installed as an adjacency in the data plane. PE2 operates purely in the symmetric mode and expects to see Label2/VRF RT in the Type-2 advertisement for H2 if inter-subnet routing was desired.

The ultimate problem is bidirectional traffic fails because there is no way to properly route traffic from a PE that only supports symmetric mode to a PE that only supports asymmetric mode.  Therefore, we can conclude that EVPN+IRB in isolation is not presently interoperable between JUNOS and IOS-XR. However, since the problem is purely that of a routing type, there are other ways we could approach the problem to still make it work.

Solution
Our network consists of hundreds of sites, each typically with a pair of PEs connected to a L2 campus infrastructure. Each subnet’s gateway lives as a FHRP VIP between each PE. Routing for each site is distributed over VPNv4. The introduction of EVPN+IRB in our network was intended to support stretching subnets across sites for services that required L2 connectivity, as well as supporting VM mobility for failover events. Our requirement included that a host stretched to any site must always be able to route on the local PE.

ENVP4.png e1666967007125 1

The interoperability problem outlined above breaks down specifically due to the lack of host-routes advertised from the JUNOS PE to IOS-XR PE, but in our network we are already running another protocol that we could use to solve this problem: VPNv4. The solution is to make the JUNOS PEs advertise the local EVPN host-routes inside VPNv4. While this does introduce additional overhead on the control-plane, as every EVPN Type-2 MAC/IP advertisement from the JUNOS PE will also have a corresponding VPNv4 advertisement, this expense is trivial for our use cases. By only injecting the EVPN host routes into VPNv4 on the JUNOS PE, we end up running EVPN in the asymmetric model on the JUNOS to IOS-XR path, and in the symmetric model on the IOS-XR to JUNOS path. At this point we’re still bound by the limitations of the asymmetric model on the JUNOS PEs. When we take the same solution one step further by also advertising the host-routes in VPNv4 from the IOS-XR PE, then we finally replicate the symmetric model bidirectionally.

Configuration Steps
The following are the configuration steps utilized to achieve loose EVPN+IRB interoperability on our network. For brevity, this assumes the control-plane is already configured with VPNv4 and EVPN address families enabled, and with relevant VRFs already created.

1.  Create the attachment circuits and set the ethernet circuit parameters. In our design, each PE in a pair connects to the L2 campus but not as a LAG, so we configure the ethernet segment in single-active mode.

IOS-XR:

evpn interface Bundle-Ether1
evpn interface Bundle-Ether1 ethernet-segment
evpn interface Bundle-Ether1 ethernet-segment identifier type 0 00.00.00.00.00.00.00.00.01
evpn interface Bundle-Ether1 ethernet-segment load-balancing-mode single-active

interface Bundle-Ether1.1000 l2transport
interface Bundle-Ether1.1000 l2transport description v1000;VRF-A;172.16.0.0/24
interface Bundle-Ether1.1000 l2transport encapsulation dot1q 1000
interface Bundle-Ether1.1000 l2transport rewrite ingress tag pop 1 symmetric

JUNOS:

set interfaces ae1 flexible-vlan-tagging
set interfaces ae1 encapsulation flexible-ethernet-services
set interfaces ae1 esi 00:00:00:00:00:00:00:00:00:02
set interfaces ae1 esi single-active

set interfaces ae1 unit 1000 description “v1000;VRF-A;172.16.0.0/24”
set interfaces ae1 unit 1000 encapsulation vlan-bridge
set interfaces ae1 unit 1000 vlan-id 1000

2.  Create the IRB interfaces in the respective VRF. To provide gateway ARP consistency as a distributed anycast gateway, the MAC address must be statically assigned and replicated on each PE.   

IOS-XR PE: 

The “host-routing” knob enables the symmetric behavior in the control plane.

interface BVI1000 description VRF-A;172.16.0.0/24
interface BVI1000 host-routing
interface BVI1000 vrf VRF-A
interface BVI1000 ipv4 address 172.16.0.1 255.255.255.0
interface BVI1000 mac-address 0.0.1000

JUNOS: 

set interfaces irb unit 1000 description “VRF-A;172.16.0.0/24”
set interfaces irb unit 1000 description “VRF-A;172.16.0.0/24”
set interfaces irb unit 1000 family inet address 172.16.0.1/24
set interfaces irb unit 1000 mac 00:00:00:00:10:00

set routing-instance VRF-A interface irb.1000

3.  Create the EVPN instance.   

IOS-XR:

The binding of the attachment circuit(s), and IRB interface is done inside of a L2VPN configuration, which is then associated with the EVPN instance.  

evpn evi 1000
evpn evi 1000 bgp
evpn evi 1000 bgp rd 1000:1000
evpn evi 1000 route-target import 1000:1000
evpn evi 1000 route-target export 1000:1000
evpn evi 1000 description VRF-A;172.16.0.0/24
 
l2vpn bridge group VRF-A
l2vpn bridge group VRF-A bridge-domain 1000 
l2vpn bridge group VRF-A bridge-domain 1000 interface Bundle-Ether1.1000
l2vpn bridge group VRF-A bridge-domain 1000 routed interface BVI1000
l2vpn bridge group VRF-A bridge-domain 1000 evi 1000

JUNOS: 

The EVPN instance is created as a routing-instance of type EVPN.  

By default, JUNOS will suppress both ingress/egress ARPs across the core, and instead proxy the response utilizing information known from the Type-2 MAC/IP advertisements. IOS-XR does not implement this feature, so ARP requests/replies must be allowed across the core. The hidden command ‘no-arp-suppression’ is necessary to disable this behavior on JUNOS.  

JUNOS also implements default-gateway MAC synchronization by default. In our use case with distributed anycast gateways, this feature is not necessary since all gateway MACs are statically set, and should be disabled with the “default-gateway do-not-advertise” knob.  

Finally, JUNOS by default does not insert a control-word in front of the payload for egress traffic, while IOS-XR by default does. A control-word is a nibble of zeros that sits between the bottom MPLS label and the payload. The purpose is to ensure a “dumb” transit device does not mistake a L2 payload with a source MAC starting with a 4 or 6 as a L3 IPv4 or IPv6 payload. This must be consistent across all PEs, otherwise the payload offset on received traffic will be inconsistent.

set routing-instances VRF-A-evpn-1000 protocols evpn interface ae1.1000
set routing-instances VRF-A-evpn-1000 protocols evpn no-arp-suppression
set routing-instances VRF-A-evpn-1000 protocols evpn default-gateway do-not-advertise
set routing-instances VRF-A-evpn-1000 protocols evpn control-word
set routing-instances VRF-A-evpn-1000 instance-type evpn
set routing-instances VRF-A-evpn-1000 vlan-id none
set routing-instances VRF-A-evpn-1000 routing-interface irb.1000
set routing-instances VRF-A-evpn-1000 interface ae1.1000
set routing-instances VRF-A-evpn-1000 route-distinguisher 1000:1000
set routing-instances VRF-A-evpn-1000 vrf-target target:1000:1000

4.  Advertise host routes in VPNv4… i.e the interoperability workaround.

IOS-XR: 

This command imports the local EVPN IRB adjacencies as host-routes into the VRF table, allowing for advertisement in VPNv4 or other protocols. 

vrf VRF-A address-family ipv4 unicast import from bridge-domain advertise-as-vpn

JUNOS: 

In JUNOS the local EVPN IRB adjacencies already exist VRF table, and advertising them requires nothing other than allowing routes from protocol evpn.  

Also JUNOS will advertise both Type-2 MAC/IP advertisements AND Type-2 MAC advertisements. Here we only require the MAC/IP advertisements, so to reduce control-plane overhead, the MAC only advertisements should be filtered.

set policy-options policy-statement VRF-A-export term EVPN from protocol evpn
set policy-options policy-statement VRF-A-export term EVPN then accept

set policy-options policy-statement rr-bgp-export term EVPN from family evpn
set policy-options policy-statement rr-bgp-export term EVPN from evpn-mac-route mac-only
set policy-options policy-statement rr-bgp-export term EVPN then reject

Future
The solution in place in our network is not optimal with regards to control-plane utilization, as we are required to double the necessary BGP advertisements for any given host in an EVPN. However, it remains more than viable at the scale we plan to deploy EVPN for the foreseeable future. JUNOS has recently implemented support for the symmetric IRB model for EVPN+IRB over VXLAN, so presumably support over MPLS is on the horizon. At that point, transitioning to the native symmetric model in EVPN would be desirable for both overhead and protocol simplicity.

]]>
https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/feed/ 0
Setup of Ansible on Eve-NG https://www.sms.com/blog/setup-of-ansible-on-eve-ng/ https://www.sms.com/blog/setup-of-ansible-on-eve-ng/#comments Fri, 16 Sep 2022 17:35:22 +0000 http://sms-old.local/?p=4419 By George Djaboury, Senior Network Engineer, SMS 

Introduction

The purpose of this post is to perform a basic setup of Ansible on Eve-NG. If you are a network engineer who has never been exposed to Ansible before, then this is perfect for you. By the end of this tutorial, you will be able to successfully push configurations to network devices via Ansible, and you’ll have the building blocks to begin constructing your own playbooks. Let’s get started!

Prerequisites

  • Eve-NG installed
  • Ubuntu Server VM installed and has IP connectivity to the internet and routers in the lab environment.
  • Router images must support SSH – Cisco images must have “k9” on the OS name

Images Used

  • “Linux-ubuntu-18.04-server” – downloaded from the Eve-NG website. They have several Linux images (mostly out of date). Updating images after installation is always recommended. I updated the image to the latest version – Ubuntu Server 22.04 LTS. Eve-NG also supports custom images.
  • Routers1 and 2 (CSRv): csr1000vng-universalk9.16.03
  • Router3: c7200-adventerprisek9-mz.152-4.M7.bin
  • Switch: i86bi-linux-l2-adventerprise-15.1b.bin – any layer 2 image will be sufficient

Topology

The Eve-NG topology is simple, but you don’t need much get started pushing configurations with Ansible.

ansible1 1

Cloud

Inserting a “Network” type object is required for internet access. The Linux server just needs to be able to pull updates and download Ansible. I used the “Management(Cloud0)” type in this lab.

Add an object -> Network

ansible2 1

Verify Connectivity

Ensure the Linux server can ping all Routers before proceeding

root@ubuntu22-server:~# ping 192.168.1.46
PING 192.168.1.46 (192.168.1.46) 56(84) bytes of data.
64 bytes from 192.168.1.46: icmp_seq=1 ttl=255 time=2.04 ms
64 bytes from 192.168.1.46: icmp_seq=2 ttl=255 time=1.58 ms

root@ubuntu22-server:~# ping 192.168.1.47
PING 192.168.1.47 (192.168.1.47) 56(84) bytes of data.
64 bytes from 192.168.1.47: icmp_seq=1 ttl=255 time=1.97 ms
64 bytes from 192.168.1.47: icmp_seq=2 ttl=255 time=1.66 ms

root@ubuntu22-server:~# ping 192.168.1.48
PING 192.168.1.48 (192.168.1.48) 56(84) bytes of data.
64 bytes from 192.168.1.47: icmp_seq=1 ttl=255 time=1.94 ms
64 bytes from 192.168.1.48: icmp_seq=2 ttl=255 time=1.67 ms

Enable SSH on Routers

A basic configuration is needed to enable SSH. Ansible utilizes SSH by default to connect to all hosts.

Router(config)#username cisco privilege 15 secret cisco
Router(config)#vRouter(config)#hostname R1
R1(config)#ip domain name test.com
R1(config)#line vty 0 15
R1(config-line)#login local
R1(config-line)#transport input ssh
R1(config-line)#exit
R1(config)#crypto key gen rsa
The name for the keys will be: R1.test.com
Choose the size of the key modulus in the range of 360 to 4096 for your
General Purpose Keys. Choosing a key modulus greater than 512 may take
a few minutes.

How many bits in the modulus [512]: 2048
% Generating 2048 bit RSA keys, keys will be non-exportable...
[OK] (elapsed time was 5 seconds)

R1(config)#ip ssh version 2

Verify crypto keys are generated:

R1#show crypto key mypubkey rsa

Verify SSH from Ubuntu Server to Routers

Ansible requires the crypto keys of the remote devices before it can connect via SSH. Let’s see if we can SSH.

root@ubuntu22-server:~# ssh cisco@192.168.1.47
Unable to negotiate with 192.168.1.47 port 22: no matching key exchange method found. Their offer: diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1

Looks like SSH failed since the images in my lab use older algorithms that is not compatible with Ubuntu Server 22.04 LTS. To get around this, we have to create the user SSH configuration file and insert compatible ciphers/algorithms.

root@ubuntu22-server:~# vi ~/.ssh/config
KexAlgorithms +diffie-hellman-group14-sha1
Ciphers +aes128-cbc
~

Restart SSH and try again.

root@ubuntu22-server:~# sudo service sshd restart
root@ubuntu22-server:~# ssh cisco@192.168.1.47
Unable to negotiate with 192.168.1.47 port 22: no matching host key type found. Their offer: ssh-rsa

SSH still failed. I’ll need to add the ssh-rsa key type in the config file based on what the router is offering

root@ubuntu22-server:# vi ~/.ssh/config
KexAlgorithms +diffie-hellman-group14-sha1
Ciphers +aes128-cbc
PubkeyAcceptedAlgorithms +ssh-rsa
HostKeyAlgorithms +ssh-rsa

Save the file, restart SSH, and give it another try.

root@ubuntu22-server:~# sudo service sshd restart
root@ubuntu22-server:~#
root@ubuntu22-server:~# ssh cisco@192.168.1.46
The authenticity of host '192.168.1.46 (192.168.1.46)' can't be established.
RSA key fingerprint is SHA256:IaXpnjoCJTZv9RKNTCEzHCajJB+ppaLc3VVgzb8ATqs.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.168.1.46' (RSA) to the list of known hosts.
(cisco@192.168.1.46) Password:

R1#

It worked! Now SSH to your remaining Routers.
If you need to remove a learned key in Ubuntu for some reason (e.g. SSH key changed), then type:

ssh-keygen -f "/root/.ssh/known_hosts" -R "192.168.1.46"

We are now ready to setup the server with Ansible!

Installation and Setup of Ansible on Ubuntu

The first step is to update the server to the latest version.

sudo apt update
sudo apt upgrade

To install Ansible, it’s best to follow instructions on Ansible Documentation website.

sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

Verify Ansible installation by checking the version:

root@ubuntu22-server:/etc/ansible# ansible --version
ansible [core 2.12.6]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.10.4 (main, Apr 2 2022, 09:04:19) [GCC 11.2.0]
jinja version = 3.0.3
libyaml = True

Additionally, we need to install LibSSH, needed to connect to network devices, and is replacing paramiko.

sudo apt install python3-pip
pip install --user ansible-pylibssh

Creating the inventory file

The /etc/ansible file directory should have been created; if not, create it. Navigate to the ansible directory and view its contents. The ‘hosts’ file is the default Ansible inventory file.

root@ubuntu22-server:~# cd /etc/ansible/
root@ubuntu22-server:/etc/ansible# ls
ansible.cfg hosts

Now let’s add our routers in the hosts file:

root@ubuntu22-server:/etc/ansible# sudo vi hosts

[iosrouters]
r1 ansible_host=192.168.1.46 
r2 ansible_host=192.168.1.47 
r3 ansible_host=192.168.1.48

[iosrouters:vars]
ansible_network_os=cisco.ios.ios
ansible_connection=ansible.netcommon.network_cli
ansible_user=cisco
ansible_password=cisco

It’s a good idea to use groups by enclosing the group name in brackets[]. The first group comprises of all our lab IOS routers. The second group defines the variables for the iosrouters group. For a lab environment, it’s ok to insert plain-text creds in here.

After saving the inventory file, you can verify variables have applied to each host:

root@ubuntu22-server:/etc/ansible# ansible-inventory --list -y
all:
children:
iosrouters:
hosts:
r1:
ansible_connection: ansible.netcommon.network_cli
ansible_host: 192.168.1.46
ansible_network_os: cisco.ios.ios
ansible_password: cisco
ansible_user: cisco
r2:
ansible_connection: ansible.netcommon.network_cli
ansible_host: 192.168.1.47
ansible_network_os: cisco.ios.ios
ansible_password: cisco
ansible_user: cisco
r3:
ansible_connection: ansible.netcommon.network_cli
ansible_host: 192.168.1.48
ansible_network_os: cisco.ios.ios
ansible_password: cisco
ansible_user: cisco
ungrouped: {}

Creating a Playbook

All playboooks must be written in YAML format with .yml or .yaml file extension.
Create a “playbooks” folder and a new playbook file. This playbook is using the “ios_config,” “ios_static_route,” and “ios_banner” modules. This playbook defines 4 different tasks.

root@ubuntu22-server:/etc/ansible# mkdir playbooks/
root@ubuntu22-server:/etc/ansible# vi playbooks/initial-config.yaml
---
- name: initial configuration
hosts: iosrouters
gather_facts: false
connection: local
tasks:
- name: enable ipv6 globally
ios_config:
lines:
- ipv6 unicast-routing

- name: configure ospf
ios_config:
parents: router ospf 1
lines: network 192.168.1.0 0.0.0.255 area 0

- name: configure static
ios_static_route:
prefix: 0.0.0.0
mask: 0.0.0.0
next_hop: 192.168.1.1

- name: configure banner
ios_banner:
banner: login
text: |
this is my awesome
login banner
state: present

Run Ansible playbook

root@ubuntu22-server:/etc/ansible# ansible-playbook -i hosts playbooks/initial-config.yaml 

PLAY [initial configuration] ************************************************************************************************

TASK [enable ipv6 globally] *************************************************************************************************
[WARNING]: To ensure idempotency and correct diff the input configuration lines should be similar to how they appear if
present in the running configuration on device
changed: [r2]
changed: [r1]
changed: [r3]

TASK [configure ospf] *******************************************************************************************************
changed: [r1]
changed: [r2]
changed: [r3]

TASK [configure static] *****************************************************************************************************
changed: [r1]
changed: [r2]
changed: [r3]

TASK [configure banner] *****************************************************************************************************
changed: [r2]
changed: [r1]
changed: [r3]
PLAY RECAP ******************************************************************************************************************
r1                         : ok=4    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
r2                         : ok=4    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
r3                         : ok=4    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

For each task, it displays the status of the task for each device; the state of the task displays “changed.”
Now run the same playbook, but with the –check option. This option simulates what would happen if the playbook was actually pushed, but without making changes.

root@ubuntu22-server:/etc/ansible# ansible-playbook -i hosts playbooks/initial-config.yaml --check

PLAY [initial configuration] ************************************************************************************************

TASK [enable ipv6 globally] *************************************************************************************************
ok: [r2]
ok: [r1]
ok: [r3]

TASK [configure ospf] *******************************************************************************************************
ok: [r2]
ok: [r1]
ok: [r3]

TASK [configure static] *****************************************************************************************************
ok: [r2]
ok: [r1]
ok: [r3]

TASK [configure banner] *****************************************************************************************************
ok: [r2]
ok: [r1]
ok: [r3]

PLAY RECAP ******************************************************************************************************************
r1                         : ok=4    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
r2                         : ok=4    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
r3                         : ok=4    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  

Notice the states of the tasks in the left column, which declares “ok” instead of “changed.” Ansible will not execute a task if the desired state is already achieved.

Conclusion

Setup is now complete! While that completes this initial task, I encourage you to keep learning; search for different Ansible modules and use them in different playbooks. Ansible can be a powerful tool in your automation toolbox!
For network engineers just getting started with Ansible, I highly recommend going through the Network Automation resource on the Ansible website.

ansible5 1
]]>
https://www.sms.com/blog/setup-of-ansible-on-eve-ng/feed/ 1
Deploying DokuWiki on Amazon Elastic Container Service (ECS) – Part 2 of 2 https://www.sms.com/blog/deploying-dokuwiki-on-amazon-elastic-container-service-ecs-part-2/ https://www.sms.com/blog/deploying-dokuwiki-on-amazon-elastic-container-service-ecs-part-2/#respond Mon, 15 Aug 2022 11:34:57 +0000 http://sms-old.local/?p=4299 By Rob Stewart, Cloud Architect, SMS  

Improving the DokuWiki Deployment

In Part 1 of this series, we documented a very basic “click-ops” deployment of an ECS Task running the Bitnami DokuWiki container. Please go back and read that post if you haven’t already before you read this one.

In this post, we are going to address some of the deficiencies in the original deployment:

  • Improving the fault tolerance of the DokuWiki deployment
  • Moving on from a “Click-Ops” deployment via the AWS console to an Infrastructure as Code (IaC) deployment using Terraform

Improving the Fault Tolerance of our Elastic Container Service (ECS) DokuWiki Deployment

In Part 1 of this series, we performed a manual deployment of the following resources using the AWS Console:

  • An ECS Task Definition which referenced the Bitnami DokuWiki container from Docker Hub that we wanted to run on AWS.
  • An ECS Cluster which is used by AWS to logically separate sets of ECS Tasks and ECS Services.
  • An ECS Task which is a running instance of the ECS Task Definition we created.
  • A Security Group which controlled the network traffic going to the ECS Task running the DokuWiki container.

Exhibit 1: The Original Deployment of DokuWiki

AWSExhibit1 1

After we finished deploying all these resources, we found that the deployment was not very robust. If our ECS task crashed then our application would stop working and any data we added to DokuWiki would be lost. We also noted that we had to connect to our application using a nonstandard TCP port.

This time around, we are going to enhance the deployment by introducing the following changes:

  • An ECS Service which will restart the ECS Task if it fails
  • An Elastic Filesystem (EFS) to store DokuWiki data so that it no longer resides on the running ECS Task and is thus preserved if the task should fail
  • An Application Load Balancer (ALB) to give us a consistent URL for our application and route traffic dynamically to the ECS Tasks created by the ECS Service
  • Security Groups for our ALB and EFS to control network traffic

Exhibit 2: An Enhanced Deployment of DokuWiki

AWSExhibit2 1

It would take a long time to run through all the configuration required to create all of these services and the connections between them using the AWS console, and there is a good chance that we might miss a step along the way. There is a better way to complete this deployment.

From “Click-Ops” to Infrastructure as Code Using HashiCorp Terraform

One of the major shortcomings of the original deployment was that all the resources were created via the AWS console. When you are first learning how to use AWS, creating resources via the console can be helpful as it will enable you to gain an understanding of how AWS services work together. However, there are several shortcomings of this approach when it is time to make the transition to deploying production workloads on AWS.

  1. Unless somebody is watching you click through the console, or they are very good at picking through AWS CloudTrail logs after the fact then they will not get a full understanding of the steps you followed to complete a deployment.
  2. If you wanted to have a shot at a repeatable deployment then you would have to write up a long run book detailing each step of the process. Even if you include every step, and the person following your directions is very conscientious, there is a very good chance that they would miss a step during the deployment which will result in small inconsistencies cropping up over time. Further, eventually the document will be out of date as the AWS console changes and evolves.
  3. In most cases, you will only discover issues or security vulnerabilities introduced by a manual deployment after the deployment is already done. After a security vulnerability is discovered, you will have to revise the run book and then go through each manual deployment and attempt to make corrections which can lead to other complications.

In sum, manual deployments are not scalable. Fortunately, we have tools like Terraform and AWS CloudFormation which enable us to define our infrastructure, the resources we will deploy in the cloud, as code. There are several benefits to defining our infrastructure as code.

  1. We can be precise in defining exactly what resources we need and how each resource will be configured.
  2. We can store the code in a Version Control System (VCS) like GitLab or GitHub.
  3. We can introduce code reviews into the process where other engineers can review our code and provide input prior to deployment.
  4. We can also employ tools to scan our code and identify any deviations from best practices and established standards and detect potential security vulnerabilities so that these issues can be addressed before we deploy infrastructure.
  5. We can repeat the deployment process exactly as the level of human involvement in each deployment is dramatically reduced.
  6. We reduce or eliminate the need to write lengthy run books describing deployments as the code is self-documenting. If you want to know what is deployed then all you need to do is review the code.
  7. When we discover an issue with a deployment, all we need to do is update the code and deploy it again. If we deploy our updates via the code, then we can be much more confident that the changes will be applied consistently.

Taken together, these benefits combine to make a compelling case for leveraging IaC tools to express our infrastructure as code.

As one of the best tools for Infrastructure as Code, HashiCorp Terraform has several unique benefits:

  1. HashiCorp Configuration Language (HCL), the language used to write Terraform code, is easy to write and easy to read due to its declarative nature and the sheer volume of helpful examples available online due to broad industry adoption. While this simplicity results in a gentle learning curve when first learning Terraform, the language has evolved to handle increasingly complex scenarios without forsaking clarity.
  2. Terraform’s core workflow loop of generating a plan describing what changes will be made, applying the changes in accordance with that plan after review, and then optionally rolling back all changes via a destroy operation if they are no longer needed is easy for engineers to understand and use. This simplicity enables rapid iteration when writing code to deploy infrastructure.

Exhibit 3: The Terraform Workflow

AWSExhibit3 1
  1. Terraform tracks the infrastructure that it deploys in a state file which enables tracking of what has been deployed and makes it simple to completely remove that infrastructure when it is no longer needed.
  2. Terraform supports packaging code into reusable modules. In the HashiCorp Terraform documentation, modules are described as follows:
    • A module is a container for multiple resources that are used together. Modules can be used to create lightweight abstractions, so that you can describe your infrastructure in terms of its architecture, rather than directly in terms of physical objects.
    • The files in the root of a terraform project are really just a module as far as Terraform is concerned. The root module can reference other modules using module blocks. You will see this in action below.
    • HashiCorp encourages developers to create and use modules by providing a searchable module registry which now contains hundreds of robust modules contributed by the community.
  3. Terraform is designed to support a diverse ecosystem of platforms and technologies via plug-ins called providers. Providers are responsible for managing the communication between Terraform and other cloud services and technologies. One benefit of this approach is that the core Terraform functionality and the functionality made available via a given provider can evolve independently. For example, when a cloud provider makes a new service available then that service can be added to an updated version of the Terraform provider, and Terraform will automatically support it.

Exhibit 4: Terraform Providers

AWSExhibit4 1
  1. Most importantly, Terraform is fast. You can deploy and then destroy a few resources or a complex environment made up of hundreds of resources using Terraform in a matter of minutes.

Deploying DokuWiki on ECS Using Terraform

If you have never used Terraform before you will need to get your computer set up first. After you finish getting your computer set up, you will need to download the Terraform code using Git, review the code, deploy the DokuWiki resources in AWS by running Terraform commands, validate that it worked by logging into the AWS Console, and then destroy the infrastructure created by the code.

Getting Setup

In order to follow along with the steps in this post you will first need to install Git, the Terraform command line interface (CLI), and the AWS CLI. You will need to have access to an AWS account with IAM Administrative permissions, and you will need to setup programmatic access to AWS with the AWS CLI.

Install Git

The Terraform code associated with this post has been uploaded to a GitLab repository. In order to download this code and follow along, you will need to install Git. Follow the installation directions on the Git website to get started. If you haven’t ever used Git before, the git-scm.com site has a lot of great documentation to get you going in the right direction, including the following sections of the Pro Git book:

Install the Terraform Command Line Interface (CLI)

In order to follow along with the steps in the post, you will need to install Terraform. The following tutorial on the HashiCorp Learn site takes you step by step through the installation process.

Install the AWS CLI

In order to create infrastructure on AWS using Terraform, you will also need to install the AWS CLI. This page in the AWS documentation takes you step by step through the installation process.

Setup An AWS Account

You will also need access to an AWS account with IAM Administrative permissions. If you were following along with the first post in this series then you already created an AWS account.

NOTE: If you follow along with the steps in this post, there is some chance you may incur minimal charges. However, if you create a new account you will be under the AWS Free Tier. That said, it is always prudent to remove any resources you create in AWS right after you finish using them so that you limit the risk of unexpected charges. Guidance on how to remove the resources created by the Terraform code after we finish with them is provided below.

Set Up Programmatic Access to AWS for Terraform Using the AWS CLI

Before you can deploy infrastructure to your AWS account using Terraform you will need to generate an AWS IAM Access Key and Secret Key (a key pair) using the AWS Console. After you generate a key pair, you will need to configure the AWS CLI to use it using the aws configure command. The following page in the AWS documentation provides step by step directions on how to generate an Access Key and Secret Key for your AWS IAM user in the AWS Console and then configure the AWS CLI to use those credentials.

NOTE: When you run the aws configure command, you will be prompted to select a region.  Make sure you specify one.

NOTE: When you generate an Access Key and Secret key for an IAM user then that key pair grants the same access to your AWS account that your AWS login and password has. If the IAM account has permissions to create resources then anybody who possesses the Access Key and Secret key can create resources. Therefore, you should never share these keys and should treat them with care like you would your login and password.

Download the ECS DokuWiki Terraform Code from GitLab

Before we can start deploying infrastructure using the Terraform CLI we need to download the code from GitLab. Enter the following command to download the project code from GitLab:

git clone https://gitlab.com/sms-pub/terraform-aws-ecs-dokuwiki-demo.git

Next change into the directory containing the code you just downloaded.

cd terraform-aws-ecs-dokuwiki-demo

Review the Terraform Code

If you take a look at the terraform code you just downloaded you will see several files and folders.

Exhibit 5: The Terraform ECS Code.
Files in sub-directories are not represented for the sake of brevity.

AWSExhibit5 1


The following table summarizes the files and folders in the repository. The Terraform files are described in further detail below.

NameObject TypeContents
/terraform.tfTerraform fileterraform block
/provider.tfTerraform fileProvider block
/variables.tfTerraform fileVariable blocks
/main.tfTerraform fileModule block for the dokuwiki module
/outputs.tfTerraform fileModule blocks for output values
/README.mdMarkdown documentOverview documentation for the GitLab Project. This file is displayed when you view the project on GitLab.com.
/modulesDirectoryContains the dokuwiki module that is created by the module block in main.tf
/modules/dokuwikiDirectoryThe DokuWiki folder contains all the files that make up the DokuWiki Terraform module. Most of the Terraform files in the DokuWiki folder contain module blocks referencing the modules in the modules folder.
modules/dokuwiki/modulesDirectoryThe modules folder in the DokuWiki folder contains all the modules used by the DokuWiki module
/modules/dokuwiki/modules/application-load-balancerDirectoryTerraform module which creates Application Load Balancers (ALBs)
/modules/dokuwiki/modules/ecs-clusterDirectoryTerraform module which creates ECS Clusters
/modules/dokuwiki/modules/ecs-serviceDirectoryTerraform module which creaes ECS Services
/modules/dokuwiki/modules/ecs-task-definitionDirectoryTerraform module responsible for creating ECS Task Definitions
/modules/dokuwiki/modules/efsDirectoryTerraform module responsible for creating the EFS storage
/modules/dokuwiki/modules/security-groupDirectoryTerraform module which creates Security Groups

NOTE: The intent behind modules is to create code that can be used in multiple projects. Therefore, it is not generally considered a best practice to put terraform modules in subfolders within the project that is referencing those modules. Instead, it is more common to reference the project repository and tag for the module you are using in your module block. The modules tutorial on the HashiCorp Learn site goes into more detail on this.

Terraform.tf

The terraform.tf file contains a terraform block which defines settings for the project including required terraform CLI and AWS provider versions:

# Terraform block which specifies version requirements

terraform {
   # Specify required providers and versions
   required_providers {
      aws = {
      source  = "hashicorp/aws"
      version = ">= 4.21.0"
      }
   }

   # specify required version for terraform itself
   required_version = ">= 1.2.4"
}

Provider.tf

The provider.tf file contains a provider block which defines settings for the AWS Terraform provider including the AWS region where the resources will be created and default tags that will be applied to all of the resources created using this provider:

# Terraform AWS Provider block
# variables come from variables.tf file 

provider "aws" {     
  # Set default AWS Region
  region = var.region 

 # Define default tags
  default_tags {
    tags = merge(var.default_tags, )
  }
}

Variables.tf

The variables.tf file contains variable blocks which set the AWS region, resource name prefix which is appended to the names of all the resources created by Terraform, and the default tags which are applied to all resources created by this project:

# AWS Region where resources will be deployed
variable "region" {
  type        = string
  description = "AWS Region where resources will be deployed"
  default     = "us-east-1"
}

# Names for all resources created by this project will have this prefix applied
variable "name_prefix" {
  type        = string
  description = "Prefix all resource names"
  default     = "dokuwiki"
}

# All resources will have these tags applied
variable "default_tags" {
  description = "A map of tags to add to all resources"
  type        = map(string)
  default = {
    tf-owned = "true"
    repo     = "https://TODOUPDATEURL"
    branch   = "main"
  }
}

Main.tf

The main.tf file contains the module block for the dokuwiki module:

# This module block creates all of the AWS resources for Dokuwiki
module "dokuwiki" {

  # Specify the path to the dokuwiki module in this project 
  source = "./modules/dokuwiki"

  # these variables come from variables.tf
  region       = var.region
  default_tags = var.default_tags
  name_prefix  = var.name_prefix
}

If you are only looking at files in the main root directory of the project, it might seem as though a lot of detail is missing, and it is. In order to simplify the code, we have intentionally placed all of the resources that are created by the deployment in the dokuwiki module which references other modules to actually create resources in AWS.

Exhibit 6: The Relationships Between the Different Modules in the DokuWiki Project

AWSExhibit6 1

Outputs.tf

The outputs.tf file contains output blocks which enable us to add values for modules and resources to the output when we run Terraform commands.

# Name of the ECS Cluster
output "ecs_cluster_name" {
  description = "The name of the ECS cluster"
  value       = module.dokuwiki.ecs_cluster_name
}

# DNS Name for the Application Load Balancer 

output "alb_dns_name" {
  description = "The DNS name of the load balancer."
  value       = module.dokuwiki.alb_dns_name
}

Both of these output blocks are pulling data from the Dokuwiki module.

Deploy DokuWiki Resources Using Terraform

Now that we have taken an initial look at the code it is time to start running terraform commands to turn this code into resources in AWS. We’ll go through the following steps to execute this deployment using terraform:

  1. Initialize the project and download required providers
  2. Generate and review the list of changes that terraform will perform
  3. Deploy changes to our infrastructure
  4. Test the deployment
  5. Roll back the changes made by Terraform

These steps are described in more detail in the following sections.

Step 1: Initialize the Project – terraform init

Before we can start deploying resources with Terraform, we need to instruct it to download any modules and provider plug-ins that we are using in our code.

Make sure you change into the project folder and then run the following command to initialize the project:

terraform init

When you run this command, terraform will do the following:

  1. Evaluate the code in the current folder (the root module).
  2. Download any modules referenced in the current folder that are not available locally and put them into a hidden subfolder in our project folder called .terraform.
  3. Evaluate all the code blocks in the current folder and all module folders to determine which provider plug-ins are needed.
  4. Download the provider plug-ins and put them in the .terraform folder.

Essentially, it does all the preparation work required to enable us to proceed to the next step.

Step 2: Generate and Review a List of Changes that Terraform will Perform – terraform plan

After we initialize our project using the terraform init command, the next step is to instruct Terraform to generate a list of changes that terraform will make to our infrastructure; this list of changes that Terraform generates is called a plan.

When Terraform generates a plan, it will do the following:

  1. Evaluate all of the code blocks in the current folder and the code blocks in all of the modules that are referenced in the current folder
  2. Determine which resources will be created
  3. Generate a dependency map which determines the order in which those resources will be created
  4. Print out a detailed output listing exactly what terraform will do if we choose to make changes to our infrastructure

NOTE: Running the command terraform plan is a safe operation. Terraform will not make any changes to your infrastructure when you run a plan. It will only tell you what changes will be made.

Run the following command to instruct terraform to generate a plan:

terraform plan

Let’s go through the output on the plan to see what resources terraform will create.

Note: The attributes of each resource were removed from the plan for the sake of brevity.

module.dokuwiki.data.aws_vpc.vpc: Reading...
module.dokuwiki.data.aws_vpc.vpc: Read complete after 1s [id=vpc-02bc8afe3a47e8497]
module.dokuwiki.data.aws_subnets.subnets: Reading...
module.dokuwiki.data.aws_subnets.subnets: Read complete after 0s [id=us-east-1]

The first statements we find in the plan output are a collection of data outputs from data blocks in the DokuWiki module. In Terraform, data blocks represent queries sent by the Terraform provider to fetch information. If we want to learn more about these then we need to check the documentation for the HashiCorp Terraform AWS Provider in the Terraform Registry.

  • The data.aws_vpc data block triggers a call to the AWS API to fetch the properties of a VPC. In this case, the intent of the data call is to fetch the ID for the default vpc in the AWS Region. When a new AWS account is created, AWS places a VPC in the account by default so workloads can be created without first having to create a VPC.
  • The data.aws_subnets data block triggers a call to the AWS API to fetch the subnets in a VPC. In this case, the intent of the data call is to get the attributes of all the subnets in the default VPC. It is necessary to specify which subnets will be used when creating resources like application load balancers and ECS services.
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create

Terraform marks resources in the plan with a  + symbol to indicate that they will be created by Terraform when we eventually create the infrastructure. We’ll see other symbols when we roll back the changes we make.

Terraform will perform the following actions:
  # module.dokuwiki.aws_cloudwatch_log_group.ecs-log-group will be created
  + resource "aws_cloudwatch_log_group" "ecs-log-group" {
  . . .
}

The aws_cloudwatch_log_group resource block is the first resource in the plan. AWS CloudWatch log groups aggregate logging data from AWS resources. In this case, the log group will capture logging data from our ECS Tasks.

# module.dokuwiki.aws_iam_role.ecs_task_role will be created
+ resource "aws_iam_role" "ecs_task_role" {
. . .
}
# module.dokuwiki.aws_iam_role_policy_attachment.ecs-task-role-policy-attach will be created
+ resource "aws_iam_role_policy_attachment" "ecs-task-role-policy-attach" {
. . .
}
  1. The aws_iam_role resource block creates an IAM role for our ECS Task. Whenever you deploy a resource in AWS that needs to interact with the AWS API then you have to assign an IAM role with corresponding IAM permissions to that resource.
  2. The aws_iam_role_policy_attachment resource block attaches an AWS IAM policy to the ECS Task IAM role. In this case, we are attaching minimal permissions to permit the task to interact with other AWS services. You can read more about the AmazonECSTaskExecutionRolePolicy AWS policy in the AWS Documentation.
# module.dokuwiki.random_id.index will be created
+ resource "random_id" "index" {
. . .
}

The random_id resource block generates a random number which is used to pick random subnets for our resources because we don’t care which subnets our resources are placed in but need to have a deterministic mechanism for picking subnets.

# module.dokuwiki.module.alb.aws_lb.this[0] will be created
+ resource "aws_lb" "this" {
. . .
}

# module.dokuwiki.module.alb.aws_lb_listener.frontend_http_tcp[0] will be created
+ resource "aws_lb_listener" "frontend_http_tcp" {
. . .
}

# module.dokuwiki.module.alb.aws_lb_target_group.main[0] will be created
+ resource "aws_lb_target_group" "main" {
. . .
}
  1. The aws_lb resource block creates an Application Load Balancer which will send HTTP traffic from users of our application to the ECS Tasks running DokuWiki.
  2. The aws_lb_listener resource block creates an Application Load Balancer Listener which listens for traffic coming to the load balancer and sends it to Target Groups.
  3. The aws_lb_target_group resource block creates an Application Load Balancer Target Group. When the ECS Service creates a new ECS Task, it will register it to the Target Group so that HTTP traffic coming to the Application Load Balancer can be sent to the ECS Task.
# module.dokuwiki.module.ecs-cluster.aws_ecs_cluster.this[0] will be created
+ resource "aws_ecs_cluster" "this" {
. . .
}

# module.dokuwiki.module.ecs-service.aws_ecs_service.this will be created
+ resource "aws_ecs_service" "this" {
. . .
}

# module.dokuwiki.module.ecs-task-def-dokuwiki.aws_ecs_task_definition.ecs_task_definition[0] will be created
+ resource "aws_ecs_task_definition" "ecs_task_definition" {
. . .
}
  1. The ecs-cluster resource block creates an ECS Cluster.
  2. The ecs-service resource block creates an ECS Service.
  3. The aws_ecs_task_definition resource block creates an ECS Task Definition.
    Note: For more information on the relationships between the different ECS resources please refer back to Part 1 of this series.
# module.dokuwiki.module.efs.aws_efs_access_point.default["Doku"] will be created
+ resource "aws_efs_access_point" "default" {
. . .
}

# module.dokuwiki.module.efs.aws_efs_backup_policy.policy[0] will be created
+ resource "aws_efs_backup_policy" "policy" {
. . .
}

# module.dokuwiki.module.efs.aws_efs_file_system.default[0] will be created
+ resource "aws_efs_file_system" "default" {
. . .
}

# module.dokuwiki.module.efs.aws_efs_mount_target.default[0] will be created
+ resource "aws_efs_mount_target" "default" {
. . .
}
  1. The aws_efs_access_point resource block creates an EFS Access Point which exposes a path on the EFS storage volume as the root directory of the filesystem mapped to the ECS Task.
  2. The aws_efs_backup_policy resource block creates an EFS backup policy for an EFS storage volume.
  3. The aws_efs_file_system resource block creates an EFS storage volume which our ECS Tasks use to store Dokuwiki data.
  4. The aws_efs_mount_target resource block creates an EFS Mount Target. In order to access an EFS filesystem an EFS mount target must be created in the VPC.
#&nbsp; module.dokuwiki.module.sg_ecs_task.aws_security_group.this_name_prefix[0] will be created
+ resource "aws_security_group" "this_name_prefix" {
. . .
}

# module.dokuwiki.module.sg_ecs_task.aws_security_group_rule.computed_egress_rules[0] will be created
+ resource "aws_security_group_rule" "computed_egress_rules" {
. . .
}

# module.dokuwiki.module.sg_ecs_task.aws_security_group_rule.computed_ingress_with_source_security_group_id[0] will be created
+ resource "aws_security_group_rule" "computed_ingress_with_source_security_group_id" {
. . .
}

# module.dokuwiki.module.sg_efs.aws_security_group.this_name_prefix[0] will be created
+ resource "aws_security_group" "this_name_prefix" {
. . .
}

# module.dokuwiki.module.sg_efs.aws_security_group_rule.computed_egress_rules[0] will be created
+ resource "aws_security_group_rule" "computed_egress_rules" {
. . .
}

# module.dokuwiki.module.sg_efs.aws_security_group_rule.computed_ingress_with_source_security_group_id[0] will be created
+ resource "aws_security_group_rule"&nbsp; "computed_ingress_with_source_security_group_id" {
. . .
}

# module.dokuwiki.module.sg_alb.module.sg.aws_security_group.this_name_prefix[0] will be created
+ resource "aws_security_group" "this_name_prefix" {
. . .
}

# module.dokuwiki.module.sg_alb.module.sg.aws_security_group_rule.egress_rules[0] will be created
+ resource "aws_security_group_rule" "egress_rules" {
. . .
}

# module.dokuwiki.module.sg_alb.module.sg.aws_security_group_rule.ingress_rules[0] will be created
+ resource "aws_security_group_rule" "ingress_rules" {
. . .
}

# module.dokuwiki.module.sg_alb.module.sg.aws_security_group_rule.ingress_rules[1] will be created
+ resource "aws_security_group_rule" "ingress_rules" {
. . .
}

# module.dokuwiki.module.sg_alb.module.sg.aws_security_group_rule.ingress_with_self[0] will be created
+ resource "aws_security_group_rule" "ingress_with_self" {
. . .
}
  1. The aws_security_group resource blocks creates Security Groups in a VPC which are essentially firewalls for the resources that are associated to the Security Group.
  2. The aws_security_group_rule resource blocks create Security Group Rules for the security groups. Security Group Rules define what incoming and outgoing network traffic is permitted to and from the resources that the security groups are associated with.

For the ECS deployment we have 3 security groups.

  1. dokuwiki.module.sg_efs.aws_security_group – The Security Group assigned to the EFS storage volume which only permits traffic coming from resources that have been assigned to the ECS Tasks Security Group.
  2. dokuwiki.module.sg_ecs_task.aws_security_group – The Security Group assigned to the ECS Tasks which only permits traffic coming from the resources that have been assigned to the ALB Security Group.
  3. dokuwiki.module.sg_alb.module.sg.aws_security_group – The Security Group assigned to the ALB which allows incoming HTTP traffic from any address.
Plan: 25 to add, 0 to change, 0 to destroy.
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Changes to Outputs:
+ alb_dns_name = (known after apply)
+ ecs_cluster_name = "dokuwiki-cluster"

Note: You didn’t use the -out option to save this plan, so Terraform can’t guarantee to take exactly these actions if you run terraform apply now.

When we choose to move onto the next step we will be deploying 25 distinct resources to AWS! We also see that the plan shows two changes to Outputs.

  1. The ecs_cluster_name has a value because Terraform can already determine what the value should be.
  2. The alb_dns_name shows a value of (known after apply) because this value will only be known after the Application Load Balancer (ALB) is created.

NOTE: When you run a terraform plan command, it is very important to review it carefully and confirm the plan is doing what you expect it to do.

Step 3: Make changes to Infrastructure – terraform apply

After we run the terraform plan command to generate the planned list of changes that Terraform will make to our infrastructure, the next step is to instruct Terraform to carry out those changes.

Run the following command to instruct terraform to initiate the process of applying the changes that we saw in the plan:

terraform apply

When you run the terraform apply command, Terraform will generate a new plan for you. Let’s go through the output on the apply command.

Note: The list of resources were removed from the sample output for the sake of brevity.

module.dokuwiki.data.aws_vpc.vpc: Reading...
module.dokuwiki.data.aws_vpc.vpc: Read complete after 1s [id=vpc-02bc8afe3a47e8497]
module.dokuwiki.data.aws_subnets.subnets: Reading...
module.dokuwiki.data.aws_subnets.subnets: Read complete after 0s [id=us-east-1]
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create

Terraform will perform the following actions:
. . .

The output from the terraform apply command starts out just like the output from the terraform plan command we just ran so we don’t need to go over it again. However, if you scroll to the end of the output you will see something new.

Plan: 25 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ alb_dns_name     = (known after apply)
+ ecs_cluster_name = "dokuwiki-cluster"

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value:

Terraform is asking you to confirm that you want it carry out all the actions that are listed in the plan. Go ahead and type yes at the prompt and hit enter. When you do this, Terraform will start creating all of the resources in AWS on your behalf.

As Terraform works through the process of carrying out the planned changes to our infrastructure, it lists out what it is doing.

Note: Some of the output was omitted for the sake of brevity.

module.dokuwiki.random_id.index: Creating...
module.dokuwiki.random_id.index: Creation complete after 0s [id=7lA]
. . .
module.dokuwiki.module.ecs-service.aws_ecs_service.this: Still creating... [2m20s elapsed]
module.dokuwiki.module.ecs-service.aws_ecs_service.this: Creation complete after 2m23s [id=arn:aws:ecs:us-east-1:816649246361:service/dokuwiki-cluster/dokuwiki-service]
Apply complete! Resources: 25 added, 0 changed, 0 destroyed.

Outputs:
alb_dns_name = "dokuwiki-alb-979498190.us-east-1.elb.amazonaws.com"
ecs_cluster_name = "dokuwiki-cluster"

It should take around 3 minutes for Terraform to create all 25 resources! This is a huge time savings if you consider how long it would take to create all of these resources by clicking around in the AWS Console.

NOTE: Terraform tracks all the changes it makes to infrastructure in a state file. This will come up later when we are finished and want Terraform to roll back all the changes it has made for us.

Notice that the output for the alb_dns_name now has a value. Terraform can tell us what the DNS name is for the Application Load Balancer (ALB) because it has now been created. Try copying the value for the alb_dns_name from your output (which will be different from mine) and then paste it into your browser to go the Dokuwiki site Terraform created.

Exhibit 7: Accessing Dokuwiki in the browser.

AWSExhibit7 1

NOTE: The Dokuwiki application may not load the first time you try it. When the ECS Task is created, AWS needs to pull the Bitnami Dokuwiki container from Docker Hub and then start it which may take a few minutes. If you try to access the DNS name for the ALB in your browser, but it does not load for you or you see an error, just wait a few minutes and try it again.

Step 4: Test the Deployment

If we were able to launch the Dokuwiki site using the alb_dns_name in the last step then we have tested a lot of the infrastructure. At the very least the following is working:

  1. Our Application Load Balancer (ALB) is handling the incoming network traffic from our browser via the security group
  2. The ALB is routing network traffic to the ECS Task running Dokuwiki which means we have a running ECS Cluster with an ECS Task running Dokuwiki.
  3. Unlike the last deployment in Part 1 of this series, we didn’t have to specify a port number in the URL to access the running Dokuwiki container from the browser.

However, there are two enhancements to this deployment when compared with the deployment we did in Part 1 of this series.

  1. Introducing an ECS Service to run our ECS Task so that if the ECS Task stops for some reason then the ECS Service will start a new one for us.
  2. Previously, our data was stored on the Dokuwiki ECS Task; therefore it would be lost if the Task was stopped or failed. However, now our ECS Task is using EFS storage for our content which continues to be available even if the Task is lost.

Now that we have deployed the infrastructure using Terraform we can test these aspects of the deployment by adding content to Dokuwiki via the browser, stopping the running ECS Task, and then verifying that the ECS Service starts a new ECS Task and that our content is still visible when we refresh the page via the browser.

  1. Add Content to Dokuwiki

From your browser, click the pencil on the right side of the page to engage the content editor for the current page.

Exhibit 8: Edit the current page in Dokuwiki.

AWSExhibit8 1

Next, type some text into the text box for the page and then click the Save button. You should now see the text you changed appear on the page.

  1. Stop the ECS Task

Now that we have added the content to the page in Dokuwiki, we should stop the running ECS Task and then wait to see if it starts again. We could login to the AWS Console and stop the running task. However, it would be much quicker to use the AWS command line Interface (CLI) instead. We need to use two AWS CLI commands to do this.

  1. aws ecs list-tasks – which lists the ECS tasks running in an ECS Cluster. Go here to check out the documentation for this command.
  2. aws ecs stop-task – which stops a running ECS task. Go here to check out the documentation for this command.

First, run the following command to get a list of running ECS Tasks on our cluster.

aws ecs list-tasks --cluster dokuwiki-cluster

If we have setup the AWS CLI correctly with an IAM Access Key and Secret Key then we should get a response like this when we run the command.

{
  "taskArns": [
    "arn:aws:ecs:us-east-1:816649246361:task/dokuwiki-cluster/ef120a2a79fe4e4e8efb70a6623d886e"
  ]
}

Note:  The output you get will not match mine exactly. The identifiers will be different.

Next, we need to run the command to stop the ECS Task that we saw when we ran the aws ecs list-tasks command. We will need to run the ecs stop-task command and pass it the name of our ECS Cluster and the identifier for the task we want to stop. Run the following command substituting the ECS Task ID you got when you ran the first command.

aws ecs stop-task --cluster dokuwiki-cluster --task ef120a2a79fe4e4e8efb70a6623d886e

If we ran the command correctly then AWS will stop the task and return all of the parameters of the stopped task. Hit the q key to get back to the command prompt.

Now that we ran a command to stop the task, run the following command again to see if our task was actually stopped.

aws ecs list-tasks --cluster dokuwiki-cluster

If we run the aws ecs list-tasks command fast enough, we may not see any tasks in the list. However, if we wait 15 seconds and run it again, we should see that another task listed with a new Task ID.

{
  "taskArns": [
    "arn:aws:ecs:us-east-1:816649246361:task/dokuwiki-cluster/91e4b6e4c0084d0493756cfe0c4d7898"
  ]
}

Note that the ID at the end of the task is different this time because the ECS Service created a new ECS Task.

  1. Check the DokuWiki Site to Confirm the Content We Changed is Still Loading from EFS Storage

After confirming that a new ECS Task has started, reload the DokuWiki page in your browser to see if the content you changed previously is still there. You may find that the first time you reload the page that you get an error message. This is expected because it will take a minute for the ECS Service to start a new ECS Task running the DokuWiki container. However, if you wait 30 seconds or so and reload the page you should find that the content you changed previously in DokuWiki is still there. A successful test is evidence that our content is now stored on the EFS storage volume instead of our ECS Task.

Step 5: Roll Back the Changes Made by Terraform – terraform destroy

Now that we have deployed and validated our infrastructure, it is time to remove it. Fortunately, Terraform tracked all the changes it made to our infrastructure in a state file and can use this information to roll back all the changes it made.

Run the following command to instruct terraform to roll back or destroy all the changes made to our infrastructure:

terraform destroy

When you run the terraform destroy command, Terraform will generate a new plan for you listing the resources that will be removed. Let’s go through the output on the destroy command.

Note: The list of resources were removed from the sample output for the sake of brevity.

module.dokuwiki.random_id.index: Refreshing state... [id=m4g]
module.dokuwiki.data.aws_vpc.vpc: Reading...
module.dokuwiki.aws_iam_role.ecs_task_role: Refreshing state... [id=dokuwiki-ecstaskrole]
. . .

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
- destroy

The terraform destroy command refreshes the state of all the resources first. Most of these messages were removed from the sample output for the sake of brevity. After it finishes refreshing the state of all the resources, it tells you what it will do. This time around, the symbol changes to – destroy indicating that any resources with the minus symbol next to them will be destroyed.

Terraform will perform the following actions:

# module.dokuwiki.aws_cloudwatch_log_group.ecs-log-group will be destroyed
- resource "aws_cloudwatch_log_group" "ecs-log-group" {
. . .
}

# module.dokuwiki.aws_iam_role.ecs_task_role will be destroyed
- resource "aws_iam_role" "ecs_task_role" {
. . .
}

. . .

Plan: 0 to add, 0 to change, 25 to destroy.

Changes to Outputs:
- alb_dns_name = "dokuwiki-alb-979498190.us-east-1.elb.amazonaws.com"
- ecs_cluster_name = "dokuwiki-cluster"

Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.

Enter a value:

Note: Some resources and resource attributes were removed from the sample output for the sake of brevity.

As you continue to review the output you should notice that every single resource now has a minus symbol next to it indicating that if you approve the operation then Terraform will remove all the resources. If you scroll down to the end of the output, you’ll see that it will destroy 25 resources which is exactly the same as the number of resources that Terraform created when we ran the apply command.

Terraform is asking you to confirm that you want it to carry out all the actions that are listed in the plan. Go ahead and type yes at the prompt and hit enter. When you do this, Terraform will start destroying all of the resources in AWS on your behalf.

As Terraform works through the process of carrying out the planned changes to our infrastructure, it lists out what it is doing.

Note: Some of the output was omitted for the sake of brevity.

module.dokuwiki.module.efs.aws_efs_backup_policy.policy[0]: Destroying... [id=fs-02fc0e0a740ddd00e]
module.dokuwiki.module.sg_alb.module.sg.aws_security_group_rule.egress_rules[0]: Destroying... [id=sgrule-1334013127]
module.dokuwiki.module.efs.aws_efs_mount_target.default[0]: Destroying... [id=fsmt-03bb888c2575f56e4]
. . .

Destroy complete! Resources: 25 destroyed.

It should take around 3 minutes for Terraform to destroy all 25 resources! Again, this is a huge time savings if you consider how long it would take to destroy all of these resources by clicking around in the AWS Console.

After the destroy process finishes, you will find that if you reload the DokuWiki browser tab it will no longer load because the Application Load Balancer (ALB) created by AWS no longer exists.

Closing Remarks

We covered a lot of ground in this post.

  1. We started by looking at some ways to make our ECS DokuWiki deployment more robust using an ECS Task and an EFS volume.
  2. We listed some of the benefits of Infrastructure as Code (IaC) when compared with “Click-Ops.”
  3. We went over some of the benefits of Terraform.
  4. We described the setup requirements for running Terraform with AWS including installing Git, the AWS CLI, and Terraform.
  5. We pulled the source code for the Terraform deployment from GitLab.
  6. We reviewed the code, ran a terraform init, ran a terraform plan, and then deployed the code using terraform apply.
  7. We tested to confirm that the Terraform deployment was successful using the AWS CLI.
  8. We then used Terraform to destroy our deployment so that we wouldn’t have to pay for resources in AWS that we were no longer using.

Thanks for reading!

]]>
https://www.sms.com/blog/deploying-dokuwiki-on-amazon-elastic-container-service-ecs-part-2/feed/ 0
Cisco Certified Internetwork Expert (CCIE) Recertification https://www.sms.com/blog/cisco-certified-internetwork-expert-ccie-recertification/ https://www.sms.com/blog/cisco-certified-internetwork-expert-ccie-recertification/#respond Fri, 29 Jul 2022 19:26:14 +0000 http://sms-old.local/?p=4219 By Ryan DeBerry, Cloud Solutions Architect, SMS

To keep my Cisco Certified Internetwork Expert (CCIE) in an active state, I must recertify every 3 years. I haven’t touched anything related to helping me recertify the CCIE in a year and half…or so I thought.

In September of 2020, after 15+ years as a Network Engineer, I transitioned to a Cloud Engineer (see the following blogpost where I briefly detailed my journey: Network to Cloud). I heard about the new Continuing Education (CE) credits for Cisco, but had not done any research on it until this year. I just assumed, as usual, I needed to check the CCIE written syllabus, look at the new topics, and start labbing away. So, like anything I dread doing, I Googled it first to see how other people have blogged about it; my actual search was “How do I recert CCIE without taking exam.” Not surprising, I found many links that I won’t bother listing here.

All paths lead to Cisco Continuing Education Program, which is a way for an individual to recertify and learn new technologies in the process. The breakdown of number of CEs to certification level are listed in the table below.

Certification Level & Duration CE Credits
Associate – 3 years 30
Specialist – 3 years 40
Professional – 3 years 80
CCIE – 3 years 120
CCDE – 3 years 120

As you can see above, I needed 120 credits to accomplish my task. The process was actually very easy to follow. Head over to Cisco Digital Learning and login with your CSCO ID or email address. Here you will find many courses that can be used for your CE credit; search for “CE Credits.” All courses in this search that are not greyed out are free to take. If a course is greyed-out, it requires a subscription – more on this later.

Each course (free or paid) contains various chapters of content(videos/pdf) with a quiz at the end of each chapter. The passing score for each quiz is 70%; in addition to each chapter, you must pass the final exam to receive the CE credits. Basically, you can claim 61 credits without a subscription, leaving just 59 credits left to obtain. Those 61 credits are all courses that are SDWAN related, so if you don’t have that skill in your bag then you are in luck. I personally did a lot of SDWAN labs/practice with EVE-NG a couple of years ago, so I was pretty familiar with most of the topics. I was able to breeze through most of the material as it was just a refresher. If that is not the case for you, please take the time to learn by doing the labs embedded in the course. For a free course, the material is pretty good.

My suggestion would be to give yourself 2-3 months ahead of your expiration date to get started because you might not know anything about these topics. Below is a list of the courses and how many credits each one will give you.

Course Name Credits
The SD-WAN Mastery Collection – Getting Started – For Customers 6
Planning and Deploying SD-Access Fundamentals (for Customers) 12
The SD-WAN Mastery Collection – Deploying the Data Plane – For Customers 6
The SD-WAN Mastery Collection – Managing the Application Experience – For Customers 6
CUST-SDA-ISE – Preparing the ISE for SD-Access (for Customers) 4
Cisco DNA Center Fast Start Use Cases 5
Securing Branch Internet and Cloud Access with Cisco SD-WAN 11
The SD-WAN Mastery Collection – Bringing Up the Control Plane Devices – For Customers 2
Getting Started with DNA Center Assurance (A-DNAC-ASSUR) v1.0 4
The SD-WAN Mastery Collection – Deploying the Overlay Topology – For Customers 5
 Total 61

These 10 courses took me a little over a week to complete. I then needed to figure out what other courses I needed to purchase. My co-worker Chris Brooks, told me about this course Cisco Certified DevNet Associate, offered in 1, 3, 6 and 12 month plans; it is worth 48 CE credits, which would bring me to 109. I opted for the 1-month plan for $99, since I didn’t know if I would need this material after I obtained the CE credits required for the recertification. This is a really good course to learn about Software Development and Design, Application Development, Automation and more.

Here is where my experience with my recent role transition to the cloud paid off. I was already doing most of what the topics covered in my daily job or previous experience, so it was very easy to follow and complete. Once completed, I only needed 11 additional credits to reach the required 120…but there isn’t a 11-credit offering. So, my next thought was picking a topic that I know pretty well and just go to the final exam and be done with it. At this point of this journey, it has been almost 2 months of videos, pdfs, and labs for each course, and I was burnt out. I found Introduction to 802.1X Operations for Cisco Security Professionals (802.1X) 2.0. I have done quite a bit of work with 802.1x in my career, so I figured this would be a slam dunk and it was. This course normally costs $200 but there was a discount because Cisco Live was going on, so the course ended up being $160.

Looking at this only from a cost perspective, to take the CCIE written would have been $400. To recertify using CE credits, I spent $360, and picked up useful skills in the process; a win in my book. I recommend picking a few courses that you are unfamiliar with so you can learn the foundation of a new technology. While I ended up with more than 120 credits, I accomplished my goal, and gained new skills along the way.

]]>
https://www.sms.com/blog/cisco-certified-internetwork-expert-ccie-recertification/feed/ 0
Deploying Your First Container on Amazon Elastic Container Service (ECS) – Part 1 of 2 https://www.sms.com/blog/deploying-your-first-container-on-amazon-elastic-container-service-ecs/ Thu, 07 Jul 2022 20:48:10 +0000 http://sms-old.local/?p=3897 By Rob Stewart, Cloud Architect, SMS

The Case for Amazon Elastic Container Service

According to Corey Quinn, there are at least 17 Ways to Run Containers on AWS. However, if you are familiar with all the hype concerning containers in general and Kubernetes in particular, then you might conclude that there is no reason to even consider any of the 16+ other options that Amazon provides for running containers in AWS. The experts have all weighed in and Amazon Elastic Kubernetes Service (EKS) is the only option worth considering. Even though EKS is a great option, I invite you to consider Amazon Elastic Container Service (ECS) as an alternative, depending on what type of containerized workload you need to deploy and support.

AWS describes ECS as “a fully managed container orchestration service that makes it easy for you to deploy, manage, and scale containerized applications.” ECS became Generally Available (GA) in April of 2015, a good 3 years prior to the GA of EKS in June 2018. Given this timing, it is not surprising that ECS has strong Docker vibes. It might even be fair to call ECS “AWS Managed Docker Container Service.” Given this heritage, you would expect that ECS would be a good citizen of AWS, and you would be correct. ECS is tightly integrated with many other AWS services. Therefore, if you already have experience with EC2, ELB, ALBs, and CloudWatch then the additional concepts you need to grasp to start using ECS are far more incremental in nature. ECS is one more piece to snap into the larger AWS puzzle.

In contrast, EKS represents Amazon’s effort to address the needs of customers who have already adopted Kubernetes and want to run it using a managed service on AWS. Kubernetes is a fantastic container orchestrator; however, it is not by any stretch a simple one; adding the inherent complexity of AWS Services including IAM, EC2, etc. on top of the complexity of Kubernetes takes it to another level. In some cases, it is not necessary to deploy a containerized workload on Kubernetes. In these scenarios, using ECS may be a better option as it will be easier to get the workload running to start with and easier to maintain it going forward. There is value in simplicity.

Amazon Elastic Container Service Overview

Before diving into a simple deployment on ECS, it is helpful to have some understanding of the different components of the architecture at a high level.

ECS Cluster

To deploy containers on ECS, you must create an ECS Cluster. The ECS documentation describes the ECS Cluster like this:

An Amazon ECS cluster is a logical grouping of tasks or services. You can use clusters to isolate your applications. This way, they don’t use the same underlying infrastructure.

ECS Task Definition

To tell AWS what you want to run on your ECS cluster you must create an ECS Task Definition. A Task Definition is a JSON formatted document that lists the containers that you would like to deploy and how those containers will interact with other AWS services like Elastic File Systems and CloudWatch. The containers you reference in the task definition can come from a public registry like Docker Hub or Amazon’s own Elastic Container Registry.

ECS Tasks

ECS Tasks are created when you ask ECS to run a Task Definition on an ECS Cluster. There are multiple ways to run ECS Tasks on ECS:

  1. EC2 instances running in your VPC can run tasks if the ECS Container Agent is installed on them. The agent enables ECS to manage the tasks running on that instance.
  2. When you don’t want to manage EC2 instances, you can use AWS Fargate. When you use Fargate, AWS will provision the compute required to run the tasks and place Elastic Network Interfaces in your VPC so that these tasks can interact with the network.

ECS Services

ECS supports running task definitions directly as ECS Tasks. However, containers are ephemeral in nature; they should be considered cattle, not pets. Therefore, you will usually create an ECS Service which is responsible for scheduling one or more tasks for you and launching new tasks automatically if one of the running tasks should fail. The following diagram shows the ECS Cluster and how tasks are deployed to the VPC on EC2 instances or as Fargate tasks:

Exhibit 1: AWS ECS Architecture Overview

AWSExhibit1 1

Deploying Dokuwiki on ECS

Now that we have provided a high-level overview of ECS, we’ll move on to describe how to deploy the Bitnami Dokuwiki container on ECS via the AWS Console. For those who may not be familiar with it, Dokuwiki is an open source Wiki engine that runs on PHP; what it lacks in flare it makes up in simplicity. We’ll be using the Bitnami Dokuwiki container because it doesn’t require databases or other services to function.

Step 1: Create a New AWS Account or Login to an AWS Account

To follow the remaining steps described in this post you will need to either create a new AWS Account or login to an existing AWS Account using an account with IAM permissions to resources such as IAM, EC2, and ECS resources. Providing guidance on how to sign up for a new AWS account is outside the scope of this post; however, Amazon provides guidance on how to create an account here.

NOTE: If you follow along with the steps in this post, there is some chance you may incur minimal charges. However, if you create a new account you will be under the AWS Free Tier. That said, it is always prudent to remove any resources you create in AWS right after you finish using them so that you limit the risk of unexpected charges. Guidance on how to remove the resources in provided below.

Step 2: Navigate to the ECS Dashboard

After you login to AWS, enter ECS in the search box at the top of the page and then select Elastic Container Service to navigate to the ECS Dashboard.

Exhibit 2: Navigate to the ECS Dashboard

AWSExhibit2

NOTE: The steps in this blog post describe how to navigate the current version of the AWS ECS UI in the AWS Console.  However, AWS is preparing to release a new version of the ECS Experience in the near future.  There is a toggle switch in the upper left-hand corner which enables you to switch between the current version of the ECS UI and the New ECS Experience.  If you find that the descriptions below do not match with what you are seeing then the New ECS experience may already be enabled.  If so, please use the toggle switch to turn off the New ECS Experience.  We plan to update this post in the future after Amazon transitions all customers over to the new experience.

Step 3: Create a New ECS Task Definition

  1. Select the Task Definitions link on the left-hand navigation.
  2. Select the Create new Task Definition
  3. Select Fargate for the launch type compatibility in Step 1.
  4. Select the Next step button on the bottom right hand corner of the page.
  5. For the Task definition name, enter the following: dokuwiki-task
  6. You can skip down to Operating System family and select Linux from the drop down.
  7. Next, set the Task size by setting Task memory (GB) to .5GB and Task CPU (vCPU) to .25 vCPU using the drop downs.
  8. Next, select the Add container button to define container settings.
  9. On the “Add Container” panel, specify the following:
  10. Container name: dokuwiki
  11. Image: bitnami/dokuwiki
  12. You do not need to configure any other settings, so scroll to the bottom and select the Add button.
  13. If you added the container definition correctly then you should see the container in the list of container definitions.

Exhibit 3: Adding a Container definition

AWSExhibit3

After you finish adding the Container definition, scroll all the way down to the bottom of the page and select the Create If everything went okay, you should see messages telling you that AWS completed the creation of the task definition. Go ahead and select the View task definition button in the lower right-hand corner of the screen.

Step 4: Create an ECS Cluster

Before you can run an ECS Task based on the task definition you just created, you need to create a new ECS Cluster.

  1. Select Clusters on the left navigation.
  2. Select the Create Cluster
  3. For the cluster template, Networking only should already be highlighted as it is the default selection. This is the option we want so go ahead and select the Next step button on the bottom right hand corner of the page.
  4. Enter Dokuwiki-cluster for the Cluster name.
  5. There is no need to change the options for Networking or CloudWatch Container Insights so scroll down to the bottom of the page and select the Create

NOTE: In some cases, you may encounter an error the first time you attempt to create a new ECS cluster telling you that AWS wasn’t able to create the cluster. If this happens, just repeat the cluster creation steps one more time. It should work just fine on the second attempt.

  1. After the cluster has been created successfully, select View Cluster. You should now see the new ECS cluster you just created.

Exhibit 4: The new ECS Cluster

AWSExhibit4

Step 5: Launch a new ECS task

Now that you have created an ECS Task Definition and a new ECS Cluster, it is time to create a new ECS Task based on the Task Definition.

  1. From the same ECS Cluster overview page, select the Tasks tab near the bottom of the page and then select the Run new task

Exhibit 5:  Run a new ECS Task

AWSExhibit5

2. On the Run Task page, select the following options:
a.  Launch type: FARGATE
b.  Operating System Family: Linux
c.  Task Definition Family: dokuwiki-task
d.  Task Definition Revision: 1 (latest)
e.  Platform version: Latest
f.   Cluster: Dokuwiki-cluster

Note: You can just leave Task group blank.

g.  Cluster VPC: Select the VPC ID for the default VPC in your account. It  should have an ID and IP CIDR range that looks similar to this: vpc-0b57fca38d911cb56 (172.31.0.0/16) Your VPC ID will be different. However, the IP CIDR range of 172.31.0.0/16 should match.
h.  Subnets: Select one subnet from the drop down. It does not matter which  subnet you select.
i.  Security groups: Select the Edit button to open the Configure Security Groups panel.
j.  On the Configure Security Groups panel, do the following:
i.  On Assigned security groups, Select the Create new security group radio button.
ii.  Security group name: dokuwiki-sg.
iii.  Description: Dokuwiki security group
iv.  For the Inbound rules for security group, set Type to Custom TCP, Protocol should be TCP, Port range should be 8080 and Source should be  anywhere.
v.  After you finish configuring the security group, select the Save button in the lower right-hand corner to close the panel.

Exhibit 6:  Configure security groups

AWSExhibit6

k.  Auto assign public IP: Enabled
l.  You do not need to change any other options, select the Run Task button to run the ECS Task Definition.
m.  If you did everything correctly, you will see a message at the top of the page indicating that the task was created successfully.

Exhibit 7: ECS task created successfully

Step 6: Access the Dokuwiki Application

Now that we have created the new ECS task, we can access it via the browser using the public IP address that AWS provisioned for it.

  1. Select the task in the list:

Exhibit 8: Select the newly created ECS Task

  1. Amazon has assigned a public IP to your container. Locate and the Public IP on the Network section of the page and copy the IP address.

Exhibit 9:  The public IP Address of the ECS Task

  1. Open a new browser tab, type http:// in the address bar, paste the public IP you just copied, add :8080 to the end of the address, and then hit the Enter key to load the page. If you did this correctly, you should see the Dokuwiki application load in your browser.

Exhibit 10: The Dokuwiki application loads in the browser

  1. Feel free to play around with Dokuwiki. You can click the pencil icon on the right side of the page to create a page, enter some text, and then click the save button at the bottom of the page.
  2. When you’re finished playing with Dokuwiki, switch back to the AWS Console tab in your browser. You should still have the ECS Task open. Go ahead and kill the task by selecting the Stop button in the upper right-hand corner of the page. When you select the Stop button, you will be prompted to confirm that you want to delete the task; go ahead and select the red Stop button to confirm that you want the task to be removed.

Exhibit 11:  Stop the Dokuwiki ECS Task

  1. After the ECS Task has stopped, you will find that if you switch back to the browser tab where you had Dokuwiki loaded and hit the Refresh button on the browser that Dokuwiki will no longer load. The public IP associated with the task was removed when you stopped the ECS task.
  2. As we are done working with the ECS Cluster, you can go ahead and delete it. Select the Delete Cluster button in the upper right-hand corner of the page. After you select the Delete Cluster button, you will be prompted to confirm that you want to delete it. Enter delete me in the input box and then select the Delete button to confirm that you want the cluster to be deleted.
  3. The only other resource we created was the EC2 Security Group. Feel free to switch to the EC2 Service and remove the Security Group if you like. However, Amazon does not charge for security groups.

Final Notes

In this post, we provided a quick overview of AWS Elastic Container Service and then shared step by step directions for how to create an ECS Cluster and run a Dokuwiki container as an ECS Task. While this was a great start, there are several deficiencies with this approach.

  1. We performed all these actions via the AWS Console. When you are first learning about new services in AWS, it is very helpful to explore the console to get a better grasp on how services work. However, clicking through the console is not repeatable. Once you understand how a given AWS service works, it is better to provision all AWS resources using Infrastructure as Code (IaC) tools like AWS CloudFormation or Hashicorp Terraform. When you express your infrastructure as code then you can create it and destroy it with a single command as many times as you like, and all the details of the deployment are automatically documented. You can also track the IaC in a source code repository which enables version tracking, collaboration, and reviews.
  2. The deployment of the ECS Task was not fault tolerant. As soon as the ECS Task was deleted, our application no longer loaded in the browser. For a real deployment, we would want to use an ECS Service to run multiple Tasks. The Service would take responsibility for making sure new tasks are started up when a task fails.
  3. Further, any content we saved in Dokuwiki was gone forever when the ECS Task was deleted. For a real deployment, we would want to store this data in a redundant fashion so that it would not be lost when an ECS Task failed. One way to introduce redundancy for storage is to create an AWS Elastic File System and mount it to the container in the ECS Task Definition.
  4. The deployment was not secure. When we accessed Dokuwiki in the browser, we had to tell the browser to use the HTTP protocol and specify port 8080. Therefore, any data we entered into Dokuwiki would not be encrypted in transit using Transport Layer Security (TLS). One way to address this deficiency is to use an AWS Elastic Load Balancer to handle the traffic. Elastic Load Balancers can be configured with TLS certificates so that data is encrypted in transit.

We address these deficiencies in Part 2 of this series.

Lastly, while ECS may work great in some cases, EKS shines in cases where the application is more complex. One of the advantages of Kubernetes is that complex deployments can be expressed as Helm charts which can be deployed on Amazon EKS or on another cloud provider’s version of Kubernetes. If you need to deploy a container-based application on AWS from a vendor who provides a well-documented Helm chart then using EKS is likely your best path forward.

Click here for Part 2 in this series.

]]>
Network Engineer to Cloud Engineer https://www.sms.com/blog/network-engineer-to-cloud-engineer/ https://www.sms.com/blog/network-engineer-to-cloud-engineer/#respond Fri, 10 Jun 2022 18:43:13 +0000 http://sms-old.local/?p=3862 By Ryan DeBerry, Cloud Solutions Architect, SMS

Introduction

This is an attempt to give other Network Engineers a brain dump of the things you should learn if you want to make the transition from networking to cloud. I decided to do this because an opportunity was presented to me at my company, SMS. I was hesitant as I was comfortable in my previous position, but it turned out to be a good decision.

The Transition

Getting up to speed on Amazon Web Services or AWS was difficult; just like learning anything new, there are many new terms/bells/whistles/you name it. I learn best by building/breaking/fixing/destroying, so I created my own AWS account to get hands on experience, with the goal of achieving the AWS Certified Solutions Architect – Associate.

I quickly learned that there is a lot to learn, A LOT! This exam is like the CCNA, a lot of topics but it doesn’t go too deep. I followed the exam guide syllabus and searched the internet for examples of creating resources, testing, and tearing it down (staying in the free-tier to avoid being charged!)

Disclaimer: I also was forced to learn Infrastructure as Code(IaC)-Terraform, to be exact- so a lot of my experience was probably not conventional. I mean, I did not ClickOps anything, I learned how to use Terraform and learned AWS at the same time. I do not suggest this method; ClickOps is a good way to learn how something works and usually the exams want you to know how it’s done in the UI. The problem with ClickOps is that in production, you shouldn’t be doing that at all unless absolutely necessary. You’ll soon find out it will be impossible to keep track of what you did in ClickOps and also how to delete what you did, which all cloud providers love because you keep getting charged for the provisioned resource.

I spent many evenings learning what I could from Google. I soon realized that YouTube channels are very handy in learning all things cloud-related (see useful links at the bottom of this post). I ended up creating multiple AWS accounts, running various services between them, and “networking” them together as well. I am not going to go into detail about the various terms and options in the cloud, it will make your brain hurt. Just know that your networking and system design knowledge is pivotal in making all this work.

This is when I figured, “I should take the AWS Networking Exam!” That said, I still haven’t taken that exam but I do have at least 12 hours of Adrian Cantrill’s course video under my belt. If you read the exam guide from the AWS Certified Advanced Networking – Specialty, you will see the same terms from your Network Engineer path. My suggestion for the true Network folks is to study for this exam right after the Solutions Architect or at least treat it as your CCIE equivalent. Personally, I don’t think you need to sit for the exam; I’d just learn all the material as if you were going to. You will be drinking from the firehose no matter which path you end up taking. Big skill gap? Maybe, it depends on you.

A good Engineer skillset is comprised of the following:

  1. An analytical mind
  2. An ability to learn new technologies quickly
  3. Good time management skills
  4. An ability to follow processes
  5. An ability to figure things out when no process exist
  6. When to know you don’t know…and know who to reach out to to get that understanding.

Does this really change for a Cloud Engineer? The short answer is no. These are the skills that I think are the most important, not the actual technology itself. You can learn and be an expert at any technology with these foundational skills. The cloud is still infrastructure that needs to be configured and maintained. You still have to know how systems are built and how they can talk to each other and how you can secure them.

Enough of the fluff, you came here to learn how to easily transition.

I put these baseline skills into three categories, this is obviously not a comprehensive list.

Technical skills you must have (high level foundational):

This is in the order I believe you should learn these:

  1. Linux: This is a non-starter IMHO. I could write an entire blog about it. The more experience the better; most, if not all, resources deployed in the cloud are Linux based. It will help you in the long run if you know your way around and how to deal with it when it breaks. I’d recommend installing Linux and using it, without a UI; it’s important to get comfortable with the command line and package manager depending on your distro. Check out this link for distro-agnostic Essential Commands.
  2. Git: GitLab, GitHub, bitbucket, gitea. There are many platforms but git is still git not get; be familiar with the git cli.
  3. IaC: Terraform, Pulumi, Chef, Puppet, Ansible. If you know any of these, it shouldn’t be too hard to learn any other; these all have free versions.
  4. Integrated Development Environment (IDE): Pycharm, IntelliJ, VScode, etc.
  5. Understanding how to create and manipulate the following file formats; JSON, Yaml.

You have to be able to navigate all of these in any cloud environment.

Technical skills that I believe are past foundational

So, you have mastered the foundational aspects…to get to the other side will be like climbing a mountain, but if you make it, there might be a pot of gold.

  1. Basic understanding of a Programming Language: Python, Java, C++, Golang
  2. Container Orchestration: Kubernetes, Docker, Containerd, Nomad
  3. Public Key Infrastructure: Trust me, it pays to understand this, I still struggle
  4. Continuous Integration/Continuous Deployment (CI/CD): Jenkins, Travis, Argo, GitLab, GitHub actions, the list goes on…
  5. Database – General understanding: Relational vs. NoSQL vs. Object Oriented vs. Key/Value

Way Past Foundational

  1. Aviatrix
  2. Rancher

Bookmark this link!

If you don’t do anything else, bookmark this: DevSecOps Learning

This link has a lot of what I have already talked about and more, a lot more.

Concepts to get in your head

  1. Pets vs. Cattle
  2. Don’t Repeat Yourself

YouTube channels that I subscribe to for various Cloud/DevOps resources

  1. DevOps Toolkit
  2. Just Me and OpenSource
  3. Pablos Spot

Cloud Service Provider Links

Each cloud provider has their way of implementing the same thing. You will need to just get your hands dirty to figure out the differences. Get an account with any provider they all have free tiers, and Git-R-Done!

  1. Azure
  2. AWS
  3. Google

Final Thoughts

If you keep an open mind, it won’t be that bad. You will definitely turn into what we call a “glorified Sys Admin” but at the end of the day, you can never learn enough. That is the gift… and the curse!

Positive
Energy
Activates
Constant
Elevation

]]>
https://www.sms.com/blog/network-engineer-to-cloud-engineer/feed/ 0