The Refactor That Wasn't Broken

The most important refactors are the ones where nothing's broken yet.

I'd just landed a feature. A cross-app "What's Next" widget that surfaces things the user should do next. Incomplete onboarding. Missing documents. Ungenerated content sections. The first draft worked. Tests passed. Lint was clean. All the mount points rendered. Then I read the code again and didn't like it.

The trigger was internal dissatisfaction. Not a bug. Not a performance problem. Just a strong sense that the shape was wrong. That's a legitimate signal, and refusing to act on it because "nothing's broken" is how codebases quietly rot.

Refactors that come from dissatisfaction are the ones I trust most. Bug-driven refactors usually have a narrower scope: you're fixing the specific thing that broke. Refactors that come from a dissatisfied read of code that works are usually restructuring something that would have caused pain later, and doing it before the pain arrives. That's cheaper.

Here's how I walked the refactor from that vague sense of wrong to a shape I could defend.

Stage 0: The starting code

A 200-line service with eight item-builder methods. Two of them looked roughly like this:

def content_items
  nudges = []
  nudges << content_ungenerated_item if ungenerated_sections.any?
  nudges << content_over_cap_item    if over_cap_count.positive?
  nudges.compact
end

def content_ungenerated_item
  missing = ungenerated_sections
  labels  = missing.map { |s| I18n.t("app.content.sections.#{s}.label") }
  item(kind: :nudge,
       step: :content,
       label_key: "user.content_ungenerated",
       label_params: { count: missing.size, sections: labels.to_sentence },
       cta_url: @url_helpers.content_path,
       cta_method: :get)
end

And a couple methods later:

def document_items
  nudges = []
  nudges << document_missing_item  if missing_document_types.any?
  nudges << document_expired_item  if expired_document_count.positive?
  nudges.compact
end

def document_missing_item
  missing = missing_document_types
  labels  = missing.map { |t| I18n.t("app.documents.types.#{t}.label") }
  item(kind: :nudge,
       step: :documents,
       label_key: "user.documents_missing",
       label_params: { count: missing.size, types: labels.to_sentence },
       cta_url: @url_helpers.documents_path,
       cta_method: :get)
end

Look at those two together. They're not the same method, the domain concept differs, the I18n namespace differs, the CTA URL differs. But the shape is identical. The list-and-labels processing is the same. The item-construction is the same. The nudges << x if cond guard is the same. And the smell of copy-paste is unmistakable.

Extend that to the eight rules that were in the file, and you had eight variants of the same shape doing eight different domain jobs. That's fine when there are two of them. It's a code smell when there are eight.

And a compute method that imperatively concatenated each builder's output:

def compute
  items = []
  items.concat(onboarding_items)
  items.concat(profile_items)
  items.concat(document_items)
  items.concat(content_items)
  items.concat(export_items)
  items.sort_by { |item| WhatsNext::PRIORITY.fetch(item.kind) }
end

It worked. It passed tests. It rendered correctly on every page. But four smells:

  1. Imperative concat. items = []; items.concat(...) is "build a thing through mutation," which fights the declarative style the rest of the codebase uses. Most of this app leans on map, flat_map, and array literals with conditionals. This compute method is an island of mutation in a sea of declaration.
  2. Builder methods accepting parameters that were all derivable from self. Every keyword argument passed to item(...) was instance state. That's the data-clump smell in one of its most recognizable forms: passing the object's own state back to itself as arguments.
  3. Duplicated label processing. Two of the eight callers mapped a list through I18n.t and .to_sentence. Same shape, different namespace. This kind of near-duplication accumulates in a service class over time. Each new rule copies the pattern of the previous one, and by rule five you have four almost-identical helpers.
  4. A class doing eight things in one file. Each "rule" was a private method related to the others only by living in the same Ruby file. As long as this stayed at eight rules, the file was manageable. The moment it grew to twelve or fifteen (and it would), the file would be unreadable.

None of these were bugs. Every one of them was a design smell that would compound over time.

Stage 1: Lay out the options before touching code

The temptation when you spot a smell is to fix it inline. I resisted. I put three options on the table with code sketches for each.

Option A: Surface fix only. Convert the imperative shapes to declarative ones without changing the class structure. The compute method becomes:

def compute
  [
    *onboarding_items, *profile_items, *document_items,
    *content_items, *export_items
  ].sort_by { |item| WhatsNext::PRIORITY.fetch(item.kind) }
end

And each builder method flattens into the declarative form:

def content_items
  [
    (content_ungenerated_item if ungenerated_sections.any?),
    (content_over_cap_item    if over_cap_count.positive?)
  ].compact
end

Minimum churn. The imperative smell is gone. The other three smells remain. Deploy risk: near zero. But it doesn't address the deeper problem: this class is still doing eight things.

Option B: Rule classes with a light factory. Each builder becomes its own class with a uniform #items interface. The main service collapses to a registry:

class WhatsNext
  RULES = [
    Rules::Onboarding, Rules::Profile, Rules::Documents,
    Rules::Content, Rules::Export
    # ... and so on
  ]

  def compute
    RULES.flat_map { |klass| klass.new(@user, @url_helpers).items }
         .sort_by { |item| PRIORITY.fetch(item.kind) }
  end
end

Each rule class is tiny. Something like:

class Rules::Documents
  def initialize(user, url_helpers)
    @user, @url_helpers = user, url_helpers
  end

  def items
    [
      (missing_item if missing_types.any?),
      (expired_item if expired_count.positive?)
    ].compact
  end

  private
  # ... the domain-specific helpers ...
end

Eight tiny files, each independently testable. Pattern matches an existing precedent in this codebase (another service was already using the flat-map-a-list-of-classes pattern). Adding a ninth rule is one new file with one new line in the RULES list.

Option C: Real domain models plus presenters. Pull the "what state is this domain in" question into actual models:

class ProfileState
  def initialize(user)
    @user = user
  end

  def missing_fields
    [
      (:primary_specialty if @user.primary_specialty.blank?),
      (:bio if @user.bio.blank?),
      (:location if @user.location.blank?)
    ].compact
  end

  def complete?
    missing_fields.empty?
  end
end

Then the WhatsNext service becomes a thin presenter over the domain layer:

class WhatsNext
  def compute
    assessments = [
      ProfileState.new(@user),
      DocumentCoverage.new(@user),
      # ... and so on
    ]
    assessments.flat_map { |a| items_from(a) }
               .sort_by { |item| PRIORITY.fetch(item.kind) }
  end
end

Just describing three alternatives surfaced tradeoffs I'd have missed if I'd started refactoring right away. Option A had almost no upside beyond removing the imperative concat. Option C had significant upside, but only if we had enough evidence to justify the domain layer today. Option B was the sensible middle. Even knowing B was right, sketching A and C made me confident in it.

The principle: before refactoring, force yourself to write down 2-3 alternatives. Just describing them surfaces tradeoffs you'd otherwise miss.

Stage 2: Pick by evidence, not by aesthetic

I leaned toward B. But a sharper question came up: is C worth the cost? "Do the domain models read nicely" isn't the right frame. The right frame is: is there existing duplication or imminent reuse to justify pulling out the extra layer?

I went searching for real duplication of the ProfileState-style query across the codebase. The actual grep commands I ran looked something like:

# Find every place we're checking "is this profile field filled in?"
rg 'primary_specialty.presence' --type=ruby
rg 'primary_specialty.present?' --type=ruby
rg '@user\.(bio|location|specialty)' --type=ruby -A 2

# For the document coverage query:
rg 'user\.documents\.where\(type:' --type=ruby -B 1 -A 3

# For the content readiness query:
rg 'ungenerated_sections' --type=ruby

The output shape for the ProfileState query looked like this:

app/services/generate/resume_base.rb:23:  fallback = user.primary_specialty.presence || "(not declared)"
app/services/generate/opportunity_resume.rb:47:  fallback = user.primary_specialty.presence || "(not declared)"
app/services/generate/linkedin_section.rb:31:  fallback = user.primary_specialty.presence || "(not declared)"
app/services/score_specialty.rb:12:  return :low if user.primary_specialty.blank?
app/services/whats_next.rb:88:  missing << :primary_specialty if user.primary_specialty.blank?

Five callers. Three of them in generation services. One in a scoring module. One in the new What's Next rule.

Concrete evidence per candidate:

ProfileState. Five places already. Plus imminent reuse in three upcoming features. Verdict: extract today. High-confidence reuse and immediate duplication.

DocumentCoverage. Only one caller today. But the roadmap included a "document tailoring" feature that would need this exact query as a direct lookup rather than as an AI-inferred flag. Verdict: defer until that feature forces the second caller. Extracting it now would be architectural speculation, the correct instinct is to wait for evidence.

ContentReadiness and ExportRoster. One caller each, no clear second. Verdict: keep inline.

Along the way I made an honest correction. I'd initially claimed the four generation services were duplicating the ProfileState query. Looking again, they were doing presence || "(not declared)" as a prompt-interpolation idiom that was adjacent to but different from the completeness question. The state question is "is this field filled in?"; the interpolation idiom is "how do I display this field in prompt text?"

Look at the specific pattern used in the generation services:

# In app/services/generate/resume_base.rb:
specialty = @user.primary_specialty.presence || "(not declared)"
prompt = "You are writing a resume for a #{specialty}..."

versus what ProfileState would want to do:

# In ProfileState:
def missing_fields
  fields = []
  fields << :primary_specialty if @user.primary_specialty.blank?
  # ...
end

Both touch @user.primary_specialty. Both involve a blank/present check. But they're doing different jobs. The generation service is producing a display string. ProfileState is producing a domain classification. Conflating them would make ProfileState a god-object serving two concerns.

So: extract ProfileState for the state question. Leave the generation services alone. The four sites where presence || "(not declared)" appears will keep doing that inline, because it's a display concern in that context, not a domain concern.

The principle: "extract when a second caller appears" is a useful brake on domain modeling. It prevents speculative architecture. Domain models without callers are debt. Wait for evidence. The exception is when you can find duplication already in the code AND know the next caller is imminent. That's high-confidence reuse.

The corollary is that being honest about what "duplication" means matters. Two lines of code that look similar aren't necessarily the same concern. If you extract them together, you'll end up with a helper class that has to fork behavior based on which caller is calling it, and now you've made two separate concerns into one entangled one.

Stage 3: Naming matters more than the structure

With Option B and the ProfileState extraction agreed on, the next push came from an unexpected place: the name.

WhatsNext was a UI label leaking into the domain layer. The service's actual job was measuring the state of the workspace across multiple dimensions. Three honest framings on the table:

Framing Fit Notes
Readiness All 8 rules "Is the workspace ready in this dimension?" Covers gaps, queues, and broken artifacts uniformly.
Completeness 5 of 8 Strong for gaps, awkward for review queues and defensive items.
Audit All 8 Honest about what the service does. Compliance connotations.

Picked Readiness. It unified the meaning across all eight rules, the namespace nested cleanly, and the user-facing "What's Next" widget could stay as a thin presenter over the domain output. In code, that layering wires up like this:

# Domain layer, the reusable, business-facing thing:
module Reviewer
  RULES = [
    Reviewer::Onboarding,
    Reviewer::Profile,
    Reviewer::Documents,
    Reviewer::Content,
    Reviewer::Export
    # ...
  ]

  def self.compute(user, url_helpers)
    RULES.flat_map { |klass| klass.new(user, url_helpers).items }
         .sort_by { |item| WhatsNext::PRIORITY.fetch(item.kind) }
  end
end

# Presentation layer, the specific widget that surfaces this:
class WhatsNext
  def self.compute(user, url_helpers)
    Reviewer.compute(user, url_helpers)
  end
end

For now the presenter is a one-line delegate. But if we ever need a second presenter, an email digest of pending items, a CLI status command, a health-check endpoint that gates the workspace as "not ready", the domain layer is the reusable thing. The UI layer is what changes per surface.

The principle: when the existing name is a widget label, push back. The domain layer should be named for what it does, not how it's rendered.

Naming discussions feel like bikeshedding when you're in them. They're not. The name you pick determines what future engineers think the code is for. Choosing "Readiness" over "WhatsNext" is choosing to communicate that this code will be reused across contexts. Choosing "WhatsNext" would have communicated that it's a UI widget and would have implicitly discouraged the future presenter.

Stage 4: First pass, then a second look

First implementation: eight rule classes, each with their own #items method:

class Reviewer::Onboarding < Reviewer::Base
  def items
    return [] if @user.onboarded?
    [item(kind: :blocking, step: :onboarding,
          label_key: "user.onboarding_incomplete",
          cta_url: @url_helpers.onboarding_path, cta_method: :get)]
  end
end

Tests passed. Lint clean. Pushed.

Then I re-read the code and found more smells the first pass had left behind. Every #items was rebuilding the WhatsNext::Item structure inline. Every keyword argument to item(...) was derivable from self. Two rules were duplicating the same list-to-sentence label processing.

This was the Template Method pattern asking to be born. The Base class should own the orchestration. Subclasses should describe their properties via hook methods:

class Reviewer::Base
  LABEL_NAMESPACE = "app.workflow.whats_next".freeze

  def initialize(user, url_helpers)
    @user = user
    @url_helpers = url_helpers
  end

  def items
    applies? ? [item] : []
  end

  def item
    WhatsNext::Item.new(
      kind: kind, step: step, label: label,
      cta_url: cta_url, cta_method: cta_method
    )
  end

  private

  attr_reader :user, :url_helpers

  # Hooks, subclass overrides
  def applies?     = raise NotImplementedError
  def kind         = raise NotImplementedError
  def step         = raise NotImplementedError
  def label_key    = raise NotImplementedError
  def cta_url      = raise NotImplementedError
  def label_params = {}
  def cta_method   = :get

  # Derived helpers
  def label
    I18n.t("#{LABEL_NAMESPACE}.#{kind}.#{label_key}", **label_params)
  end

  def localized_list(keys, namespace:)
    keys.map { |k| I18n.t("#{namespace}.#{k}") }.to_sentence
  end
end

And the subclasses collapse to property declarations:

class Reviewer::Onboarding < Reviewer::Base
  private

  def applies?  = !user.onboarded?
  def kind      = :blocking
  def step      = :onboarding
  def label_key = "user.onboarding_incomplete"
  def cta_url   = url_helpers.onboarding_path
end

Five hook overrides. No construction noise. No parameter passing. No return [] unless ... boilerplate. Each method does one thing. The grain is right.

For rules with computed label parameters, the same pattern holds:

class Reviewer::Profile < Reviewer::Base
  private

  def applies?     = user.onboarded? && !state.complete?
  def kind         = :nudge
  def step         = :profile
  def label_key    = "user.profile_incomplete"
  def cta_url      = url_helpers.profile_path

  def label_params
    { fields: localized_list(state.missing_fields,
                             namespace: "#{LABEL_NAMESPACE}.profile_fields") }
  end

  def state
    @state ||= ::ProfileState.new(user)
  end
end

And for a rule that has multiple ways to trigger:

class Reviewer::Documents < Reviewer::Base
  private

  def applies?  = missing_types.any? || expired_count.positive?
  def kind      = :nudge
  def step      = :documents
  def label_key = missing_types.any? ? "user.documents_missing" : "user.documents_expired"
  def cta_url   = url_helpers.documents_path

  def label_params
    return { count: missing_types.size } if missing_types.any?
    { count: expired_count }
  end

  def missing_types
    @missing_types ||= DocumentCoverage.new(user).missing_types
  end

  def expired_count
    @expired_count ||= user.documents.expired.count
  end
end

The label_key and label_params are conditional because the same rule handles two related states. That's fine, the rule is small enough to hold both conditions without the class becoming confused. If the branching got more complex, that would be a signal to split into two rules.

Each method does one thing. The grain is right. And adding a ninth rule is one file with five hook overrides.

Ruby techniques worth calling out

A few Ruby-specific details make this pattern read especially well:

Endless method definitions (def applies? = !user.onboarded?). Ruby 3.0+. They make hook overrides read as data rather than as code. Compare:

# Regular form, six lines per property
def applies?
  !user.onboarded?
end

def kind
  :blocking
end

To:

# Endless form, two lines
def applies?  = !user.onboarded?
def kind      = :blocking

Same behavior. The endless form makes the "declaration" nature of the code visible. When every subclass's overrides are single-expression declarations, the endless form removes the def...end noise and lets the properties themselves become the visual content of the file. If you're glancing at a rule class, you should be able to see its full property definition in a screen. Regular def...end blocks make that hard for a class with five hooks; endless methods make it easy.

attr_reader :user, :url_helpers in the base. Small but meaningful. Compare:

# Without attr_reader, subclasses use instance variables
def applies?  = !@user.onboarded?
def cta_url   = @url_helpers.onboarding_path

To:

# With attr_reader on the base, subclasses use methods
def applies?  = !user.onboarded?
def cta_url   = url_helpers.onboarding_path

The second reads as a property declaration ("I need user to check whether they're onboarded"). The first reads as instance-variable access ("I need @user, but @user where? Set where? Any subclass can accidentally reassign this."). The attr_reader also enforces read-only access to the constructor arguments, which prevents a whole class of bugs where a subclass would clobber @user mid-execution.

The declarative array-with-conditionals form. Compare:

# Imperative, build through mutation
def items
  arr = []
  arr << missing_item if missing_types.any?
  arr << expired_item if expired_count.positive?
  arr
end

To:

# Declarative, describe what the array can contain
def items
  [
    (missing_item if missing_types.any?),
    (expired_item if expired_count.positive?)
  ].compact
end

Same output, different reading order. The first describes an algorithm ("start empty, conditionally push, return"). The second describes a specification ("here are the two items this rule can produce"). In a service class whose job is to describe eligibility, the second form reads as the answer to the question. The first reads as the process by which the answer gets computed. When you're describing rules, the specification form is almost always right.

flat_map and sort_by as a one-liner pipeline vs. imperative concat-and-sort. Compare:

# Imperative
def compute
  items = []
  RULES.each do |klass|
    items.concat(klass.new(user, url_helpers).items)
  end
  items.sort_by { |i| PRIORITY.fetch(i.kind) }
end

To:

# Declarative pipeline
def compute
  RULES.flat_map { |klass| klass.new(user, url_helpers).items }
       .sort_by { |i| PRIORITY.fetch(i.kind) }
end

The chain is two operations that compose. The imperative version is five lines that don't compose. In Ruby, whenever you can express a transformation as a pipeline, it reads better than as a loop with mutation.

Testing the pattern

Each rule class is independently testable, which is one of the biggest reasons for extracting them. Before the refactor, testing the WhatsNext service required constructing a User in various complete/incomplete states and asserting on the aggregate output. After the refactor, testing a rule is a two-line stub:

RSpec.describe Reviewer::Onboarding do
  let(:url_helpers) { double(onboarding_path: "/onboarding") }

  it "returns a blocking item when user is not onboarded" do
    user = build_stubbed(:user, onboarded: false)
    rule = described_class.new(user, url_helpers)

    expect(rule.items.size).to eq(1)
    expect(rule.items.first.kind).to eq(:blocking)
    expect(rule.items.first.step).to eq(:onboarding)
  end

  it "returns no items when user is already onboarded" do
    user = build_stubbed(:user, onboarded: true)
    rule = described_class.new(user, url_helpers)

    expect(rule.items).to be_empty
  end
end

The rule's tests describe the rule's contract, "here's when it fires, here's what it produces." No user state gets tangled up with other rules. The WhatsNext service, meanwhile, gets a much shorter test suite that just checks the composition, "when three rules fire, we get three items in the right order." Each layer's tests describe that layer's job.

Before the refactor, the service class had eighteen tests, all of which had to construct a User in specific combined states to exercise the various branches. After the refactor, the same behavioral coverage lives in twenty-five tests split across the rule classes and the WhatsNext composition. Each test is smaller. Each is easier to write. And when a rule changes, only that rule's tests need updating.

The dependency tree that made this possible

One thing I glossed over: this refactor was cheap because we already had a complete dependency map for the What's Next pattern. Every rule, every downstream consumer, every follow-up item that would need to change if a rule's shape changed, all mapped out and documented. Before I typed the first character of the refactor, I knew exactly what depended on what.

That's not typical. In most codebases, identifying the dependency tree for a feature like this is something you do late in the game, usually after a refactor has started to go sideways, and you're scrambling to figure out what else you're breaking. We do it upstream. The dependency map is a first-class artifact, not something that lives in developers' heads and gets discovered as needed.

The immediate value of a complete dependency map is that it makes evidence-based refactoring easy. When I said "grep the codebase for this pattern," the grep was the confirmation, not the discovery. The dependency map already told me ProfileState had five callers in specific files. The grep just verified the exact line numbers. That inverts the usual investigation loop: instead of guessing what might depend on something and then searching to verify, you already know, and searching is bookkeeping.

The deeper value is that a complete dependency map makes it easy to identify follow-up items, the places where a related architecture should look the same but doesn't yet. Once we agreed to extract ProfileState as a domain model, the dependency map told us exactly which other reviewer rules would benefit from the same treatment when their evidence arrived. Not "someday when we notice." Right now, in the docs, with the specific criterion (a second caller) that would trigger each extraction. When DocumentCoverage becomes justified by the second caller landing, the extraction won't be a fresh design exercise. It will be running the pattern we already documented.

This is a discipline most codebases skip. Their dependency maps live in the developers' heads, discovered as needed. The result: refactors are expensive because you don't know what you're going to break, so you refactor conservatively, small changes, minimum churn, leave the rest of the codebase alone. Which means the rest of the codebase stays in the pre-refactor shape indefinitely. It never gets the same treatment. The codebase drifts into inconsistency because each part gets refactored on a different schedule based on when it starts causing pain.

Having the dependency map up front inverts this. We can refactor systematically because we know the whole picture. When we improve the shape of one rule, we can immediately improve the shape of related rules that share the same architecture pattern. The refactor doesn't leak into surprise breakage because there aren't surprises. And when we sweep the codebase months later to check that all rules follow the same pattern, we're checking against a documented spec, not against a shifting collective memory.

For teams that haven't done this: it's worth doing. The upfront cost of mapping the dependency tree is real, but it pays off the first time you need to refactor anything non-trivial in that part of the codebase. And it pays off continuously, every subsequent change is done with the full picture in view rather than with the local view of just the code you're touching.

Principles distilled

Six principles I'd carry from this refactor to the next one.

The most important refactors are the ones where nothing's broken. Dissatisfaction is a legitimate input. The smell instinct is faster than the bug instinct, and codebases that only get refactored when something breaks accumulate quiet debt.

Write down 2-3 alternatives before changing code. Even when you know option B is right, sketching A and C makes you confident in B. Sometimes the sketching surfaces a better hybrid you hadn't seen.

"Extract when a second caller appears" is a useful brake on domain modeling. Domain models without callers are debt. Wait for evidence. The exception is when you can find duplication already in the code and know the next caller is imminent. That's high-confidence reuse.

Push back on names that leak presentation into the domain layer. WhatsNext was the widget label. The service was doing Readiness assessment. The rename made the layering obvious.

Be honest mid-refactor about your earlier claims. I claimed five callers were duplicating the ProfileState query. On re-read, four were doing an adjacent-but-different pattern. Saying that out loud is the difference between honest design and motivated reasoning.

Hook methods over parameters when the data is on self. Parameter lists in Ruby OO are often a sign the data should be moved onto the receiver. Subclasses that declare def kind = :blocking read like property declarations. That's the grain the language wants.

The final shape

Eight rule classes, each 12-29 lines. A Reviewer::Base at 72 lines doing all the orchestration. A Reviewer registry at ~20 lines. A ProfileState domain model at ~30 lines. Twenty-five test examples covering the rules and the new model.

Adding the ninth rule is one file with five hook overrides.

None of that was necessary. Nothing was broken. The dissatisfaction was the whole trigger. That's the point.


If you're navigating a Rails codebase where the design has quietly drifted and you're not sure where to start, that's exactly the kind of work Rock Agile does. Get in touch.

Previous
Previous

The U-Curve of AI Amplification

Next
Next

Inheriting a Dev Project: How to Not Blow It in the First Two Weeks