Skip to main content

Overengineering

... to overengineer or not to overengineer ...

There is a simple task: crawl html pages, parse those and return parsed results. Sounds simple and it is. In fact it is just about 40 lines of code including comments
public async Task<ActionResult<MeasurementSet>>
        Get([FromQuery]DateTime? from = null, [FromQuery]DateTime? to = null)
{
  // Defaulting values
  from ??= DateTime.Today.AddDays(-1);
  to ??= DateTime.Today;
  // Defining URLs
  var query = $"beginn={from:dd.MM.yyyy}&ende={to:dd.MM.yyyy}";
  var baseUrl = new Uri("https://BaseURL/");
  using (var client = new HttpClient { BaseAddress = baseUrl })
  {
    // Collecting data
    return Ok(new MeasurementSet 
    { 
      Temperature = await GetMeasures(query, client, "wassertemperatur"), 
      Level = await GetMeasures(query, client, "wasserstand"), 
    });
  }
}

private static async Task<IEnumerable<Measurement>>
        GetMeasures(string query, HttpClient client, string apiName)
{
  // Retrieving the data
  var response = await client.GetAsync($"{apiName}/some/other/url/part/?{query}");
  var html = await response.Content.ReadAsStringAsync();
  // Parsing HTML response
  var bodyMatch = Regex.Match(html, "<tbody>(.*)<\\/tbody>");
  var rowsHtml = bodyMatch.Groups.Values.Last();
  return Regex.Matches(rowsHtml.Value, "<tr  class=\"row2?\"><td >([^<]*)<\\/td><td  class=\"whocares\">([^<]*)<\\/td>")
    // Building the results
    .Select(match => new Measurement
    {
      Date = DateTime.Parse(match.Groups[1].Value),
      Value = decimal.Parse(match.Groups[2].Value)
    });
}
It is not the best code, as it would be complicated to maintain, extend, debug and test. It is not fit to the basic OOP principals (except maybe DRY). So let's refactor it (sources and the history placed here)

Refactoring

Assuming the project as real life code let's try to fix and refactor it problem by problem

1. Possible failures

First step is error handling, as lots of things could go wrong. There are no logging at the moment, but changes would make errors at least more descriptive:
if (response.IsSuccessStatusCode)
{
  throw new Exception($"{apiName} gathering failed with. [{response.StatusCode}] {html}");
}
// Parsing HTML response
var bodyMatch = Regex.Match(html, "<tbody>(.*)<\\/tbody>");
if (!bodyMatch.Success)
{
  throw new Exception($"Failed to define data table body. Content: {html}");
}
In fact, it was more or less functional improvement. 

2. Layers separation

All the logic is located in a single class, which is an API controller as well, means it acts as persistence, business and presentation at the same time. Next step would be separation of the layers by introducing
MeasureParser in business layer and RawMeasuresCollector in persistence layer
 
The URL building is part of persistence logic, that's why the enum  MeasureType with appropriate logic in persistence layer was introduced.
var apiName = measure switch
{
  MeasureType.Temperature => "wassertemperatur",
  MeasureType.Level => "wasserstand",
  _ => throw new NotImplementedException($"Measure type {measure} not implemented")
};

3. Single Responsibility

There are logic of URL building based on provided dates or default values, that should be separated as well. It is easy to perform by introducing UrlQueryBuilder, which would take care about the exact single task:
public class UrlQueryBuilder
{
  public DateTime From { get; set; } = DateTime.Today.AddDays(-1);
  public DateTime To { get; set; } = DateTime.Today;

  public string Build(MeasureType measure)
  {
    var query = $"beginn={From:dd.MM.yyyy}&ende={To:dd.MM.yyyy}";
    var apiName = measure switch
    {
      MeasureType.Temperature => "wassertemperatur",
      MeasureType.Level => "wasserstand",
      _ => throw new NotImplementedException($"Measure type {measure} not implemented")
    };
    return $"{apiName}/some/other/url/part/?{query}";
  }
}

4. Reducing coupling

Coupling between business logic and persistence layers gets reduced with introduction of interface. As a nice feature here would be start of usage Dependency injection. On top we can remove hardcoded base URL from code and move it into the settings:
var settings = Configuration.GetSection("AppSettings").Get<AppSettings>();
services.AddHttpClient(Constants.ClientName, client =>
{
  client.BaseAddress = new Uri(settings.BaseUrl);
});
services.AddTransient<IRawMeasuresCollector, RawMeasuresCollector>();

5. Testing

After all the steps above, we can finally mock up the data collector interface and test business logic, which is represented with HTML parsing and URL query building features. More code less words, check parsing test sample:

[TestMethod]
public async Task TestHtmlTemperatureParsing()
{
  var collector = new Mock<IRawMeasuresCollector>();
  collector
    .Setup(c => c.CollectRawMeasurement(MeasureType.Temperature))
    .Returns(Task.FromResult(_temperatureDataA));

  var actual = (await new MeasureParser(collector.Object)
    .GetMeasures(MeasureType.Temperature)
    ).ToArray();

  Assert.AreEqual(165, actual.Length);
  Assert.AreEqual(7.1M, actual
      .First(l => l.Date == DateTime.Parse("24.11.2020 10:15")).Value);
}

6. Encrease flexibility

Unfortunatelly, current the source is not trusted, URLs and the page DOM can be changed any time. Means the structure should be flexible enough to handle. 
 
To resolve issue with possible source change, the factory for measures collector was created, potentially another implementation of IRawMeasuresCollector could read the HTML from another source.

There are two different measure types, which a read from different pages. But at the same time there is only single the parsing logic algorythm. The logic separation would save the functionality if at some point DOM of one of the pages would be changed. And controller GET function would be:
public async Task<ActionResult<MeasurementSet>> Get
    ([FromQuery] DateTime? from = null, [FromQuery] DateTime? to = null)
{
  var parser = new MeasureParser(_collectorFactory.CreateCollector(from, to));
  return Ok(new MeasurementSet
  {
    Temperature = await parser.GetTemperature(),
    Level = await parser.GetLevel(),
  });
}
Additionaly that change causes removal of the MeasureType enum, which is positive as it reduces amount of code branches (no switch less branches) and it saves code from runtime failures if the introduced types wasn't covered in query building, as it would be visible on the build stage.

Summary

As a result of our actions the project grows from a 40 line single file to something bigger:

 
It is flexible and maintainable (according to Visual studio it has 87 maintainablility index), safe and fits to the standards. But at the same time the functionality gets spread into different files and the project seems bigger than it should be, it is only 67 executable lines of code in 277 lines of source code.

As for me, such a structure could make more sence in a bigger projects, and goal is to find an individual balance for each project. 
 
Feel free to mention the step, which you would see as a last step of the refactoring or bring more improvements ideas if you see the necessity.

Comments

Popular posts from this blog

RetrieveMultiple Plugins MS CRM

Plugin for RetrieveMultiple message in MS Dynamics CRM RetrieveMultiple message is not the most popular message in CRM development, because not so much types of tasks are solved via plugin registered on this message. And it leads to situation, when this message is forgotten and not used.

System solutions in MS Dynamics CRM

In storage model description was described, how different types of solutions affect component storage in system. And only one system solution was mentioned: "Active solution", but it's not alone in CRM system. System solutions There are four standard solutions in any CRM organization