Using Analytics to Make Bad Buildings Better in New York City

Tobias PfaffProject stories5 Comments


by Sohaib Hasan | 5 min read


Problem: Deteriorating physical conditions in multifamily buildings endanger the health and safety of residents in NYC.

Objective: Preemptively identify at-risk buildings.

Method: Develop a metric to find buildings approaching states of severe distress by modeling a variety of bad outcomes at the building level.



The government of the city of New York delivers its services through a number of agencies with varying mission statements. One area of note in which these agencies overlap is the assessment of building distress, with each agency being motivated by  different interests and necessary focus. For instance, the Department of Buildings (DOB) is concerned with structural building conditions, while the New York City Fire Department (FDNY) is concerned specifically with proclivity for fires. With the multitude of complaints and violation types generated, the task of assessing the holistic condition of a building in New York City has become a convoluted affair [1], not to mention trying to do so before unpleasant and severe hazards occur.

Housing violations NYC 2014Source:

Housing Preservation and Development (HPD) is one of the city’s agencies that is tasked with developing and preserving affordable housing for New York City. The enforcement wing of HPD is responsible both for ensuring buildings do not reach critical levels of distress,  as well as taking enforcement action against owners and landlords should buildings ever enter states of severe distress. The majority of this action is undertaken as a result of complaints made to the city via the 311 system, NYC’s public service call center (a hotline for all non-emergencies and information requests). However, HPD also has an initiative to find distressed buildings that may not have had enough complaint or violation activity to warrant being classified as “highly distressed,”  but that appear to be on the path toward becoming such buildings. Identification of these buildings and proactive enforcement is the purpose of the Proactive Preservation Initiative (PPI). [2]

“A year ago the building was not in that great shape but now it’s much better.  They replaced the broken tiles on my bathroom floor.  They even repaired the walls and ceiling then painted it.  They also painted the hallways and repaired the stairs.”

Anthony Rodriquez, Bronx
(statement after entering PPI, source:

When the PPI program was initially developed, HPD used only internal data sources to construct metrics indicating building distress. This metric was a weighted sum of a number of HPD’s indicators for distress. For example, HPD categorizes their complaints into A/B/C levels (A being the highest priority/most severe). Hence, a combination of three or more B/C level complaints would count as one point and would be set as equivalent to two A level complaints. Something as serious as a litigation case would count as another point by itself. HPD reached out to the Mayor’s Office of Data Analytics (MODA) [3] to develop a metric that was more statistically rigorous, as well as more comprehensive by leveraging more data sources other than those generated by HPD.


(Warning! It’s getting technical now…)
MODA and HPD jointly identified nine potential “bad outcomes” that a building could have and for which data was also available:

  • Foreclosure
  • Tax Lien
  • HPD comprehensive litigation case
  • B or C level violation
  • Emergency Repair Program (ERP) charges levies by HPD
  • Vacate order issued by the Department of Buildings (DOB)
  • Housing quality related complaints from 311
  • Residential quality related complaints from 311
  • Residential fires

These different outcomes show varying patterns of occurrence, and therefore require their own modeling techniques. All of these “bad outcomes” form the dependent side of individual sub-models, which are further described below.

For independent variables for the sub-models, as many data sources as possible were pulled in from DataBridge [4], the MODA-curated central warehouse for city data. The data were collapsed and merged at the building level, including: Department of Finance data provided information on property values, building size, building age, and occurrence of tax liens, lis pendens (pre-foreclosure notices), or property charges. Property Shark [5] provided information on foreclosures. Next, complaints sourced from the Department of Environmental Preservation, Department of Buildings, and HPD were compiled along with all 311 service requests resulting from civilian calls to 311. Resulting violations that were processed by the Environmental Control Board (ECB) were also gathered. Finally, all 911 emergency calls, sorted by type and priority, were also pulled in. All of these were collapsed into counts and flags at the building level, which resulted in a total of 398 potential predictors for approximately one million buildings in NYC.

The dependant variables were collapsed in a similar way, with binary outcomes created for Foreclosures, Tax Liens, Comprehensive Litigation, and Fires, where a 1 was assigned to the building if the event occurred at any point in the past two years. The rest of the “bad outcomes” were represented as number/amount of occurrence within the past two years.

Details on the sub-models

Since the vast majority of the predictors show significant amounts of multicollinearity, a Principal Components Analysis was conducted with an eigenvalue threshold of 1.3. This resulted in an extraction of 27 latent factors, composed of linear combinations of the above variables.

Scatter plots, coupled with trial and error, were used to determine the regression technique employed for each of the bad outcomes. The binary outcomes were straightforward as all can be approximated with logistic regressions. The two 311 complaint count “bad outcomes” were modeled using Poisson regressions (appropriate for count data). B/C violations and ERP charges were modeled using Tobit regressions (appropriate for censored samples with many zeros).

The buildings sample was randomly broken into an in-sample training set (75%) and out-of-sample test set (25%). The 27 factors were thinned down to between 15 and 23 factors for each model by systematically removing one factor, re-running logistic regression, and observing how the model performed on the in-sample and out of sample portions of the dataset. The factors selected were those that remained significant in all iterations of the training and also remained significant in the sample reserved for testing. The final run with the completed model selection was used to output predictions for each building in the city.

See some SAS code used in the project here.

Next, all of the bad outcomes predicted were used as inputs in a meta-model to create a final score that amounted to a weighted sum of all the bad outcomes modeled.  This score was generated for every building in the city and was considered a generalized building distress metric.

Implementation and outlook

The buildings with the highest scores on the building distress metric were highlighted for comparison with the several other HPD programs that result in inspections for buildings. Our aim was to find buildings that were not currently in HPD’s purview, but ought to be. The 500 worst scoring buildings that did not overlap with any other HPD program were compiled and sent to HPD in score order for proactive inspection in the PPI initiative.

MODA and HPD will jointly keep a close eye on results from this approach as inspection results begin to come in. Special attention will be focused on comparison to previous years to get a sense of the added value of the “bad outcome” meta-model approach outlined above in comparison with the weighted sum approach taken by HPD in prior years. Furthermore, inspection results can be used as additional training data for the existing models, or to potentially create new sub-models for an upgraded meta-model. The meta-model also provides the city with a single metric to assess the level of distress of a building. This enables the city to transmit the information of all the various distress measures the city collects in a convenient, easy to interpret package.

The collaboration of MODA with the Proactive Preservation Initiative shows how data analytics can help city officials better identify hazardous buildings, and ultimately prevent citizens from living amidst unfavorable and unsafe conditions. This report also hopefully serves as a blueprint for other cities to rethink simple data approaches, such as a weighted sum model, and replace them with more sophisticated analytics.

You want to share your project story on the DataLook blog? Just drop us a line. You can also follow DataLook on Twitter.

Read more:
[1] “City in distress” on
[2] Proactive Preservation Initiative
[3] Mayor’s Office of Data Analytics
[4] DataBridge
[5] Property Shark

Project duration: Two months

Project team: MODA -Sohaib Hasan (Lead Analyst), Logan Werschky (Project Manager); HPD – Vito Mustaciolo (Head of Enforcement),  Andrew Eickman (Project Manager)

Tools: SAS (see code snippets here)



Sohaib Hasan
Sohaib is the former Chief Analyst of NYC Mayor’s Office of Data Analytics. He is now a Data Scientist at OnDeck.

Gracious editing by Danielle Feinberg.