The power and promise of Analytics: Three case studies
The following are examples of analytics applied in network security, healthcare and education. Each will discuss the data that is used, the process by which the data is reviewed and an algorithm is derived, the application of the algorithm, the risks raised and how they are mitigated, and how the organisation determines whether and to what extent the results of analytics will be used.
Case 1: Intel – Big Data Analytics to Improve Network Security
Security professionals manage enterprise system risks by controlling access to systems, services and applications; defending against external threats; protecting valuable data and assets from theft and loss; and monitoring the network to quickly detect and recover from an attack. Big data analytics is particularly important to network monitoring, auditing and recovery. Intel’s Security Business Intelligence uses big data and analytics for these purposes.
Most of the data analyzed for network security comes from log files of every event that occurs on a network system. Log files may include records of attempts to access a website or to download a file, system logins, email transmissions and authentication attempts. The vast amount of data generated by log files enables researchers to identify malfunctions, attacks or suspicious activity on the system. Intel Security gathers log file data from servers, clients, network devices, specific applications and specialised sensors. It also collects contextual information that helps security experts to interpret the events captured in log files. Because an enterprise system can generate five billion events per day, big data analytics is instrumental in making sense of network activity.
Data compiled in log files and contextual information is maintained in a variety of formats and must be put into a consistent format and entered into a system for analysis. For each network event, data is extracted, put in standard formats and loaded into a data warehouse. Formatting of data is an automated function that can process 11 billion new network events and more than one million events per second during periods of peak activity.
Because this volume of data is too large to be processed effectively, security experts distill samples of data that represents normal network behaviour to make anomalies and threats more easily detectable. By condensing data in this way, one can model anticipated threats based on identified network activity trends, geographic regions with disproportionate threat activity, and other network characteristics that signal an attack. Based on this analysis, one can create predictive models to identify new potential threats. Analytics models are continually refined based on feedback data, making possible faster responses to actual threats, more accurate predictions and real-time detection of potential attacks.
Prior to the use of big data analytics in network security, scheduled network analysis would be performed to assess the health of a network. Today, systems like Intel’s Security Business Intelligence enable real-time processing and analysis of data to identify safe traffic — network activity that is known not to be dangerous or associated with threats to the system and related trends; high-risk traffic — activity that is dangerous or associated with threats to the system and related trends; and predictive trends. All network activity is compared against these models; each individual network event can be flagged as safe, threatening or suspicious as compared to the trends identified. The activity may be blocked, noted or allowed. Understanding of the accuracy of those decisions supports further refinement of the model.
Intel Security maintains a privacy plan to address the collection and use of data. To mitigate the risk that individuals may be identified through such accumulated data, personnel access is restricted to appropriate areas of the system through a formal process that establishes whether an individual is authorized to see certain data. In addition, data may be deidentified — in some cases data sources are inherently de-identified because credentials are not associated with the access request, while in other cases logs deal primarily with identity.
Case 2: Merck – Reducing Patient Readmission Rates
Vree™ Health, a subsidiary of Merck & Co. Inc., applies analytics to big data to address patient care issues and to reduce hospital readmission rates. The focus of Vree Health’s TransitionAdvantage™ service is patients hospitalized for heart attack, heart failure or pneumonia. The leading causes of hospital readmission have been identified as failure to provide patients with necessary information upon discharge, lack of follow-up with patients, medication management issues and insufficient coordination of care. By helping hospitals identify issues that may arise after patients have left the hospital and promoting patient compliance with post-discharge care plans, the project aims to reduce admissions that might occur within 30 days of patient discharge.
Vree Health uses data collected throughout the course of patient care — when the patient completes hospital admission forms, during the hospital stay and at the time the patient leaves the hospital.
Vree Health also uses data collected by its representatives during follow-up calls over the 30 days after discharge and data generated when patients interact with the resources available through the cloud-based web platform or mobile application, over the phone with an operator or via its interactive voice response system. This information is combined with data from more traditional sources, in cluding weight, changes in diet, medications, other clinical data, demographic details and data gathered from third-party data sources such as Centers for Medicare & Medicaid Services for comparison with historical controls of patient populations.
Data is preprocessed to correct any errors and to format it for analysis. Once cleaned, it is incorporated into the data warehouse where it may be organised into subcategories so that it can be more readily accessed for use in specific areas of research. For example, a data scientist and clinician could partner to investigate a data set that focuses on congestive heart failure and might exclude patients with heart attack or pneumonia. However, even though data may have been sub-categorized in this way, it is still possible to leverage the entire data set to discover broader population trends (i.e., within regions, within/between hospitals). By analyzing relationships among the data, researchers identify factors that are likely to lead to readmission. A variety of data sets may be reviewed: researchers may analyze data, for example, across all participating hospitals, across all the hospitals within a single region or within patient cohorts defined by demographic or disease category.
Partnering medical facilities may apply identified trends to information about patients to address conditions that could lead to hospital readmission. Patients who exhibit certain characteristics or behaviours that would indicate that readmission is likely can be provided with specified services or assistance. For instance, a patient who does not check in with her primary care physician within seven days of discharge — a factor indicating increased likelihood of readmission — could be contacted by email or phone.
Use of sensitive health information raises risks that Merck takes measures to address. Vree Health pays particular attention to issues related to obtaining consent to the use of the data, to the use of de-identified rather than identified information (and when the use of each is appropriate), and to the risks raised by the application of algorithms derived from this data.
Consent forms state that the health information it collects is de-identified and used for analysis to support research to enhance care for the individual patient and others. Merck periodically reviews their disclosures to assess their readability and effectiveness. It gives patients the ability to “deactivate” their consent, but even where consent is withdrawn, the de-identified information is still used for public health research, for meta-studies and to maintain the integrity of the knowledge discovery research. Patients admitted to a participating hospital other than the facility of their initial hospital stay are asked again to provide consent. The terms of consent provided by the patient are attached to the data; if the terms change, a new consent is requested.
Because identified data is not necessary to discover trends and build analytic models, Merck uses de-identified data for knowledge discovery. However, de-identification is not irreversible. Data may be re-identified for use by clinicians who may need to know the identity and health issues of those under their care. Data incorporated into Vree Health’s patient profiles is only that germane to their episodic hospital discharge. If consent is granted, this information is shared with the patient’s family caregiver (i.e., a parent or adult son/daughter), the primary care physician or specialist, the transition liaisons and the nurse call centre.
Merck recognizes the potential for harm that may result from inaccurate or untrustworthy predictive models. Models and algorithms are scrutinized and validated before the interventions they suggest are applied to individual patients. Merck refines models and algorithms as more data becomes available and researchers arrive at new insights. By incorporating data about interventions and their effect, for example, researchers update and improve prediction models.
Case 3: IBM – Analytics to Reduce the Student Dropout Rate
Analytics applied to education data can help schools and school systems better understand how students learn and succeed. Based on these insights, schools and school systems can take steps to enhance education environments and improve outcomes. Assisted by analytics, educators can use data to assess and when necessary re-organise classes, identify students who need additional feedback or attention, and direct resources to students who can benefit most from them.
Alabama’s Mobile County public school system is the largest in the state, comprising 63,000 students and 95 schools. Forty-eight percent of students left school prior to graduation — a rate significantly higher than the national average. With the goal of reducing dropout rates, IBM worked with Mobile County Public Schools to apply analytics to education data to help the school system identify which students were at risk of dropping out, and which interventions would help at-risk students. Based on these in sights, educators developed an individualised response to each student’s problems.
Working closely with Mobile County Schools to develop the analytic models, IBM used information that had been collected about students over the course of their schooling. It included administrative and academic data that had been gathered from each of the system’s schools, including data about attendance and test scores, and demographic information including neighbourhood, race, gender and socio-economic status. Some data was collected specifically for analytic research; other data were available from legacy systems and existing databases. This information was combined for analysis with aggregated, de-identified data related to population and ethnicity from external sources, including the U.S. Department of Education (on general trends nationwide) and various state-level government agencies, including the Alabama State Department of Education. Data from these outside sources was also used to allow researchers to test findings and to assess progress as compared to similar school districts.
Data was cleaned and formatted for analysis. Redundancies were removed and relevant data was retrieved from the database or warehouse. Data were examined to determine the relationship between a dependent variable (e.g., whether a child will drop out in the future) based on independent variables (e.g., parents’ education and income; child’s neighbourhood, test scores, absences).
These relationships were used to create a model for predicting which students were at risk of dropping out. Each student’s data was entered into an algorithm to yield a score representing the student’s risk of leaving school.
Mobile County addressed a variety of concerns raised by their use of student data for analytics. While consent to school officials’ use of personally identifiable information contained in student education records for legitimate education purposes is not required, the Mobile County Public Schools obtained parental consent to use student information and maintained conservative disclosure and access policies. The system designed by Mobile County Public Schools and IBM also provided parents with access to their children’s information through personal computers or hand-held devices.
Project administrators also recognized that sharing data beyond the school system could create vulnerabilities and lead to unintended consequences. Identifiable information was available to only parents, guardians and those within the school system. Unidentified information was made available to other institutions under the guidelines outlined on the school system’s website.
Project administrators also recognized that the results of analytics could be misunderstood or misused. False positives (that misidentify students as being at risk who are not) may seem harmless because teachers or administrators will intervene to offer support. However, placing students in different categories (e.g., such as remedial course tracks) can impede their mobility within a school system (e.g., they will remain in remedial courses when they are ready for more advanced work). While the predictions may help educators more accurately place students in programs that will lead to their academic success, it is important that teachers and administrators understand the limitations of the predictions. Users were, therefore, instructed on how to understand and make appropriate decisions based on the findings of the analysis.