Big Data Framework

ali@fuzzywireless.com
Mar 4, 2022
7 min read

Network Architecture

On a high level, a framework of big data based analytical network architecture will be consisted of input data, database, big data platform and web interface (Chauhan & Jangade, 2016). Input data feed from various hospitals or clinics with various medical information like electronic health records consisting of patient’s historical data, personal information, billings, etc. Medical imaging data like X rays, ultrasound, magnetic resonance imaging and computed topography will also be fed into big data framework for processing. Sensor data collected from electro encephalogram, electrocardiogram etc. will also be fed into database to identify critical insights (2016). Wearables will be another source with long term heart rate, physical activity and other health related data (Tseng, Chou, Yang & Tseng, 2017). Database storing all the input data will be stored in structured, semi-structured and un-structured format (Chauhan & Jangade, 2016). Electronic health records usually are saved in structured formats while sensor data and medical imaging data are either semi-structured or unstructured. Big data platform which will process, filter, extract, mine and analyze the data for useful insight will be based on Spark, Hadoop and so on. Web interface with advanced visualization tool will present the useful information for necessary actions (2016).

The data from varying sources will be stored in Hadoop Distributed File System (HDFS) of Hadoop core for analytical processing by Spark (Rahman, Slepian & Mitra, 2016). Spark will perform distributed and parallel computing over various nodes. SparkSQL will be used for structured data like electronic health records while Hive will be used for unstructured data like medical images, sensory data etc. Spark Streaming will be used for real-time processing of sensory data for alerts and triggers. MLib machine leaning library will be used to perform classification, regression and clustering analysis. GraphX will be used for relational processing of graph data. Scheduler and YARN will be used for the management of distributed cluster nodes. Cassandra, a NoSQL database will be used to store processed data for web interface. Web interface will be based on Apache Tomcat (2016). The processing will be consisted of data preprocessing, feature extraction and pattern mining (Tseng, Chou, Yang & Tseng, 2017). Figure 1 shows the overall big data framework for healthcare organizations.

Figure 1. Big data framework for health care organizations

Based on the requirements and volume, framework can be hosted on cloud as well using OpenStack cloud operating system in a private cloud configuration (Yang & Cao, 2016). Figure 2 shows that all the big data processing, storage and web interface can be hosted on OpenStack based private cloud for resilience, scalability, flexibility, availability, and cost effectiveness.

Figure 2. Big data framework for health care organizations hosted on a private OpenStack cloud

Database

Connelly & Begg (2014) define cloud computing as a model comprising of shared pool of virtualized resources like, networks, servers, storage, applications and services which can be provisioned and released on the fly with minimal efforts. Distinguishing feature of cloud is that these computing resources are usually sourced from off-the-shelf commodity processors and internet connection thus offer efficiency and reliability at low cost (Johnson, 2009). Redundancy helps in circumventing failures, load balancing etc.

Some of the key characteristics of cloud computing are on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service (Connelly & Begg, 2014). Cloud can be deployed as public, private, community or hybrid models. Cloud computing offers three service models:

Software as a Service (SaaS) – software and data are centrally loaded and can be accessed via thin clients like Google’s Gmail, Salesforce.com’s sales management applications etc.

Platform as a Service (PaaS) – computing platform which can be used to develop web applications using software and hardware hosted at cloud like Microsoft’s Azure, Google’ App Engine etc.

Infrastructure as a Service (IaaS) – virtualized resources like servers, storage, network, OS etc. to consumers as an on-demand service like Amazon’s Elastic Compute Cloud (EC2), Rackspace etc.

SQL on Cloud - SaaS

Using the software as a service (SaaS), SQL can be offered on a cloud using Database as a Service (DBaaS) with full database functionality (Connelly & Begg, 2014). A management layer in DBaaS monitor and configure the database to achieve optimized scaling, high availability, multi-tenancy and effective resource allocation in cloud.

Several architectural options available to implement SQL database on cloud, some of these models are (Connelly & Begg, 2014):

Separate servers
Shared server, separate database server process
Shared database server, separate database
Shared database, separate schema
Shared database, shared schema

Separate Server

This approach will have separate server and database for each tenant thus providing high degree of isolation, support large databases, high number of users and specific performance requirements with high cost (Connelly & Begg, 2014).

Shared Sever, separate database server process

In this architecture, different group of users have their own databases but using a shared server, shown below in Fig 2 (Connelly & Begg, 2014). This is a typical virtualized environment where server resources like processing etc. are virtualized for different group of users thus not available for other users. Performance may be an issue but security is not.

Shared database server, separate database

In comparison to above architecture, this model offers separate database for group of users, but all users share a server and process which improve efficiency and utilization of resources.

Shared database, separate schema

This architecture offers shared server, shared server process and shared database. However different group of users will have their own schema thus requiring strict DBMS permission structure (Connelly & Begg, 2014).

Shared database, shared schema

This architecture shares the server, server process, database and schema thus every database table require a column to identify intended group of user (Connelly & Begg, 2014). This is the most effective solution in terms of cost, hardware and software with lowest data isolation. However this approach require additional security efforts to ensure that different group of users don’t access data of other group of users.

Alternate approaches: SQL on Hadoop in Cloud – PaaS

Another approach is to deploy SQL on Hadoop in cloud using Platform as a Service (PaaS) like Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform, Rackspace etc. Analytical service like Hive and Spark are usually preconfigured on Hadoop in a fully elastic and on-demand PaaS environment (Poggi, Berral, Fenech, Carrera, Blakeley, Minhas, & Vujic, 2016). The latest offerings from PaaS providers facilitate planning free infrastructures by separating processing from storage with enhancements in service times and reliability.

Alternate approaches: SQL in Cloud – IaaS

With the use of an off the shelf DBMS on a rented virtual machine hosted on a cloud, IaaS architecture can be realized (Zhang & Holland, 2015). Applications connect to database via APIs like JDBC or ODBC. Besides using virtual machine to host a database, this approach doesn’t offer much because one is fully responsible to maintain, install, configure, backup, recovery and system management. For persistency, virtual machine needs to be connected to persistent storage system thus difficult to scale.

Critical Requirements & Issues to be resolved for SQL on Cloud

Some of the important requirements to host SQL on cloud are (Bernstein, Cseri, Dani, Ellis, Kalhan, Kakivaya, Lomet, Manne, Novik, & Talius, 2011):

ACID – Although atomicity, consistency, isolation and durability are standard requirements of traditional RDBMS but not universally offered by web applications like Amazon’s Dynamo and Yahoo’s PNUTS.

High Availability – replication of data is necessary to offer high availability while using commodity hardware.

Failure recovery – reconfiguration of failed data using global partition management techniques.

Scalable – processing and storage resources should be scaled up and down easily with minimal intervention.

Cost Effective – compared to hosting a traditional server based database, SQL on cloud should be cost effective in low load as well as high load conditions.

Security – balance between complete isolation offering high security at high cost and less isolation with less cost needed to be achieved per requirements.

Security Policy Proposal

Chandra, Ray and Goswami (2017) outlined some of the key healthcare data and challenges faced today, which include sharing data in the cloud, e-governance and laws, malware attacks, medical identity theft, social issues, incorrect treatment and diagnosis, denial of insurance claim, employment issues and so on. Data security of sensitive health care data in the given health care organization will be achieved by following the best practices of cyber security, which include (Ntiva, 2018):

1. Automated software patching and updates,

2. Employee training program,

3. IoT device tracking,

4. Strict access control,

5. Network segmentation,

6. Leverage AI driven technologies,

7. Implement incident response plan,

8. Data encryption,

9. Data loss prevention and

10. Mobile device management (2018).

Data privacy will be enabled by realizing:

1. Sensitive health care data will be visible to only handful of folks, which include physician, patient and billing/insurance associates with highest level of permissions

2. Health care data will be created, stored and transmitted with highest available encryption

3. Health care data will only be accessible through private cloud accessed using virtual private network via hard and soft code token authentication

4. Medical records will not be allowed to be copied on any form of storage media including computers, laptops, smartphones, external hard drive etc.

5. Health care data will not be shared with any person or entity outside the company for research or marketing purposes

6. Explicit permissions from patients will be required to share sensitive health care data with other health care providers

7. Inside the premises of health care facility, only wired network connections will be allowed

8. Anonymization techniques will be applied to remove identifiers from the data

9. Email, internet and other communication services offered to employees of health care organization will be available over only company supplied devices for strictly work-related usage

Zero tolerance policy will be practiced towards data security and privacy realization across the health care organization, which means any violation will result in termination from employment with immediate effect. As per HIPAA regulations, all laws will be strictly followed and implemented to avoid penalty from the governing body. However, in the event of data breach or loss relevant authorities will be notified immediately along with the resolution steps taken to resolve the security lapse. Patients will be informed immediately if their personal data is compromised during the breach.

References

Connolly, T. & Begg, C. (2014). Database Systems: a practical approach to design, implementation, and management (6th ed.). Upper Saddle River, NJ: Pearson.

Poggi, N., Berral, J., Fenech, T., Carrera, D., Blakeley, J., Minhas, U., & Vujic, N. (2016). The state of SQL-on-Hadoop in the cloud. 2016 IEEE International Conference on Big data, 1432-1443

Johnson, J. (2009). SQL in the clouds. IEEE Computing in Science & Engineering, 11(4), 12-28

Bernstein, P., Cseri, I., Dani, N., Ellis, N., Kalhan, A., Kakivaya, G., Lomet, D., Manne, R.,

Novik, L., & Talius, T. (2011). Adapting Microsoft SQL server for cloud computing. 2011 IEEE 27th International Conference on Data Engineering, 1255-1263

Zhang, W. & Holland, D. (2015). Containerized SQL Query Evaluation in a cloud. 2015 IEEE International Conference on Smart city/SocialCom/SustainCom (Smart City), 1010-1017

Chandra, S., Ray, S. & Goswami, R. (2017) Big data security in healthcare. 2017 IEEE 7th International Advance Computing Conference.

Yang, X. & Cao, X. (2016) Design of cloud-based China’s community care system for diabetes. 2016 International Conference on Information System and Artificial Intelligence.

Rahman, F., Slepian, M. & Mitra, A. (2016) A novel big-data processing framework for healthcare applications. 2016 IEEE Conference on Big Data.

Tseng, V., Chou, C., Yang K. & Tseng, J. (2017) A big data analytical framework for sports behavior mining and personalized health services. 2017 Conference on Technologies and Applications of Artificial Intelligence.

Chauhan, R. & Jangade, R. (2016) A robust model for big healthcare data analytics. 2016 6th International Conference on Cloud System and Big Data Engineering.

Ntiva (2018) 10 cyber security best prctices for the healthcare industry. Retrieved from https://www.ntiva.com/blog/10-cybersecurity-best-practices-for-the-healthcare-industry

Big Data Framework

Recent Posts

Comments