MIS 586 Big Data Analytics

Saturday, December 5, 2015

What we learned during the course.....

[Dive Deeper]

The team has struggled with the project many times throughout this semester, but every time a dilemma arises, there has always been a way to get out of it. “Dive deeper” has been really important to help us to achieve that.

Throughout the semester, there have been many progress reports that pushed us to progressively solve the identified problems. Each report has been associated with a new concept of big data – from basic data preliminary analysis, temporal/spatial analysis, data visualization, and prediction model. The concepts have been all very new to us, and the requirements of these have been even harder. Therefore, our team has been dedicated to dive deeper into each report to exceed expectation regardless of the limited time and technical ability.

Here is an example that we have done to dive deeper into the data source. Initially, we have tried to use different data source formats – Tumblr, Twitter, and Reddit. A lot of efforts have been spent collecting the data, but when we have come to the “temporal/spatial analysis” report, the on-hand data set appears to be not good enough for us to run a complete analysis. Such as Tumblr data set, there is no user profile information nor time stamp information available for us; Twitter data set, streaming API returns very limited information. By that moment, the on-hand data set is barely sufficient for the report but definitely cannot exceed the expectation.

Next, our team divides into two groups to look for the insightful analysis to the data set. Even though Tumbler has not worked out very well due to the API limitation, Twitter data has been incredibly expended. Not only the streaming API has been used, but also the user timeline API and the user profile API have been added to collect a much more complete data set. Owing to this, our final prediction model – the team project’s ultimate goal – has heavily rely on the twitter data set.

It is known to all that the complexity of Big data comes in many ways. “Dive deeper” into the big data set metaphors the perseverance and dedication. Especially like our team, pure amateurs to the real big data area finally use big data as the tool to successfully run the prediction model. Surely, the result may be off to the real world, but the point is that we have learned an approach to solve the problem, to make big data “insightful”, and finally to dive deeper not only into the solid data set but also our minds. Despite all the other terminologies and technologies, dive deeper has helped us throughout the semester to make big data vivid. Certainly, the concept of knowing to dive deeper shall help all of our team members’ prosperous future career life.

[Setting the Goal]

The most crucial lesson learned from our team project has been to have a clear vision before diving deep into the analysis. At the start of our project, we begin by collecting as much data as possible immediately after settling down the topic – sleep paralysis. Since we have anticipated that it will take us a long time for data collection and cleaning, we want to take action as soon as possible. Even though everyone in our team know and understand the importance of defining a goal at the start, we have failed to do that. As a result, we have struggled with the goal of our analysis and prediction almost every step of the way. Fortunately, we have been able to set the purpose of the data analysis before performing the technical analysis work. However, we have felt being lost during certain times of the project. If we have had the purpose of the sleep paralysis clearly thought out and defined, the project should have gone much more smoothly.

The true value of any data analysis, including big data, lies in the goal of the analysis. In the age of information explosion, the challenge we are facing is no longer not enough information. Instead, it’s too much information. How to survive the waves of information without losing our purpose is the key question not only in data analysis but also in life. An appropriate and meaningful goal is the first step of success.

Setting a good goal for data requires consideration of multiple aspects. Below is the three most important aspects that our team learned from our project:

• Limitations
During our data collection stage, we have used APIs from Twitter, Tumblr and Reddit. One thing that we have not prepared to face is that they only offered very limited data. That has been very frustrating for our team since one of the essential part of big data is about the temporal and spatial information at an individual user level. When setting the goal, these limitations should be taken into consideration.

• Data Set Attributes
Understanding the data set we will be working with has been crucial before setting the goal. We have learned late in the project that one third of the location information in the Twitter dataset has been invalid. Such exploratory findings have been vital in deciding the direction of the analysis. For example, if we have had high quality location data of the sleep paralysis patients, we then will focus our data analytic on locations.

• Business Value
Business value is the core component of big data analysis, and it’s the part that we struggled most when trying to set a goal for analysis. For sleep paralysis, it’s hard for us to think of appropriate business scenarios due to lack of domain knowledge. The domain knowledge can help us to dive deeper into the analysis and guide us each step of the way.

[Data Collection & Data Cleaning]

This is the first course that requires us to collect streaming data and not just download data sets from websites. This is also the first course that covers the whole process, from data collection to model construction. This means that we have to begin this project from scratch. Before this course, we usually searched everywhere to find qualified data sets to do analysis. If we have several data sets, we compare each of them and try to find the one that is the easiest for us to build models and do analysis. This means that we tended to find data sets that are already cleaned by others. Therefore, all we have had to worry about is how to build models and make predictions. However, this is not what doing data analysis is all about.

This course helps us realize that “collect data” and “clean data” usually take most of the time. Before building models, we have to decide how we are going to collect our data. Are we writing programs to collect data from websites, or designing questionnaires and sending out surveys? If we are going to collect data from websites, what techniques should be used to perform that? These questions are something that should be answered before we even begin our project. After that, we need to clean and transform our data. This process takes most of the time, which is different from what we used to think. We need to decide what format fits our project the most and how we are going to save our data. Besides, we also need to determine which part of the data is valuable and can be used for further analysis, while which part of the data is not important that we have to discard. We believe this process is the most important part of the entire project since we spent hours of hours debating this. Therefore, in this course, we have learned how to select and decide the project goal, determine proper methods to collect data, and audit data for quality.

[Gephi and Network Analysis]

Beyond the learning curve of data collection and data cleaning, we have enjoyed to use Gephi to visualize our network. It is a powerful tool to analyze and present network. We are proud that our “artwork” is displayed to public as part of Eller MIS department.

During the visualization assignment, we have had some technical issues with Gephi. Our network that we have tried to input to Gephi has been too large. The software has been able to run but it usually crashes in the middle of progression. It has taken us many hours and days to load data in, calculate centrality and visualize the network. The lesson learn that we have gotten from this process is to use XML file to modify network outside Gephi. XML file records all data for each node and edge from locations to colors and centrality values. Here are some sleep paralysis community visualizations which we created after visualization assignment:

The entire network

Four communities in the network

Another four communities in the network

On the other side of struggling, Gephi gives us multiple options to analyze network and ask questions. When we looked at our network, we curiously ask:

+What community is this node belong to? What are the common things of that community?
+Why does it belong to that community? Why does this node have highest/lowest betweenness/closeness centrality?
+Who is this node? Who does this node connect to?
+How does this node impact on other nodes and network?

Those questions help us to understand and dive deeper into the network. Then, we realize that the more we answer our own questions, the more questions we want to ask. Many things are revealed and discovered during question-answer exercises in our team. For example, according to our network, many people who said they have experienced sleep paralysis, connect to some sort of heavy metal music accounts in Twitter. Due to the time limitation of this semester, we cannot verify if there is a real relationship between sleep paralysis and heavy metal music. However, it would be a good research to discover more about what causes sleep paralysis. Then, we will be able to build a prediction model to predict who potentially has sleep paralysis. In addition, this idea can also be expanded to a network of people who can have a high stress level, being in depression or think about suicide. Those symptoms can be one of many factors that increase the number of massive shootings in the last 5 years in the world. If we can build a prediction model on this subject, we will be able to help depressed people and prevent potential massive shootings.

[Network thinking]

During the course of 4 months, Dr. Ram and Devi have completely changed our understanding of big data and have introduced a whole new perspective about big data analytic. Our team wants to make a specific note about Network Thinking. Before we took this class, each of us had done some data analysis on our own. From the previous experience, identifying a topic was the most difficult part. The reason being is that we always start from examining the available data sets. When certain information is unavailable, we tend to give up. This selection process rules out a lot of interesting topics.

In this class, Dr. Ram introduced network thinking. This can be used in any topics especially social media analysis. Network Thinking allows us to incorporate data sets and information related but not directly from the study target. For example, our group have studied about Sleep Paralysis in the social media network. Not only have we collected data about sleep paralysis patients, but we also identified their common interests by examining their interactive behaviors in social media. It is impossible to achieve this with the traditional analysis.

With Networking Thinking techniques, we are tapping into the future of data analysis. According to Dr. Albert Barabasi, people in this industry are becoming aware of the network effect due to technological advances. However, we are still in the stage of using Network Thinking barely as a buzzword. In the article “Thinking in Network Terms”, Dr. Barabasi explained the several stages we have to overcome to be able to use this power. The first stage is thinking in network terms, which is exactly what we were taught by Dr. Ram in this class.

We have been impressed with the Smart Campus and the Aschma Emergency Room Visit examples, which are perfect demonstrations of the how to apply network thinking to solve real world problems. In the Smart Campus example, Dr. Ram has looked into the interactive behaviors of students by tracking their campus card usage. That data might not generate much meaning when analyzed with traditional data analysis techniques. With network thinking, we can dig deeper into the data set to incorporate temporal and special variables. We can even compare their interactive behaviors with their grades. Dr. Ram’s team was able to build a predictive model to predict the college dropout rate and how likely a student will drop out in the future. As you can see, network thinking is extremely powerful. The concept is not just limited to data analysis and we can use network thinking to make daily life decisions, too. Here’s a great TEDx talk in which Dr. Ram explains her idea of creating a smarter world with big data and network thinking.

REFERENCE: https://edge.org/conversation/albert_l_szl_barab_si-thinking-in-network-terms

Saturday, November 14, 2015

Big Data in Urban Planning

Photographer Catherine Hyland (CNN)

A few days ago, I've come across an article on a Chinese news website, talking about ghost cities in China had been identified by Baidu Big Data Lab. The ghost city phenomenon has long been noticed and reported by news agencies, but it always lacked the concrete statistic evidence behind those horrified pictures like above. Out of curiosity, I searched on Google for the specific paper and below is a brief summary of this study.

Baidu’s Big Data Lab uses big data analytic to measure housing vacancy rate and locate where these ghost cities lie in real time. The data source is from Baidu's own user data (include temporal and spatial information) and residential area pinpointed at Baidu Map. The data collection lasted over 6 month in 2014 and 2015. After initial data cleaning and processing, Baidu used clustering algorithm to analyze users' spatial distribution and then calculate the vacancy rate based on Baidu Map data.

In order to solve the problem of classifying ghost cities, the team used two different standard to calculate the ranges for normally populated residential areas. The final result has considered both measures mentioned. Below is a picture pinpointed ghost city areas in 9 cities by Baidu.

Vacant areas in 9 cities (Baidu Big Data Lab)

Baidu has offered the list of cities that are detected the existence of vacant residential areas and label the cities that are heavily rely on tourism. Since none of our team member comes from these place, it's hard for us to decide whether the result from Baidu actually reflects the real situation. However, based on the my personal experience, there's something Baidu has missed.

Multiple real estates owned by one family: There are majorly two reasons for an apartment to be vacant, one is that someone bought it but didn't actually move in, another is that this place has never been sold or finished. To exclude the first type of situation, Baidu has already tried to exclude the new residential projects. Even though it is hard to get the sales progress of an open real estate project, it is a common situation in China which should not be completely overlooked. I think if Baidu could include the data from secondary housing market and rental market, the result can be more robust.

Predict the next "ghost city": There are a lot of indicators that calculates the tendency of becoming the next"ghost city" on news reports, but they mostly are based on the change of local housing unit price and recent land auction price. Utilizing Baidu's real-time analytic, the "ghost city" can be identified in a more timely manner.

This study of "ghost cities" has revealed a corner of what role Big Data Analytic can play in urban planning. In addition to real estate, railways, airports and roads are also sometimes suffer from messy urban planning. There are roads that has few vehicles on them, while there are roads that packed with vehicles. Roads have been constantly mended since the traffic exceeds expectation. Is it possible to use Big Data Analytic to monitor the road condition? Such urban planning probably can be improved with the help of Big Data.

For developing countries with big population like China and India, building infrastructure is hard. Of course, money, technology and labor are needed, but how to avoid the appearance of "ghost city" remains a challenges. According to macroeconomic principles, market itself sometimes need intervention and I think "ghost city" phenomenon is an example of that. People tend to think of solution when the problem appears, but Big Data has provided real-time data and evidence for officials to make decisions at the planning stage.

Right now, Baidu's strategy on Big Data has mainly focused on business use. The study mentioned above is part of the big data analytic package designed for real-estate clients. However, this study has further proved the over-construction problem in urban China, and I personally hope more Big Data studies will emerge and offer a new perspective on some traditionally challenging problems.

Reference:
[1] Data mining reveals the extent of China ghost cities http://www.technologyreview.com/view/543121/data-mining-reveals-the-extent-of-chinas-ghost-cities/
[2] "Ghost Cities" analysis based on positioning data in China http://arxiv.org/abs/1510.08505
[3] China's New ghost town: Wonderland in Beijing http://travel.cnn.com/shanghai/life/chinas-new-ghost-town-nankou-wonderland-beijing-561846

Saturday, October 10, 2015

How Big Data will change the world?

Introduction

Based on Ernst, Steve's blog "5 ways Big Data Will Change the World", this blog will cover the same 5 ways but in a different perspective than Steve's. According to Steve's Innovation Insights (insights.wired.com), there are currently five areas that Big Data potentially can make a huge impact on, such as medicine, security, urban planning, products and election. This blog will discuss about medicine, consumer products and election.

Big data is the term which was recently invented. The words describe a huge amount of data that is generated every day. With the growth of internet, computer processor and data storage, collecting and processing data becomes easier than before. People collect all different type of data that is available. For example: how people use cell phone or credit card, how they walk, what is their habit, etc. At first, those data seems overwhelm and not very useful. However, when scientists and researchers found a way to clean and process it, the big messy meaningless data now becomes a smart clean useful data. It has been used in such incredible way to change the world and the way we are living.

Medicine

Ideally, doctors would have a better diagnostic to patients if they have a prediction model which uses records of all patients in the world. Based on symptoms that patients have, this prediction model will predict outcome (what type of illness a patient will more likely have). It will help to detect cancer or serious illness from its early stage. In addition, prediction model in medicine area would be a great tool to prevent and control disease spreading in the future.

Many companies and researchers have recently built prediction model in this area. For instance:

+STEM (Spatial Temporal Epidemiological Modeler): IBM prediction model to predict dengue fever and malaria. The goal for this project is to “combat illness and infectious diseases in real-time with smarter data tools for public health”, specifically dengue fever and malaria. This prediction model uses data from World Health Organization and local weather and environment data such as temperature and precipitation.

+CancerLinQ project is collecting and analyzing data from cancer patient visits. The goal of this project is to "real-time, personalized guidance and quality feedback for physicians” and “find the most effective therapy for each specific cancer patients”. At this moment, the project has collected about 100,000 patient records from 27 oncology practices.

Although Big Data in medicine is still largely unproven its outcome, more and more researchers and companies are jumping in this area. As the result, many medicine prediction model will be created in the near future. It, hopefully, will help human-being to defeat cancer and some other severe diseases.

Consumer Products

Unlike Big Data in medicine, Big Data in consumer products is being effectively used by giant retail and social media companies, such as Walmart, Amazon, Google, Facebook, etc. Data is collected from all different type of actions and movements that are made by consumers. Every clicks, every comments are recorded. Those data will reveal true consumers: what they like, how they shop, what they want, etc. As the result, products, prices and services are customized to each individual customers. Moreover, products will be smarter to customize itself to meet customer’s needs. To illustrate, a car that suggests the best route or remind driver about speed limit. Another example is a toothbrush that can learn about customer’s brushing habit and predict his/her teeth’s condition.

Here is an interesting Youtube video would better help to understand the way big data changes in the consumer products industry:

The video brings up very two general questions that Big data can help with:

1. The conversion rate

One of the PwC's client doesn't know how their website connect works and what's their website content's placement on the website. How much clicks are involved, and most importantly, what's the conversion to convert the website click through to the actual purchase.

The role of big data enable the pwc's client to better organize and advertise their website content to figure out the best advertise pattern and marketing content for attracting more online customers and sales volume.

2. Inventory stocking problems

The other client of pwc is a US large beverage company. Within the consumer products, the inventory planning is always the biggest issue. The video mentioned that the beverage company used to stock up based on the past sales, seasons, and etc. However, this doesn't work out - they are still facing the overstock problem. As we know, the overstock to particular skus leads to the insufficient stock to the other skus.

Therefore, the big data analysis could be taken place here to learn this beverage company's inventory stocking pattern and predict next month projected sales volume. Different from traditional inventory planning, you don't need to consider the seasons, holiday, or any other variables, because by conducting the big data analysis, it already learned the patterns and consider all these variables, most importantly, there are many other variables that overseen before are also taken into consideration.

Overall, consumers will surely gain a huge benefit, but there would be no privacy anymore because every single products they use will collect their data.

Election

Election is always a hot topic, especially in America. The biggest and hardest question for every presidential candidates is how to make impact on people, so that they can get more vote. Big Data would be the tool to answer the question.

As we discussed about brands in class, how a brand can influent on another brand and vice versa. We can apply this concept to Big Data in election. One person’s decision would impact on many other people’s decision in voting. For example, in a household scale, a father’s political view can change political view of other family members. In a larger scale, a public figure can make a huge different in number of vote for certain candidates because of his/her influence.

In the other words, Big Data can help to make a different by analyzing person/brand influence. Base on the result, a presidential candidate can come up with different strategies to reach to those “influential nodes” who will be bridges to help the candidate to connect to the massive population around those nodes.

Here is a video clip talk about how Big Data worked in Obama's election campaign and explain if Big Data would influence future election campaigns.

The video basically describes the success of Obama election by utilizing the social media power to broadcast to the right group based each individual's preferences, which successfully turn young Jewish people in Florida, who is known to be the most preservative and the biggest support group for republican, voted democracy. This is very surprising but expected.

Surely, there are many of factors would be taken into consideration for election, but Big Data takes the play since Obama's election and will further being developed to today's election. If the big data can be used in the right place, by using the characteristic of big data to individual's preferences and opinions would definitely changes the final election result. This is also why Hillary started to hire soft engineers in the very early of year 2015 once she decided to be a President candidate.

The earlier she started, the better network would be constructed. The better network leads to better decision and proper information to be distributed. Let's once the 2016 election is done, what would be final result and how the big data help with result.

Conclusion

In conclusion, according to Harvard Magazine, Big Data is a big deal. It promises to change the world in many different ways.

Following is the info-graphic from www.visualcapitalist.com to summarize what Big Data is and how it can change the world.

(Image source: http://www.visualcapitalist.com/order-from-chaos-how-big-data-will-change-the-world/)

Source:

Ernst, Steve. "5 Ways Big Data Will Change the World." - Innovation Insights. 24 Apr. 2014. Web. 10 Oct. 2015. <http://insights.wired.com/profiles/blogs/5-ways-big-data-will-change-the-world#axzz3nr2OirZj>.

Bedgood, Larisa. "How Is Big Data Changing the World?" - Data Science Central. 8 June 2015. Web. 10 Oct. 2015. <http://www.datasciencecentral.com/profiles/blogs/how-is-big-data-changing-the-world>.

Shaw, Jonathan. "Why “Big Data” Is a Big Deal." Harvard Magazine. 18 Feb. 2014. Web. 10 Oct. 2015. <http://harvardmagazine.com/2014/03/why-big-data-is-a-big-deal>.

Terry, Ken. "Cancer Researchers Mine Big Data To Individualize Treatment - InformationWeek." InformationWeek. 28 Mar. 2013. Web. 10 Oct. 2015. <http://www.informationweek.com/healthcare/electronic-health-records/cancer-researchers-mine-big-data-to-individualize-treatment/d/d-id/1109305?>.

Galvez, John. "Made in IBM Labs: Scientists Turn Data into Disease Detective to Predict Dengue Fever and Malaria Outbreaks." IBM News Room. 30 Sept. 2013. Web. 10 Oct. 2015. <http://www-03.ibm.com/press/uk/en/pressrelease/42103.wss>.

Desjardins, Jeff. "Order From Chaos: How Big Data Will Change the World." Visual Capitalist. 29 July 2015. Web. 10 Oct. 2015. <http://www.visualcapitalist.com/order-from-chaos-how-big-data-will-change-the-world/>.

Friday, September 11, 2015

Relationship of characters in Game of Thrones

Game of Thrones is an American fantasy drama television series, which has attracted exceptionally broad audience from home and abroad. The series is an adaptation of A Song of Ice and Fire, George R.R. Martin's series of fantasy novels.

As the above banner points out, in Game of Thrones, you win or you die. In the first four season of the series, the death toll of characters is 456. Characters appear, get killed and are forgotten by the audience. Due to the large amount of characters appeared, it's quite normal for audience to get confused by the relationship among characters.

Thus, the TV series has become an interesting topic for data visualization. There are majorly two different ways of illustrating how these characters are connected, through the traditional hierarchy diagram or social network analysis diagram.

Hierarchy Diagram

The above diagram shows a clear relationship within each houses through marriage or birth. The reader can easily see who and who are related by blood. The author also added other relationship such as servants, killer and rivals to showcase the relationship between major characters.

The family tree like diagram can easily show the hierarchy of characters, and readers can get a basic idea of the relationship between major characters. However, there are over 500 characters in Game of Thrones. It would be hard to illustrate the relationship of all characters in one single diagram.

Social Network Analysis

This is an example of the social network analysis for Season 1 of Game of Thrones characters. In total, there are 120 characters identified. Each character is a node, and they are connected if the characters ever have some interaction, which includes verbal, physical or gestural actions. The degree of each node represents the "air time" of the character. Higher degree nodes will have a bigger circle in the diagram. Without actually watching first season of Game of Thrones, viewers can easily see that Eddard Stark is the most relevant character in first season.

Here is another example of network analysis of characters relationship based on the novel. In addition to the information shown on the previous diagram, it adds "kill" and time to the network. The Link between nodes means a "kill" instead of an interaction.

Learning and Thoughts

The following table compares and contrast the pros and cons of the two methods used to visualize the show,Game of Thrones.

Method	PROS	CONS
Hierarchy Diagram	Ø Shows hierarchy of characters Ø Very clear of specific relationships	Ø Limited size and quantity Ø Only surface facts are revealed
Network Analysis	Ø Reveals deeper connection between characters Ø Easily sort out centralized nodes (not necessarily the ones on top of the hierarchy) Ø Size and quantity of nodes are not limited	Ø Not clear about specific relationships between nodes Ø No hierarchical view (specially important to understand the show)

Each method excels in their own domain, and has its disadvantages. However, it is quite pleasant to see that when used together to analyze this TV show, they complement each other. The Hierarchy Diagram provides great details about each relationships and shows a clear ranks of the social status of each characters, while the Network Analysis revealed deeper connections between characters. I would say that the Hierarchy Diagram helps audience understand the characters better, but the Network Analysis would probably prepare audiences for future surprises in the show.

Our team are amazed that text mining and social network analysis can provide a new perspective and understand of the classical novels as well. For example, Dream of the Red Chamber, one of China's four greatest classical novels, can be interesting to analyze. The book has more than 40 major characters and over 500 additional ones. Apart from complicated relationship between characters, only two thirds of the book were confirmed written by the original author. Social Network Analysis probably can bring new perspective into the research exclusive to this novel. The examples above really changed our team's views about network analysis. Not only could network analysis bring benefit to scientific researches, but also it could complement the entertainment of our life.

After all, it is quite interesting to see what we have learned in class can be used to help us understand the complicated relationship. At the beginning of the class, when first heard the theory "network", we automatically linked it to graphs with nodes. How excited could that be?! The answer turned out to be, "Super excited"! As we slowly dip our toes into network analysis with the guidance of Dr. Ram, we started to appreciate its beauty that it can fits into every aspect of this world. It has been applied in many scientific and social disciplines including machine learning, particle physics, the internet, biology, supply chain, social networks, event predictions and etc. With rapid technology advancement and data overflow, network analysis is becoming easier and more accessible to more and more people. This also enables each individual to make better and faster decisions on the run.

Stay tuned for more insights about Big Data Analytics!

Team Big Bear,

- "Think Big, Like a Big Bear!"

Reference:

https://en.wikipedia.org/wiki/Game_of_Thrones
http://hauteslides.com/2011/05/game-of-thrones-infographic-illustrated-guide-to-houses-and-character-relationships/
https://en.wikipedia.org/wiki/Dream_of_the_Red_Chamber
http://www.washingtonpost.com/graphics/entertainment/game-of-thrones/
https://medium.com/@nscmnto/game-of-thrones-beyond-the-power-struggles-ea4567451e67
https://en.wikipedia.org/wiki/Network_theory