Data Science with Python Pandas by Athena Kan

ANITA KHAN: Hi, everyone. Welcome to Data Science with Python pandas. This is a CS50 seminar and my name is Ms. Anita Khan. Just to give you a little bit introduction about myself, I'm a sophomore here at Harvard and I'm in Pfoho. This past summer I interned at Booz Allen Hamilton, a tech consulting firm, and there I was doing server security and data science research. On campus, I'm involved with the Harvard Open Data Project, among other things, where we try to aggregate all of Harvard's data into one central area. That way students can work with data from all across the university to create something special, and some applications that improve student life. I'm also on the school's curling team. Just to give you a brief introduction about data science, this is an incredibly evolving field That's? growing so quickly. Right now on Glassdoor it's rated as the number one best job in America in 2016.

And you can have a median base salary of $116,000. Harvard Business Review also listed it as the sexiest job of the 21st century, and it's always growing. If we look at Indeed, we see the number of job postings has been skyrocketing. In the past four years alone, the number of postings have increased eight times, which is pretty incredible, just because data science is such a growing field and every company now wants to use it. If we also look at job seeker interest versus job posting, we see that there are, at max, there are sometimes 30 times more posts than there are people to fill them. And we also have at minimum still almost 20, which is incredible. And we want to fill that demand. So here I'll be teaching some data science. Stephen Few once said that numbers have an important story to tell. They rely on you to give them a clear and convincing voice.

And today, I'll be helping you to develop that clear and convincing voice. If we look at some examples of data science, we have seen things like how data science has been used to predict results for the election. And so we can see here, there is a diagram about how many different ways Clinton can win, how many different ways Trump can win, just depending on the number of different results. And this results in a very interactive and intuitive visualization. So if we want to look at the brief article here, we see here some election results. And this was just released pretty recently actually, updated 28 minutes ago. So we see here there are things like the percentage over time on the likelihood of winning. We also have where exactly the race has shifted, some things about state by state estimates divided by time.

And so this is just a really intuitive way for people all across the world to be accessing this data. And so data scientists are always taking this data and these huge spreadsheets that aren't always accessible for people to see, and that way people can actually observe what's going on. You also can see different forecasts, some different outcomes that are pretty likely, and, again, an interactive visualization for people to really understand the data that's going on. Some other ways are that we can see Obama rates are rising here. Obamacare rates are rising and there is the graph to see how that's changing. We've also used data to catch people like Osama bin Laden, and to fight crime. So data scientists have a lot of different usages across many different fields. There are many steps to data science and I'll be going through them today. So the first thing is you want to ask a question.

Ask a question, then you want to– it's important to ask a question because otherwise there's nothing to answer. Data science is a tool, and so once you have a question to answer you use data science to answer that. You can't just use data science on some arbitrary data set that you don't care too much about. Next you want to get the data. And so there are a wide variety of places to get the data, but you just want to find a data set that you also care about. After that you can explore the data set a little bit, get a better sense of what kind of descriptive statistics you're looking for. Next, you want to model the data. So what happens if you are trying to predict something years into the future? What happens if this scenario occurs? Or what happens if this predictor changes a lot? Then you want to see what could possibly happen based on your model.

And models always improve when you have more data, and so it's always good to get more data. Finally, you want to communicate all of your information. Because while it's great that a data scientist has all of this information that they found, and all these visualizations, it's really important to share that with your boss or other colleagues. That way there can be something actionable about it. So in the examples we showed before, we've seen things like how Osama bin Laden was caught using data science. But if the person who's the data scientist came up with the data and couldn't present that effectively, then that couldn't have happened. There are a bunch of different tools to help you get along, help you find all this information. So when asking a question, you can think about to your own experiences. What are some issues that you faced before? What is something that you want to know more about? You can also look at websites, like Kaggle, for example, which presents data challenges pretty frequently.

And so if a company poses a question, you can always answer them yourself. You can also talk to some experts what kind of things are they looking to answer that they might not necessarily have the capability to address. And so you can help them using the data that you find to answer their question. As for getting the data, there are many different ways to get data. You can scrape a web page. So that you can get information that way. You can also look at databases if you have access to one. And finally, a lot of different places have Excel spreadsheets or CVSs, Comma Separated Values, and text files that are really easy to work with. After that, you want to explore the data a little bit. And so we have a couple of different Python libraries, along with others, but Python seems to be pretty common in the industry. You have libraries such as pandas, matplotlib, which is more for visualization, and then NumPy as well, which works with arrays. And so after that you want to work with modeling the data.

So, extrapolating essentially. And so you can also do this with pandas and also a library that's gaining a lot of traction is sklearn, which is more for like machine learning. And finally, you want to communicate your information. So matplotlib is great for creating graphs and d3 is great for creating interactive visualizations. But as we've seen before, pandas is used in both explore and modeling. And also, matplotlib and NumPy is built into panda. So that's why pandas is great. So we're going to be exploring that today. Just a little bit more information about pandas. It's a Python library, as I mentioned before. And it's great for a wide variety of steps in the data science process. So, things like cleaning, analysis, and visualization. It's super easy, very quick. And it's very flexible, so you can work with a bunch of different data types, often many different types at once.

You could have several different columns with strings, but also numbers, and even strings within strings, and it's great. And finally, you can integrate well with other libraries because it's built off of Python, it works with NumPy tools and other different libraries as well. So it's pretty easy to integrate. Next, we'll also be using Jupityr Notebooks. So this is kind of similar to the CS50 IDE, but this is preferential for data science because you can see the graphs in line and you don't have to worry about loading things separately. You also have all of your tools and all your libraries already loaded. So if you download a package in SAR called Anaconda, that has all of these tools already. It also allows over 40 languages. So today, we'll be focusing on Python but it's great that you can share notebooks and work with many different languages as well.

So we're going to just launch into pandas. And so there are two different data types in Python pandas. So the main one is called series and there's another great one called DataFrame. And so series are essentially NumPy arrays. They're essentially arrays. So you can index through them, just as you did in CS50, but one difference is that you can hold a lot of different data types. So this is kind of similar to a Python array. So we can work on a couple of different exercises. So here is going to be our notebook where we're going to be working with all of our information. This way you can see everything as it goes. So you have the code here, and then if you press Shift-Enter it loads the code for you. So here in this section, we're going to be exploring different series. And so first you want to import the library as you did in the last P set, for CS50.

So if you import pandas as pd, that pd means that you can access different modules within panda just using the word pd. So you don't have to type pandas all the time. So if you want to create a series, you just call pd.Series. And then this generates this NumPy command. Import NumPy as np. This NumPy command generates five random numbers, and then in the series you'll also have an index. So let's see what it creates. As you can see, you have an index here, a, b, c, d, e. And then you have your five random numbers from just here. Because this isn't saved inside of a variable, it's just pd.Series, if you want to save it inside of a variable, you can also do the same thing. You also don't need to have an index, you can just have 0 and the default is just 0 through 4.

Next, you can also index through them because there are different arrays. So can someone tell me what ss[0] would return here? AUDIENCE: The first value. ANITA KHAN: Yeah, exactly. And then do you know what this one is? AUDIENCE: That's all the values up to the third. ANITA KHAN: Yup, exactly. So here you have your first value, as you had here. And then after, when you are slicing through them, it gets easier. 1 and 2. So that's a series in a nutshell. The next type of data structure is called a DataFrame. And so essentially this is just multiple series added together into one table that way you can work with many different series at once and that way you can work with many different data types as well. You can also index through index and columns, that way you can just work with many different data types very quickly.

So here we're going to do a couple of exercises with DataFrames. And so first we create a DataFrame in the same way. So when we call pd.DataFrame, that means you access the command DataFrame in pandas. That means you create a DataFrame out of (s). So (s), remember, was this series back up here. So we're going to create a DataFrame out of that series, and we're going to call that column Column 1. So as you can see, it's the exact same series that we had before, these random five numbers put into this DataFrame. And then its column is named Column 1. You can also access the column by the name if you want to have a specific column. So if you call df, which is the name of the DataFrame, and then in brackets ["Column 1"], kind of like what we did in the last piece with accessing like dicts, then you can access that first column. It's also really easy to work with different functions applied to that.

And so for example, if we wanted to create another column called Column 2, for example, and we want that column to be the same as Column 1 but multiplied by 4, it would just be like adding another element in that dict. So then it would be df, and then in that dict we'd be creating something else called Column 2. And then that's equal to the Column 1 times 4. And so as you can see, we've added a second column that's exactly the same, except it's multiplied by 4. So it's pretty intuitive. You can work with many different other functions as well. And so if you want to add something like df times 5, or like subtracting, or you can even add or subtract two different columns, you can add multiple columns, it's pretty flexible with what you can do. You can also work with other manipulations, such as a thing like sorting.

So if you want to preserve– you can do other things such as sorting. So if you want to sort by Column 2, for example, you can take this column and you can call df.sort_values and then by Column 2. And if you want to preserve it, make sure to set it to a variable because this just sorts it and it doesn't actually affect how the DataFrame actually looks. And so if you sort by Column 2, you can see that the whole DataFrame is just sorted with these indices staying the same. So, for example, you see that Column 2, this one has the lowest value so it's going to be at the top, and then you also have the indices preserved sorted by that Column 2. You can also do something called Boolean indexing, which is where– so if you recall from a Python array, if you just call, for example, is this array less than 2, then it should return trues and falses to see whether each element is actually less than 2.

So this same concept can be applied to a DataFrame. And so if you call this DataFrame, if you want to access things that in Column 2 are less than 2, then you can just do syntax like this and it would return every column that's less than 2. As you can see, the first row has been eliminated because Column 2 is not less than 2. You can also apply things called anonymous functions. And so if you have something called lambda x, is the minimum of the DataFrame plus the maximum of the DataFrame, then you can apply that to your DataFrame and then that should return the result of whatever this should be. So, for example, if you run this you take the minimum of Column 1 and then you add it to the maximum of Column 1. And this result is negative 1.31966.

And then you do the same thing for Column 2 as well. So you can run the same thing to another– you can also add another anonymous function. Do you want to try it out? Give an example? So it's something like df.apply (lambda x). AUDIENCE: A mean? ANITA KHAN: Mean? OK. mean(x). Oh, whoops. That's why you don't do live coding during seminars. You can also call on mean(df) and then that should– np.mean(df). And then that should return the mean as well. Finally, you can describe what different characteristics of that DataFrame. And so if you do something like df.describe, it returns how many values are inside the DataFrame. You can also find things like the mean, standard deviation, minimum, quartiles, and finally, the maximum. So it's pretty easy once you have all that data loaded into DataFrame if you call df.

describe, then that allows you to access pretty essential variables about that DataFrame. That way you can work with different things pretty quickly. So if you want to subtract and add the mean, then you have these two values here already. If you want to access things like the mean exactly, you could call– if this is the table– then you want to call table(mean), or ["mean"], that should access the means as well. So we're going to go through the data science process together. So, the first thing we're going to do is ask a question. So what are some data sets that you're interested in and what kind of questions do you want to answer with data? AUDIENCE: Who's going to win the election? ANITA KHAN: Who's going to win the election? That's a good one. AUDIENCE: Anything to do with stock prices. ANITA KHAN: Stock prices. What kind of things with stock prices? Kind of similar to CS50 Finance? Or like if you want to predict how a stock moves up and down? AUDIENCE: Yeah.

ANITA KHAN: OK. All very interesting questions. And the data is definitely available. So for something like– yeah, we can go through that later. So today we're going to be exploring how have earth's surface temperatures changed over time. And this is definitely a relevant issue as global warming is pretty prevalent and then temperatures definitely are increasing a lot. We had a very hot summer, a very hot winter. So this might be something we want to explore, and there are definitely data sets out there. So for getting the data in this kind of example, so where do you think you'd get data about who's going to win the election? AUDIENCE: I'm sure there's several databases. Or past results. ANITA KHAN: Past results of previous elections? AUDIENCE: Yeah. And polls. ANITA KHAN: Where do you think you could get data about elections? AUDIENCE: Previous polls. ANITA KHAN: Yeah, definitely. And as we saw before in the New York Times visualization, that's how a lot of people predict how the elections are going to go, just based on aggravating a lot of different polls together. And we can take maybe the mean and see who's actually going to win based on all of these.

That way you account for any variance, or where different places are, and who different polls are targeting, and so on. So for something like stock prices, what would you look at? Or where would you get the data? AUDIENCE: You could start with Google Finance. ANITA KHAN: Google Finance. Yeah. Anything. AUDIENCE: Like Bloomberg or something like that. ANITA KHAN: Yeah, for sure. Same thing? AUDIENCE: Same places, I guess. ANITA KHAN: Same places. Yeah. And what's really cool is that there are industries that are predicated off of both of the questions that you're asking. And so if you can use data science to predict how stocks are going to move, that's how some companies operate. That's how they decide what to invest in. And then for elections, if you can predict the election, that's life changing. And so here we're going to get the data from this place called Kaggle. As I mentioned before, it's where a lot of different companies pose challenges for data science.

And so if we look here, there is a challenge for looking at earth's surface temperature data since 1750. And it was posted by Berkley Earth pretty recently. What's great about Kaggle is that you can also look at other people's contributions or discussion about it if you need help about how do you access different types of data. So if we look at a description of this data, we see a brief graph of how things have changed over time. So we can definitely see this is a relevant issue. And you can see from this example of data science already, it's pretty intuitive to see what exactly is happening in this graph. We see that there is an upward trend of data happening over time, and we see exactly what are the anomalies over this line of best fit.

We also see that this data set includes other different files, such as global land and ocean temperature, and so on. And the raw data comes from the Berkeley Earth data page. So if we download this– it might take a little bit to download because it's a huge data file, because it's containing every single temperature since 1750 by city, by country, by everything. So it's a pretty cool data set to work with. There's a lot of different data sources. And while this isn't quite like technically big data, this definitely is a chance to work with a large data set. So if we look here, we can look at global temperatures. So here you can see some pretty cool information about the data. You see that it's organized by timestamp. You can look at land average temperatures, you can see here. Might be kind of hard to tell. Land Average Temperature Uncertainty, that's a pretty interesting field. Maximum Temperature, Maximum Temperature Uncertainty, Minimum Temperature.

So it's always great to look at a data set, like once you actually have it, what kinds of fields there are. And so there's things like date, temperature. We see that there are a lot of different blanks here, which is kind of curious. And so maybe this could get resolved later in the data set? And we see that this goes all the way up to the 1800s so far. And then we see here that the other fields are populated here. So it's possible that before 1850, they just didn't measure this at all, which is why we don't have information before. So this is something to keep in mind as we work with the data set. And so we see, there's a lot of information, a lot of really cool data. And so we want to work with that. And so we open up our notebook. You import in all of the libraries you already have. The great thing about Jupityr Notebook is that keeps it keeps in memory from things that you've loaded before.

So up here we loaded pandas and NumPy already, so we don't have to load them again. And so we just import matplotlib, which is, again, for visualizations, and graphs, and everything. And we also import NumPy– we already imported that– but it helps you work with arrays and everything. This matplotlib inline allows you to look at graphs within Jupityr Notebook. Otherwise it would just open up a new window, which can get kind of annoying. And so if you want to see it inline, that way you can work with things pretty quickly rather than switching between windows, it's a good thing to use. And then this is just a style way of preference for how you want your graphs to look. And so if you use the default, it's just like blue. I wanted it to be red and gray, and nice so I changed it. So if you call pd.read_csv– again, remember that pd is referencing pandas. And so this is accessing a module in pandas called read_csv. So it let's you load in a CSV, just with a single command, and that way it loads into your DataFrame. And so if we call that– yeah. So this looks exactly the same way we had it before, or had it in the Excel spreadsheet, just loaded into a DataFrame.

So again, very simple. If you want to see the rest of the file, you just call df. I just chose head(), that way head shows the first five elements rather than every single thing, because it was a pretty long data set. But it does show the first 30, and then also the last 30 I believe. And so you can see that there are 3,192 rows and 9 columns, just from loading it in. You can also call tail(), and then that should show you the last five elements. You can also change the number within here to be the last 10 elements. So you can see things pretty easily. Next, we want to look at just the land average temperature. That way we can work with just the temperature for now. The others are a little bit confusing to work with, and so we want to just focus on one column for now.

Plus, that's what we're interested in. We want to see how temperature has changed over time. So we want to look at just temperature. And so this is a method to index. And so this takes the columns from 0 all the way up to 2, where it stops right before 2. And then it gets to zeroth column, and the first column. The zeroth column, remember, is the datetime, and the first column is the land average temperature. And then again, we want to take the head(). So as you see, it's just the datetime and the land average temperature. And we also changed the DataFrame to be updated to this. That way we can just work with just these rather than the rest of them. Next, as we saw before, df.describe was a very helpful tool. And so if we run that again, that will allow us to see basic information about it.

And so we see that there are in total 3,180. And then we also have a mean temperature. We have a standard deviation for temperature. We have our minimum and maximum as well. And we also see that we have NaN values, which means it's not a number. So that's a little bit curious. We might want to explore that a little bit. In all likelihood, it probably is just that there are Not a Number values in there, and so it's hard to find quartiles when some of them are not valid numbers. So once we have a description, we can see we've gained insights already about it, just from those couple lines of code up here. And so we see that the mean temperature from 1750 to 2015 was 8.4 degrees, which is interesting. Next, we want to just plot it, just so we have a little bit of a sense of how the data is trending. We just want to plot it, just to see we can explore some of the data. And plus, it's pretty easy to apply it. So even if it doesn't look too great, then we aren't losing anything. And so, plt.

Again, we imported matplotlib, which is the library that helps you plot. matplotlib.pyplot helps you plot. And then if import it as plt, you can access all the modules from just calling plt(). And so we have plt.figure. plt.figure(figsize), that just defines how big that graph is going to look. And so we call its going to be 15 by 5. And so you have the width is a little bit bigger, and that's to be expected because it should be like a time series graph, and so there will be more years than there are actual temperatures. Next, we're going to actually plot the thing. And so since we have a DataFrame that has all that information, we can just plug that in. And this command knows exactly how to sort between the x and y, so you just need to call that DataFrame. The only thing is that matplotlib in this case would plot a series. You can also plot multiple of them. But as the series, as you remember before, is a one-dimensional array with an index.

And so in this case that land average temperature, or the temperature itself, would be what you plot on your y-axis. And then the x-axis would be the index. So that would be what year you're in. You can also plot a whole DataFrame. And then this, we'd just plot all the different lines all at once. So if you had a land maximum temperature, then you can see the differences between that. We also have plt.title, that changes the title of a whole graph. You have the x label, year, and y label. And finally, you want to show the graph. You also don't have to, but because of Jupityr Notebook, so then same thing happens. And so you see from this graph, it's a little bit noisy. And so we see that there seems to be an upward trend, but it's kind of unclear because it looks like things are just skyrocketing back and forth.

Do you have an idea why that might be the case? AUDIENCE: It's connecting the dots. ANITA KHAN: Yeah, exactly. Yeah, that's exactly right. And so we also see from the table up here, there are different months located. And so, of course, the temperature will decrease during the winter and increase during the summer. And so as it connects the dots, as you said, then it'll just be connecting the dots between winter and summer and it will just be increasing a lot. So this graph is kind of messy. We want to think about how exactly we can refine it. But we do see that there is a general upward trend, which is a good thing for us to see, probably not good for the world, but it's OK. We can also pretty clearly see what the ranges are. And so we see here, you can get from as low as couple of negative degrees up to almost 20 degrees, which is consistent with our df.

describe findings. We also see that it goes from the 0 to the 3,000, or almost 3,200, which is not quite correct because we only had the years from 1750 to 2015. And so there's something incorrect here. It's probably referencing the months maybe. AUDIENCE: I think it's referencing the indexes? ANITA KHAN: Yeah, exactly. Referencing the indexes, but each row is a month. And so it would be like the zeroth month, first month, and so on. So how do you think we can make this graph a little bit smoother, so that it doesn't go up and down by month? AUDIENCE: Make a scatterplot? ANITA KHAN: Scatterplot. But if you had the points– yeah, we can try that. So plt.plot(kind=scatter). And then for a scatterplot, you need to specify the x and the y. So we could have x equals the index, as we said before.

And the y equals the actual thing itself. plt.scatter. Scatterplot. So we still see a couple different– it's still a little bit messy. It's still kind of hard to see exactly where everything is. What else do you think we could do? So right now we have it indexed by month. What do you think we could change about that? AUDIENCE: You can have dates by year. ANITA KHAN: Yeah, exactly. So if we ever– AUDIENCE: Like the max temperature. ANITA KHAN: Max temperature, yup. All very good ideas and something to definitely explore. So for now we can just look at the mean of the year, or average of the year. That way we can see because each year has all of the months, it would make sense just to average all of them, just to see how that's been changing. However, we notice when we look at the timestamp column, which is called DT, if we access that and called the type, it's actually of type str.

So that means all of these dates are recorded inside of the file as a string rather than a date. So that would mean if we want to parse through them, we have to look through every single letter inside of the DT. So what might be helpful is to convert that to something pandas has called a DatetimeIndex. Pandas is very adapted towards time series data. And so, definitely, there are a lot of tools in their library for this exactly. So if we convert it to a DatetimeIndex, we can also group it by a year. And this is a syntax where we take the year in the index, and then we also take the mean of every single one. So if we run that, and then we plot that again, that's a little bit smoother. So we can definitely see that there is a trend over time. And as there are a lot of different spikes, so it's not incredibly uniform, which makes sense because there are peaks and valleys for years. But as a whole, this data set is trending upwards.

So this is wrapping up the exploratory phase. But then we notice there is something pretty anomalous here. We see right around the 1750, in the beginning with 1750s, there's a huge dip down. So before while it was at 8.5 before, it went all the way down to 5.7. So let's see. There might be a couple of reasons why this might be the case, such as maybe there was an ice age for that one year or something and then it went back up to 8.5. But that's probably not what happened. So let's look into the data a little bit. Maybe they messed up something, maybe someone mistyped a number. So that it says negative 40, or negative 20 instead of 20, or something like that. And so if we look at the data– and it's important to check in with yourself, make sure that what you're getting is reasonable– we can look in. And so we want to see what caused these anomalies.

Because it was in the first couple of years, we can call something like .head(), which shows the first five elements. And we see here that 1752 is what caused this. And for whatever reason, even though all of the years previous and after had 8 degrees and then 9 degrees. It just goes back down to 6.4 degrees, which matches what we found in our plot. So let's look at that data set exactly. So, as you remember, we can filter by Booleans. So if we want to see if the year of that grouped DataFrame is equal to 1752, we can see what happened. And so we see here, in this case we can see every single temperature from every single month, and the land average temperature, as long as that year is 1752. And because it's a DatetimeIndex, we're allowed to do something like that, rather than searching the string for every single thing, looking for 1752. And so we see here in this exploration that land average temperature, so while this January makes sense that it's pretty low, we also have things like Not a Number.

And you have things, like you have a couple of the numbers but then all these summer months are just gone. And so what happens is when you average this, where it might not have a number, it'll just average the existing values. And so because you're missing those summer months it'll be low, even though it's not supposed to be. So what exactly can we do about that? So there are a lot of null values. You want to see what exactly we can do. Also, this might be affecting results in the future. Because what happens if there are other null values in other years? It wouldn't be just exclusive to 1752. And so again, as we tried from that Boolean values, if we call numpy.isnan(), that can access every single thing and determine which cells exactly are not a number. And specifically, land average temperature is not a number. And so we see here that there are a lot of different values that are all not a number.

And so this is OK. It definitely makes sense, because no data set is going to be perfect. As we saw before when we were looking at the data set, it was missing all these columns. And so it's not ever going to be perfect, which is OK. The thing that you have to do is either work with data that is perfect, or you have to fill in those null values. You have to make sure that it has something that's reasonable that shouldn't affect your data that much, but you should fill it in with something that makes sense. So, in order to find out what exactly makes sense, we want to look at possibly other information around it. So if we wanted to predict this February of 1752, how do you think that we could estimate what that should be? AUDIENCE: Look at the previous and past February's? ANITA KHAN: Yeah, exactly. Yeah, previous and past February's are a good way.

Another way to do it might be looking at the January and the March of that same year. It should be somewhere around the middle maybe. Because to get from that January to the March, you have to be somewhere in the middle. And so February would make sense that it should be right around the middle. And then you could do the same thing for these values as well. It's kind of a little bit more difficult because you don't have before and after values for where there are a lot in the sequence, but definitely looking at the year before, the year after might be helpful. So what we're going to do today is we're going to be looking at the month before, or previous thing that's most valid. So, for example, in February you would look at the month before. So then this would be that January. For this May, you would be looking at the April previously. And then for this June, because the most previous value is that April, you'll be looking at that April value as well.

So you'd just be filling all of these with this April value. So, not the most accurate, but it's something that we can at least say it's reasonable. So you're going to be changing the value of what that DataFrame column is. And so we want to set that equal to something else. And it's going to be exactly the same thing, but we're going to be calling a command called fillna(). It's another pandas command, but it fills all of the null values. So these are things like none, NaN, any blank spaces, or anything, just things that would go under na, that you would classify as na. And the way we're going to fill this is going to be called something ffill, or forward fill. So this is going to be things from before and then it's just going to fill the things ahead of it. You can also do backward fill, and there are some other different ways as well.

And so once we call that, it changes. And then we can graph that again. And then we see it's a little bit more reasonable. There still are some dips and everything, but it can't be perfect. So we might want to try different avenues for the future. That data set definitely looks a lot cleaner than it was before. And we know that there are no null values as of right now, so then we can work with the whole data set and not have to worry about that at all. All the syntax for the plots are pretty similar. So you can always definitely copy it, or even create a function out of it, that way you don't have to worry too much about styling and everything. You can also change things like the x-axis, y-axis, font size. So it's pretty simple. So that concludes our exploration of our data set. Next, we want to model our data set a little bit to predict what would happen based on future conditions or other variables that could happen.

So in your example of predicting the election, what would you want to model? AUDIENCE: Who gets electoral votes. ANITA KHAN: Yes, exactly. And then for stock price, what might you want to model? AUDIENCE: Likely [INAUDIBLE]. ANITA KHAN: Yeah, exactly. And how that all change over time. And so there are different ways to model. The model we're going to use today is called linear regression. So, as you might have learned before in class, just like creating a line of best fit. That way you can estimate how that trend is going to change over time. So we're going to be calling in a library called sklearn. So this is used for typically machine learning, but definitely regression models or just seeing how things will change over time, this is good for, and pretty easy to use. And so this is just a couple of syntax values, that way you can set what that x is and what that y is.

You just want to take just the values rather than a series, and that creates a NumPy array. And then when you import this as LinReg. You can just call your regression is equal to this. And then sklearn has a quirky syntax where you want to fit it to your data first, and then you can predict the data based on what you had there. That way if you want to predict a certain value that wasn't in your data set, you could call that in predict. And so if you call reg.fit(x, y), that should find the line of best fit between x and y. And then if you want to predict something, then you would call reg.predict(x). You can also do something called score, which is where you compare your predicted values against your actual values. And so here you put in x, which would be your predictors, and y, which is like your predicted values. So in this case x would be the year, and then y would be what exactly that temperature would be. And so you compare what the actual temperature is against what your predicted temperature is. Next, we want to find that accuracy to see how good our model is and everything.

And so this compares how far the predicted point is from the actual point, does residual squares, and r-squared, if you heard that in stats. And so we see that it's not very accurate, but it's better than nothing. It would be better than a random point. And since this was a very basic model, like this is actually not terrible. It's a good way to start. And so next we want to plot it to see exactly how accurate is it. Because while this percentage could mean something as to how accurate it is, it's not that intuitive, and so we want to graph it. So again, graph it as we did before. Scatterplot is good for this. And we see how all of these points– you see that we have our straight line of best fit here, that blue line, but then we also have all of our points. And we see that it's not perfect, but it definitely matches the trend in data, which is what we're looking for. And so if we wanted to predict something like 2050, we would just extend that line a little bit further.

Or if you just wanted the number, you could call reg.predict(). And so this is what we did here if you call that reg.predict(2050). So this predicts that the temperature in 2050 will be 9.15 degrees, which is pretty consistent with what this line is. Do you have any ideas for a better regression model? So instead of linear, what might we do? AUDIENCE: Like a polynomial? ANITA KHAN: Yeah, exactly. So it looks like this data set is following a pretty curvy model. We see while it's pretty straight here, it curves up here. And so, definitely, polynomial might be something to look for. There's also another pretty cool method of predicting called k-nearest neighbors. And what this is you find the nearest points and then you predict based on that. So for example, if you wanted to predict 2016, you would look at the nearest points, which are 2015 and 2014, maybe 2013 if you want that. Average it together and then that would be your prediction. There are other regression methods as well. You could do logistic regression, or you can use linear regression but use a few more parameters.

That way you can decrease the effect a certain predictor has, and so on. But linear regression is a good start. You should definitely look at the sklearn library and there are definitely a lot of different models for you to use there. And so the next part is communicating our data. So how do you think we could communicate the information that we have now? Who would we want to communicate to on global temperature data? AUDIENCE: [INAUDIBLE] ANITA KHAN: What do you think? Same thing? OK. If you wanted to communicate something about what your examples are, once you had data about election predictions, how do you think you could communicate that? AUDIENCE: Do something very similar to what the New York Times did. ANITA KHAN: And what about stock market data? Who would you communicate to, what would you be sharing? AUDIENCE: Try to put it in some type of presentation. ANITA KHAN: Yeah, exactly.

That'd be great. And you could present to one of these companies, or you could do it at a stock pitch competition, or even invest, because maybe you just want to communicate to yourself, and that's fine too. But the idea is once you have that data, someone needs to see it. Once you have that data, it can generate pretty actionable goals, which is a great thing about data science. So just talking about some other resources since we've gone through the pretty simple data science process. Other resources if you want to continue this further. I'm a part of the Harvard Open Data Project where we're trying to aggregate Harvard data sets into one central area. That way students can work with that kind of data and create something. So some projects that we're working on are looking at energy consumption data sets, or food waste data sets, and seeing how exactly we can make changes in that. So other than that, again, as I showed you before, Kaggle.

Definitely a great resource if you want to just play with some simple data sets. They have a great tutorial on how to predict who's going to survive the Titanic crash based on socioeconomic status, or gender, or age. Can you exactly predict who will survive? And actually, the best models are pretty accurate. And so that's really cool that just using a couple regression models and using exactly the same tools that I showed you, you can predict anything. Your predictions might not be very correct, but you can definitely create a model that would be more accurate than if you shot in the dark. Some other tools are data.gov and data.cityofboston.gov. So again, more open data sets that you can play with and you can create actually meaningful conclusions. And so in data.gov you could look at a data set on economic trends. So, how unemployment is changing. You could predict how unemployment will be in a couple different years.

Or you can definitely get information about how election races have gone in the past. You can definitely reach out to organizations like Data Ventures that works with other organizations, essentially like consulting for another organization using data science. There are a lot of classes at Harvard about this. Definitely CS50 was sentiment analysis. You can work with that as well. So if you've got all the tweets of Donald Trump and Hillary Clinton, and all the other presidential candidates, and did some sentiment analysis on that, or looked at different words, you could predict what exactly might happen. You can also take other classes such as CS109 A and B, which are Data Science and, I believe, Advanced Topics in Data Science. CS181 is Machine Learning as well. There are other classes, I'm sure, that are definitely helping with this. Also another good resource is if you just Google things. If you do Python pandas groupby, by, for example, if you forget the syntax, you can look through great documentation on how exactly to use them.

So it gives you examples, like code examples. So in case you forget from this presentation, or other tools that you might want to use as well. So, for example, if you want to do a tutorial, or if you want to work with time series, there are a lot of– the documentation for pandas is pretty robust. And same thing for the other libraries as well. So sklearn linear regression. Definitely have looked that up before. And you can do the same thing, where it has parameters that it takes in, and also what you can call after you've called sklearn in your regression, what exactly you can get. So you can get the coefficients, you can get the residuals, the sum of the residuals. You can get your intercepts. There are some other information that you can use. They probably have examples as well. They have examples using this, just in case like you want an example of what exactly yours should look like, or you want code. That's definitely helpful. And finally, just to inspire you a little bit further, I can talk a little bit about my data science projects that I'm working on.

For one of my final projects for a class I'm trying to predict the NBA draft order just from college statistics. So there's a lot of information, I think back up to since the NBA started, on how exactly draft order is selected, just based on that college student's statistics. And so definitely a lot of people are trying– like there are industries devoted to predicting what will happen based on those college statistics. Like exactly what order, how much they'll get paid, how does this affect their play time while they're on their teams, so on. Also, over the summer at Booz Allen I was developing an intrusion detection system in industrial control systems. Essentially what this entails is industrial control systems are responsible for our national infrastructure. And so if we observe different data about them, we can possibly detect any anomalies in them. An anomaly might indicate the presence of an attack, or a virus or something on it.

And so that is a possibly better alternative to current intrusion detection systems that might be a little bit more complex rather than just focusing on data. Something else I'm working on for another final project for class is looking at Instagram friends based on mutual interactions. And so each person on Instagram, maybe they like certain people's photos more often than other people's photos. Maybe they comment more, maybe they are tagged in more photos. And so looking at that information, if you look at the Instagram API, it's pretty cool to see how there is a certain web of influence, and you have a certain circle that's very condensed and expands a little bit further. And what's interesting about that is celebrities, for sure, they definitely interact with certain people more or less, definitely get in hate wars, or anything. For example, Justin Bieber and Selena Gomez.

People found out they broke up because they unfollowed each other on Instagram. So I think that's interesting. Also some other things that I've done are predicting diabetes subtypes based on biometric data. So this was in CS109. First P set, I believe. And so given biometric data, so it would be information like age and gender, but also biometric data like presence of certain markers, or blood pressure, or something. You can pretty accurately predict what type of diabetes they'll have, or whether they'll have diabetes or not, like type 1, type 2, or type 3. And we can also predict things like urban demographic changes. Because a lot of this information is available online, you know what socioeconomic status people are in, but you also know where exactly they're located based on longitude and latitude. And so based on how good your regression model is, if you input in a specific latitude and longitude, you can predict what exactly socioeconomic status they're in, which I think is pretty cool.

And over time as well, because their data sets go back many different years. So those are a couple of ideas. Any questions about data science? AUDIENCE: It's pretty cool. ANITA KHAN: Thank you. OK. Well, thank you for coming. If you have any questions, feel free to let me know. My information is here if you want any advice or tips or anything. And also these slides and everything will be posted online if you want to access that again. So, thank you. .