Is GitHub a Source for Exploits?
Share with Your Network
We look at the phenomena of exploit code moving from traditional and cybersecurity-centric databases like Exploit-DB and Metasploit and instead being published on Github. Is Github becoming a de facto database for exploit code?
Transcript
Dan Mellinger: Today on Security Science, is GitHub the new source for exploits? Hello, and thank you for joining us. I’m Dan Mellinger. And today, we’re looking at the phenomenon of GitHub becoming a de facto database for exploit code, or is it? With me, I have our director of security research at Kenna Security, the exploit coding, Jerry Gamblin. How’s it going, Jerry?
Jerry Gamblin: Good, good. How are you guys today?
Dan Mellinger: Doing good. And some would say our special guest is the de facto database for cybersecurity research. He’s the partner and co- founder at Cyentia Institute, Jay Jacobs. Welcome back, Jay. How’s it going?
Jay Jacobs: Good, good. Thanks for having me.
Dan Mellinger: Awesome. Just real quick, I’ll start off at this topic is based off of a blog that Jay recently published on the Cyentia website. And so, we’ll link to that on the episode page on podcast. kennaresearch. com, just so you guys can all check it out, follow along. There’s some pretty cool charts. And I know that our friend, Fahmida, wrote an article on this as well on duo. com/ decipher. So you can go check that out if you’re listening this week, which is, what? Thursday, January 14th. This will probably be up a week later, should still be on the front page, I would assume. But ultimately, we’re looking at this phenomenon that Jay’s been tracking on exploits and weaponization of exploits being published to GitHub, which we’ll explain, but typically isn’t the use case for it. So anyway, I figured we’d start off real quick with a primer from Jay. Jay, what got you looking into this?
Jay Jacobs: Well, so there’s a couple of things. One thing is that we’ve been trying to find more and more sources for anything about vulnerabilities, and looking at, essentially like The MITRE publishes their CV list, NVD will pick that up and add a lot more stuff to it. But looking at the references for each CVE, we saw that GitHub was climbing year over year, the past three, four years. The number of references to GitHub was going way up. And so, that made us wonder, are people just linking to vulnerable projects or are they actually linking to exploits? Why are we seeing GitHub go up, grow in popularity and references for vulnerabilities? And so, that made us look into it. And of course we found, of course, anybody who has been around know that there are exploits published to GitHub, but there’s a huge challenge then of discovering, classifying, labeling these as exploits versus you get scanners, you get discussions of the vulnerabilities that have no exploits whatsoever, trying to separate those out is an interesting challenge.
Dan Mellinger: Ah, yeah, that makes a ton of sense. And before we get too deep into it, and a nice segue here, but Jerry, do you mind just giving us an overview on what’s a vulnerability versus an exploit versus weaponization of that exploit?
Jerry Gamblin: So yeah, we’ll start at the first. A vulnerability is a known flaw in software. It could just be a logic flaw. It could be code flaw. It could be anything. It’s just something wrong with a piece of software. And so, those are eligible for CVEs. An exploit is code that actually exploits that. Jay’s the master at that chart that shows the difference between a vulnerability and a vulnerability with exploit. And I haven’t seen that chart in the last three months, but it always is in the low single digits. The truth is, finding a vulnerability is easy, finding an exploit is super difficult.
Dan Mellinger: Yeah. I’ll just list off some of the stats that are in the blog and have actually been shared through some of the prioritization or prediction reports and our exploit prediction scoring system work, but vulns that have a published exploit or exploit code that’s published online are seven times as likely to be exploited. Kind of makes sense, right? Because the hard part is actually writing the exploit like Jerry said, and then weaponized exploits. So I don’t know if I’ve seen this stat before, Jay. But you wrote in the blog that the odds of exploitation in the wild jump from 3. 7 to 37.1% if a vuln has an exploit and it’s weaponized. Could you go over that a little bit?
Jay Jacobs: Yeah, I think that was specifically Metasploit when we’re talking about a weaponized exploit, and that is essentially something that is incredibly easy to run. Anybody can grab it, run it, it’s weaponized, it’s ready to go. As opposed to something like some source code on GitHub or Pastebin or something you have to grab, compile, configure. It’s not ready to go, but it’s there, the exploit is there. So that weaponized, when we looked at essentially Metasploit, we find that the odds of expectation on the wild essentially at a base rate is like 3. 7%. And then when we look at the number of CVEs, vulnerabilities in Metasploit that we see exploit in the wild, it’s like 37% of those that jumps way up. Those that are in Metasploit, if we know that it’s in Metasploit, published in Metasploit, we’re way more likely, 10 times more likely to see it exploited in the wild. And that, again, that’s a correlation. We don’t know if that’s causation just based on the data alone. We can’t say that because it’s in there, it’s in there. But what we know is that when we see it on there, we are also likely to see it exploited in the wild.
Dan Mellinger: That makes sense. And that’s actually a really good segue as well, to explain the traditional ways that we typically have exploit code published is things like Exploit DB. You talked about Metasploit. Could you guys just provide a little background on what’s Exploit DB? Why does it exist? What’s it used for? It might help color the conversation on why people are moving towards GitHub.
Jerry Gamblin: Yeah, Exploit DB is just a website. I don’t know exactly when it was founded, but it was just the start of mapping CVEs to exploit code. You could add the exploit code and then you could search for it. It was the first real way to match up a CVE to something that could be exploited. Exactly as the name says.
Dan Mellinger: Yeah. It was a… I think Offensive Security is the company who-
Jay Jacobs: They bought it, I think it was like 2010 that they acquired that. But I think before that it was one person or a small group of people. I think it’s still a small group of people, don’t get me wrong. So I think what’s interesting is a lot of the data that we see around vulnerabilities, a lot of it is being driven by one person or two or three people sort of driving a large portion of this. If you look at the history of just MITRE and CVEs in general, just seeing how many people were involved in 99 when the first one was published through 2003 to 2005, and then, what happens in 2017 when we get the CNA process. I mean, it’s just really interesting. A lot of this stuff, when it starts out as like one person, and then usually you get some sponsorship and some corporate involvement and it gets a little bit more mature.
Jerry Gamblin: I mean, and I think it’s a good point. To talk about here that GitHub isn’t replacing VulnDB. VulnDB has just become-
Jay Jacobs: Exploit DB.
Jerry Gamblin: Exploit DB, sorry, had just become less of a place for people to put the data. The data is getting stale because it’s going to Twitter, it’s going to other places. And I think we have a question later in the question list about how are these exploits shared. Because they weren’t publishing the exploit to Exploit DB. It was just getting linked there in some way. That was hardly ever the first place that it would end up. It would end up on a mailing list that somebody was on or something like that. Exploit DB was just the central hub for that data. And I think that’s moved somewhere else and I’m not exactly sure where outside of the 40 Slack channels I’m in and the Twitter shit that never stops.
Dan Mellinger: Yep. I did want to note as well, which I just thought was kind of funny because of the topic, but Exploit DB also has a Git repository for it as well. Just kind of funny that it’s titled that way. And then Metasploit, can you go into that? That essentially is a weaponized exploit. It makes it a lot easier to use.
Jerry Gamblin: Yeah, when I was coming up in the industry, we called those script kiddies. I don’t know if that’s still the term, but what it is, it’s basically a CLI inaudible I want to attack this system with this exploit. I don’t need to know what it does. Most of the time I don’t care what it does. And then give me the data.
Dan Mellinger: Leap inaudible.
Jerry Gamblin: Yeah. It’s the leap button. It’s the easy button for hacking.
Dan Mellinger: It’s the video game kid screaming at you, I’m going to hack you because you beat me. Method, got it. Well, and I think now that we’re getting into the meat of things a little bit, I mean, GitHub, let’s just do a little background. What’s it intended to be used for and that might influence why it’s become what it has?
Jerry Gamblin: GitHub is a code sharing site. It runs Git. I’ve added my normal XKCD thing in there. Nobody knows what Git is, but GitHub was a code sharing site that’s become the number one code sharing site. There’s GitLab out there. And there are a few others. I’m not sure if Jay looked at any of those, but get GitHub is by and far the world’s most popular code sharing website.
Jay Jacobs: Yeah, definitely. And I mean like there’s, no, it’s free for most users. I mean, there are of course commercial options and things that you can subscribe to, but to just create a public repository of code, you can just go sign up and do it. The barrier to entry is much smaller compared to something like Exploit DB, where you actually have to grab an exploit, create a submission, submit it to, I can’t remember the company’s name, Offensive Security. They will review it and then possibly post it and validate it and stuff like that. For GitHub, it’s just, I’m going to create an account. I’m going to go dump it and then I’m done. You know what I mean? There’s no review, there’s no gate to get through. It’s just, it’s out there.
Dan Mellinger: That might make sense why people might use it more. Because I know Exploit DB. They do try to validate some of these vulnerabilities where they can by testing it and being like, okay, yeah, this works. And we’ll put a green check mark next to it. No such thing on GitHub. Jay, let’s get into this, the numbers, we have a really cool chart. I encourage anyone. Who’s listening to go check out the blog on a Cyentia Institute website and or linked on the podcast page, but let’s go over the numbers.
Jay Jacobs: Essentially I looked at the number of new exploits being published to exploit DB, GitHub and Metasploit. And Metasploit not too much of a story, it’s pretty flat. And it’s I would guess probably on average, what about 10 a month, 10 to 20 on average a month. And this is over the last four years. And when we look at Exploit DB, it sort of peaks in 2018 and it’s been dropping pretty steady ever since. It looks like there might be a little uptake at the end of 2020, but generally it went from probably, I don’t know, 80 to a 100 a month at the beginning of 2018. And now it’s at 20 to 30 maybe in 2020, as opposed to GitHub that started out at 10 to 20, in 2017. And now on average, it’s probably 60 a month. Essentially what we see is Exploit DB sort of peaks in 2018, it’s been declining ever since. GitHub has been increasing since 2017. And it looks like it’s outpaced Exploit DB at this point for the number of exploits on there and then Metasploit is just sort of steady.
Dan Mellinger: Yeah. It’s interesting. These two charts they’re plotted it looks like by month, by year. So over time and number of published and they look like mirrors of each other almost, which is interesting.
Jay Jacobs: Yeah, and sort of flipped around.
Dan Mellinger: Yeah, yeah. Exploit is going down, GitHub is going up and Metasploit is roughly about the same throughout the years. Jerry, you found this super interesting. What are your initial thoughts?
Jerry Gamblin: It’s interesting, but it also talks to a wider audience. We have a problem with CVEs quote unquote problem with the amount of CVEs that are being introduced. Last year was ridiculous. This year we’ve had it’s what is it? The 14th? And I just ran that thing. We’ve had 83 CVEs so far. 15 of them were in one product that’s not supported that were all cross- site scripting. We’re just filling up the CVE list of stuff like this. And all of those could, easily go into GitHub, the cross site scripting ones, as you know, there’s a CVE for it. Here’s how I do it and that’ll make the numbers go up. I love this data and I do see GitHub becoming more and more the place for this. But I really think that it gets interesting when we break it down. And I know Jay’s working on this or we’re talking about this, what does the CVE actually attack? Is it a cross site scripting PoC, because I expect to see one of those. If you get a CVE for a cross site scripting attack, you better have a PoC that you can post with it. Some of those AppSec CVEs should automatically have PoC data.
Dan Mellinger: Interesting. Yeah. I mean, that’s a good point because the number of CVEs is only increasing year over year and we’ve seen that and there’s a ton of them and Jerry, early on, I think he wrote a blog about CVE stuffing for things like these cross site scripting, which you could just essentially run on at infinitum for check software here, website here.
Jerry Gamblin: Yeah. I mean, but then we have the other issue too, where our PoC is a PoC, but it doesn’t mean that it’s ever going to get weaponized. Growing up I had a friend who did karate and he was always like, let me hold your arm like this and show you what I can do. Yeah, of course, if you get somebody with their arm behind their back, you can do a cool karate move. And some of these PoCs are kind of like that, it’s like, Hey, if you have a box and you can run this script as root, you can make this work. It’s like, Oh, nice. Yeah, I see where that is. I mean-
Dan Mellinger: If I have root, I’m doing something completely different.
Jerry Gamblin: It’s a valid PoC because it exploits the CVE, but the chance of it getting weaponized is there’s still a lot of steps to take between some PoCs that’ll run on a machine as root or with the software to get to something that’s actually weaponized. And you know, it’s great to have these PoCs and this data to see how it does. But as Jay was talking about, getting to a weaponized state is still a bridge too far for some of these.
Jay Jacobs: Yeah.
Dan Mellinger: Yeah. Makes sense. And that also the karate piece reminds me of an office, Dwight and Jim segments. I’ll see if I can link that. Because that would be hilarious. Well, that being said, so Jay, I’m sure this is the fun part for you. GitHub was not designed, it’s not an exploit database, it wasn’t designed to host exploits and be searchable for them, all that fun stuff. So how did you determine if a GitHub repo has an exploit?
Jay Jacobs: First, GitHub has a set of tags that people can tag a repository with. And there are tags for exploits, PoC I mean like people, you can grab these tags and search for them, but the problem is that’s up to the creator of the repository to make sure those tags are there. And that is completely… you can’t depend on it at all. And so there are two huge problems in here. One is discovery, like how do you look at the fire hose that is GitHub? Because it is extremely popular and keeping up on what’s being published there is extremely difficult. And one of the cool things GitHub offers a streaming API, where you can essentially, when you refresh, you get a list of all the new repos and changes in the last X period. But the problem is you can’t refresh that fast enough to watch the fire hose. You go and you get this thing. It’s like, hey, there’s 30 pages here. And by the time you go to the second page, it’s already looped through 30 pages of content. Like it’s insane. Especially during busy peak times in general evenings in the US and things like that. You can’t watch that fire hose that way. Essentially going through their API, looking for searches, all these things, trying to find good keywords, trying to whittle that down to some sort of seed dataset. And then of course the challenge is how do you look at this discovered repositories and say which one of these contain exploits and which don’t. And this is where the next challenge comes in, that essentially what we decided to do is to manually hit the list. We started going through, I think we got several hundred, I think six, 700 repositories manually inspected. And what was interesting, I mean, like we would look at some of these repositories and have absolutely no idea if this was an exploit or not. And just some of the complete vagueness, you get things where it’s like the only thing in there, something called exploit. py. And so you actually have to go look at the exploit, look at the code and it might say like this will find all exploits for this CVE. And so you look at it and it’s basically grabs a header and looks at the version as opposed to actually exploiting it. Because if it just finds a version, that’s not going to help anybody attack or test it in anything other than just say, it’s there. You have to sort of whittle through and go through these very carefully. And then you get, there’s so many gray areas to say, is this an exploit or not? For cross site scripting, if it just has alert this is XSS like, is that actually an exploit? Because it does theoretically do cross site scripting, but popping up an alert is not essentially a huge payload, but I mean, anybody who knows some basic Java script can replace that with what they want, but is that truly an exploit? So it’s tough, but essentially what we did is we got that list of labeled data saying these repositories are, these repositories are not. And then there are other challenges. Like you get a repository that has a 100 CVEs and exploits in there, and some of them are exploits and some of them aren’t. Huge mess. Anyway, once you get that list of those, then you can look at attributes. So we’re looking at file names, the contents, the dates, the lifetimes, the how many commits, what does a commit range? We’re looking at as many possible attributes of these repositories in the code that we can get our hands on from GitHub. And we use that to essentially create a classifier. And then once you’ve got that classifier, you can see how it performs. You train it on some amount of this labeled data and hold out some of it and you say, all right, I’m going to run this classifier in what I held out, how did it perform? How did it go? And then in the blog post, I think I threw a rock curve. One it’s called the receiver operator characteristic, which doesn’t mean anything to anybody pretty much, but essentially it’s a curve that says, what is the false positive rate versus the true positive rate? And because the classifier outputs something between zero and one, it doesn’t say this is, or this isn’t. It says, this is a 0.9. This is a 0.2. This is 0.7. And so you want to take that output and say, I’m going to make a cutoff at 0. 9 or 0.6 or 0. 5, whatever it is. And when I make that cutoff, what is the true positive rate versus the false positive rate? And so like for a company like Kenna for instance, if you want it to be super confident that what you’re saying is an exploit and you want to be sure that you’re confident that you’re going to want to set that threshold pretty high, like a 0.9, 0. 95, that way, anything above that, you can be really sure that is an exploit. And conversely, if you want to be sure that you don’t miss something, you might want to lower that down significantly.
Dan Mellinger: Interesting. Well, let’s move on to what I think Jerry thinks is the fun part of now that you’ve determined if something has an exploit, what’s kind of the breakdown. So Jerry, you were asking questions on the breakdown of the languages used, some of the vendors listed. Jay popped in some nice data that isn’t in the blog. And we might be trying to work on another blog to publish after this podcast that’ll talk about this stuff, but Jerry, did you want to lead any interesting takeaways from the language breakdown some of the crosstalk?
Jerry Gamblin: We’ll probably try to add that to the blog or Twitter to this notes. But it’s mostly Python was the number one language, and that’s probably most people’s go to quick and dirty kind of language. It’s what I always go to, especially if it’s a web base, because the request library is so great. You can just say, Oh, I’m looking for this header from this VPN project. So it’s simple to work with. I wasn’t surprised there. I was mostly asking for the data to just kind of validate what I was thinking, but yeah. And then you get to C, which is interesting. I don’t know a lot of people who write exploits in C, I wonder how many of those… Those would be an interesting set to look through because my best guest are those are windows vulnerabilities that probably really legitimate that somebody has written this code in C and in there, there’s a compiler that turns it into an exe that lets you exploit that on there. So just kind of in, what’s interesting on those lists, I just did C in school, so I’m not great with it, but I would really be interested in looking through that list of CVEs that are written in C.
Jay Jacobs: If I had to guess, there’s a couple of thoughts here. One is that as we’re going through these repositories, there are quite a few student projects. And our teachers will actually assign, go find a vulnerability, write an exploit and submit it as your final project or something and so you’ll see, and by the way, do it in C. You might see some things like that. Another thing might be some of the memory related vulnerabilities, I think might be easier to exploit in C than Python or something else. That might be specific to the type of vulnerability. I haven’t looked at that, but that would just be my hunch.
Jerry Gamblin: Yeah. Much more low level of a piece of software. Outside of that, what do you think that this means for security practitioners, researchers and IT admin, do you think that this is kind of a big shift in the way that-
Jay Jacobs: Yeah, I don’t think it’s a shift. I mean, it’s a shift because constantly things are changing and you have to sort of keep your finger on the pulse as best you can to try and figure out what we see from a threat perspective. And as we talked about early on, one of the biggest indicators of threat is when these exploits are discussed and have exploits published for them. And so the more broadly these things are discussed, the more likely we are to see them exploited in the wild. And so just being aware if you are working in vulnerability management, if you’re a security practitioner, researcher, whatever you want to be aware of first the vulnerabilities in your systems, but then also how to prioritize. And one of the probably easiest indicators is if you see some type of exploit published out there. And so keeping an eye on, GitHub, keeping an eye on Exploit DB, Metasploit, all the other sources out there for when these things get published. I think it’s a really good indicator that these should probably be prioritized over those without an exploit.
Jerry Gamblin: The model look at, if these links on these GitHub repos get added back to the CVE details page?
Jay Jacobs: I ended up completely ignoring that, completely ignoring the CVE details. Yeah.
Dan Mellinger: Well, that’s actually interesting. Because we just did volume six of the report, where we were looking at basically momentum of attackers versus defenders and based off of timelines, what happened first, order of operations, what happens second. And throughout the P2P series, the existence of an exploit has always been kind of the go moment. When that characteristic happens, you should go take care of this. What are some of the drawbacks of them existing on a GitHub and all the challenges you had just pulling the data. Accurate data and the trade- offs with that. And then being published to exploit DB or a Metasploit or somewhere else that is designed to do this. And people can keep track of that flood a little bit easier.
Jerry Gamblin: Well, after Jay, I like to hop in here cause I have probably an interesting take outside of the data part on this.
Jay Jacobs: Yeah. So yeah, like you mentioned volume six of the P2P series. And then there we looked at… We had a question about if exploit code is published. I think we had a cutoff of when the patch was early. So if an exploit code was published before the patch was released, what is the effect on exploitation in the wild? And we saw a huge shift. And of course, when I first saw that, my brain was like, ” Oh, this looks terrible. Like exploit code appears to cause more exploitation in the wild, but that cause versus correlation is something people always trip up on.” I’m no exception. So essentially, I mean, when we looked at it deeper in order to detect something as exploit in the wild, you need to have a signature to detect it. And to write that signature, it’s a lot easier to write if you have some sort of proof of concept that you can make sure that signature is going to trigger at the right time. If you think about it, in order to detect it in the wild, it’s helpful to have a proof of concept. When we see a proof of concept out there and we see it exploited in the wild faster, does that mean that it’s actually been exploited faster or that we’re just detecting it faster? Was it always exploited out there or is the fact that we’re seeing that X by giving us an advantage, it’s an advantage to see these things exploited sooner. So it’s a difficult thing. And so, the drawback, I think of seeing these things on GitHub, obviously they can be used by attackers and they’re going to highlight this vulnerability more, put it more on people’s radar, but I think there’s benefits too. Once you have that exploit, you can create a signature, you can know how to reconfigure a firewall or your WAF or whatever, just how to deal with that exploit if you can actually get it and run it and put it through your security tools.
Jerry Gamblin: Yeah. I really think this is a signal intelligence question, you want to be as close to the information as your adversary is. And by the time you walk all the way back to an exploit DB, that’s had to gone through a process that either involves somebody running some kind of script or a person looking at it. So you’re 24, 48 hours kind of removed. And I’m guessing that Metasploit, which is kind of like the gold standard is probably way more than that. What he was talking about, that streaming API, you want to be as close to that as you can. I want the data in real time. I don’t want to give, like we talked about in P2P volume six, that headstart, if I can see a PoC at the same time that my attacker’s seeing the PoC, we can start our remediation flow at exactly that point, but waiting on someone to update another set in Exploit DB or whatever adds latency to kicking off that remediation process. So the closer and faster we can get to real time data gets us closer to how our attackers are working.
Dan Mellinger: Interesting. So you think this is ultimately a net benefit. We just need to figure out how to do this faster.
Jerry Gamblin: This is where the attackers are getting it and sites that pull this together and kind of make it a service or whatever, an Exploit DB, they have to get much faster because 48 hour turnaround time for being published on GitHub to being in my threat intelligence data, isn’t acceptable in 2020. You want to see it as quickly as your attackers are seeing it.
Dan Mellinger: Hmm. That’s really interesting. Is there… Yeah, I’m sure we probably don’t have this level of data yet, but it’d be interesting. So are the exploits against existing CVEs primarily?
Jerry Gamblin: I’m going to say yes. There’s a really, really interesting article that just came out that says the NSA hasn’t dealt with a zero- day in over three years. They said that all of the stuff that they’ve worked on publicly have been of known CVEs. I just tweeted it this morning.
Jay Jacobs: Like from a response perspective or what?
Jerry Gamblin: Yeah, from other countries, from their defensive side. They’re not seeing bad country X using zero- days. They’re seeing them use known CVEs with known exploits.
Dan Mellinger: BlueKeep and the Microsoft-
Jay Jacobs: I mean, that’s something we saw in our data in V6 that essentially the remediation gets to a point and plateaus, you don’t get a 100% remediation on these vulnerabilities that people are finding in their vuln scanners, you get 80, 90% coverage. And there’s always these corners that the vuln scanners see, but you can’t remediate. They’re either not part of the remediation platform. You can’t find the owner of that system. They’re in a corner. Whatever it is, there’s just some amount of systems that just seem to fall off. And so what we see then in the attacker perspective, when we see the exploits in the wild, it’s a much slower growth to it. And you sort of see the attackers just sort of marching along and they don’t appear to be in a rush. We don’t see this mad scramble when something is published or the patch is released. We don’t see that. We see this sort of… we do see a slight increase, but it’s not like 80% is in the first week. It’s like 10% in the first week or something. And it’s just sort of spread out over time. And it’s a slow march from the attacker perspective.
Jerry Gamblin: Because I wonder, just kind of think about that as somebody who played defense for a lot of time, I wonder how much of that is an attacker picks a CVE to exploit or if an attacker is just going to look at an organization? Like I want to get in this organization. I don’t care what CVE it is. Let me see if they’ve missed patching their firewall or their VPN or their web server or their exchange server. I don’t care what it is. I just want in versus here’s the CVE I’m going to shop everywhere to try to get in.
Dan Mellinger: Yeah. So the difference between a target versus targets of opportunity.
Jerry Gamblin: Yeah, absolutely.
Dan Mellinger: Interesting. Do you think a security right now as an industry is monitoring GitHub close enough? I’m going to go with no.
Jerry Gamblin: I’m going with, I think it’s becoming more and more of a focal point. Not outside of this research. This research is unique and I love this stuff that Jay’s done here, but kind of as a security industry, especially on the AppSec side, GitHub has decided that they’re going to basically become everything. They’re offering SCA now, a static code analysis tool. They do alerts inside their CNA, they’re issuing CVEs for code hosted on GitHub. So GitHub has really, really become the center of at least the application security world’s focus on vulnerabilities recently. If you go and look in their actions in their marketplace, almost all the major security tools have plugins that allow you to run the code, run their tool right from GitHub.
Jay Jacobs: Yeah.
Dan Mellinger: Jay, do you think there’s any gaps in data?
Jay Jacobs: Yeah. I mean, there’s, there’s always this discussion about vulnerabilities that have a CVE versus not. And so part of the challenge is that when people post a vulnerability and I mean, this applies not just to GitHub, but to everything, with a vulnerability. So if you look at Exploit DB even, there are probably my guess is there’s some proportion, and I have no idea what that is, because I have no idea how many, but there’s some proportion that are exploiting a CVE that nobody took the time to say this is mapped to the CVE. And for GitHub, for example, one of the key fields that we’re trying to search on is CVE. And so when we find CVEs mentioned, and of course it’s super easy to tie to a CVE, when we see something like BlueKeep is a great example. When you have a vulnerability that’s very popular, that has a name, people will create repos with BlueKeep and not even bother discussing a CVE or linking a CVE somehow. And so those become a little bit more of a challenge to discover and associate, and this is globally, this is not just GitHub, but it’s something that we’re seeing. And so that’s part of the challenge with this is that it is so ad hoc, so that whoever creates a repo can decide every aspect of what goes into that, how it’s tagged, how it looks, how it appears, the information in there. And so the challenge is to go through and figure out the commonalities and how to mine this in an automated fashion.
Jerry Gamblin: Well, and did you even look at Gist or no?
Jay Jacobs: I didn’t. We skipped the individual code things. That would be fun, because, I mean, there’s, again, that’s probably even more scattered.
Jerry Gamblin: Yeah. And that’s where I don’t kind of PoC kind of working stuff is just in a Gist, because a repo is forever. Like I want the code to meet a certain level of cleanliness. Like I’m going to update it. And I want people to do pull request on that or whatever, but 90% of the time, if I’m working on something that’s just kind of a one- off, I’d just use GitHub’s codelet service. It’s called GIST, G- I- S- T. If you go look at what’s in my Gist list, it’s a bunch of shell scripts and Python scripts that work that I want to share, but I don’t want a support.
Dan Mellinger: Interesting. It’s like a scratch pad type thing. Or are you just-
Jerry Gamblin: Yeah, a public scratchpad. A public bulletin board.
Jay Jacobs: Yeah. And you don’t have to have the whole repository built around it. You can just put a code block out there.
Dan Mellinger: Gotcha. Interesting. Well, I mean, this is all very cool. I think the industry, especially from a cybersecurity standpoint is paying increasing attention to GitHub even outside of the pure application security side. But overall, any final takeaways you guys have, I know we want to dig further into this and it seems like it’s going to align with some of our future reports, for sure. So Jerry, any final takeaways for you?
Jerry Gamblin: I’m waiting to see how it GitHub reacts to becoming this repository? They will become a big target if code is hosted here. That then goes on to be used to exploit massive amounts of people. I know I’ve seen PoC kind of versions of like C2C bots that are run through GitHub, but none of them have really been big or noteworthy even, but I’m just waiting for the first time that someone drops a PoC on GitHub that becomes the base for a major breach and to see how they respond to that.
Jay Jacobs: Yeah. That’d be hard to link back though. inaudible, I mean like knowing where they picked up an exploit that was used in a large attack is going to be a tricky attribution to make.
Jerry Gamblin: Or probably more likely for an attacker to post his code on GitHub and then use that code to exploit a bunch of people.
Jay Jacobs: Yeah. Yeah. I think it will be interesting to watch. I mean, it’s sort of exciting to see. It’s always tough working on data and you see that data source start to dwindle as we’ve seen with Exploit DB. And then to find that some of this is shifting over to GitHub is very… It’s a relief to me to find that something that’s increasing lately around this. So I’m pretty excited about that.
Dan Mellinger: Never enough data.
Jay Jacobs: Never enough.
Dan Mellinger: Awesome. Well, appreciate both you gentlemen for hopping on today. And like I said, we’ll have all these resources all linked to, well, Jerry’s a XKCD post because we have to have one of those in the podcast with him, and then also the links to the Cyentia blog and also the Decipher article. And then we may go back and update this if we write a secondary blog, but Jay, Jerry, thanks for hopping on. And we’ll keep tracking this.
Jerry Gamblin: Thank you so much. Thanks, Jay, this has been really fun.
Jay Jacobs: Yeah. Thanks.