I am trying to scrape all the data content inside code , but however my code looks like kinda weirdy on code_snippet = soup.find('code')
since it display different data as the following:
<code class="language-plaintext highlighter-rouge">backend/src</code>Nonehh2019/09/22/dragonteaser19-rms/<code>What do? list [p]ending requests list [f]inished requests [v]iew result of request [a]dd new request [q]uitChoice? [pfvaq]</code>Nonehh2019/01/02/exploiting-math-expm1-v8/<code class="language-plaintext highlighter-rouge">nc 35.246.172.142 1</code>Nonehh2018/12/23/xmas18-white-rabbit/<code class="MathJax_Preview">n</code>Nonehh2018/12/02/pwn2win18-tpm20/<code>Welcome to my trusted platform. Tell me what do you want:hh2018/05/21/rctf18-stringer/<code class="language-plaintext highlighter-rouge">calloc</code>None
However, printing the soup = BeautifulSoup(content['value'], "html.parser")
it returns the right data pre > code
where it interest me only the content inside these tags , and looks like this
<h3 id="overview">Overview</h3><p>The challenge shipped with several cave templates.A user can build a cave from an existing template and populate it with treasures in random positions.For caves created by the gamebot, the treasures are flags.Any user can visit a cave by providing a program written in a custom programming language.The program has to navigate around the cave.If it terminates on a treasure, the treasure’s contents will be printed.</p><p>I was drawn to this challenge because the custom programming language is compiled to machine code using LLVM, and then executed.It seemed like a fun place to look for bugs.</p><p>The challenge ships the backend’s source code in <code class="language-plaintext highlighter-rouge">backend/src</code>, some program samples in <code class="language-plaintext highlighter-rouge">backend/samples</code>, and the prebuilt binaries in <code class="language-plaintext highlighter-rouge">backend/build</code>.The <code class="language-plaintext highlighter-rouge">backend/build/SaarlangCompiler</code> executable is a standalone compiler for the language.It’s useful for testing, but it is not used in the challenge.The actual server is <code class="language-plaintext highlighter-rouge">backend/build/SchlossbergCaveServer</code>.It binds to the local port 9081, and it is exposed to other teams through a nginx reverse proxy on port 9080.I will use port 9081 in examples and exploits so that they can be tested locally without nginx.</p><h3 id="api-interactions">API interactions</h3><p>The APIs are defined in <code class="language-plaintext highlighter-rouge">backend/src/api.cpp</code>.We will take a look at some typical API interactions.I will prettify JSON responses for your convenience.</p><p>First, we need to register a user:</p><div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -c cookies -X POST -H 'Content-Type: application/json' \ -d '{"username": "abiondo", "password": "secret"}' \ http://localhost:9081/api/users/register{"username": "abiondo"}</code></pre></div></div>
I want to scrape all <pre *><code>
and clean it with code_snippet.get_text()
, but I am not sure, what I am missing on this, however, I am using asyncio + feedparser + bs4
for a scraper, but at some point, it's giving me the wrong data on this.
for entrie in entries: print(entrie['link']) for content in entrie['content']: soup = BeautifulSoup(content['value'], "html.parser") code_snippet = soup.find('code') print(soup)