Grobid
Description
GROBID means GeneRation Of BIbliographic Data. GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. Grobid can help you perform bibliographic analyses on scientific papers.
Grobid can be used via a webapplication in which you can upload and parse documents in the browser, and also provides an API for scripting purposes.
Launching this Catalog Item will provide you with a workspace on which the Grobid webapplication and API are running.
Variations
There are some variations of this Catalog Item that come with different Grobid extensions in place:
- Grobid Standalone: provides only the main Grobid application without extensions
- Grobid Datastet: provides the datastet extension for detecting datasets mentions
- Grobid Softcite: provides the softcite extension for detecting software mentions
- Grobid All-in-One: provides all of the above.
For 1-3, the application is simply hosted at https://yourworkspaceurl.nl/
. For 4., the various applications are accessible at the following locations:
- Grobid (main app):
https://yourworkspaceurl.nl/grobid/
- Datastet:
https://yourworkspaceurl.nl/datastet/
- Softcite:
https://yourworkspaceurl.nl/softcite/
Creation
Create a workspace
In the Research Cloud portal click the ‘Create a new workspace’ button and follow the steps in the wizzard.
See the workspace creation manual page for more guidance.
Interactive parameters
If you want to override the default password used to access the API, you can do so in the final page before you press Submit to create your workspace. Just fill in your desired password in the provided Interactive Parameter field:
Access
Webapplication
Members of the workspace’s Collaborative Organisation can simply point their browser to the workspace’s URL and login using their organisation’s Single Sign-On mechanism (e.g. Solis login with two factor authentication for UU employees and students). You can use the yellow ‘Access’ button in the Workspace overview in the portal to be linked to the right URL.
API
The API for each application is accessible at sublocations of your workspace’s URL:
- For Grobid, use the
/api
sublocation, for instancehttps://yourworkspaceurl.nl/api/
. - For Datastet and Softcite use the
/service
sublocation, e.g.https://yourworkspaceurl.nl/datastet/service/
.
Since Single-Sign On is difficult to implement when scripting, authentication for the API uses a simple username/password scheme. The default username and password are grobid
. You can set the password as an interactive parameter.
From a command line, you can test the Grobid API e.g. in the following way:
curl -ugrobid:grobid -L https://grobiddatastet.itsdatalandscap.src.surf-hosted.nl/api/version
Also see the API documentation.
SSH
You can also access the workspace via the command line (SSH). If you do so, the Grobid service(s) will be availabe on http://localhost:<port>/api
. The port number is different for Grobid, Datastet, and Softcite. You can read off the relevant port numbers by running the following command:
$ sudo docker container list
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d64db08a37ea grobid/datastet:0.8.0 "sh -c 'java --add-o…" About a minute ago Up 57 seconds 0.0.0.0:8060->8060/tcp, :::8060->8060/tcp b636ea56-cceb-42e9-bef9-b8850047be04_datastet_1
In this example you can see that Datastet is running on localhost:8060
.
Usage
See the Grobid documentation, or the docs for the extensions you are using, for help with using the application.