LiveBlog: Developer Challenge Show & Tell

Paul Walk begins by thanking all of the Repository Fringe 2013 Developer Challenge sponsors. And to our judges for the Developer Challenge.

I want to thank our entrants for the Developer Challenge. We found some ideas exchanged even over the last few days. Which is great. Also interesting many of our entrants did not want to win a prize. Three of our developers said that! It gives you a sense of what doesn’t motivate people! I didn’t think that people would be content with me issuing them. We discussed the entrants this morning and took a majority vote.

The winning entry was from Russel Boyatt! Congratulations! There were four judges and we have two runners up! The first runner up is Peter Murray-Rust and Cesare Bellinni share that prize. And our other runner up is Chris Gutteridge.

As you will see from the presentations the winner was very timely, very much about preservation, and a real call to action for something they need to do. The two runners up essentially one of those projects has a very strong idea and had implemented it with executable code, the other built on top of it and integrated it into a repository. It was fantastic to have those feed into each other. This morning we had a session where all teams exchanged ideas, asked questions, offered feedback and it was really collaborative.

Russell Boyatt – Preserving a MOOC

I am interested in MOOCs. I am interested in these and there are a growing number of providers – Coursera, FutureLearn, etc. These are Massive Open Online Courses. What happens in these there are a *large* number of students taking part in the course. And there is a huge amount of social media activity around these courses. A huge amount of activity around these courses outside of these platforms. Wouldn’t it be great if we could capture and preserve that?

Why do that? Well it’s an institutional archive, you want to see what all that expenditure represents, you want to reuse, to learn from it, to see how it relates to on campus activity. Students want to look back at that data, see what they have done. And there is future research, MOOCs are new and we will have questions in five years where we will need to look back over that work so we have to collect it now. Also our current repository is assembling content for the MOOC. So the MOOC preservation toolkit I started to build it that I will reach into these platforms – particularly Moodle, EdX and OpenMOOC as they are open source – to gather material, alongside gathering social media interactions, the tweets, the discussions, everything. And then look at our own repository. And package that up together and push it into a repository using SWORD.

For social media we can use Twitter streaming API – and I’ve helpfully been pointed at Southampton EPrints Twitter tool. And I’ve been involved in blog preservation project which solves that gathering issue So we can preserve that content. We can ensure that learning resources developer for a MOOC can be captured and stored in a repository. And we can use this internally and externally to the university. And the MOOC materials could be route to collecting learning resources in a form suitable for an OER. And then we have a representation of that activity over time.

So I have started building a MOOC Preservation Toolkit. I will then be able to extract discussions from that MOOC then I can transfer that and pull it out as an XML and push it to a repository with SWORD – will try and get that working this weekend.

I haven’t done this as a tool, it might be useful. This is a call to action. There is huge activity around MOOCs right now. There is huge content in repositories being used in MOOCS and we want to know how that is being used and interacted with.


Q1) Activity data

A1) Yes, thinking about that, need to do more thinking.

Peter Murray Rust and Cesare Bellini – Images in Scientific Publication

If you use an image – here’s one on FigShare – you can claim it as your own, how can you prove it’s yours? Springer stole an image of mine! So this bothers me a lot! How do we make images proofed against copy fraud? How could we stamp that image in an unremovable way that would be immediately obvious to humans. I have blogged it already – read it there. So you take an image, you take CC-BY. You overlay one on top of another. No-one can remove the CC-BY  without destroying the image. That would be hugely useful to have on ALL images. Cesare is implementing this on a server – and we hope to implement it on a server. And thanks to Chris for developing this further. Mark from Ubiquity Press likes this, I’m going to approach all the major open access publishers.


Q1) What about a water mark for video or visualisations?

A1) That might be worthwhile.

A1 – Chris G) And should be doable by reusing VLC.

Chris Gutteridge – Images with Creative Commons Licenses

I want more young people snapping at my heels for these contests! More young people now!

Anyway what I have done is I have stolen Peter’s idea but made some other tweaks. So I take an image. I say who took the image, and the creative commons data is added to the image – the CC BY etc. on top of the image, as well as a proper attribution on the side to show whose image it is. And because we have the date on license and attribution, we have it to hand and are munging the image. JPEG and PDF files have a space for metadata in the file format. JPEG have an attribution field. So we have added license to EXIF for the image.


Q1) When will it be on the EPrints training course?!

Patrick McSweeney – Preservation Toolkit

So yesterday I learned that format preservation isn’t really an issue which upended this a little! But this is what I did… I made use of some EPrints functionality built by Dave Tarrant. The idea is that you take a file, you convert the file format. Then you can open it up. So that the word document becomes easy to open as a different file. But I’ve swapped one preservation risk for another. BUT you can convert your document to native raw HTML requiring no special tools at all. It’s kind of cool. The HTML is a bit ropey. The purpose of the service was two fold. EdShare, my baby, does this stuff server side. You could also use the tool to create zip file of images of your document, PDF, PPT etc.

Richard Wincewicz – Metadata Creator

I thought of this idea on the way home last night and spent three hours coding so don’t get your expectations up too much! Yesterday there was much talk of lacking metadata but I thought, well there is loads of metadata in a document if you know how to find it! So this solution would allow you to upload a document and do text mining etc. to pull out the metadata. All good then I spoke to Chris Gutteridge who showed me something built years back… but in any case… here is what I built!

So, here is a PDF, it pulls out metadata and pulls out XML. You would use this as a web service basically so you wouldn’t see all these fields in their raw form here, you’d see a user interface. In cases where you have lots of files with no data this kind of metadata gives you a starting point.


Q1 – Balviar) Do you know something about the NZ preservation metadata extractor

A1) Yes it’s built into ApacheTika which is an umbrealla for lots of libraries. This is essentially a front end for that.

Q2 – Chris) The great idea here is the generic API that gives you structured information. So that old hack shouldn’t put you off… this is a neat idea.

Paul Walk: I noted that four of five of the entries were about metadata, addressing the lack of metadata. Maybe we need to have a bigger crack at that metadata travelling with the object idea. Chris and Peter’s idea is an extreme version but the idea of putting it in the EXIF is nice. And this is a new emphasis if not a new idea.

Peter Burnhill: I recall when people were focusing on metadata and we had the catalogue record… the arguement went that metadata should be separate but if embedded you have it in there, you can then extract if you want to. Also for images you often want to know where associated objects are. Some documentation should be intrinsic to the object. In extreme cases one wants to find something related to enhance the object in creation.

Paul Walk: Now for something not in the plan…. Chris talked about the lack of new young developers showing up the old fart that he is now. I wanted to put it as a question. Is it something we as a community can do something about. It’s been an important part of the Repository Fringe and other events. It has really helped build relationships. So many of our developers didn’t want a prize that perhaps challenges are no longer the right phrase here… I saw collaboration, cooperation and maybe next year we reframe it.

Pat: The Developer Challenge can be a challenge still, just not a competition.

Paul Walk: I think that’s a really good idea. A solution to

Peter M-R: Hackathons are the modern approach. My idea stemmed from the Hack for Ac…

Dave Tarrant: Hacks are old ideas too. What we lost this year is training for developers. Dev8D didn’t run this year. That training is important. Having that critical mass, exchanging ideas, that’s what’s so important. We have to think about how to do that again. So that vision and hacks get realised. So many developments from challenges is part of day to day work. We should embrace that.

Chris Gutteridge: The reason I talk about young developers… I’ve seen giant prizes drive people purely for the cash. It can defeat the point. You really want to go back to your work and show that your work is great. For us we have won a lot of these we have been inspired and validated but we want new people helped up and validated.

Paul: That was the motivation for the prizes. I have examples of developers who were taken more seriously because of that prize

Russell: That feedback in the session this morning was brilliant. Chris’ comment has changed what I do for the better. Do that. Collaborate more. That’s the value.

Claire: That has to be every day, perhaps not at these events. So like plugins etc. Is that a good idea, is it cool, has it be done. How do you keep up?!

Dave: You couldn’t have a better place to reach that community frankly. The friendships are built and last beyond the challenge.

Peter B: Last year when we had OR2012 we had concern that the big show was coming to town… we blended Repository Fringe in again. I’m delighted that we’ve come out the other side. There is a commitment to that. You alluded that there is an intrinsic value to this sort of “mixed collar” event. So if there is a particular problem, so those coming through compsci schools we may have an obligation to work that out for next year.

Paul: And finally a big thank you Muriel Mewissen and Nicola Osborne who did most of the heavy lifting to organise the Developers Challenge.

Tagged with: , ,
Posted in LiveBlog

LiveBlog: Pecha Kucha Session 2

Repositories for Scientific Research Data – Peter Murray Rust

We spend huge money, millions, looking at phyllogenetic trees, huge amounts of data gathered but most is thrown away. Little work done to solve this but ANDS in Australia has done some great work and we have a model in Crystallography – you have to have your data to be published. So you have data as part of publications. A similar problem is computational systems. There is a lack of awareness of DataCite and of the concept of sharing their data really. Progress? Well Obama is on board…

So back to that tree of life. We need repositories for particular domains. I like FigShare. I like CKAN. Anything open. But compared with IR there are very few domain repositories but I want to index absolutely all species, places and dates in biosis, thanks to JISC for supporting this.

And hat tip to OKFN here too.

But we need to make repositories for scientific data. Jisc have clout to make domain repositories happen!

Matt Taylor – Small Dataset Support

I am wanting to address people with universities with an embarrassing problem. I speak of course of those with small datasets. We inadvertently emphasise huge data sets. But small datasets matter. I know what a lot of you are thinking is “why are you talking to me about this”. Perhaps you have a friend or coworker who is a bit quiet, who has unconventional metadata needs. But these unloved academics need a supportive hand. Not out of pity but as genuine recognition of their problem.

Red feather is a Jisc RI project design to afford a repository-like experience but in miniature. They can be trivially installed on any computer via a simple pho script. It has a simplified interface and workflow. Lea. Fr those inexperienced and intimidated by depositing their materials. There is support for audio, visual, PDF, documents, etc. and social media tools lets you spread enticing rumours about your research. And rdf, Json, allows you to spread machine readable versions of your work to the world. And you can customise, reskin, and make it work for you. As used by for instance. red feather Allows those poorly endowed with data to make the maximise it’s use.

It’s not the size of your dataset that counts, it’s how you use it!

Sebastian Palucha – Implementing Durham E-These

We started with an out of the box EPrints but we wanted to make it as simple as possible to use and to highly customise it with LaTex. We added google analytics for full text. And we wanted to interoperable with EThOS. And the British library does digitisation services for us so we have changed the model a bit. We store EThos persistent IDs. There were some UTF-8 issues. Rather than update EPrints we used XML fix. And a student question raised creative commons licensing as an issue for us to work out. So we have made it clear to the user how to use it. We use google custom search to look across repositories.

We have a retrospective digitisation project on the go at the moment, lots of materials to work out. And we need to ensure we comply with EU Cookies law. And the there is the repository vs real life – users want to do bulk upload. Some try to send encrypted PDF. And now we need to look to sustainability as well.

We have plans to review our processes, to connect our CRIS. To engage with the repository of the future as a concept as well.


Q1 – Kevin) for Peter and Matt: similar problem from difference ends of the size spectrum. Very different solutions. Discuss!

A1 – Matt) my motivation was how can you make a repository like system that is super simple. I do teaching and learning repositories on the whole. They have much lower size and detail requirements than many other repositories. Often when working with EPrints I need a simplified solution. That was the idea of redfeather

A1 – Peter) if we can get people using this stuff on their own machines then that’s great. The people who can distribute are Apple and Google etc. we need to lead or they will. It’s right to out it on our own systems… Get it on the iPhone etc. that’s right but will it happen. But when we talk long tail data… I think you need a few repositories for specialist domains in order to get it to the community.

Q2) so your idea is for those without a repository

A2) the idea of red feather is that is it for those without a repository.

Open Access: Hegemonic and Subaltern – Les Carr

This came out at 45 mins when I practised it. This should be interesting!

When we think libraries we think stacks etc. universities are huge and old, conservative and sustainable – they predate the states they are located in. Very unlike many research contexts. Ten years ago the web escaped from CERN through physics departments, research, banking. People adopted stuff for open physics collaboration.

This wasnt the first time someone tried this… In the early twentieth century there was a phone based library card supported idea that was basically google but offline! HG Wells had the idea of microfiche allowing a collection of all research in the world. Vladimir Busch tried to the same with hypertext. We eventually got to one, escaped from CERN, inflated to the world with ideas like openness. Not worrying about identity, IP, theft, not issues for academia…

And ten years later… Open Access comes along, defined in Hungary. The idea of opening the door to knowledge, science, data, educational resources, government data, etc. but it doesn’t suit everyone. Not everyone is a physicists! We have commercial interests and we have the academy! The are genuine interests and tensions.

Robert maxwell was one of the first to think about commercial benefit from open materials. A damaging stuff. We publish facts in journals, we need rules, evaluation, the web changes our practice and our use of science. If we shackle that stuff because of commercial interests it spoils everything.

Our mission statement at University of Southampton does talk about benefitting the world. But we have come to the wrong conclusion with Finch etc. we have to come to this aspect of saying, whose side are you on?

Scott Renton – Images at UoE

This is light advertising! I work with special collections a the library. Will talk about what we do and what’s coming! We have CRC collections, largely prints. We have photography as our first stage, we have a huge fancy camera. We grab the images and feed into LUNA – uses very high JPEG2000 compression. Workflow is complex. We use DAMS collection management system. All images go into the collection. So much born digital stuff yo can’t apply it in the way you’d want. Metadata is provided by photographers, have cataloging in LUNA, looking to crowd source that.

Discovery and publicity of images really matters. One way to do this is using OAI to do this. Used Europeana Project with this, in MEMO project. We also have a Flickr presence connected up via the API. And there is a BookReader Object interface to display scans – low res images linked t high res version. We are part of LUNA Commons and would like UK LUNA Commons too.

Next up will be ordering systems. Digital preservation – we embed metadata into Tiffs. Our DSpace collections have Skylight as an interface. We want some so of ecommerce, we have built a check out system to sit with images. The next version of LUNA will be more scalable, faster, and will be web based which will be good.

Better visualisations would be great, more mass digitisation, and interoperability with Voyager.

All links are in the end slide here – take a look.


Q1 -Robin Rice) commercialisation came up in both your talks, was also raised in DCC Round Table…

A1 – Les) I have no problem with publishers making piles of money out of universities. Nor do I have a problem with never sites making money. But I don’t like barriers. Adding value is great. Rtificially creating scarcity is not OK. My big beef is not not affording literature, it’s not being able to datamine that data. So the Cory Doctorrow model – free but with enhanced versions for money too.

Q2 – Ribin Rice) You mentioned high res image sales, is there a business model

A2 – Scott) we had to grapple with creative commons issues for Flickr. E share low res images there.. It shifts regularly where these business models are at… No simple solution.

Tagged with:
Posted in LiveBlog

LiveBlog: Short Presentations 4

Paul Walk – RIOXX Metadata Application Profile and Vocabularies for OA (V4OA)

Paul will be presenting on two projects: RIOXX and V40A developed by UKOLN and funded by JISC. Funding for both ended yesterday.

RIOXX had as it’s goal to improve the quality of metadata being harvested from IRs, really enabling better re-use. And the other goal was to satisfy reporting requirements from RCUK. The principles I was keen to adopt in this project were to create something that would cause the minimum disruption to the IR, learning from previous experience and particularly the SIOP profile – the perceived complexity was a barrier to uptake there. And we wanted to emphasize pragmatism over elegance – the least doable in a reasonable timeframe. And we knew this would be an interim solution as there are other technologies coming down the line. Things like CERIF accommodate these sorts of information. So there are lots of compromises in RIOXX and that’s a feature, not a bug.

So we have delivered a set of guidelines for repository managers – how to describe open access papers, primarily for reporting to RCUK. And a metadata application profile to support that – borrowing heavily from the ETHOS project. And we created XML schemas to support that profile. We also comissioned specific software for EPrints (a plugin) and a DSpace repository patch – they are now available to download from And Atmire is also trialling RIOXX at the moment.

The authority file of funders names was an interesting thing. It was not straight forward but we have moved forward but there still isn’t a globally recognised system so we have a list of data. We had a list from Elsevier but in the meantime CrossRef have developed a database called FundRef, based on the same data, and with an API. That overlaps with our work. If the terms are appropriate I have agreed to depricate RIOXX in favour of FundRef.

Where RIOXX has been developed in a deliberately open way, V40A has been a closed consultation with major stakeholders to reach consensus over which vocabularies to use. It became apparent that some stakeholders really needed that work to be private. So it has been very private and very closed as a result, not an environment I usually work in. But agreements have been reached in a number of areas and those agreements will be made openly available for public consultation in a few weeks time. What it tries to do is pin down how to use vocabularies to describe phrases like Open Access in repository records. Just having agreed vocabulary across 50-60 repositories will be very useful. Look out on

In terms of potential for RIOXX. Funding for RIOXX has ended – and funding for UKOLN has ended. I have undertaken to keep RIOXX and V4OA up and running for another year. In terms of further development of RIOXX there will be efforts to include developments such as V4OA for example. And I know that Jisc are keen on that. And there will be implementation or support at a national aggregation – there is an ITT out from Jisc for that at the moment.


Q1 – Ian Stuart) The RIOXX stuff for DPspace and EPrints are they import or export or both?

A1) They are for export. The DSpace one hooks into the authority file list. It’s essentially an export filter

Q2 – Peter Murray-Rust) I’m impressed by V4OA if it is what you say it is. We desperately need it… Do you think the process is constructive or is it disruptive… do we have white smoke

A2) I have been surprised by how constructive it seems to have been. I’m not sure about a workable conclusion but we do have concensus. There was a moment in the consultation where we moved away from gold and instead turn to more practical and implementable things.

Q3 – Balvier) Jisc are committed to looking at the next phase and implementation, configuration and development that may be needed.

Chris Keene – I’m turning enterprisey (I really think so)

Quick background. University of Sussex’s repository is Sussex Research Online. For us repository is a system with lots of metadata with some files connected, mainly if funders mandate it. We don’t have a CRIS so some things may not apply to those with a CRIS.

We are a bit in the past and fluffy… We like stuff, we like open stuff, we encourage metadata sharing. But in the last year we’ve turned much more enterprisey, our work has a knock on effect for the university. And two big drivers here, the REF and RCUK. The REF basically decides Research block funding for UK HEIs for nex 6 years ish. It’s broken down into units of assessment, submission due November 2013. This isn’t extra funding, this is crucial core funding. It makes a big difference. I’ve been working on REF2 – Research Outputs.

We are using our IR for REF2. Using EPrints REF Plugin. Data is exported to Uni Data Warehouse along with other REF data and turns it into an XML file submitted automatically to HEFCE.

So, what’s changed? Well we did great stuff for personal pride. But all of a sudden there is big financial risk here. What’s right is what HEFCE thinks is right. Before what’s published matters. Doesn’t matter what they want, they are right. When is something published for instance? When in the journal? Or when online first? The conference item is an object of publication in some disciplines (and not in others). And what is the publication date – issue, volume, page number? With no physical version what’s the date. If online copy online first, what’s the date?

Implications for the IR – well there is risk aversion now. We can’t do fun trial stuff on the IR anymore. This is the REF system for us. We put system into it for the REF. Probably not a good thing. Metadata matters – research has to be represented correctly. And we have to do all we can to avoid losing out…

Also RCUK and others reported on open access which gives us two aspects to think about: Workflow – how can we ensure OA is built in; and how do we identify appropriate funded researchers to ensure they are OA compliant? And we don’t actually know what data we need to collect for RCUK. But V4OA will help there. And someone for a Research Council say they may use these as the basis of what they will require. Which is a bit scary. We are supposed to collect data from this April (gone) but we don’t know what the requirements are yet so we don’t know if we are collecting the right stuff yet. But we probably need to store basic research project information per research item – whether green/gold/long green, funding council, etc.

So managing IRs feels really different in 2013. Getting it right is so much more important now and has financial implications.


Q1 – Theo Andrew) You talked about the perils of getting things wrong… any positive points of REF implications for you?

A1) Yes! We want metadata and we want open access stuff in the IR and people are doing that – REF drives them but we are seeing increased usage and we’ve also had funding to train and support the IR which is good.

Posted in Uncategorized

LiveBlog: Short Presentations 3

We are kicking off Day Two with short presentations:

Hydra – Chris Awre, University of Hull

Firstly thank you to my colleague Tom Cramer at Stanford for some of these slides. Hydra started out as a project in 2008 between University of Hull, Uniuversity of Virginia, Stanford University, Fedora Commons/DuraSpace, MediaShelf LLC. As we identified a common need. The time-frame was 2008 to 2011 but it is now running indefinitely.

We had several fundamental assumptions. Firstly that no single system can provide the full range of repository based solutions for a given institutions needs, yet sustainable solutions require a common repository infrastructure. And we also assumed no single institution can resource the development of a full range of solutions of its own, yet they all need to tailor their solutions to their own local needs and circumstances.

So Hydra is a repository system which you can take and run and use but you can select what you need knowing that all elements share a common infrastructure. But Hydra is also a community, that is key to sustainability through encouraging lots of input from lots of places. It is a technical framework that can be applied to other solutions. And Hydra is open source.

The software we use is a Fedora repository with a Solr indexing tool. It uses Blacklight, adapted to repository content, as an interface. And everything is built on Ruby as it is flexible, has excellent testing tools. With Ruby Gems used as well.

Fedora can be complex in enabling its flexibility – so how can the system be enabled through simpler interfaces and interactions? Well the concept of Hydra is that there are many views onto a single body of materials (Hydra, one body, many heads). We now have well over 20 institutions using Hydra. Many are in the US but there are others around the world. Hull is by no means a large university and we have really benefited from being part of this project. LSE, Glasgow Caledonian and Oxford are also using Hydra.

Hydra allows you to manage ETDs, Books, Articles, Images, Audio-visual content, Research data, Maps and GIS and Documents. You can include any of those as a single body of content so that you are not building new systems for each. And that idea of different views allows you to filter through that single body of data.

We have four key capabilities:

  • Support for any kind of record or metadata (as per Fedora)
  • Object specific behaviors – whether books, images, music, etc.
  • Tailored views
  • Easy to enhance

Hydra@Hull includes many types of data – we have datasets, committee papers, student handbooks and articles, etc. We try not to overstretch ourselves but it’s great to be able to accommodate others’ needs. We are using version 6 with BootStrap interface (Twitter interface tool now usable for other sites).

We have seven strategic Hydra priorities at the moment:

  1. Develop solution bundles
  2. Develop turnkey applications – make it even easier to set up and install
  3. Grow the Hydra vendor ecosystem – support matters and we have already started to see vendors come onboard
  4. Codify a scalable training framework to fuel community growth – a session in Dublin recently, more coming up in Virginia soon
  5. Develop a documentation framework
  6. Ensure the technical framework allows for further enhancement and development – we will be “Gemifying” in September
  7. Reinforce and develop the Hydra community.


Q1 ) Can you say more about BootStrap?

A1) It is a CSS library designed for or by Twitter and you can download it and either use the entire libraries (as we are) or you can take elements and apply them.

Q1) Mainly for responsiveness?

A1) Well yes, everything works on mobile. But it’s also about the freshness and flexibility of the design.

Q1 – Les) Can we have your OMI-PMH endpoint?

A2) Yes.

Q3) Repository owners have problems recruiting developers and keeping them engaged, Hydra doesn’t seem to have that issue, can you say why you think that is?

A3) I think the choice of Ruby – which none of our developers had used before – but all got up to speed rapidly and they enjoy that environment and they enjoy the interaction of others sharing ideas with each other. One reason it’s potentially successful in the US in particular may be that US libraries take technical development really seriously. That can seem to be a struggle in the UK and we will need to address that if repository development is seen as important.

Q4) Can you say a bit about how you have adapted for other purposes?

A4) A lot of our REF records are being output from the CRIS using Ruby scripts. In terms of Data Management we use using Blacklight to enable searching and analysis of data sets.

Andrew Dorward and Pablo de Castro – UK RepositoryNet+

We have been involved over the last few years on UK RepNet which is a project to build out the socio-technical infrastructure for shared repository services in the UK. The two year project came to an end two days ago so this is an ideal opportunity to reflect on what we have done. We have a round table later in which we will explore some of the issues – especially around CRISs – later. But just now we will share Outcomes and Lessons learned.

We have worked on various services and elements – some have come from ideas through to services during the project, some have been explored during the project. Our website gathers what we’ve done in one place. The project has two more years funding from JISC for the service elements such as IRUS-UK, RoMEO, Juliet, Repository Junction Broker will continue, under the management of JISC. And the website includes so much more of our work and findings as well.

In terms of our outcomes we used an ITIL framework for transitioning projects into services – to bring all stakeholders together in a coherant framework. So what lies under RepNet are a series of components addressing Aggregation and Search; Benchmarking and Reporting; Registry of Repositories; Deposit Tools; Metadata quality – tools for enhancing it. And then a gap analysis of where gaps in metadata could be filled through new initiatives or services.

We started the project with a market analysis – looking at where repository managers felt they were and where they would like to be. We looked at prototype projects and services and then moved those through to services. So one thing we created was a mapping of the CRIS / IR landscape. We found a mixture of usage and obviously over the last two years some of these have changed. We want to explore CRIS further later on.

Another key outcome of the project was stakeholder engagement activity. A complex diagram as many stakeholders. We have HEIs, Service Providers and Vendors. We have JISC, we have RCUK and Wellcome, and we have component providers (e.g. EDINA, University of Nottingham, Mimas, etc) and we have OpenAIRE/OpenAIRE+, we have COAR and euroCRIS. We also have ARMA, UKCoRR and RSP.

The STARS initiative was one of the main outcomes of RepNet in stakeholder engagement and exploring how the landscape analysis could apply to a single institution – in this case St Andrews. We explored running services on DSpace IR and/or on PURE CRIS.


Q1) I think IRUS has been one of the big successes of the last few years. But two things on there I’m curios about is the Repository Junction and Metadata Enhancement. When I think about that I think about the REF framework – put data in there and CrossRef comes back with matching metadata and that’s been very useful for tidying up my metadata

A1) So IRUS wise I think people who were in the Repository of the Future session yesterday – Balviar mentioned that 35 institutions are signed up but we’d like to get to 150. When we worked with St Andrews it was a fast install for IRUS. There have been huge numbers of downloads, extrapolating across the UK there will be huge traffic across the network. We are hugely excited by IRUS. That’s analytics but adding in bibliometrics and Altmetrics it is even more exciting.

Pablo: RJ Broker – the broker allows mediated full text deposit and as there is increased demand to do that it will really be useful. We went through some implementation issues yesterday and will have more space to talk about that later on. Every repository platform and version has to be supported in order to use push mechanism and SWORD across all repositories. It has two aspects. The core is working, the additional work is being implementing.

Balviar: On RJ Broker… it is in test phase right now. It is being tested with Nature Publishing Group, EuPMC, Imperial and Oxford. All in test phase. In terms of what it can do it’s one-to-many deposits – e.g. for multi author papers. Both those publishers are really on board.

Pablo: Finally I’d like to highlight the paper by William Nixon on APC funding – related to your question on the REF, including funding as metadata. This was published in the UKSG journal Serials earlier this week and we will talk more about that too.

Angus Whyte – The role of repositories in supporting RDM: lessons from the DCC engagements

I want to share some experiences and really about the role of repository managers in the wider institution. Following on from other presentations really in that it is about interoperability.

For those who are not aware the Digital Curation Centre has been around for almost ten years. In latter years we have had a much greater focus on Research Data Management. Since 2011 we’ve had HEFCE funding to help institutions engage and embed Research Data Management. Our work includes institutional engagement in both research intensive and teaching led institutions.

We have also had a background role in the JISC Managing Research Data programme, which has funded 25 infrastructure projects from 2009-2013. We have supported events and provided tools for the sector.

So we have a view of the development process of RDM. This process certainly isn’t linear. Our role has focused on the earlier stages – helping institutions to develop policies and advocate for reseach groups. We have tools to build on work carried on elsewhere. CARDIO – Collaborative Assessment of Research Data Infrastructure and Objectives and DAF – Data Asset Framework. Cardio is based on work by the Data Library and other institutions in 2006.

From working with the institutions that we work with and from speaking with JISC we came up with this view of the services that we see emerging over the last few years. This is a very high level view but captures early stages of technology through to establishing data catalogues with metadata assets. We can probably see repository roles sitting on the bottom (guidance and support) and the right side of this diagram.

In terms of emerging services there have been a number of excellent surveys published recently in the US and the UK (Cox and Pinfield 2013 as well as Corrall, Kennan and Afzal 2013). These really give a good view of planned RDM services. Very interesting views of what libraries in particular are planning to deliver in the next 2 to 3 years. And the prioritisation of those plans. There are a mixture of advice and liaison, and technical services planned. There are interesting points from those priorities – there is still a lot to do to help libraries develop policy. And data citation advice comes low down the list – it is a priority for funders but perhaps the library see their role slightly differently here. What sits with the library, what sits with the institution?

When we engage with institutions repository managers get involved in very different things. So if we compare Oxford Brookes with Edinburgh University – very different institutions – we see repository managers taking lead roles in steering groups to develop policy, to develop online guidance, to support data management planning. Oxford Brookes have been driven by EPSRC expectations and they are aware that the infrastructure isn’t what they’d like it to be. They have done a lot in the last few years. There is data in the IR and they have a helpdesk, all without specific RDM staff. A contrast with Edinburgh. Edinburgh have been active in this area for many years, Robin Rice has had a very active role in the steering group here. One of the first UK data repositories. Data Library pivotal in RDM developments. They have actively involved social science librarians to help build awareness and activity. They have led on RDM policy and training materials – particularly on MANTRA of course.

So, to sum up. In our experience repository managers are very active in kickstarting “softer” capabilities. Still few universities have dedicated RDM staff, tends to be carved out of existing academic liaison roles  (also indicated by surveys mentioned). It’s kind of obvious that repositories already deal with computing services, research support and records managers but I hope we can discuss later is what these relationships, particularly for day to day work, work in terms of research data, continuity of process and data. And where new workflows come in.


Q1 – Andrew Dorward) You talked about data catalogues – is that a repository or registry of data repositories? Is there a common way of benchmarking metadata for different disciplines where data varies?

A1) Good question. Data catalogues are basically catalogues of what research data has been produced by the institution. Not all things will be recorded or deposited in the repository, if there is one. But generally we think institutions want a record of what has been deposited whether with them or elsewhere. So that catalogue has to be lowest common denominator to work across disciplines and contexts. Southampton have thought this through well with a three level approach allowing people to make some choices rather than shoehorning data into the inappropriate format for them.

Q2) Question about the survey and whether you were shocked by the results. There are few institutions with dedicated RDM staff. Priorities for training and advice… will they hire people for those roles?

A2) People are trying to carve those roles out of existing roles. Then hiring short period (1-2 years) project managers to lead that work. Whether institutions make the EPSRC 2015 deadline or the 2014 REF will be interesting. Institutions have to figure RDM into their planning to ensure they can get things going.

Q2) In terms of roles… is anyone here in the education space rather than repository management space?

A2) We know of a few but you could put them in a couple of taxis! Those institutions that have funded those roles have done so as they see it as a competitive advantage.

Comment – Kevin Ashley) Most of those surveys questioned libraries on RDM but not institutions. Libraries are important stakeholders in RDM but not the be all and end all. If one wants to understand what universities are doing with research data you have to ask universities not just libraries.

A3) Yes, I think that’s reflected in priorities

Q4) In terms of trust that you mentioned… is that in the repository within the institution or for the end users access the data?

A4) Firstly researchers have to trust the repository and funders require researchers to deposit data in appropriate jurisdictions. There is a gap at the moment in guidance around what is a good place to ask researchers to deposit their data, in terms of trust standards, seals of approval, ISO16363. That standard has been established but few repositories are established that are certified. How do you deal with databib that lists hundreds of repositories but no guarantee of longevity? We probably need RJ Broker but for data… more work to do to get there first though.

And now for coffee followed by Round Tables and the parallel judging of the Developer Challenge!

Tagged with: , , , , , , ,
Posted in LiveBlog

Repository Fringe Day Two: What We Found Out (and Who Won the Lego!)

We had a fantastic and very busy day two at Repository Fringe today!

A brief welcome from Repository Fringe Chair Nicola Osborne kicked off the day and was followed be a welcome and introduction to the city – and it’s many and varied repository activities – by Stuart Lewis. Our Opening Keynote Jacqui Taylor talked Open Data, Data visualisation and connecting up the supplier side and demand side of data sharing (see her slides here). She urged attendees to consider being part of the UK Open Data User Group and shared their call for members: Read the full live blog here.

Coffee,  served in the stylish Repository Fringe insulated mugs of course, was served during an extra long networking break to allow for serious business card exchange, with a prize for the most creative cards or handing out method. There was also cake to keep us all lively for networking!

Image of branded coffee mugs

The Repository Fringe 2013 coffee mugs – a “collectors item” according to a tweet from Ann Green (@annthegreen)!

As the break drew to a close the Developer Challenge officially kicked off in the “cafe” area of the Informatics Forum. We have seriously good prizes this year and we are also extending our online submission deadline so it’s all to play for – even if you are not with us in person! Find out what happened today at the Challenge here and find out more about the hack, including the rules, here.

The morning continued with short presentations. Angela Laurans and Theo Andrew filled us in on what happened at yesterday’s OJS Workshop then Graham Triggs talked Vivo and FigShare. Pablo de Castro and Jackie Proven gave a double act on STARS – a project trialling various repository innovations at St Andrews. Then Chris Gutteridge explained why “It’s not open if nobody can find it” and Tim Gollins polished off the session with his energised low-slide-count “Parsimonious Preservation and Digital Sensitivity Review” talk. Read the full live blog of that session here.

After lunch we returned with three Round Tables looking at the Repository of the Future, Preservation and Digital Sensitivity, and Academia and open access. All three were busy and lively and we know there will be write ups for all three following shortly from our crack volunteer blogging team!

After a refreshing coffee break – and more cake! – it was back to our main room for the announcement of our Networking Prize winner, Lisa Ng from Herriot-Watt, who had pre-customised a large number of coffee mug holders with her contact address . She won a coveted Lego business card holder! Honorable mention goes to Muriel Mewissen – who skewered her business cards on free lollipops – and to Claire Knowles – who handed out free calendars from the University of Edinburgh library as her own unique calling card. Most importantly of all everyone got a great excuse to meet new people, chat and exchange ideas!

Image of Lego networking prize and entries.

The networking prize and some of the best contenders. Click through for an annotated image on Flickr.

Then it was on to the Pecha Kuchas which kicked off with a breakneck pace GoGeo and GeoDoc overview from Tony Mathys, followed by Sarah Jones giving a super calm guide to the new DMP Online tool. Sarah was followed by Dave Tarrant convincing us that little buttons make a big difference, Pat McSweeney telling a room full of librarians and library related staff that metadata was, basically, bad, and finally Muriel Mewissen went through the challenges and opportunities of RJ Broker. It was a fab sequence of presentations – full liveblog here – with just enough time for questions and instant voting!

We finished the afternoon with more short presentations kicking off with Robin Burgess’s excellent follow up to his 2011 Repository Song – video of this follow up and accompanying music video to follow shortly. Next up Robin Rice talked Use Cases and finally Stuart Lewis closed the day with his super concise overview of ResourceSync. Full notes from this session here.

The very final tasks of the day were the award of our first Pecha Kucha prize and our Symplectic Drinks Reception. After challenging voting – as all of the PKs were great – the winner was Pat McSweeney proving that, librarians and academics really do love a heated debate! Pat takes home a rather desirable Lego calendar and a bonus prize: the new CD from Scratchy Noises, one of whom is also a Repository Fringe delegates!

And the day closed with very convivial drinks, chat and plotting of evening Edinburgh Fringe shows for many. If you were only with us today please do fill out our feedback form. Otherwise we shall see you tomorrow for more Repository Fringe fun including our marvellous Closing Keynote Mark Hahnel, more Round Tables, the judging of the Developer Challenge and yes, more Lego Pecha Kucha prizes.

From the blogs

We are indebted to our volunteer blog and social media team who are tweeting, sketching, photographing and preparing a rich Twitter analysis of the event. We are sharing links here to posts that have already gone live but please leave a comment below or email us if your own post isn’t included here!

There will be more blogs, video, images, etc. to come after the Fringe so do keep an eye on the blog!

Tagged with: , , , , , , , , ,
Posted in Developer Challenge, LiveBlog

Developer Challenge Update

Our developer challenge runs in parallel to the main Repository Fringe event. The first signs of hacking appear this morning in the foyer area. Armed with a few sweets to break the ice but also to provide some extra energy for coding, I went to meet some of the participants and encouraged all potential hackers to take part in the challenge.

We have a few interesting ideas brewing that fit well with our preservation theme. One of them from the Picture Liberation Team lead by Peter Murray-Rust, helped by Richard Wincewicz and Cesare Bellini on the developer side, is looking at ways to add open access license in the picture itself rather than in the metadate which can easily be stripped and separated from the data. Peter wrote a post about this on his blog: Making images Open can and should be routine.

In  a similar frame of mind, Chris Gutteridge has been working on a ePrints plug-in that would add copyright and license in images.

Andrew Dorward & Pablo de Castro have an idea around the concept of the registry of data repositories but are looking for developers to discuss and develop their concept. It is not too late to take part and we do accept well presented ideas as well as prototype. The prizes are great and all that is required is a short presentation. Get in touch with Andrew or Pablo if you can help them with tuning the technical side of their idea.

A few more developers  are busy working on  hacks but are still too shy to give me more details yet. Richard, Russell, Patrick, Ian, I’m looking forward to see what ideas you will have been working on!

Presenting and attending parallel sessions is obvioulsy putting pressure on our potential participants and  I have caught a few of them writting up their Pecha Kutcha instead of hacking! More distraction awaits tonight with the Fringe in such close proximity!!

Remember that you can summit your entry remotly even if you couldn’t  make the meeting in Edinburgh, all the details are in the Developer Challenge section. Just let me know of your entry before 10.00 tomorrow.

The presentations to the judges will start at 10.30 with the winners show & tell scheduled for 14.30. Looking forward to it already!


Tagged with: , ,
Posted in Developer Challenge, Guest Blogs

LiveBlog: Short Presentations 2

Rocio von Jungenfeld is introducing our final three presentations…

Robin Burgess’ Repository Song

I presented in 2011 as part of Repository Fringe I was presenting my paper on take up and embedding from the Glasgow School of Art’s new repository, and embedding their previous system in a better one. Then it was the story so far and where we were… the conclusion was that we finally had a name… RADAR.

Two years on we will update you on where we’re at at the moment! They said never work with children and animals… watch out for miaowing from my cat! The song is called RADAR: The Repository Song.

As I can’t capture Robin’s song properly here there will be video… 

It’s better than Filemaker… and easy to use… It’s built on EPrints. It gives a much better view of GSA’s research capability. And they can use the repository from home. The impacts we’ve been seeing are mainly to improve take up and use of research outputs… hopefully. Recommendations? Make sure you engage staff resources and work with their needs.

RADAR is a repository and thank you for welcoming us to the repository community!

Robin Rice – Edinburgh DataShare and RDM

I know some people are interested in Edinburgh’s DataShare and I thought I’d share some “warts and all” issues we’ve faced with putting data into repositories. DataShare is free at the point of use data repository that started in 2008 back when no one was really thinking about reuse of academic research data in that way. We worked with repository communities and library communities. We worked with Oxford and Southampton on this. We have gone on to create our own Research Data Management Policy. It means people should deposit their data. We don’t say it has to be open access as a term – but it’s clear that people should make their data available – we are about encouraging sharing.

We had the policy in 2011, now we have a steering group who help guide us make sure it’s fit for purpose for academics. So this is a picture of the University RDM Roadmap. It’s the brickwork for everything we do. It has to do with the data stewardship part of the roadmap. We have been using the DMP Online tool as part of this too. So the group have given us a challenge to look at tough test cases. So for instance Dr Nuno Feirrara came to us through MANTRA and wanted to encourage good practice and get students to share data in Clinical Psychology. But they have NHS data and supervisors and a lot of the fieldwork is considered sensitive. Not in a legal sense but the NHS people may not want a study out there in case it leads to a scandal. So whilst Nuno grapples the politics we could use him as a usability test case – he wanted a simple process for his students to work through. So we’ve got a lot of usability results out of it. So the next two releases will be fixing that, hiding unneeded fields etc.

Another use case we had, Dr Bert Remijsen was sort of a perfect user – we did assisted deposit for him. He’s from Linguistics and English Language. His expectations was that he could upload zip, magically unpack it and explain the content. To get round that we did it for him. But we would like to make it that simple… He had already deposited in a Max Planck repository for lost languages too. So is duplication good or not? User happy with download stats and referred colleagues to them.

And we’ve had users from Informatics – so Prof. Simon King from the Centre for Speech Technology Research – recording and annotating videos. They want the data. They have huge video files. They have specialist software to deposit. We’re still at the talking stage with them. There is an ongoing deposit… They want user registration, their own licenses in the headers of the files (a system they devised 10 years back). So there are special non-standard terms and conditions. How do we cope with that? But there are no checks on some of the licenses – some would say requiring to register violates academic freedom.

And another use case has been the Roslin Institute – they have lots of ‘omics data. We felt there were specialist repositories for this stuff… apparently not the case. They have loads of data created by machine, big files, getting bigger, very easy to generate. Should the repository be the place to share things? Or should they go it alone to figure out storage solution? That’s still an open conversation. And they are interested in push-pull relationship with the CRIS.

And finally the fish 4 knowledge project from Prof Bob Fisher. It’s an EU project in the Institute of Perception, Action and Behaviour (Informatics). But when that project ends do they just wipe the data? It’s observational data. The professor came up with a way for automatically detecting fish in video. He’s suggested keeping a 5% sample of the video data but that would be a 3TB. And that would swamp our database – a great challenge for us around big data.

And we’ve also done some work with the ECA. They have their own digital asset management needs so to what extent should there be rich display to the users – is that the responsibility for another service?

So issues arising from pilots include: usability and user education; encouraging user to document and future-proof; relationships of IRs and subject repositories, etc.


Q1 – Chris Adie) Many of those use cases are very specific. But at the same time one thing that comes across when managing data is that we need more use cases described to learn from across the sector. So to what extent could you write up or describe those use cases for the sector? Of how to manage different types of data.

A1) I think that’s a good idea. I’d like to solve all the problems first.

A1 – Stuart) We keep thinking people must have done this already so a central collection would be great!

Stuart Lewis – ResourceSync and SWORD

Somehow I’ve managed to both start and end the day! So you all seem to know about SWORD (by show of hands) so I’ll focus on ResourceSync. JISC has been supporting this work. ResourceSynch has been developed by the same people who developed OMI-PMH but this time working with NISO/OAI with Sloan funding and building on the OMI-PMH experience.

It’s basically about ways to synchronise resources on the web – and those might be files, images, whatever. As well as OMI-PMH has been adopted and embedded in our community it has it’s issues. So this allows us to look for changes or updates needed in the repository – so if you archive a site you might just want to update rather than overwrite the data (e.g. new blog posts rather than overwriting the whole history of posts). It’s an interoperability protocol – like SWORD – not a piece of software! Much like http. You have to have a client and a server.

There are several different laywers to the protocol: discovery; capability description – how does it do that?; baseline sync – grab everything; changelists – a way to gather only latest updates; dumps – basically zips… ways to archive the repository quickly and efficiently.

So that’s ResourceSync. It’s about getting things out of repositories for reuse and discovery.

Now, who has heard of SiteMaps? Your repository will support them out of the box! They let Google etc. understand your website rather than crawling them. So we aren’t reinventing the wheel… we are making use of the sitemaps adding information about changes, about relationships, about what needs to by synced.

To sum this up… if you take nothing away… SWORD is putting stuff in repositories. ResourceSync is about what’s in the repository and what’s changed.

And if you feel brave arXiv ResourceSync have a Feed…

And here’s a 30 second demo. So one nice thing about the trials JISC has funded is that they are real repositories. So I will run 3 lines of code. We see the changes and can sync them. Then run another line or code to deposit all those changes! And we can see them update in realtime! Live  mirroring of ArXiv into DSpace.


Q1 – Pat MacSweeney) When are DSpace and Eprints going to have this?

A1) Well part of this work has been about how to do this in a fairly generic way. The DSpace trial has highlighted that DSpace timestamps an “item” but not “parts of the item”. We can see it’s updated but not which part has changed. And that’s highlighted that issue which needs resolving to make ResourceSync work. So there is a version for DSpace.

Q2) Is it that dependent on URLs that you can’t connect to desktop app? You can imagine that being useful…

A2) Call it a URI then. So that should be possible.

Tagged with: , , , ,
Posted in LiveBlog

LiveBlog – Pecha Kucha Session 1

Martin Donnelly, DCC, is introducing our Pecha Kucha session:

Tony Mathys – Geo Metadata and repositories

The GoGeo vision for repositories. How many librarians are there here? (not many). I work for the GoGeo Metadata service. We work to make metadata discovery easier. We did a spatial data set some years ago, found lots of spatial data that needed to more findable. So we created a UK Spatial Data Infrastructure. People snore when you mention metadata. We could try mind melding but that’s not practical! So we came up with geodoc – a form to fill in about metadata, text fields, drop down and automated list, it’s designed to make it as easy as possible. There is metadata validation. And that tool allows sharing in multiple formats and sharing privately to institutions and out to GoGeo. We’ve had thousands of accesses but 230 records created, most are published privately to their institutions. Why? Well privacy and security concerns.

When data is made public it is surfaced in GoGeo. They can be searched for, they can be downloaded, or sent to the repository to access the data. We have a Share Geo Open Data Repository. We have few external contributors but 3000 data downloads a month! We have elearning modules on metadata there, a biannual newsletter on metadata.

So, our vision… we are trying to encourage good practice to deposit data, raising awareness in workshops, and we envisage data sharing and deposit but also sharing of metadata as well! And for them to share multiple data sets for new applications. And that reuse might achieve achieve digital immortality.


Q) You were talking about institutional nodes, and them not sharing publicly. Why do you think that is.

A) Tends to be institutions where an individual creates the structure. This whole process is about promoting data management. We want to break down the barriers and raise confidence in sharing data, and sharing metadata and the data. If it’s discipline focused then each group knows their own needs. GoGeo has been funded by JISC for 12 years, the first academic geo portal anywhere in the world but its about building trust and their are fine to share with each other. Even if just the metadata is shared it’s helpful for cololeagues.

Q – Balviar) Geoparsing question, EDINA has a specific tool, has there been big uptake? We had a small project geoparsing historical documents, not sure what happened after that.

A) Not as much as there should be! Unlock is built into GeoDoc to georeference things, use footprint to find associated keywords.

Q) So you could run over repositories and georeference as part of the process?

A) Absolutely!

Q – Kevin) I’m sorry if I missed it… so you’ve got 2300 records, 230 shared more widely. So the 90% that aren’t – are they complete? Are they trial records?

A) We don’t know! We respect the privacy of the user. We’d like to know too. Can query which fields are filled but that’s about all that’s appropriate.

Sarah Jones – DMPOnline

Sarah is going to tell us about DMPOnline. A tool for putting research online. What’s the idea behind this? We realised at the DCC that a lot of funders were asking for research data management plans and we wanted to help with that. And that was in 2010. Since then there has been even more interest growing, more funders requiring plans and universities making their own plans and policies around research data management. So we did some assessment of the tool, they liked having an online tool, and having a chance to collaborate and share. But they found the process very long and overwhelming and they found the checklist a little confusing. People wanted the minimum they could get away with. Simple tools and clear guidance.

So in terms of the checklist thats a list of questions that might need addressing in a data managament plan. But over time the list had gotten long and too long, too confusing. Sometimes we asked several questions instead of one. Some wanted it to be easier and less spoon feeding – fewer questions. Sometimes the question didn’t map. So now we are using funder or institutional questions asked and answered – it’s in a different place. We are asking fewer questions to keep plans short and emphasising guidance as that’s what researchers need. So we have 13 key questions under 8 sections – please look and provide your feedback. We have created new use cases and database as been redesigned, we expect to roll out version 4 in the autumn. We’ve tried to make it clean and simple, things expand only as you start to fill them in, you expand as you want/need. And you fill plans in at different stages in the project. If we have a suggested answer to a question we provide that and you can delve in further as you need.

So if you are inspired register to use it. Download from github. And contact us if you want to customise to your uni. Three things you may want: Provide guidance and suggested answers; map the form to your policies; and customise it to your institution and share your template online.

Dave Tarrant – Little buttons make a big difference

I came up with an idea last week that little buttons make a big difference. Basically I wanted to talk about access, this little button that says “download”. But that button does so much. It can enable a demand as called for in our keynote. In order to look into what this button can enable I want to tell you a story about peer to peer lending. This allows people to bi-pass banks and loan money directly, setting their own rates and conditions. At the Open Data Institute (ODI), we asked a number of peer-to-peer enabling companies to open up data about activities for us to analyse. The study looked at three sites that do peer to peer lending in the UK (92% of it) and If you look at the demand for money it’s all over the UK. But looking at who loans the money it seems to mainly be in the South. It was also found that the mean loan was currently less than £10,000. The visualisations and results of this work can be seen on

What was the impact of this interesting little bit of work? Front page of the Financial Times who established that P2P would be worth £1 billion by 2016. Since this article one company who deals with P2P lending has also reduced their minimum loan amount as they previously wrote themselves out of most of the potential market! So now consumers have more choice!

This work was carried out by the Open Data Institute to show the value of Open Data. The key bit is the “Open” – getting government, private companies and the general public to realise the benefits of Open. There is international interest – thanks to G8 Open Data Charter – presented early work from ODI here including data certificates which sit alongside Research Data management.

The open access community has been around for many years now, so this is not a new approach, just a new community. On this note, we could potentially look at the open access community as one of the 5 open-stars (my idea, why not?!?). These five stars for organisations would be:

  • open data – the raw data that is collected and used to create new things
  • open access – the things, the knowledge and stories around the data
  • open science and open knowledge – allowing science to be crowd sourced, data plus knowledge plus method.
  • open innovation – allowing commercial companies to exploit the open ecosphere and give back to the economy (don’t put an NC licence on things!)
  • open by default – complete transparency
This doesn’t represent a whole new approach, simply the involvement with more communities to achieve a bigger vision.

Patrick McSweeney – ReCollect Research

I am from the University of Southampton, I do eprints, open data, all that. I’m going to talk about work I did with scientific data. I won a prize at OR2012 for building data visualisation into a repository. A colleague at the University of Southampton asked me for help, the EPSRC want a data plan and want that enacted in a year’s time. He said don’t worry about the time. The University of Essex have a repository and talked to researchers about what they want in the repository. Off I trotted to Colchester… to discover that things were not as I had envisaged. Indeed the antithesis of what I believed would work. Long lists. Complex workflows. Didn’t seem user friendly. It was set up by a “my first day of programming” type approach. Must needed to be done. I heaved out some metadata fields. Polished some edges – gracefull install/uninstall. But then there was an attack of politics. They wanted what they had before but with magic one click install. So I put 8 essential fields, and the rest are optional with text to explain why you don’t need.

It launched early this year but there have been 8 deposits so far. I’m not pleased with how it works because we want to keep the process simple, the key problem as I saw it was that I’d won a prize for doing as much as possible with the thing you’ve deposited  as possible. Easy way to do this with the EPSRC requirements…. well you can install the Recollect plugin if you want miles of metadata. But if you want to make something user friendly use FigShare basically. It hurt me to do this work. I encourage you to read Don’t Make Me Think by Steve Krug. I am just a tool. If you as a librarian come to me and want me to build a machine of death… well I’ll query it but I have to build it… so don’t ask me to build a machine of death.

Muriel Mewissen – RJ Broker

This is my first Pecha Kucha, we’ll see how it goes. Most of you will have head of RJ Broker before but quickly it’s a way to transfer data to the institutional repository. This presentation is about the challenges that we’ve had. We can take any data… but that has meant lots of things of all shapes and sizes… we’ll take everything. Publishers expect first class special treatment. For each provider we’ve set up bespoke system for the broker. So bespoke means time money and effort, but we expect this to be voluntary. And the data is precious… sometimes it is open, sometimes it’s not and sometimes we have to persuade them to make it open. Repositories are happy to sign up if the Broker has lots of data available – so the more open data we have the more sign ups we’ll get but data tends to follow having more repositories signed up – a chicken and egg situation. And there is also the issue of technology versions… different technologies have different requirements but some people do not want to change. And what you get isn’t always what you expect.

Those of you that were in the repository of the future session you will know that the CRIS will rewrite the data every 24 hours – we don’t need to work with repositories anymore, we need to work with the CRIS!

We are a UK project, we have to change as things move on, we have to react. And there are a lot of people involved in very different roles, getting information across to the right people in the right roles can be a huge challenge. We hope to move to a service and hopefully it will be like the Olympics – hopefully we’ll get gold… but then we’ll take any kind of open access you choose! We know it will be a long road and we’ll be developing as we need… we are happy to do that. We want cake to everyone so everyone can be happy.


Q – Robin Rice) For Sarah – so we are looking forward to the new tool and the suggested answers. How do you expect or want other institutions to engage with you to get that customisation.

A – Sarah) We have been working to make it so that you can give various guidance and suggested answers and fill into the template. So for Edinburgh you have a policy you want to think about anything additional you might want to add in addition to what the funders ask for.

Tagged with: ,
Posted in LiveBlog

Live(ish)blog: Getting to the Repository of the Future Workshop

Our blogger Rocio (with some help from Nancy Pontika) shares her live notes from the Getting to the Repository of the Future workshop which took place today…


Chris Awre

We’ve been working on repositories for over 10 years, and we’ve moved forward, but where do we go next?
Are the systems right, are the repositories we put in place 10 years ago still doing what we want them to do?

What impact will repositories have on staff, skills ..?

Academics are very good at creating research output, but are they as good at managing those assets?
How do we shape the future of repositories?

Balviar Notay

  • From Jisc repository programme since 2002
  • Value of managing digital assets for the institution

Now we have the infrastructure, how can we move on, what can we make out of this. Some of these projects include:

  • Hydra project (University of Hull)
  • Kulture, Kultivate, eNova (University of Creative Arts) – public portfolios of staff/institution work
  • MIRAGE (Middlesex University) – 3D visualizations of 2D scans using ParaView, it’s being used for pedgaogical purposes, so that the repository adds immediate value to the university’s actitivities

Repositories were at the edges, but the power is slowly moving from the centre to the edges.


Chris Awre

Questions on tables are based on the two papers that were distributed as preparation for the workshop

Consider the indirect revenue of openness, but how can we measure/evaluate it.
What services could we build around the repositories that were economically sustainable?
Different disciplines, require different approaches, or at least different researchers require different specifications for managing their research data

Generational changes/differences, … identify current trends
Risk of repositories becoming silent/unusued.
Can repositories be data-centric, or do they need to be rather user friendly?
Are repositories just another tool, or part of the infrastructure.
Licensing and copyright is like a shifting sand, a dune that will not disappear, just change location/attention.
Environment will change, becuase organisations will change, and their relationships to other systems (VLE, CRIS), Collaborations across institutions
Security, preservation (hoping this point will be taken care off as of now, so it doesn’t seem like we’ve not been taking care of this issue, it would defeat the purpose) …

Small group activities
Think about the questions: what do we want repositories to do? Horizon view – think about the questions in perspective from 2, to 5, to 10 years time. What do we need to do now that allows us to see those changes implemented in the future.

2 year = short term, powerful trends ; 5 years = aspects that need to be addressed (like how to navigate through different research outputs, aspects, … ) Risk analysis for the repository is different to the risk analysis of the stakeholders.



General wish for full text and good quality of data and metadata.
The repo as a content holder not just as a place to deposit plain text.
Flexibility to customize with other technologies.
Changing landscape of software, investment of the library in terms of cataloguing skills.
In 10 years not sure if there are going to be multiple systems working together or one platform that can do more.

New content into the repository.
Some content is prepared, with good quality metadata and some is unprepared.
Organization, set criteria for acceptance.
Funding and policy decision, issues regarding preservation and archiving, do we keep something for ever?
Setting metrics to see if your repository is successful and planning for the future such as doing reviews. Sustainability and cost how much does the institution invest on the repository.
What about DOIs and how much does it cost?

Is blogging an ephemeral research outcome that is lost at the end, will repositories preserve this material, data?
Linked data requires IDENTIFIERS to every individual, academic institutions, repositories (ORCHID) – identifier issue needs to be sorted out, we’ve been trying to do this for the last 10 years, so it is optimistic to think we’ll get it sorted in 5 years.
Vision for 10-years time, repository fully integrated into the research flow, but completely invisible. So the copy is immediatly available.

Debate about repository content: Should institutional repositories only want published content to be deposited in the repository?
User needs, when the repository does not match the needs.
Formal cooperation between different repositories, V-mirroring, cooperating in curating specific areas.
Hoping users will find their way to the repository and the data stored.
CRIS -managed from top, not interested in using metameta for appraisal.
Librarians – want material to be shared.
Research data, granularity and level of metadata, …
In 2 years, progress in permanent identifiers (OCRHID ID among institutions), academic norms for data citation, more data audits, understand better what users want.
5 year, capturing metadata automatically.
10 years, hoping the repository will last.

What is the repository for? Repository is the stuff, the content.
Functionality that is not part of the repository, manipulating, visualising,… this does not needed to be done by the repo, but by other layers, metadata the means of discovery.
5-10 term make data & metadata interoperable
What content is beter not managed in the repository but somewhere else?
3 levels of opennes: make it avaialable in the internet, make it available to be discovered, make it avaialable and interoperable
Galaxy of data centres, prevent duplication and silence;  ideally funders will provide data centres, but what about the research that is not funded?
Institutional repository needs duplication, people work across institutions and move away from institutions

Future: thinking about repository not as something that is fixed
We articulated what we want to see in the future, we articulated what we have been wanting in the past.
Discovery/interoperability issues.
The future is unevenly distributed.
Lets go back to the OAIS model, to make things easier for people to deposit and to retrieve
Academics are willing to do voluntarily like (Mendeley, LinkedIn, Academia, … they are all silos), we want to make things discoverable and this can be done through repositories; CRIS system being to much of silos.
Discoverable, RSS disappearing (Google has takign it off their service)
When will ORCHID become main stream ?? possibly in 10 years.
Organisation issue: it would be great if publishers were more collaborative, like the Jisc APC.
Make our repositories more attractive, why does it looke the same, why are the discussions the same that were being discussed 10 years ago, and why haven’t we moved on
The issue of impact and citation, REF as the imapact that research has on other, … …
Silo, does not have to do with the technology,but with teh mentality of the institution.
CRIS as a way of bringing all data together for the using this for reporting tool, we don’t see it as a silo, Does your silo information go outside your organisation, your public facing is … Unusual that the data is then made avaialbel

Repositories are still regarded as a good thing that will be here in 10 years, as a place to have stuff, to hold the digital stuff. And the breadth of content will grow.

Development of repositories took place in a policy-free environment, but now this is changing, and the taking over of repositories by policy makers. In 5 years, this will be more clear: those running repositoreis will know what they are supposed/expected to do.

We would like better identifiers, automation, metadata… but how are we going to do any of this, putting aside financial constraints…  What are the barriers to fulfill the requirements and to allow us to do what we want to do?

How do we get over the different identifier providers (OpenDOAR – an EU format, RIOXX  – a UK format…) how do we match these identifiers?


Tagged with: , , ,
Posted in LiveBlog

Live Blog – Short Presentations (1)

OJS – Angela Laurens and Theo Andrew

OJS is the Open Journal Systems and it’s developed by PKP, a group of American Universities. And it is used worldwide with around 7000 journals. Our first OJS Journal was Concept, the Journal of Contemporary Community Education Practice Theory. We have various journals running on the system, some are student led some are academic or researcher led, but all are peer reviewed.

We held yesterday’s OJS workshop because there is a growing UK community. Since the Finch review there has been a lot more interest in the system. Ourselves and St Andrews have been using OJS since around 2009/10 but in the last year we’ve had many more inquiries. So we wanted a forum for the UK OJS community.

Some key themes arising were resources. The software is free but it has a real cost. Pittsburg estimated 3.5 FTE worth of time for them. UoE reckoned at least 0.5FTE. But its significant time. According to the Finch Report we can expect to pay an average of £1750 to publish an article. UoE regularly received invoices from Elsevier for $5000. So 8 articles costs the same as running OJS for us. It’s really good value or money and brings control back to universities and to academics themselves.

Another key theme for us was the learning curve. It can be steep but that means training up front but otherwise relatively self-supporting.

Managing expectations is important here. What does a free service include or exclude? Is it just the systems, is it design, support, training, layout, policy. Who manages submissions? Institutions providing a service have documentation in place to standardise the service and to manage those expectation. Our Keynote Vanessa Garbler explained the substantial documentation that helps ensure expectations are managed. But quality matters. Pittsburg have a committe to approve new journals to ensure quality is maintained.

A few more key themes:

Licensing matter. CC-BY recommended as NC too restrictive. Avoid heavy customisation – too difficult to maintain and manage. Similarly one installation vs multiple installations are better. And there is real opportunity to engage students – as guest editors of journals etc. And Kevin Ashley spoke about the beneficial impact of having preservation as part of the routine process of publishing as OJS enables.


Q1) Will there be follow up in terms of meetings etc?

A1 – Angela) We hope so yes. There was real appeitite for that and for sharing expertise and experience and to build a toolkit we can all use to avoid reinventing the wheel.

Vivo, Repositories and FigShare – Graham Triggs, Symplectic

I work on intregrating our tools with repositories. VIVO is one system for integrating systems – it is a network of systems across the world. The whole system is based on Linked Open Data, all triple-based information, and data captured at the university. It’s not just publications but ontologies, events, professional activities, the people themselves, all linked together. That means there are SPARQL endpoints that can be queried and investigated. New ways to discover research and expertise across multiple universities. These systems work by harvesting different sources – from CSVs, from PubMEd, from Scopus. Your repository is only a partial view of your research. You need to add all that other stuff in.

But this stuff is tricky, how do you disambiguate the individual from multiple data sources, what all the research and resources are. One option is to use a CRIS like Symplectic Elements or PURE. Those research management systems make it easier for you to disambiguate and link to authors with precise connections. We’ve already talked at this conference about how you can integrate CRIS and repositories, taking metadata with you. But interestingly the CRIS also gives you information about what’s in the repository in a way that is exportable through an API. So if you harvest from a CRIS and put into VIVO you can actually take all of that information with the links into VIVO. For instance Duke University in America allows links right through into national networks, into the repository, from person or research to the article itself.

You can also pull data sets in from FigShare – and embed materials into other systems.

The STARS Shared Initiative – Pablo de Castro, UK RepositoryNet+ project and Jackie Proven, Repository Support Officer at St Andrews

STARS stands for St Andrews and Repnet SDLC is about delivering repository services in an advanced CRIS system set up.

The UK RepositoryNet+ project is about building a socio-technical infrastructure for UK Repository Services. On our brand new website you can see the wide range of components that are part of this: IRUS-UK; RoMEO; JULIET, and the Repository Junction Broker. And we have the different strands that the RepNet is covering in terms of services. Aggregation, benchmarking, registries, preservation, etc.

The RepNet was originally conceived to build services on the UK Repository network. But as time has gone by there have been huge changes in terms of repositories and policy. We have found there are no longer so many stand alone IRs, it is now a much more complex and mixed system, with many CRIS systems, particularly at the research intensive institutions. So we have IR-only institutions, those with IR+CRIS, CRIS-only systems, IR+Symplectic, IR+RMS. There is quite a discussion to be had here – we have a round table tomorrow with an opportunity to discuss more.

Another finding from our survey of IR managers is that there are a wide range of support services presently available to IRs for hosting or maintaining repositories. So Research@St Andrews for instance offers a complex set up – CRIS+IR and an external provider (in this case SDLC) as an example. So, over to Jackie.

You will see from this diagram how the repository fits together – how data comes in to our repository and how those systems fit together. We have a PURE CRIS system – there is a portal as a front end that showcases our research. When full text goes into PURE it goes into our DSPACE repository – and the link to that full text is included in the portal. That process is influenced and affected by the set up in PURE.

Pablo: So in our work programme for STARS included many refinements and enhancements. From enabling SWORD to use of IRUS, etc. So we started with IRUS, a project for gathering COUNTER-compliant data from repositories. It is on 35 repositories so far and aiming for wider coverage.

Jackie: Installation of IRUS was really easy for us – particularly as Claire Knowles did that for us. We now have monthly stats that we can compare to our existing Google Analytics stats. There are some discrepencies and that had led to some really interesting discussions around those – we’ll talk more about that in our Round Table tomorrow.

We tested openaire. CRIS is very tied up with the REF system so we tested openaire in a test collection in the portal of EU funded publications and we added the relevant fields – some issues with validation to investigate.

Pablo: With RJ Broker we tested that in the data workflow. There were two options of whether to push or pull the data with the CRIS.

Conclusions here. This was a really interesting opportunity to test the landscape we had been looking at.

Jackie: For us it has also been a real opportunity to build bridges with stakeholders and we want to keep those lines of communication open.


Q1 – Balviar Notay at JISC) The RepNet project is a JISC funded project which ended yesterday but we are looking at delivering the components as shared services from this point onwards. Just to say that IRUS is not just article level data. If you are not already being harvested by IRUS then please get in touch. They are harvesting from over 30 repositories. We’ve seen over 3 million uses of materials in the repository, 2.5 million of articles. Thank you, really useful to hear your experience.

A1 – Pablo) Yes, IRUS is journal as well as article level data in COUNTER compliant way.

A1 – JAckie) For us it also links into our other systems so great to have that comparison of stats.

It’s not open if no-one can find it – Chris Gutteridge, University of Southampton

I was working on repositories, now in open data but I retain a fondness of repositories. I want to talk to you about It is not being run like – they run on a shoe string and we run on a lot less than that (although we accidentally have some funding now!). We want to be a hub and ensure that data has a sensible generic domain for future proofing.

The big initiative we’ve been working on is The national portal for UK HE research equipment. So if you want a laser or a DNA sequencer for instance you can see where the nearest one is. It works through basic open data principles. The next stage for us is to export to our Southampton repository – so that we can use our equipment IDs in the papers. So we can tie the equipment to the articles published – a great way to show the value of equipment.

So if we look at the Roslin Institute in Edinburgh. They have open data about their equipment. They only have ten or fifteen items of equipment. The old system was very slow and manual, it’s expensive to maintain those types of relationships. We are making this sustainable (not cheap! sustainable). Getting info from the website isn’t scalable. But if you go to the website you can see in the HTML a single line that shows where the equipment data is – and that is harvestable. It makes it clear who the institute are, where they are. And a series of assertions about their equipment data and where that data can be found. And I can, from that, autodiscover a CSV daily, automatically, and compile it into a database you can search here. And it works. And you can follow from you can trace back to the institutions. It’s simple, it’s automated. And neater still you could do the same thing for any type of organisation or aggregation or organisations – e.g. an Irish network. All the code is on Github – please steal the code.


Q1 – Les Carr) Any chance of grants on the web being rolled in so that RCUK can make use it.

A1) There’s a website called research gateway with data from grants. I’m keen to link up our definitions to theirs. So yes, but there are some delays with data providers there. And the other real value… the advantage of URI survives any reorganisations which will be really helpful for future proofing.

Parsimonious Preservation at the UK National Archives – Tim Gollins, Head of Digital Preservation

What do we do with preservation? We look after things for a long time! There is a concept that you should be able to describe things in two lines. Your challenge is what is the two line description for digital preservation?

So I want to talk to you about the threats to materials we want to preserve. Everyone talks about formats, about media obsolescence… what happens if you bring in your zip disks and floppy discs… you can’t read it. So… Rule 1 is get it off removable media!

So, just to say, my perspective is about National Archives, you may have different data etc. But maybe you are costing yourself more than you need. So… file formats… the long tail. So the National Archive’s own data admin… had over 1.2 million emails in our repository. Then 400+K documents. There are 130k excel sheets. But there are 800 formats. So should I worry about the 800th format in which we will have almost no data. There is an economic issue here, what’s worth saving?

And should I worry about that list in the next thirty years… Well I don’t think most of those will be obsolete any time soon. There are millions of .doc or .xls files and someone will want to read them. They will ensure that material stays readable. It’s a very similar graph as shared by the British Library recently. So, so National Archives doesn’t do file format conversions, we put things into a repository. And if people want to use them we give them to the customer – they can read these. When they can’t read them then we’ll worry about them then. How long will your repository survive?

From Pat in the crowd: Until the end of the REF! [cue much laughter]

Is your system 10 years old? 15 years old? Very few systems are operational that long. They get upgraded, replaced, improved, they get changed. So will any of the threats that any of your data will come against actually be a  problem.

What’s the two line thing for digital preservation… ?

  1. “Know what you’ve got” – have a catalogue, know whats there.
  2. “Keep the bits safe” – so that you can actually hand that record onwards reliably so that’s what has been preserved.


Q1 – Jacqui Taylor) In the NHS there are systems much older than 15 years

A1) Good point. They have a major preservation problem for that reason. I’m not aware of anyone addressing it..

Q1) I am!

A1) Great!

Q2 – Kevin) The idea of counting formats is a good way of cutting the problem. Sometimes a rare item is more important to preserve, and you put the effort in then.

A2) That is the point. You can invest realitively cheaply in investing in seeing what’s there, and then you have the resource to identify what needs that expensive specialist curation and preservation. That stuff at the end of the long shape. But if you don’t know what you’ve got you can’t curate effectively – you attempt to curate everything to the same level.

Tagged with: , , , , , ,
Posted in LiveBlog
Repository Fringe 2013 logo

Latest tweets

Repository Fringe 2013 is organised by:

The Digital Curation Centre


The University of Edinburgh